R/C2/Merging-and-Importing-Data/English

From Script | Spoken-Tutorial
Revision as of 13:22, 25 March 2019 by Madhurig (Talk | contribs)

Jump to: navigation, search

Title of script: Merging and Importing Data

Author: Shaik Sameer (IIIT Vadodara) and Sudhakar Kumar (IIT Bombay)

Keywords: R, RStudio, data merge, data import, video tutorial

Visual Cue Narration
Show slide

Opening slide

Welcome to the spoken tutorial on Merging and Importing Data
Show slide

Learning Objectives

In this tutorial, we will learn how to:
  • Use built-in functions for exploring a data frame
  • Merge two data frames
  • Import data in different formats in R
Show slide

Pre-requisites

http://spoken-tutorial.org

To understand this tutorial, you should know
  • Data frames in R
  • R script in RStudio
  • How to set working directory in RStudio

If not, please locate the relevant tutorials on R on this website.

Show slide

System Specifications

This tutorial is recorded on
  • Ubuntu Linux OS version 16.04
  • R version 3.4.4
  • RStudio version 1.1.456

Install R version 3.2.0 or higher.

Show slide

Download files

For this tutorial, we will use,
  • five data frames in different formats and
  • a script file myDataSet.R.

Please download these files from the Code files link of this tutorial.

[Computer screen]

Highlight data frames and myDataSet.R in the folder myProject

I have downloaded these files from Code files link.

And moved them to DataMerging folder in myProject folder on the Desktop.

I have also set this folder as my Working Directory.

Let us switch to RStudio.
Click myDataSet.R in RStudio

Point to myDataSet.R in Rstudio

Open the script myDataSet.R in RStudio.

For this, click on the script myDataSet.R.

Script myDataSet.R opens in RStudio.

Highlight the Source button Run this script by clicking on Source button.
Highlight captaincyOne in the Source window captaincyOne appears in the Source window.
[RStudio]

Highlight captaincyOne in the Source window

We will use some built-in functions of R to explore captaincyOne.

For all the built-in functions used in this tutorial, please refer to the Additional Material.

Cursor on the interface. First, we will use summary function.
Highlight myDataSet.R in the Source window Click on the script myDataSet.R
[RStudio]

summary(captaincyOne)

Highlight Source button

In the Source window, type summary and then captaincyOne in parentheses.

Save the script and run the current line by pressing Ctrl+Enter keys simultaneously.

Highlight the output in the Console window In the Console window, scroll up to locate the output.

Statistical parameters for each column of captaincyOne are shown on the Console.

Highlight summary(captaincyOne) in the Source window In the Source window, press Enter.

Press Enter at the end of every command.

Now, let us look at class function.
[RStudio]

class(captaincyOne)

In the Source window, type class and then captaincyOne in parentheses.

Save the script and run the current line.

Highlight the output in the Console window class function returns the class of captaincyOne, which is data frame.
Point to Source window. Next let us look at typeof function.
[RStudio]

typeof(captaincyOne)

In the Source window, type typeof and then captaincyOne in parentheses.

Save the script and run the current line.

Highlight the output in the Console window typeof function returns the storage type of captaincyOne, which is list.
Highlight typeof in the Source window To know more about typeof function, we will access the help section of RStudio.
[RStudio]

help(typeof)

In the Console window, type help, within parentheses typeof. Press Enter.
Highlight Description in the help window typeof determines the R internal type or storage mode of any object.
Highlight Files tab in the lower right of RStudio Click on the Files tab.
Highlight broom icon in the Console window Clear the Console window by clicking on the broom icon.
Highlight captaincyOne in the Source window Click on the data frame captaincyOne.
Highlight captaincyOne in Source window Now, let us extract two rows from top of captaincyOne.

For this, we will use head function.

Highlight myDataSet.R in the Source window Click on the script myDataSet.R
[RStudio]

head(captaincyOne, 2)

In the Source window, type head within parentheses captaincyOne comma space 2.

Save the script and run the current line.

Highlight the output in the Console window The top two rows of captaincyOne are shown on the Console window.
Highlight captaincyOne in the Source window Click on the data frame captaincyOne.
Highlight CaptaincyOne in the Source window Suppose we want to extract two rows from bottom of captaincyOne.

For this, we will use the tail function.

Highlight myDataSet.R in the Source window Click on the script myDataSet.R
[RStudio]

tail(captaincyOne, 2)

In the Source window, type tail within parentheses captaincyOne comma space 2.

Save the script and run the current line.

Highlight the output in the Console window The last two rows of captaincyOne are shown on the Console window.
Cursor on the interface. Next, let us learn about str function.

This function is used to display the structure of an R object.

[RStudio]

str(captaincyOne)

In the Source window, type str within parentheses captaincyOne.

Save the script and run the current line.

Highlight the output in the Console window The structural details of captaincyOne are shown on the Console.
Now, we will look at merging of data frames.
Show slide

Merging data frames

Merging data frames has advantages like:
  • It makes data more available.
  • It helps in improving data quality.
  • Combining similar data also reduces data complexity.
Let us switch to RStudio.
[RStudio]

Highlight CaptaincyData.csv and CaptaincyData2.csv under Files tab

We will learn how to merge two data frames CaptaincyData.csv and CaptaincyData2.csv.
[RStudio]

captaincyTwo <- read.csv("CaptaincyData2.csv")

We will declare a variable captaincyTwo to store and read CaptaincyData2.csv.

In the Source window, type the following command and press Enter.

[RStudio]

View(captaincyTwo)

Now, type View within parentheses captaincyTwo.

Save the script and run the last two lines.

Highlight captaincyTwo in Source window The contents of captaincyTwo appear in the Source window.
Highlight the name of captains in captaincyTwo

Highlight the column drawn in captaincyTwo

This data frame has the same captains as that in captaincyOne.

However, it has different information about them like the number of matches drawn.

Highlight captaincyOne in the Source window Now, we will update captaincyOne by adding information from captaincyTwo.

For this, we use merge function.

Highlight myDataSet.R in the Source window Click on the script myDataSet.R
Drag the Source window. I am resizing the Source window.
[RStudio]

captaincyOne <- merge(captaincyOne, captaincyTwo, by = "names")

In the Source window, type the following command. Press Enter.
Highlight by = "names" in the Source window In the merge function, we use column names by which we want to merge two data frames.

Here, it is names.

[RStudio]

View(captaincyOne)

Now, type View and captaincyOne in parentheses.

Save the script and run these two lines.

Highlight captaincyOne in the Source window The contents of the updated captaincyOne appear in the Source window.
[RStudio]

Highlight the tabs captaincyOne and captaincyTwo

Close the two tabs captaincyOne and captaincyTwo.
Cursor on the interface. Now, we will learn how to import data of different formats in R.
[RStudio]

# Importing data in different formats

We shall add one comment first.

In the Source window, type # hash space Importing data in different formats.

Highlight CaptaincyData.xml under Files tab Now, let us import CaptaincyData.xml file.

For that, we need to install XML package.

Make sure that you are connected to Internet.

We need to install Ubuntu package libxml2-dev before installing XML package.

Information on how to install this package, is provided in the Additional Material.

[RStudio]

Click in the Console window

I have already installed libxml2-dev package.

Hence, I will proceed for installing XML package now.

[RStudio]

install.packages("XML")

Highlight the red dot in the Console window

On the Console window, type install dot packages.

Now, type XMLinside double quotes and in parentheses.

Press Enter.

We will wait until R installs the package.

Then, we load this package using library function.
Highlight myDataSet.R in the Source window Click on the script myDataSet.R
Click at the top of the script myDataSet.R Since we are loading a package, we will add it at the top of the script.
[RStudio]

library(XML)

In the Source window, scroll up.

Now, at the top of the script myDataSet.R, type library and XML in parentheses.

Save the script and run this line.

[RStudio]

Point to the comment.

xmldata <- xmlToDataFrame("CaptaincyData.xml")

Now, in the Source window, click on the next line after the comment Importing data in different formats.

Type the following command and press Enter.

[RStudio]

View(xmldata )

Then type View and xmldata in parentheses.

Save the script and run these two lines.

Highlight xmldata in the Source window The contents of the xml file are shown here.
Highlight CaptaincyData.txt under Files tab Next let us learn how to import CaptaincyData.txt.
Highlight myDataSet.R in the Source window Click on the script myDataSet.R
[RStudio]

txtdata <- read.table(“CaptaincyData.txt”)

In the Source window, type the following command and press Enter.

[RStudio]

View(txtdata)

Next, type View and txtdata in parentheses.

Save the script and run these two lines.

Highlight txtdata in the Source window The contents of the txt file are shown.
Highlight CaptaincyData.xlsx under the Files tab Now, we will learn how to import data from user interface of Rstudio.

I am resizing the Source window.

We will import the Excel file CaptaincyData.xlsx using this method.

Please ensure that you have packages like readxl and Rcpp installed in your system.

Highlight Environment tab In the top right corner of RStudio, click on the Environment tab.
Highlight Import Dataset button

Highlight From Excel option

In the Environment tab, click on Import Dataset.

From the drop-down menu, select From Excel.

Highlight Import Excel Data window A window named Import Excel Data appears.
Highlight File/Url option You can select a file on your computer or type the URL from which you want to load an Excel file.

We will select a file on our computer.

Highlight Browse option In the upper right corner of this window, near File/Url text field, click on Browse.
Highlight CaptaincyData.xlsx in the folder myProject I will select the file CaptaincyData.xlsx located in DataMerging folder.

This folder is in myProject folder on the Desktop.

Click Open to load this file.

Highlight Data Preview option Below the field File/Url, RStudio shows the preview of the Excel file being imported.
Highlight Code Preview option At the bottom right corner of this window, you can see the code for importing this Excel file.
Highlight Import button Finally, click on the Import button.
Highlight CaptaincyData in the Source window The contents of the Excel file are shown here.
Let us summarize what we have learnt.
Show Slide

Summary

In this tutorial, we have learnt how to:
  • Use built-in functions for exploring a data frame
  • Merge two data frames
  • Import data in different formats in R
Show Slide

Assignment

We now suggest an assignment.
  • Using built-in dataset iris, implement all the functions we have learnt in this tutorial.
Show slide

About the Spoken Tutorial Project

The video at the following link summarises the Spoken Tutorial project.

Please download and watch it.

Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.

Show Slide

Forum to answer questions

Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Please post your general queries in this forum.
Show Slide

Textbook Companion

The FOSSEE team coordinates the TBC project.

For more details, please visit these sites.

Show Slide

Acknowledgement

The Spoken Tutorial project is funded by NMEICT, MHRD, Govt. of India
Show Slide

Thank You

The script for this tutorial was contributed by Shaik Sameer (FOSSEE Fellow 2018).

This is Sudhakar Kumar from IIT Bombay signing off. Thanks for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey, Sudhakarst