Difference between revisions of "R/C2/Data-Manipulation-using-dplyr-Package/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
Line 85: Line 85:
 
About dplyr Package
 
About dplyr Package
  
|| * '''dplyr''' is a package for data manipulation, written and maintained by '''Hadley Wickham'''.
+
||  
 +
* '''dplyr''' is a package for data manipulation, written and maintained by '''Hadley Wickham'''.
 
* It comprises many functions that perform mostly used data manipulation operations.  
 
* It comprises many functions that perform mostly used data manipulation operations.  
  
Line 100: Line 101:
 
|| Highlight '''movies''' in the '''Source''' window  
 
|| Highlight '''movies''' in the '''Source''' window  
 
|| '''movies data frame''' opens in the '''Source''' window.  
 
|| '''movies data frame''' opens in the '''Source''' window.  
 
  
 
This '''data frame''' will be used later in this tutorial.  
 
This '''data frame''' will be used later in this tutorial.  
Line 158: Line 158:
 
|| Highlight the scroll bar in the '''Source''' window  
 
|| Highlight the scroll bar in the '''Source''' window  
 
|| In the '''Source''' window, scroll from left to right.  
 
|| In the '''Source''' window, scroll from left to right.  
 
  
 
This will enable us to see the remaining objects of '''movies''' '''data frame'''.  
 
This will enable us to see the remaining objects of '''movies''' '''data frame'''.  
Line 191: Line 190:
 
|| Highlight '''moviesComedy''' in the '''Environment''' window  
 
|| Highlight '''moviesComedy''' in the '''Environment''' window  
 
|| Resulting data frame is stored in an object called '''moviesComedy '''in the''' Environment window.'''
 
|| Resulting data frame is stored in an object called '''moviesComedy '''in the''' Environment window.'''
 
  
 
Let us view the data frame '''moviesComedy''' to check whether it contains '''movies''' with genre as '''Comedy'''.
 
Let us view the data frame '''moviesComedy''' to check whether it contains '''movies''' with genre as '''Comedy'''.
Line 244: Line 242:
 
|  | Highlight '''moviesComDr''' in the '''Source''' window  
 
|  | Highlight '''moviesComDr''' in the '''Source''' window  
 
|  | '''moviesComDr '''opens in the '''Source''' window.  
 
|  | '''moviesComDr '''opens in the '''Source''' window.  
 
  
 
The '''movies''' having '''genre''' as either '''Comedy '''or '''Drama '''have been filtered.  
 
The '''movies''' having '''genre''' as either '''Comedy '''or '''Drama '''have been filtered.  
Line 281: Line 278:
 
|| Highlight '''moviesComDrP''' in the '''Source''' window  
 
|| Highlight '''moviesComDrP''' in the '''Source''' window  
 
|| '''moviesComDrP '''opens in the '''Source''' window.  
 
|| '''moviesComDrP '''opens in the '''Source''' window.  
 
  
 
The '''movies''' having '''genre''' as either '''Comedy '''or '''Drama '''have been filtered.  
 
The '''movies''' having '''genre''' as either '''Comedy '''or '''Drama '''have been filtered.  
Line 433: Line 429:
 
* Consider the built-in data set '''mtcars'''. Find the cars with '''hp''' greater than 100 and '''cyl''' equal to 3.
 
* Consider the built-in data set '''mtcars'''. Find the cars with '''hp''' greater than 100 and '''cyl''' equal to 3.
 
* Arrange the '''mtcars''' data set based on '''mpg''' variable.
 
* Arrange the '''mtcars''' data set based on '''mpg''' variable.
 
  
 
|-  
 
|-  

Revision as of 19:06, 6 August 2019

Title of the script: Data Manipulation using dplyr package

Author: Varshit Dubey (CoE Pune) and Sudhakar Kumar (IIT Bombay)

Keywords: R, RStudio, data manipulation, dplyr, filter, video tutorial

Visual Cue Narration
Show slide

Opening Slide

Welcome to this tutorial on Data manipulation using dplyr package.
Show slide

Learning Objective

In this tutorial, we will learn about,
  • Data manipulation
  • dplyr package
  • How to use filter and arrange functions

Show slide

Pre-requisites

To understand this tutorial, you should know,
  • Basics of statistics
  • Basics of ggplot2 package
  • Data frames

If not, please locate the relevant tutorials on R on this website.

Show slide

System Specifications

This tutorial is recorded on
  • Ubuntu Linux OS version 16.04
  • R version 3.4.4
  • RStudio version 1.1.463

Install R version 3.2.0 or higher.

Show slide

Download Files

For this tutorial, we will use
  • A data frame moviesData.csv
  • A script file myVis.R.

Please download these files from the Code files link of this tutorial.

[Computer screen]

Highlight moviesData.csv and myVis.R in the folder DataVis

I have downloaded and moved these files to DataVis folder.

This folder is located in myProject folder on my Desktop.

I have also set DataVis folder as my Working Directory.

Show slide

Need for Data Manipulation

In real life, it is rare that we get the data in exactly the right form we need.
Show slide

Need for Data Manipulation

Often we’ll need to
  • create some new variables or summaries
  • rename the variables
  • reorder the observations in order to make the data a little easier to work with
We will learn how to achieve all this by using dplyr package.
Show slide

About dplyr Package

  • dplyr is a package for data manipulation, written and maintained by Hadley Wickham.
  • It comprises many functions that perform mostly used data manipulation operations.
Let us switch to RStudio.
Highlight myVis.R in the Files window of RStudio Open the script myVis.R in RStudio.
Highlight the Source button Let us run this script by clicking on the Source button.
Highlight movies in the Source window movies data frame opens in the Source window.

This data frame will be used later in this tutorial.

Now, we will install dplyr package. Please make sure that you are connected to the Internet.
[RStudio]

install.packages("dplyr")

In the Console window, type the following command and press Enter.
Highlight the red dot in the Console window The installation of the package takes a few seconds.

We will wait while the package is being installed.

Click at the top of the script myVis.R To load this package, we will add the library at the top of the script.
Highlight the script myVis.R in the Source window Click on the script myVis.R
[RStudio]

library(dplyr)

Press Ctrl+Enter keys.

At the top of the script, type library and dplyr in parentheses.

Save the script and run this line by pressing Ctrl+Enter keys simultaneously.

Show slide

Functions in dplyr package

Now we learn about some key functions in dplyr package:
  • filter - to select cases based on their values.
  • arrange - to reorder the cases.
  • select - to select variables based on their names.
  • mutate - to add new variables that are functions of existing variables.
Show slide

Functions in dplyr package

* summarise - to condense multiple values to a single value.

These all functions can be combined with group underscore by function. It allows us to perform any operation by a group.

Let us switch to RStudio.
Highlight movies in the Source window In the Source window, click on movies.
Highlight the scroll bar in the Source window In the Source window, scroll from left to right.

This will enable us to see the remaining objects of movies data frame.

Highlight genre in the Source window Suppose we want to filter the movies having genre as Comedy.

For this, we will use the filter function.

Highlight the script myVis.R in the Source window Click on the script myVis.R
[RStudio]

moviesComedy <- filter(movies,

genre == "Comedy")

In the Source window, type the following command.
Highlight filter in the Source window Recall that, filter function in dplyr package allows us to select cases based on their values.
Highlight movies after filter in the Source window Inside the filter function, the first argument is the name of the data frame which is movies.
Highlight genre == "Comedy" in the Source window The second argument is the value by which we want to filter the movies data frame.
Highlight the Run button in the Source window Save the script and run the current line.
Highlight moviesComedy in the Environment window Resulting data frame is stored in an object called moviesComedy in the Environment window.

Let us view the data frame moviesComedy to check whether it contains movies with genre as Comedy.

[RStudio]

View(moviesComedy)

In the Source window, type the following command.
Highlight the Run button in the Source window Run the current line.
Highlight moviesComedy in the Source window moviesComedy data frame opens in the Source window.
Highlight genre in the Source window All the movies having genre as Comedy have been filtered.
Highlight moviesComedy in the Source window Let us close this data frame moviesComedy for now.
Highlight filter in the Source window We can also use logical operators to combine two or more than two values.
Highlight movies in the Source window In the Source window, click on movies.
Highlight genre in the Source window Suppose we want to filter the movies with genre as either Comedy or Drama.
Highlight the script myVis.R in the Source window Click on the script myVis.R
[RStudio]

moviesComDr <- filter(movies,

genre == "Comedy" | genre == "Drama")

View(moviesComDr)

In the Source window, type the following commands.
Highlight filter in the Source widow Here, we have two values by which we would like to filter movies data frame.
Highlight | in the Source window For this, we have used a logical OR operator.
Highlight the Run button in the Source window Run the last two lines of code.
Highlight moviesComDr in the Source window moviesComDr opens in the Source window.

The movies having genre as either Comedy or Drama have been filtered.

Highlight moviesComDr in the Source window Let us close this data frame moviesComDr for now.
Highlight moviesComDr <- filter(movies, genre == "Comedy" | genre == "Drama") in the Source window This filter function can also be written using the match operator.
[RStudio]

moviesComDrP <- filter(movies,

genre %in% c("Comedy", "Drama"))

View(moviesComDrP)

In the Source window, type the following command.
Highlight %in% in the Source window %in% is used for value matching.
[RStudio]

help('%in%')

To know more about this operator, let us access the Help.

In the Console window, type the following command and press Enter.

Highlight Help window Match returns a vector of the positions of (first) matches of its first argument in its second.
Highlight the Run button in the Source window Run the last two lines of code.
Highlight moviesComDrP in the Source window moviesComDrP opens in the Source window.

The movies having genre as either Comedy or Drama have been filtered.

Highlight moviesComDrP in the Source window Let us close this data frame moviesComDrP for now.
Highlight movies in the Source window In the Source window, click on movies.
Highlight the scroll bar in the Source window In the Source window, scroll from left to right.
Highlight genre and imdb_rating in the Source window Let us now filter movies with genre as Comedy and imdb underscore rating greater than or equal to 7 point 5.
Highlight the script myVis.R in the Source window Click on the script myVis.R
[RStudio]

moviesComIm <- filter(movies,

genre == "Comedy" & imdb_rating >= 7.5)

View(moviesComIm)

In the Source window, type the following command.
Highlight genre == "Comedy" & imdb_rating >= 7.5 in the Source window Here, we have used a logical AND operator to include both conditions.
Highlight the Run button in the Source window Save the script and run the last two lines of code.
Highlight moviesComIm in the Source window moviesComIm opens in the Source window.

I will resize the Console window.

There are seven movies with genre as Comedy and imdb underscore rating greater than or equal to 7 point 5.

Highlight moviesComIm in the Source window Let us close this data frame moviesComIm for now.
Highlight movies in the Source window In the Source window, click on movies.
Highlight imdb_rating in the Source window Suppose, we want to arrange the movies in an ascending order of imdb underscore rating.

For this, we will use the arrange function.

Highlight the script myVis.R in the Source window Click on the script myVis.R
[RStudio]

moviesImA <- arrange(movies, imdb_rating)

View(moviesImA)

In the Source window, type the following command.
Highlight the Run button in the Source window Run the last two lines of code.
Highlight moviesImA in the Source window moviesImA opens in the Source window.
Highlight imdb_rating in the Source window In the Source window, scroll from left to right and locate the imdb_rating column.


The movies have been arranged in ascending order of imdb_rating.

Highlight imdb_rating in the Source window Now, let us say we want to arrange the movies in descending order of imdb rating.

For this, we use desc function.

Highlight moviesImA in the Source window Let us close this data frame moviesImA for now.
[RStudio]

moviesImD <- arrange(movies, desc(imdb_rating))

View(moviesImD)

In the Source window, type the following command.
Highlight the Run button in the Source window Run the last two lines of code.
Highlight moviesImD in the Source window moviesImD opens in the Source window.
Highlight imdb_rating in the Source window In the Source window, scroll from left to right and locate the imdb_rating column.

The movies have been arranged in descending order of imdb_rating.

Highlight moviesImD in the Source window Let us close this data frame moviesImD for now.
Highlight movies in the Source window In the Source window, click on movies.
Highlight genre and imdb_rating in the Source window Suppose we want to arrange the movies both by genre and imdb_rating.
Highlight the script myVis.R in the Source window Click on the script myVis.R
[RStudio]

moviesGeIm <- arrange(movies, genre, imdb_rating)

View(moviesGeIm)

In the Source window, type the following commands.
Highlight the Run button in the Source window Run the last two lines of code.
Highlight moviesGeIm in the Source window moviesGeIm opens in the Source window.
Highlight the scroll bar in the Source window In the Source window, scroll from left to right.

movies have been arranged both by genre and imdb_rating.

Let us summarize what we have learnt.

Show slide

Summary

In this tutorial, we have learnt about:
  • Data manipulation
  • dplyr package
  • How to use filter and arrange functions
Show slide

Assignment

We now suggest an assignment.
  • Consider the built-in data set mtcars. Find the cars with hp greater than 100 and cyl equal to 3.
  • Arrange the mtcars data set based on mpg variable.
Show slide

About the Spoken Tutorial Project

The video at the following link summarises the Spoken Tutorial project.

Please download and watch it.

Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.

Show Slide

Forum to answer questions

Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Please post your general queries in this forum.
Show Slide

Textbook Companion

The FOSSEE team coordinates the TBC project.

For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial project is funded by NMEICT, MHRD, Govt. of India
Show Slide

Thank You

The script for this tutorial was contributed by Varshit Dubey (CoE Pune).


This is Sudhakar Kumar from IIT Bombay signing off. Thanks for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey, Sudhakarst