Difference between revisions of "R/C2/Data-Manipulation-using-dplyr-Package/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
Line 73: Line 73:
  
 
|| Often we’ll need to  
 
|| Often we’ll need to  
* create some new variables or summaries
+
* create some new '''variables''' or '''summaries'''
* rename the variables
+
* rename the '''variables'''
 
* reorder the observations in order to make the data a little easier to work with
 
* reorder the observations in order to make the data a little easier to work with
  
 
|-  
 
|-  
 
||  
 
||  
|| We will learn how to achieve all this by using '''dplyr''' package.
+
|| We will learn how to achieve all this by using '''dplyr package'''.
 
|-  
 
|-  
 
|| Show slide
 
|| Show slide
Line 86: Line 86:
  
 
||  
 
||  
* '''dplyr''' is a package for data manipulation, written and maintained by '''Hadley Wickham'''.
+
* '''dplyr''' is a '''package''' for '''data manipulation''', written and maintained by '''Hadley Wickham'''.
* It comprises many functions that perform mostly used data manipulation operations.  
+
* It comprises many '''functions''' that perform mostly used '''data manipulation operations'''.  
  
 
|-  
 
|-  
Line 93: Line 93:
 
|| Let us switch to '''RStudio'''.
 
|| Let us switch to '''RStudio'''.
 
|-  
 
|-  
|| Highlight '''myVis.R''' in the '''Files '''window''' '''of '''RStudio '''
+
|| Highlight '''myVis.R''' in the '''Files '''window of '''RStudio '''
 
|| Open the '''script myVis.R '''in''' RStudio'''.  
 
|| Open the '''script myVis.R '''in''' RStudio'''.  
 
|-  
 
|-  
 
|| Highlight the '''Source''' button
 
|| Highlight the '''Source''' button
|| Let us run this '''script''' by clicking on the '''Source''' button.  
+
|| Let us '''run''' this '''script''' by clicking on the '''Source''' button.  
 
|-  
 
|-  
 
|| Highlight '''movies''' in the '''Source''' window  
 
|| Highlight '''movies''' in the '''Source''' window  
Line 105: Line 105:
 
|-  
 
|-  
 
||  
 
||  
|| Now, we will install '''dplyr''' package. Please make sure that you are connected to the Internet.  
+
|| Now, we will install '''dplyr package'''. Please make sure that you are connected to the '''Internet'''.  
 
|-  
 
|-  
 
|| [RStudio]  
 
|| [RStudio]  
  
 
'''install.packages("dplyr")'''
 
'''install.packages("dplyr")'''
|| In the '''Console''' window, type the following command and press '''Enter'''.  
+
|| In the '''Console''' window, type the following '''command''' and press '''Enter'''.  
 
|-  
 
|-  
 
|| Highlight the red dot in the '''Console''' window  
 
|| Highlight the red dot in the '''Console''' window  
|| The installation of the package takes a few seconds.  
+
|| The installation of the '''package''' takes a few seconds.  
  
We will wait while the package is being installed.  
+
We will wait while the '''package''' is being installed.  
 
|-  
 
|-  
 
|| Click at the top of the '''script myVis.R'''
 
|| Click at the top of the '''script myVis.R'''
|| To load this package, we will add the library at the top of the '''script'''.
+
|| To load this '''package''', we will add the library at the top of the '''script'''.
 
|-  
 
|-  
 
|| Highlight the '''script myVis.R''' in the '''Source''' window  
 
|| Highlight the '''script myVis.R''' in the '''Source''' window  
Line 128: Line 128:
  
 
Press '''Ctrl+Enter''' keys.
 
Press '''Ctrl+Enter''' keys.
|| At the top of the '''script''', type '''library '''and '''dplyr '''in parentheses'''.'''
+
|| At the top of the '''script''', type '''library '''and '''dplyr '''in parentheses.
  
Save the '''script '''and run this line by pressing '''Ctrl+Enter''' keys simultaneously.
+
Save the '''script '''and '''run''' this line by pressing '''Ctrl+Enter''' keys simultaneously.
 
|-  
 
|-  
 
|| Show slide
 
|| Show slide
  
Functions in '''dplyr''' package
+
'''Functions''' in '''dplyr package'''
  
|| Now we learn about some key functions in '''dplyr '''package:  
+
|| Now we learn about some key '''functions''' in '''dplyr package''':  
* '''filter - '''to select cases based on their values.  
+
* '''filter - '''to select ''cases'' based on their values.  
* '''arrange - '''to reorder the cases.''' '''
+
* '''arrange - '''to reorder the '''cases.'''  
* '''select - '''to select variables based on their names.
+
* '''select - '''to select '''variables''' based on their names.
* '''mutate - '''to add new variables that are functions of existing variables.
+
* '''mutate - '''to add new '''variables''' that are '''functions''' of existing '''variables'''.
  
 
|-  
 
|-  
 
|| Show slide
 
|| Show slide
  
Functions in '''dplyr''' package
+
'''Functions''' in '''dplyr package'''
 
||  
 
||  
 
* '''summarise - '''to condense multiple values to a single value.
 
* '''summarise - '''to condense multiple values to a single value.
  
These all functions can be combined with '''group underscore by''' function. It allows us to perform any operation by a group.
+
All these '''functions''' can be combined with '''group underscore by function'''. It allows us to perform any operation by a '''group'''.
 
|-  
 
|-  
 
||  
 
||  
Line 160: Line 160:
 
|| In the '''Source''' window, scroll from left to right.  
 
|| In the '''Source''' window, scroll from left to right.  
  
This will enable us to see the remaining objects of '''movies''' '''data frame'''.  
+
This will enable us to see the remaining '''objects''' of '''movies data frame'''.  
 
|-  
 
|-  
 
|| Highlight '''genre''' in the '''Source''' window  
 
|| Highlight '''genre''' in the '''Source''' window  
|| Suppose we want to filter the movies having '''genre''' as '''Comedy'''.
+
|| Suppose we want to filter the movies having genre as '''Comedy'''.
  
For this, we will use the '''filter''' function.  
+
For this, we will use the '''filter function'''.  
 
|-  
 
|-  
 
|| Highlight the '''script myVis.R''' in the '''Source''' window  
 
|| Highlight the '''script myVis.R''' in the '''Source''' window  
Line 175: Line 175:
  
 
'''genre == "Comedy")'''
 
'''genre == "Comedy")'''
|| In the '''Source''' window, type the following command.  
+
|| In the '''Source''' window, type the following '''command'''.  
 
|-  
 
|-  
 
|| Highlight '''filter''' in the '''Source''' window  
 
|| Highlight '''filter''' in the '''Source''' window  
|| Recall that, '''filter '''function in '''dplyr''' package allows us to select cases based on their values.  
+
|| Recall that, '''filter function''' in '''dplyr package''' allows us to select '''cases''' based on their values.  
 
|-  
 
|-  
 
|| Highlight '''movies''' after '''filter''' in the '''Source''' window  
 
|| Highlight '''movies''' after '''filter''' in the '''Source''' window  
|| Inside the '''filter''' function, the first argument is the name of the '''data frame''' which is '''movies'''.  
+
|| Inside the '''filter function''', the first '''argument''' is the name of the '''data frame''' which is '''movies'''.  
 
|-  
 
|-  
 
|| Highlight '''genre == "Comedy" '''in the '''Source''' window  
 
|| Highlight '''genre == "Comedy" '''in the '''Source''' window  
|| The second argument is the value by which we want to filter the '''movies''' data frame.  
+
|| The second '''argument''' is the value by which we want to filter the '''movies data frame. '''
 
|-  
 
|-  
 
|| Highlight the '''Run''' button in the '''Source''' window  
 
|| Highlight the '''Run''' button in the '''Source''' window  
|| Save the '''script''' and run the current line.  
+
|| Save the '''script''' and '''run''' the current line.  
 
|-  
 
|-  
 
|| Highlight '''moviesComedy''' in the '''Environment''' window  
 
|| Highlight '''moviesComedy''' in the '''Environment''' window  
|| Resulting data frame is stored in an object called '''moviesComedy '''in the''' Environment window.'''
+
|| Resulting '''data frame''' is stored in an '''object''' called '''moviesComedy '''in the''' Environment window.'''
  
Let us view the data frame '''moviesComedy''' to check whether it contains '''movies''' with genre as '''Comedy'''.
+
Let us view the '''data frame moviesComedy''' to check whether it contains '''movies''' with genre as '''Comedy'''.
 
|-  
 
|-  
 
|| [RStudio]
 
|| [RStudio]
  
 
'''View(moviesComedy) '''
 
'''View(moviesComedy) '''
|| In the '''Source''' window, type the following command.  
+
|| In the '''Source''' window, type the following '''command'''.  
 
|-  
 
|-  
 
|| Highlight the '''Run''' button in the '''Source''' window  
 
|| Highlight the '''Run''' button in the '''Source''' window  
|| Run the current line.  
+
|| '''Run''' the current line.  
 
|-  
 
|-  
 
|| Highlight '''moviesComedy''' in the Source window  
 
|| Highlight '''moviesComedy''' in the Source window  
|| '''moviesComedy''' data frame opens in the '''Source''' window.  
+
|| '''moviesComedy data frame''' opens in the '''Source''' window.  
 
|-  
 
|-  
 
|| Highlight '''genre''' in the '''Source''' window  
 
|| Highlight '''genre''' in the '''Source''' window  
Line 209: Line 209:
 
|-  
 
|-  
 
|| Highlight '''moviesComedy''' in the Source window  
 
|| Highlight '''moviesComedy''' in the Source window  
|| Let us close this data frame '''moviesComedy '''for now.  
+
|| Let us close this '''data frame moviesComedy '''for now.  
 
|-  
 
|-  
 
|| Highlight '''filter''' in the '''Source''' window  
 
|| Highlight '''filter''' in the '''Source''' window  
|| We can also use logical operators to combine two or more than two values.  
+
|| We can also use '''logical''' operators to combine two or more than two values.  
 
|-
 
|-
 
|  | Highlight '''movies''' in the '''Source''' window  
 
|  | Highlight '''movies''' in the '''Source''' window  
Line 230: Line 230:
  
 
'''View(moviesComDr)'''
 
'''View(moviesComDr)'''
|| In the '''Source''' window, type the following commands.  
+
|| In the '''Source''' window, type the following '''commands'''.  
 
|-  
 
|-  
 
|| Highlight '''filter''' in the '''Source''' widow  
 
|| Highlight '''filter''' in the '''Source''' widow  
|| Here, we have two values by which we would like to filter '''movies''' '''data frame'''.  
+
|| Here, we have two values by which we would like to filter '''movies data frame'''.  
 
|-  
 
|-  
 
|| Highlight''' |''' in the '''Source''' window  
 
|| Highlight''' |''' in the '''Source''' window  
|| For this, we have used a logical '''OR''' operator.  
+
|| For this, we have used a '''logical OR''' operator.  
 
|-  
 
|-  
 
|| Highlight the '''Run''' button in the '''Source''' window  
 
|| Highlight the '''Run''' button in the '''Source''' window  
|| Run the last two lines of code.  
+
|| '''Run''' the last two lines of code.  
 
|-
 
|-
 
|  | Highlight '''moviesComDr''' in the '''Source''' window  
 
|  | Highlight '''moviesComDr''' in the '''Source''' window  
 
|  | '''moviesComDr '''opens in the '''Source''' window.  
 
|  | '''moviesComDr '''opens in the '''Source''' window.  
  
The '''movies''' having '''genre''' as either '''Comedy '''or '''Drama '''have been filtered.  
+
The '''movies''' having genre as either '''Comedy '''or '''Drama '''have been filtered.  
 
|-  
 
|-  
 
|| Highlight '''moviesComDr''' in the '''Source''' window  
 
|| Highlight '''moviesComDr''' in the '''Source''' window  
|| Let us close this data frame '''moviesComDr '''for now.  
+
|| Let us close this '''data frame moviesComDr '''for now.  
 
|-
 
|-
 
|  | Highlight '''moviesComDr <- filter(movies, genre == "Comedy" | genre == "Drama") '''in the '''Source''' window  
 
|  | Highlight '''moviesComDr <- filter(movies, genre == "Comedy" | genre == "Drama") '''in the '''Source''' window  
|  | This '''filter''' function can also be written using the '''match''' operator.  
+
|  | This '''filter function''' can also be written using the '''match''' operator.  
 
|-
 
|-
 
|  | [RStudio]
 
|  | [RStudio]
Line 259: Line 259:
  
 
'''View(moviesComDrP)'''
 
'''View(moviesComDrP)'''
|  | In the '''Source''' window, type the following command.  
+
|  | In the '''Source''' window, type the following '''command'''.  
 
|-
 
|-
 
|  | Highlight '''%in%''' in the '''Source''' window  
 
|  | Highlight '''%in%''' in the '''Source''' window  
Line 269: Line 269:
 
|  | To know more about this operator, let us access the '''Help'''.  
 
|  | To know more about this operator, let us access the '''Help'''.  
  
In the '''Console''' window, type the following command and press '''Enter'''.  
+
In the '''Console''' window, type the following '''command''' and press '''Enter'''.  
 
|-
 
|-
 
|  | Highlight '''Help''' window  
 
|  | Highlight '''Help''' window  
|  | Match returns a vector of the positions of (first) matches of its first argument in its second.
+
|  | Match returns a '''vector''' of the positions of first matches of its first '''argument''' in its second.
 
|-  
 
|-  
 
|| Highlight the '''Run''' button in the '''Source''' window  
 
|| Highlight the '''Run''' button in the '''Source''' window  
|| Run the last two lines of code.  
+
|| '''Run''' the last two lines of code.  
 
|-  
 
|-  
 
|| Highlight '''moviesComDrP''' in the '''Source''' window  
 
|| Highlight '''moviesComDrP''' in the '''Source''' window  
 
|| '''moviesComDrP '''opens in the '''Source''' window.  
 
|| '''moviesComDrP '''opens in the '''Source''' window.  
  
The '''movies''' having '''genre''' as either '''Comedy '''or '''Drama '''have been filtered.  
+
The movies having genre as either '''Comedy '''or '''Drama '''have been filtered.  
 
|-  
 
|-  
 
|| Highlight '''moviesComDrP''' in the '''Source''' window  
 
|| Highlight '''moviesComDrP''' in the '''Source''' window  
|| Let us close this data frame '''moviesComDrP '''for now.  
+
|| Let us close this '''data frame moviesComDrP '''for now.  
 
|-
 
|-
 
|  | Highlight '''movies''' in the '''Source''' window  
 
|  | Highlight '''movies''' in the '''Source''' window  
Line 292: Line 292:
 
|-  
 
|-  
 
|| Highlight '''genre''' and '''imdb_rating''' in the '''Source''' window  
 
|| Highlight '''genre''' and '''imdb_rating''' in the '''Source''' window  
|| Let us now filter '''movies''' with '''genre''' as '''Comedy '''and '''imdb '''underscore''' rating '''greater than or equal to 7 point 5.
+
|| Let us now filter movies with genre as '''Comedy '''and '''imdb underscore rating '''greater than or equal to 7 point 5.
 
|-  
 
|-  
 
|| Highlight the '''script myVis.R''' in the '''Source''' window  
 
|| Highlight the '''script myVis.R''' in the '''Source''' window  
Line 304: Line 304:
  
 
'''View(moviesComIm)'''
 
'''View(moviesComIm)'''
|| In the '''Source''' window, type the following command.  
+
|| In the '''Source''' window, type the following '''command'''.  
 
|-  
 
|-  
 
|| Highlight '''genre == "Comedy" & imdb_rating >= 7.5 '''in the '''Source''' window  
 
|| Highlight '''genre == "Comedy" & imdb_rating >= 7.5 '''in the '''Source''' window  
|| Here, we have used a logical '''AND''' operator to include both conditions.  
+
|| Here, we have used a '''logical AND''' operator to include both conditions.  
  
 
|-  
 
|-  
 
|| Highlight the '''Run''' button in the '''Source''' window  
 
|| Highlight the '''Run''' button in the '''Source''' window  
|| Save the script and run the last two lines of code.  
+
|| Save the script and '''run''' the last two lines of code.  
 
|-  
 
|-  
 
|| Highlight '''moviesComIm''' in the '''Source''' window  
 
|| Highlight '''moviesComIm''' in the '''Source''' window  
Line 318: Line 318:
 
I will resize the '''Console''' window.  
 
I will resize the '''Console''' window.  
  
There are seven movies with '''genre''' as '''Comedy '''and '''imdb '''underscore''' rating '''greater than or equal to 7 point 5.  
+
There are seven movies with genre as '''Comedy '''and '''imdb underscore rating '''greater than or equal to 7 point 5.  
 
|-  
 
|-  
 
|| Highlight '''moviesComIm''' in the '''Source''' window  
 
|| Highlight '''moviesComIm''' in the '''Source''' window  
|| Let us close this data frame '''moviesComIm '''for now.  
+
|| Let us close this '''data frame moviesComIm '''for now.  
 
|-
 
|-
 
|  | Highlight '''movies''' in the '''Source''' window  
 
|  | Highlight '''movies''' in the '''Source''' window  
Line 327: Line 327:
 
|-  
 
|-  
 
|| Highlight '''imdb_rating''' in the '''Source''' window  
 
|| Highlight '''imdb_rating''' in the '''Source''' window  
|| Suppose, we want to arrange the '''movies''' in an ascending order of '''imdb '''underscore '''rating'''.  
+
|| Suppose, we want to arrange the movies in an ascending order of '''imdb underscore rating'''.  
  
For this, we will use the '''arrange''' function.  
+
For this, we will use the '''arrange function'''.  
 
|-  
 
|-  
 
|| Highlight the '''script myVis.R''' in the '''Source''' window  
 
|| Highlight the '''script myVis.R''' in the '''Source''' window  
Line 339: Line 339:
  
 
'''View(moviesImA)'''
 
'''View(moviesImA)'''
|  | In the '''Source''' window, type the following command.  
+
|  | In the '''Source''' window, type the following '''command'''.  
 
|-  
 
|-  
 
|| Highlight the '''Run''' button in the '''Source''' window  
 
|| Highlight the '''Run''' button in the '''Source''' window  
|| Run the last two lines of code.  
+
|| '''Run''' the last two lines of code.  
 
|-  
 
|-  
 
|| Highlight '''moviesImA''' in the '''Source''' window  
 
|| Highlight '''moviesImA''' in the '''Source''' window  
Line 348: Line 348:
 
|-  
 
|-  
 
|| Highlight '''imdb_rating''' in the '''Source''' window
 
|| Highlight '''imdb_rating''' in the '''Source''' window
|| In the '''Source''' window, scroll from left to right and locate the '''imdb_rating''' column.  
+
|| In the '''Source''' window, scroll from left to right and locate the '''imdb underscore rating''' column.  
  
The '''movies''' have been arranged in ascending order of '''imdb_rating'''.  
+
The movies have been arranged in ascending order of '''imdb underscore rating'''.  
 
|-  
 
|-  
 
|| Highlight '''imdb_rating''' in the '''Source''' window
 
|| Highlight '''imdb_rating''' in the '''Source''' window
 
|| Now, let us say we want to arrange the movies in descending order of''' imdb rating. '''
 
|| Now, let us say we want to arrange the movies in descending order of''' imdb rating. '''
  
For this, we use '''desc''' function.  
+
For this, we use '''desc function'''.  
 
|-  
 
|-  
 
|| Highlight '''moviesImA''' in the '''Source''' window  
 
|| Highlight '''moviesImA''' in the '''Source''' window  
|| Let us close this data frame '''moviesImA '''for now.  
+
|| Let us close this '''data frame moviesImA '''for now.  
 
|-  
 
|-  
 
|| [RStudio]
 
|| [RStudio]
Line 365: Line 365:
  
 
'''View(moviesImD)'''
 
'''View(moviesImD)'''
|| In the '''Source''' window, type the following command.  
+
|| In the '''Source''' window, type the following '''command'''.  
 
|-  
 
|-  
 
|| Highlight the '''Run''' button in the '''Source''' window  
 
|| Highlight the '''Run''' button in the '''Source''' window  
|| Run the last two lines of code.  
+
|| '''Run''' the last two lines of code.  
 
|-  
 
|-  
 
|| Highlight '''moviesImD''' in the '''Source''' window  
 
|| Highlight '''moviesImD''' in the '''Source''' window  
Line 374: Line 374:
 
|-  
 
|-  
 
|| Highlight '''imdb_rating''' in the '''Source''' window
 
|| Highlight '''imdb_rating''' in the '''Source''' window
|| In the '''Source''' window, scroll from left to right and locate the '''imdb_rating '''column.  
+
|| In the '''Source''' window, scroll from left to right and locate the '''imdb underscore rating '''column.  
  
The '''movies''' have been arranged in descending order of '''imdb_rating'''.  
+
The movies have been arranged in descending order of '''imdb rating'''.  
 
|-  
 
|-  
 
|| Highlight '''moviesImD''' in the '''Source''' window  
 
|| Highlight '''moviesImD''' in the '''Source''' window  
|| Let us close this data frame '''moviesImD '''for now.  
+
|| Let us close this '''data frame moviesImD '''for now.  
 
|-
 
|-
 
|  | Highlight '''movies''' in the '''Source''' window  
 
|  | Highlight '''movies''' in the '''Source''' window  
Line 385: Line 385:
 
|-
 
|-
 
|  | Highlight '''genre''' and '''imdb_rating''' in the '''Source''' window
 
|  | Highlight '''genre''' and '''imdb_rating''' in the '''Source''' window
|  | Suppose we want to arrange the movies both by '''genre''' and '''imdb_rating'''.
+
|  | Suppose we want to arrange the movies both by genre and '''imdb rating'''.
 
|-  
 
|-  
 
|| Highlight the '''script myVis.R''' in the '''Source''' window  
 
|| Highlight the '''script myVis.R''' in the '''Source''' window  
Line 395: Line 395:
  
 
'''View(moviesGeIm)'''
 
'''View(moviesGeIm)'''
|  | In the '''Source''' window, type the following commands.  
+
|  | In the '''Source''' window, type the following '''commands'''.  
 
|-  
 
|-  
 
|| Highlight the '''Run''' button in the '''Source''' window  
 
|| Highlight the '''Run''' button in the '''Source''' window  
|| Run the last two lines of code.  
+
|| '''Run''' the last two lines of code.  
 
|-  
 
|-  
 
|| Highlight '''moviesGeIm''' in the '''Source''' window  
 
|| Highlight '''moviesGeIm''' in the '''Source''' window  
Line 406: Line 406:
 
|| In the '''Source''' window, scroll from left to right.  
 
|| In the '''Source''' window, scroll from left to right.  
  
'''movies''' have been arranged both by '''genre''' and '''imdb_rating'''.
+
Movies have been arranged both by genre and '''imdb underscore rating'''.
 
|-  
 
|-  
 
||  
 
||  
Line 418: Line 418:
  
 
|| In this tutorial, we have learnt about:
 
|| In this tutorial, we have learnt about:
* Data manipulation
+
* '''Data manipulation'''
* '''dplyr''' package
+
* '''dplyr package'''
* How to use '''filter''' and '''arrange''' functions
+
* How to use '''filter''' and '''arrange functions'''
  
 
|-  
 
|-  
Line 427: Line 427:
 
Assignment
 
Assignment
 
|| We now suggest an assignment.
 
|| We now suggest an assignment.
* Consider the built-in data set '''mtcars'''. Find the cars with '''hp''' greater than 100 and '''cyl''' equal to 3.
+
* Consider the '''built-in data set mtcars'''. Find the cars with '''hp''' greater than 100 and '''cyl''' equal to 3.
* Arrange the '''mtcars''' data set based on '''mpg''' variable.
+
* Arrange the '''mtcars data set '''based on '''mpg variable'''.
  
 
|-  
 
|-  

Revision as of 14:03, 7 August 2019

Title of the script: Data Manipulation using dplyr package

Author: Varshit Dubey (CoE Pune) and Sudhakar Kumar (IIT Bombay)

Keywords: R, RStudio, data manipulation, dplyr, filter, video tutorial

Visual Cue Narration
Show slide

Opening Slide

Welcome to this tutorial on Data manipulation using dplyr package.
Show slide

Learning Objective

In this tutorial, we will learn about,
  • Data manipulation
  • dplyr package
  • How to use filter and arrange functions

Show slide

Pre-requisites

To understand this tutorial, you should know,
  • Basics of statistics
  • Basics of ggplot2 package
  • Data frames

If not, please locate the relevant tutorials on R on this website.

Show slide

System Specifications

This tutorial is recorded on
  • Ubuntu Linux OS version 16.04
  • R version 3.4.4
  • RStudio version 1.1.463

Install R version 3.2.0 or higher.

Show slide

Download Files

For this tutorial, we will use
  • A data frame moviesData.csv
  • A script file myVis.R.

Please download these files from the Code files link of this tutorial.

[Computer screen]

Highlight moviesData.csv and myVis.R in the folder DataVis

I have downloaded and moved these files to DataVis folder.

This folder is located in myProject folder on my Desktop.

I have also set DataVis folder as my Working Directory.

Show slide

Need for Data Manipulation

In real life, it is rare that we get the data in exactly the right form we need.
Show slide

Need for Data Manipulation

Often we’ll need to
  • create some new variables or summaries
  • rename the variables
  • reorder the observations in order to make the data a little easier to work with
We will learn how to achieve all this by using dplyr package.
Show slide

About dplyr Package

  • dplyr is a package for data manipulation, written and maintained by Hadley Wickham.
  • It comprises many functions that perform mostly used data manipulation operations.
Let us switch to RStudio.
Highlight myVis.R in the Files window of RStudio Open the script myVis.R in RStudio.
Highlight the Source button Let us run this script by clicking on the Source button.
Highlight movies in the Source window movies data frame opens in the Source window.

This data frame will be used later in this tutorial.

Now, we will install dplyr package. Please make sure that you are connected to the Internet.
[RStudio]

install.packages("dplyr")

In the Console window, type the following command and press Enter.
Highlight the red dot in the Console window The installation of the package takes a few seconds.

We will wait while the package is being installed.

Click at the top of the script myVis.R To load this package, we will add the library at the top of the script.
Highlight the script myVis.R in the Source window Click on the script myVis.R
[RStudio]

library(dplyr)

Press Ctrl+Enter keys.

At the top of the script, type library and dplyr in parentheses.

Save the script and run this line by pressing Ctrl+Enter keys simultaneously.

Show slide

Functions in dplyr package

Now we learn about some key functions in dplyr package:
  • filter - to select cases based on their values.
  • arrange - to reorder the cases.
  • select - to select variables based on their names.
  • mutate - to add new variables that are functions of existing variables.
Show slide

Functions in dplyr package

  • summarise - to condense multiple values to a single value.

All these functions can be combined with group underscore by function. It allows us to perform any operation by a group.

Let us switch to RStudio.
Highlight movies in the Source window In the Source window, click on movies.
Highlight the scroll bar in the Source window In the Source window, scroll from left to right.

This will enable us to see the remaining objects of movies data frame.

Highlight genre in the Source window Suppose we want to filter the movies having genre as Comedy.

For this, we will use the filter function.

Highlight the script myVis.R in the Source window Click on the script myVis.R
[RStudio]

moviesComedy <- filter(movies,

genre == "Comedy")

In the Source window, type the following command.
Highlight filter in the Source window Recall that, filter function in dplyr package allows us to select cases based on their values.
Highlight movies after filter in the Source window Inside the filter function, the first argument is the name of the data frame which is movies.
Highlight genre == "Comedy" in the Source window The second argument is the value by which we want to filter the movies data frame.
Highlight the Run button in the Source window Save the script and run the current line.
Highlight moviesComedy in the Environment window Resulting data frame is stored in an object called moviesComedy in the Environment window.

Let us view the data frame moviesComedy to check whether it contains movies with genre as Comedy.

[RStudio]

View(moviesComedy)

In the Source window, type the following command.
Highlight the Run button in the Source window Run the current line.
Highlight moviesComedy in the Source window moviesComedy data frame opens in the Source window.
Highlight genre in the Source window All the movies having genre as Comedy have been filtered.
Highlight moviesComedy in the Source window Let us close this data frame moviesComedy for now.
Highlight filter in the Source window We can also use logical operators to combine two or more than two values.
Highlight movies in the Source window In the Source window, click on movies.
Highlight genre in the Source window Suppose we want to filter the movies with genre as either Comedy or Drama.
Highlight the script myVis.R in the Source window Click on the script myVis.R
[RStudio]

moviesComDr <- filter(movies,

genre == "Comedy" | genre == "Drama")

View(moviesComDr)

In the Source window, type the following commands.
Highlight filter in the Source widow Here, we have two values by which we would like to filter movies data frame.
Highlight | in the Source window For this, we have used a logical OR operator.
Highlight the Run button in the Source window Run the last two lines of code.
Highlight moviesComDr in the Source window moviesComDr opens in the Source window.

The movies having genre as either Comedy or Drama have been filtered.

Highlight moviesComDr in the Source window Let us close this data frame moviesComDr for now.
Highlight moviesComDr <- filter(movies, genre == "Comedy" | genre == "Drama") in the Source window This filter function can also be written using the match operator.
[RStudio]

moviesComDrP <- filter(movies,

genre %in% c("Comedy", "Drama"))

View(moviesComDrP)

In the Source window, type the following command.
Highlight %in% in the Source window %in% is used for value matching.
[RStudio]

help('%in%')

To know more about this operator, let us access the Help.

In the Console window, type the following command and press Enter.

Highlight Help window Match returns a vector of the positions of first matches of its first argument in its second.
Highlight the Run button in the Source window Run the last two lines of code.
Highlight moviesComDrP in the Source window moviesComDrP opens in the Source window.

The movies having genre as either Comedy or Drama have been filtered.

Highlight moviesComDrP in the Source window Let us close this data frame moviesComDrP for now.
Highlight movies in the Source window In the Source window, click on movies.
Highlight the scroll bar in the Source window In the Source window, scroll from left to right.
Highlight genre and imdb_rating in the Source window Let us now filter movies with genre as Comedy and imdb underscore rating greater than or equal to 7 point 5.
Highlight the script myVis.R in the Source window Click on the script myVis.R
[RStudio]

moviesComIm <- filter(movies,

genre == "Comedy" & imdb_rating >= 7.5)

View(moviesComIm)

In the Source window, type the following command.
Highlight genre == "Comedy" & imdb_rating >= 7.5 in the Source window Here, we have used a logical AND operator to include both conditions.
Highlight the Run button in the Source window Save the script and run the last two lines of code.
Highlight moviesComIm in the Source window moviesComIm opens in the Source window.

I will resize the Console window.

There are seven movies with genre as Comedy and imdb underscore rating greater than or equal to 7 point 5.

Highlight moviesComIm in the Source window Let us close this data frame moviesComIm for now.
Highlight movies in the Source window In the Source window, click on movies.
Highlight imdb_rating in the Source window Suppose, we want to arrange the movies in an ascending order of imdb underscore rating.

For this, we will use the arrange function.

Highlight the script myVis.R in the Source window Click on the script myVis.R
[RStudio]

moviesImA <- arrange(movies, imdb_rating)

View(moviesImA)

In the Source window, type the following command.
Highlight the Run button in the Source window Run the last two lines of code.
Highlight moviesImA in the Source window moviesImA opens in the Source window.
Highlight imdb_rating in the Source window In the Source window, scroll from left to right and locate the imdb underscore rating column.

The movies have been arranged in ascending order of imdb underscore rating.

Highlight imdb_rating in the Source window Now, let us say we want to arrange the movies in descending order of imdb rating.

For this, we use desc function.

Highlight moviesImA in the Source window Let us close this data frame moviesImA for now.
[RStudio]

moviesImD <- arrange(movies, desc(imdb_rating))

View(moviesImD)

In the Source window, type the following command.
Highlight the Run button in the Source window Run the last two lines of code.
Highlight moviesImD in the Source window moviesImD opens in the Source window.
Highlight imdb_rating in the Source window In the Source window, scroll from left to right and locate the imdb underscore rating column.

The movies have been arranged in descending order of imdb rating.

Highlight moviesImD in the Source window Let us close this data frame moviesImD for now.
Highlight movies in the Source window In the Source window, click on movies.
Highlight genre and imdb_rating in the Source window Suppose we want to arrange the movies both by genre and imdb rating.
Highlight the script myVis.R in the Source window Click on the script myVis.R
[RStudio]

moviesGeIm <- arrange(movies, genre, imdb_rating)

View(moviesGeIm)

In the Source window, type the following commands.
Highlight the Run button in the Source window Run the last two lines of code.
Highlight moviesGeIm in the Source window moviesGeIm opens in the Source window.
Highlight the scroll bar in the Source window In the Source window, scroll from left to right.

Movies have been arranged both by genre and imdb underscore rating.

Let us summarize what we have learnt.

Show slide

Summary

In this tutorial, we have learnt about:
  • Data manipulation
  • dplyr package
  • How to use filter and arrange functions
Show slide

Assignment

We now suggest an assignment.
  • Consider the built-in data set mtcars. Find the cars with hp greater than 100 and cyl equal to 3.
  • Arrange the mtcars data set based on mpg variable.
Show slide

About the Spoken Tutorial Project

The video at the following link summarises the Spoken Tutorial project.

Please download and watch it.

Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.

Show Slide

Forum to answer questions

Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Please post your general queries in this forum.
Show Slide

Textbook Companion

The FOSSEE team coordinates the TBC project.

For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial project is funded by NMEICT, MHRD, Govt. of India
Show Slide

Thank You

The script for this tutorial was contributed by Varshit Dubey (CoE Pune).

This is Sudhakar Kumar from IIT Bombay signing off. Thanks for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey, Sudhakarst