Difference between revisions of "R/C2/Data-Manipulation-using-dplyr-Package/English"
Sudhakarst (Talk | contribs) |
|||
(6 intermediate revisions by 2 users not shown) | |||
Line 12: | Line 12: | ||
Opening Slide | Opening Slide | ||
− | || Welcome to this tutorial on '''Data manipulation using | + | || Welcome to this tutorial on '''Data manipulation using '''dplyr package'''. |
|- | |- | ||
|| Show slide | || Show slide | ||
Line 23: | Line 23: | ||
|- | |- | ||
− | || | + | || Show slide |
− | + | ||
− | Show slide | + | |
Pre-requisites | Pre-requisites | ||
|| To understand this tutorial, you should know, | || To understand this tutorial, you should know, | ||
− | * Basics of | + | * Basics of Statistics |
* Basics of '''ggplot2''' package | * Basics of '''ggplot2''' package | ||
* Data frames | * Data frames | ||
Line 38: | Line 36: | ||
System Specifications | System Specifications | ||
− | || This tutorial is recorded on | + | || This tutorial is recorded on, |
* '''Ubuntu Linux '''OS version '''16.04''' | * '''Ubuntu Linux '''OS version '''16.04''' | ||
* '''R''' version '''3.4.4''' | * '''R''' version '''3.4.4''' | ||
Line 49: | Line 47: | ||
Download Files | Download Files | ||
|| For this tutorial, we will use | || For this tutorial, we will use | ||
− | * A '''data frame | + | * A '''data frame moviesData.csv''' |
* A '''script''' file '''myVis.R'''. | * A '''script''' file '''myVis.R'''. | ||
Line 61: | Line 59: | ||
This folder is located in '''myProject''' folder on my '''Desktop'''. | This folder is located in '''myProject''' folder on my '''Desktop'''. | ||
− | I have also set '''DataVis''' folder as my '''Working Directory | + | I have also set '''DataVis''' folder as my '''Working Directory'''. |
|- | |- | ||
|| Show slide | || Show slide | ||
Line 73: | Line 71: | ||
|| Often we’ll need to | || Often we’ll need to | ||
− | * create some new variables or summaries | + | * create some new '''variables''' or '''summaries''' |
− | * rename the variables | + | * rename the '''variables''' |
− | * reorder the observations in order to make the data a little easier to work with | + | * reorder the observations in order to make the data a little easier to work with. |
|- | |- | ||
− | || | + | || '''About dplyr package''' |
− | || We will learn how to achieve all this by using '''dplyr''' | + | || We will learn how to achieve all this by using '''dplyr package'''. |
|- | |- | ||
|| Show slide | || Show slide | ||
About dplyr Package | About dplyr Package | ||
− | + | || | |
− | || * '''dplyr''' is a package for data manipulation, written and maintained by '''Hadley Wickham'''. | + | * '''dplyr''' is a '''package''' for '''data manipulation''', written and maintained by '''Hadley Wickham'''. |
− | * It comprises many functions that perform mostly used data manipulation operations. | + | * It comprises many '''functions''' that perform mostly used '''data manipulation operations'''. |
|- | |- | ||
Line 92: | Line 90: | ||
|| Let us switch to '''RStudio'''. | || Let us switch to '''RStudio'''. | ||
|- | |- | ||
− | || Highlight '''myVis.R''' in the '''Files '''window | + | || Highlight '''myVis.R''' in the '''Files '''window of '''RStudio ''' |
|| Open the '''script myVis.R '''in''' RStudio'''. | || Open the '''script myVis.R '''in''' RStudio'''. | ||
|- | |- | ||
|| Highlight the '''Source''' button | || Highlight the '''Source''' button | ||
− | || Let us run this '''script''' by clicking on the '''Source''' button. | + | || Let us '''run''' this '''script''' by clicking on the '''Source''' button. |
|- | |- | ||
|| Highlight '''movies''' in the '''Source''' window | || Highlight '''movies''' in the '''Source''' window | ||
|| '''movies data frame''' opens in the '''Source''' window. | || '''movies data frame''' opens in the '''Source''' window. | ||
− | |||
This '''data frame''' will be used later in this tutorial. | This '''data frame''' will be used later in this tutorial. | ||
|- | |- | ||
− | || | + | || Cursor on the interface. |
− | || Now, we will install '''dplyr''' | + | || Now, we will install '''dplyr package'''. Please make sure that you are connected to the '''Internet'''. |
|- | |- | ||
|| [RStudio] | || [RStudio] | ||
'''install.packages("dplyr")''' | '''install.packages("dplyr")''' | ||
− | || In the '''Console''' window, type the following command and press '''Enter'''. | + | || In the '''Console''' window, type the following '''command''' and press '''Enter'''. |
|- | |- | ||
|| Highlight the red dot in the '''Console''' window | || Highlight the red dot in the '''Console''' window | ||
− | || The installation of the package takes a few seconds. | + | || The installation of the '''package''' takes a few seconds. |
− | We will wait while the package is being installed. | + | We will wait while the '''package''' is being installed. |
|- | |- | ||
|| Click at the top of the '''script myVis.R''' | || Click at the top of the '''script myVis.R''' | ||
− | || To load this package, we will add the library at the top of the '''script'''. | + | || To load this '''package''', we will add the library at the top of the '''script'''. |
|- | |- | ||
|| Highlight the '''script myVis.R''' in the '''Source''' window | || Highlight the '''script myVis.R''' in the '''Source''' window | ||
Line 128: | Line 125: | ||
Press '''Ctrl+Enter''' keys. | Press '''Ctrl+Enter''' keys. | ||
− | || At the top of the '''script''', type '''library '''and '''dplyr '''in parentheses | + | || At the top of the '''script''', type '''library''' and '''dplyr '''in parentheses. |
− | Save the '''script '''and run this line by pressing '''Ctrl+Enter''' keys simultaneously. | + | Save the '''script '''and '''run''' this line by pressing '''Ctrl + Enter''' keys simultaneously. |
− | |- | + | |- |
|| Show slide | || Show slide | ||
− | Functions in '''dplyr''' | + | '''Functions''' in '''dplyr package''' |
− | || Now we learn about some key functions in '''dplyr ''' | + | || Now we learn about some key '''functions''' in '''dplyr package''': |
− | * '''filter | + | * '''filter'''- to select ''cases'' based on their values. |
− | * '''arrange | + | * '''arrange''' - to reorder the '''cases.''' |
− | * '''select | + | * '''select''' - to select '''variables''' based on their names. |
− | * '''mutate | + | * '''mutate''' - to add new '''variables''' that are '''functions''' of existing '''variables'''. |
|- | |- | ||
|| Show slide | || Show slide | ||
− | Functions in '''dplyr''' | + | '''Functions''' in '''dplyr package''' |
− | || * '''summarise | + | || |
+ | * '''summarise''' - to condense multiple values to a single value. | ||
+ | |||
+ | All these '''functions''' can be combined with '''group underscore by function'''. | ||
− | + | It allows us to perform any operation by a '''group'''. | |
|- | |- | ||
|| | || | ||
|| Let us switch to '''RStudio'''. | || Let us switch to '''RStudio'''. | ||
|- | |- | ||
− | | | + | || Highlight '''movies''' in the '''Source''' window |
− | | | + | ||In the '''Source''' window, click on '''movies'''. |
− | |- | + | |- |
|| Highlight the scroll bar in the '''Source''' window | || Highlight the scroll bar in the '''Source''' window | ||
|| In the '''Source''' window, scroll from left to right. | || In the '''Source''' window, scroll from left to right. | ||
− | + | This will enable us to see the remaining '''objects''' of '''movies data frame'''. | |
− | This will enable us to see the remaining | + | |
|- | |- | ||
|| Highlight '''genre''' in the '''Source''' window | || Highlight '''genre''' in the '''Source''' window | ||
− | || Suppose we want to filter the movies having | + | || Suppose we want to filter the movies having genre as '''Comedy'''. |
− | For this, we will use the '''filter''' | + | For this, we will use the '''filter function'''. |
|- | |- | ||
|| Highlight the '''script myVis.R''' in the '''Source''' window | || Highlight the '''script myVis.R''' in the '''Source''' window | ||
Line 175: | Line 174: | ||
'''genre == "Comedy")''' | '''genre == "Comedy")''' | ||
− | || In the '''Source''' window, type the following command. | + | || In the '''Source''' window, type the following '''command'''. |
|- | |- | ||
|| Highlight '''filter''' in the '''Source''' window | || Highlight '''filter''' in the '''Source''' window | ||
− | || Recall that, '''filter ''' | + | || Recall that, '''filter function''' in '''dplyr package''' allows us to select '''cases''' based on their values. |
|- | |- | ||
|| Highlight '''movies''' after '''filter''' in the '''Source''' window | || Highlight '''movies''' after '''filter''' in the '''Source''' window | ||
− | || Inside the '''filter''' | + | || Inside the '''filter function''', the first '''argument''' is the name of the '''data frame''' which is '''movies'''. |
|- | |- | ||
|| Highlight '''genre == "Comedy" '''in the '''Source''' window | || Highlight '''genre == "Comedy" '''in the '''Source''' window | ||
− | || The second argument is the value by which we want to filter the '''movies | + | || The second '''argument''' is the value by which we want to filter the '''movies data frame. ''' |
|- | |- | ||
|| Highlight the '''Run''' button in the '''Source''' window | || Highlight the '''Run''' button in the '''Source''' window | ||
− | || Save the '''script''' and run the current line. | + | || Save the '''script''' and '''run''' the current line. |
|- | |- | ||
|| Highlight '''moviesComedy''' in the '''Environment''' window | || Highlight '''moviesComedy''' in the '''Environment''' window | ||
− | || Resulting data frame is stored in an object called '''moviesComedy '''in the''' Environment window.''' | + | || Resulting '''data frame''' is stored in an '''object''' called '''moviesComedy''' in the''' Environment window.''' |
− | + | ||
− | Let us view the | + | Let us view the '''data frame moviesComedy''' to check whether it contains '''movies''' with genre as '''Comedy'''. |
|- | |- | ||
|| [RStudio] | || [RStudio] | ||
'''View(moviesComedy) ''' | '''View(moviesComedy) ''' | ||
− | || In the '''Source''' window, type the following command. | + | || In the '''Source''' window, type the following '''command'''. |
|- | |- | ||
|| Highlight the '''Run''' button in the '''Source''' window | || Highlight the '''Run''' button in the '''Source''' window | ||
− | || Run the current line. | + | || '''Run''' the current line. |
|- | |- | ||
|| Highlight '''moviesComedy''' in the Source window | || Highlight '''moviesComedy''' in the Source window | ||
− | || '''moviesComedy''' | + | || '''moviesComedy data frame''' opens in the '''Source''' window. |
|- | |- | ||
|| Highlight '''genre''' in the '''Source''' window | || Highlight '''genre''' in the '''Source''' window | ||
Line 210: | Line 208: | ||
|- | |- | ||
|| Highlight '''moviesComedy''' in the Source window | || Highlight '''moviesComedy''' in the Source window | ||
− | || Let us close this | + | || Let us close this '''data frame moviesComedy''' for now. |
|- | |- | ||
|| Highlight '''filter''' in the '''Source''' window | || Highlight '''filter''' in the '''Source''' window | ||
− | || We can also use logical operators to combine two or more than two values. | + | || We can also use '''logical''' operators to combine two or more than two values. |
|- | |- | ||
− | | | + | || Highlight '''movies''' in the '''Source''' window |
− | | | + | || In the '''Source''' window, click on '''movies'''. |
|- | |- | ||
|| Highlight '''genre''' in the '''Source''' window | || Highlight '''genre''' in the '''Source''' window | ||
− | || Suppose we want to filter the '''movies''' with '''genre''' as either '''Comedy '''or '''Drama'''. | + | || Suppose we want to filter the '''movies''' with '''genre''' as either '''Comedy''' or '''Drama'''. |
|- | |- | ||
|| Highlight the '''script myVis.R''' in the '''Source''' window | || Highlight the '''script myVis.R''' in the '''Source''' window | ||
Line 231: | Line 229: | ||
'''View(moviesComDr)''' | '''View(moviesComDr)''' | ||
− | || In the '''Source''' window, type the following commands. | + | || In the '''Source''' window, type the following '''commands'''. |
|- | |- | ||
|| Highlight '''filter''' in the '''Source''' widow | || Highlight '''filter''' in the '''Source''' widow | ||
− | || Here, we have two values by which we would like to filter '''movies | + | || Here, we have two values by which we would like to filter '''movies data frame'''. |
|- | |- | ||
|| Highlight''' |''' in the '''Source''' window | || Highlight''' |''' in the '''Source''' window | ||
− | || For this, we have used a | + | || For this, we have used a '''logical OR''' operator. |
|- | |- | ||
|| Highlight the '''Run''' button in the '''Source''' window | || Highlight the '''Run''' button in the '''Source''' window | ||
− | || Run the last two lines of code. | + | || '''Run''' the last two lines of code. |
|- | |- | ||
− | | | + | || Highlight '''moviesComDr''' in the '''Source''' window |
− | | | + | || '''moviesComDr '''opens in the '''Source''' window. |
− | + | ||
− | The '''movies''' having | + | The '''movies''' having genre as either '''Comedy '''or '''Drama '''have been filtered. |
|- | |- | ||
|| Highlight '''moviesComDr''' in the '''Source''' window | || Highlight '''moviesComDr''' in the '''Source''' window | ||
− | || Let us close this | + | || Let us close this '''data frame moviesComDr''' for now. |
|- | |- | ||
− | | | + | || Highlight '''moviesComDr <- filter(movies, genre == "Comedy" | genre == "Drama") '''in the '''Source''' window |
− | | | + | || This '''filter function''' can also be written using the '''match''' operator. |
|- | |- | ||
− | | | + | || [RStudio] |
'''moviesComDrP <- filter(movies,''' | '''moviesComDrP <- filter(movies,''' | ||
Line 261: | Line 258: | ||
'''View(moviesComDrP)''' | '''View(moviesComDrP)''' | ||
− | | | + | || In the '''Source''' window, type the following '''command'''. |
|- | |- | ||
− | | | + | || Highlight '''%in%''' in the '''Source''' window |
− | | | + | || '''%in%''' is used for value matching. |
|- | |- | ||
− | | | + | || [RStudio] |
'''help('%in%')''' | '''help('%in%')''' | ||
− | | | + | || To know more about this operator, let us access the '''Help'''. |
+ | |||
+ | In the '''Console''' window, type the following '''command''' and press '''Enter'''. | ||
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
|| Highlight the '''Run''' button in the '''Source''' window | || Highlight the '''Run''' button in the '''Source''' window | ||
− | || Run the last two lines of code. | + | || '''Run''' the last two lines of code. |
|- | |- | ||
|| Highlight '''moviesComDrP''' in the '''Source''' window | || Highlight '''moviesComDrP''' in the '''Source''' window | ||
|| '''moviesComDrP '''opens in the '''Source''' window. | || '''moviesComDrP '''opens in the '''Source''' window. | ||
− | + | The movies having genre as either '''Comedy '''or '''Drama '''have been filtered. | |
− | The | + | |
|- | |- | ||
|| Highlight '''moviesComDrP''' in the '''Source''' window | || Highlight '''moviesComDrP''' in the '''Source''' window | ||
− | || Let us close this | + | || Let us close this '''data frame moviesComDrP '''for now. |
|- | |- | ||
− | | | + | || Highlight '''movies''' in the '''Source''' window |
− | | | + | || In the '''Source''' window, click on '''movies'''. |
|- | |- | ||
|| Highlight the scroll bar in the '''Source''' window | || Highlight the scroll bar in the '''Source''' window | ||
Line 295: | Line 289: | ||
|- | |- | ||
|| Highlight '''genre''' and '''imdb_rating''' in the '''Source''' window | || Highlight '''genre''' and '''imdb_rating''' in the '''Source''' window | ||
− | || Let us now filter | + | || Let us now filter movies with genre as '''Comedy''' and '''imdb underscore rating '''greater than or equal to 7 point 5. |
|- | |- | ||
|| Highlight the '''script myVis.R''' in the '''Source''' window | || Highlight the '''script myVis.R''' in the '''Source''' window | ||
Line 307: | Line 301: | ||
'''View(moviesComIm)''' | '''View(moviesComIm)''' | ||
− | || In the '''Source''' window, type the following command. | + | || In the '''Source''' window, type the following '''command'''. |
|- | |- | ||
|| Highlight '''genre == "Comedy" & imdb_rating >= 7.5 '''in the '''Source''' window | || Highlight '''genre == "Comedy" & imdb_rating >= 7.5 '''in the '''Source''' window | ||
− | || Here, we have used a | + | || Here, we have used a '''logical AND''' operator to include both conditions. |
|- | |- | ||
|| Highlight the '''Run''' button in the '''Source''' window | || Highlight the '''Run''' button in the '''Source''' window | ||
− | || Save the script and run the last two lines of code. | + | || Save the script and '''run''' the last two lines of code. |
|- | |- | ||
|| Highlight '''moviesComIm''' in the '''Source''' window | || Highlight '''moviesComIm''' in the '''Source''' window | ||
− | || '''moviesComIm '''opens in the '''Source''' window. | + | || '''moviesComIm''' opens in the '''Source''' window. |
I will resize the '''Console''' window. | I will resize the '''Console''' window. | ||
− | There are seven movies with | + | There are seven movies with genre as '''Comedy''' and '''imdb underscore rating''' greater than or equal to 7 point 5. |
|- | |- | ||
|| Highlight '''moviesComIm''' in the '''Source''' window | || Highlight '''moviesComIm''' in the '''Source''' window | ||
− | || Let us close this | + | || Let us close this '''data frame moviesComIm '''for now. |
|- | |- | ||
− | | | + | || Highlight '''movies''' in the '''Source''' window |
− | | | + | || In the '''Source''' window, click on '''movies'''. |
|- | |- | ||
|| Highlight '''imdb_rating''' in the '''Source''' window | || Highlight '''imdb_rating''' in the '''Source''' window | ||
− | || Suppose, we want to arrange the | + | || Suppose, we want to arrange the movies in an ascending order of '''imdb underscore rating'''. |
− | For this, we will use the '''arrange''' | + | For this, we will use the '''arrange function'''. |
|- | |- | ||
|| Highlight the '''script myVis.R''' in the '''Source''' window | || Highlight the '''script myVis.R''' in the '''Source''' window | ||
|| Click on the '''script myVis.R''' | || Click on the '''script myVis.R''' | ||
|- | |- | ||
− | | | + | || [RStudio] |
'''moviesImA <- arrange(movies, imdb_rating) ''' | '''moviesImA <- arrange(movies, imdb_rating) ''' | ||
'''View(moviesImA)''' | '''View(moviesImA)''' | ||
− | | | + | || In the '''Source''' window, type the following '''command'''. |
|- | |- | ||
|| Highlight the '''Run''' button in the '''Source''' window | || Highlight the '''Run''' button in the '''Source''' window | ||
− | || Run the last two lines of code. | + | || '''Run''' the last two lines of code. |
|- | |- | ||
|| Highlight '''moviesImA''' in the '''Source''' window | || Highlight '''moviesImA''' in the '''Source''' window | ||
Line 351: | Line 345: | ||
|- | |- | ||
|| Highlight '''imdb_rating''' in the '''Source''' window | || Highlight '''imdb_rating''' in the '''Source''' window | ||
− | || In the '''Source''' window, scroll from left to right and locate the ''' | + | || In the '''Source''' window, scroll from left to right and locate the '''imdb underscore rating''' column. |
− | + | ||
− | The | + | The movies have been arranged in ascending order of '''imdb underscore rating'''. |
|- | |- | ||
|| Highlight '''imdb_rating''' in the '''Source''' window | || Highlight '''imdb_rating''' in the '''Source''' window | ||
|| Now, let us say we want to arrange the movies in descending order of''' imdb rating. ''' | || Now, let us say we want to arrange the movies in descending order of''' imdb rating. ''' | ||
− | For this, we use '''desc''' | + | For this, we use '''desc function'''. |
|- | |- | ||
|| Highlight '''moviesImA''' in the '''Source''' window | || Highlight '''moviesImA''' in the '''Source''' window | ||
− | || Let us close this | + | || Let us close this '''data frame moviesImA''' for now. |
|- | |- | ||
|| [RStudio] | || [RStudio] | ||
Line 369: | Line 362: | ||
'''View(moviesImD)''' | '''View(moviesImD)''' | ||
− | || In the '''Source''' window, type the following command. | + | || In the '''Source''' window, type the following '''command'''. |
|- | |- | ||
|| Highlight the '''Run''' button in the '''Source''' window | || Highlight the '''Run''' button in the '''Source''' window | ||
− | || Run the last two lines of code. | + | || '''Run''' the last two lines of code. |
|- | |- | ||
|| Highlight '''moviesImD''' in the '''Source''' window | || Highlight '''moviesImD''' in the '''Source''' window | ||
− | || '''moviesImD '''opens in the '''Source''' window. | + | || '''moviesImD''' opens in the '''Source''' window. |
|- | |- | ||
|| Highlight '''imdb_rating''' in the '''Source''' window | || Highlight '''imdb_rating''' in the '''Source''' window | ||
− | || In the '''Source''' window, scroll from left to right and locate the ''' | + | || In the '''Source''' window, scroll from left to right and locate the '''imdb underscore rating''' column. |
− | The | + | The movies have been arranged in descending order of '''imdb rating'''. |
|- | |- | ||
|| Highlight '''moviesImD''' in the '''Source''' window | || Highlight '''moviesImD''' in the '''Source''' window | ||
− | || Let us close this | + | || Let us close this '''data frame moviesImD''' for now. |
|- | |- | ||
− | | | + | || Highlight '''movies''' in the '''Source''' window |
− | | | + | || In the '''Source''' window, click on '''movies'''. |
|- | |- | ||
− | | | + | || Highlight '''genre''' and '''imdb_rating''' in the '''Source''' window |
− | | | + | || Suppose we want to arrange the movies both by genre and '''imdb rating'''. |
|- | |- | ||
|| Highlight the '''script myVis.R''' in the '''Source''' window | || Highlight the '''script myVis.R''' in the '''Source''' window | ||
|| Click on the '''script myVis.R''' | || Click on the '''script myVis.R''' | ||
|- | |- | ||
− | | | + | || [RStudio] |
'''moviesGeIm <- arrange(movies, genre, imdb_rating)''' | '''moviesGeIm <- arrange(movies, genre, imdb_rating)''' | ||
'''View(moviesGeIm)''' | '''View(moviesGeIm)''' | ||
− | | | + | || In the '''Source''' window, type the following '''commands'''. |
|- | |- | ||
|| Highlight the '''Run''' button in the '''Source''' window | || Highlight the '''Run''' button in the '''Source''' window | ||
− | || Run the last two lines of code. | + | || '''Run''' the last two lines of code. |
|- | |- | ||
|| Highlight '''moviesGeIm''' in the '''Source''' window | || Highlight '''moviesGeIm''' in the '''Source''' window | ||
− | || '''moviesGeIm '''opens in the '''Source''' window. | + | || '''moviesGeIm''' opens in the '''Source''' window. |
|- | |- | ||
|| Highlight the scroll bar in the '''Source''' window | || Highlight the scroll bar in the '''Source''' window | ||
|| In the '''Source''' window, scroll from left to right. | || In the '''Source''' window, scroll from left to right. | ||
− | + | Movies have been arranged both by genre and '''imdb underscore rating'''. | |
|- | |- | ||
|| | || | ||
|| Let us summarize what we have learnt. | || Let us summarize what we have learnt. | ||
|- | |- | ||
− | || | + | ||Show slide |
− | + | '''Summary''' | |
− | + | ||
− | Summary | + | |
|| In this tutorial, we have learnt about: | || In this tutorial, we have learnt about: | ||
− | * Data manipulation | + | * '''Data manipulation''' |
− | * '''dplyr''' | + | * '''dplyr package''' |
− | * How to use '''filter''' and '''arrange''' | + | * How to use '''filter''' and '''arrange functions'''. |
|- | |- | ||
Line 431: | Line 422: | ||
Assignment | Assignment | ||
|| We now suggest an assignment. | || We now suggest an assignment. | ||
− | * Consider the built-in data set | + | * Consider the '''built-in data set mtcars'''. Find the cars with '''hp''' greater than 100 and '''cyl''' equal to 3. |
− | * Arrange the '''mtcars''' | + | * Arrange the '''mtcars data set '''based on '''mpg variable'''. |
− | + | ||
|- | |- | ||
Line 476: | Line 466: | ||
Thank You | Thank You | ||
|| The script for this tutorial was contributed by Varshit Dubey (CoE Pune). | || The script for this tutorial was contributed by Varshit Dubey (CoE Pune). | ||
− | |||
This is Sudhakar Kumar from IIT Bombay signing off. Thanks for watching. | This is Sudhakar Kumar from IIT Bombay signing off. Thanks for watching. | ||
|- | |- | ||
|} | |} |
Latest revision as of 11:21, 28 August 2019
Title of the script: Data Manipulation using dplyr package
Author: Varshit Dubey (CoE Pune) and Sudhakar Kumar (IIT Bombay)
Keywords: R, RStudio, data manipulation, dplyr, filter, video tutorial
Visual Cue | Narration |
Show slide
Opening Slide |
Welcome to this tutorial on Data manipulation using dplyr package. |
Show slide
Learning Objective |
In this tutorial, we will learn about,
|
Show slide
Pre-requisites |
To understand this tutorial, you should know,
If not, please locate the relevant tutorials on R on this website. |
Show slide
System Specifications |
This tutorial is recorded on,
Install R version 3.2.0 or higher. |
Show slide
Download Files |
For this tutorial, we will use
Please download these files from the Code files link of this tutorial. |
[Computer screen]
Highlight moviesData.csv and myVis.R in the folder DataVis |
I have downloaded and moved these files to DataVis folder.
This folder is located in myProject folder on my Desktop. I have also set DataVis folder as my Working Directory. |
Show slide
Need for Data Manipulation |
In real life, it is rare that we get the data in exactly the right form we need. |
Show slide
Need for Data Manipulation |
Often we’ll need to
|
About dplyr package | We will learn how to achieve all this by using dplyr package. |
Show slide
About dplyr Package |
|
Let us switch to RStudio. | |
Highlight myVis.R in the Files window of RStudio | Open the script myVis.R in RStudio. |
Highlight the Source button | Let us run this script by clicking on the Source button. |
Highlight movies in the Source window | movies data frame opens in the Source window.
This data frame will be used later in this tutorial. |
Cursor on the interface. | Now, we will install dplyr package. Please make sure that you are connected to the Internet. |
[RStudio]
install.packages("dplyr") |
In the Console window, type the following command and press Enter. |
Highlight the red dot in the Console window | The installation of the package takes a few seconds.
We will wait while the package is being installed. |
Click at the top of the script myVis.R | To load this package, we will add the library at the top of the script. |
Highlight the script myVis.R in the Source window | Click on the script myVis.R |
[RStudio]
library(dplyr) Press Ctrl+Enter keys. |
At the top of the script, type library and dplyr in parentheses.
Save the script and run this line by pressing Ctrl + Enter keys simultaneously. |
Show slide
Functions in dplyr package |
Now we learn about some key functions in dplyr package:
|
Show slide
Functions in dplyr package |
All these functions can be combined with group underscore by function. It allows us to perform any operation by a group. |
Let us switch to RStudio. | |
Highlight movies in the Source window | In the Source window, click on movies. |
Highlight the scroll bar in the Source window | In the Source window, scroll from left to right.
This will enable us to see the remaining objects of movies data frame. |
Highlight genre in the Source window | Suppose we want to filter the movies having genre as Comedy.
For this, we will use the filter function. |
Highlight the script myVis.R in the Source window | Click on the script myVis.R |
[RStudio]
moviesComedy <- filter(movies, genre == "Comedy") |
In the Source window, type the following command. |
Highlight filter in the Source window | Recall that, filter function in dplyr package allows us to select cases based on their values. |
Highlight movies after filter in the Source window | Inside the filter function, the first argument is the name of the data frame which is movies. |
Highlight genre == "Comedy" in the Source window | The second argument is the value by which we want to filter the movies data frame. |
Highlight the Run button in the Source window | Save the script and run the current line. |
Highlight moviesComedy in the Environment window | Resulting data frame is stored in an object called moviesComedy in the Environment window.
Let us view the data frame moviesComedy to check whether it contains movies with genre as Comedy. |
[RStudio]
View(moviesComedy) |
In the Source window, type the following command. |
Highlight the Run button in the Source window | Run the current line. |
Highlight moviesComedy in the Source window | moviesComedy data frame opens in the Source window. |
Highlight genre in the Source window | All the movies having genre as Comedy have been filtered. |
Highlight moviesComedy in the Source window | Let us close this data frame moviesComedy for now. |
Highlight filter in the Source window | We can also use logical operators to combine two or more than two values. |
Highlight movies in the Source window | In the Source window, click on movies. |
Highlight genre in the Source window | Suppose we want to filter the movies with genre as either Comedy or Drama. |
Highlight the script myVis.R in the Source window | Click on the script myVis.R |
[RStudio]
moviesComDr <- filter(movies, genre == "Comedy" | genre == "Drama") View(moviesComDr) |
In the Source window, type the following commands. |
Highlight filter in the Source widow | Here, we have two values by which we would like to filter movies data frame. |
Highlight | in the Source window | For this, we have used a logical OR operator. |
Highlight the Run button in the Source window | Run the last two lines of code. |
Highlight moviesComDr in the Source window | moviesComDr opens in the Source window.
The movies having genre as either Comedy or Drama have been filtered. |
Highlight moviesComDr in the Source window | Let us close this data frame moviesComDr for now. |
Highlight moviesComDr <- filter(movies, genre == "Comedy" | genre == "Drama") in the Source window | This filter function can also be written using the match operator. |
[RStudio]
moviesComDrP <- filter(movies, genre %in% c("Comedy", "Drama")) View(moviesComDrP) |
In the Source window, type the following command. |
Highlight %in% in the Source window | %in% is used for value matching. |
[RStudio]
help('%in%') |
To know more about this operator, let us access the Help.
In the Console window, type the following command and press Enter. |
Highlight the Run button in the Source window | Run the last two lines of code. |
Highlight moviesComDrP in the Source window | moviesComDrP opens in the Source window.
The movies having genre as either Comedy or Drama have been filtered. |
Highlight moviesComDrP in the Source window | Let us close this data frame moviesComDrP for now. |
Highlight movies in the Source window | In the Source window, click on movies. |
Highlight the scroll bar in the Source window | In the Source window, scroll from left to right. |
Highlight genre and imdb_rating in the Source window | Let us now filter movies with genre as Comedy and imdb underscore rating greater than or equal to 7 point 5. |
Highlight the script myVis.R in the Source window | Click on the script myVis.R |
[RStudio]
moviesComIm <- filter(movies, genre == "Comedy" & imdb_rating >= 7.5) View(moviesComIm) |
In the Source window, type the following command. |
Highlight genre == "Comedy" & imdb_rating >= 7.5 in the Source window | Here, we have used a logical AND operator to include both conditions. |
Highlight the Run button in the Source window | Save the script and run the last two lines of code. |
Highlight moviesComIm in the Source window | moviesComIm opens in the Source window.
I will resize the Console window. There are seven movies with genre as Comedy and imdb underscore rating greater than or equal to 7 point 5. |
Highlight moviesComIm in the Source window | Let us close this data frame moviesComIm for now. |
Highlight movies in the Source window | In the Source window, click on movies. |
Highlight imdb_rating in the Source window | Suppose, we want to arrange the movies in an ascending order of imdb underscore rating.
For this, we will use the arrange function. |
Highlight the script myVis.R in the Source window | Click on the script myVis.R |
[RStudio]
moviesImA <- arrange(movies, imdb_rating) View(moviesImA) |
In the Source window, type the following command. |
Highlight the Run button in the Source window | Run the last two lines of code. |
Highlight moviesImA in the Source window | moviesImA opens in the Source window. |
Highlight imdb_rating in the Source window | In the Source window, scroll from left to right and locate the imdb underscore rating column.
The movies have been arranged in ascending order of imdb underscore rating. |
Highlight imdb_rating in the Source window | Now, let us say we want to arrange the movies in descending order of imdb rating.
For this, we use desc function. |
Highlight moviesImA in the Source window | Let us close this data frame moviesImA for now. |
[RStudio]
moviesImD <- arrange(movies, desc(imdb_rating)) View(moviesImD) |
In the Source window, type the following command. |
Highlight the Run button in the Source window | Run the last two lines of code. |
Highlight moviesImD in the Source window | moviesImD opens in the Source window. |
Highlight imdb_rating in the Source window | In the Source window, scroll from left to right and locate the imdb underscore rating column.
The movies have been arranged in descending order of imdb rating. |
Highlight moviesImD in the Source window | Let us close this data frame moviesImD for now. |
Highlight movies in the Source window | In the Source window, click on movies. |
Highlight genre and imdb_rating in the Source window | Suppose we want to arrange the movies both by genre and imdb rating. |
Highlight the script myVis.R in the Source window | Click on the script myVis.R |
[RStudio]
moviesGeIm <- arrange(movies, genre, imdb_rating) View(moviesGeIm) |
In the Source window, type the following commands. |
Highlight the Run button in the Source window | Run the last two lines of code. |
Highlight moviesGeIm in the Source window | moviesGeIm opens in the Source window. |
Highlight the scroll bar in the Source window | In the Source window, scroll from left to right.
Movies have been arranged both by genre and imdb underscore rating. |
Let us summarize what we have learnt. | |
Show slide
Summary |
In this tutorial, we have learnt about:
|
Show slide
Assignment |
We now suggest an assignment.
|
Show slide
About the Spoken Tutorial Project |
The video at the following link summarises the Spoken Tutorial project.
Please download and watch it. |
Show slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
Please contact us. |
Show Slide
Forum to answer questions |
Please post your timed queries in this forum. |
Show Slide
Forum to answer questions |
Please post your general queries in this forum. |
Show Slide
Textbook Companion |
The FOSSEE team coordinates the TBC project.
For more details, please visit these sites. |
Show Slide
Acknowledgment |
The Spoken Tutorial project is funded by NMEICT, MHRD, Govt. of India |
Show Slide
Thank You |
The script for this tutorial was contributed by Varshit Dubey (CoE Pune).
This is Sudhakar Kumar from IIT Bombay signing off. Thanks for watching. |