R/C2/Data-Manipulation-using-dplyr-Package/English
Title of the script: Data Manipulation using dplyr package
Author: Varshit Dubey (CoE Pune) and Sudhakar Kumar (IIT Bombay)
Keywords: R, RStudio, data manipulation, dplyr, filter, video tutorial
Visual Cue | Narration |
Show slide
Opening Slide |
Welcome to this tutorial on Data manipulation using dplyr package. |
Show slide
Learning Objective |
In this tutorial, we will learn about,
|
Show slide Pre-requisites |
To understand this tutorial, you should know,
If not, please locate the relevant tutorials on R on this website. |
Show slide
System Specifications |
This tutorial is recorded on
Install R version 3.2.0 or higher. |
Show slide
Download Files |
For this tutorial, we will use
Please download these files from the Code files link of this tutorial. |
[Computer screen]
Highlight moviesData.csv and myVis.R in the folder DataVis |
I have downloaded and moved these files to DataVis folder.
This folder is located in myProject folder on my Desktop. I have also set DataVis folder as my Working Directory. |
Show slide
Need for Data Manipulation |
In real life, it is rare that we get the data in exactly the right form we need. |
Show slide
Need for Data Manipulation |
Often we’ll need to
|
We will learn how to achieve all this by using dplyr package. | |
Show slide
About dplyr Package |
* dplyr is a package for data manipulation, written and maintained by Hadley Wickham.
|
Let us switch to RStudio. | |
Highlight myVis.R in the Files window of RStudio | Open the script myVis.R in RStudio. |
Highlight the Source button | Let us run this script by clicking on the Source button. |
Highlight movies in the Source window | movies data frame opens in the Source window.
|
Now, we will install dplyr package. Please make sure that you are connected to the Internet. | |
[RStudio]
install.packages("dplyr") |
In the Console window, type the following command and press Enter. |
Highlight the red dot in the Console window | The installation of the package takes a few seconds.
We will wait while the package is being installed. |
Click at the top of the script myVis.R | To load this package, we will add the library at the top of the script. |
Highlight the script myVis.R in the Source window | Click on the script myVis.R |
[RStudio]
library(dplyr) Press Ctrl+Enter keys. |
At the top of the script, type library and dplyr in parentheses.
Save the script and run this line by pressing Ctrl+Enter keys simultaneously. |
Show slide
Functions in dplyr package |
Now we learn about some key functions in dplyr package:
|
Show slide
Functions in dplyr package |
* summarise - to condense multiple values to a single value.
These all functions can be combined with group underscore by function. It allows us to perform any operation by a group. |
Let us switch to RStudio. | |
Highlight movies in the Source window | In the Source window, click on movies. |
Highlight the scroll bar in the Source window | In the Source window, scroll from left to right.
|
Highlight genre in the Source window | Suppose we want to filter the movies having genre as Comedy.
For this, we will use the filter function. |
Highlight the script myVis.R in the Source window | Click on the script myVis.R |
[RStudio]
moviesComedy <- filter(movies, genre == "Comedy") |
In the Source window, type the following command. |
Highlight filter in the Source window | Recall that, filter function in dplyr package allows us to select cases based on their values. |
Highlight movies after filter in the Source window | Inside the filter function, the first argument is the name of the data frame which is movies. |
Highlight genre == "Comedy" in the Source window | The second argument is the value by which we want to filter the movies data frame. |
Highlight the Run button in the Source window | Save the script and run the current line. |
Highlight moviesComedy in the Environment window | Resulting data frame is stored in an object called moviesComedy in the Environment window.
|
[RStudio]
View(moviesComedy) |
In the Source window, type the following command. |
Highlight the Run button in the Source window | Run the current line. |
Highlight moviesComedy in the Source window | moviesComedy data frame opens in the Source window. |
Highlight genre in the Source window | All the movies having genre as Comedy have been filtered. |
Highlight moviesComedy in the Source window | Let us close this data frame moviesComedy for now. |
Highlight filter in the Source window | We can also use logical operators to combine two or more than two values. |
Highlight movies in the Source window | In the Source window, click on movies. |
Highlight genre in the Source window | Suppose we want to filter the movies with genre as either Comedy or Drama. |
Highlight the script myVis.R in the Source window | Click on the script myVis.R |
[RStudio]
moviesComDr <- filter(movies, genre == "Comedy" | genre == "Drama") View(moviesComDr) |
In the Source window, type the following commands. |
Highlight filter in the Source widow | Here, we have two values by which we would like to filter movies data frame. |
Highlight | in the Source window | For this, we have used a logical OR operator. |
Highlight the Run button in the Source window | Run the last two lines of code. |
Highlight moviesComDr in the Source window | moviesComDr opens in the Source window.
|
Highlight moviesComDr in the Source window | Let us close this data frame moviesComDr for now. |
Highlight moviesComDr <- filter(movies, genre == "Comedy" | genre == "Drama") in the Source window | This filter function can also be written using the match operator. |
[RStudio]
moviesComDrP <- filter(movies, genre %in% c("Comedy", "Drama")) View(moviesComDrP) |
In the Source window, type the following command. |
Highlight %in% in the Source window | %in% is used for value matching. |
[RStudio]
help('%in%') |
To know more about this operator, let us access the Help.
In the Console window, type the following command and press Enter. |
Highlight Help window | Match returns a vector of the positions of (first) matches of its first argument in its second. |
Highlight the Run button in the Source window | Run the last two lines of code. |
Highlight moviesComDrP in the Source window | moviesComDrP opens in the Source window.
|
Highlight moviesComDrP in the Source window | Let us close this data frame moviesComDrP for now. |
Highlight movies in the Source window | In the Source window, click on movies. |
Highlight the scroll bar in the Source window | In the Source window, scroll from left to right. |
Highlight genre and imdb_rating in the Source window | Let us now filter movies with genre as Comedy and imdb underscore rating greater than or equal to 7 point 5. |
Highlight the script myVis.R in the Source window | Click on the script myVis.R |
[RStudio]
moviesComIm <- filter(movies, genre == "Comedy" & imdb_rating >= 7.5) View(moviesComIm) |
In the Source window, type the following command. |
Highlight genre == "Comedy" & imdb_rating >= 7.5 in the Source window | Here, we have used a logical AND operator to include both conditions. |
Highlight the Run button in the Source window | Save the script and run the last two lines of code. |
Highlight moviesComIm in the Source window | moviesComIm opens in the Source window.
I will resize the Console window. There are seven movies with genre as Comedy and imdb underscore rating greater than or equal to 7 point 5. |
Highlight moviesComIm in the Source window | Let us close this data frame moviesComIm for now. |
Highlight movies in the Source window | In the Source window, click on movies. |
Highlight imdb_rating in the Source window | Suppose, we want to arrange the movies in an ascending order of imdb underscore rating.
For this, we will use the arrange function. |
Highlight the script myVis.R in the Source window | Click on the script myVis.R |
[RStudio]
moviesImA <- arrange(movies, imdb_rating) View(moviesImA) |
In the Source window, type the following command. |
Highlight the Run button in the Source window | Run the last two lines of code. |
Highlight moviesImA in the Source window | moviesImA opens in the Source window. |
Highlight imdb_rating in the Source window | In the Source window, scroll from left to right and locate the imdb_rating column.
|
Highlight imdb_rating in the Source window | Now, let us say we want to arrange the movies in descending order of imdb rating.
For this, we use desc function. |
Highlight moviesImA in the Source window | Let us close this data frame moviesImA for now. |
[RStudio]
moviesImD <- arrange(movies, desc(imdb_rating)) View(moviesImD) |
In the Source window, type the following command. |
Highlight the Run button in the Source window | Run the last two lines of code. |
Highlight moviesImD in the Source window | moviesImD opens in the Source window. |
Highlight imdb_rating in the Source window | In the Source window, scroll from left to right and locate the imdb_rating column.
The movies have been arranged in descending order of imdb_rating. |
Highlight moviesImD in the Source window | Let us close this data frame moviesImD for now. |
Highlight movies in the Source window | In the Source window, click on movies. |
Highlight genre and imdb_rating in the Source window | Suppose we want to arrange the movies both by genre and imdb_rating. |
Highlight the script myVis.R in the Source window | Click on the script myVis.R |
[RStudio]
moviesGeIm <- arrange(movies, genre, imdb_rating) View(moviesGeIm) |
In the Source window, type the following commands. |
Highlight the Run button in the Source window | Run the last two lines of code. |
Highlight moviesGeIm in the Source window | moviesGeIm opens in the Source window. |
Highlight the scroll bar in the Source window | In the Source window, scroll from left to right.
movies have been arranged both by genre and imdb_rating. |
Let us summarize what we have learnt. | |
Show slide Summary |
In this tutorial, we have learnt about:
|
Show slide
Assignment |
We now suggest an assignment.
|
Show slide
About the Spoken Tutorial Project |
The video at the following link summarises the Spoken Tutorial project.
Please download and watch it. |
Show slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
Please contact us. |
Show Slide
Forum to answer questions |
Please post your timed queries in this forum. |
Show Slide
Forum to answer questions |
Please post your general queries in this forum. |
Show Slide
Textbook Companion |
The FOSSEE team coordinates the TBC project.
For more details, please visit these sites. |
Show Slide
Acknowledgment |
The Spoken Tutorial project is funded by NMEICT, MHRD, Govt. of India |
Show Slide
Thank You |
The script for this tutorial was contributed by Varshit Dubey (CoE Pune).
|