Difference between revisions of "R/C2/Pipe-Operator/English"
Nancyvarkey (Talk | contribs) |
Nancyvarkey (Talk | contribs) |
||
Line 182: | Line 182: | ||
Point to the mean values in the console. | Point to the mean values in the console. | ||
− | || The mean values of all | + | || The mean values of all movies in different '''genres''' are displayed. |
Notice that, '''Documentary genre''' has the highest mean '''imdb_rating.''' | Notice that, '''Documentary genre''' has the highest mean '''imdb_rating.''' | ||
Line 200: | Line 200: | ||
|| Let us find the mean '''imdb_rating '''distribution for the movies of '''Drama genre'''. | || Let us find the mean '''imdb_rating '''distribution for the movies of '''Drama genre'''. | ||
− | Also, we will group movies of '''Drama''' | + | Also, we will group movies of '''Drama genre''' by '''mpaa_rating. ''' |
For this, we will use '''filter, group_by''', and '''summarise functions''' one by one. | For this, we will use '''filter, group_by''', and '''summarise functions''' one by one. | ||
Line 328: | Line 328: | ||
|- | |- | ||
|| Highlight '''audience_score''' and '''critics_score''' in '''movies''' | || Highlight '''audience_score''' and '''critics_score''' in '''movies''' | ||
− | || Let us check what is the difference between '''critics_score ''' and '''audience_score'''of all the movies. | + | || Let us check what is the difference between '''critics_score ''' and '''audience_score''' of all the movies. |
We will use a '''box plot''' for our study. | We will use a '''box plot''' for our study. | ||
− | By using the '''pipe''' operator, we can combine the '''functions''' of '''ggplot2''' and '''dplyr ''' | + | By using the '''pipe''' operator, we can combine the '''functions''' of '''ggplot2''' and '''dplyr packages'''. |
|- | |- | ||
|| Highlight the '''script myPipe.R''' in the '''Source''' window | || Highlight the '''script myPipe.R''' in the '''Source''' window | ||
Line 371: | Line 371: | ||
|- | |- | ||
|| Highlight '''movies''' in the '''Source''' window | || Highlight '''movies''' in the '''Source''' window | ||
− | || | + | || In the '''Source''' window, click on '''movies data frame. ''' |
|- | |- | ||
|| Highlight the scroll bar in the '''Source''' window | || Highlight the scroll bar in the '''Source''' window | ||
Line 399: | Line 399: | ||
'''summarise(num = n()) '''in the '''Source''' window | '''summarise(num = n()) '''in the '''Source''' window | ||
− | || Also, we used '''num = n()'''. | + | || Also, we have used '''num = n()'''. |
The '''function n''' computes the number of times the event with specific condition has happened. | The '''function n''' computes the number of times the event with specific condition has happened. | ||
|- | |- | ||
|| Highlight the '''Run''' button in the '''Source''' window | || Highlight the '''Run''' button in the '''Source''' window | ||
− | || Run the current line. | + | || '''Run''' the current line. |
|- | |- | ||
− | || | + | || Resize the '''Console''' window. |
|| I will resize the '''Console''' window. | || I will resize the '''Console''' window. | ||
|- | |- | ||
Line 430: | Line 430: | ||
Assignment | Assignment | ||
|| We now suggest an assignment. | || We now suggest an assignment. | ||
− | * Use the built-in data set | + | * Use the '''built-in data set iris'''. |
+ | *Using the '''pipe''' operator, group the flowers by their species. | ||
* Summarise the grouped data by the mean of '''Sepal.Length '''and '''Sepal.Width'''. | * Summarise the grouped data by the mean of '''Sepal.Length '''and '''Sepal.Width'''. | ||
Revision as of 12:56, 23 September 2019
Title of the script: Pipe operator
Author: Varshit Dubey (CoE Pune) and Sudhakar Kumar (IIT Bombay)
Keywords: R, RStudio, dplyr package, ggplot2, summarise function, group_by function, pipe operator, boxplot, video tutorial
Visual Cue | Narration |
Show slide
Opening Slide |
Welcome to this tutorial on Pipe Operator. |
Show slide
Learning Objective |
In this tutorial, we will learn about:
|
Show slide Pre-requisites |
To understand this tutorial, you should know,
If not, please locate the relevant tutorials on R on this website. |
Show slide
System Specifications |
This tutorial is recorded on
Install R version 3.2.0 or higher. |
Show slide
Download Files |
For this tutorial, we will use
Please download these files from the Code files link of this tutorial. |
[Computer screen]
Highlight moviesData.csv and myPipe.R in the folder pipeOps |
I have downloaded and moved these files to pipeOps folder.
This folder is located in myProject folder on my Desktop. I have also set pipeOps folder as my Working Directory. |
Show slide
summarise function |
Now we learn about summarise function.
|
Let us switch to RStudio. | |
Highlight myPipe.R in the Files window of RStudio | Open the script myPipe.R in RStudio. |
Highlight the Source button | Run this script by clicking on the Source button. |
Highlight movies in the Source window | movies data frame opens in the Source window. |
Highlight the scroll bar in the Source window | In the movies dataframe, scroll from left to right.
This will enable us to see the remaining objects of the movies data frame. |
Highlight imdb_rating in the Source window | To know the mean of imdb_rating of all movies, we will use summarise function. |
Highlight the script myPipe.R in the Source window | Click on the script myPipe.R |
[RStudio]
summarise(movies, mean(imdb_rating)) |
In the Source window, type the following command. |
Highlight summarise in the Source window | Inside the summarise function, the first argument is a data frame to be summarised.
Here, it is movies. The second argument is the information we need, that is the mean of imdb_rating. |
Highlight the Run button in the Source window | Save the script and run the current line by pressing Ctrl+Enter keys simultaneously. |
Highlight output in the Console window
|
The mean value is shown.
One will argue that I can find the mean by using mean function along with dollar operator. What is the use of installing a whole package and using a complex function? |
Highlight summarise in the Source window | Basically, we do not use summarise function for computing such things.
This function is not useful unless we pair it with group underscore by function. |
Show slide
group_by() function |
When we use group_by function, the data frame gets divided into different groups. |
Let us switch back to RStudio. | |
Highlight movies in the Source window | In the Source window, click on movies data frame. |
Highlight the scroll bar in the Source window | In the movies data frame, scroll from right to left. |
Highlight genre in the Source window | We will group the movies data frame based on the genre.
For this, we will use group underscore by function. |
Highlight the script myPipe.R in the Source window | Click on the script myPipe.R |
[RStudio]
groupMovies <- group_by(movies, genre) |
In the Source window, type the following command. |
Highlight Run button in the Source window | Run the current line. |
Highlight groupMovies in the Environment window | A new data frame groupMovies is stored.
Now, we will use summarise function on this data frame. |
[RStudio]
summarise(groupMovies, mean(imdb_rating)) |
In the Source window, type the following command. |
Highlight Run button in the Source window | Run the current line. |
I will resize the Console window | |
Highlight output in the Console window.
Point to the mean values in the console. |
The mean values of all movies in different genres are displayed.
Notice that, Documentary genre has the highest mean imdb_rating. And Comedy genre has the lowest mean imdb_rating. |
I will resize the Console window | |
Highlight movies in the Source window | In the Source window, click on movies data frame. |
Highlight the scroll bar in the Source window | In the movies data frame, scroll from left to right. |
Point to imdb_rating, genre and mpaa_rating in the movies dataframe. | Let us find the mean imdb_rating distribution for the movies of Drama genre.
Also, we will group movies of Drama genre by mpaa_rating. For this, we will use filter, group_by, and summarise functions one by one. |
Highlight the script myPipe.R in the Source window | Click on the script myPipe.R |
[RStudio]
dramaMov <- filter(movies, genre == "Drama") gr_dramaMov <- group_by(dramaMov, mpaa_rating) summarise(gr_dramaMov, mean(imdb_rating)) |
In the Source window, type the following commands. |
Highlight filter(movies, genre == "Drama") in the Source window | First, we will extract the movies of Drama genre. |
Highlight gr_dramaMov <- group_by(dramaMov, mpaa_rating) in the Source window | Then, we group these movies based on mpaa_rating. |
Highlight summarise(gr_dramaMov, mean(imdb_rating)) in the Source window | Finally, we apply summarise function.
This will calculate the mean of the filtered and grouped movies. |
Highlight Run button in the Source window | Run the last three lines of code. |
I will resize the Console window | |
Highlight output in the Console window | The required mean values are printed on the console. |
I will resize the Console window again. | |
Highlight the last three lines of code in the Source window | In this code, we have to give names to each and every intermediate data frame.
But there is an alternate method to write these statements using the pipe operator. |
Show slide
Pipe operator |
|
Show slide
Example of pipe operator |
If we have to find the cosine of sine of pi, we can write
pi %>% sin() %>% cos() |
Let us switch to RStudio. | |
Highlight the last three lines of code in the Source window | We will learn how to do the same analysis by using the pipe operator. |
[RStudio]
movies %>% filter(genre=="Drama") %>% group_by(mpaa_rating) %>% summarise(mean(imdb_rating)) |
In the Source window, type the following command. |
Highlight movies %>% filter(genre=="Drama") %>%
group_by(mpaa_rating) %>% summarise(mean(imdb_rating)) in the Source window |
Here three lines of code have been written as a series of statements.
We can read this code as,
|
Highlight movies %>% filter(genre=="Drama") %>%
group_by(mpaa_rating) %>% summarise(mean(imdb_rating)) in the Source window |
This code is easier to read and write than the previous one.
In the case of pipe operator, we don’t have to repeat the name of the data frame. Notice that we have written name of the data frame only once. |
Highlight movies %>% filter(genre=="Drama") %>%
group_by(mpaa_rating) %>% summarise(mean(imdb_rating)) in the Source window |
Save the script and run the current line. |
I will resize the Console window. | |
Highlight output on the Console window | The required mean values are printed on the Console. |
I will resize the Console window again. | |
Highlight movies in the Source window | In the Source window, click on movies data frame. |
Highlight the scroll bar in the Source window | In the Source window, scroll from left to right. |
Highlight audience_score and critics_score in movies | Let us check what is the difference between critics_score and audience_score of all the movies.
We will use a box plot for our study. By using the pipe operator, we can combine the functions of ggplot2 and dplyr packages. |
Highlight the script myPipe.R in the Source window | Click on the script myPipe.R |
[RStudio]
movies %>% mutate(diff = audience_score - critics_score) %>% ggplot(mapping = aes(x=genre, y=diff)) + geom_boxplot() |
In the Source window, type the following command. |
Highlight the Run button in the Source window | Save the script and run the current line. |
Highlight Plots window | The required box plot appears in the Plots window. |
Highlight Plots window | In the Plots window, click on the Zoom button to maximize the plot. |
Highlight Plots window, highlight drama, horror, and mystery & suspense | Here you can see that for the genres Drama, Horror, and Mystery & Suspense movies, the median is close to zero.
This means that the audience and critics opinions are very similar for these genres. |
Highlight Plots window. highlight action & adventure and comedy | Whereas for Action & adventure and Comedy movies, the median is not close to zero.
This means that the audience and critics opinions are different for these genres. |
Highlight the close button in the Plot Zoom window | Close this plot. |
Highlight movies in the Source window | In the Source window, click on movies data frame. |
Highlight the scroll bar in the Source window | In the Source window, scroll from right to left. |
Highlight mpaa_rating and genre in movies | Let us check the number of movies in every category of mpaa_rating of each genre. |
Highlight the script myPipe.R in the Source window | Click on the script myPipe.R |
[RStudio]
movies %>% group_by(genre, mpaa_rating) %>% summarise(num = n()) |
In the Source window, type the following command. |
Highlight group_by in movies %>% group_by(genre, mpaa_rating) %>%
summarise(num = n()) in the Source window |
Notice that we have included both genre and mpaa_rating in group_by function.
So, the analysis will be done on the data divided by these 2 variables. |
Highlight summarise(num = n()) in movies %>% group_by(genre, mpaa_rating) %>%
summarise(num = n()) in the Source window |
Also, we have used num = n().
The function n computes the number of times the event with specific condition has happened. |
Highlight the Run button in the Source window | Run the current line. |
Resize the Console window. | I will resize the Console window. |
Highlight the output in the Console window | From the output, we can see that there are 22 Action and Adventure movies with mpaa_rating as R. |
Let us summarize what we have learnt. | |
Show slide Summary |
In this tutorial, we have learnt about:
|
Show slide
Assignment |
We now suggest an assignment.
|
Show slide
About the Spoken Tutorial Project |
The video at the following link summarises the Spoken Tutorial project.
Please download and watch it. |
Show slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
Please contact us. |
Show Slide
Forum to answer questions |
Please post your timed queries in this forum. |
Show Slide
Forum to answer questions |
Please post your general queries in this forum. |
Show Slide
Textbook Companion |
The FOSSEE team coordinates the TBC project.
For more details, please visit these sites. |
Show Slide
Acknowledgment |
The Spoken Tutorial project is funded by NMEICT, MHRD, Govt. of India |
Show Slide
Thank You |
The script for this tutorial was contributed by Varshit Dubey (CoE Pune).
This is Sudhakar Kumar from IIT Bombay signing off. Thanks for watching. |