Difference between revisions of "R/C2/Pipe-Operator/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
(Created page with "'''Title of the script''': Pipe operator '''Author''': Varshit Dubey (CoE Pune) and Sudhakar Kumar (IIT Bombay) '''Keywords''': R, RStudio, dplyr package, ggplot2, summaris...")
 
 
(4 intermediate revisions by 2 users not shown)
Line 6: Line 6:
  
 
{| border =1
 
{| border =1
|'''Visual Cue’’’
+
|'''Visual Cue'''
|'''Narration’’’
+
|'''Narration'''
 
|-
 
|-
 
|| Show slide  
 
|| Show slide  
Line 19: Line 19:
  
 
|| In this tutorial, we will learn about:
 
|| In this tutorial, we will learn about:
* '''summarise''' and '''group_by''' functions
+
* '''summarise''' and '''group_by functions'''
* Operations in '''summarise''' function
+
* Operations in '''summarise function'''
 
* '''Pipe '''operator
 
* '''Pipe '''operator
  
Line 34: Line 34:
 
* Basics of statistics  
 
* Basics of statistics  
 
* Basics of '''ggplot2''' and '''dplyr''' packages
 
* Basics of '''ggplot2''' and '''dplyr''' packages
* Data frames
+
* '''Data frames'''
  
 
If not, please locate the relevant tutorials on '''R''' on this website.
 
If not, please locate the relevant tutorials on '''R''' on this website.
Line 42: Line 42:
 
System Specifications
 
System Specifications
 
|| This tutorial is recorded on
 
|| This tutorial is recorded on
* '''Ubuntu Linux '''OS version '''16.04'''
+
* '''Ubuntu Linux '''OS version 16.04
* '''R''' version '''3.4.4'''
+
* '''R''' version 3.4.4
* '''RStudio''' version '''1.1.463'''
+
* '''RStudio''' version 1.1.463
  
Install '''R''' version '''3.2.0''' or higher.  
+
Install '''R''' version 3.2.0 or higher.  
 
|-  
 
|-  
 
|| Show slide
 
|| Show slide
Line 52: Line 52:
 
Download Files
 
Download Files
 
|| For this tutorial, we will use
 
|| For this tutorial, we will use
* A '''data frame''' '''moviesData.csv'''
+
* A '''data frame moviesData.csv'''
 
* A '''script''' file '''myPipe.R'''.
 
* A '''script''' file '''myPipe.R'''.
  
Line 68: Line 68:
 
|| Show slide  
 
|| Show slide  
  
'''summarise''' function
+
'''summarise function'''  
|| Now we learn about '''summarise''' function.
+
|| Now we learn about '''summarise function'''.
* '''summarise''' function reduces a '''data frame '''into a single row.  
+
* '''summarise function''' reduces a '''data frame '''into a single row.  
* It gives summaries like mean, median, etc. of the variables available in the '''data frame'''.
+
* It gives summaries like mean, median, etc. of the '''variables''' available in the '''data frame'''.
* We use '''summarise''' along with '''group_by''' function.
+
* We use '''summarise''' along with '''group_by function'''.
 
+
  
 
|-  
 
|-  
Line 79: Line 78:
 
|| Let us switch to '''RStudio'''.
 
|| Let us switch to '''RStudio'''.
 
|-  
 
|-  
|| Highlight '''myPipe.R''' in the '''Files '''window''' '''of '''RStudio '''
+
|| Highlight '''myPipe.R''' in the '''Files '''window of '''RStudio '''
 
|| Open the '''script myPipe.R '''in''' RStudio'''.  
 
|| Open the '''script myPipe.R '''in''' RStudio'''.  
 
|-  
 
|-  
 
|| Highlight the '''Source''' button
 
|| Highlight the '''Source''' button
|| Run this '''script''' by clicking on the '''Source''' button.  
+
|| '''Run''' this '''script''' by clicking on the '''Source''' button.  
 
|-  
 
|-  
 
|| Highlight '''movies''' in the '''Source''' window  
 
|| Highlight '''movies''' in the '''Source''' window  
Line 91: Line 90:
 
|| In the '''movies dataframe''', scroll from left to right.  
 
|| In the '''movies dataframe''', scroll from left to right.  
  
This will enable us to see the remaining objects of the '''movies''' '''data frame'''.  
+
This will enable us to see the remaining '''objects''' of the '''movies data frame'''.  
 
|-  
 
|-  
 
|| Highlight '''imdb_rating''' in the '''Source''' window  
 
|| Highlight '''imdb_rating''' in the '''Source''' window  
|| To know the mean of '''imdb_rating''' of all movies, we will use '''summarise''' function.  
+
|| To know the mean of '''imdb_rating''' of all movies, we will use '''summarise function'''.  
 
|-
 
|-
 
|| Highlight the '''script myPipe.R''' in the '''Source''' window  
 
|| Highlight the '''script myPipe.R''' in the '''Source''' window  
Line 104: Line 103:
  
 
'''mean(imdb_rating))'''
 
'''mean(imdb_rating))'''
|| In the '''Source''' window, type the following command.  
+
|| In the '''Source''' window, type the following '''command'''.  
 
|-
 
|-
 
|| Highlight '''summarise''' in the '''Source''' window
 
|| Highlight '''summarise''' in the '''Source''' window
|| Inside the '''summarise '''function, the first argument is a '''data frame''' to be summarised.  
+
|| Inside the '''summarise function''', the first '''argument''' is a '''data frame''' to be '''summarised'''.  
  
 
Here, it is '''movies'''.  
 
Here, it is '''movies'''.  
  
The second argument is the information we need, that is the mean of '''imdb_rating'''.  
+
The second '''argument''' is the information we need, that is the mean of '''imdb_rating'''.  
 
|-
 
|-
 
|| Highlight the '''Run''' button in the '''Source''' window  
 
|| Highlight the '''Run''' button in the '''Source''' window  
|| Save the '''script''' and run the current line by pressing '''Ctrl+Enter '''keys simultaneously.  
+
|| Save the '''script''' and '''run''' the current line by pressing '''Ctrl+Enter '''keys simultaneously.  
 
|-
 
|-
 
|| Highlight output in the '''Console''' window
 
|| Highlight output in the '''Console''' window
Line 121: Line 120:
 
|| The mean value is shown.  
 
|| The mean value is shown.  
  
 +
One will argue that I can find the mean by using '''mean function''' along with '''dollar''' operator.
  
One will argue that I can find the mean by using '''mean '''function along with dollar''' '''operator.
+
What is the use of installing a whole '''package''' and using a complex '''function'''?  
 
+
 
+
What is the use of installing a whole package and using a complex function?  
+
 
|-
 
|-
 
|| Highlight '''summarise''' in the '''Source''' window  
 
|| Highlight '''summarise''' in the '''Source''' window  
|| Basically, we do not use '''summarise '''function for computing such things.
+
|| Basically, we do not use '''summarise function''' for computing such things.
  
 
+
This '''function''' is not useful unless we pair it with '''group by function'''.
This function is not useful unless we pair it with '''group '''underscore '''by '''function.
+
 
|-  
 
|-  
 
|| Show slide  
 
|| Show slide  
  
 
'''group_by()''' function
 
'''group_by()''' function
|| When we use '''group_by''' function, the '''data frame''' gets divided into different groups.  
+
|| When we use '''group_by function''', the '''data frame''' gets divided into different groups.  
 
|-  
 
|-  
 
||  
 
||  
Line 150: Line 146:
 
|| We will group the '''movies data frame''' based on the '''genre'''.  
 
|| We will group the '''movies data frame''' based on the '''genre'''.  
  
 
+
For this, we will use '''group underscore by function'''.  
For this, we will use '''group''' underscore '''by''' function.  
+
 
|-
 
|-
 
|| Highlight the '''script myPipe.R''' in the '''Source''' window  
 
|| Highlight the '''script myPipe.R''' in the '''Source''' window  
Line 161: Line 156:
  
 
'''genre)'''
 
'''genre)'''
|| In the '''Source''' window, type the following command.  
+
|| In the '''Source''' window, type the following '''command'''.  
 
|-
 
|-
 
|| Highlight '''Run''' button in the '''Source''' window  
 
|| Highlight '''Run''' button in the '''Source''' window  
|| Run the current line.  
+
|| '''Run''' the current line.  
 
|-
 
|-
 
|| Highlight '''groupMovies '''in the '''Environment''' window  
 
|| Highlight '''groupMovies '''in the '''Environment''' window  
|| A new data frame '''groupMovies '''is stored.  
+
|| A new '''data frame groupMovies '''is stored.  
  
Now, we will use '''summarise '''function on this''' '''data frame.  
+
Now, we will use '''summarise function''' on this '''data frame'''.  
 
|-
 
|-
 
|| [RStudio]
 
|| [RStudio]
Line 176: Line 171:
  
 
'''mean(imdb_rating))'''
 
'''mean(imdb_rating))'''
|| In the '''Source''' window, type the following command.  
+
|| In the '''Source''' window, type the following '''command'''.  
 
|-
 
|-
 
|| Highlight '''Run''' button in the '''Source''' window  
 
|| Highlight '''Run''' button in the '''Source''' window  
|| Run the current line.  
+
|| '''Run''' the current line.  
 
|-
 
|-
||  
+
|| Drag the Console window to resize.
 
|| I will resize the '''Console''' window  
 
|| I will resize the '''Console''' window  
 
|-
 
|-
 
|| Highlight output in the '''Console''' window.
 
|| Highlight output in the '''Console''' window.
 
  
 
Point to the mean values in the console.
 
Point to the mean values in the console.
|| The mean values of all '''movies''' in different '''genres''' are displayed.
+
|| The mean values of all movies in different '''genres''' are displayed.
  
 +
Notice that, '''Documentary genre''' has the highest mean '''imdb_rating.'''
  
Notice that, '''Documentary '''genre has the highest mean '''imdb_rating '''and '''Comedy '''genre has the lowest mean '''imdb_rating.'''
+
And '''Comedy genre''' has the lowest mean '''imdb_rating.'''
 
|-
 
|-
||  
+
|| Drag the Console window to resize.
 
|| I will resize the '''Console''' window  
 
|| I will resize the '''Console''' window  
 
|-
 
|-
Line 200: Line 195:
 
|-  
 
|-  
 
|| Highlight the scroll bar in the '''Source''' window  
 
|| Highlight the scroll bar in the '''Source''' window  
|| In the '''movies data frame''' window, scroll from left to right.  
+
|| In the '''movies data frame''', scroll from left to right.  
 
|-
 
|-
 
|  | Point to '''imdb_rating, genre''' and '''mpaa_rating '''in the''' movies dataframe.'''
 
|  | Point to '''imdb_rating, genre''' and '''mpaa_rating '''in the''' movies dataframe.'''
|| Let us find the mean '''imdb_rating '''distribution for the movies of '''Drama''' genre.
+
|| Let us find the mean '''imdb_rating '''distribution for the movies of '''Drama genre'''.  
 
+
 
+
Also, we will group movies of '''Drama''' genre by '''mpaa_rating. '''
+
  
 +
Also, we will group movies of '''Drama genre''' by '''mpaa_rating. '''
  
For this, we will use '''filter''', '''group_by''', and '''summarise''' functions one by one.  
+
For this, we will use '''filter, group_by''', and '''summarise functions''' one by one.  
 
|-
 
|-
 
|| Highlight the '''script myPipe.R''' in the '''Source''' window  
 
|| Highlight the '''script myPipe.R''' in the '''Source''' window  
Line 227: Line 220:
  
 
'''mean(imdb_rating))'''
 
'''mean(imdb_rating))'''
|| In the '''Source''' window, type the following commands.  
+
|| In the '''Source''' window, type the following '''commands'''.  
 
|-
 
|-
 
|| Highlight '''filter(movies, genre == "Drama") '''in the Source window  
 
|| Highlight '''filter(movies, genre == "Drama") '''in the Source window  
|| First, we will extract the '''movies''' of '''Drama genre'''.  
+
|| First, we will extract the movies of '''Drama genre'''.  
 
|-
 
|-
 
|| Highlight '''gr_dramaMov <- group_by(dramaMov, mpaa_rating) '''in the '''Source''' window  
 
|| Highlight '''gr_dramaMov <- group_by(dramaMov, mpaa_rating) '''in the '''Source''' window  
 
|| Then, we group these movies based on '''mpaa_rating'''.  
 
|| Then, we group these movies based on '''mpaa_rating'''.  
 
|-
 
|-
|| Highlight '''summarise(gr_dramaMov, mean(imdb_rating)) '''in the Source window  
+
|| Highlight '''summarise(gr_dramaMov, mean(imdb_rating)) '''in the '''Source''' window  
|| Finally, we apply '''summarise''' function.
+
|| Finally, we apply '''summarise function'''.
  
 
+
This will calculate the mean of the filtered and grouped movies.  
This will calculate the mean of the filtered and grouped '''movies'''.  
+
 
|-
 
|-
 
|| Highlight '''Run''' button in the '''Source''' window  
 
|| Highlight '''Run''' button in the '''Source''' window  
|| Run the last three lines of code.  
+
|| '''Run''' the last three lines of code.  
 
|-
 
|-
||  
+
|| Drag the Console window to resize.
 
|| I will resize the '''Console''' window  
 
|| I will resize the '''Console''' window  
 
|-
 
|-
 
|| Highlight output in the '''Console''' window  
 
|| Highlight output in the '''Console''' window  
|| The required mean values are printed on the console.
+
|| The required mean values are printed on the '''console'''.
 
|-
 
|-
||  
+
|| Drag the Console window to resize.
 
|| I will resize the '''Console''' window again.  
 
|| I will resize the '''Console''' window again.  
 
|-
 
|-
Line 256: Line 248:
 
|| In this code, we have to give names to each and every intermediate '''data frame'''.
 
|| In this code, we have to give names to each and every intermediate '''data frame'''.
  
 
+
But there is an alternate method to write these '''statements''' using the '''pipe''' operator.  
But there is an alternate method to write these statements using the '''pipe''' operator.  
+
 
|-
 
|-
 
|| Show slide  
 
|| Show slide  
  
Pipe operator  
+
'''Pipe operator'''
|| * The '''pipe '''operator is denoted as '''%>%'''.
+
||  
* It prevents us from making unnecessary data frames.
+
* The '''pipe '''operator is denoted as '''%>%'''.
* We can read the '''pipe '''as a series of imperative statements.
+
* It prevents us from making unnecessary '''data frames'''.
 
+
* We can read the '''pipe '''as a series of imperative '''statements'''.
  
 
|-
 
|-
 
|| Show slide  
 
|| Show slide  
  
Example of pipe operator  
+
'''Example of pipe operator'''
 
|| If we have to find the '''cosine''' of '''sine''' of '''pi''', we can write  
 
|| If we have to find the '''cosine''' of '''sine''' of '''pi''', we can write  
  
Line 279: Line 270:
 
|-
 
|-
 
|| Highlight the last three lines of code in the '''Source''' window  
 
|| Highlight the last three lines of code in the '''Source''' window  
| | We will learn how to do the same analysis by using the '''pipe''' operator.  
+
|| We will learn how to do the same analysis by using the '''pipe''' operator.  
 
|-
 
|-
 
|| [RStudio]
 
|| [RStudio]
Line 288: Line 279:
  
 
'''summarise(mean(imdb_rating))'''
 
'''summarise(mean(imdb_rating))'''
|| In the '''Source''' window, type the following command.  
+
|| In the '''Source''' window, type the following '''command'''.  
 
|-
 
|-
 
|| Highlight '''movies %>% filter(genre=="Drama") %>% '''
 
|| Highlight '''movies %>% filter(genre=="Drama") %>% '''
Line 294: Line 285:
 
'''group_by(mpaa_rating) %>% '''
 
'''group_by(mpaa_rating) %>% '''
  
'''summarise(mean(imdb_rating)) '''in the Source window  
+
'''summarise(mean(imdb_rating)) '''in the '''Source''' window  
|| Here three lines of code have been written as a series of statements.  
+
|| Here three lines of code have been written as a series of '''statements'''.  
 
+
  
We can read this code as, * Using the '''movies''' dataframe, filter the '''movies '''of '''Drama''' genre
+
We can read this code as,  
* Next, group the filtered '''movies''' by '''mpaa_rating'''
+
* Using the '''movies data frame''', filter the movies of '''Drama genre'''
 +
* Next, group the filtered movies by '''mpaa_rating'''
 
* Finally, summarise the mean of '''imdb_rating '''of the grouped data.
 
* Finally, summarise the mean of '''imdb_rating '''of the grouped data.
 
  
 
|-
 
|-
Line 310: Line 300:
 
'''summarise(mean(imdb_rating)) '''in the '''Source''' window  
 
'''summarise(mean(imdb_rating)) '''in the '''Source''' window  
 
|| This code is easier to read and write than the previous one.
 
|| This code is easier to read and write than the previous one.
 
  
 
In the case of '''pipe''' operator, we don’t have to repeat the name of the''' data frame.'''
 
In the case of '''pipe''' operator, we don’t have to repeat the name of the''' data frame.'''
 
  
 
Notice that we have written name of the''' data frame''' only once.  
 
Notice that we have written name of the''' data frame''' only once.  
Line 324: Line 312:
 
|| Save the '''script''' and run the current line.  
 
|| Save the '''script''' and run the current line.  
 
|-
 
|-
||  
+
|| Drag the Console window to resize.
 
|| I will resize the '''Console''' window.  
 
|| I will resize the '''Console''' window.  
 
|-
 
|-
Line 330: Line 318:
 
|| The required mean values are printed on the '''Console'''.  
 
|| The required mean values are printed on the '''Console'''.  
 
|-
 
|-
||  
+
||Drag the Console window to resize.
 
|| I will resize the '''Console''' window again.  
 
|| I will resize the '''Console''' window again.  
 
|-
 
|-
Line 340: Line 328:
 
|-  
 
|-  
 
|| Highlight '''audience_score''' and '''critics_score''' in '''movies'''  
 
|| Highlight '''audience_score''' and '''critics_score''' in '''movies'''  
|| Let us check what is the difference between '''audience_score''' and '''critics_score '''of all the movies.  
+
|| Let us check what is the difference between '''critics_score ''' and '''audience_score''' of all the movies.  
  
 +
We will use a '''box plot''' for our study.
  
We will use a '''boxplot''' for our study.
+
By using the '''pipe''' operator, we can combine the '''functions''' of '''ggplot2''' and '''dplyr packages'''.  
 
+
 
+
By using the '''pipe''' operator, we can combine the functions of '''ggplot2''' and '''dplyr '''package.  
+
 
|-
 
|-
 
|| Highlight the '''script myPipe.R''' in the '''Source''' window  
 
|| Highlight the '''script myPipe.R''' in the '''Source''' window  
Line 360: Line 346:
  
 
'''geom_boxplot()'''
 
'''geom_boxplot()'''
|| In the '''Source''' window, type the following command.  
+
|| In the '''Source''' window, type the following '''command'''.  
 
|-
 
|-
 
|| Highlight the '''Run''' button in the '''Source''' window  
 
|| Highlight the '''Run''' button in the '''Source''' window  
Line 372: Line 358:
 
|-  
 
|-  
 
|| Highlight '''Plots''' window, highlight '''drama, horror, and mystery & suspense'''
 
|| Highlight '''Plots''' window, highlight '''drama, horror, and mystery & suspense'''
|| Here you can see that for the genres '''Drama, Horror, '''and '''Mystery & Suspense''' movies, the '''median''' is close to zero.  
+
|| Here you can see that for the genres '''Drama, Horror, '''and '''Mystery & Suspense''' movies, the median is close to zero.  
  
 
+
This means that the '''audience '''and '''critics '''opinions are very similar for these '''genres'''.
This means that the '''audience '''and '''critics '''opinions are very similar for these genres.
+
 
|-  
 
|-  
 
|| Highlight '''Plots''' window. highlight '''action & adventure and comedy'''
 
|| Highlight '''Plots''' window. highlight '''action & adventure and comedy'''
|| Whereas for '''Action & adventure and Comedy '''movies, the '''median''' is not close to zero.  
+
|| Whereas for '''Action & adventure''' and '''Comedy '''movies, the median is not close to zero.  
  
 
+
This means that the '''audience''' and '''critics''' opinions are different for these '''genres'''.
This means that the audience and critics opinions are different for these genres.
+
 
|-  
 
|-  
 
|| Highlight the close button in the '''Plot Zoom''' window
 
|| Highlight the close button in the '''Plot Zoom''' window
Line 387: Line 371:
 
|-
 
|-
 
|| Highlight '''movies''' in the '''Source''' window  
 
|| Highlight '''movies''' in the '''Source''' window  
|| Click on '''movies data frame. '''
+
|| In the '''Source''' window, click on '''movies data frame. '''
 
|-  
 
|-  
 
|| Highlight the scroll bar in the '''Source''' window  
 
|| Highlight the scroll bar in the '''Source''' window  
Line 403: Line 387:
  
 
'''summarise(num = n())'''
 
'''summarise(num = n())'''
|| In the '''Source''' window, type the following command.  
+
|| In the '''Source''' window, type the following '''command'''.  
 
|-
 
|-
 
|| Highlight '''group_by''' in '''movies %>% group_by(genre, mpaa_rating) %>% '''
 
|| Highlight '''group_by''' in '''movies %>% group_by(genre, mpaa_rating) %>% '''
  
 
'''summarise(num = n()) '''in the '''Source''' window  
 
'''summarise(num = n()) '''in the '''Source''' window  
|| Notice that we have included both '''genre '''and '''mpaa_rating''' in '''group_by''' function.  
+
|| Notice that we have included both '''genre '''and '''mpaa_rating''' in '''group_by function'''.  
  
 
+
So, the analysis will be done on the data divided by these 2 '''variables'''.  
So, the analysis will be done on the data divided by these 2 variables.  
+
 
|-
 
|-
 
|| Highlight '''summarise(num = n())''' in '''movies %>% group_by(genre, mpaa_rating) %>% '''
 
|| Highlight '''summarise(num = n())''' in '''movies %>% group_by(genre, mpaa_rating) %>% '''
  
 
'''summarise(num = n()) '''in the '''Source''' window  
 
'''summarise(num = n()) '''in the '''Source''' window  
|| We used '''num = n()'''.  
+
|| Also, we have used '''num = n()'''.  
  
 
+
The '''function n''' computes the number of times the event with specific condition has happened.
The function '''n''' computes the number of times the event with specific condition has happened.
+
 
|-
 
|-
 
|| Highlight the '''Run''' button in the '''Source''' window  
 
|| Highlight the '''Run''' button in the '''Source''' window  
|| Run the current line.  
+
|| '''Run''' the current line.  
 
|-
 
|-
||  
+
|| Resize the '''Console''' window.
 
|| I will resize the '''Console''' window.  
 
|| I will resize the '''Console''' window.  
 
|-
 
|-
Line 433: Line 415:
 
|| Let us summarize what we have learnt.
 
|| Let us summarize what we have learnt.
 
|-  
 
|-  
||  
+
|| Show slide
 
+
Show slide
+
  
Summary
+
'''Summary'''
|| In this tutorial, we have learnt about:* '''summarise''' and '''group_by''' functions
+
|| In this tutorial, we have learnt about:
* Operations in '''summarise''' function
+
* '''summarise''' and '''group_by functions'''
 +
* Operations in '''summarise function'''
 
* '''Pipe '''operator
 
* '''Pipe '''operator
 
  
 
|-  
 
|-  
Line 448: Line 428:
 
Assignment
 
Assignment
 
|| We now suggest an assignment.
 
|| We now suggest an assignment.
* Use the built-in data set '''iris'''. Using the '''pipe''' operator, group the flowers by their '''Species'''.
+
* Use the '''built-in data set iris'''.  
 +
*Using the '''pipe''' operator, group the flowers by their species.
 
* Summarise the grouped data by the mean of '''Sepal.Length '''and '''Sepal.Width'''.  
 
* Summarise the grouped data by the mean of '''Sepal.Length '''and '''Sepal.Width'''.  
 
  
 
|-  
 
|-  
Line 464: Line 444:
 
Spoken Tutorial Workshops
 
Spoken Tutorial Workshops
 
|| We conduct workshops using Spoken Tutorials and give certificates.
 
|| We conduct workshops using Spoken Tutorials and give certificates.
 
  
 
Please contact us.
 
Please contact us.
Line 494: Line 473:
 
Thank You
 
Thank You
 
|| The script for this tutorial was contributed by Varshit Dubey (CoE Pune).
 
|| The script for this tutorial was contributed by Varshit Dubey (CoE Pune).
 
  
 
This is Sudhakar Kumar from IIT Bombay signing off. Thanks for watching.
 
This is Sudhakar Kumar from IIT Bombay signing off. Thanks for watching.
 
|-
 
|-
 
|}
 
|}

Latest revision as of 17:08, 23 September 2019

Title of the script: Pipe operator

Author: Varshit Dubey (CoE Pune) and Sudhakar Kumar (IIT Bombay)

Keywords: R, RStudio, dplyr package, ggplot2, summarise function, group_by function, pipe operator, boxplot, video tutorial

Visual Cue Narration
Show slide

Opening Slide

Welcome to this tutorial on Pipe Operator.
Show slide

Learning Objective

In this tutorial, we will learn about:
  • summarise and group_by functions
  • Operations in summarise function
  • Pipe operator

Show slide

Pre-requisites

https://spoken-tutorial.org/

To understand this tutorial, you should know,
  • Basics of statistics
  • Basics of ggplot2 and dplyr packages
  • Data frames

If not, please locate the relevant tutorials on R on this website.

Show slide

System Specifications

This tutorial is recorded on
  • Ubuntu Linux OS version 16.04
  • R version 3.4.4
  • RStudio version 1.1.463

Install R version 3.2.0 or higher.

Show slide

Download Files

For this tutorial, we will use
  • A data frame moviesData.csv
  • A script file myPipe.R.

Please download these files from the Code files link of this tutorial.

[Computer screen]

Highlight moviesData.csv and myPipe.R in the folder pipeOps

I have downloaded and moved these files to pipeOps folder.

This folder is located in myProject folder on my Desktop.

I have also set pipeOps folder as my Working Directory.

Show slide

summarise function

Now we learn about summarise function.
  • summarise function reduces a data frame into a single row.
  • It gives summaries like mean, median, etc. of the variables available in the data frame.
  • We use summarise along with group_by function.
Let us switch to RStudio.
Highlight myPipe.R in the Files window of RStudio Open the script myPipe.R in RStudio.
Highlight the Source button Run this script by clicking on the Source button.
Highlight movies in the Source window movies data frame opens in the Source window.
Highlight the scroll bar in the Source window In the movies dataframe, scroll from left to right.

This will enable us to see the remaining objects of the movies data frame.

Highlight imdb_rating in the Source window To know the mean of imdb_rating of all movies, we will use summarise function.
Highlight the script myPipe.R in the Source window Click on the script myPipe.R
[RStudio]

summarise(movies,

mean(imdb_rating))

In the Source window, type the following command.
Highlight summarise in the Source window Inside the summarise function, the first argument is a data frame to be summarised.

Here, it is movies.

The second argument is the information we need, that is the mean of imdb_rating.

Highlight the Run button in the Source window Save the script and run the current line by pressing Ctrl+Enter keys simultaneously.
Highlight output in the Console window


The mean value is shown.

One will argue that I can find the mean by using mean function along with dollar operator.

What is the use of installing a whole package and using a complex function?

Highlight summarise in the Source window Basically, we do not use summarise function for computing such things.

This function is not useful unless we pair it with group by function.

Show slide

group_by() function

When we use group_by function, the data frame gets divided into different groups.
Let us switch back to RStudio.
Highlight movies in the Source window In the Source window, click on movies data frame.
Highlight the scroll bar in the Source window In the movies data frame, scroll from right to left.
Highlight genre in the Source window We will group the movies data frame based on the genre.

For this, we will use group underscore by function.

Highlight the script myPipe.R in the Source window Click on the script myPipe.R
[RStudio]

groupMovies <- group_by(movies,

genre)

In the Source window, type the following command.
Highlight Run button in the Source window Run the current line.
Highlight groupMovies in the Environment window A new data frame groupMovies is stored.

Now, we will use summarise function on this data frame.

[RStudio]

summarise(groupMovies,

mean(imdb_rating))

In the Source window, type the following command.
Highlight Run button in the Source window Run the current line.
Drag the Console window to resize. I will resize the Console window
Highlight output in the Console window.

Point to the mean values in the console.

The mean values of all movies in different genres are displayed.

Notice that, Documentary genre has the highest mean imdb_rating.

And Comedy genre has the lowest mean imdb_rating.

Drag the Console window to resize. I will resize the Console window
Highlight movies in the Source window In the Source window, click on movies data frame.
Highlight the scroll bar in the Source window In the movies data frame, scroll from left to right.
Point to imdb_rating, genre and mpaa_rating in the movies dataframe. Let us find the mean imdb_rating distribution for the movies of Drama genre.

Also, we will group movies of Drama genre by mpaa_rating.

For this, we will use filter, group_by, and summarise functions one by one.

Highlight the script myPipe.R in the Source window Click on the script myPipe.R
[RStudio]

dramaMov <- filter(movies,

genre == "Drama")

gr_dramaMov <- group_by(dramaMov,

mpaa_rating)

summarise(gr_dramaMov,

mean(imdb_rating))

In the Source window, type the following commands.
Highlight filter(movies, genre == "Drama") in the Source window First, we will extract the movies of Drama genre.
Highlight gr_dramaMov <- group_by(dramaMov, mpaa_rating) in the Source window Then, we group these movies based on mpaa_rating.
Highlight summarise(gr_dramaMov, mean(imdb_rating)) in the Source window Finally, we apply summarise function.

This will calculate the mean of the filtered and grouped movies.

Highlight Run button in the Source window Run the last three lines of code.
Drag the Console window to resize. I will resize the Console window
Highlight output in the Console window The required mean values are printed on the console.
Drag the Console window to resize. I will resize the Console window again.
Highlight the last three lines of code in the Source window In this code, we have to give names to each and every intermediate data frame.

But there is an alternate method to write these statements using the pipe operator.

Show slide

Pipe operator

  • The pipe operator is denoted as %>%.
  • It prevents us from making unnecessary data frames.
  • We can read the pipe as a series of imperative statements.
Show slide

Example of pipe operator

If we have to find the cosine of sine of pi, we can write

pi %>% sin() %>% cos()

Let us switch to RStudio.
Highlight the last three lines of code in the Source window We will learn how to do the same analysis by using the pipe operator.
[RStudio]

movies %>% filter(genre=="Drama") %>%

group_by(mpaa_rating) %>%

summarise(mean(imdb_rating))

In the Source window, type the following command.
Highlight movies %>% filter(genre=="Drama") %>%

group_by(mpaa_rating) %>%

summarise(mean(imdb_rating)) in the Source window

Here three lines of code have been written as a series of statements.

We can read this code as,

  • Using the movies data frame, filter the movies of Drama genre
  • Next, group the filtered movies by mpaa_rating
  • Finally, summarise the mean of imdb_rating of the grouped data.
Highlight movies %>% filter(genre=="Drama") %>%

group_by(mpaa_rating) %>%

summarise(mean(imdb_rating)) in the Source window

This code is easier to read and write than the previous one.

In the case of pipe operator, we don’t have to repeat the name of the data frame.

Notice that we have written name of the data frame only once.

Highlight movies %>% filter(genre=="Drama") %>%

group_by(mpaa_rating) %>%

summarise(mean(imdb_rating)) in the Source window

Save the script and run the current line.
Drag the Console window to resize. I will resize the Console window.
Highlight output on the Console window The required mean values are printed on the Console.
Drag the Console window to resize. I will resize the Console window again.
Highlight movies in the Source window In the Source window, click on movies data frame.
Highlight the scroll bar in the Source window In the Source window, scroll from left to right.
Highlight audience_score and critics_score in movies Let us check what is the difference between critics_score and audience_score of all the movies.

We will use a box plot for our study.

By using the pipe operator, we can combine the functions of ggplot2 and dplyr packages.

Highlight the script myPipe.R in the Source window Click on the script myPipe.R
[RStudio]

movies %>% mutate(diff = audience_score - critics_score) %>%

ggplot(mapping = aes(x=genre,

y=diff)) +

geom_boxplot()

In the Source window, type the following command.
Highlight the Run button in the Source window Save the script and run the current line.
Highlight Plots window The required box plot appears in the Plots window.
Highlight Plots window In the Plots window, click on the Zoom button to maximize the plot.
Highlight Plots window, highlight drama, horror, and mystery & suspense Here you can see that for the genres Drama, Horror, and Mystery & Suspense movies, the median is close to zero.

This means that the audience and critics opinions are very similar for these genres.

Highlight Plots window. highlight action & adventure and comedy Whereas for Action & adventure and Comedy movies, the median is not close to zero.

This means that the audience and critics opinions are different for these genres.

Highlight the close button in the Plot Zoom window Close this plot.
Highlight movies in the Source window In the Source window, click on movies data frame.
Highlight the scroll bar in the Source window In the Source window, scroll from right to left.
Highlight mpaa_rating and genre in movies Let us check the number of movies in every category of mpaa_rating of each genre.
Highlight the script myPipe.R in the Source window Click on the script myPipe.R
[RStudio]

movies %>% group_by(genre, mpaa_rating) %>%

summarise(num = n())

In the Source window, type the following command.
Highlight group_by in movies %>% group_by(genre, mpaa_rating) %>%

summarise(num = n()) in the Source window

Notice that we have included both genre and mpaa_rating in group_by function.

So, the analysis will be done on the data divided by these 2 variables.

Highlight summarise(num = n()) in movies %>% group_by(genre, mpaa_rating) %>%

summarise(num = n()) in the Source window

Also, we have used num = n().

The function n computes the number of times the event with specific condition has happened.

Highlight the Run button in the Source window Run the current line.
Resize the Console window. I will resize the Console window.
Highlight the output in the Console window From the output, we can see that there are 22 Action and Adventure movies with mpaa_rating as R.
Let us summarize what we have learnt.
Show slide

Summary

In this tutorial, we have learnt about:
  • summarise and group_by functions
  • Operations in summarise function
  • Pipe operator
Show slide

Assignment

We now suggest an assignment.
  • Use the built-in data set iris.
  • Using the pipe operator, group the flowers by their species.
  • Summarise the grouped data by the mean of Sepal.Length and Sepal.Width.
Show slide

About the Spoken Tutorial Project

The video at the following link summarises the Spoken Tutorial project.

Please download and watch it.

Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.

Show Slide

Forum to answer questions

Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Please post your general queries in this forum.
Show Slide

Textbook Companion

The FOSSEE team coordinates the TBC project.

For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial project is funded by NMEICT, MHRD, Govt. of India
Show Slide

Thank You

The script for this tutorial was contributed by Varshit Dubey (CoE Pune).

This is Sudhakar Kumar from IIT Bombay signing off. Thanks for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey, Sudhakarst