R/C2/Pipe-Operator/English-timed

From Script | Spoken-Tutorial
Revision as of 16:36, 22 May 2020 by Sakinashaikh (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Time Narration
00:01 Welcome to this tutorial on Pipe Operator.
00:06 In this tutorial,we will learn about:
00:10 summarise and group_by functions
00:14 Operations in summarise function
00:18 Pipe operator
00:20 To understand this tutorial, you should know,
00:25 Basics of statistics
00:28 Basics of ggplot2 and dplyr packages
00:34 Data frames
00:36 If not, please locate the relevant tutorials on R on this website.
00:43 This tutorial is recorded on
00:45 Ubuntu Linux OS version 16.04
00:51 R version 3.4.4
00:55 RStudio version 1.1.463
01:01 Install R version 3.2.0 or higher.
01:07 For this tutorial, we will use
01:10 A data frame moviesData.csv
01:15 A script file myPipe.R.
01:20 Please download these files from the Code files link of this tutorial.
01:27 I have downloaded and moved these files to pipeOps folder.
01:33 This folder is located in myProject folder on my Desktop.
01:40 I have also set pipeOps folder as my Working Directory.
01:46 Now we learn about summarise function.
01:50 summarise function reduces a data frame into a single row.
01:56 It gives summaries like mean, median, etc. of the variables available in the data frame.
02:05 We use summarise along with group_by function.
02:11 Let us switch to RStudio.
02:15 Open the script myPipe.R in RStudio.
02:21 Run this script by clicking on the Source button.
02:26 movies data frame opens in the Source window.
02:31 In the movies dataframe, scroll from left to right.
02:37 This will enable us to see the remaining objects of the movies data frame.
02:44 To know the mean of imdb_rating of all movies, we will use summarise function.
02:52 Click on the script myPipe.R
02:57 In the Source window, type the following command.
03:02 Inside the summarise function, the first argument is a data frame to be summarised.
03:09 Here, it is movies.
03:12 The second argument is the information we need, that is the mean of imdb_rating.
03:21 Save the script and run the current line by pressing Ctrl+Enter keys simultaneously.
03:31 The mean value is shown.
03:34 One will argue that I can find the mean by using mean function along with dollar operator.
03:43 What is the use of installing a whole package and using a complex function?
03:49 Basically, we do not use summarise function for computing such things.
03:56 This function is not useful unless we pair it with group by function.
04:03 When we use group_by function, the data frame gets divided into different groups.
04:12 Let us switch back to RStudio.
04:16 In the Source window, click on movies data frame.
04:21 In the movies data frame, scroll from right to left.
04:27 We will group the movies data frame based on the genre.
04:33 For this, we will use group underscore by function.
04:39 Click on the script myPipe.R
04:43 In the Source window, type the following command.
04:48 Run the current line.
04:51 A new data frame groupMovies is stored.
04:56 Now, we will use summarise function on this data frame.
05:02 In the Source window, type the following command.
05:07 Run the current line.
05:10 I will resize the Console window
05:14 The mean values of all movies in different genres are displayed.
05:21 Notice that, Documentary genre has the highest mean imdb_rating.
05:28 And Comedy genre has the lowest mean imdb_rating.
05:34 I will resize the Console window
05:38 In the Source window, click on movies data frame.
05:43 In the movies data frame, scroll from left to right.
05:49 Let us find the mean imdb_rating distribution for the movies of Drama genre.
05:58 Also, we will group movies of Drama genre by mpaa_rating.
06:05 For this, we will use filter, group_by, and summarise functions one by one.
06:12 Click on the script myPipe.R
06:17 In the Source window, type the following commands.
06:23 First, we will extract the movies of Drama genre.
06:29 Then, we group these movies based on mpaa_rating.
06:35 Finally, we apply summarise function.
06:39 This will calculate the mean of the filtered and grouped movies.
06:46 Run the last three lines of code.
06:50 I will resize the Console window
06:54 The required mean values are printed on the console.
06:59 I will resize the Console window again.
07:03 In this code, we have to give names to each and every intermediate data frame.
07:10 But there is an alternate method to write these statements using the pipe operator.
07:17 The pipe operator is denoted as %>%.
07:25 It prevents us from making unnecessary data frames.
07:30 We can read the pipe as a series of imperative statements.
07:35 If we have to find the cosine of sine of pi, we can write
07:42 Let us switch to RStudio.
07:45 We will learn how to do the same analysis by using the pipe operator.
07:51 In the Source window, type the following command.
07:56 Here three lines of code have been written as a series of statements.
08:02 We can read this code as,
08:06 Using the movies data frame, filter the movies of Drama genre
08:13 Next, group the filtered movies by mpaa_rating
08:19 Finally, summarise the mean of imdb_rating of the grouped data.
08:26 This code is easier to read and write than the previous one.
08:32 In the case of pipe operator, we don’t have to repeat the name of the data frame.
08:39 Notice that we have written name of the data frame only once.
08:45 Save the script and run the current line.
08:50 I will resize the Console window.
08:54 The required mean values are printed on the Console.
08:59 I will resize the Console window again.
09:03 In the Source window, click on movies data frame.
09:08 In the Source window, scroll from left to right.
09:13 Let us check what is the difference between critics_score and audience_score of all the movies.
09:22 We will use a box plot for our study.
09:26 By using the pipe operator, we can combine the functions of ggplot2 and dplyr packages.
09:34 Click on the script myPipe.R
09:38 In the Source window, type the following command.
09:43 Save the script and run the current line.
09:49 The required box plot appears in the Plots window.
09:54 In the Plots window, click on the Zoom button to maximize the plot.
10:00 Here you can see that for the genres Drama, Horror, and Mystery & Suspense movies, the median is close to zero.
10:14 This means that the audience and critics opinions are very similar for these genres.
10:22 Whereas for Action & adventure and Comedy movies, the median is not close to zero.
10:30 This means that the audience and critics opinions are different for these genres.
10:37 Close this plot.
10:39 In the Source window, click on movies data frame.
10:44 In the Source window, scroll from right to left.
10:49 Let us check the number of movies in every category of mpaa_rating of each genre.
10:57 Click on the script myPipe.R
11:01 In the Source window, type the following command.
11:06 Notice that we have included both genre and mpaa_rating in group_by function.
11:15 So, the analysis will be done on the data divided by these 2 variables.
11:22 Also, we have used num = n().
11:27 The function n computes the number of times the event with specific condition has happened.
11:35 Run the current line.
11:38 I will resize the Console window.
11:42 From the output, we can see that there are 22 Action and Adventure movies with mpaa_rating as R.
11:53 Let us summarize what we have learnt.
11:57 In this tutorial, we have learnt about:
12:00 summarise and group_by functions
12:04 Operations in summarise function
12:08 Pipe operator
12:10 We now suggest an assignment.
12:14 Use the built-in data set iris.
12:18 Using the pipe operator, group the flowers by their species.
12:24 Summarise the grouped data by the mean of Sepal.Length and Sepal.Width.
12:33 The video at the following link summarises the Spoken Tutorial project.
12:37 Please download and watch it.
12:41 We conduct workshops using Spoken Tutorials and give certificates.
12:46 Please contact us.
12:49 Please post your timed queries in this forum.
12:54 Please post your general queries in this forum.
12:59 The FOSSEE team coordinates the TBC project.
13:03 For more details, please visit these sites.
13:07 The Spoken Tutorial project is funded by NMEICT, MHRD, Govt. of India
13:13 The script for this tutorial was contributed by Varshit Dubey (CoE Pune).
13:20 This is Sudhakar Kumar from IIT Bombay signing off. Thanks for watching.

Contributors and Content Editors

Sakinashaikh