R/C2/Pipe-Operator/English-timed
From Script | Spoken-Tutorial
Revision as of 16:36, 22 May 2020 by Sakinashaikh (Talk | contribs)
Time | Narration |
00:01 | Welcome to this tutorial on Pipe Operator. |
00:06 | In this tutorial,we will learn about: |
00:10 | summarise and group_by functions |
00:14 | Operations in summarise function |
00:18 | Pipe operator |
00:20 | To understand this tutorial, you should know, |
00:25 | Basics of statistics |
00:28 | Basics of ggplot2 and dplyr packages |
00:34 | Data frames |
00:36 | If not, please locate the relevant tutorials on R on this website. |
00:43 | This tutorial is recorded on |
00:45 | Ubuntu Linux OS version 16.04 |
00:51 | R version 3.4.4 |
00:55 | RStudio version 1.1.463 |
01:01 | Install R version 3.2.0 or higher. |
01:07 | For this tutorial, we will use |
01:10 | A data frame moviesData.csv |
01:15 | A script file myPipe.R. |
01:20 | Please download these files from the Code files link of this tutorial. |
01:27 | I have downloaded and moved these files to pipeOps folder. |
01:33 | This folder is located in myProject folder on my Desktop. |
01:40 | I have also set pipeOps folder as my Working Directory. |
01:46 | Now we learn about summarise function. |
01:50 | summarise function reduces a data frame into a single row. |
01:56 | It gives summaries like mean, median, etc. of the variables available in the data frame. |
02:05 | We use summarise along with group_by function. |
02:11 | Let us switch to RStudio. |
02:15 | Open the script myPipe.R in RStudio. |
02:21 | Run this script by clicking on the Source button. |
02:26 | movies data frame opens in the Source window. |
02:31 | In the movies dataframe, scroll from left to right. |
02:37 | This will enable us to see the remaining objects of the movies data frame. |
02:44 | To know the mean of imdb_rating of all movies, we will use summarise function. |
02:52 | Click on the script myPipe.R |
02:57 | In the Source window, type the following command. |
03:02 | Inside the summarise function, the first argument is a data frame to be summarised. |
03:09 | Here, it is movies. |
03:12 | The second argument is the information we need, that is the mean of imdb_rating. |
03:21 | Save the script and run the current line by pressing Ctrl+Enter keys simultaneously. |
03:31 | The mean value is shown. |
03:34 | One will argue that I can find the mean by using mean function along with dollar operator. |
03:43 | What is the use of installing a whole package and using a complex function? |
03:49 | Basically, we do not use summarise function for computing such things. |
03:56 | This function is not useful unless we pair it with group by function. |
04:03 | When we use group_by function, the data frame gets divided into different groups. |
04:12 | Let us switch back to RStudio. |
04:16 | In the Source window, click on movies data frame. |
04:21 | In the movies data frame, scroll from right to left. |
04:27 | We will group the movies data frame based on the genre. |
04:33 | For this, we will use group underscore by function. |
04:39 | Click on the script myPipe.R |
04:43 | In the Source window, type the following command. |
04:48 | Run the current line. |
04:51 | A new data frame groupMovies is stored. |
04:56 | Now, we will use summarise function on this data frame. |
05:02 | In the Source window, type the following command. |
05:07 | Run the current line. |
05:10 | I will resize the Console window |
05:14 | The mean values of all movies in different genres are displayed. |
05:21 | Notice that, Documentary genre has the highest mean imdb_rating. |
05:28 | And Comedy genre has the lowest mean imdb_rating. |
05:34 | I will resize the Console window |
05:38 | In the Source window, click on movies data frame. |
05:43 | In the movies data frame, scroll from left to right. |
05:49 | Let us find the mean imdb_rating distribution for the movies of Drama genre. |
05:58 | Also, we will group movies of Drama genre by mpaa_rating. |
06:05 | For this, we will use filter, group_by, and summarise functions one by one. |
06:12 | Click on the script myPipe.R |
06:17 | In the Source window, type the following commands. |
06:23 | First, we will extract the movies of Drama genre. |
06:29 | Then, we group these movies based on mpaa_rating. |
06:35 | Finally, we apply summarise function. |
06:39 | This will calculate the mean of the filtered and grouped movies. |
06:46 | Run the last three lines of code. |
06:50 | I will resize the Console window |
06:54 | The required mean values are printed on the console. |
06:59 | I will resize the Console window again. |
07:03 | In this code, we have to give names to each and every intermediate data frame. |
07:10 | But there is an alternate method to write these statements using the pipe operator. |
07:17 | The pipe operator is denoted as %>%. |
07:25 | It prevents us from making unnecessary data frames. |
07:30 | We can read the pipe as a series of imperative statements. |
07:35 | If we have to find the cosine of sine of pi, we can write |
07:42 | Let us switch to RStudio. |
07:45 | We will learn how to do the same analysis by using the pipe operator. |
07:51 | In the Source window, type the following command. |
07:56 | Here three lines of code have been written as a series of statements. |
08:02 | We can read this code as, |
08:06 | Using the movies data frame, filter the movies of Drama genre |
08:13 | Next, group the filtered movies by mpaa_rating |
08:19 | Finally, summarise the mean of imdb_rating of the grouped data. |
08:26 | This code is easier to read and write than the previous one. |
08:32 | In the case of pipe operator, we don’t have to repeat the name of the data frame. |
08:39 | Notice that we have written name of the data frame only once. |
08:45 | Save the script and run the current line. |
08:50 | I will resize the Console window. |
08:54 | The required mean values are printed on the Console. |
08:59 | I will resize the Console window again. |
09:03 | In the Source window, click on movies data frame. |
09:08 | In the Source window, scroll from left to right. |
09:13 | Let us check what is the difference between critics_score and audience_score of all the movies. |
09:22 | We will use a box plot for our study. |
09:26 | By using the pipe operator, we can combine the functions of ggplot2 and dplyr packages. |
09:34 | Click on the script myPipe.R |
09:38 | In the Source window, type the following command. |
09:43 | Save the script and run the current line. |
09:49 | The required box plot appears in the Plots window. |
09:54 | In the Plots window, click on the Zoom button to maximize the plot. |
10:00 | Here you can see that for the genres Drama, Horror, and Mystery & Suspense movies, the median is close to zero. |
10:14 | This means that the audience and critics opinions are very similar for these genres. |
10:22 | Whereas for Action & adventure and Comedy movies, the median is not close to zero. |
10:30 | This means that the audience and critics opinions are different for these genres. |
10:37 | Close this plot. |
10:39 | In the Source window, click on movies data frame. |
10:44 | In the Source window, scroll from right to left. |
10:49 | Let us check the number of movies in every category of mpaa_rating of each genre. |
10:57 | Click on the script myPipe.R |
11:01 | In the Source window, type the following command. |
11:06 | Notice that we have included both genre and mpaa_rating in group_by function. |
11:15 | So, the analysis will be done on the data divided by these 2 variables. |
11:22 | Also, we have used num = n(). |
11:27 | The function n computes the number of times the event with specific condition has happened. |
11:35 | Run the current line. |
11:38 | I will resize the Console window. |
11:42 | From the output, we can see that there are 22 Action and Adventure movies with mpaa_rating as R. |
11:53 | Let us summarize what we have learnt. |
11:57 | In this tutorial, we have learnt about: |
12:00 | summarise and group_by functions |
12:04 | Operations in summarise function |
12:08 | Pipe operator |
12:10 | We now suggest an assignment. |
12:14 | Use the built-in data set iris. |
12:18 | Using the pipe operator, group the flowers by their species. |
12:24 | Summarise the grouped data by the mean of Sepal.Length and Sepal.Width. |
12:33 | The video at the following link summarises the Spoken Tutorial project. |
12:37 | Please download and watch it. |
12:41 | We conduct workshops using Spoken Tutorials and give certificates. |
12:46 | Please contact us. |
12:49 | Please post your timed queries in this forum. |
12:54 | Please post your general queries in this forum. |
12:59 | The FOSSEE team coordinates the TBC project. |
13:03 | For more details, please visit these sites. |
13:07 | The Spoken Tutorial project is funded by NMEICT, MHRD, Govt. of India |
13:13 | The script for this tutorial was contributed by Varshit Dubey (CoE Pune). |
13:20 | This is Sudhakar Kumar from IIT Bombay signing off. Thanks for watching. |