Machine-Learning-using-R - old 2022/C2/Data-Cleaning-using-R/English
Title of the script: Data Cleaning in R
Author: Tanmay Srinath
Keywords: R, RStudio, machine learning, supervised, unsupervised, data cleaning, NA values, CSV files, spoken tutorial, video tutorial.
Visual Cue | Narration |
Show Slide
Opening Slide |
Welcome to the spoken tutorial on Data Cleaning using R. |
Show Slide
Learning Objectives |
In this tutorial, we will learn about:
|
Show Slide
System Specifications |
This tutorial is recorded using,
It is recommended to install R version 4.1.0 or higher. |
Show Slide
Prerequisites |
To follow this tutorial, the learner should know:
If not, please access the relevant tutorials on R on this website. |
Show Slide
What is Data Cleaning? |
Now let us see what is Data Cleaning.
It involves detecting and correcting corrupt or inaccurate records in a dataset. The correction may also lead to removal of specific inaccurate record. |
Show Slide
Need for Data Cleaning |
Next let us see the need for data cleaning.
|
Show Slide
Reading Data from Text File |
|
Only Narration. | Let us see how we can read data in RStudio. |
Show Slide
Download Files |
We will use a script file DataCleaning.R and a dataset airquality.txt
Please download these files from the Code files link of this tutorial. Make a copy and then use them for practising. |
[Computer screen]
Highlight DataCleaning.R and the folder Data Cleaning. |
I have downloaded and moved these files to the DataCleaning folder.
This folder is located in the MLProject folder on my Desktop. I have also set the Data Cleaning folder as my Working Directory. |
Cursor in the Data Cleaning folder. | Let us switch to RStudio. |
Double click DataCleaning.R in RStudio
Point to DataCleaning.R in RStudio. |
Let us open the script DataCleaning.R in RStudio.
|
Highlight airTxt <- readLines("airquality.txt")
Highlight print(airTxt) |
readLines() function is used to read a text file.
print() function prints the text file. |
Highlight airTxt <- readLines("airquality.txt")
print(airTxt)
|
Select and run these commands.
|
Drag the boundary. | I will drag the boundary to see the console window clearly. |
Highlight output in console
Scroll to show all the output. |
This dataset is a text document.
Each individual entry in a row is separated by a tab space, denoted by backslash t. |
Highlight NAs in airTxt | Here NA denotes missing data.
|
Highlight “\t” in the console window. | First, we will convert this dataset to a dataframe.
|
Drag the boundary. | I will drag the boundary to see the Source window clearly. |
[RStudio]
airList <- strsplit(airTxt, "\t") View(airList) |
In the Source window, type these two commands. |
Highlight strsplit
Highlight \t |
strsplit() function splits the elements of a character vector into substrings.
Here we have to split our text with the separator “\t” |
Select the commands. Click the Run Button. | Select and run these commands. |
Highlight output in Source.
|
A 2-dimensional list of data points appears in the Source window.
We will use the first row as the dataframe’s column names. We will store this row in a variable cols. |
Highlight DataCleaning.R | Click on DataCleaning.R in the Source window. |
[RStudio]
cols <- unlist(airList[1])
|
In the Source window type this command.
Save and run the command.
|
Drag Boundary
|
I will drag the boundary to see the Environment tab clearly.
|
Show Slide:
Purpose of Type Conversion |
Now let us understand why we need type conversion.
|
Only Narration. | Switch to RStudio. |
[RStudio]
|
We will now convert our data to numeric type.
|
Highlight lapply
|
lapply applies a given function to every element of a list.
Save and run the command. |
Cursor in the Source window. | Next we convert our list to a data frame" |
[RStudio]
airDF <- as.data.frame(do.call(rbind, airN)) View(airDF) |
In the Source window, type these commands. |
Highlight airDF <- as.data.frame(do.call(rbind, airN))
Point to the column names. V1 to V6. |
The do.call() function is used to bind the list by rows.
The raw text document is converted to a dataframe.
|
Highlight NAs on 1st row. | The first row is full of NAs, which represents missing values.
|
Highlight DataCleaning.R | Click on DataCleaning.R in the Source window. |
[RStudio]
airDF <- airDF[-1,]
Click on Save button and Run button. |
In the Source window, type these commands.
This command removes the first row of NAs.
|
Highlight airDF window | In the Source window, click on airDF.
|
Only Narration. | Now we will learn how to handle missing values in our dataset. |
Show Slide
Handling NA values |
There are two ways of handling NA values:
|
Only Narration. | First we will learn to remove NA values.
Let us return to RStudio. |
Highlight DataCleaning.R | Click on DataCleaning.R in the Source window. |
[RStudio]
airDF_without_NA <- na.omit(airDF) View(airDF_without_NA) |
In the Source window, type the following commands. |
Highlight na.omit | na.omit() function removes all the NA values. |
Click on Save button and Run button.
Highlight : airDF_without_NA <- na.omit(airDF) View(airDF_without_NA)
|
Save and run these commands.
All the NA values have been removed. |
Show Slide
Replacing Missing Values |
|
Only Narration. | Let us return to RStudio. |
Highlight DataCleaning.R | Click on DataCleaning.R in the Source window. |
[RStudio]
airDF[is.na(airDF)] <- 0 Highlight airDF[is.na(airDF)] <- 0 |
We will replace all the NA values with zeros.
In the Source window, type this command.
If this condition is true, it updates the NA values to 0.
|
View(airDF)
airDF[is.na(airDF)] <- 0 View(airDF)
|
In the Source window type this command.
Select these commands and run them. |
Highlight
airDF in Source |
All the NA values have been replaced with 0.
NA can be replaced by mean, median or any other similar statistical value.
|
Highlight
airDF in Source |
Before performing the data analysis, we need to encode categorical variables.
|
Highlight DataCleaning.R | Click on DataCleaning.R in the Source window. |
[RStudio]
airDF$Month <- factor(airDF$Month) levels(airDF$Month) |
In the Source window, type these commands. |
Highlight
airDF$Month <- factor(airDF$Month)
|
This encodes the column Month as a factor.
|
[RStudio]
names <- c("May", "June", "July", "August", "September") airDF$Month <- factor(airDF$Month, labels = names) levels(airDF$Month) Drag boundary. |
In the Source window, type following commands.
Drag boundary to see the source window clearly.
|
Highlight names <- c("May", "June", "July", "August", "September")
|
We use the factor() function to change the numeric levels to month names. |
Highlight Output in console | Now the levels have changed to month names. |
Only Narration. | With this we come to the end of this tutorial.
Let us summarise. |
Show Slide
Summary |
In this tutorial we have learnt about:
|
Show Slide
Assignment |
Here is an assignment for you.
In the air quality dataset, replace the NA values with the mean of observations. |
Show Slide
About the Spoken Tutorial Project |
The video at the following link summarises the Spoken Tutorial project. Please download and watch it. |
Show Slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
|
Show Slide
Spoken Tutorial Forum to answer questions
Choose the minute and second where you have the question. Explain your question briefly. Someone from the FOSSEE team will answer them. Please visit this site. |
Please post your time queries in this forum. |
Show Slide
Forum to answer questions |
Do you have any general/technical questions?
Please visit the forum given in the link. |
Show Slide
Textbook Companion |
The FOSSEE team coordinates coding
of solved examples of popular books and case study projects.
|
Show Slide
Acknowledgment |
The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India. |
Show Slide
Thank You |
This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay.
Thank you for watching. |