Difference between revisions of "Machine-Learning-using-R - old 2022/C2/Data-Cleaning-using-R/English"
(Created page with "Title of the script: Data Cleaning in R Author: Tanmay Srinath Keywords: R, RStudio, machine learning, supervised, unsupervised, data cleaning, NA values, CSV files, video t...") |
|||
Line 3: | Line 3: | ||
Author: Tanmay Srinath | Author: Tanmay Srinath | ||
− | Keywords: R, RStudio, machine learning, supervised, unsupervised, data cleaning, NA values, CSV files, video tutorial. | + | Keywords: R, RStudio, machine learning, supervised, unsupervised, data cleaning, NA values, CSV files, spoken tutorial, video tutorial. |
Line 21: | Line 21: | ||
|| In this tutorial, we will learn about: | || In this tutorial, we will learn about: | ||
* '''Data Cleaning'''. | * '''Data Cleaning'''. | ||
− | * Reading | + | * Reading data from a text file. |
* Type conversions. | * Type conversions. | ||
* Handling '''NA''' values. | * Handling '''NA''' values. | ||
Line 31: | Line 31: | ||
'''System Specifications''' | '''System Specifications''' | ||
|| This tutorial is recorded using, | || This tutorial is recorded using, | ||
− | * '''Ubuntu Linux ''' | + | * '''Ubuntu Linux OS''' version 20.04 |
* '''R '''version''' '''4.1.2 | * '''R '''version''' '''4.1.2 | ||
* '''Rstudio''' version 1.4.1717 | * '''Rstudio''' version 1.4.1717 | ||
Line 52: | Line 52: | ||
'''What is Data Cleaning?''' | '''What is Data Cleaning?''' | ||
− | || Now let us see what is '''Data Cleaning'''.It involves detecting and correcting corrupt or inaccurate records in a dataset. | + | || Now let us see what is '''Data Cleaning'''. |
+ | |||
+ | It involves detecting and correcting corrupt or inaccurate records in a dataset. | ||
The correction may also lead to removal of specific inaccurate record. | The correction may also lead to removal of specific inaccurate record. | ||
Line 60: | Line 62: | ||
'''Need for Data Cleaning''' | '''Need for Data Cleaning''' | ||
|| Next let us see the need for data cleaning. | || Next let us see the need for data cleaning. | ||
+ | |||
* It improves data quality and '''data''' reliability. | * It improves data quality and '''data''' reliability. | ||
* Delivers accuracy and ensures consistency in data. | * Delivers accuracy and ensures consistency in data. | ||
Line 69: | Line 72: | ||
|| | || | ||
* '''Data''' might not be available in convenient forms like '''CSV''' files. | * '''Data''' might not be available in convenient forms like '''CSV''' files. | ||
− | * We need to learn how to extract '''data '''from text files. | + | * We need to learn how to extract '''data ''' from text files. |
− | + | ||
|- | |- | ||
Line 81: | Line 83: | ||
|| We will use a script file '''DataCleaning.R ''' and a dataset '''airquality.txt''' | || We will use a script file '''DataCleaning.R ''' and a dataset '''airquality.txt''' | ||
− | Please download these files from the ''' Code files''' link of this tutorial. | + | Please download these files from the '''Code files''' link of this tutorial. |
Make a copy and then use them for practising. | Make a copy and then use them for practising. | ||
Line 94: | Line 96: | ||
I have also set the '''Data Cleaning''' folder as my '''Working Directory'''. | I have also set the '''Data Cleaning''' folder as my '''Working Directory'''. | ||
|- | |- | ||
− | || | + | ||Cursor in the '''Data Cleaning''' folder. |
||Let us switch to '''RStudio'''. | ||Let us switch to '''RStudio'''. | ||
|- | |- | ||
Line 108: | Line 110: | ||
Highlight '''print(airTxt) ''' | Highlight '''print(airTxt) ''' | ||
− | || '''readLines() '''function is used to read a''' text '''file. | + | || '''readLines() ''' function is used to read a''' text '''file. |
print() function prints the text file. | print() function prints the text file. | ||
Line 121: | Line 123: | ||
− | The dataset is imported to the console window. | + | The dataset is imported to the '''console''' window. |
|- | |- | ||
||Drag the boundary. | ||Drag the boundary. | ||
− | ||I will drag the boundary to see the console window clearly. | + | ||I will drag the boundary to see the '''console''' window clearly. |
|- | |- | ||
|| Highlight output in console | || Highlight output in console | ||
Line 130: | Line 132: | ||
Scroll to show all the output. | Scroll to show all the output. | ||
− | || This '''dataset''' is a | + | || This '''dataset''' is a text document. |
Each individual entry in a row is separated by '''tab '''space, denoted by''' backslash t'''. | Each individual entry in a row is separated by '''tab '''space, denoted by''' backslash t'''. | ||
Line 139: | Line 141: | ||
Missing data given as '''NA''' is omitted or replaced for proper data analysis. | Missing data given as '''NA''' is omitted or replaced for proper data analysis. | ||
+ | |||
|- | |- | ||
||Highlight “'''\t'''” in the '''console''' window. | ||Highlight “'''\t'''” in the '''console''' window. | ||
Line 147: | Line 150: | ||
|- | |- | ||
||Drag the boundary. | ||Drag the boundary. | ||
− | ||I will drag the boundary to see the Source window clearly. | + | ||I will drag the boundary to see the '''Source''' window clearly. |
|- | |- | ||
||[RStudio] | ||[RStudio] | ||
Line 160: | Line 163: | ||
Highlight '''\t''' | Highlight '''\t''' | ||
− | ||'''strsplit() '''function splits the elements of a character vector into substrings. | + | ||'''strsplit() ''' function splits the elements of a character vector into '''substrings'''. |
Here we have to split our text with the separator '''“\t”''' | Here we have to split our text with the separator '''“\t”''' | ||
Line 189: | Line 192: | ||
Highlight '''unlist()''' | Highlight '''unlist()''' | ||
− | || In the '''Source '''window type this command. | + | || In the '''Source ''' window type this command. |
Save and run the command. | Save and run the command. | ||
Line 206: | Line 209: | ||
We can see the vector in the '''Environment tab'''. | We can see the vector in the '''Environment tab'''. | ||
− | |||
− | |||
− | |||
|- | |- | ||
Line 214: | Line 214: | ||
'''Purpose of Type Conversion''' | '''Purpose of Type Conversion''' | ||
− | || | + | ||Now let us understand why we need type conversion. |
+ | |||
* Most functions in '''R''' work solely using numeric data. | * Most functions in '''R''' work solely using numeric data. | ||
* Hence, type conversion will make data suitable for analysis. | * Hence, type conversion will make data suitable for analysis. | ||
Line 239: | Line 240: | ||
− | '''as.numeric '''converts a variable to numeric '''data type'''. | + | '''as.numeric ''' converts a variable to numeric '''data type'''. |
Save and run the command. | Save and run the command. | ||
|- | |- | ||
− | || | + | ||Cursor in the '''Source''' window. |
||Next we convert our list to a''' data frame"''' | ||Next we convert our list to a''' data frame"''' | ||
|- | |- | ||
Line 263: | Line 264: | ||
'''V1''' to '''V6'''. | '''V1''' to '''V6'''. | ||
− | || The '''do.call() function '''is used to bind the list by rows. | + | || The '''do.call() function ''' is used to bind the list by rows. |
Line 381: | Line 382: | ||
− | '''is.na() '''checks if the value is '''NA'''. | + | '''is.na() ''' checks if the value is '''NA'''. |
− | If this condition is true, it updates the '''NA '''values to 0. | + | If this condition is true, it updates the '''NA ''' values to 0. |
Line 412: | Line 413: | ||
− | To know more about it, please refer to the '''Additional Reading Material | + | To know more about it, please refer to the '''Additional Reading Material'''. |
|- | |- | ||
|| Highlight | || Highlight | ||
Line 468: | Line 469: | ||
|| In the '''Source''' window, type following commands. | || In the '''Source''' window, type following commands. | ||
− | Drag | + | Drag boundary to see the source window clearly. |
Line 494: | Line 495: | ||
'''Summary''' | '''Summary''' | ||
|| In this tutorial we have learnt about: | || In this tutorial we have learnt about: | ||
+ | |||
* Data Cleaning. | * Data Cleaning. | ||
− | * Reading | + | * Reading data from a text file. |
* Type conversions. | * Type conversions. | ||
* Handling '''NA''' values. | * Handling '''NA''' values. | ||
Line 508: | Line 510: | ||
In the '''air quality dataset''', replace the '''NA''' values with the mean of observations. | In the '''air quality dataset''', replace the '''NA''' values with the mean of observations. | ||
+ | |||
|- | |- | ||
|| '''Show Slide''' | || '''Show Slide''' |
Revision as of 13:12, 3 March 2023
Title of the script: Data Cleaning in R
Author: Tanmay Srinath
Keywords: R, RStudio, machine learning, supervised, unsupervised, data cleaning, NA values, CSV files, spoken tutorial, video tutorial.
Visual Cue | Narration |
Show Slide
Opening Slide |
Welcome to the spoken tutorial on Data Cleaning using R. |
Show Slide
Learning Objectives |
In this tutorial, we will learn about:
|
Show Slide
System Specifications |
This tutorial is recorded using,
It is recommended to install R version 4.1.0 or higher. |
Show Slide
Prerequisites |
To follow this tutorial, the learner should know:
If not, please access the relevant tutorials on R on this website. |
Show Slide
What is Data Cleaning? |
Now let us see what is Data Cleaning.
It involves detecting and correcting corrupt or inaccurate records in a dataset. The correction may also lead to removal of specific inaccurate record. |
Show Slide
Need for Data Cleaning |
Next let us see the need for data cleaning.
|
Show Slide
Reading Data from Text File |
|
Only Narration. | Let us see how we can read data in RStudio. |
Show Slide
Download Files |
We will use a script file DataCleaning.R and a dataset airquality.txt
Please download these files from the Code files link of this tutorial. Make a copy and then use them for practising. |
[Computer screen]
Highlight DataCleaning.R and the folder Data Cleaning. |
I have downloaded and moved these files to the DataCleaning folder.
This folder is located in the MLProject folder on my Desktop. I have also set the Data Cleaning folder as my Working Directory. |
Cursor in the Data Cleaning folder. | Let us switch to RStudio. |
Double click DataCleaning.R in RStudio
Point to DataCleaning.R in RStudio. |
Let us open the script DataCleaning.R in RStudio.
|
Highlight airTxt <- readLines("airquality.txt")
Highlight print(airTxt) |
readLines() function is used to read a text file.
print() function prints the text file. |
Highlight airTxt <- readLines("airquality.txt")
print(airTxt)
|
Select and run these commands.
|
Drag the boundary. | I will drag the boundary to see the console window clearly. |
Highlight output in console
Scroll to show all the output. |
This dataset is a text document.
Each individual entry in a row is separated by tab space, denoted by backslash t. |
Highlight NAs in airTxt | Here NA denotes missing data.
|
Highlight “\t” in the console window. | First, we will convert this dataset to a dataframe.
|
Drag the boundary. | I will drag the boundary to see the Source window clearly. |
[RStudio]
airList <- strsplit(airTxt, "\t") View(airList) |
In the Source window, type these two commands. |
Highlight strsplit
Highlight \t |
strsplit() function splits the elements of a character vector into substrings.
Here we have to split our text with the separator “\t” |
Select the commands. Click the Run Button. | Select and run these commands. |
Highlight output in Source.
|
A 2-dimensional list of data points appears in the Source window.
We will use the first row as the dataframe’s column names. We will store this row in a variable cols. |
Highlight DataCleaning.R | Click on DataCleaning.R in the Source window. |
[RStudio]
cols <- unlist(airList[1])
|
In the Source window type this command.
Save and run the command.
|
Drag Boundary
|
I will drag the boundary to see the Environment tab clearly.
|
Show Slide:
Purpose of Type Conversion |
Now let us understand why we need type conversion.
|
Only Narration. | Switch to RStudio. |
[RStudio]
|
We will now convert our data to numeric type.
|
Highlight lapply
|
lapply applies a given function to every element of a list.
Save and run the command. |
Cursor in the Source window. | Next we convert our list to a data frame" |
[RStudio]
airDF <- as.data.frame(do.call(rbind, airN)) View(airDF) |
In the Source window, type these commands. |
Highlight airDF <- as.data.frame(do.call(rbind, airN))
Point to the column names. V1 to V6. |
The do.call() function is used to bind the list by rows.
The raw text document is converted to a dataframe.
|
Highlight NAs on 1st row. | The first row is full of NAs, which represents missing values.
|
Highlight DataCleaning.R | Click on DataCleaning.R in the Source window. |
[RStudio]
airDF <- airDF[-1,]
Click on Save button and Run button. |
In the Source window, type these commands.
This command removes the first row of NAs.
|
Highlight airDF window | In the Source window, click on airDF.
|
Only Narration. | Now we will learn how to handle missing values in our dataset. |
Show Slide
Handling NA values |
There are two ways of handling NA values:
|
Only Narration. | First we will learn to remove NA values.
Let us return to RStudio. |
Highlight DataCleaning.R | Click on DataCleaning.R in the Source window. |
[RStudio]
airDF_without_NA <- na.omit(airDF) View(airDF_without_NA) |
In the Source window, type the following commands. |
Highlight na.omit | na.omit() function removes all the NA entries. |
Click on Save button and Run button.
Highlight : airDF_without_NA <- na.omit(airDF) View(airDF_without_NA)
|
Save and run these commands.
All the NA values have been removed. |
Show Slide
Replacing Missing Values |
|
Only Narration. | Let us return to RStudio. |
Highlight DataCleaning.R | Click on DataCleaning.R in the Source window. |
[RStudio]
airDF[is.na(airDF)] <- 0 Highlight airDF[is.na(airDF)] <- 0 |
We will replace all the NA values with zeros.
In the Source window, type this command.
If this condition is true, it updates the NA values to 0.
|
View(airDF)
airDF[is.na(airDF)] <- 0 View(airDF)
|
In the Source window type this command.
Select these commands and run them. |
Highlight
airDF in Source |
All the NA values have been replaced with 0.
NA can be replaced by mean, median or any other similar statistical value.
|
Highlight
airDF in Source |
Before performing the data analysis, we need to encode categorical variables.
|
Highlight DataCleaning.R | Click on DataCleaning.R in the Source window. |
[RStudio]
airDF$Month <- factor(airDF$Month) levels(airDF$Month) |
In the Source window, type these commands. |
Highlight
airDF$Month <- factor(airDF$Month)
|
This encodes the column Month as a factor.
|
[RStudio]
names <- c("May", "June", "July", "August", "September") airDF$Month <- factor(airDF$Month, labels = names) levels(airDF$Month) Drag boundary. |
In the Source window, type following commands.
Drag boundary to see the source window clearly.
|
Highlight names <- c("May", "June", "July", "August", "September")
|
We use the factor() function to change the numeric levels to month names. |
Highlight Output in console | Now the levels have changed to month names. |
Only Narration. | With this we come to the end of this tutorial.
Let us summarise. |
Show Slide
Summary |
In this tutorial we have learnt about:
|
Show Slide
Assignment |
Here is an assignment for you.
In the air quality dataset, replace the NA values with the mean of observations. |
Show Slide
About the Spoken Tutorial Project |
The video at the following link summarises the Spoken Tutorial project. Please download and watch it. |
Show Slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
|
Show Slide
Spoken Tutorial Forum to answer questions
Choose the minute and second where you have the question. Explain your question briefly. Someone from the FOSSEE team will answer them. Please visit this site. |
Please post your time queries in this forum. |
Show Slide
Forum to answer questions |
Do you have any general/technical questions?
Please visit the forum given in the link. |
Show Slide
Textbook Companion |
The FOSSEE team coordinates coding
of solved examples of popular books and case study projects.
|
Show Slide
Acknowledgment |
The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India. |
Show Slide
Thank You |
This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay.
Thank you for watching. |