Machine-Learning-using-R - old 2022/C2/Data-Cleaning-using-R/English

Title of the script: Data Cleaning in R

Author: Tanmay Srinath

Keywords: R, RStudio, machine learning, supervised, unsupervised, data cleaning, NA values, CSV files, video tutorial.

Visual Cue	Narration
Show Slide Opening Slide	Welcome to the spoken tutorial on Data Cleaning using R.
Show Slide Learning Objectives	In this tutorial, we will learn about: Data Cleaning. Reading Data from a text file. Type conversions. Handling NA values. Encoding values to factors.
Show Slide System Specifications	This tutorial is recorded using, Ubuntu Linux OS version 20.04 R version 4.1.2 Rstudio version 1.4.1717 It is recommended to install R version 4.1.0 or higher.
Show Slide Prerequisites https://spoken-tutorial.org	To follow this tutorial, the learner should know: Basics of R Programming. Dataframes, lists and vectors. If not, please access the relevant tutorials on R on this website.
Show Slide What is Data Cleaning?	Now let us see what is Data Cleaning.It involves detecting and correcting corrupt or inaccurate records in a dataset. The correction may also lead to removal of specific inaccurate record.
Show Slide Need for Data Cleaning	Next let us see the need for data cleaning. It improves data quality and data reliability. Delivers accuracy and ensures consistency in data. Ensures that data is set for statistical analysis.
Show Slide Reading Data from Text File	Data might not be available in convenient forms like CSV files. We need to learn how to extract data from text files.
Only Narration.	Let us see how we can read data in RStudio.
Show Slide Download Files	We will use a script file DataCleaning.R and a dataset airquality.txt Please download these files from the Code files link of this tutorial. Make a copy and then use them for practising.
[Computer screen] Highlight DataCleaning.R and the folder Data Cleaning.	I have downloaded and moved these files to the DataCleaning folder. This folder is located in the MLProject folder on my Desktop. I have also set the Data Cleaning folder as my Working Directory.
Only Narration.	Let us switch to RStudio.
Double click DataCleaning.R in RStudio Point to DataCleaning.R in RStudio.	Let us open the script DataCleaning.R in RStudio. Script DataCleaning.R opens in RStudio.
Highlight airTxt <- readLines("airquality.txt") Highlight print(airTxt)	readLines() function is used to read a text file. print() function prints the text file.
Highlight airTxt <- readLines("airquality.txt") print(airTxt) Click the Run button.	Select and run these commands. The dataset is imported to the console window.
Drag the boundary.	I will drag the boundary to see the console window clearly.
Highlight output in console Scroll to show all the output.	This dataset is a text document. Each individual entry in a row is separated by tab space, denoted by backslash t.
Highlight NAs in airTxt	Here NA denotes missing data. Missing data given as NA is omitted or replaced for proper data analysis.
Highlight “\t” in the console window.	First, we will convert this dataset to a dataframe. For that, we should remove these tab spaces.
Drag the boundary.	I will drag the boundary to see the Source window clearly.
[RStudio] airList <- strsplit(airTxt, "\t") View(airList)	In the Source window, type these two commands.
Highlight strsplit Highlight \t	strsplit() function splits the elements of a character vector into substrings. Here we have to split our text with the separator “\t”
Select the commands. Click the Run Button.	Select and run these commands.
Highlight output in Source. Highlight 1st row of ‘airList’.	A 2-dimensional list of data points appears in the Source window. We will use the first row as the dataframe’s column names. We will store this row in a variable cols.
Highlight DataCleaning.R	Click on DataCleaning.R in the Source window.
[RStudio] cols <- unlist(airList[1]) Highlight unlist()	In the Source window type this command. Save and run the command. unlist() function converts a list into a vector.
Drag Boundary Point to Environment tab.	I will drag the boundary to see the Environment tab clearly. We can see the vector in the Environment tab.
Only Narration.	Now let us understand why we need type conversion.
Show Slide: Purpose of Type Conversion	Most functions in R work solely using numeric data. Hence, type conversion will make data suitable for analysis.
Only Narration.	Switch to RStudio.
[RStudio] airN <- lapply(airList, as.numeric)	We will now convert our data to numeric type. In the Source window, type the following command.
Highlight lapply Highlight as.numeric	lapply applies a given function to every element of a list. as.numeric converts a variable to numeric data type. Save and run the command.
	Next we convert our list to a data frame"
[RStudio] airDF <- as.data.frame(do.call(rbind, airN)) View(airDF)	In the Source window, type these commands.
Highlight airDF <- as.data.frame(do.call(rbind, airN)) Highlight airDF in Source. Point to the column names. V1 to V6.	The do.call() function is used to bind the list by rows. Save and run these commands. The raw text document is converted to a dataframe. Observe that column names are given as V1, V2, V3, and so on.
Highlight NAs on 1st row.	The first row is full of NAs, which represents missing values. This happened because we used as.numeric() function on a character data. Let us correct these problems.
Highlight DataCleaning.R	Click on DataCleaning.R in the Source window.
[RStudio] airDF <- airDF[-1,] colnames(airDF) <- cols Click on Save button and Run button.	In the Source window, type these commands. This command removes the first row of NAs. This command replaces the original values with the column names. Select commands and run them.
Highlight airDF window	In the Source window, click on airDF. We can see that the column names have been changed.
Only Narration.	Now we will learn how to handle missing values in our dataset.
Show Slide Handling NA values	There are two ways of handling NA values: Removing NA values. Replacing them with appropriate values.
Only Narration.	First we will learn to remove NA values. Let us return to RStudio.
Highlight DataCleaning.R	Click on DataCleaning.R in the Source window.
[RStudio] airDF_without_NA <- na.omit(airDF) View(airDF_without_NA)	In the Source window, type the following commands.
Highlight na.omit	na.omit() function removes all the NA entries.
Click on Save button and Run button. Highlight : airDF_without_NA <- na.omit(airDF) View(airDF_without_NA) Highlight airDF_without_NA in Source	Save and run these commands. All the NA values have been removed.
Show Slide Replacing Missing Values	Removing NA values makes sense when we have few such entries. However, removing a lot of missing values without replacement leads to data loss. So, we should know how to replace the missing values.
Only Narration.	Let us return to RStudio.
Highlight DataCleaning.R	Click on DataCleaning.R in the Source window.
[RStudio] airDF[is.na(airDF)] <- 0 Highlight airDF[is.na(airDF)] <- 0	We will replace all the NA values with zeros. In the Source window, type this command. is.na() checks if the value is NA. If this condition is true, it updates the NA values to 0. The NA values can be replaced by any numeric value. For example 0.
View(airDF) [Highlight] airDF[is.na(airDF)] <- 0 View(airDF) Click on Save button and Run button.	In the Source window type this command. Select these commands and run them.
Highlight airDF in Source	All the NA values have been replaced with 0. NA can be replaced by mean, median or any other similar statistical value. To know more about it, please refer to the Additional Reading Material.
Highlight airDF in Source	Before performing the data analysis, we need to encode categorical variables. In R, this is done by converting them to factors.
Highlight DataCleaning.R	Click on DataCleaning.R in the Source window.
[RStudio] airDF$Month <- factor(airDF$Month) levels(airDF$Month)	In the Source window, type these commands.
Highlight airDF$Month <- factor(airDF$Month) Highlight levels(airDF$Month) Highlight Output in console	This encodes the column Month as a factor. To find the levels of a factor we use the levels() function. Save and Run the commands. We can see that the levels are 5, 6, 7, 8 and 9. We will replace these factor values with Month names.
[RStudio] names <- c("May", "June", "July", "August", "September") airDF$Month <- factor(airDF$Month, labels = names) levels(airDF$Month) Drag boundary.	In the Source window, type following commands. Drag bounday to see the source window clearly. This creates a vector that stores the Month names from May to September. Save and Run the commands.
Highlight names <- c("May", "June", "July", "August", "September") Highlight airDF$Month <- factor(airDF$Month, labels = names)	We use the factor() function to change the numeric levels to month names.
Highlight Output in console	Now the levels have changed to month names.
Only Narration.	With this we come to the end of this tutorial. Let us summarise.
Show Slide Summary	In this tutorial we have learnt about: Data Cleaning. Reading Data from a text file. Type conversions. Handling NA values. Encoding values to factors.
Show Slide Assignment	Here is an assignment for you. In the air quality dataset, replace the NA values with the mean of observations.
Show Slide About the Spoken Tutorial Project	The video at the following link summarises the Spoken Tutorial project. Please download and watch it.
Show Slide Spoken Tutorial Workshops	We conduct workshops using Spoken Tutorials and give certificates. Please contact us.
Show Slide Spoken Tutorial Forum to answer questions Do you have questions in THIS Spoken Tutorial? Choose the minute and second where you have the question. Explain your question briefly. Someone from the FOSSEE team will answer them. Please visit this site.	Please post your time queries in this forum.
Show Slide Forum to answer questions	Do you have any general/technical questions? Please visit the forum given in the link.
Show Slide Textbook Companion	The FOSSEE team coordinates coding of solved examples of popular books and case study projects. We give certificates to those who do this. For more details, please visit these sites.
Show Slide Acknowledgment	The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India.
Show Slide Thank You	This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay. Thank you for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey

Machine-Learning-using-R - old 2022/C2/Data-Cleaning-using-R/English

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools