Machine-Learning-using-R - old 2022/C2/Data-Cleaning-using-R/English

From Script | Spoken-Tutorial
Jump to: navigation, search

Title of the script: Data Cleaning in R

Author: Tanmay Srinath

Keywords: R, RStudio, machine learning, supervised, unsupervised, data cleaning, NA values, CSV files, spoken tutorial, video tutorial.


Visual Cue Narration
Show Slide

Opening Slide

Welcome to the spoken tutorial on Data Cleaning using R.
Show Slide

Learning Objectives

In this tutorial, we will learn about:
  • Data Cleaning.
  • Reading data from a text file.
  • Type conversions.
  • Handling NA values.
  • Encoding values to factors.
Show Slide

System Specifications

This tutorial is recorded using,
  • Ubuntu Linux OS version 20.04
  • R version 4.1.2
  • Rstudio version 1.4.1717

It is recommended to install R version 4.1.0 or higher.

Show Slide

Prerequisites

https://spoken-tutorial.org

To follow this tutorial, the learner should know:
  • Basics of R Programming.
  • Dataframes, lists and vectors.

If not, please access the relevant tutorials on R on this website.

Show Slide

What is Data Cleaning?

Now let us see what is Data Cleaning.

It involves detecting and correcting corrupt or inaccurate records in a dataset.

The correction may also lead to removal of specific inaccurate record.

Show Slide

Need for Data Cleaning

Next let us see the need for data cleaning.
  • It improves data quality and data reliability.
  • Delivers accuracy and ensures consistency in data.
  • Ensures that data is set for statistical analysis.
Show Slide

Reading Data from Text File

  • Data might not be available in convenient forms like CSV files.
  • We need to learn how to extract data from text files.
Only Narration. Let us see how we can read data in RStudio.
Show Slide

Download Files

We will use a script file DataCleaning.R and a dataset airquality.txt

Please download these files from the Code files link of this tutorial.

Make a copy and then use them for practising.

[Computer screen]

Highlight DataCleaning.R and the folder Data Cleaning.

I have downloaded and moved these files to the DataCleaning folder.

This folder is located in the MLProject folder on my Desktop.

I have also set the Data Cleaning folder as my Working Directory.

Cursor in the Data Cleaning folder. Let us switch to RStudio.
Double click DataCleaning.R in RStudio

Point to DataCleaning.R in RStudio.

Let us open the script DataCleaning.R in RStudio.


Script DataCleaning.R opens in RStudio.

Highlight airTxt <- readLines("airquality.txt")

Highlight print(airTxt)

readLines() function is used to read a text file.

print() function prints the text file.

Highlight airTxt <- readLines("airquality.txt")

print(airTxt)


Click the Run button.

Select and run these commands.


The dataset is imported to the console window.

Drag the boundary. I will drag the boundary to see the console window clearly.
Highlight output in console

Scroll to show all the output.

This dataset is a text document.

Each individual entry in a row is separated by a tab space, denoted by backslash t.

Highlight NAs in airTxt Here NA denotes missing data.


Missing data given as NA is omitted or replaced for proper data analysis.

Highlight “\t” in the console window. First, we will convert this dataset to a dataframe.


For that, we should remove these tab spaces.

Drag the boundary. I will drag the boundary to see the Source window clearly.
[RStudio]

airList <- strsplit(airTxt, "\t")

View(airList)

In the Source window, type these two commands.
Highlight strsplit

Highlight \t

strsplit() function splits the elements of a character vector into substrings.

Here we have to split our text with the separator “\t”

Select the commands. Click the Run Button. Select and run these commands.
Highlight output in Source.


Highlight 1st row of ‘airList’.

A 2-dimensional list of data points appears in the Source window.

We will use the first row as the dataframe’s column names.

We will store this row in a variable cols.

Highlight DataCleaning.R Click on DataCleaning.R in the Source window.
[RStudio]

cols <- unlist(airList[1])


Highlight unlist()

In the Source window type this command.

Save and run the command.


unlist() function converts a list into a vector.

Drag Boundary


Point to Environment tab.

I will drag the boundary to see the Environment tab clearly.


We can see the vector in the Environment tab.


Show Slide:

Purpose of Type Conversion

Now let us understand why we need type conversion.
  • Most functions in R work solely using numeric data.
  • Hence, type conversion will make data suitable for analysis.
Only Narration. Switch to RStudio.
[RStudio]


airN <- lapply(airList, as.numeric)

We will now convert our data to numeric type.


In the Source window, type the following command.

Highlight lapply


Highlight as.numeric

lapply applies a given function to every element of a list.


as.numeric converts a variable to numeric data type.

Save and run the command.

Cursor in the Source window. Next we convert our list to a data frame"
[RStudio]

airDF <- as.data.frame(do.call(rbind, airN))

View(airDF)

In the Source window, type these commands.
Highlight airDF <- as.data.frame(do.call(rbind, airN))


Highlight airDF in Source.

Point to the column names.

V1 to V6.

The do.call() function is used to bind the list by rows.


Save and run these commands.

The raw text document is converted to a dataframe.


Observe that column names are given as V1, V2, V3, and so on.

Highlight NAs on 1st row. The first row is full of NAs, which represents missing values.


This happened because we used as.numeric() function on a character data.


Let us correct these problems.

Highlight DataCleaning.R Click on DataCleaning.R in the Source window.
[RStudio]

airDF <- airDF[-1,]


colnames(airDF) <- cols

Click on Save button and Run button.

In the Source window, type these commands.

This command removes the first row of NAs.


This command replaces the original values with the column names.


Select commands and run them.

Highlight airDF window In the Source window, click on airDF.


We can see that the column names have been changed.

Only Narration. Now we will learn how to handle missing values in our dataset.
Show Slide

Handling NA values

There are two ways of handling NA values:
  • Removing NA values.
  • Replacing them with appropriate values.
Only Narration. First we will learn to remove NA values.

Let us return to RStudio.

Highlight DataCleaning.R Click on DataCleaning.R in the Source window.
[RStudio]

airDF_without_NA <- na.omit(airDF)

View(airDF_without_NA)

In the Source window, type the following commands.
Highlight na.omit na.omit() function removes all the NA values.
Click on Save button and Run button.

Highlight :

airDF_without_NA <- na.omit(airDF)

View(airDF_without_NA)


Highlight airDF_without_NA in Source

Save and run these commands.

All the NA values have been removed.

Show Slide

Replacing Missing Values

  • Removing NA values makes sense when we have few such entries.
  • However, removing a lot of missing values without replacement leads to data loss.
  • So, we should know how to replace the missing values.


Only Narration. Let us return to RStudio.
Highlight DataCleaning.R Click on DataCleaning.R in the Source window.
[RStudio]

airDF[is.na(airDF)] <- 0

Highlight airDF[is.na(airDF)] <- 0

We will replace all the NA values with zeros.

In the Source window, type this command.


is.na() checks if the value is NA.

If this condition is true, it updates the NA values to 0.


The NA values can be replaced by any numeric value. For example 0.

View(airDF)


[Highlight]

airDF[is.na(airDF)] <- 0

View(airDF)


Click on Save button and Run button.

In the Source window type this command.

Select these commands and run them.

Highlight

airDF in Source

All the NA values have been replaced with 0.

NA can be replaced by mean, median or any other similar statistical value.


To know more about it, please refer to the Additional Reading Material.

Highlight

airDF in Source

Before performing the data analysis, we need to encode categorical variables.


In R, this is done by converting them to factors.

Highlight DataCleaning.R Click on DataCleaning.R in the Source window.
[RStudio]

airDF$Month <- factor(airDF$Month)

levels(airDF$Month)

In the Source window, type these commands.
Highlight

airDF$Month <- factor(airDF$Month)


Highlight levels(airDF$Month)


Highlight Output in console

This encodes the column Month as a factor.


To find the levels of a factor we use the levels() function.


Save and Run the commands.


We can see that the levels are 5, 6, 7, 8 and 9.


We will replace these factor values with Month names.

[RStudio]

names <- c("May", "June", "July", "August", "September")

airDF$Month <- factor(airDF$Month, labels = names)

levels(airDF$Month)

Drag boundary.

In the Source window, type following commands.

Drag boundary to see the source window clearly.


This creates a vector that stores the Month names from May to September.


Save and Run the commands.

Highlight names <- c("May", "June", "July", "August", "September")


Highlight airDF$Month <- factor(airDF$Month, labels = names)

We use the factor() function to change the numeric levels to month names.
Highlight Output in console Now the levels have changed to month names.
Only Narration. With this we come to the end of this tutorial.

Let us summarise.

Show Slide

Summary

In this tutorial we have learnt about:
  • Data Cleaning.
  • Reading data from a text file.
  • Type conversions.
  • Handling NA values.
  • Encoding values to factors.
Show Slide

Assignment

Here is an assignment for you.

In the air quality dataset, replace the NA values with the mean of observations.

Show Slide

About the Spoken Tutorial Project

The video at the following link summarises the Spoken Tutorial project. Please download and watch it.
Show Slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.


Please contact us.

Show Slide

Spoken Tutorial Forum to answer questions


Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.

Please post your time queries in this forum.
Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion

The FOSSEE team coordinates coding

of solved examples of popular books and case study projects.


We give certificates to those who do this.


For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India.
Show Slide

Thank You

This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay.

Thank you for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey