Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

From Script | Spoken-Tutorial
Jump to: navigation, search

Title of the script: Introduction to Machine Learning in R

Author: Debatosh Chakraborty

Keywords: R, RStudio, machine learning, supervised, unsupervised, video tutorial.


Visual Cue Narration
Show slide

Opening Slide

Welcome to this spoken tutorial on Introduction to Machine Learning in R
Show slide

Learning Objectives

In this tutorial, we will learn about:
  • Machine Learning
  • Supervised and Unsupervised Learning
  • Workflow of ML CLassifier Algorithm
  • Visualizing Feature Space
  • Constructing a dummy classifier
  • Evaluation of the chosen dummy classifier
Show slide

System Specifications

This tutorial is recorded using,
  • Windows 11
  • R version 4.3.0
  • RStudio version 2023.06.1

It is recommended to install R version 4.2.0 or higher.

Show slide

Prerequisites

https://spoken-tutorial.org

To follow this tutorial, the learner should know
  • Basic programming in R.
  • To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.

Show slide

Machine Learning

About machine learning
  • ML enables computers to learn from data.
  • ML algorithms automate the learning process from data through patterns.
  • Their primary role is prediction, classification or clustering of data.
  • ML algorithms are applied in several applications.
  • For example Natural Language Processing, Image and speech recognition, etc.
Show Slide

Types of Machine Learning

ML algorithms include the following types and tasks:
  • Supervised learning: Prediction and Classification,
  • Unsupervised learning: Clustering,
  • Semi-supervised learning
  • Reinforcement learning.

In this series, we will focus on Supervised and Unsupervised learning algorithms.

Show Slide

Supervised and Unsupervised Learning


Supervised learning: Labeled data
  • ML algorithms predict labels for unseen features
  • They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data

  • ML algorithms develop a mechanism to group similar features into clusters.
  • And label them for future analysis.
Show Slides

Classification and Regression

  • Supervised learning consists of Regression and Classification.
  • Regression is applied to predict and learn continuous-valued responses from features.
  • Regression techniques include Linear, Spline, Ridge, Lasso, and others.
  • Classification is applied to predict the class of a discrete (labeled) response from features.
  • Classification techniques include Logistic Regression, Decision Tree, SVM, and others.
Show Slides

Workflow of an ML Classifier algorithm

The Workflow of an ML Classifier algorithm include
  • Feature Space: Collection of all possible values of the features.
  • A classification algorithm partitions the feature space into a number of classes.
  • Data is split into training and testing sets to learn and evaluate the algorithm.
  • The model learns from the training data to create partitions of feature space.
  • The model is evaluated on the test dataset through performance metrics.
Show Slide

Dataset

Let’s use Raisin dataset with two chosen variables or features to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.

Show slide

Download Files

We will use a script file Intro.R and Raisin Dataset ‘raisin.xlsx’

Please download these files from the Code files link of this tutorial.

Make a copy and then use them while practicing.

[Computer screen]

point to Intro.R and the folder Introduction.

Point to the MLProject folder on the Desktop.

I have downloaded and moved these files to the Introduction folder.

This folder is located in the MLProject folder on my Desktop.

I have also set the Introduction folder as my working Directory.

In this tutorial, we will introduce classification on the raisin dataset.

Let us switch to RStudio.
Click Intro.R in RStudio

Point to Intro.R in RStudio.

Let us open the script Intro.R in RStudio.

Script Intro.R opens in RStudio.

[RStudio]

Highlight the command library(readxl)

Highlight the command library(caret)

Highlight the command library(ggplot2)

#install.packages(“package_name”)


Point to the command.

Select and run these commands to import the packages.


We will use the readxl package to load the excel file of our Raisin Dataset.


We will use the caret package to create the confusion matrix.

The ggplot2 package will be used to create the decision boundary plot.

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.

[RStudio]

Highlight the command

data<- read_xlsx("Raisin.xlsx")

Run this command to load the Raisin dataset.

Drag boundary to see the Environment tab clearly.

In the Environment tab below Data, you will see the data variable.

Click on data to load the dataset in the Source window.

Click on Intro.R in the Source window and close the tab.

Highlight the command.

data<-data[c("minorAL",ecc,"class")]

data$class <- factor(data$class)

Select the commands and click the Run button

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable data$class to a factor.

Select and run the commands.

Click on the Environment tab.

Click on data.

Click on data to load the modified data in the Source window.
We will now understand the feature space of this data.
range_minor_al <- range(data$minorAL)

range_ecc <- range(data$ecc)

In the Source window type these commands
Highlight the command

range_minor_al <- range(data$minorAL)

Highlight the command

range_ecc <- range(data$ecc)

These commands show the range of the feature variables minorAL and ecc.

Select and run the commands.


Drag boundary to see the environment tab clearly.


The minimum and maximum value of the minor_al and ecc are shown in their range variables

X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)

Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)

feature <- expand.grid(minorAL = X, ecc = Y)

We will now use the range to generate grid points to construct the feature space.

In the Source window type these commands

Highlight

X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)

Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)

HIghlight

feature <- expand.grid(minorAL = X, ecc = Y)

This command generates a sequence of points spanning the range of minorAL and ecc.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.

ggplot(data = data, aes(x = minorAL, y = ecc)) +

geom_point(aes(color = class), size = 2) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(title = "Feature Space") +

theme_minimal()

We will now plot the feature space created


In the Source window type these commands

ggplot(data = data, aes(x = minorAL, y = ecc)) +

geom_point(aes(color = class), size = 2) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(title = "Feature Space") +

theme_minimal()

These commands plot the data points in the feature space.

Select and run the commands.

Drag boundaries. Drag boundaries to see the plot window clearly.
Point to the data. Now let us split our data into training and testing data.
[RStudio]

set.seed(1)


index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

Click on Intro.R in the Source window, and type these commands.

Highlight the command

set.seed(1)


Highlight the command

index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

Select the commands and run them.

[RStudio]

train_data <- data[index_split, ]


test_data <- data[-c(index_split), ]

In the Source window type these commands
Highlight the command

train_data <- data[index_split, ]

Highlight the command

test_data <- data[-c(index_split), ]

This creates training data, consisting of 630 unique rows.


This creates testing data, consisting of 270 unique rows.

Select the commands and click the Run button.


Point to the sets in the Environment Tab

Click the test_data and train_data

Select the commands and run them.

The data sets are shown in the Environment tab.

Drag boundary to see the Environment tab clearly


Click on test_data and train_data to load them in the Source window.

Here we try to partition the feature space to construct the classifier.

To begin with, one might construct a heuristic line to build the classifier.

[Rstudio]

fit = function(x)((x * (-0.0021)) + 1.445)

model_predict <- function(x){

factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))

}

In the Source window type these commands.
Highlight the command

fit = function(x)((x * (-0.0021)) + 1.445)

Highlight the command

model_predict <- function(x){

factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))

}

Click Save and Click Run buttons.

Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It does not involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

feature$class <- model_predict(feature)

feature$classnum <- as.numeric(feature$class)

Let’s use the line to classify the feature space and draw the decision boundary.

In the Source window type these commands

Highlight

feature$class <- model_predict(feature)

Highlight

feature$classnum <- as.numeric(feature$class)

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting.

Select and run the commands.

Click on feature in the Environment tab.

Point to the data in the Source window.

Drag boundary to see the Environment window.

Click on feature in the Environment tab.

The feature set with the predicted classes loads in the source window.

ggplot() +

geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +

geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +

geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(title = "Data Boundary") +

theme_minimal()

In the Source window type these commands
Highlight the command

ggplot() +

geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +

geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +

geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(title = "Data Boundary") +

theme_minimal()

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

Drag boundary to see the plot window. Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

prediction_test = model_predict(test_data)

Let us see how well the partition performs on the testing dataset.

In the Source window type this command

Highlight the command

prediction_test = model_predict(test_data)

We predict the classes from testing data and store it in the prediction_test variable.

Select and run the command.

Let us now measure the performance of the classification.
[RStudio]

test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)

In the Source window, type the command

Highlight the command

test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)

Click on Save and Run buttons.

We use the confusionMatrix function from the MASS package to calculate the performance matrix.

Select and run the command.

test_confusion_matrix$overall["Accuracy"] In the Source window, type this command
Highlight

test_confusion_matrix$overall["Accuracy"]

It fetches the accuracy metric from the list created

Select and run the command

Drag boundary to see the console window clearly
Highlight

Accuray

0.6962963

The accuracy of the testing dataset is 69%
Drag boundary to see the source window clearly Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

[RStudio]

test_confusion_matrix$table

In the Source window type this command
Highlight the command

test_confusion_matrix$table

Click on Save and Run buttons.

Select and run the command.


Point the output in the console window

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

Drag boundary to see the console window clearly.

The output is seen in the console window.

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.

With this, we come to the end of this tutorial.

Let us summarize.

Show Slide

Summary

In this tutorial we have learned about:
  • Machine Learning
  • Supervised and Unsupervised Learning
  • Workflow of an ML Classifier Algorithm
  • Visualizing Feature Space
  • Constructing a dummy classifier
  • Evaluation of the chosen dummy classifier
Here is an assignment for you.
Show Slide

Assignment

  • Use a vertical line as a classifier to partition the feature space.
  • Plot the decision boundary for the same.
  • Evaluate the classifier on the test dataset
Show slide

About the Spoken Tutorial Project

The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.

Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.


Please contact us.

Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.

Please post your timed queries in this forum.


Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

R Activities

The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.

Show Slide

Acknowledgment

The Spoken Tutorial project was established by the Ministry of Education Govt of India.
Show Slide

Thank You

This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.

Contributors and Content Editors

Ushav