Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English
Title of the script: Introduction to Machine Learning in R
Author: Debatosh Chakraborty
Keywords: R, RStudio, machine learning, supervised, unsupervised, video tutorial.
Visual Cue | Narration |
Show slide
Opening Slide |
Welcome to this spoken tutorial on Introduction to Machine Learning in R |
Show slide
Learning Objectives |
In this tutorial, we will learn about:
|
Show slide
System Specifications |
This tutorial is recorded using,
It is recommended to install R version 4.2.0 or higher. |
Show slide
Prerequisites |
To follow this tutorial, the learner should know
If not, please access the relevant tutorials on this website. |
Show slide
Machine Learning
|
About machine learning
|
Show Slide
Types of Machine Learning |
ML algorithms include the following types and tasks:
In this series, we will focus on Supervised and Unsupervised learning algorithms. |
Show Slide
Supervised and Unsupervised Learning
|
Supervised learning: Labeled data
Unsupervised learning: Unlabeled data
|
Show Slides
Classification and Regression |
|
Show Slides
Workflow of an ML Classifier algorithm |
The Workflow of an ML Classifier algorithm
|
Show Slide
Dataset |
Let’s use Raisin dataset with two chosen variables to understand a classification problem.
For more information on Raisin data please refer to Additional Reading Material on this tutorial page. |
Show slide
Download Files |
We will use a script file Intro.R and Raisin Dataset ‘raisin.xlsx’
Please download these files from the Code files link of this tutorial. Make a copy and then use them while practicing. |
[Computer screen]
point to Intro.R and the folder Introduction. Point to the MLProject folder on the Desktop. |
I have downloaded and moved these files to the Introduction folder.
This folder is located in the MLProject folder on my Desktop. I have also set the Introduction folder as my working Directory. In this tutorial, we will introduce classification on the raisin dataset. |
Let us switch to RStudio. | |
Click Intro.R in RStudio
Point to Intro.R in RStudio. |
Let us open the script Intro.R in RStudio.
Script Intro.R opens in RStudio. |
[RStudio]
Highlight the command library(readxl) Highlight the command library(caret) Highlight the command library(ggplot2) #install.packages(“package_name”)
|
Select and run these commands to import the packages.
The ggplot2 package will be used to create the decision boundary plot. Please ensure that all the packages are installed correctly. As I have already installed the packages, I have imported them directly. |
[RStudio]
Highlight the command data<- read_xlsx("Raisin.xlsx") |
Run this command to load the Raisin dataset.
Drag boundary to see the Environment tab clearly. In the Environment tab below Data, you will see the data variable. Click on data to load the dataset in the Source window. Click on Intro.R in the Source window and close the tab. |
Highlight the command.
data<-data[c("minorAL",ecc,"class")] data$class <- factor(data$class) Select the commands and click the Run button |
We now select three columns from data. 2 columns ("minorAL", "ecc") are chosen as features. The class column is chosen as a target variable. We convert the target variable data$class to a factor. Select and run the commands. |
Click on the Environment tab.
Click on data. |
Click on data to load the modified data in the Source window. |
We will now understand the feature space of this data. | |
range_minor_al <- range(data$minorAL)
range_ecc <- range(data$ecc) |
In the Source window type these commands |
Highlight the command
range_minor_al <- range(data$minorAL) Highlight the command range_ecc <- range(data$ecc) |
These commands show the range of the feature variables minorAL and ecc.
Select and run the commands.
|
X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)
Y <- seq(min(data$ecc), max(data$ecc), length.out = 100) feature <- expand.grid(minorAL = X, ecc = Y) |
We will now use the range to generate grid points to construct the feature space.
In the Source window type these commands |
Highlight
X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100) Y <- seq(min(data$ecc), max(data$ecc), length.out = 100) HIghlight feature <- expand.grid(minorAL = X, ecc = Y) |
This command generates a sequence of points spanning the range of minorAL and ecc.
This command creates a cartesian product of the two features to create a feature space. Select and run the commands. |
ggplot(data = data, aes(x = minorAL, y = ecc)) +
geom_point(aes(color = class), size = 2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "Feature Space") + theme_minimal() |
We will now plot the feature space created
|
ggplot(data = data, aes(x = minorAL, y = ecc)) +
geom_point(aes(color = class), size = 2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "Feature Space") + theme_minimal() |
These commands plot the data points in the feature space.
Select and run the commands. |
Drag boundaries. | Drag boundaries to see the plot window clearly. |
Point to the data. | Now let us split our data into training and testing data. |
[RStudio]
set.seed(1)
|
Click on Intro.R in the Source window, and type these commands. |
Highlight the command
set.seed(1)
index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) |
Select the commands and run them. |
[RStudio]
train_data <- data[index_split, ]
|
In the Source window type these commands |
Highlight the command
train_data <- data[index_split, ] Highlight the command test_data <- data[-c(index_split), ] |
This creates training data, consisting of 630 unique rows.
|
Select the commands and click the Run button.
Click the train_data and test_data |
Select the commands and run them. The data sets are shown in the Environment tab. Drag boundary to see the Environment window clearly
|
Here we try to partition the feature space to construct the classifier.
To begin with, one might construct a heuristic line to build the classifier. | |
[Rstudio]
fit = function(x)((x * (-0.0021)) + 1.445) model_predict <- function(x){ factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni")) } |
In the Source window and type these commands. |
Highlight the command
fit = function(x)((x * (-0.0021)) + 1.445) Highlight the command model_predict <- function(x){ factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni")) } Click Save and Click Run buttons. |
Let us describe the steps of the classification algorithm.
For that we will define a line to partition the data as a dummy classifier. It doesn’t involve training data so performance may be poor. We define a function that separates data points belonging to either side of the line. Click Save. Select and run the commands. |
feature$class <- model_predict(feature)
feature$classnum <- as.numeric(feature$class) |
Let’s use the line to classify the feature space and draw the decision boundary.
In the Source window type these commands |
Highlight
feature$class <- model_predict(feature) Highlight feature$classnum <- as.numeric(feature$class) |
This command will use the line created to predict the class of every point in the grid of feature space. This command encodes the class string labels into numbers suitable for plotting Select and run the commands. |
Click on feature in the Environment tab.
Point to the data in the Source window. |
Drag boundary to see the Environment window.
Click on feature in the Environment tab. The feature set with the predicted classes loads in the source window. |
ggplot() +
geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) + geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) + geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+ scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "Data Boundary") + theme_minimal() |
In the Source window type these commands |
Highlight the command
ggplot() + geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) + geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) + geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+ scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "Data Boundary") + theme_minimal() |
We are visualising the feature space and the partition line using GGPlot2. Select and run the commands. |
Drag boundary to see the plot window. | Drag boundary to see the plot window clearly.
Overall plot shows that the chosen line approximately separates the training data classes. |
prediction_test = model_predict(test_data) |
Let us see how well the partition performs on the testing dataset.
In the Source window type this command |
Highlight the command
prediction_test = model_predict(test_data) |
We predict the classes from testing data and store it in the prediction_test variable. Select and run the command. |
Let us now measure the performance of the classification. | |
[RStudio]
test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test) |
In the Source window, type the command |
Highlight the command
test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test) Click on Save and Run buttons. |
We use the confusionMatrix function from the MASS package to calculate performance matrices.
Select and run the command. |
test_confusion_matrix$overall["Accuracy"] | In the Source window, type this command |
Highlight
test_confusion_matrix$overall["Accuracy"] |
It fetches the accuracy metric from the list created
Select and run the command |
Drag boundary to see the console window clearly | |
Highlight
Accuray 0.6962963 |
The accuracy of the testing dataset is 69% |
Drag boundary to see the source window clearly | Drag boundary to see the source window clearly
Let us now view the confusion matrix of the testing dataset |
[RStudio]
test_confusion_matrix$table |
In the Source window type this command |
Highlight the command
test_confusion_matrix$table Click on Save and Run buttons. |
Select and run the command.
The output is seen in the console window |
Point the output in the console window
Reference Prediction Besni Kecimen Besni 50 82 Kecimen 0 138 |
Drag boundary to see the console window clearly
Observe that: 0 samples of class Besni have been incorrectly classified. 82 samples of class Kecimen have been incorrectly classified. We can see that our partition line is skewed. |
For the same problem many partitions can be drawn.
We can choose a complicated partition to reduce train misclassification error. But there will be no control on test data. We can aim to choose a classifier which is simple with a smaller test misclassification error. | |
With this, we come to the end of this tutorial.
Let us summarize. | |
Show Slide
Summary |
In this tutorial we have learned about:
|
Here is an assignment for you. | |
Show Slide
Assignment |
|
Show slide
About the Spoken Tutorial Project |
The video at the following link summarizes the Spoken Tutorial project.
Please download and watch it. |
Show slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
|
Show Slide
Spoken Tutorial Forum to answer questions Do you have questions in THIS Spoken Tutorial? Choose the minute and second where you have the question. Explain your question briefly. Someone from our team will answer them. Please visit this site. |
Please post your timed queries in this forum.
|
Show Slide
Forum to answer questions |
Do you have any general/technical questions?
Please visit the forum given in the link. |
Show Slide
R Activities |
The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.
We give certificates to those who do this. For more details, please visit the website. |
Show Slide
Acknowledgment |
The Spoken Tutorial project was established by the Ministry of Education Govt of India. |
Show Slide
Thank You |
This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.
Thank you for joining. |