Machine-Learning-using-R - old 2022/C2/Supervised-Learning/English

From Script | Spoken-Tutorial
Revision as of 17:30, 24 February 2023 by Madhurig (Talk | contribs)

Jump to: navigation, search

Title of the script: Supervised Learning

Author: Sudhakar Kumar

Keywords: R, RStudio, machine learning, supervised learning, unsupervised, classification, Naive Bayes, confusion matrix, video tutorial.


Visual Cue Narration
Show Slide

Opening Slide

Welcome to this spoken tutorial on Supervised Learning.
Show Slide

Learning Objectives

In this tutorial, we will learn about:
  • Machine Learning and its types
  • Supervised learning
  • Classification model on iris data
  • Confusion matrix
Show Slide

System Specifications

This tutorial is recorded using,
  • Ubuntu Linux OS version 20.04
  • R version 4.1.2
  • RStudio version 1.4.1717

It is recommended to install R version 4.1.0 or higher.

Show Slide

Prerequisites


https://spoken-tutorial.org

To understand this tutorial, you should know,
  • Basics of R programming
  • Basics of Statistics


If not, please access the relevant tutorials on R on this website.

Show Slide

What is Machine Learning?

Now let us see what machine learning is?
  • ML is a science that enables computers to learn without being explicitly programmed
  • Its applications include self-driven cars, speech recognition, etc.
  • It is seen as a subset of Artificial Intelligence, also known as AI.
Show Slide

Classification of Machine Learning

ML is broadly classified into the following types:
  • Supervised learning,
  • Unsupervised learning,
  • Semi-supervised learning and
  • Reinforcement learning.


In this series, we will focus on Supervised and Unsupervised learning.

Show Slide

Iris Flower


Highlight the iris flower

Let us consider a flower named iris.

An image of this flower is shown here.

There are two critical parameters of an iris flower:

  • Sepal, and
  • Petal


One can measure the length and width of these two parameters.

Show Slide

Species of an iris flower


Highlight the species of an iris flower

Based on the measurements, three species of iris flowers are available:
  • Setosa
  • Versicolor
  • Verginica
Show Slide

Tabulating the Data

Consider a situation:
  • A botanist wants to distinguish the species of iris flowers.
  • She collects four features of some iris flowers:
    • Sepal length and Sepal width
    • Petal length and Petal width
Show Slide

Tabulating the Data

She gets these flowers labeled as one of the three species by an expert.


Show Slide

Download Files

For this tutorial, we will use:
  • A data set iris.csv
  • A script file irisModel.R


Please download these files from the Code files link of this tutorial.

Make a copy and then use them for practising.

[Computer screen]


Highlight irisModel.R and the folder SupervisedLearning

I have downloaded and moved these files to the SupervisedLearning folder.


This folder is located in the MLProject on my Desktop.


I have also set the SupervisedLearning folder as my Working Directory.

Cursor near irisModel.R file. Let us switch to RStudio.
Double click on irisModel.R to open in RStudio

Point to irisModel.R in RStudio.

Let’s open the script irisModel.R in RStudio.


For this, double-click on the script irisModel.R


Script irisModel.R opens in RStudio.

Highlight irisModel.R in the Source window Run this script by clicking on the Source button.


Highlight iris_data in the Source window The iris data frame is displayed in the Source window.
Highlight 100 entries, 5 total columns at the bottom of the Source window Here we can see five columns with 100 rows.
Highlight Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species in the Source window The columns are Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and Species.
Highlight Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species in the Source window The first four columns are the features of an iris flower.


The fifth column, Species, is the label of each iris flower.

Highlight Species column in the Source window In the Source window, scroll down to locate the different Species.


Notice that there are two species of the iris flower, setosa and versicolor.


A typical iris dataset contains three different Species.

Show Slide

Posing the Problem

Suppose that the botanist considers the following about the iris flower:
  • Can I build a model that learns from labels of known species?
  • Can this model accurately predict the species from its measurements?
Show Slide

Mapping of Features and Labels

We will map the dimensions of sepal and petal to iris species.


The classification model would work as a function as given below:


This mechanism is supervised learning.

Highlight the function

Show Slide

Supervised Learning

In Supervised learning,
  • The desired output labels are available for training datasets.
  • These labels can be called supervisors.


Show Slide

Supervised Learning

  • While learning, the model makes predictions using the given training dataset.
  • The model iteratively makes predictions on the training dataset.
  • The supervisor corrects the model.
Show Slide

Types of Supervised Learning

There are two types of supervised learning:

Regression and Classification.

  • Regression is applied to predict a continuous-valued output.
  • For example, predicting prices for the real estate sector.
Types of Supervised Learning
  • Classification is applied to predict a discrete-valued output.
  • For example, predicting the species of an iris flowe.
Let’s model a classification algorithm to predict the Species of an iris model.

Here we will perform a 2-class classification.

The species which we will try to predict are setosa and versicolor.


For this task, we will apply a Naive Bayes classifier.

Let us switch to RStudio.
Highlight irisModel.R in the Source window button In the Source window, click on the script irisModel.R.
Libraries e1071 and caret.

Install.packages() function.

Here we need to install and import libraries e1071 and caret.


These packages are needed to fit a Naive Bayes classifier and visualize its performance.


To know more about these packages, please refer to Additional Reading Material.


I have already installed e1071 and caret. I will directly import these.


If you have not installed, please install using the install.packages function.

[RStudio]

library(e1071)

library(caret)

Let us type the following commands at the top of the script.


Press Ctrl + S keys to save the script.

Highlight library(e1071) and

library(caret) in the Source window

Select the commands and click the Run button to load these libraries.
Highlight iris_data in the Source window

n <- nrow(iris_data)

Type the following command below the View(iris_data) command.


Using this command we can find rows in iris_data.

[RStudio]

Type n_train <- round(0.80 * n)

We will now reserve the number of data points for the training set.


Type the following command.


For this model, I will use 80% of the data points for training the model.


The remaining 20% of the data points will be used for testing the model.


To know more about splitting the dataset, please refer to Additional Reading Material.

Click on Save button. Save the script.
[RStudio]


train_indices <- sample(1:n, n_train)


iris_train <- iris_data[train_indices, ]

Next, we will create a vector of indices.


Type the following commands.


It will be an 80% random sample of the total number of rows.


This vector will be used to extract the data points for the training set.

iris_test <- iris_data[-train_indices, ]


Highlight the minus sign before train_indices

Now, we will create a test set.


Type the following command.


Note that there is a minus sign before train_indices.


It is to exclude the data points already used in the training set.

Highlight the Source button

Click Save and Run buttons.

Save the script and select the commands after View to the end.


Click on the Run button to execute the selected commands.

Drag the boundary. I will drag the boundary to see the Environment tab clearly.
Highlight iris_train and iris_test in the Environment. Click the train set and test set to load them in the Source window.


In the Source window, click on iris_train and iris_test to see the details.

Drag the boundary. Now, we will train a classification model with a Naive Bayes classifier.


Again I will drag the boundary to see the Source window clearly.

[RStudio]

iris_model <- naiveBayes(formula = Species~., data = iris_train)


Highlight the above command

In the Source window, type the following command.


We will learn more about arguments in the upcoming tutorials in this series.


Save the script and run this line by pressing Ctrl + Enter keys together.

Highlight the iris_model command Now, let's use the test set to evaluate the performance of the model created.
[RStudio]

class_prediction <- predict(object = iris_model, newdata = iris_test)

Type the following command.

Using this, we will predict the Species of the data points in the test set.

Highlight the class_prediction command Save the script and run this line by pressing Ctrl + Enter keys together.
Highlight class_prediction in the Environment window Now we can use class_prediction values to evaluate the performance of our model.


For this, we can use a confusion matrix.

Show Slide

Confusion Matrix

  • It is a performance measurement for ML classification problems.
  • In these classification problems, the output can be two or more classes.


To know more about Confusion matrix, please refer to Additional Reading Material.

Let us switch to RStudio.
[RStudio]

Highlight iris_model in the Source window

confusionMatrix(data= class_prediction,

reference = as.factor(iris_test$Species))

Now, we will draw the confusion matrix to check the performance of this model.


In the Source window, type the following command.

Save the script and run this line by pressing Ctrl + Enter keys together.

Drag boundary. Drag the boundary to see the Console window clearly.
[RStudio]

Highlight the Console window


Highlight the Confusion Matrix and Statistics

In the Console window, scroll up and locate the Confusion Matrix and Statistics.


The confusion matrix and its corresponding values are displayed.

[RStudio]

Highlight Reference in the Console window

Highlight Prediction in the Console window

Here, the Reference represents the actual values.


Prediction represents the predicted values.

[RStudio]

Highlight the figures in the confusion matrix on the Console window

Accuracy of the model can be checked using the values of True Positive and True Negative.


In this case, the accuracy of the model is 1.


The classification model correctly predicted the values for all the points in the test set.

With this we come to the end of tutorial.

Let us summarize.

Show Slide

Summary

In this tutorial, we have learnt about:
  • Machine Learning and its types
  • Supervised Learning
  • Classification model on iris data
  • Confusion Matrix
Show Slide

About the Spoken Tutorial Project

The video at the following link summarises the Spoken Tutorial project.

Please download and watch it.

Show Slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.

Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions about THIS Spoken Tutorial?

Please visit this site.

Choose the minute and second where you have the question.

Explain your question briefly.

The FOSSEE project will ensure an answer.

You will have to register to ask questions.

Show Slide

Spoken Tutorial Forum for specific questions:

The Spoken Tutorial forum is for specific questions on this tutorial.

Please do not post unrelated and general questions on them.

This will help reduce the clutter.

With less clutter, we can use these discussions as instructional material.

Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion

The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education, Govt. of India.
Show Slide

About the Contributors

This tutorial is contributed by Sudhakar Kumar and Madhuri Ganapathi from IIT Bombay.

Thank you for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey