Machine-Learning-using-R - old 2022/C3/Linear-Discriminant-Analysis-in-R/English

From Script | Spoken-Tutorial
Jump to: navigation, search

Title of the script: Linear Discriminant Analysis in R

Author: Tanmay Srinath

Keywords: R, RStudio, machine learning, dimensionality reduction, LDA, confusion matrix, dataset, gaussian, Bayes classifier, Homoscedasticity, heteroscedastic, QDA, spoken tutorial, video tutorial.


Visual Cue Narration
Show Slide

Opening Slide

Welcome to this spoken tutorial on Linear Discriminant Analysis in R.
Show Slide

Learning Objectives

In this tutorial, we will learn about:
  • Linear Discriminant Analysis or LDA
  • Applications of LDA.
  • Assumptions of LDA.
  • Robustness of LDA.
  • LDA on iris dataset .
Show Slide

System Specifications

This tutorial is recorded using,
  • Ubuntu Linux OS version 20.04
  • R version 4.1.2
  • RStudio version 1.4.1717

It is recommended to install R version 4.1.0 or higher.

Show Slide

Prerequisites

To follow this tutorial, the learner should know:
  • Basics of R programming.
  • Basics of Machine Learning using R.

If not, please access the relevant tutorials on R on this website.

Show Slide

Linear Discriminant Analysis

Linear Discriminant Analysis.
  • It is a linear combination of features that separates two or more classes.
  • It is the best classifier when data is gaussian.
  • It has the smallest misclassification error.
  • It relies on the Bayes classifier.
Show Slide

Applications of LDA

LDA is primarily a multi-class classifier.
Let us now understand the assumptions of LDA.
Show Slide

Assumptions of LDA

  • Multivariate normality: The records are gaussian.
  • Homoscedasticity: Variance and covariance structure among classes are the same.
  • Independence: The records are independent of each other.
Show Slide

Robustness of LDA

  • LDA is robust to slight violations of these assumptions.
  • If data is heteroscedastic, LDA becomes quadratic discriminant analysis or QDA.
  • QDA can be performed when the covariance structure of the classes is different.


Show Slide

LDA

Now let us implement LDA on the iris dataset.
Show Slide

Download Files

We will use a script file LDA.R

Please download this file from the Code files link of this tutorial.

Make a copy and then use it for practising.

[Computer screen]

Highlight LDA.R and the folder LDA

I have downloaded and moved this file to the LDA folder.


This folder is located in the MLProject folder on my Desktop.


I have also set the LDA folder as my working directory.

Show Slide

LDA Classifier Model

In this tutorial, we will create a LDA classifier model on the iris dataset.
Let us switch to RStudio.
Click LDA.R in RStudio.


Point to LDA.R in RStudio.

Open the script LDA.R in RStudio.


For this, double click on the script LDA.R


Script LDA.R opens in RStudio.

Highlight library(MASS)

Highlight library(e1071)


Highlight library(ggplot2)

Highlight library(caret)

The MASS package contains the lda() function that we use for our analysis.


e1071 library is needed as a dependency for the caret library.


The ggplot2 package is used to plot the results of our analysis.


The caret package contains the confusionMatrix function.


Confusion Matrix is used to measure the performance of a model.


As I have already installed these packages.

I will directly import them.


If you have not installed the libraries, please install them before importing.


[RStudio]

library(MASS)

library(e1071)

library(ggplot2)

library(caret)

Select and run these commands to import the requisite packages.
Highlight data(iris)


Double click in the Environment tab to load the iris dataset.

Run this command to import the iris dataset .

In the Environment tab, double click on iris values to load the data.


Then click the iris data to load the dataset in the Source window.

Point to the dataset. Now we split our dataset into training and testing data.
[RStudio]

set.seed(222)

trn_ind=sample(1:nrow(iris),

size=0.8*nrow(iris),

replace=FALSE)

Click on LDA.R in the Source window


In the Source window type these commands.

Highlight set.seed(222)


Highlight

sample(1:nrow(iris),

size=0.8*nrow(iris),replace=FALSE)


Highlight replace=FALSE

Select the commands and click the Run button.

First we set a seed for reproducible results.


We will create a vector of indices using sample() function.

It will be an 80% random sample of the total number of rows.

We are sampling without replacement.

This is done so that the model doesn’t train on duplicate rows.


Select the commands and run them.

Point to Environment tab. The vector is shown in the Environment tab.
Point to Environment tab. We use the indices that we previously generated to obtain our train-test split.
[RStudio]

train <- iris[trn_ind, ]

test <- iris[-c(trn_ind), ]

In the Source window type these commands.
Highlight train <- iris[trn_ind]

Highlight test <- iris[-c(trn_ind), ]

This command creates training data, consisting of 120 unique rows.


This command creates testing data, consisting of 30 unique rows.

Select the commands and click the Run button.


Point to the sets in the Environment Tab

Select the commands and run them.


The data sets are shown in the Environment tab.

Click the test set and train set to load them in the Source window.

Cursor in the panel. Let us train our LDA model.
[RStudio]

lda_model <- lda(Species~., data=train)

lda_model

In the Source window, type these commands.
Highlight

lda_model <- lda(Species~., data=train)

Click on Save and Run buttons.

We pass two parameters to the lda() function.
  1. formula and
  2. data on which the model should train.

Save and run these commands.


The output is shown in the console window.

Drag boundary to see the console window. Drag boundary to see the console window clearly.
Highlight output in console. Our lda_model provides us a lot of information.

Let us go through them one at a time.

Highlight Prior probabilities of groups .


Highlight Group means.

Highlight Coefficients of linear discriminants .

Highlight Proportion of trace.

These explain the distribution of classes in the training dataset.


These display the mean values of each predictor variable for each species.


These display the linear combination of predictor variables.


The given linear combinations form the decision rule of the LDA model.


These display the percentage separation created by LD1 and LD2 functions.

Drag boundary to see the Source window. Drag boundary to see the Source window.
Cursor in the window. Let us use this model to make predictions on the testing data.
[RStudio]

predicted_values <-

predict(lda_model, test)

In the Source window type this command and run it.

Let us check what predicted_values contain.

Click the predicted_values data in the Environment tab.


Point to the table.

Click the predicted_values data in the Environment tab.


The predicted_values table is loaded in the Source window.

[RStudio]

head(predicted_values$class)

head(predicted_values$posterior)

head(predicted_values$x)

In the Source window type these commands.


Save and run the commands.

The output is seen in the console window.

Drag boundary to see the console window clearly. Drag boundary to see the console window clearly.
Highlight output of

head(predicted_value$class) in console


Highlight output of

head(predicted_value$posterior) in console


Highlight output of

head(predicted_value$x) in console

It contains the type of species that the model has predicted for each observation.

It contains posterior probability of the observation belonging to each class.

This contains the linear discriminants for each observation.

Drag boundary to see the Source window clearly. Drag boundary to see the Source window clearly.
Cursor in the source window. Now we will measure the performance of our model using the Confusion Matrix.
[RStudio]

confusionMatrix

(predicted_values$class,test$Species)


Click on Save and Run buttons.

In the Source window type this command.

Save and run the command.

Drag boundary to see the console window clearly. Drag boundary to see the console window clearly.
Highlight output in console Our model has misclassified just one observation.


This shows that our model is robust and accurate.

Drag boundary to see the Source window clearly. Drag boundary to see the Source window clearly.
Let us visualise how well our model separates different classes.
[RStudio]

lda_plot <- cbind(train, predict(lda_model)$x)


ggplot(lda_plot, aes(LD1, LD2)) +

geom_point(aes(color = Species))

In the Source window , type these commands.
[RStudio]

Highlight

lda_plot <- cbind(train, predict(lda_model)$x)


Highlight

ggplot(lda_plot, aes(LD1, LD2)) +

geom_point(aes(color = Species))

Select the commands and run them.


Drag boundary to see the plot window.

This command creates the data for our plot.


It consists of the training data and the predicted linear discriminants.


We create a plot with axes LD1 and LD2 .


Then colour the points according to the species.


Select and run these commands.


Drag boundaries to see the plot window clearly.

[RStudio]

Highlight output in Plots

We can see that our model has separated almost all the data points clearly.
With this we come to end of this tutorial. Let us summarise.
Show Slide

Summary

In this tutorial we have learnt about:
  • Linear Discriminant Analysis or LDA
  • Applications of LDA.
  • Assumptions of LDA.
  • Robustness of LDA.
  • LDA on iris dataset.
Show Slide

Assignment

Now we will suggest an assignment.

Perform LDA on the in-built PlantGrowth dataset.


Evaluate the model using a confusion matrix and visualise the results.

Show Slide

About the Spoken Tutorial Project

The video at the following link summarises the Spoken Tutorial project.

Please download and watch it.

Show Slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.


For more details Please contact us.

Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.

Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Do you have any general or technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion

The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India.
Show Slide

Thank You

This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay.

Thank you for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey