Machine-Learning-using-R/C2/Linear-Discriminant-Analysis-in-R/English

From Script | Spoken-Tutorial
Revision as of 19:09, 28 May 2024 by Ushav (Talk | contribs)

Jump to: navigation, search

Title of the script: Linear Discriminant Analysis in R

Author: YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty

Keywords: R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, confusion matrix, console, LDA, video tutorial.

Visual Cue Narration
Show slide

Opening Slide

Welcome to this spoken tutorial on Linear Discriminant Analysis in R.
Show slide

Learning Objectives

In this tutorial, we will learn about:
  1. Linear Discriminant Analysis (LDA) and its implementation.
  2. Assumptions of LDA
  3. Limitations of LDA
  4. LDA on a subset of Raisin dataset
  5. Visualization of the LDA separator and its corresponding confusion matrix.


Show slide

System Specifications

This tutorial is recorded using,
  • Windows 11
  • R version 4.3.0
  • RStudio version 2023.06.1

It is recommended to install R version 4.2.0 or higher.

Show slide.

Prerequisites

https://spoken-tutorial.org

To follow this tutorial, the learner should know:
  • Basics of R programming.
  • Basics of Machine Learning using R.

If not, please access the relevant tutorials on R on this website.

Show slide.

Linear Discriminant Analysis

Linear Discriminant Analysis is a statistical method.
  • It is used for classification.
  • It constructs a data driven line that best separates different classes.
  • It is based on maximization of likelihood function to classify two or more classes.


Show slide.

Applications of LDA

  • LDA technique is used in several applications like
    • Fraud Detection
    • Bio-Imaging classification
    • Classify patient disease state
Only Narration Let us now understand the assumptions of LDA.
Show Slide

Assumptions for LDA

Multivariate Normality:
  • All data entries are continuous, Gaussian, with equal covariance matrix for all the classes.
  • Mean vectors for each class are different.
  • Data records are independent and identically distributed among each class.
Show Slide

Limitations of LDA

Now we will see the limitations of LDA.
  • Departure from Gaussianity can increase misclassification probability in LDA.
  • LDA may perform poorly if data has unequal class covariance matrix.
Show Slide

Implementation Of LDA

Now let us implement LDA on the raisin dataset with two chosen variables.

More information on raisin data is available in the Additional Reading material on this tutorial page.

Show slide

Download Files

We will use a script file LDA.R

Please download this file from the Code files link of this tutorial.

Make a copy and then use it for practicing.

[Computer screen]

Point to LDA.R and the folder LDA.

Point to the MLProject folder on the Desktop.


Point to the LDA folder.

I have downloaded and moved these files to the LDA folder.


This folder is in the MLProject folder on my Desktop.


I have also set the LDA folder as my working directory.

Point to the script file LDA.R. In this tutorial, we will create a LDA classifier model on the raisin dataset.


Let us switch to RStudio.

Open LDA.R in RStudio


Point to LDA.R in RStudio.

Open the script LDA.R in RStudio.

For this, click on the script LDA.R.

Script LDA.R opens in RStudio.

Highlight the Readxl package.

Highlight the command library(MASS)

Highlight the command library(ggplot2)

Highlight the command library(caret)

Highlight the command library(caret)

Highlight all the commands.

#install.packages(“package_name”)

Readxl package is used to load the Excel file.


The MASS package contains the lda() function that we will use for our analysis.


The ggplot2 package is used to plot the results of our analysis.


The caret package contains the

confusionMatrix function.


It is used as a measure for the performance of the classifier.


Please note that in order to import these libraries, we need to install them.


Please ensure that everything is installed correctly.


You can use the command install.packages(“package_name”) to install the required packages.


As I have already installed these packages, I will directly import them.

[RStudio]

library(readxl)

library(MASS)

library(ggplot2)

library(caret)

library(lattice)


Select and run these commands to import the requisite packages.
Highlight the command

data <- read_xlsx("Raisin.xlsx")


Highlight the command data<-data[c("minorAL","ecc","class")]


Highlight the commands.

data <- read_xlsx("Raisin.xlsx")


data<-data[c("minorAL","ecc","class")]

We will read the excel file and choose 3 columns, two features (minorAL, ecc) and one target (class) variable.

Run these commands to import the raisin dataset.

Drag boundary to see the Environment tab clearly.

Point to the data variable in the Environment tab.

Click the data to load the dataset.

Drag boundary to see the Environment tab clearly.

In the Environment tab under Data heading, you will see a data variable.

Click the data variable to load the dataset in the Source window.

Drag boundary to see the Source window clearly. Drag boundary to see the Source window clearly.
[RStudio]

Type these commands in the source window.

data$class <- factor(data$class)

In the Source window type this command.
Highlight the below commands.

data$class <- factor(data$class)

Select the commands and click the Run button.

Here we are converting the variable data$class to a factor.

It ensures that the categorical data is properly encoded.

Select the command and run it. them.

Only Narration. Now we split our dataset into training and testing data.
[RStudio]

Type the command in the source window.

set.seed(1)

index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

In the Source window type these commands.
Highlight the command

set.seed(1)

Highlight the command sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

Highlight the command replace=FALSE

Select the commands and click the Run button.

First we set a seed for reproducible results.


We will create a vector of indices using sample() function.


This will be 70% for training and 30% for testing.


The training data is chosen using simple random sampling without replacement.

Select the commands and run them.

The vector is shown in the Environment tab.
Point to train-test split. We use the indices that we previously generated to obtain our train-test split.
[RStudio]

Type the command

train_data <- data [index_split, ]

test_data <- data[-c(index_split), ]

In the Source window type these commands.
Highlight the command

train_data <- data[index_split, ]

Highlight the command

test_data <- data[-c(index_split), ]

This creates training data, consisting of 630 unique rows.


This creates testing data, consisting of 270 unique rows.

Select the commands and click the Run button.


Point to the sets in the Environment Tab

Select the commands and run them.


The data sets are shown in the Environment tab.


Click on test_data and train_data to load them in the Source window.

Only Narration. Let us train our LDA model.
[RStudio]

LDA_model <- lda(class~.,data=train_data)

LDA_model

In the Source window, type these commands.
Highlight the command

LDA_model <- lda(class~.,data=train_data)

LDA_model


Highlight the command LDA_model


Click on Save and Run buttons.

Point to the output in the console window.

We pass two parameters to the lda() function.
  1. formula
  2. data on which the model should train.

Select the comands and run them.

The output is shown in the console window.

Drag boundary to see the console window. Drag boundary to see the console window clearly.
Highlight output in the console. Our model provides us with a lot of information.

Let us go through them one at a time.

Highlight the command Prior probabilities of groups.

Highlight the command Group means.

Highlight the command Coefficients of linear discriminants

These explain the distribution of classes in the training dataset.


These display the mean values of each predictor variable for each species.


These display the linear combination of predictor variables.


The given linear combinations form the decision rule of the LDA model.

Drag boundary to see the Source window. Drag boundary to see the Source window clearly.
Let us use this model to make predictions on the testing data.
[RStudio]

predicted_values <- predict(LDA_model, test_data)

In the Source window type this command and run it.

Let us check what predicted_values contain.

Click the predicted_values data in the Environment tab.


Point to the table.

Click the predicted_values data in the Environment tab.

The predicted_values table is loaded in the Source window.

[RStudio]

head(predicted_values$class)

head(predicted_values$posterior)

head(predicted_values$x)

In the Source window type these commands and run them.


The output is seen in the console window.

Highlight the command output of head(predicted_values$class) in the console.


Highlight the command output of head(predicted_values$posterior) in the console.


Highlight the command output of head(predicted_values$x) in console

It contains the type of species that the model has predicted for each observation.


It contains the posterior probability of the observation belonging to each class.

This contains the linear discriminants for each observation.

Only Narration. Now we will measure the performance of our model using the Confusion Matrix.
[RStudio]

confusion <-table(test_data$class,predicted_values$class)


fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)


Click on Save and Run buttons.

In the Source window type these commands.


Save and run the commands.

Highlight the command confusion <- table(test_data$class, predicted_values$class)

Highlight the command

fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1)

This table creates a confusion matrix.


The fourfoldplot() function generates a visual plot of the confusion matrix,


The output is seen in the plot window.

Highlight the plot in plot window Drag boundary to see the plot window clearly.

Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations.

This number may change for different sets of training data.

Only Narration. Let us visualize how well our model separates different classes.
[RStudio]

[RStudio]

X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100)


Y <- seq(min(train_data$ecc), max(train_data$ecc), length.out = 100)


min_max <- expand.grid(minorAL = X, ecc = Y)


min_max$predicted_class <- predict(LDA_model, newdata = min_max)$class


grid <- expand.grid(minorAL = X, ecc = Y)

grid$class <- predict(LDA_model, newdata = grid)$class


grid$classnum <- as.numeric(grid$class)


Click on Save and Run buttons.

In the Source window, type these commands.


This block of code operates as a setup for visual plotting.


It consists of square grid coordinates in the range of training data and their predicted linear discriminants.


The seq function generates a sequence of evenly spaced values within a range of smallest and largest values of 'minorAL' and 'ecc' variables from the training data.


The 'grid' variable contains the generated data including the prediction of the LDA_model on it.


The as.numeric function encodes the predicted classes labels into numeric values.


Select the commands and run them.

Point to the Environment tab. Drag boundary to see the details in the Environment tab.


These variables contain the data for the visualization of the linear discriminants.

Click the grid data in the Environment tab.

The grid data table is loaded in the Source window.

[RStudio]


ggplot() +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +

geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +

theme_minimal()


ggplot() +

geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +

geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(title = "LDA Decision Boundary") +

theme_minimal()

In the Source window, type these commands.
Highlight the command

ggplot() +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +

geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal()


ggplot() +

geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +

geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(title = "LDA Decision Boundary") +

theme_minimal()


Select the commands and run them.

This command creates the decision boundary plot


It plots the grid points with colors indicating the predicted classes.

geom_raster creates a colour map indicating the predicted classes of the grid points

geom_contour creates the decision boundary of the LDA.

The scale_color_manual function assigns specific colors to the classes and so does scale_fill_manual function.


The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the model.


Select and run these commands.


Drag boundaries to see the plot window clearly.

Point the output in the Plots window We can see that our model has separated most of the data points clearly.
Only Narration With this we come to end of this tutorial.

Let us summarize.

Show Slide

Summary

In this tutorial we have learnt:
  • Linear Discriminant Analysis (LDA) and its implementation. 
  • Assumptions of LDA
  • Limitations of LDA
  • LDA on a subset of Raisin dataset
  • Visualization of the LDA separator and its corresponding confusion matrix


Now we will suggest an assignment for this Spoken Tutorial.
Show Slide

Assignment

  • Perform LDA on inbuilt PlantGrowthdataset
  • Evaluate the model using a confusion matrix and visualize the results
Show slide

About the Spoken Tutorial Project

The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.

Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.


Please contact us.

Show Slide

Spoken Tutorial Forum to answer questions.

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.

Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion

The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial project was established by the Ministry of Education Govt of India.
Show Slide

Thank You

This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay.

Thank you for joining.

Contributors and Content Editors

Madhurig, Ushav