Machine-Learning-using-R/C2/Linear-Discriminant-Analysis-in-R/English
Title of the script: Linear Discriminant Analysis in R
Author: YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty
Keywords: R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, confusion matrix, console, LDA, video tutorial.
Visual Cue | Narration |
Show slide
Opening Slide |
Welcome to this spoken tutorial on Linear Discriminant Analysis in R. |
Show slide
Learning Objectives |
In this tutorial, we will learn about:
|
Show slide
System Specifications |
This tutorial is recorded using,
It is recommended to install R version 4.2.0 or higher. |
Show slide.
Prerequisites |
To follow this tutorial, the learner should know:
If not, please access the relevant tutorials on R on this website. |
Show slide.
Linear Discriminant Analysis |
Linear Discriminant Analysis is a statistical method.
|
Show slide.
Applications of LDA |
|
Only Narration | Let us now understand the assumptions of LDA. |
Show Slide
Assumptions for LDA |
Multivariate Normality:
|
Show Slide
Limitations of LDA |
Now we will see the limitations of LDA.
|
Show Slide
Implementation Of LDA |
Now let us implement LDA on the raisin dataset with two chosen variables.
More information on raisin data is available in the Additional Reading material on this tutorial page. |
Show slide
Download Files |
We will use a script file LDA.R
Please download this file from the Code files link of this tutorial. Make a copy and then use it for practising. |
[Computer screen]
Point to LDA.R and the folder LDA. Point to the MLProject folder on the Desktop.
|
I have downloaded and moved these files to the LDA folder.
|
Point to the script file LDA.R. | In this tutorial, we will create a LDA classifier model on the raisin dataset.
|
Open LDA.R in RStudio
|
Open the script LDA.R in RStudio.
For this, click on the script LDA.R. Script LDA.R opens in RStudio. |
Highlight the Readxl package.
Highlight the command library(MASS) Highlight the command library(ggplot2) Highlight the command library(caret) Highlight the command library(caret) Highlight all the commands. #install.packages(“package_name”) |
Readxl package is used to load the Excel file.
confusionMatrix function.
|
[RStudio]
library(readxl) library(MASS) library(ggplot2) library(caret) library(lattice)
|
Select and run these commands to import the requisite packages. |
Highlight the command
data <- read_xlsx("Raisin.xlsx")
data <- read_xlsx("Raisin.xlsx")
|
We will read the excel file and choose 3 columns, two features (minorAL, ecc) and one target (class) variable.
Run these commands to import the raisin dataset. |
Drag boundary to see the Environment tab clearly.
Point to the data variable in the Environment tab. Click the data to load the dataset. |
Drag boundary to see the Environment tab clearly.
In the Environment tab under Data heading, you will see a data variable. Click the data variable to load the dataset in the Source window. |
Drag boundary to see the Source window clearly. | Drag boundary to see the Source window clearly. |
[RStudio]
Type these commands in the source window. data$class <- factor(data$class) |
In the Source window type this command. |
Highlight the below commands.
data$class <- factor(data$class) Select the commands and click the Run button. |
Here we are converting the variable data$class to a factor.
It ensures that the categorical data is properly encoded. Select the command and run it. |
Only Narration. | Now we split our dataset into training and testing data. |
[RStudio]
Type the command in the source window. set.seed(1) index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) |
In the Source window type these commands. |
Highlight the command
set.seed(1) Highlight the command sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) Highlight the command replace=FALSE Select the commands and click the Run button. |
First we set a seed for reproducible results.
Select the commands and run them. |
The vector is shown in the Environment tab. | |
Point to train-test split. | We use the indices that we previously generated to obtain our train-test split. |
[RStudio]
Type the command train_data <- data [index_split, ] test_data <- data[-c(index_split), ] |
In the Source window type these commands. |
Highlight the command
train_data <- data[index_split, ] Highlight the command test_data <- data[-c(index_split), ] |
This creates training data, consisting of 630 unique rows.
|
Select the commands and click the Run button.
|
Select the commands and run them.
|
Only Narration. | Let us train our LDA model. |
[RStudio]
LDA_model <- lda(class~.,data=train_data) LDA_model |
In the Source window, type these commands. |
Highlight the command
LDA_model <- lda(class~.,data=train_data) LDA_model
Point to the output in the console window. |
We pass two parameters to the lda() function.
Select the comands and run them. The output is shown in the console window. |
Drag boundary to see the console window. | Drag boundary to see the console window clearly. |
Highlight output in the console. | Our model provides us with a lot of information.
Let us go through them one at a time. |
Highlight the command Prior probabilities of groups.
Highlight the command Group means. Highlight the command Coefficients of linear discriminants |
These explain the distribution of classes in the training dataset.
|
Drag boundary to see the Source window. | Drag boundary to see the Source window clearly. |
Let us use this model to make predictions on the testing data. | |
[RStudio]
predicted_values <- predict(LDA_model, test_data) |
In the Source window type this command and run it.
Let us check what predicted_values contain. |
Click the predicted_values data in the Environment tab.
|
Click the predicted_values data in the Environment tab.
The predicted_values table is loaded in the Source window. |
[RStudio]
head(predicted_values$class) head(predicted_values$posterior) head(predicted_values$x) |
In the Source window type these commands and run them.
|
Highlight the command output of head(predicted_values$class) in the console.
|
It contains the type of species that the model has predicted for each observation.
This contains the linear discriminants for each observation. |
Only Narration. | Now we will measure the performance of our model using the Confusion Matrix. |
[RStudio]
confusion <-table(test_data$class,predicted_values$class)
|
In the Source window type these commands.
|
Highlight the command confusion <- table(test_data$class, predicted_values$class)
Highlight the command fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1) |
This table creates a confusion matrix.
|
Highlight the plot in plot window | Drag boundary to see the plot window clearly.
Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations. This number may change for different sets of training data. |
Only Narration. | Let us visualize how well our model separates different classes. |
[RStudio]
[RStudio] X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100)
grid$class <- predict(LDA_model, newdata = grid)$class
|
In the Source window, type these commands.
|
Point to the Environment tab. | Drag boundary to see the details in the Environment tab.
Click the grid data in the Environment tab. The grid data table is loaded in the Source window. |
[RStudio]
geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) + geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) + theme_minimal()
geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) + geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "LDA Decision Boundary") + theme_minimal() |
In the Source window, type these commands. |
Highlight the command
ggplot() + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) + geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal()
geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) + geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "LDA Decision Boundary") + theme_minimal()
|
This command creates the decision boundary plot
geom_raster creates a colour map indicating the predicted classes of the grid points geom_contour creates the decision boundary of the LDA. The scale_color_manual function assigns specific colors to the classes and so does scale_fill_manual function.
|
Point the output in the Plots window | We can see that our model has separated most of the data points clearly. |
Only Narration | With this we come to end of this tutorial.
Let us summarize. |
Show Slide
Summary |
In this tutorial we have learnt:
|
Now we will suggest an assignment for this Spoken Tutorial. | |
Show Slide
Assignment |
|
Show slide
About the Spoken Tutorial Project |
The video at the following link summarizes the Spoken Tutorial project.
Please download and watch it. |
Show slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
|
Show Slide
Spoken Tutorial Forum to answer questions. Do you have questions in THIS Spoken Tutorial? Choose the minute and second where you have the question.Explain your question briefly. Someone from the FOSSEE team will answer them. Please visit this site. |
Please post your timed queries in this forum. |
Show Slide
Forum to answer questions |
Do you have any general/technical questions?
Please visit the forum given in the link. |
Show Slide
Textbook Companion |
The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.
We give certificates to those who do this. For more details, please visit these sites. |
Show Slide
Acknowledgment |
The Spoken Tutorial project was established by the Ministry of Education Govt of India. |
Show Slide
Thank You |
This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay.
Thank you for joining. |