Machine-Learning-using-R/C2/Quadratic-Discriminant-Analysis-in-R/English
Title of the script: Quadratic Discriminant Analysis in R
Author: Yate Asseke Ronald Olivera and Debatosh Chakraborty
Visual Cue | Narration |
Show slide
Opening Slide |
Welcome to this spoken tutorial on Quadratic Discriminant Analysis in R |
Show slide
Learning Objectives |
In this tutorial, we will learn about:
|
Show slide
System Specifications |
This tutorial is recorded using,
It is recommended to install R version 4.2.0 or higher. |
Show slide
Prerequisites |
To follow this tutorial, the learner should know,
If not, please access the relevant tutorials on this website. |
Show slide
Quadratic Discriminant Analysis |
|
Show Slide
Differences between LDA and QDA |
Now let’s see the differences between LDA and QDA
|
Show Slides
Assumptions for QDA
QDA assumes that each class has its own covariance matrix. |
Now let us see the assumption of QDA
QDA is used when data is multivariate Gaussian and each class has its own covariance matrix. |
Show slide.
Limitations of QDA
|
These are the limitations of QDA |
Show slide.
Applications of QDA
|
QDA technique is used in several applications. |
Show Slide
Implementation Of QDA |
Let us implement QDA on the Raisin dataset with two chosen variables.
|
Show slide
Download Files |
We will use a script file QDA.R and Raisin Dataset ‘raisin.xlsx’
Please download these files from the Code files link of this tutorial. Make a copy and then use them while practicing. |
[Computer screen]
point to QDA.R and the folder QDA. Point to the MLProject folder on the Desktop. |
I have downloaded and moved these files to the QDA folder.
This folder is located in the MLProject folder on my Desktop. I have also set the QDA folder as my working Directory. In this tutorial, we will create a QDA classifier model on the raisin dataset. |
Let us switch to RStudio. | |
Click QDA.R in RStudio
Point to QDA.R in RStudio. |
Let us open the script QDA.R in RStudio.
For this, click on the script QDA.R. Script QDA.R opens in RStudio. |
[RStudio]
Highlight the command library(readxl) Highlight the command library(MASS) Highlight the command library(caret) Highlight the command library(ggplot2) library(dplyr)
|
Select and run these commands to import the packages.
The MASS package contains the qda() function to create our classifier.
I have directly imported them. |
[RStudio]
data<- read_xlsx("Raisint.xlsx") |
Click on QDA.R in the Source window.
In the Source window type these commands. |
Highlight the command data<- read_xlsx("Raisin.xlsx")
These commands are already there in script filedata<-data[c("minorAL",ecc,"class")] |
Run this command to load the Raisin dataset.
Then click on data to load the dataset in the Source window. |
[Rstudio]
Type these commands in R studio. These commands are already there in script filedata$class <- factor(data$class) |
Click on QDA.R in the Source window and close the tab.
In the Source window type these commands |
Highlight the command.
These commands are already there in script filedata<-data[c("minorAL",ecc,"class")] data$class <- factor(data$class) Select the commands and click the Run button |
We now select three columns from data and convert the variable data$class to a factor.
Select and run the commands. |
Click on the Environment tab.
Click on data. |
Click on data to load the modified data in the Source window. |
Point to the data. | Now let us split our data into training and testing data. |
[RStudio]
set.seed(1)
|
Click on QDA.R in the Source window. In the Source window type these commands |
Highlight the command
set.seed(1)
index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)
|
First we set a seed for reproducible results.
The training data is chosen using simple random sampling without replacement.
|
[RStudio]
train_data <- data[index_split, ]
|
In the Source window type these commands |
Highlight the command
train_data <- data[index_split, ] Highlight the command test_data <- data[-c(index_split), ] |
This creates training data, consisting of 630 unique rows.
This creates testing data, consisting of 270 unique rows. |
Select the commands and click the Run button.
Point to the sets in the Environment Tab Click the train_data and test_data |
Select the commands and run them. The data sets are shown in the Environment tab. Click on train_data and test_data to load them in the Source window. |
Let’s perform QDA on the training dataset. | |
[Rstudio]
|
Click on QDA.R in the Source window.
In the Source window type these commands |
Highlight the command QDA_model <- qda(class~.,data=train_data) Highlight the command QDA_model Click Save and Click Run buttons. |
We use this command to create QDA Model
We pass two parameters to the qda() function.# formula
Click Save. Select and run the commands. The output is shown in the console window. |
Drag boundary to see the console window. | Drag boundary to see the console window. |
Point the output in the console
Highlight the command Prior probabilities of group Highlight the command Group means |
These are the parameters of our model.
This indicates the composition of classes in the training data. These indicate the mean values of the predictor variables for each class. |
Drag boundary to see the Source window. | Drag boundary to see the Source window. |
Let us now use our model to make predictions on test data. | |
[RStudio]
predicted_values <- predict(QDA_model, test_data) predicted_values |
Click on QDA.R in the Source window. In the Source window type these commands |
Highlight the command
predicted_values <- predict(model, test) Type the command before highlighting Highlight the command predicted_values Click on Save and Run buttons. |
Let’s use this command to predict the class variable from the test data using the trained QDA model.
This will give us more information about the model such as class and posterior. This predicts the class and posterior probability for the testing data. Select and run the commands. |
This part is not clear Click on predicted_values in the Environment tab.
Point the output in the This part is not clearconsole Highlight the command class Highlight the command posterior |
Click on predicted_values in the Environment tab
This shows us that our predicted variable has two components. class contains the predicted classes of the testing data. Posterior contains the posterior probability of an observation belonging to each class. |
Let us compute the accuracy of our model. | |
confusion <- confusionMatrix(test_data$class,predicted_values$class) | Click on QDA.R in the source window.
In the Source window type these commands |
Highlight the command confusionMatrix(test_data$class,predicted_values$class)
Point to the confusion in the Environment Tab Highlight the attribute table |
This command creates a confusion matrix list.
The list is created from the actual and predicted class labels of testing data. And it is stored in the confusion variable. It helps to assess the classification model's performance and accuracy. Select and run the command. The confusion matrix list is shown in the Environment tab. Click confusion to load it in the Source window. confusion list contains a component table containing the required confusion matrix. |
plot_confusion_matrix <- function(confusion_matrix){
tab <- confusion_matrix$table tab = as.data.frame(tab) tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction))) tab <- tab %>% rename(Actual = Reference) %>% mutate(cor = if_else(Actual == Prediction, 1,0)) tab$cor <- as.factor(tab$cor) ggplot(tab, aes(Actual,Prediction)) + geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) + scale_fill_manual(values = c("red","green")) + theme_light() + theme(legend.position = "None", line = element_blank()) + scale_x_discrete(position = "top") } |
Now let’s plot the confusion matrix from the table.
Click on QDA.R in the source window. In the Source window type these commands |
Highlight the command tab <- confusion_matrix$table Highlight the command tab <- confusion_matrix$table tab = as.data.frame(tab) tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction))) tab <- tab %>% rename(Actual = Reference) %>% mutate(cor = if_else(Actual == Prediction, 1,0)) tab$cor <- as.factor(tab$cor) Highlight the command ggplot(tab, aes(Actual,Prediction)) + geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) + scale_fill_manual(values = c("red","green")) + theme_light() + theme(legend.position = "None", line = element_blank()) + scale_x_discrete(position = "top") } |
These commands create a function plot_confusion_matrix to display the confusion matrix from the confusion matrix list created.
It fetches the confusion matrix table from the list. It creates a data frame from the table which is suitable for plotting using GGPlot2. It plots the confusion matrix using the data frame created. It represents correct and incorrect predictions using different colors. Select and run the commands. |
[RStudio]
fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1) plot_confusion_matrix(confusion) |
Click on QDA.R in the Source window.
In the Source window type these commands |
Highlight the command
confusion <- table(test_data$class,predicted_values$class) Highlight the command fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1) plot_confusion_matrix(confusion) Click on Save and Run buttons. |
The table output is not displayed / used.table creates a confusion matrix that compares the actual and predicted class labels.
We are using the created plot_confusion_matrix() function to generate the visual plot of the confusion matrix in confusion variable Select and run the command. The output is seen in the plot window. |
Point the output in the plot window | Drag boundary to see the plot window clearly
Observe that: 22 24 samples of class 0 ...samples of class Kecimen have been incorrectly classified. 11 samples of class Besni have been incorrectly classified. Overall, the model has misclassified only 33 out of 270 samples. We can say that our model performs well. |
[RStudio]
grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500), ecc = seq(min(data$ecc), max(data$ecc), length = 500)) grid$class = predict(QDA_model, newdata = grid)$class grid$classnum <- as.numeric(grid$class) |
Drag boundary to see the source window clearly.
In the Source window type these commands |
Highlight the command
grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500), ecc = seq(min(data$ecc), max(data$ecc), length = 500))
grid$class = predict(QDA_model, newdata = grid)$class grid$classnum <- as.numeric(grid$class)
|
This block of code first creates a grid of points spanning the range of minorAL and ecc features in the dataset.
It stores it in a variable 'grid'. Then, it uses the QDA model to predict the class of each point in this grid. It stores these predictions as a new column 'class' in the grid dataframe. I have added this part The as.numeric function encodes the predicted classes string labels into numeric values. The resulting grid of points and their predicted classes will be used to visualize the decision boundaries of the QDA model. Select and run these commands. Click grid on the Environment tab to load the grid dataframe in the source window. |
[RStudio]
ggplot() + geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") + theme_minimal() |
Click on QDA.R in the Source window.
In the Source window type these commands |
Highlight the command
ggplot() + geom_raster(data = grid, aes(x = var, y = kurt, fill = class), alpha = 0.3) + geom_point(data = train_data, aes(x = var, y = kurt, color = class)) + geom_contour(data = grid, aes(x = var, y = kurt, z = classnum), colour = "black", linewidth = 1.2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "Variance", y = "Kurtosis", title = "QDA Decision Boundary") + theme_minimal()
|
This command is same as LDA plot one. Please check if that script part can be added hereWe are creating the decision boundary plot using ggplot2.
This command creates the decision boundary plot It plots the grid points with colors indicating the predicted classes. geom_raster creates a colour map indicating the predicted classes of the grid points geom_point plots the training data points in the plot. geom_contour creates the decision boundary of the QDA. The scale_fill_manual function assigns specific colors to the classes and so does scale_color_manual function. The overall plot provides a visual representation of the decision boundary. And the distribution of training data points of the model. Select and run these commands. Drag boundaries to see the plot window clearly. |
We can see that the decision boundary of our model is a non-linear line.
And our model has separated most of the data points clearly. | |
With this, we come to the end of this tutorial.
Let us summarize. | |
Show Slide
Summary |
In this tutorial we have learned about:
|
Here is an assignment for you. | |
Show Slide
Assignment |
This dataset can be found in the HDclassif package. Install the package and import the dataset using the data() command |
Show slide
About the Spoken Tutorial Project |
The video at the following link summarizes the Spoken Tutorial project.
Please download and watch it. |
Show slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
|
Show Slide
Spoken Tutorial Forum to answer questions Do you have questions in THIS Spoken Tutorial? Choose the minute and second where you have the question. Explain your question briefly. Someone from the FOSSEE team will answer them. Please visit this site.
|
Please post your timed queries in this forum. |
Show Slide
Forum to answer questions |
Do you have any general/technical questions?
Please visit the forum given in the link. |
Show Slide Textbook Companion |
The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.
We give certificates to those who do this. For more details, please visit these sites. |
Show Slide
Acknowledgment |
The Spoken Tutorial project was established by the Ministry of Education Govt of India. |
Show Slide
Thank You |
This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborty from IIT Bombay.
Thank you for joining. |