Difference between revisions of "Machine-Learning-using-R/C2/Quadratic-Discriminant-Analysis-in-R/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
(Created page with "'''Title of the script''': Quadratic Discriminant Analysis in R '''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty <div style="margin-right:-1.27cm;">'''Keywo...")
 
 
(5 intermediate revisions by one other user not shown)
Line 3: Line 3:
 
'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty
 
'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty
  
<div style="margin-right:-1.27cm;">'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, QDA, quadratic discriminant analysis, video tutorial.
+
'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, QDA, quadratic discriminant analysis, video tutorial.
  
  
Line 9: Line 9:
 
{| border=1
 
{| border=1
 
|-  
 
|-  
| align=center| '''Visual Cue'''
+
|| '''Visual Cue'''
| align=center| '''Narration'''
+
|| '''Narration'''
 
|-  
 
|-  
 
|| '''Show slide'''
 
|| '''Show slide'''
Line 21: Line 21:
 
'''Learning Objectives'''
 
'''Learning Objectives'''
  
|| In this tutorial, we will learn about:
+
|| In this tutorial, we will learn about:  
 
* Quadratic Discriminant Analysis (QDA).
 
* Quadratic Discriminant Analysis (QDA).
 
* Comparison between '''QDA '''and''' LDA'''.
 
* Comparison between '''QDA '''and''' LDA'''.
 
* Assumptions for QDA.
 
* Assumptions for QDA.
* Limitations of QDA
 
 
* Applications of QDA
 
* Applications of QDA
 
* Implementation of QDA using''' Raisin''' Dataset'''.'''
 
* Implementation of QDA using''' Raisin''' Dataset'''.'''
 
* Visualization of the '''QDA '''separator
 
* Visualization of the '''QDA '''separator
 +
* Limitations of QDA
  
 
|-  
 
|-  
Line 46: Line 46:
  
 
'''https://spoken-tutorial.org'''
 
'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know,
+
|| To follow this tutorial, the learner should know
 
* Basic programming in '''R'''.
 
* Basic programming in '''R'''.
 
* '''Basics of Machine Learning'''.
 
* '''Basics of Machine Learning'''.
Line 55: Line 55:
  
 
'''Quadratic Discriminant Analysis'''
 
'''Quadratic Discriminant Analysis'''
||
+
||  
 
* Quadratic discriminant analysis is a statistical method used for classification.
 
* Quadratic discriminant analysis is a statistical method used for classification.
 
* QDA constructs a data-driven non-linear separator between two classes.
 
* QDA constructs a data-driven non-linear separator between two classes.
* The covariance matrix for different classes isThis line is repeated in the next two slides.Just like "It is based on maximization of likelihood function to classify two or more classes." in LDA, we can specify a way how QDA created non-linear boundary. not necessarily equal.  
+
* The covariance matrix for different classes is not necessarily equal.  
 
* A quadratic function describes the decision boundary between each pair of classes.
 
* A quadratic function describes the decision boundary between each pair of classes.
* more than 80 characters. please shorten he sentence.The decision boundary between each pair of classes is described by a quadratic function.
 
 
  
 
|-  
 
|-  
Line 67: Line 65:
  
 
'''Differences between LDA and QDA'''
 
'''Differences between LDA and QDA'''
|| Now let’s see the differences between LDA and QDA
+
|| Now let’s see the differences between QDA and LDA
 +
 
  
 
* '''LDA''' assumes that each class has the same covariance matrix.
 
* '''LDA''' assumes that each class has the same covariance matrix.
Line 83: Line 82:
  
 
'''QDA''' assumes that each class has its own covariance matrix.
 
'''QDA''' assumes that each class has its own covariance matrix.
 +
  
 
|| Now let us see the assumption of QDA
 
|| Now let us see the assumption of QDA
  
 
QDA is used when data is multivariate Gaussian and each class has its own covariance matrix.
 
QDA is used when data is multivariate Gaussian and each class has its own covariance matrix.
|-
 
|| '''Show slide.'''
 
 
'''Limitations of QDA'''
 
 
* Multicollinearity among predictors may lead to poor performance.
 
* The presence of outliers in data may also lead to poor performance.
 
 
|| These are the limitations of QDA
 
 
 
|-  
 
|-  
 
|| '''Show slide.'''
 
|| '''Show slide.'''
Line 112: Line 102:
  
 
'''Implementation Of QDA'''
 
'''Implementation Of QDA'''
|| Let us implement '''QDA '''on the '''Raisin''' '''dataset '''with two chosen variables'''.'''
+
|| Let us implement '''QDA ''' on the '''Raisin''' '''dataset ''' with two chosen variables'''.'''
 
+
  
 
For more information on Raisin data please see the Additional Reading material on this tutorial page.
 
For more information on Raisin data please see the Additional Reading material on this tutorial page.
Line 120: Line 109:
  
 
'''Download Files '''
 
'''Download Files '''
|| We will use a script file '''QDA.R '''and '''Raisin Dataset ‘raisin.xlsx’'''
+
|| We will use a script file '''QDA.R ''' and '''Raisin Dataset ‘raisin.xlsx’'''
  
Please download these files from the''' Code files''' link of this tutorial.
+
Please download these files from the ''' Code files''' link of this tutorial.
  
Make a copy and then use them while practicing.
+
Make a copy and then use them while practising.
 
|-  
 
|-  
 
|| [Computer screen]
 
|| [Computer screen]
Line 163: Line 152:
  
 
'''library(dplyr)'''
 
'''library(dplyr)'''
 
  
 
'''<nowiki>#install.packages(“package_name”)</nowiki>'''
 
'''<nowiki>#install.packages(“package_name”)</nowiki>'''
 
  
 
'''Point to the command.'''
 
'''Point to the command.'''
||  
+
|| Select and run these commands to import the packages.
 
+
Select and run these commands to import the packages.
+
  
  
 
We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.
 
We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.
  
The '''MASS''' package contains the '''qda()''' function
+
The '''MASS''' package contains the '''qda()''' function to create our classifier.
 
+
to create our classifier.
+
 
+
  
 
We will use the '''caret''' package to create the '''confusion matrix.'''
 
We will use the '''caret''' package to create the '''confusion matrix.'''
  
 
+
The '''ggplot2''' package will be used to create the '''decision boundary plot'''.
The '''ggplot2''' package will be used to create the '''decision boundary plot.'''
+
 
+
  
 
We will use the '''dplyr''' package to aid the visualisation of the confusion matrix.
 
We will use the '''dplyr''' package to aid the visualisation of the confusion matrix.
 
  
 
Please ensure that all the packages are installed correctly.
 
Please ensure that all the packages are installed correctly.
 
  
 
As I have already installed the packages.
 
As I have already installed the packages.
Line 202: Line 180:
 
'''data<- read_xlsx("Raisint.xlsx")'''
 
'''data<- read_xlsx("Raisint.xlsx")'''
  
|| Click on '''QDA.R''' in the Source window.
+
||  
 
+
In the '''Source''' window type these commands.
+
 
+
 
|-  
 
|-  
 
|| Highlight the command''' data<- read_xlsx("Raisin.xlsx")'''
 
|| Highlight the command''' data<- read_xlsx("Raisin.xlsx")'''
 
These commands are already there in script file'''data<-data[c("minorAL",ecc,"class")]'''
 
  
 
|| Run this command to load the '''Raisin '''dataset.
 
|| Run this command to load the '''Raisin '''dataset.
  
 +
Drag boundary to see the '''Environment''' tab clearly.
  
Drag boundary to see the Environment tab clearly.
+
In the '''Environment''' tab below '''Data''', you will see the '''data '''variable.
 
+
 
+
In the Environment tab below Data, you will see the '''data '''variable.
+
  
 
Then click on '''data '''to load the dataset in the Source window.  
 
Then click on '''data '''to load the dataset in the Source window.  
Line 224: Line 195:
 
|| [Rstudio]
 
|| [Rstudio]
  
Type these commands in R studio.
+
'''data$class <- factor(data$class)'''
 
+
These commands are already there in script file'''data$class <- factor(data$class)'''
+
 
+
 
|| Click on '''QDA.R''' in the Source window and close the tab.
 
|| Click on '''QDA.R''' in the Source window and close the tab.
 
In the '''Source''' window type these commands
 
 
 
|-  
 
|-  
 
|| Highlight the command.
 
|| Highlight the command.
  
These commands are already there in script file'''data<-data[c("minorAL",ecc,"class")]'''
+
'''data<-data[c("minorAL",ecc,"class")]'''
  
 
'''data$class <- factor(data$class)'''
 
'''data$class <- factor(data$class)'''
Line 256: Line 221:
  
 
'''set.seed(1) '''
 
'''set.seed(1) '''
 
  
 
'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''
 
'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''
||
 
  
Click on '''QDA.R''' in the Source window.
+
|| Click on '''QDA.R''' in the Source window.
  
 
In the '''Source''' window type these commands
 
In the '''Source''' window type these commands
Line 269: Line 232:
  
 
'''set.seed(1)'''
 
'''set.seed(1)'''
 
  
 
Highlight the command
 
Highlight the command
  
 
'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''
 
'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''
 
 
  
 
|| First we set a seed for reproducible results.
 
|| First we set a seed for reproducible results.
  
  
We will create a vector of indices using '''sample() '''function.
+
We will create a vector of indices using '''sample() ''' function.
  
  
It will be a 70% of the total number of rows for training and 30% for testing.
+
It will be 70% of the total number of rows for training and 30% for testing.
  
 
The training data is chosen using simple random sampling without replacement.
 
The training data is chosen using simple random sampling without replacement.
 
  
 
Select the commands and run them.
 
Select the commands and run them.
Line 293: Line 252:
  
 
'''train_data <- data[index_split, ]'''
 
'''train_data <- data[index_split, ]'''
 
  
 
'''test_data <- data[-c(index_split), ]'''
 
'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
+
|| In the '''Source''' window type these commands.
  
 
|-
 
|-
Line 311: Line 269:
 
|-
 
|-
 
|| Select the commands and click the Run button.
 
|| Select the commands and click the Run button.
 +
  
 
Point to the sets in the Environment Tab
 
Point to the sets in the Environment Tab
 
+
 
 
Click the '''train_data '''and '''test_data '''
 
Click the '''train_data '''and '''test_data '''
||  
+
|| Select the commands and run them.
 
+
Select the commands and run them.
+
  
The data sets are shown in the '''Environment '''tab.
+
The data sets are shown in the '''Environment ''' tab.
  
 
Click on '''train_data '''and '''test_data '''to load them in the Source window.
 
Click on '''train_data '''and '''test_data '''to load them in the Source window.
Line 332: Line 289:
 
|| Click on '''QDA.R''' in the Source window.
 
|| Click on '''QDA.R''' in the Source window.
  
In the '''Source''' window
+
In the '''Source''' window type these commands
 
+
type these commands
+
  
 
|-  
 
|-  
||  
+
|| Highlight the command  
 
+
Highlight the command  
+
  
 
'''QDA_model <- qda(class~.,data=train_data)'''
 
'''QDA_model <- qda(class~.,data=train_data)'''
Line 350: Line 303:
 
|| We use this command to create '''QDA Model'''
 
|| We use this command to create '''QDA Model'''
  
We pass two parameters to the '''qda()''' function.# formula  
+
We pass two parameters to the '''qda()''' function.
 +
# formula  
 
# data on which the model should train.
 
# data on which the model should train.
  
Line 357: Line 311:
 
Select and run the commands.  
 
Select and run the commands.  
  
The output is shown in the '''console '''window.
+
The output is shown in the '''console ''' window.
|-
+
|| Drag boundary to see the console window.
+
|| Drag boundary to see the '''console '''window.  
+
 
|-  
 
|-  
 +
 
|| Point the output in the '''console '''
 
|| Point the output in the '''console '''
  
Line 371: Line 323:
 
This indicates the composition of classes in the training data.
 
This indicates the composition of classes in the training data.
  
These indicate the mean values of the predictor variables for each class.
+
These indicate the mean values of the predictor variables of each class.
 
|-  
 
|-  
 
|| Drag boundary to see the '''Source '''window.
 
|| Drag boundary to see the '''Source '''window.
Line 384: Line 336:
  
 
'''predicted_values '''
 
'''predicted_values '''
||  
+
|| In the '''Source''' window type these commands
 
+
Click on '''QDA.R''' in the Source window.
+
 
+
In the '''Source''' window type these commands
+
  
 
|-  
 
|-  
Line 394: Line 342:
  
 
'''predicted_values <- predict(model, test)'''
 
'''predicted_values <- predict(model, test)'''
 
Type the command before highlighting
 
  
 
Highlight the command  
 
Highlight the command  
Line 403: Line 349:
 
Click on''' Save '''and '''Run '''buttons.
 
Click on''' Save '''and '''Run '''buttons.
 
|| Let’s use this command to predict the class variable from the test data using the trained QDA model.
 
|| Let’s use this command to predict the class variable from the test data using the trained QDA model.
 
This will give us more information about the model such as class and posterior.
 
  
 
This predicts the class and posterior probability for the testing data.
 
This predicts the class and posterior probability for the testing data.
Line 411: Line 355:
  
 
|-  
 
|-  
|| This part is not clear Click on '''predicted_values '''in the Environment tab.
+
|| Click on '''predicted_values '''in the Environment tab.
  
Point the output in the This part is not clear'''console'''
+
Point the output in the '''console'''
  
 
Highlight the command '''class'''
 
Highlight the command '''class'''
 +
  
 
Highlight the command '''posterior'''
 
Highlight the command '''posterior'''
 +
 
|| Click on '''predicted_values''' in the Environment tab
 
|| Click on '''predicted_values''' in the Environment tab
  
Line 444: Line 390:
 
|| This command creates a confusion matrix list.
 
|| This command creates a confusion matrix list.
  
The list is created from the actual and predicted class labels of testing data.
+
The list is created from the actual and predicted class labels of testing data and it is stored in the '''confusion''' variable.
 
+
And it is stored in the confusion variable.
+
  
 
It helps to assess the classification model's performance and accuracy.
 
It helps to assess the classification model's performance and accuracy.
Line 452: Line 396:
 
Select and run the command.  
 
Select and run the command.  
  
The confusion matrix list is shown in the Environment tab.
+
The '''confusion matrix''' list is shown in the Environment tab.
  
 
Click '''confusion '''to load it in the''' Source '''window.
 
Click '''confusion '''to load it in the''' Source '''window.
  
'''confusion '''list contains a component table containing the required confusion matrix.
+
'''confusion '''list contains a component table containing the required '''confusion matrix'''.
 
|-  
 
|-  
 
|| '''plot_confusion_matrix <- function(confusion_matrix){'''
 
|| '''plot_confusion_matrix <- function(confusion_matrix){'''
Line 491: Line 435:
  
 
|| Now let’s plot the confusion matrix from the table.
 
|| Now let’s plot the confusion matrix from the table.
 
Click on '''QDA.R''' in the source window.
 
  
 
In the '''Source''' window type these commands
 
In the '''Source''' window type these commands
 
|-  
 
|-  
 
||  
 
||  
 +
 
'''Highlight '''the command  
 
'''Highlight '''the command  
  
 
'''tab <- confusion_matrix$table'''
 
'''tab <- confusion_matrix$table'''
 +
  
 
'''Highlight '''the command
 
'''Highlight '''the command
Line 549: Line 493:
 
|-  
 
|-  
 
|| [RStudio]
 
|| [RStudio]
 
<span style="color:#ff0000;">'''fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)'''
 
  
 
'''plot_confusion_matrix(confusion)'''
 
'''plot_confusion_matrix(confusion)'''
  
|| Click on '''QDA.R''' in the '''Source '''window.
+
|| In the '''Source''' window type these commands
 
+
In the '''Source''' window type these commands
+
  
 
|-  
 
|-  
 
|| Highlight the command
 
|| Highlight the command
 
'''confusion <- table(test_data$class,predicted_values$class)'''
 
 
Highlight the command
 
 
<span style="color:#ff0000;">'''fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)'''
 
  
 
'''plot_confusion_matrix(confusion)'''
 
'''plot_confusion_matrix(confusion)'''
  
 
Click on''' Save '''and '''Run '''buttons.
 
Click on''' Save '''and '''Run '''buttons.
|| The table output is not displayed / used.'''table''' creates a confusion matrix that compares the actual and predicted class labels.
+
|| We are using the created '''plot_confusion_matrix()''' function to generate the visual plot of the confusion matrix in '''confusion''' variable.
 
+
We are using the created '''plot_confusion_matrix()''' function to generate the visual plot of the confusion matrix in '''confusion''' variable
+
  
 
Select and run the command.
 
Select and run the command.
Line 579: Line 511:
 
|-  
 
|-  
 
|| Point the output in the '''plot window'''
 
|| Point the output in the '''plot window'''
|| Drag boundary to see the plot window clearly  
+
|| Drag boundary to see the plot window clearly.
 +
 
  
 
Observe that:  
 
Observe that:  
  
22 24 samples of class 0 ...samples of class Kecimen have been incorrectly classified.
+
22 samples of class Kecimen have been incorrectly classified.
  
 
11 samples of class Besni have been incorrectly classified.  
 
11 samples of class Besni have been incorrectly classified.  
Line 589: Line 522:
 
Overall, the model has misclassified only '''33''' out of '''270 '''samples.
 
Overall, the model has misclassified only '''33''' out of '''270 '''samples.
  
We can say that our model performs well.
 
 
|-  
 
|-  
 
|| [RStudio]
 
|| [RStudio]
Line 601: Line 533:
 
'''grid$classnum <- as.numeric(grid$class)'''
 
'''grid$classnum <- as.numeric(grid$class)'''
  
|| Drag boundary to see the source window clearly.
+
|| In the '''Source''' window type these commands.
 
+
In the '''Source''' window type these commands
+
  
 
|-  
 
|-  
Line 611: Line 541:
  
 
'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''
 
'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''
 
  
 
Highlight the command
 
Highlight the command
Line 619: Line 548:
 
'''grid$classnum <- as.numeric(grid$class)'''
 
'''grid$classnum <- as.numeric(grid$class)'''
  
 +
'''grid$classnum <- as.numeric(grid$class)'''
  
'''grid$classnum <- as.numeric(grid$class)'''
 
 
|| This block of code first creates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc '''features in the dataset.
 
|| This block of code first creates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc '''features in the dataset.
  
Line 627: Line 556:
 
Then, it uses the QDA model to predict the class of each point in this grid.
 
Then, it uses the QDA model to predict the class of each point in this grid.
  
It stores these predictions as a new column ''''class' '''in the '''grid '''dataframe.  
+
It stores these predictions as a new column ''''class'''in the '''grid '''dataframe.  
  
I have added this part The '''as.numeric''' function encodes the predicted classes string labels into numeric values.
+
The '''as.numeric''' function encodes the predicted classes string labels into numeric values.
  
 
The resulting grid of points and their predicted classes will be used to visualize the decision boundaries of the QDA model.
 
The resulting grid of points and their predicted classes will be used to visualize the decision boundaries of the QDA model.
Line 662: Line 591:
 
|-  
 
|-  
 
|| Highlight the command  
 
|| Highlight the command  
 +
  
 
'''ggplot() +'''
 
'''ggplot() +'''
Line 680: Line 610:
  
 
'''theme_minimal()'''
 
'''theme_minimal()'''
 
  
 
''')'''
 
''')'''
|| This command is same as LDA plot one. Please check if that script part can be added hereWe are creating the decision boundary plot using '''ggplot2.'''  
+
|| We are creating the decision boundary plot using '''ggplot2.'''  
 
+
This command creates the decision boundary plot
+
  
 
It plots the grid points with colors indicating the predicted classes.  
 
It plots the grid points with colors indicating the predicted classes.  
  
'''geom_raster '''creates a colour map indicating the predicted classes of the grid points
+
'''geom_raster ''' creates a colour map indicating the predicted classes of the grid points.
  
 
'''geom_point '''plots the training data points in the plot.
 
'''geom_point '''plots the training data points in the plot.
Line 705: Line 632:
 
Drag boundaries to see the plot window clearly.
 
Drag boundaries to see the plot window clearly.
 
|-  
 
|-  
||  
+
|| Point to the plot.
|| We can see that the decision boundary of our model is a non-linear line.
+
|| We can see that the decision boundary of our model is non-linear.
  
 
And our model has separated most of the data points clearly.
 
And our model has separated most of the data points clearly.
 +
|-
 +
|| '''Show slide.'''
 +
 +
'''Limitations of QDA'''
 +
 +
* Multicollinearity among predictors may lead to poor performance.
 +
* The presence of outliers in data may also lead to poor performance.
 +
 +
|| These are the limitations of QDA
 +
 
|-  
 
|-  
 
||  
 
||  
Line 722: Line 659:
 
* Comparison between '''QDA '''and''' LDA'''.
 
* Comparison between '''QDA '''and''' LDA'''.
 
* Assumptions for QDA.
 
* Assumptions for QDA.
* Limitations of QDA
 
 
* Applications of QDA
 
* Applications of QDA
 
* Implementation Of QDA using''' Raisin''' Dataset'''.'''
 
* Implementation Of QDA using''' Raisin''' Dataset'''.'''
 
* Visualization of the '''QDA '''separator
 
* Visualization of the '''QDA '''separator
 
+
* Limitations of QDA
 
|-  
 
|-  
 
||  
 
||  
Line 734: Line 670:
  
 
Assignment
 
Assignment
||
+
||  
 
* Apply '''QDA''' on the '''wine''' dataset.
 
* Apply '''QDA''' on the '''wine''' dataset.
 
* Measure the accuracy of the model.
 
* Measure the accuracy of the model.
Line 770: Line 706:
  
 
Please visit this site.
 
Please visit this site.
 
  
  
Line 782: Line 717:
 
Please visit the forum given in the link.
 
Please visit the forum given in the link.
 
|-  
 
|-  
||  
+
|| Show Slide
 
+
Show Slide
+
  
 
Textbook Companion
 
Textbook Companion

Latest revision as of 17:07, 5 June 2024

Title of the script: Quadratic Discriminant Analysis in R

Author: Yate Asseke Ronald Olivera and Debatosh Chakraborty

Keywords: R, RStudio, machine learning, supervised, unsupervised, QDA, quadratic discriminant analysis, video tutorial.


Visual Cue Narration
Show slide

Opening Slide

Welcome to this spoken tutorial on Quadratic Discriminant Analysis in R
Show slide

Learning Objectives

In this tutorial, we will learn about:
  • Quadratic Discriminant Analysis (QDA).
  • Comparison between QDA and LDA.
  • Assumptions for QDA.
  • Applications of QDA
  • Implementation of QDA using Raisin Dataset.
  • Visualization of the QDA separator
  • Limitations of QDA
Show slide

System Specifications

This tutorial is recorded using,
  • Windows 11
  • R version 4.3.0
  • RStudio version 2023.06.1

It is recommended to install R version 4.2.0 or higher.

Show slide

Prerequisites

https://spoken-tutorial.org

To follow this tutorial, the learner should know
  • Basic programming in R.
  • Basics of Machine Learning.

If not, please access the relevant tutorials on this website.

Show slide

Quadratic Discriminant Analysis

  • Quadratic discriminant analysis is a statistical method used for classification.
  • QDA constructs a data-driven non-linear separator between two classes.
  • The covariance matrix for different classes is not necessarily equal.
  • A quadratic function describes the decision boundary between each pair of classes.
Show Slide

Differences between LDA and QDA

Now let’s see the differences between QDA and LDA


  • LDA assumes that each class has the same covariance matrix.
  • QDA relaxes the assumption of an equal covariance matrix for all the classes.
  • LDA constructs a linear boundary, while QDA constructs a non-linear boundary.
  • When the covariance matrices of different classes are the same, QDA reduces to LDA.
Show Slides

Assumptions for QDA


QDA is primarily used when data is multivariate Gaussian.

QDA assumes that each class has its own covariance matrix.


Now let us see the assumption of QDA

QDA is used when data is multivariate Gaussian and each class has its own covariance matrix.

Show slide.

Applications of QDA

  • Medical Diagnosis.
  • Bio-Imaging classification.
  • Fraud Detection.
QDA technique is used in several applications.
Show Slide

Implementation Of QDA

Let us implement QDA on the Raisin dataset with two chosen variables.

For more information on Raisin data please see the Additional Reading material on this tutorial page.

Show slide

Download Files

We will use a script file QDA.R and Raisin Dataset ‘raisin.xlsx’

Please download these files from the Code files link of this tutorial.

Make a copy and then use them while practising.

[Computer screen]

point to QDA.R and the folder QDA.

Point to the MLProject folder on the Desktop.

I have downloaded and moved these files to the QDA folder.

This folder is located in the MLProject folder on my Desktop.

I have also set the QDA folder as my working Directory.

In this tutorial, we will create a QDA classifier model on the raisin dataset.

Let us switch to RStudio.
Click QDA.R in RStudio

Point to QDA.R in RStudio.

Let us open the script QDA.R in RStudio.

For this, click on the script QDA.R.

Script QDA.R opens in RStudio.

[RStudio]

Highlight the command library(readxl)

Highlight the command library(MASS)

Highlight the command library(caret)

Highlight the command library(ggplot2)

library(dplyr)

#install.packages(“package_name”)

Point to the command.

Select and run these commands to import the packages.


We will use the readxl package to load the excel file of our Raisin Dataset.

The MASS package contains the qda() function to create our classifier.

We will use the caret package to create the confusion matrix.

The ggplot2 package will be used to create the decision boundary plot.

We will use the dplyr package to aid the visualisation of the confusion matrix.

Please ensure that all the packages are installed correctly.

As I have already installed the packages.

I have directly imported them.

[RStudio]

data<- read_xlsx("Raisint.xlsx")

Highlight the command data<- read_xlsx("Raisin.xlsx") Run this command to load the Raisin dataset.

Drag boundary to see the Environment tab clearly.

In the Environment tab below Data, you will see the data variable.

Then click on data to load the dataset in the Source window.

[Rstudio]

data$class <- factor(data$class)

Click on QDA.R in the Source window and close the tab.
Highlight the command.

data<-data[c("minorAL",ecc,"class")]

data$class <- factor(data$class)

Select the commands and click the Run button

We now select three columns from data and convert the variable data$class to a factor.

Select and run the commands.

Click on the Environment tab.

Click on data.

Click on data to load the modified data in the Source window.
Point to the data. Now let us split our data into training and testing data.
[RStudio]

set.seed(1)

index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

Click on QDA.R in the Source window.

In the Source window type these commands

Highlight the command

set.seed(1)

Highlight the command

index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

First we set a seed for reproducible results.


We will create a vector of indices using sample() function.


It will be 70% of the total number of rows for training and 30% for testing.

The training data is chosen using simple random sampling without replacement.

Select the commands and run them.

[RStudio]

train_data <- data[index_split, ]

test_data <- data[-c(index_split), ]

In the Source window type these commands.
Highlight the command

train_data <- data[index_split, ]

Highlight the command

test_data <- data[-c(index_split), ]

This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.

Select the commands and click the Run button.


Point to the sets in the Environment Tab

Click the train_data and test_data

Select the commands and run them.

The data sets are shown in the Environment tab.

Click on train_data and test_data to load them in the Source window.

Let’s perform QDA on the training dataset.
[Rstudio]


QDA_model <- qda(class~.,data=train_data)

Click on QDA.R in the Source window.

In the Source window type these commands

Highlight the command

QDA_model <- qda(class~.,data=train_data)

Highlight the command

QDA_model

Click Save and Click Run buttons.

We use this command to create QDA Model

We pass two parameters to the qda() function.

  1. formula
  2. data on which the model should train.

Click Save.

Select and run the commands.

The output is shown in the console window.

Point the output in the console

Highlight the command Prior probabilities of group

Highlight the command Group means

These are the parameters of our model.

This indicates the composition of classes in the training data.

These indicate the mean values of the predictor variables of each class.

Drag boundary to see the Source window. Drag boundary to see the Source window.
Let us now use our model to make predictions on test data.
[RStudio]

predicted_values <- predict(QDA_model, test_data)

predicted_values

In the Source window type these commands
Highlight the command

predicted_values <- predict(model, test)

Highlight the command

predicted_values

Click on Save and Run buttons.

Let’s use this command to predict the class variable from the test data using the trained QDA model.

This predicts the class and posterior probability for the testing data.

Select and run the commands.

Click on predicted_values in the Environment tab.

Point the output in the console

Highlight the command class


Highlight the command posterior

Click on predicted_values in the Environment tab

This shows us that our predicted variable has two components.

class contains the predicted classes of the testing data.

Posterior contains the posterior probability of an observation belonging to each class.

Let us compute the accuracy of our model.
confusion <- confusionMatrix(test_data$class,predicted_values$class) Click on QDA.R in the source window.

In the Source window type these commands

Highlight the command confusionMatrix(test_data$class,predicted_values$class)

Point to the confusion in the Environment Tab

Highlight the attribute

table

This command creates a confusion matrix list.

The list is created from the actual and predicted class labels of testing data and it is stored in the confusion variable.

It helps to assess the classification model's performance and accuracy.

Select and run the command.

The confusion matrix list is shown in the Environment tab.

Click confusion to load it in the Source window.

confusion list contains a component table containing the required confusion matrix.

plot_confusion_matrix <- function(confusion_matrix){

tab <- confusion_matrix$table

tab = as.data.frame(tab)

tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))

tab <- tab %>%

rename(Actual = Reference) %>%

mutate(cor = if_else(Actual == Prediction, 1,0))

tab$cor <- as.factor(tab$cor)

ggplot(tab, aes(Actual,Prediction)) +

geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +

scale_fill_manual(values = c("red","green")) +

theme_light() +

theme(legend.position = "None",

line = element_blank()) +

scale_x_discrete(position = "top")

}

Now let’s plot the confusion matrix from the table.

In the Source window type these commands

Highlight the command

tab <- confusion_matrix$table


Highlight the command

tab <- confusion_matrix$table

tab = as.data.frame(tab)

tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))

tab <- tab %>%

rename(Actual = Reference) %>%

mutate(cor = if_else(Actual == Prediction, 1,0))

tab$cor <- as.factor(tab$cor)

Highlight the command

ggplot(tab, aes(Actual,Prediction)) +

geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +

scale_fill_manual(values = c("red","green")) +

theme_light() +

theme(legend.position = "None",

line = element_blank()) +

scale_x_discrete(position = "top")

}

These commands create a function plot_confusion_matrix to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using GGPlot2.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands.

[RStudio]

plot_confusion_matrix(confusion)

In the Source window type these commands
Highlight the command

plot_confusion_matrix(confusion)

Click on Save and Run buttons.

We are using the created plot_confusion_matrix() function to generate the visual plot of the confusion matrix in confusion variable.

Select and run the command.

The output is seen in the plot window.

Point the output in the plot window Drag boundary to see the plot window clearly.


Observe that:

22 samples of class Kecimen have been incorrectly classified.

11 samples of class Besni have been incorrectly classified.

Overall, the model has misclassified only 33 out of 270 samples.

[RStudio]

grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),

ecc = seq(min(data$ecc), max(data$ecc), length = 500))

grid$class = predict(QDA_model, newdata = grid)$class

grid$classnum <- as.numeric(grid$class)

In the Source window type these commands.
Highlight the command

grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),

ecc = seq(min(data$ecc), max(data$ecc), length = 500))

Highlight the command

grid$class = predict(QDA_model, newdata = grid)$class

grid$classnum <- as.numeric(grid$class)

grid$classnum <- as.numeric(grid$class)

This block of code first creates a grid of points spanning the range of minorAL and ecc features in the dataset.

It stores it in a variable 'grid'.

Then, it uses the QDA model to predict the class of each point in this grid.

It stores these predictions as a new column 'classin the grid dataframe.

The as.numeric function encodes the predicted classes string labels into numeric values.

The resulting grid of points and their predicted classes will be used to visualize the decision boundaries of the QDA model.

Select and run these commands.

Click grid on the Environment tab to load the grid dataframe in the source window.

[RStudio]

ggplot() +

geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +

geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),

colour = "black", linewidth = 0.7) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") +

theme_minimal()

Click on QDA.R in the Source window.

In the Source window type these commands

Highlight the command


ggplot() +

geom_raster(data = grid, aes(x = var, y = kurt, fill = class), alpha = 0.3) +

geom_point(data = train_data, aes(x = var, y = kurt, color = class)) +

geom_contour(data = grid, aes(x = var, y = kurt, z = classnum),

colour = "black", linewidth = 1.2) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(x = "Variance", y = "Kurtosis", title = "QDA Decision Boundary") +

theme_minimal()

)

We are creating the decision boundary plot using ggplot2.

It plots the grid points with colors indicating the predicted classes.

geom_raster creates a colour map indicating the predicted classes of the grid points.

geom_point plots the training data points in the plot.

geom_contour creates the decision boundary of the QDA.

The scale_fill_manual function assigns specific colors to the classes and so does scale_color_manual function.

The overall plot provides a visual representation of the decision boundary.

And the distribution of training data points of the model.

Select and run these commands.

Drag boundaries to see the plot window clearly.

Point to the plot. We can see that the decision boundary of our model is non-linear.

And our model has separated most of the data points clearly.

Show slide.

Limitations of QDA

  • Multicollinearity among predictors may lead to poor performance.
  • The presence of outliers in data may also lead to poor performance.
These are the limitations of QDA
With this, we come to the end of this tutorial.

Let us summarize.

Show Slide

Summary

In this tutorial we have learned about:
  • Quadratic Discriminant Analysis (QDA).
  • Comparison between QDA and LDA.
  • Assumptions for QDA.
  • Applications of QDA
  • Implementation Of QDA using Raisin Dataset.
  • Visualization of the QDA separator
  • Limitations of QDA
Here is an assignment for you.
Show Slide

Assignment

  • Apply QDA on the wine dataset.
  • Measure the accuracy of the model.

This dataset can be found in the HDclassif package.

Install the package and import the dataset using the data() command

Show slide

About the Spoken Tutorial Project

The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.

Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.


Please contact us.

Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.


Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion

The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial project was established by the Ministry of Education Govt of India.
Show Slide

Thank You

This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.

Contributors and Content Editors

Madhurig, Ushav