Difference between revisions of "Machine-Learning-using-R/C2/Linear-Discriminant-Analysis-in-R/English"
Line 13: | Line 13: | ||
'''Opening Slide''' | '''Opening Slide''' | ||
− | || Welcome to this spoken tutorial on '''Linear Discriminant Analysis in R | + | || Welcome to this spoken tutorial on '''Linear Discriminant Analysis in R'''. |
|- | |- | ||
|| '''Show slide''' | || '''Show slide''' | ||
Line 48: | Line 48: | ||
* Basics of '''Machine Learning '''using '''R'''. | * Basics of '''Machine Learning '''using '''R'''. | ||
− | If not, please access the relevant tutorials on '''R '''on this website. | + | If not, please access the relevant tutorials on '''R ''' on this website. |
|- | |- | ||
|| '''Show slide.''' | || '''Show slide.''' | ||
Line 64: | Line 64: | ||
'''Applications of LDA''' | '''Applications of LDA''' | ||
|| | || | ||
− | * LDA technique is used in several applications like | + | * LDA technique is used in several applications like, |
** Fraud Detection | ** Fraud Detection | ||
Line 96: | Line 96: | ||
'''Implementation Of LDA''' | '''Implementation Of LDA''' | ||
− | || Now let us implement '''LDA''' on the '''raisin dataset '''with two chosen variables'''. | + | || Now let us implement '''LDA''' on the '''raisin dataset '''with two chosen variables'''. |
More information on '''raisin''' data is available in the '''Additional Reading material''' on this tutorial page. | More information on '''raisin''' data is available in the '''Additional Reading material''' on this tutorial page. | ||
Line 169: | Line 169: | ||
It is used as a measure for the performance of the classifier. | It is used as a measure for the performance of the classifier. | ||
− | |||
Please note that in order to import these libraries, we need to install them. | Please note that in order to import these libraries, we need to install them. | ||
Line 194: | Line 193: | ||
'''library(lattice)''' | '''library(lattice)''' | ||
− | |||
− | |||
|| Select and run these commands to import the requisite packages. | || Select and run these commands to import the requisite packages. | ||
Line 226: | Line 223: | ||
Click the data to load the dataset. | Click the data to load the dataset. | ||
− | || Drag boundary to see the Environment tab clearly. | + | || Drag boundary to see the '''Environment''' tab clearly. |
− | In the Environment tab under '''Data '''heading, you will see a '''data '''variable. | + | In the '''Environment''' tab under '''Data '''heading, you will see a '''data '''variable. |
Click the data''' variable''' to load the dataset in the '''Source''' window. | Click the data''' variable''' to load the dataset in the '''Source''' window. | ||
|- | |- | ||
− | || Drag boundary to see the Source window clearly. | + | || Drag boundary to see the '''Source''' window clearly. |
− | || Drag boundary to see the '''Source '''window clearly. | + | || Drag boundary to see the '''Source ''' window clearly. |
|- | |- | ||
Line 279: | Line 276: | ||
Select the commands and click the Run button. | Select the commands and click the Run button. | ||
− | ||First we set a seed for reproducible results. | + | ||First we set a '''seed''' for reproducible results. |
Line 292: | Line 289: | ||
Select the commands and run them. | Select the commands and run them. | ||
|- | |- | ||
− | | | + | || |
|| The vector is shown in the''' Environment '''tab. | || The vector is shown in the''' Environment '''tab. | ||
|- | |- | ||
Line 315: | Line 312: | ||
'''test_data <- data[-c(index_split), ]''' | '''test_data <- data[-c(index_split), ]''' | ||
|| This creates training data, consisting of 630 unique rows. | || This creates training data, consisting of 630 unique rows. | ||
− | |||
This creates testing data, consisting of 270 unique rows. | This creates testing data, consisting of 270 unique rows. | ||
|- | |- | ||
|| Select the commands and click the Run button. | || Select the commands and click the Run button. | ||
− | |||
Point to the sets in the Environment Tab | Point to the sets in the Environment Tab | ||
|| Select the commands and run them. | || Select the commands and run them. | ||
+ | The data sets are shown in the '''Environment''' tab. | ||
+ | |||
− | + | Click on '''test_data '''and '''train_data '''to load them in the '''Source''' window. | |
− | + | ||
− | + | ||
− | Click on '''test_data '''and '''train_data '''to load them in the Source window. | + | |
|- | |- | ||
Line 360: | Line 354: | ||
# data on which the model should train. | # data on which the model should train. | ||
− | Select the | + | Select the commands and run them. |
The output is shown in the '''console''' window. | The output is shown in the '''console''' window. | ||
|- | |- | ||
|| Drag boundary to see the '''console''' window. | || Drag boundary to see the '''console''' window. | ||
− | || Drag boundary to see the '''console '''window clearly. | + | || Drag boundary to see the '''console''' window clearly. |
|- | |- | ||
Line 382: | Line 376: | ||
− | These display the mean values of each '''predictor '''variable for each '''species'''. | + | These display the mean values of each '''predictor ''' variable for each '''species'''. |
Line 395: | Line 389: | ||
|- | |- | ||
− | || | + | || Cursor in the Source window. |
|| Let us use this model to make predictions on the testing data. | || Let us use this model to make predictions on the testing data. | ||
|- | |- | ||
Line 407: | Line 401: | ||
|- | |- | ||
− | || Click the '''predicted_values '''data in the Environment tab. | + | || Click the '''predicted_values '''data in the '''Environment''' tab. |
Point to the table. | Point to the table. | ||
− | || Click the '''predicted_values '''data in the Environment tab. | + | || Click the '''predicted_values '''data in the '''Environment''' tab. |
The '''predicted_values '''table is loaded in the '''Source''' window. | The '''predicted_values '''table is loaded in the '''Source''' window. | ||
Line 426: | Line 420: | ||
− | The output is seen in the''' console''' window. | + | The output is seen in the ''' console''' window. |
|- | |- | ||
|| Highlight the command output of '''head(predicted_values$class) '''in the '''console.''' | || Highlight the command output of '''head(predicted_values$class) '''in the '''console.''' | ||
Line 455: | Line 449: | ||
Click on '''Save '''and''' Run''' buttons. | Click on '''Save '''and''' Run''' buttons. | ||
− | || In the '''Source '''window type these commands. | + | || In the '''Source ''' window type these commands. |
Line 523: | Line 517: | ||
− | The''' 'grid' '''variable contains the generated data including the prediction of the LDA_model on it. | + | The ''' 'grid' ''' variable contains the generated data including the prediction of the '''LDA_model''' on it. |
Line 533: | Line 527: | ||
|- | |- | ||
|| Point to the Environment tab. | || Point to the Environment tab. | ||
− | || Drag boundary to see the details in the Environment tab. | + | || Drag boundary to see the details in the '''Environment''' tab. |
Latest revision as of 11:15, 4 June 2024
Title of the script: Linear Discriminant Analysis in R
Author: YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty
Keywords: R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, confusion matrix, console, LDA, video tutorial.
Visual Cue | Narration |
Show slide
Opening Slide |
Welcome to this spoken tutorial on Linear Discriminant Analysis in R. |
Show slide
Learning Objectives |
In this tutorial, we will learn about:
|
Show slide
System Specifications |
This tutorial is recorded using,
It is recommended to install R version 4.2.0 or higher. |
Show slide.
Prerequisites |
To follow this tutorial, the learner should know:
If not, please access the relevant tutorials on R on this website. |
Show slide.
Linear Discriminant Analysis |
Linear Discriminant Analysis is a statistical method.
|
Show slide.
Applications of LDA |
|
Only Narration | Let us now understand the assumptions of LDA. |
Show Slide
Assumptions for LDA |
Multivariate Normality:
|
Show Slide
Limitations of LDA |
Now we will see the limitations of LDA.
|
Show Slide
Implementation Of LDA |
Now let us implement LDA on the raisin dataset with two chosen variables.
More information on raisin data is available in the Additional Reading material on this tutorial page. |
Show slide
Download Files |
We will use a script file LDA.R
Please download this file from the Code files link of this tutorial. Make a copy and then use it for practising. |
[Computer screen]
Point to LDA.R and the folder LDA. Point to the MLProject folder on the Desktop.
|
I have downloaded and moved these files to the LDA folder.
|
Point to the script file LDA.R. | In this tutorial, we will create a LDA classifier model on the raisin dataset.
|
Open LDA.R in RStudio
|
Open the script LDA.R in RStudio.
For this, click on the script LDA.R. Script LDA.R opens in RStudio. |
Highlight the Readxl package.
Highlight the command library(MASS) Highlight the command library(ggplot2) Highlight the command library(caret) Highlight the command library(caret) Highlight all the commands. #install.packages(“package_name”) |
Readxl package is used to load the Excel file.
confusionMatrix function.
Please note that in order to import these libraries, we need to install them.
|
[RStudio]
library(readxl) library(MASS) library(ggplot2) library(caret) library(lattice) |
Select and run these commands to import the requisite packages. |
Highlight the command
data <- read_xlsx("Raisin.xlsx")
data <- read_xlsx("Raisin.xlsx")
|
We will read the excel file and choose 3 columns, two features (minorAL, ecc) and one target (class) variable.
Run these commands to import the raisin dataset. |
Drag boundary to see the Environment tab clearly.
Point to the data variable in the Environment tab. Click the data to load the dataset. |
Drag boundary to see the Environment tab clearly.
In the Environment tab under Data heading, you will see a data variable. Click the data variable to load the dataset in the Source window. |
Drag boundary to see the Source window clearly. | Drag boundary to see the Source window clearly. |
[RStudio]
Type these commands in the source window. data$class <- factor(data$class) |
In the Source window type this command. |
Highlight the below commands.
data$class <- factor(data$class) Select the commands and click the Run button. |
Here we are converting the variable data$class to a factor.
It ensures that the categorical data is properly encoded. Select the command and run it. |
Only Narration. | Now we split our dataset into training and testing data. |
[RStudio]
Type the command in the source window. set.seed(1) index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) |
In the Source window type these commands. |
Highlight the command
set.seed(1) Highlight the command sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) Highlight the command replace=FALSE Select the commands and click the Run button. |
First we set a seed for reproducible results.
Select the commands and run them. |
The vector is shown in the Environment tab. | |
Point to train-test split. | We use the indices that we previously generated to obtain our train-test split. |
[RStudio]
Type the command train_data <- data [index_split, ] test_data <- data[-c(index_split), ] |
In the Source window type these commands. |
Highlight the command
train_data <- data[index_split, ] Highlight the command test_data <- data[-c(index_split), ] |
This creates training data, consisting of 630 unique rows.
This creates testing data, consisting of 270 unique rows. |
Select the commands and click the Run button.
Point to the sets in the Environment Tab |
Select the commands and run them.
The data sets are shown in the Environment tab.
|
Only Narration. | Let us train our LDA model. |
[RStudio]
LDA_model <- lda(class~.,data=train_data) LDA_model |
In the Source window, type these commands. |
Highlight the command
LDA_model <- lda(class~.,data=train_data) LDA_model
Point to the output in the console window. |
We pass two parameters to the lda() function.
Select the commands and run them. The output is shown in the console window. |
Drag boundary to see the console window. | Drag boundary to see the console window clearly. |
Highlight output in the console. | Our model provides us with a lot of information.
Let us go through them one at a time. |
Highlight the command Prior probabilities of groups.
Highlight the command Group means. Highlight the command Coefficients of linear discriminants |
These explain the distribution of classes in the training dataset.
|
Drag boundary to see the Source window. | Drag boundary to see the Source window clearly. |
Cursor in the Source window. | Let us use this model to make predictions on the testing data. |
[RStudio]
predicted_values <- predict(LDA_model, test_data) |
In the Source window type this command and run it.
Let us check what predicted_values contain. |
Click the predicted_values data in the Environment tab.
|
Click the predicted_values data in the Environment tab.
The predicted_values table is loaded in the Source window. |
[RStudio]
head(predicted_values$class) head(predicted_values$posterior) head(predicted_values$x) |
In the Source window type these commands and run them.
|
Highlight the command output of head(predicted_values$class) in the console.
|
It contains the type of species that the model has predicted for each observation.
This contains the linear discriminants for each observation. |
Only Narration. | Now we will measure the performance of our model using the Confusion Matrix. |
[RStudio]
confusion <-table(test_data$class,predicted_values$class)
|
In the Source window type these commands.
|
Highlight the command confusion <- table(test_data$class, predicted_values$class)
Highlight the command fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1) |
This table creates a confusion matrix.
|
Highlight the plot in plot window | Drag boundary to see the plot window clearly.
Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations. This number may change for different sets of training data. |
Only Narration. | Let us visualize how well our model separates different classes. |
[RStudio]
[RStudio] X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100)
grid$class <- predict(LDA_model, newdata = grid)$class
|
In the Source window, type these commands.
|
Point to the Environment tab. | Drag boundary to see the details in the Environment tab.
Click the grid data in the Environment tab. The grid data table is loaded in the Source window. |
[RStudio]
geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) + geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) + theme_minimal()
geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) + geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "LDA Decision Boundary") + theme_minimal() |
In the Source window, type these commands. |
Highlight the command
ggplot() + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) + geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal()
geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) + geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "LDA Decision Boundary") + theme_minimal()
|
This command creates the decision boundary plot
geom_raster creates a colour map indicating the predicted classes of the grid points geom_contour creates the decision boundary of the LDA. The scale_color_manual function assigns specific colors to the classes and so does scale_fill_manual function.
|
Point the output in the Plots window | We can see that our model has separated most of the data points clearly. |
Only Narration | With this we come to end of this tutorial.
Let us summarize. |
Show Slide
Summary |
In this tutorial we have learnt:
|
Now we will suggest an assignment for this Spoken Tutorial. | |
Show Slide
Assignment |
|
Show slide
About the Spoken Tutorial Project |
The video at the following link summarizes the Spoken Tutorial project.
Please download and watch it. |
Show slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
|
Show Slide
Spoken Tutorial Forum to answer questions. Do you have questions in THIS Spoken Tutorial? Choose the minute and second where you have the question.Explain your question briefly. Someone from the FOSSEE team will answer them. Please visit this site. |
Please post your timed queries in this forum. |
Show Slide
Forum to answer questions |
Do you have any general/technical questions?
Please visit the forum given in the link. |
Show Slide
Textbook Companion |
The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.
We give certificates to those who do this. For more details, please visit these sites. |
Show Slide
Acknowledgment |
The Spoken Tutorial project was established by the Ministry of Education Govt of India. |
Show Slide
Thank You |
This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay.
Thank you for joining. |