Difference between revisions of "Machine-Learning-using-R/C2/Linear-Discriminant-Analysis-in-R/English"

Latest revision as of 11:15, 4 June 2024

Title of the script: Linear Discriminant Analysis in R

Author: YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty

Keywords: R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, confusion matrix, console, LDA, video tutorial.

Visual Cue	Narration
Show slide Opening Slide	Welcome to this spoken tutorial on Linear Discriminant Analysis in R.
Show slide Learning Objectives	In this tutorial, we will learn about: Linear Discriminant Analysis (LDA) and its implementation. Assumptions of LDA Limitations of LDA LDA on a subset of Raisin dataset Visualization of the LDA separator and its corresponding confusion matrix.
Show slide System Specifications	This tutorial is recorded using, Windows 11 R version 4.3.0 RStudio version 2023.06.1 It is recommended to install R version 4.2.0 or higher.
Show slide. Prerequisites https://spoken-tutorial.org	To follow this tutorial, the learner should know: Basics of R programming. Basics of Machine Learning using R. If not, please access the relevant tutorials on R on this website.
Show slide. Linear Discriminant Analysis	Linear Discriminant Analysis is a statistical method. It is used for classification. It constructs a data driven line that best separates different classes. It is based on maximization of likelihood function to classify two or more classes.
Show slide. Applications of LDA	LDA technique is used in several applications like, Fraud Detection Bio-Imaging classification Classify patient disease state
Only Narration	Let us now understand the assumptions of LDA.
Show Slide Assumptions for LDA	Multivariate Normality: All data entries are continuous, Gaussian, with equal covariance matrix for all the classes. Mean vectors for each class are different. Data records are independent and identically distributed among each class.
Show Slide Limitations of LDA	Now we will see the limitations of LDA. Departure from Gaussianity can increase misclassification probability in LDA. LDA may perform poorly if data has unequal class covariance matrix.
Show Slide Implementation Of LDA	Now let us implement LDA on the raisin dataset with two chosen variables. More information on raisin data is available in the Additional Reading material on this tutorial page.
Show slide Download Files	We will use a script file LDA.R Please download this file from the Code files link of this tutorial. Make a copy and then use it for practising.
[Computer screen] Point to LDA.R and the folder LDA. Point to the MLProject folder on the Desktop. Point to the LDA folder.	I have downloaded and moved these files to the LDA folder. This folder is in the MLProject folder on my Desktop. I have also set the LDA folder as my working directory.
Point to the script file LDA.R.	In this tutorial, we will create a LDA classifier model on the raisin dataset. Let us switch to RStudio.
Open LDA.R in RStudio Point to LDA.R in RStudio.	Open the script LDA.R in RStudio. For this, click on the script LDA.R. Script LDA.R opens in RStudio.
Highlight the Readxl package. Highlight the command library(MASS) Highlight the command library(ggplot2) Highlight the command library(caret) Highlight the command library(caret) Highlight all the commands. #install.packages(“package_name”)	Readxl package is used to load the Excel file. The MASS package contains the lda() function that we will use for our analysis. The ggplot2 package is used to plot the results of our analysis. The caret package contains the confusionMatrix function. It is used as a measure for the performance of the classifier. Please note that in order to import these libraries, we need to install them. Please ensure that everything is installed correctly. You can use the command install.packages(“package_name”) to install the required packages. As I have already installed these packages, I will directly import them.
[RStudio] library(readxl) library(MASS) library(ggplot2) library(caret) library(lattice)	Select and run these commands to import the requisite packages.
Highlight the command data <- read_xlsx("Raisin.xlsx") Highlight the command data<-data[c("minorAL","ecc","class")] Highlight the commands. data <- read_xlsx("Raisin.xlsx") data<-data[c("minorAL","ecc","class")]	We will read the excel file and choose 3 columns, two features (minorAL, ecc) and one target (class) variable. Run these commands to import the raisin dataset.
Drag boundary to see the Environment tab clearly. Point to the data variable in the Environment tab. Click the data to load the dataset.	Drag boundary to see the Environment tab clearly. In the Environment tab under Data heading, you will see a data variable. Click the data variable to load the dataset in the Source window.
Drag boundary to see the Source window clearly.	Drag boundary to see the Source window clearly.
[RStudio] Type these commands in the source window. data$class <- factor(data$class)	In the Source window type this command.
Highlight the below commands. data$class <- factor(data$class) Select the commands and click the Run button.	Here we are converting the variable data$class to a factor. It ensures that the categorical data is properly encoded. Select the command and run it.
Only Narration.	Now we split our dataset into training and testing data.
[RStudio] Type the command in the source window. set.seed(1) *index_split=sample(1:nrow(data),size=0.7nrow(data),replace=FALSE)**	In the Source window type these commands.
Highlight the command set.seed(1) Highlight the command *sample(1:nrow(data),size=0.7nrow(data),replace=FALSE) Highlight the command replace=FALSE** Select the commands and click the Run button.	First we set a seed for reproducible results. We will create a vector of indices using sample() function. This will be 70% for training and 30% for testing. The training data is chosen using simple random sampling without replacement. Select the commands and run them.
	The vector is shown in the Environment tab.
Point to train-test split.	We use the indices that we previously generated to obtain our train-test split.
[RStudio] Type the command train_data <- data [index_split, ] test_data <- data[-c(index_split), ]	In the Source window type these commands.
Highlight the command train_data <- data[index_split, ] Highlight the command test_data <- data[-c(index_split), ]	This creates training data, consisting of 630 unique rows. This creates testing data, consisting of 270 unique rows.
Select the commands and click the Run button. Point to the sets in the Environment Tab	Select the commands and run them. The data sets are shown in the Environment tab. Click on test_data and train_data to load them in the Source window.
Only Narration.	Let us train our LDA model.
[RStudio] LDA_model <- lda(class~.,data=train_data) LDA_model	In the Source window, type these commands.
Highlight the command LDA_model <- lda(class~.,data=train_data) LDA_model Highlight the command LDA_model Click on Save and Run buttons. Point to the output in the console window.	We pass two parameters to the lda() function. formula data on which the model should train. Select the commands and run them. The output is shown in the console window.
Drag boundary to see the console window.	Drag boundary to see the console window clearly.
Highlight output in the console.	Our model provides us with a lot of information. Let us go through them one at a time.
Highlight the command Prior probabilities of groups. Highlight the command Group means. Highlight the command Coefficients of linear discriminants	These explain the distribution of classes in the training dataset. These display the mean values of each predictor variable for each species. These display the linear combination of predictor variables. The given linear combinations form the decision rule of the LDA model.
Drag boundary to see the Source window.	Drag boundary to see the Source window clearly.
Cursor in the Source window.	Let us use this model to make predictions on the testing data.
[RStudio] predicted_values <- predict(LDA_model, test_data)	In the Source window type this command and run it. Let us check what predicted_values contain.
Click the predicted_values data in the Environment tab. Point to the table.	Click the predicted_values data in the Environment tab. The predicted_values table is loaded in the Source window.
[RStudio] head(predicted_values$class) head(predicted_values$posterior) head(predicted_values$x)	In the Source window type these commands and run them. The output is seen in the console window.
Highlight the command output of head(predicted_values$class) in the console. Highlight the command output of head(predicted_values$posterior) in the console. Highlight the command output of head(predicted_values$x) in console	It contains the type of species that the model has predicted for each observation. It contains the posterior probability of the observation belonging to each class. This contains the linear discriminants for each observation.
Only Narration.	Now we will measure the performance of our model using the Confusion Matrix.
[RStudio] confusion <-table(test_data$class,predicted_values$class) fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1) Click on Save and Run buttons.	In the Source window type these commands. Save and run the commands.
Highlight the command confusion <- table(test_data$class, predicted_values$class) Highlight the command fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1)	This table creates a confusion matrix. The fourfoldplot() function generates a visual plot of the confusion matrix, The output is seen in the plot window.
Highlight the plot in plot window	Drag boundary to see the plot window clearly. Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations. This number may change for different sets of training data.
Only Narration.	Let us visualize how well our model separates different classes.
[RStudio] [RStudio] X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100) Y <- seq(min(train_data$ecc), max(train_data$ecc), length.out = 100) min_max <- expand.grid(minorAL = X, ecc = Y) min_max$predicted_class <- predict(LDA_model, newdata = min_max)$class grid <- expand.grid(minorAL = X, ecc = Y) grid$class <- predict(LDA_model, newdata = grid)$class grid$classnum <- as.numeric(grid$class) Click on Save and Run buttons.	In the Source window, type these commands. This block of code operates as a setup for visual plotting. It consists of square grid coordinates in the range of training data and their predicted linear discriminants. The seq function generates a sequence of evenly spaced values within a range of smallest and largest values of 'minorAL' and 'ecc' variables from the training data. The 'grid' variable contains the generated data including the prediction of the LDA_model on it. The as.numeric function encodes the predicted classes labels into numeric values. Select the commands and run them.
Point to the Environment tab.	Drag boundary to see the details in the Environment tab. These variables contain the data for the visualization of the linear discriminants. Click the grid data in the Environment tab. The grid data table is loaded in the Source window.
[RStudio] ggplot() + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) + geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) + theme_minimal() ggplot() + geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) + geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "LDA Decision Boundary") + theme_minimal()	In the Source window, type these commands.
Highlight the command ggplot() + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) + geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal() ggplot() + geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) + geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "LDA Decision Boundary") + theme_minimal() Select the commands and run them.	This command creates the decision boundary plot It plots the grid points with colors indicating the predicted classes. geom_raster creates a colour map indicating the predicted classes of the grid points geom_contour creates the decision boundary of the LDA. The scale_color_manual function assigns specific colors to the classes and so does scale_fill_manual function. The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the model. Select and run these commands. Drag boundaries to see the plot window clearly.
Point the output in the Plots window	We can see that our model has separated most of the data points clearly.
Only Narration	With this we come to end of this tutorial. Let us summarize.
Show Slide Summary	In this tutorial we have learnt: Linear Discriminant Analysis (LDA) and its implementation. Assumptions of LDA Limitations of LDA LDA on a subset of Raisin dataset Visualization of the LDA separator and its corresponding confusion matrix
	Now we will suggest an assignment for this Spoken Tutorial.
Show Slide Assignment	Perform LDA on inbuilt PlantGrowthdataset Evaluate the model using a confusion matrix and visualize the results
Show slide About the Spoken Tutorial Project	The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
Show slide Spoken Tutorial Workshops	We conduct workshops using Spoken Tutorials and give certificates. Please contact us.
Show Slide Spoken Tutorial Forum to answer questions. Do you have questions in THIS Spoken Tutorial? Choose the minute and second where you have the question.Explain your question briefly. Someone from the FOSSEE team will answer them. Please visit this site.	Please post your timed queries in this forum.
Show Slide Forum to answer questions	Do you have any general/technical questions? Please visit the forum given in the link.
Show Slide Textbook Companion	The FOSSEE team coordinates the coding of solved examples of popular books and case study projects. We give certificates to those who do this. For more details, please visit these sites.
Show Slide Acknowledgment	The Spoken Tutorial project was established by the Ministry of Education Govt of India.
Show Slide Thank You	This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay. Thank you for joining.

Contributors and Content Editors

Madhurig, Ushav

@@ Line 13: / Line 13: @@
 '''Opening Slide'''
-|| Welcome to this spoken tutorial on '''Linear Discriminant Analysis in R.'''
+|| Welcome to this spoken tutorial on '''Linear Discriminant Analysis in R'''.
 |-
 || '''Show slide'''
@@ Line 48: / Line 48: @@
 * Basics of '''Machine Learning '''using '''R'''.
-If not, please access the relevant tutorials on '''R '''on this website.
+If not, please access the relevant tutorials on '''R ''' on this website.
 |-
 || '''Show slide.'''
@@ Line 64: / Line 64: @@
 '''Applications of LDA'''
 ||
-* LDA technique is used in several applications like
+* LDA technique is used in several applications like,
 ** Fraud Detection
@@ Line 89: / Line 89: @@
 || Now we will see the limitations of LDA.
-* Departure from Gaussianity may increase misclassification probability in LDA.
+* Departure from Gaussianity can increase misclassification probability in LDA.
 * '''LDA''' may perform poorly if data has unequal class covariance matrix.
@@ Line 96: / Line 96: @@
 '''Implementation Of LDA'''
-|| Now let us implement '''LDA''' on the '''raisin dataset '''with two chosen variables'''.'''
+|| Now let us implement '''LDA''' on the '''raisin dataset '''with two chosen variables'''.
 More information on '''raisin''' data is available in the '''Additional Reading material''' on this tutorial page.
@@ Line 107: / Line 107: @@
 Please download this file from the''' Code files''' link of this tutorial.
-Make a copy and then use it for practicing.
+Make a copy and then use it for practising.
 |-
 || [Computer screen]
@@ Line 169: / Line 169: @@
 It is used as a measure for the performance of the classifier.
 Please note that in order to import these libraries, we need to install them.
@@ Line 194: / Line 193: @@
 '''library(lattice)'''
 || Select and run these commands to import the requisite packages.
@@ Line 226: / Line 223: @@
 Click the data to load the dataset.
-|| Drag boundary to see the Environment tab clearly.
+|| Drag boundary to see the '''Environment''' tab clearly.
-In the Environment tab under '''Data '''heading, you will see a '''data '''variable.
+In the '''Environment''' tab under '''Data '''heading, you will see a '''data '''variable.
 Click the data''' variable''' to load the dataset in the '''Source''' window.
 |-
-|| Drag boundary to see the Source window clearly.
+|| Drag boundary to see the '''Source''' window clearly.
-|| Drag boundary to see the '''Source '''window clearly.
+|| Drag boundary to see the '''Source ''' window clearly.
 |-
@@ Line 255: / Line 252: @@
 It ensures that the categorical data is properly encoded.
-Select the command and run it. them.
+Select the command and run it.
 |-
 ||Only Narration.
@@ Line 279: / Line 276: @@
 Select the commands and click the Run button.
-||First we set a seed for reproducible results.
+||First we set a '''seed''' for reproducible results.
@@ Line 292: / Line 289: @@
 Select the commands and run them.
 |-
+||
 || The vector is shown in the''' Environment '''tab.
 |-
@@ Line 314: / Line 312: @@
 '''test_data <- data[-c(index_split), ]'''
 || This creates training data, consisting of 630 unique rows.
 This creates testing data, consisting of 270 unique rows.
 |-
 || Select the commands and click the Run button.
 Point to the sets in the Environment Tab
 || Select the commands and run them.
+The data sets are shown in the '''Environment''' tab.
-The data sets are shown in the Environment tab.
+Click on '''test_data '''and '''train_data '''to load them in the '''Source''' window.
-Click on '''test_data '''and '''train_data '''to load them in the Source window.
 |-
@@ Line 359: / Line 354: @@
 # data on which the model should train.
-Select the comands and run them.
+Select the commands and run them.
 The output is shown in the '''console''' window.
 |-
 || Drag boundary to see the '''console''' window.
-|| Drag boundary to see the '''console '''window clearly.
+|| Drag boundary to see the '''console''' window clearly.
 |-
@@ Line 381: / Line 376: @@
-These display the mean values of each '''predictor '''variable for each '''species'''.
+These display the mean values of each '''predictor ''' variable for each '''species'''.
@@ Line 394: / Line 389: @@
 |-
-||
+|| Cursor in the Source window.
 || Let us use this model to make predictions on the testing data.
 |-
@@ Line 406: / Line 401: @@
 |-
-|| Click the '''predicted_values '''data in the Environment tab.
+|| Click the '''predicted_values '''data in the '''Environment''' tab.
 Point to the table.
-|| Click the '''predicted_values '''data in the Environment tab.
+|| Click the '''predicted_values '''data in the '''Environment''' tab.
 The '''predicted_values '''table is loaded in the '''Source''' window.
@@ Line 425: / Line 420: @@
-The output is seen in the''' console''' window.
+The output is seen in the ''' console''' window.
 |-
 || Highlight the command output of '''head(predicted_values$class) '''in the '''console.'''
@@ Line 454: / Line 449: @@
 Click on '''Save '''and''' Run''' buttons.
-|| In the '''Source '''window type these commands.
+|| In the '''Source ''' window type these commands.
@@ Line 522: / Line 517: @@
-The''' 'grid' '''variable contains the generated data including the prediction of the LDA_model on it.
+The ''' 'grid' ''' variable contains the generated data including the prediction of the '''LDA_model''' on it.
@@ Line 532: / Line 527: @@
 |-
 || Point to the Environment tab.
-|| Drag boundary to see the details in the Environment tab.
+|| Drag boundary to see the details in the '''Environment''' tab.

Difference between revisions of "Machine-Learning-using-R/C2/Linear-Discriminant-Analysis-in-R/English"

Latest revision as of 11:15, 4 June 2024

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools