Difference between revisions of "PhET-Simulations-for-Physics/C2/Geometric-Optics/English"
(Created page with "'''Title of the script''': '''Installation of orca on Windows OS''' '''Author: '''Madhuri Ganapathi '''Keywords''': Orca, Windows 10 OS, zip files, Orca Forum, register, l...") |
|||
Line 1: | Line 1: | ||
− | '''Title of the script''': | + | '''Title of the script''': Linear Discriminant Analysis in R |
+ | '''Author''': YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty | ||
− | ''' | + | '''Keywords''': R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, LDA, video tutorial. |
+ | {| border=1 | ||
+ | |- | ||
+ | || '''Visual Cue''' | ||
+ | || '''Narration''' | ||
+ | |- | ||
+ | || '''Show slide''' | ||
− | ''' | + | '''Opening Slide''' |
+ | || Welcome to this spoken tutorial on '''Linear Discriminant Analysis in R.''' | ||
+ | |- | ||
+ | || '''Show slide''' | ||
+ | '''Learning Objectives''' | ||
− | + | || In this tutorial, we will learn about: | |
− | | | + | # Linear Discriminant Analysis ('''LDA''') and its implementation. |
− | + | # Assumptions of LDA | |
− | + | # Limitations of LDA | |
− | + | # LDA on a subset of Raisin dataset | |
+ | # Visualization of the '''LDA''' separator and its corresponding confusion matrix. | ||
− | |||
− | + | |- | |
− | || | + | || '''Show slide''' |
− | + | '''System Specifications''' | |
− | || ''' | + | || This tutorial is recorded using, |
+ | * '''Windows 11 ''' | ||
+ | * '''R '''version''' 4.3.0''' | ||
+ | * '''RStudio''' version '''2023.06.1''' | ||
− | ''' | + | It is recommended to install '''R''' version '''4.2.0''' or higher. |
+ | |- | ||
+ | || '''Show slide.''' | ||
− | + | '''Prerequisites ''' | |
− | + | '''https://spoken-tutorial.org''' | |
+ | || To follow this tutorial, the learner should know: | ||
− | * | + | * Basics of '''R''' programming. |
+ | * Basics of '''Machine Learning '''using '''R'''. | ||
− | + | If not, please access the relevant tutorials on '''R '''on this website. | |
+ | |- | ||
+ | || '''Show slide.''' | ||
+ | '''Linear Discriminant Analysis''' | ||
+ | || Linear Discriminant Analysis is a statistical method. | ||
+ | * It is used for classification. | ||
+ | * It constructs a data driven line that best separates different classes. | ||
+ | * It is based on maximization of likelihood function to classify two or more classes. | ||
− | |||
− | |||
− | ''' | + | |- |
− | + | || '''Show slide.''' | |
− | ''' | + | '''Applications of LDA''' |
+ | || | ||
+ | * LDA technique is used in several applications like | ||
− | + | ** Fraud Detection | |
+ | ** Bio-Imaging classification | ||
+ | ** Classify patient disease state | ||
− | ''' | + | |- |
+ | || Only Narration | ||
+ | || Let us now understand the assumptions of LDA. | ||
+ | |- | ||
+ | || '''Show Slide ''' | ||
− | ''' | + | '''Assumptions for LDA''' |
+ | || '''Multivariate Normality: ''' | ||
+ | * All data entries are continuous, Gaussian, with equal covariance matrix for all the classes. | ||
+ | * Mean vectors for each class are different. | ||
+ | * Data records are independent and identically distributed among each class. | ||
− | + | |- | |
− | |- | + | || '''Show Slide ''' |
− | || '''Slide | + | |
− | ''' | + | '''Limitations of LDA''' |
+ | || Now we will see the limitations of LDA. | ||
− | + | * Departure from Gaussianity may increase misclassification probability in LDA. | |
− | * | + | * '''LDA''' may perform poorly if data has unequal class covariance matrix. |
− | |- | + | |- |
− | || '''Slide | + | || '''Show Slide''' |
− | ''' | + | '''Implementation Of LDA''' |
+ | || Now let us implement '''LDA''' on the '''raisin dataset '''with two chosen variables'''.''' | ||
− | + | More information on '''raisin''' data is available in the '''Additional Reading material''' on this tutorial page. | |
− | + | |- | |
+ | || '''Show slide ''' | ||
− | + | '''Download Files''' | |
+ | || We will use a script file '''LDA.R''' | ||
− | + | Please download this file from the''' Code files''' link of this tutorial. | |
− | |- | + | Make a copy and then use it for practicing. |
− | || | + | |- |
+ | || [Computer screen] | ||
− | + | Point to '''LDA.R''' and the folder '''LDA.''' | |
+ | Point to the''' MLProject folder '''on the '''Desktop.''' | ||
− | |||
− | |||
− | + | Point to the''' LDA folder.''' | |
+ | || I have downloaded and moved these files to the '''LDA '''folder. | ||
− | |||
− | ''' | + | This folder is in the '''MLProject''' folder on my '''Desktop'''. |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | |||
− | |||
− | |||
− | |||
− | Point to ''' | + | I have also set the '''LDA''' folder as my working''' directory'''. |
+ | |- | ||
+ | || Point to the script file '''LDA.R.''' | ||
+ | || In this tutorial, we will create a '''LDA''' classifier model on the '''raisin''' dataset. | ||
− | |||
− | |||
− | ''' | + | Let us switch to '''RStudio'''. |
+ | |- | ||
+ | || Open '''LDA.R '''in '''RStudio''' | ||
− | |||
+ | Point to''' LDA.R''' in '''RStudio'''. | ||
+ | || Open the script '''LDA.R''' in '''RStudio'''. | ||
− | + | For this, click on the script '''LDA.R.''' | |
− | + | ||
− | + | ||
− | + | Script '''LDA.R''' opens in '''RStudio'''. | |
− | || | + | |- |
+ | || Highlight the '''Readxl package.''' | ||
− | + | Highlight the command '''library(MASS) ''' | |
− | + | ||
− | + | ||
+ | Highlight the command '''library(ggplot2)''' | ||
− | + | Highlight the command '''library(caret)''' | |
− | + | Highlight the command '''library(caret)''' | |
− | + | ||
+ | Highlight all the commands. | ||
− | Please | + | '''<nowiki>#install.packages(“package_name”)</nowiki>''' |
+ | || '''Readxl package''' is used to load the '''Excel''' file. | ||
+ | |||
+ | |||
+ | The''' MASS package''' contains the '''lda()''' function that we will use for our analysis. | ||
+ | |||
+ | |||
+ | The '''ggplot2 package''' is used to plot the results of our analysis. | ||
+ | |||
+ | |||
+ | The '''caret package''' contains the | ||
+ | |||
+ | '''confusionMatrix''' function. | ||
+ | |||
+ | |||
+ | It is used as a measure for the performance of the classifier. | ||
+ | |||
+ | |||
+ | Please note that in order to import these libraries, we need to install them. | ||
+ | |||
+ | |||
+ | Please ensure that everything is installed correctly. | ||
+ | |||
+ | |||
+ | You can use the command '''install.packages(“package_name”)''' to install the required packages. | ||
+ | |||
+ | |||
+ | As I have already installed these packages, I will directly import them. | ||
+ | |||
+ | |- | ||
+ | || [RStudio] | ||
+ | |||
+ | '''library(readxl)''' | ||
+ | |||
+ | '''library(MASS)''' | ||
+ | |||
+ | '''library(ggplot2)''' | ||
+ | |||
+ | '''library(caret)''' | ||
+ | |||
+ | '''library(lattice)''' | ||
+ | |||
+ | |||
+ | |||
+ | || Select and run these commands to import the requisite packages. | ||
+ | |||
+ | |- | ||
+ | || Highlight the command''' ''' | ||
+ | |||
+ | '''data <- read_xlsx("Raisin.xlsx")''' | ||
+ | |||
+ | |||
+ | Highlight the command''' data<-data[c("minorAL","ecc","class")]''' | ||
+ | |||
+ | |||
+ | Highlight the commands. | ||
+ | |||
+ | '''data <- read_xlsx("Raisin.xlsx")''' | ||
+ | |||
+ | |||
+ | '''data<-data[c("minorAL","ecc","class")]''' | ||
+ | |||
+ | || We will read the excel file and choose 3 columns, two features ('''minorAL, ecc)''' and one target ('''class''') variable. | ||
+ | |||
+ | Run these commands to import the '''raisin''' dataset. | ||
+ | |||
+ | |- | ||
+ | || Drag boundary to see the '''Environment '''tab clearly. | ||
+ | |||
+ | Point to the data variable in the Environment tab. | ||
+ | |||
+ | Click the data to load the dataset. | ||
+ | |||
+ | || Drag boundary to see the Environment tab clearly. | ||
+ | |||
+ | In the Environment tab under '''Data '''heading, you will see a '''data '''variable. | ||
+ | |||
+ | Click the data''' variable''' to load the dataset in the '''Source''' window. | ||
+ | |- | ||
+ | || Drag boundary to see the Source window clearly. | ||
+ | || Drag boundary to see the '''Source '''window clearly. | ||
|- | |- | ||
− | || | + | ||[RStudio] |
− | || | + | |
+ | Type these commands in the source window. | ||
+ | |||
+ | '''data$class <- factor(data$class)''' | ||
+ | |||
+ | || In the '''Source''' window type this command. | ||
+ | |||
|- | |- | ||
− | || | + | ||Highlight the below commands. |
− | + | ||
− | + | '''data$class <- factor(data$class)''' | |
+ | |||
+ | Select the commands and click the Run button. | ||
+ | |||
+ | ||Here we are converting the variable '''data$class''' to a factor. | ||
+ | |||
+ | It ensures that the categorical data is properly encoded. | ||
+ | |||
+ | Select the command and run it. them. | ||
|- | |- | ||
− | || | + | ||Only Narration. |
− | || | + | || Now we split our dataset into training and testing data. |
|- | |- | ||
− | || | + | ||[RStudio] |
+ | Type the command in the source window. | ||
− | + | '''set.seed(1) ''' | |
− | + | ||
+ | '''index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)''' | ||
+ | ||In the '''Source''' window type these commands. | ||
|- | |- | ||
− | || | + | ||Highlight the command |
− | + | '''set.seed(1)''' | |
− | + | ||
+ | Highlight the command '''sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)''' | ||
− | + | Highlight the command '''replace=FALSE''' | |
− | + | ||
− | + | ||
+ | Select the commands and click the Run button. | ||
+ | ||First we set a seed for reproducible results. | ||
− | |||
− | + | We will create a vector of indices using '''sample() '''function. | |
− | + | This will be 70% for training and 30% for testing. | |
− | + | ||
− | + | ||
− | |||
− | |||
− | + | The training data is chosen using simple random sampling without replacement. | |
− | + | Select the commands and run them. | |
|- | |- | ||
− | || | + | || The vector is shown in the''' Environment '''tab. |
− | + | ||
|- | |- | ||
− | || Point to | + | ||Point to train-test split. |
− | || | + | || We use the indices that we previously generated to obtain our train-test split. |
+ | |- | ||
+ | || [RStudio] | ||
− | + | Type the command | |
− | + | ||
− | + | ||
+ | '''train_data <- data [index_split, ]''' | ||
− | + | '''test_data <- data[-c(index_split), ]''' | |
+ | || In the '''Source '''window type these commands. | ||
+ | |- | ||
+ | || Highlight the command | ||
− | + | '''train_data <- data[index_split, ]''' | |
− | + | ||
− | + | Highlight the command | |
− | + | '''test_data <- data[-c(index_split), ]''' | |
+ | || This creates training data, consisting of 630 unique rows. | ||
− | |||
− | |||
− | || | + | This creates testing data, consisting of 270 unique rows. |
+ | |- | ||
+ | || Select the commands and click the Run button. | ||
− | |||
− | |||
− | |||
− | |||
+ | Point to the sets in the Environment Tab | ||
+ | || Select the commands and run them. | ||
− | |||
− | |||
+ | The data sets are shown in the Environment tab. | ||
+ | |||
− | Click on | + | Click on '''test_data '''and '''train_data '''to load them in the Source window. |
+ | |- | ||
+ | || Only Narration. | ||
+ | || Let us train our '''LDA''' model. | ||
+ | |- | ||
+ | || [RStudio] | ||
− | + | '''LDA_model <- lda(class~.,data=train_data)''' | |
− | + | '''LDA_model''' | |
− | + | || In the '''Source '''window, type these commands. | |
− | + | ||
− | || | + | |
− | + | |- | |
− | |- | + | || Highlight the command |
− | || | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | '''LDA_model <- lda(class~.,data=train_data)''' | |
− | + | ||
− | + | '''LDA_model''' | |
− | |||
− | + | Highlight the command '''LDA_model''' | |
− | + | ||
− | |||
− | |||
− | |||
− | |||
− | + | Click on Save and Run buttons. | |
− | + | ||
− | + | ||
+ | Point to the output in the '''console '''window. | ||
+ | || We pass two parameters to the '''lda()''' function. | ||
+ | # formula | ||
+ | # data on which the model should train. | ||
− | + | Select the comands and run them. | |
+ | The output is shown in the '''console''' window. | ||
+ | |- | ||
+ | || Drag boundary to see the '''console''' window. | ||
+ | || Drag boundary to see the '''console '''window clearly. | ||
− | + | |- | |
− | || | + | || Highlight '''output''' in the '''console.''' |
+ | || Our '''model''' provides us with a lot of information. | ||
+ | Let us go through them one at a time. | ||
+ | |- | ||
+ | || Highlight the command '''Prior probabilities of groups. ''' | ||
− | + | Highlight the command''' Group means.''' | |
+ | Highlight the command '''Coefficients of linear discriminants ''' | ||
− | + | || These explain the distribution of classes in the training dataset. | |
− | | | + | |
− | | | + | |
− | + | ||
− | |||
− | |||
− | + | These display the mean values of each '''predictor '''variable for each '''species'''. | |
− | |||
− | |||
+ | These display the '''linear combination of predictor''' variables. | ||
− | |||
− | |||
− | |||
− | |||
− | + | The given linear combinations form the decision rule of the '''LDA''' model. | |
− | + | ||
− | + | ||
− | |- | + | |- |
− | || | + | || Drag boundary to see the Source window. |
+ | || Drag boundary to see the '''Source '''window clearly. | ||
− | + | |- | |
− | || | + | || |
+ | || Let us use this model to make predictions on the testing data. | ||
+ | |- | ||
+ | || [RStudio] | ||
− | + | '''predicted_values <- predict(LDA_model, test_data)''' | |
− | | | + | || In the '''Source '''window type this command and run it. |
− | | | + | |
+ | Let us check what '''predicted_values''' contain. | ||
− | + | |- | |
− | || ''' | + | || Click the '''predicted_values '''data in the Environment tab. |
− | |||
− | + | Point to the table. | |
− | + | || Click the '''predicted_values '''data in the Environment tab. | |
− | || | + | |
− | + | The '''predicted_values '''table is loaded in the '''Source''' window. | |
− | + | ||
− | + | ||
− | + | |- | |
+ | || [RStudio] | ||
− | + | '''head(predicted_values$class)''' | |
− | + | '''head(predicted_values$posterior)''' | |
− | + | ||
− | + | ||
− | + | ||
− | + | '''head(predicted_values$x)''' | |
+ | || In the '''Source''' window type these commands and run them. | ||
− | + | The output is seen in the''' console''' window. | |
+ | |- | ||
+ | || Highlight the command output of '''head(predicted_values$class) '''in the '''console.''' | ||
− | |||
− | |||
− | + | Highlight the command output of '''head(predicted_values$posterior)''' in the '''console.''' | |
− | + | ||
− | |||
− | |||
− | + | Highlight the command output of '''head(predicted_values$x) '''in '''console''' | |
− | || | + | || It contains the type of species that the model has predicted for each observation. |
− | + | It contains the '''posterior probability''' of the observation belonging to each class. | |
− | + | ||
− | + | ||
− | + | ||
− | + | This contains the linear discriminants for each observation. | |
− | + | ||
− | + | |- | |
+ | || Only Narration. | ||
+ | || Now we will measure the performance of our model using the '''Confusion Matrix'''. | ||
+ | |- | ||
+ | || [RStudio] | ||
− | + | '''confusion <-table(test_data$class,predicted_values$class)''' | |
− | + | ||
− | + | '''fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)''' | |
− | |||
− | + | Click on '''Save '''and''' Run''' buttons. | |
− | + | || In the '''Source '''window type these commands. | |
− | || | + | |
− | |||
− | |||
− | |||
− | |- | + | Save and run the commands. |
− | || | + | |- |
+ | || Highlight the command '''confusion <- table(test_data$class, predicted_values$class)''' | ||
− | + | Highlight the command | |
− | + | ||
− | + | '''fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1)''' | |
− | + | || This table creates a confusion matrix. | |
− | | | + | |
− | | | + | |
− | + | ||
− | |||
− | + | The '''fourfoldplot()''' function generates a visual plot of the confusion matrix, | |
− | |||
− | |- | + | The output is seen in the '''plot''' window. |
− | || | + | |- |
+ | || Highlight the plot in '''plot window ''' | ||
+ | || Drag boundary to see the plot window clearly. | ||
− | + | Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations. | |
− | + | ||
− | + | This number may change for different sets of training data. | |
− | |- | + | |- |
− | || | + | || Only Narration. |
− | || | + | || Let us visualize how well our model separates different classes. |
+ | |- | ||
+ | || [RStudio] | ||
− | + | [RStudio] | |
− | + | ||
− | + | ||
− | + | '''X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100)''' | |
− | + | ||
− | |||
− | + | '''Y <- seq(min(train_data$ecc), max(train_data$ecc), length.out = 100)''' | |
− | + | ||
− | + | ||
− | |||
− | |||
− | |||
− | ''' | + | '''min_max <- expand.grid(minorAL = X, ecc = Y)''' |
− | |||
− | + | '''min_max$predicted_class <- predict(LDA_model, newdata = min_max)$class''' | |
− | + | ||
− | + | ||
− | |||
+ | '''grid <- expand.grid(minorAL = X, ecc = Y)''' | ||
− | + | '''grid$class <- predict(LDA_model, newdata = grid)$class''' | |
− | + | ||
− | + | ||
− | |||
− | |||
− | + | '''grid$classnum <- as.numeric(grid$class)''' | |
− | |||
− | |||
− | |||
+ | Click on Save and Run buttons. | ||
− | + | || In the '''Source''' window, type these commands. | |
− | + | ||
− | + | ||
− | + | ||
− | |||
− | |||
− | |||
− | + | This block of code operates as a setup for visual plotting. | |
− | |||
− | + | It consists of square grid coordinates in the range of training data and their predicted linear discriminants. | |
− | |||
− | + | The ''' seq ''' function generates a sequence of evenly spaced values within a range of smallest and largest values of 'minorAL' and 'ecc' variables from the training data. | |
− | |||
− | |||
− | ''' | + | The''' 'grid' '''variable contains the generated data including the prediction of the LDA_model on it. |
− | |||
− | + | The '''as.numeric''' function encodes the predicted classes labels into numeric values. | |
− | |||
− | + | Select the commands and run them. | |
− | |- | + | |- |
− | || | + | || Point to the Environment tab. |
+ | || Drag boundary to see the details in the Environment tab. | ||
− | |||
− | + | These variables contain the data for the visualization of the linear discriminants. | |
− | + | Click the '''grid''' '''data''' in the Environment tab. | |
− | + | The '''grid data''' table is loaded in the '''Source''' window. | |
− | + | ||
− | + | |- | |
− | || | + | || [RStudio] |
− | + | '''ggplot() +''' | |
− | + | ||
− | + | ||
− | ''' | + | '''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +''' |
+ | '''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +''' | ||
+ | |||
+ | '''theme_minimal()''' | ||
+ | |||
+ | |||
+ | '''ggplot() +''' | ||
+ | |||
+ | '''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +''' | ||
+ | |||
+ | '''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +''' | ||
+ | |||
+ | '''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +''' | ||
+ | |||
+ | '''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +''' | ||
+ | |||
+ | '''scale_color_manual(values = c("red", "blue")) +''' | ||
+ | |||
+ | '''labs(title = "LDA Decision Boundary") +''' | ||
+ | |||
+ | '''theme_minimal()''' | ||
+ | |||
+ | || In the '''Source''' window, type these commands. | ||
+ | |||
+ | |- | ||
+ | || Highlight the command | ||
+ | |||
+ | '''ggplot() +''' | ||
+ | |||
+ | '''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +''' | ||
+ | |||
+ | '''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal()''' | ||
+ | |||
+ | |||
+ | '''ggplot() +''' | ||
+ | |||
+ | '''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +''' | ||
+ | |||
+ | '''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +''' | ||
+ | |||
+ | '''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +''' | ||
+ | |||
+ | '''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +''' | ||
+ | |||
+ | '''scale_color_manual(values = c("red", "blue")) +''' | ||
+ | |||
+ | '''labs(title = "LDA Decision Boundary") +''' | ||
+ | |||
+ | '''theme_minimal()''' | ||
+ | |||
+ | |||
+ | Select the commands and run them. | ||
+ | |||
+ | || This command creates the decision boundary plot | ||
+ | |||
+ | |||
+ | It plots the '''grid''' points with colors indicating the predicted classes. | ||
+ | |||
+ | '''geom_raster '''creates a colour map indicating the predicted classes of the grid points | ||
+ | |||
+ | '''geom_contour '''creates the decision boundary of the LDA. | ||
+ | |||
+ | The '''scale_color_manual''' function assigns specific colors to the classes and so does '''scale_fill_manual''' function. | ||
+ | |||
+ | |||
+ | The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''. | ||
+ | |||
+ | |||
+ | Select and run these commands. | ||
+ | |||
+ | |||
+ | Drag boundaries to see the plot window clearly. | ||
+ | |- | ||
+ | || Point the output in the '''Plots '''window | ||
+ | || We can see that our model has separated most of the data points clearly. | ||
+ | |- | ||
+ | || Only Narration | ||
+ | || With this we come to end of this tutorial. | ||
+ | |||
+ | Let us summarize. | ||
+ | |- | ||
+ | || '''Show Slide''' | ||
+ | |||
+ | '''Summary''' | ||
+ | || In this tutorial we have learnt: | ||
+ | |||
+ | * Linear Discriminant Analysis ('''LDA''') and its implementation. | ||
+ | * Assumptions of LDA | ||
+ | * Limitations of LDA | ||
+ | * LDA on a subset of Raisin dataset | ||
+ | * Visualization of the '''LDA''' separator and its corresponding confusion matrix | ||
+ | |||
+ | |||
+ | |- | ||
|| | || | ||
− | + | || Now we will suggest an assignment for this Spoken Tutorial. | |
− | + | |- | |
− | + | || '''Show Slide''' | |
− | + | ||
− | + | ||
− | + | '''Assignment''' | |
− | + | || | |
− | || ''' | + | * Perform LDA on inbuilt '''PlantGrowthdataset''' |
+ | * Evaluate the model using a confusion matrix and visualize the results | ||
− | ''' | + | |- |
+ | || '''Show slide''' | ||
− | || The '''Spoken Tutorial''' project was established by the Ministry of Education | + | '''About the Spoken Tutorial Project''' |
− | |- | + | || The video at the following link summarizes the Spoken Tutorial project. |
− | || | + | |
− | + | Please download and watch it. | |
+ | |- | ||
+ | || '''Show slide''' | ||
+ | |||
+ | '''Spoken Tutorial Workshops''' | ||
+ | || We conduct workshops using Spoken Tutorials and give certificates. | ||
+ | |||
+ | |||
+ | Please contact us. | ||
+ | |- | ||
+ | || '''Show Slide''' | ||
+ | |||
+ | '''Spoken Tutorial Forum to answer questions.''' | ||
+ | |||
+ | Do you have questions in THIS Spoken Tutorial? | ||
+ | |||
+ | Choose the minute and second where you have the question.Explain your question briefly. | ||
+ | |||
+ | Someone from the FOSSEE team will answer them. | ||
+ | |||
+ | Please visit this site. | ||
+ | || Please post your timed queries in this forum. | ||
+ | |- | ||
+ | || '''Show Slide''' | ||
+ | |||
+ | '''Forum to answer questions''' | ||
+ | || Do you have any general/technical questions? | ||
+ | |||
+ | Please visit the forum given in the link. | ||
+ | |- | ||
+ | || '''Show Slide''' | ||
+ | |||
+ | '''Textbook Companion''' | ||
+ | || The FOSSEE team coordinates the coding of solved examples of popular books and case study projects. | ||
+ | |||
+ | We give certificates to those who do this. | ||
+ | |||
+ | For more details, please visit these sites. | ||
+ | |- | ||
+ | || '''Show Slide''' | ||
+ | |||
+ | '''Acknowledgment''' | ||
+ | || The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India. | ||
+ | |- | ||
+ | || '''Show Slide''' | ||
+ | '''Thank You''' | ||
+ | || This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay. | ||
+ | Thank you for joining. | ||
|- | |- | ||
|} | |} |
Revision as of 13:03, 21 November 2023
Title of the script: Linear Discriminant Analysis in R
Author: YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty
Keywords: R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, LDA, video tutorial.
Visual Cue | Narration |
Show slide
Opening Slide |
Welcome to this spoken tutorial on Linear Discriminant Analysis in R. |
Show slide
Learning Objectives |
In this tutorial, we will learn about:
|
Show slide
System Specifications |
This tutorial is recorded using,
It is recommended to install R version 4.2.0 or higher. |
Show slide.
Prerequisites |
To follow this tutorial, the learner should know:
If not, please access the relevant tutorials on R on this website. |
Show slide.
Linear Discriminant Analysis |
Linear Discriminant Analysis is a statistical method.
|
Show slide.
Applications of LDA |
|
Only Narration | Let us now understand the assumptions of LDA. |
Show Slide
Assumptions for LDA |
Multivariate Normality:
|
Show Slide
Limitations of LDA |
Now we will see the limitations of LDA.
|
Show Slide
Implementation Of LDA |
Now let us implement LDA on the raisin dataset with two chosen variables.
More information on raisin data is available in the Additional Reading material on this tutorial page. |
Show slide
Download Files |
We will use a script file LDA.R
Please download this file from the Code files link of this tutorial. Make a copy and then use it for practicing. |
[Computer screen]
Point to LDA.R and the folder LDA. Point to the MLProject folder on the Desktop.
|
I have downloaded and moved these files to the LDA folder.
|
Point to the script file LDA.R. | In this tutorial, we will create a LDA classifier model on the raisin dataset.
|
Open LDA.R in RStudio
|
Open the script LDA.R in RStudio.
For this, click on the script LDA.R. Script LDA.R opens in RStudio. |
Highlight the Readxl package.
Highlight the command library(MASS) Highlight the command library(ggplot2) Highlight the command library(caret) Highlight the command library(caret) Highlight all the commands. #install.packages(“package_name”) |
Readxl package is used to load the Excel file.
confusionMatrix function.
|
[RStudio]
library(readxl) library(MASS) library(ggplot2) library(caret) library(lattice)
|
Select and run these commands to import the requisite packages. |
Highlight the command
data <- read_xlsx("Raisin.xlsx")
data <- read_xlsx("Raisin.xlsx")
|
We will read the excel file and choose 3 columns, two features (minorAL, ecc) and one target (class) variable.
Run these commands to import the raisin dataset. |
Drag boundary to see the Environment tab clearly.
Point to the data variable in the Environment tab. Click the data to load the dataset. |
Drag boundary to see the Environment tab clearly.
In the Environment tab under Data heading, you will see a data variable. Click the data variable to load the dataset in the Source window. |
Drag boundary to see the Source window clearly. | Drag boundary to see the Source window clearly. |
[RStudio]
Type these commands in the source window. data$class <- factor(data$class) |
In the Source window type this command. |
Highlight the below commands.
data$class <- factor(data$class) Select the commands and click the Run button. |
Here we are converting the variable data$class to a factor.
It ensures that the categorical data is properly encoded. Select the command and run it. them. |
Only Narration. | Now we split our dataset into training and testing data. |
[RStudio]
Type the command in the source window. set.seed(1) index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) |
In the Source window type these commands. |
Highlight the command
set.seed(1) Highlight the command sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) Highlight the command replace=FALSE Select the commands and click the Run button. |
First we set a seed for reproducible results.
Select the commands and run them. |
The vector is shown in the Environment tab. | |
Point to train-test split. | We use the indices that we previously generated to obtain our train-test split. |
[RStudio]
Type the command train_data <- data [index_split, ] test_data <- data[-c(index_split), ] |
In the Source window type these commands. |
Highlight the command
train_data <- data[index_split, ] Highlight the command test_data <- data[-c(index_split), ] |
This creates training data, consisting of 630 unique rows.
|
Select the commands and click the Run button.
|
Select the commands and run them.
|
Only Narration. | Let us train our LDA model. |
[RStudio]
LDA_model <- lda(class~.,data=train_data) LDA_model |
In the Source window, type these commands. |
Highlight the command
LDA_model <- lda(class~.,data=train_data) LDA_model
Point to the output in the console window. |
We pass two parameters to the lda() function.
Select the comands and run them. The output is shown in the console window. |
Drag boundary to see the console window. | Drag boundary to see the console window clearly. |
Highlight output in the console. | Our model provides us with a lot of information.
Let us go through them one at a time. |
Highlight the command Prior probabilities of groups.
Highlight the command Group means. Highlight the command Coefficients of linear discriminants |
These explain the distribution of classes in the training dataset.
|
Drag boundary to see the Source window. | Drag boundary to see the Source window clearly. |
Let us use this model to make predictions on the testing data. | |
[RStudio]
predicted_values <- predict(LDA_model, test_data) |
In the Source window type this command and run it.
Let us check what predicted_values contain. |
Click the predicted_values data in the Environment tab.
|
Click the predicted_values data in the Environment tab.
The predicted_values table is loaded in the Source window. |
[RStudio]
head(predicted_values$class) head(predicted_values$posterior) head(predicted_values$x) |
In the Source window type these commands and run them.
|
Highlight the command output of head(predicted_values$class) in the console.
|
It contains the type of species that the model has predicted for each observation.
This contains the linear discriminants for each observation. |
Only Narration. | Now we will measure the performance of our model using the Confusion Matrix. |
[RStudio]
confusion <-table(test_data$class,predicted_values$class)
|
In the Source window type these commands.
|
Highlight the command confusion <- table(test_data$class, predicted_values$class)
Highlight the command fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1) |
This table creates a confusion matrix.
|
Highlight the plot in plot window | Drag boundary to see the plot window clearly.
Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations. This number may change for different sets of training data. |
Only Narration. | Let us visualize how well our model separates different classes. |
[RStudio]
[RStudio] X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100)
grid$class <- predict(LDA_model, newdata = grid)$class
|
In the Source window, type these commands.
|
Point to the Environment tab. | Drag boundary to see the details in the Environment tab.
Click the grid data in the Environment tab. The grid data table is loaded in the Source window. |
[RStudio]
geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) + geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) + theme_minimal()
geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) + geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "LDA Decision Boundary") + theme_minimal() |
In the Source window, type these commands. |
Highlight the command
ggplot() + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) + geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal()
geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) + geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "LDA Decision Boundary") + theme_minimal()
|
This command creates the decision boundary plot
geom_raster creates a colour map indicating the predicted classes of the grid points geom_contour creates the decision boundary of the LDA. The scale_color_manual function assigns specific colors to the classes and so does scale_fill_manual function.
|
Point the output in the Plots window | We can see that our model has separated most of the data points clearly. |
Only Narration | With this we come to end of this tutorial.
Let us summarize. |
Show Slide
Summary |
In this tutorial we have learnt:
|
Now we will suggest an assignment for this Spoken Tutorial. | |
Show Slide
Assignment |
|
Show slide
About the Spoken Tutorial Project |
The video at the following link summarizes the Spoken Tutorial project.
Please download and watch it. |
Show slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
|
Show Slide
Spoken Tutorial Forum to answer questions. Do you have questions in THIS Spoken Tutorial? Choose the minute and second where you have the question.Explain your question briefly. Someone from the FOSSEE team will answer them. Please visit this site. |
Please post your timed queries in this forum. |
Show Slide
Forum to answer questions |
Do you have any general/technical questions?
Please visit the forum given in the link. |
Show Slide
Textbook Companion |
The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.
We give certificates to those who do this. For more details, please visit these sites. |
Show Slide
Acknowledgment |
The Spoken Tutorial project was established by the Ministry of Education Govt of India. |
Show Slide
Thank You |
This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay.
Thank you for joining. |