Difference between revisions of "PhET-Simulations-for-Physics/C2/Geometric-Optics/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
(Created page with "'''Title of the script''': '''Installation of orca on Windows OS''' '''Author: '''Madhuri Ganapathi '''Keywords''': Orca, Windows 10 OS, zip files, Orca Forum, register, l...")
 
Line 1: Line 1:
'''Title of the script''': '''Installation of orca on Windows OS'''
+
'''Title of the script''': Linear Discriminant Analysis in R
  
 +
'''Author''': YATE ASSEKE RONALD OLIVERA  and Debatosh Charkraborty
  
'''Author: '''Madhuri Ganapathi
+
'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, LDA, video tutorial.
  
 +
{| border=1
 +
|-
 +
|| '''Visual Cue'''
 +
|| '''Narration'''
 +
|-
 +
|| '''Show slide'''
  
'''Keywords''': Orca, Windows 10 OS, zip files, Orca Forum, register, login, run input file, video tutorial.
+
'''Opening Slide'''
 +
|| Welcome to this spoken tutorial on '''Linear Discriminant Analysis in R.'''
 +
|-
 +
|| '''Show slide'''
  
 +
'''Learning Objectives'''
  
{| border=1
+
|| In this tutorial, we will learn about:
|-
+
# Linear Discriminant Analysis ('''LDA''') and its implementation.
|| '''Visual Cue '''
+
# Assumptions of LDA
|| '''Narration'''
+
# Limitations of LDA
|-
+
# LDA on a subset of Raisin dataset
 +
# Visualization of the '''LDA''' separator and its corresponding confusion matrix.
  
|| '''Slide Number 1'''
 
  
'''Title slide'''
+
|-
|| Welcome to this Spoken Tutorial on  '''Installation of ORCA on Windows'''.
+
|| '''Show slide'''
  
|-
+
'''System Specifications'''
|| '''Slide Number 2'''
+
|| This tutorial is recorded using,
 +
* '''Windows 11 '''
 +
* '''R '''version''' 4.3.0'''
 +
* '''RStudio''' version '''2023.06.1'''
  
'''Learning Objectives'''
+
It is recommended to install '''R''' version '''4.2.0''' or higher.
 +
|-
 +
|| '''Show slide.'''
  
|| In this tutorial, we will learn to,
+
'''Prerequisites '''
  
* Download '''orca 5.0.4 '''compressed zip files for Windows OS
+
'''https://spoken-tutorial.org'''
 +
|| To follow this tutorial, the learner should know:
  
* Extract the compressed files.
+
* Basics of '''R''' programming.
 +
* Basics of '''Machine Learning '''using '''R'''.  
  
* Run an '''orca''' input file to check the installation.
+
If not, please access the relevant tutorials on '''R '''on this website.
 +
|-
 +
|| '''Show slide.'''
  
 +
'''Linear Discriminant Analysis'''
 +
|| Linear Discriminant Analysis is a statistical method.
 +
* It is used for classification.
 +
* It constructs a data driven line that best separates different classes.
 +
* It is based on maximization of likelihood function to classify two or more classes.
  
|-
 
|| '''Slide Number 3'''
 
  
'''System Requirements'''
+
|-
|| This tutorial is recorded using,
+
|| '''Show slide.'''
  
'''Windows 10 OS'''
+
'''Applications of LDA'''
 +
||
 +
* LDA technique is used in several applications like
  
'''Orca '''version 5.0.4
+
** Fraud Detection
 +
** Bio-Imaging classification
 +
** Classify patient disease state
  
'''Notepad '''version''' 20H2'''
+
|-
 +
|| Only Narration
 +
|| Let us now understand the assumptions of LDA.
 +
|-
 +
|| '''Show Slide '''
  
'''Firefox Web browser''' version 112.0.2
+
'''Assumptions for LDA'''
 +
|| '''Multivariate Normality: '''
  
 +
* All data entries are continuous, Gaussian, with equal covariance matrix for all the classes.
 +
* Mean vectors for each class are different.
 +
* Data records are independent and identically distributed among each class.
  
Before you begin, please make sure that you are connected to the internet.
+
|-  
|-
+
|| '''Show Slide '''
|| '''Slide Number 4'''
+
  
'''Pre-requisites'''
+
'''Limitations of LDA'''
 +
|| Now we will see the limitations of LDA.
  
|| To follow this tutorial,
+
* Departure from Gaussianity may increase misclassification probability in LDA.
* Learner must be familiar with basic computer and Internet skills.
+
* '''LDA''' may perform poorly if data has unequal class covariance matrix.
  
|-
+
|-  
|| '''Slide Number 5'''
+
|| '''Show Slide'''
  
'''Code Files'''
+
'''Implementation Of LDA'''
 +
|| Now let us implement '''LDA''' on the '''raisin dataset '''with two chosen variables'''.'''
  
||
+
More information on '''raisin''' data is available in the '''Additional Reading material''' on this tutorial page.
* The input file to check the installation is provided in the '''code files''' link.
+
|-
 +
|| '''Show slide '''
  
* Please download and extract the file.
+
'''Download Files'''
 +
|| We will use a script file '''LDA.R'''
  
* Make a copy and then use it for practising.
+
Please download this file from the''' Code files''' link of this tutorial.
  
|-
+
Make a copy and then use it for practicing.
|| Open a web browser and type '''orca forum ''' in the ''' Google search the web'''.
+
|-  
 +
|| [Computer screen]
  
<u>https://orcaforum.kofo.mpg.de/app.php/portal</u>
+
Point to '''LDA.R''' and the folder '''LDA.'''
  
 +
Point to the''' MLProject folder '''on the '''Desktop.'''
  
Point to the '''ORCA Forum''' page.
 
|| Open your web browser and type '''orca Forum''' and press '''Enter'''.
 
  
You can see the first instance as '''ORCA Forum - Portal'''.
+
Point to the''' LDA folder.'''
 +
|| I have downloaded and moved these files to the '''LDA '''folder.
  
Click on the link.
 
  
'''ORCA Forum''' page opens.
+
This folder is in the '''MLProject''' folder on my '''Desktop'''.
|-
+
|| Point to '''Register''' and '''Login''' links on the top right.
+
|| On the top right side of the page, we have links for '''Register''' and '''Login'''.
+
|-
+
|| Cursor on the page.  
+
  
Point to the Register link.
 
|| If you are a first time user, you have to register on the '''ORCA Forum'''.
 
|-
 
|| Click on the '''Register '''link.
 
  
Point to '''ORCA Forum Registration page.'''
+
I have also set the '''LDA''' folder as my working''' directory'''.
 +
|-
 +
|| Point to the script file '''LDA.R.'''
 +
|| In this tutorial, we will create a '''LDA''' classifier model on the '''raisin''' dataset.
  
Scroll down the page.
 
|| Click the '''Register '''link.
 
  
'''ORCA Forum Registration''' page opens.
+
Let us switch to '''RStudio'''.
 +
|-
 +
|| Open '''LDA.R '''in '''RStudio'''
  
Scroll down the page.
 
  
 +
Point to''' LDA.R''' in '''RStudio'''.
 +
|| Open the script '''LDA.R''' in '''RStudio'''.
  
Please read the information on the page before clicking the button.
+
For this, click on the script '''LDA.R.'''
|-
+
|| Click on '''I agree to these terms''' button.
+
  
Point to '''ORCA Forum – Registration''' page.
+
Script '''LDA.R''' opens in '''RStudio'''.
|| Click on '''I agree to these terms''' button.
+
|-
 +
|| Highlight the '''Readxl package.'''
  
The '''Registration''' page refreshes with the details.
+
Highlight the command '''library(MASS) '''
|-
+
|| Point to the registration details part.
+
  
 +
Highlight the command '''library(ggplot2)'''
  
Point to the Email Address text box.
+
Highlight the command '''library(caret)'''
  
Point to the message.
+
Highlight the command '''library(caret)'''
|| Here you have to type your details to register on the '''ORCA Forum'''.
+
  
 +
Highlight all the commands.
  
Please enter your valid email address in the '''Email Address '''text box.
+
'''<nowiki>#install.packages(“package_name”)</nowiki>'''
 +
|| '''Readxl package''' is used to load the '''Excel''' file.
 +
 
 +
 
 +
The''' MASS package''' contains the '''lda()''' function that we will use for our analysis.
 +
 
 +
 
 +
The '''ggplot2 package''' is used to plot the results of our analysis.
 +
 
 +
 
 +
The '''caret package''' contains the
 +
 
 +
'''confusionMatrix''' function.
 +
 
 +
 
 +
It is used as a measure for the performance of the classifier.
 +
 
 +
 
 +
Please note that in order to import these libraries, we need to install them.
 +
 
 +
 
 +
Please ensure that everything is installed correctly.
 +
 
 +
 
 +
You can use the command '''install.packages(“package_name”)''' to install the required packages.
 +
 
 +
 
 +
As I have already installed these packages, I will directly import them.
 +
 
 +
|-
 +
|| [RStudio]
 +
 
 +
'''library(readxl)'''
 +
 
 +
'''library(MASS)'''
 +
 
 +
'''library(ggplot2)'''
 +
 
 +
'''library(caret)'''
 +
 
 +
'''library(lattice)'''
 +
 
 +
 
 +
 
 +
|| Select and run these commands to import the requisite packages.
 +
 
 +
|-
 +
|| Highlight the command''' '''
 +
 
 +
'''data <- read_xlsx("Raisin.xlsx")'''
 +
 
 +
 
 +
Highlight the command''' data<-data[c("minorAL","ecc","class")]'''
 +
 
 +
 
 +
Highlight the commands.
 +
 
 +
'''data <- read_xlsx("Raisin.xlsx")'''
 +
 
 +
 
 +
'''data<-data[c("minorAL","ecc","class")]'''
 +
 
 +
|| We will read the excel file and choose 3 columns, two features ('''minorAL, ecc)''' and one target ('''class''') variable.
 +
 
 +
Run these commands to import the '''raisin''' dataset.
 +
 
 +
|-
 +
|| Drag boundary to see the '''Environment '''tab clearly.
 +
 
 +
Point to the data variable in the Environment tab.
 +
 
 +
Click the data to load the dataset.
 +
 
 +
|| Drag boundary to see the Environment tab clearly.
 +
 
 +
In the Environment tab under '''Data '''heading, you will see a '''data '''variable.
 +
 
 +
Click the data''' variable''' to load the dataset in the '''Source''' window.
 +
|-
 +
|| Drag boundary to see the Source window clearly.
 +
|| Drag boundary to see the '''Source '''window clearly.
  
 
|-
 
|-
|| Click on the '''Submit ''' button.
+
||[RStudio]
|| Click the '''Submit button ''' to complete the registration.
+
 
 +
Type these commands in the source window.
 +
 
 +
'''data$class <- factor(data$class)'''
 +
 
 +
|| In the '''Source''' window type this command.
 +
 
 
|-
 
|-
||  
+
||Highlight the below commands.
|| The '''orca''' team will send an activation link to the given email address.
+
  
You need to click the link sent in the email to get the registration activated.  
+
'''data$class <- factor(data$class)'''
 +
 
 +
Select the commands and click the Run button.
 +
 
 +
||Here we are converting the variable '''data$class''' to a factor.
 +
 
 +
It ensures that the categorical data is properly encoded.
 +
 
 +
Select the command and run it. them.
 
|-
 
|-
|| Point to login.
+
||Only Narration.
|| I have already registered on the''' ORCA Forum''' so I will login now.
+
|| Now we split our dataset into training and testing data.
 
|-
 
|-
|| Fill in the details to login.
+
||[RStudio]
  
 +
Type the command in the source window.
  
Click on the '''Login''' button.
+
'''set.seed(1) '''
|| I will log in using my '''Username''' and '''password'''.  
+
 
 +
'''index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''
 +
||In the '''Source''' window type these commands.
  
 
|-
 
|-
|| Point to the page.
+
||Highlight the command
  
Click on the Downloads link on the top left side.
+
'''set.seed(1)'''
|| A new page opens.
+
  
 +
Highlight the command '''sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''
  
On the top left side, click the '''Downloads''' link.
+
Highlight the command '''replace=FALSE'''
|-
+
|| Point to '''Downloads – Categories''' page.
+
  
 +
Select the commands and click the Run button.
 +
||First we set a seed for reproducible results.
  
Point to the various versions on the page.
 
  
|| The''' Downloads – Categories '''page opens.
+
We will create a vector of indices using '''sample() '''function.
  
  
The page shows various versions of '''ORCA'''.
+
This will be 70% for training and 30% for testing.
|-
+
|| Click on the '''5.0.4 folder''' link.
+
  
Cursor on the page.
 
|| Click on the '''ORCA''' '''5.0.4 folder''' link.
 
  
A new page '''ORCA 5.0.4''' opens.
+
The training data is chosen using simple random sampling without replacement.  
  
A new version of orca might be available at the time of your download.  
+
Select the commands and run them.
 
|-
 
|-
|| Scroll down the page.
+
|| The vector is shown in the''' Environment '''tab.
|| Scroll down the page.
+
 
|-
 
|-
|| Point to the 3 files.
+
||Point to train-test split.
|| On the page, the required files for ''' Windows OS ''' are provided as 3 zip files.
+
|| We use the indices that we previously generated to obtain our train-test split.
 +
|-
 +
|| [RStudio]
  
We need to download all 3 parts one at a time.
+
Type the command
|-
+
|| Click on [https://orcaforum.kofo.mpg.de/app.php/dlext/?view=detail&df_id=172 ORCA 5.0.][https://orcaforum.kofo.mpg.de/app.php/dlext/?view=detail&df_id=172 4][https://orcaforum.kofo.mpg.de/app.php/dlext/?view=detail&df_id=172 , Windows,64bit, .zip Archive, Part 1/3]
+
  
 +
'''train_data <- data [index_split, ]'''
  
Point to the Details
+
'''test_data <- data[-c(index_split), ]'''
 +
|| In the '''Source '''window type these commands.
 +
|-
 +
|| Highlight the command
  
Point to the file size.
+
'''train_data <- data[index_split, ]'''
|| Click on '''ORCA 5.0.4, part 1 zip file'''.
+
  
The page refreshes.
+
Highlight the command
  
Here are the details of this file.  
+
'''test_data <- data[-c(index_split), ]'''
 +
|| This creates training data, consisting of 630 unique rows.
  
The file size is also seen here.
 
|-
 
  
|| Point to the green Download button.
+
This creates testing data, consisting of 270 unique rows.
 +
|-
 +
|| Select the commands and click the Run button.
  
Click on the Download button.
 
|| Click the large green '''Download ''' button at the bottom-right to start downloading.
 
|-
 
|| Point to the dialog box.
 
  
 +
Point to the sets in the Environment Tab
 +
|| Select the commands and run them.
  
Click the '''Save button''' to download the file.
 
|| A dialog box opens, which prompts you to save the file.
 
  
 +
The data sets are shown in the Environment tab.
 +
 
  
Click on the''' Save button ''' to download the file.  
+
Click on '''test_data '''and '''train_data '''to load them in the Source window.
  
 +
|-
 +
|| Only Narration.
 +
|| Let us train our '''LDA''' model.
 +
|-
 +
|| [RStudio]
  
In some systems, downloading may start directly without asking to save.
+
'''LDA_model <- lda(class~.,data=train_data)'''
  
The file takes some time to download due to its large file size.
+
'''LDA_model'''
|-
+
|| In the '''Source '''window, type these commands.
|| Point to the downloading process.
+
|| The file downloads to the '''Downloads''' folder for me.
+
  
The file may download to the folder as per your system settings.
+
|-  
|-
+
|| Highlight the command
|| Point to the downloaded file.
+
|| Here is my downloaded file.
+
|-
+
|| Cursor on the ORCA Forum Downloads page.
+
|| Go back to the previous page, where all the files are listed.
+
  
|-
+
'''LDA_model <- lda(class~.,data=train_data)'''
|| Show the downloading files.
+
  
Click on [https://orcaforum.kofo.mpg.de/app.php/dlext/?view=detail&df_id=172 ORCA 5.0.][https://orcaforum.kofo.mpg.de/app.php/dlext/?view=detail&df_id=172 4][https://orcaforum.kofo.mpg.de/app.php/dlext/?view=detail&df_id=172 , Windows, 64bit, .zip Archive, Part 2/3]
+
'''LDA_model'''
  
|| Now click the link for part 2 '''zip''' file and download it.
 
  
|-
+
Highlight the command '''LDA_model'''
|| Click on [https://orcaforum.kofo.mpg.de/app.php/dlext/?view=detail&df_id=172 ORCA 5.0.][https://orcaforum.kofo.mpg.de/app.php/dlext/?view=detail&df_id=172 4][https://orcaforum.kofo.mpg.de/app.php/dlext/?view=detail&df_id=172 , Windows, 64bit, .zip Archive, Part 3/3]
+
  
|| Similarly click the link for part 3 ''' zip''' file and download it.
 
|-
 
|| In the left panel select '''Downloads'''.
 
|| Open the file manager and go to the '''Downloads '''directory.
 
  
Notice the three downloaded '''zip''' files.
+
Click on Save and Run buttons.
|-
+
|| Enlarge the folder '''orca ''' using '''Ctrl ++'''.
+
  
 +
Point to the output in the '''console '''window.
 +
|| We pass two parameters to the '''lda()''' function.
 +
# formula
 +
# data on which the model should train.
  
Right-click in the '''Downloads''' folder.
+
Select the comands and run them.
  
 +
The output is shown in the '''console''' window.
 +
|-
 +
|| Drag boundary to see the '''console''' window.
 +
|| Drag boundary to see the '''console '''window clearly.
  
From the context menu select the '''New Folder''' option.
+
|-
|| Let us create a new directory and move the''' zip ''' files to this directory.
+
|| Highlight '''output''' in the '''console.'''
 +
|| Our '''model''' provides us with a lot of information.
  
 +
Let us go through them one at a time.
 +
|-
 +
|| Highlight the command '''Prior probabilities of groups. '''
  
To create a new directory, right-click in the '''Downloads''' directory.
+
Highlight the command''' Group means.'''
  
 +
Highlight the command '''Coefficients of linear discriminants '''
  
From the context menu select the '''New Folder''' option.
+
|| These explain the distribution of classes in the training dataset.
|-
+
|| Type the name as '''orca '''in the '''Folder Name''' text box.
+
|| Name the folder as '''orca'''.
+
  
|-
 
|| Point to the '''orca''' directory.
 
  
Press and hold the '''Ctrl''' key,
+
These display the mean values of each '''predictor '''variable for each '''species'''.
  
Click on all 3 files to select them.
 
|| Let’s move the 3 downloaded zip files to this newly created '''orca''' directory.
 
  
 +
These display the '''linear combination of predictor''' variables.
  
Press and hold the '''Ctrl''' key and click the 3 files to select them.
 
|-
 
|| Drag and drop into '''orca''' folder.
 
|| Drag and drop them into the '''orca''' directory.
 
  
|-
+
The given linear combinations form the decision rule of the '''LDA''' model.
|| Point to the zip files.
+
|| Now we will extract the files one by one.
+
  
|-
+
|-  
|| Right-click on '''orca part1 zip '''file.
+
|| Drag boundary to see the Source window.
 +
|| Drag boundary to see the '''Source '''window clearly.
  
From the context menu select '''Extract All''' option.
+
|-
|| Right-click on '''orca''' part1''' zip '''file.
+
||  
 +
|| Let us use this model to make predictions on the testing data.
 +
|-
 +
|| [RStudio]
  
From the context menu select '''Extract All''' option.
+
'''predicted_values <- predict(LDA_model, test_data)'''
  
|-
+
|| In the '''Source '''window type this command and run it.  
|| Point to the dialog box.
+
  
 +
Let us check what '''predicted_values''' contain.
  
Click the '''Extract '''button.
+
|-
|| '''Extract Compressed Folders '''dialog box opens.
+
|| Click the '''predicted_values '''data in the Environment tab.
  
In the box click on the '''Extract''' button at the bottom-right.
 
  
|-
+
Point to the table.
|| Point to the progress bar.
+
|| Click the '''predicted_values '''data in the Environment tab.
|| Extraction progress is shown.
+
  
Wait for the extraction to complete.
+
The '''predicted_values '''table is loaded in the '''Source''' window.
|-
+
|| Point to '''orca_5_0_4_win64_msmpi10_part1'''
+
  
folder.
+
|-
 +
|| [RStudio]
  
|| A folder with the same name as the '''zip '''file is created.
+
'''head(predicted_values$class)'''
  
All the files get extracted to this folder.
+
'''head(predicted_values$posterior)'''
|-
+
|| Point to '''orca_5_0_4_win64_msmpi10_part2''' folder.
+
|| In the same manner extract the '''orca''' part2 '''zip '''file.
+
  
Again a folder with the same name as the '''zip '''file is created.
+
'''head(predicted_values$x)'''
 +
|| In the '''Source''' window type these commands and run them.
  
  
All the files are extracted into it.
+
The output is seen in the''' console''' window.
 +
|-
 +
|| Highlight the command output of '''head(predicted_values$class) '''in the '''console.'''
  
|-
 
|| Point to orca_5_0_4_win64_msmpi10_part3
 
  
folder.
+
Highlight the command output of '''head(predicted_values$posterior)''' in the '''console.'''
|| Similarly extract the part 3 '''zip''' file.
+
  
|-
 
|| Point to the directories.
 
  
Double click to open them.
+
Highlight the command output of '''head(predicted_values$x) '''in '''console'''
|| We have completed the extraction of all the parts.
+
|| It contains the type of species that the model has predicted for each observation.
  
  
We now have three directories with the extracted files.
+
It contains the '''posterior probability''' of the observation belonging to each class.
|-
+
|| point to '''orca''' directory.
+
|| We have to place all the extracted files in the '''orca''' directory.
+
  
|-
+
This contains the linear discriminants for each observation.
|| Ctrl + A to Select.
+
  
Ctrl + X to Cut.
+
|-
 +
|| Only Narration.
 +
|| Now we will measure the performance of our model using the '''Confusion Matrix'''.
 +
|-
 +
|| [RStudio]
  
Ctrl + V to Paste.
+
'''confusion <-table(test_data$class,predicted_values$class)'''
|| Open the directories one by one.
+
  
  
Select all the files in each of them.
+
'''fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)'''
  
Cut and paste them into the '''orca''' directory.
 
  
|-
+
Click on '''Save '''and''' Run''' buttons.
|| Select and press Delete key on the keyword.
+
|| In the '''Source '''window type these commands.
|| You may delete the empty directories.  
+
  
|-
 
|| Select and press Delete key on the keyboard.
 
|| You may also delete the 3 zip files to save the disk space.
 
  
|-
+
Save and run the commands.
|| Point to the compiled files in the folder.
+
|-  
 +
|| Highlight the command '''confusion <- table(test_data$class, predicted_values$class)'''
  
Point to the '''orca''' file.
+
Highlight the command
|| All the files must be placed together for the orca program to function.
+
  
Here, all the files are already compiled and are in executable format.
+
'''fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1)'''
  
All calculations will run using the '''orca '''executable file.
+
|| This table creates a confusion matrix.
|-
+
|| Point to the '''carbonmonoxide '''folder in the Downloads folder.
+
|| Now we will run an input file to check the installation of '''orca'''.
+
  
This file is provided to you in the '''Code Files''' link.
 
  
Please download and extract it to a path convenient to you.
+
The '''fourfoldplot()''' function generates a visual plot of the confusion matrix,
  
I have downloaded the file to my '''Downloads '''directory.
 
  
|-
+
The output is seen in the '''plot''' window.
|| Point to the Search bar.
+
|-  
 +
|| Highlight the plot in '''plot window '''
 +
|| Drag boundary to see the plot window clearly.
  
Type''' command prompt'''.
+
Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations.  
|| Let us open the command prompt.
+
  
In the '''Search bar''' next to the '''Start''' icon type '''command prompt'''.
+
This number may change for different sets of training data.  
  
|-
+
|-  
|| Click on it to select command prompt
+
|| Only Narration.
|| Select '''Command Prompt''' from the shown list.
+
|| Let us visualize how well our model separates different classes.
 +
|-
 +
|| [RStudio]
  
'''Command Prompt''' opens.
+
[RStudio]
|-
+
|| At the prompt:
+
  
type '''cd Downloads >> '''press Enter.
+
'''X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100)'''
|| At the prompt type '''cd space Downloads '''and press '''Enter.'''
+
  
Our ''' orca''' directory is in the '''Downloads''' directory.
 
  
|-
+
'''Y <- seq(min(train_data$ecc), max(train_data$ecc), length.out = 100)'''
|| Type '''orca\orca '''press Enter.
+
|| Then type '''orca\orca '''and press '''Enter'''.
+
  
|-
 
|| Point to the message.
 
|| A message appears.
 
  
'''This program requires the name of a parameter file as argument'''
+
'''min_max <- expand.grid(minorAL = X, ecc = Y)'''
  
'''For example ORCA TEST.INP'''
 
  
|-
+
'''min_max$predicted_class <- predict(LDA_model, newdata = min_max)$class'''
||'''orca\orca carbonmonoxide\carbonmonoxide.inp'''
+
|| The input file is saved in my Downloads directory.
+
  
Now type '''orca\orca''' space '''carbonmonoxide\carbonmonoxide.inp '''and press '''Enter.'''
 
  
 +
'''grid <- expand.grid(minorAL = X, ecc = Y)'''
  
Users have to type the commands as per their folder structure.
+
'''grid$class <- predict(LDA_model, newdata = grid)$class'''
|-
+
|| Show the processing.
+
  
Point to the final output.
 
|| Observe that, the input file is processed in the orca environment.
 
  
The output is generated with ORCA terminated normally.
+
'''grid$classnum <- as.numeric(grid$class)'''
  
|-
 
|| Open and show the output files in the directory.
 
|| The generated output files are saved in the same directory as the input file.
 
  
 +
Click on Save and Run buttons.
  
This confirms the successful installation of '''orca'''.
+
|| In the '''Source''' window, type these commands.
|-
+
|| Only Narration.
+
|| With this, we come to the end of this tutorial.
+
  
Let us summarise.
 
|-
 
|| '''Slide Number 7'''
 
  
'''Summary '''
+
This block of code operates as a setup for visual plotting.
  
|| In this tutorial, we have learnt to,
 
  
* Download orca 5.0.4 compressed zip files for Windows OS
+
It consists of square grid coordinates in the range of training data and their predicted linear discriminants.
  
* Extract the compressed files.
 
  
* Run an o'''rca''' input file to check the installation
+
The ''' seq ''' function generates a sequence of evenly spaced values within a range of smallest and largest values of 'minorAL' and 'ecc' variables from the training data.
  
|-
 
|| '''Slide Number 8'''
 
  
'''Assignment'''
+
The''' 'grid' '''variable contains the generated data including the prediction of the LDA_model on it.
  
'''https://sites.google.com/site/orcainputlibrary/home'''
 
  
|| As an assignment, users can
+
The '''as.numeric''' function encodes the predicted classes labels into numeric values.
  
Go through '''ORCA Input Library''' and check the updates for '''input files'''.
 
  
Check the '''ORCA Forum''' for queries.
+
Select the commands and run them.
  
|-
+
|-  
|| '''Slide Number 9'''
+
|| Point to the Environment tab.
 +
|| Drag boundary to see the details in the Environment tab.
  
'''Spoken Tutorial Project'''
 
  
|| The video at the following link summarises the Spoken Tutorial Project.
+
These variables contain the data for the visualization of the linear discriminants.
  
Please download and watch it.
+
Click the '''grid''' '''data''' in the Environment tab.
  
|-
+
The '''grid data''' table is loaded in the '''Source''' window.
|| '''Slide Number 10'''
+
  
'''Spoken Tutorial workshops'''
+
|-
|| We conduct workshops using spoken tutorials and give certificates.
+
|| [RStudio]
  
  
For more details, please write to us.
+
'''ggplot() +'''
|-
+
|| '''Slide Number 11'''
+
  
'''Forum for questions'''
+
'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +'''
  
 +
'''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +'''
 +
 +
'''theme_minimal()'''
 +
 +
 +
'''ggplot() +'''
 +
 +
'''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''
 +
 +
'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''
 +
 +
'''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +'''
 +
 +
'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''
 +
 +
'''scale_color_manual(values = c("red", "blue")) +'''
 +
 +
'''labs(title = "LDA Decision Boundary") +'''
 +
 +
'''theme_minimal()'''
 +
 +
|| In the '''Source''' window, type these commands.
 +
 +
|-
 +
|| Highlight the command
 +
 +
'''ggplot() +'''
 +
 +
'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +'''
 +
 +
'''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal()'''
 +
 +
 +
'''ggplot() +'''
 +
 +
'''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''
 +
 +
'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''
 +
 +
'''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +'''
 +
 +
'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''
 +
 +
'''scale_color_manual(values = c("red", "blue")) +'''
 +
 +
'''labs(title = "LDA Decision Boundary") +'''
 +
 +
'''theme_minimal()'''
 +
 +
 +
Select the commands and run them.
 +
 +
|| This command creates the decision boundary plot
 +
 +
 +
It plots the '''grid''' points with colors indicating the predicted classes.
 +
 +
'''geom_raster '''creates a colour map indicating the predicted classes of the grid points
 +
 +
'''geom_contour '''creates the decision boundary of the LDA.
 +
 +
The '''scale_color_manual''' function assigns specific colors to the classes and so does '''scale_fill_manual''' function.
 +
 +
 +
The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.
 +
 +
 +
Select and run these commands.
 +
 +
 +
Drag boundaries to see the plot window clearly.
 +
|-
 +
|| Point the output in the '''Plots '''window
 +
|| We can see that our model has separated most of the data points clearly.
 +
|-
 +
|| Only Narration
 +
|| With this we come to end of this tutorial.
 +
 +
Let us summarize.
 +
|-
 +
|| '''Show Slide'''
 +
 +
'''Summary'''
 +
|| In this tutorial we have learnt:
 +
 +
* Linear Discriminant Analysis ('''LDA''') and its implementation.&nbsp;
 +
* Assumptions of LDA
 +
* Limitations of LDA
 +
* LDA on a subset of Raisin dataset
 +
* Visualization of the '''LDA''' separator and its corresponding confusion matrix
 +
 +
 +
|-
 
||  
 
||  
* Do you have questions in THIS '''Spoken Tutorial'''?
+
|| Now we will suggest an assignment for this Spoken Tutorial.
* Please visit this site.
+
|-
* Choose the minute and second where you have the question.
+
|| '''Show Slide'''
* Explain your question briefly.
+
* The spoken tutorial project will ensure an answer.
+
  
You will have to register on this website to ask questions.
+
'''Assignment'''
|-
+
||  
|| '''Slide Number 13'''
+
* Perform LDA on inbuilt '''PlantGrowthdataset'''
 +
* Evaluate the model using a confusion matrix and visualize the results
  
'''Acknowledgement'''
+
|-
 +
|| '''Show slide'''
  
|| The '''Spoken Tutorial''' project was established by the Ministry of Education (MoE), Govt. Of India
+
'''About the Spoken Tutorial Project'''
|-
+
|| The video at the following link summarizes the Spoken Tutorial project.
||Only Narration.
+
 
|| This is Madhuri Ganapathi from IIT, Bombay signing off.
+
Please download and watch it.
 +
|-
 +
|| '''Show slide'''
 +
 
 +
'''Spoken Tutorial Workshops'''
 +
|| We conduct workshops using Spoken Tutorials and give certificates.
 +
 
 +
 
 +
Please contact us.
 +
|-
 +
|| '''Show Slide'''
 +
 
 +
'''Spoken Tutorial Forum to answer questions.'''
 +
 
 +
Do you have questions in THIS Spoken Tutorial?
 +
 
 +
Choose the minute and second where you have the question.Explain your question briefly.
 +
 
 +
Someone from the FOSSEE team will answer them.
 +
 
 +
Please visit this site.
 +
|| Please post your timed queries in this forum.
 +
|-
 +
|| '''Show Slide'''
 +
 
 +
'''Forum to answer questions'''
 +
|| Do you have any general/technical questions?
 +
 
 +
Please visit the forum given in the link.
 +
|-
 +
|| '''Show Slide'''
 +
 
 +
'''Textbook Companion'''
 +
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.
 +
 
 +
We give certificates to those who do this.
 +
 
 +
For more details, please visit these sites.
 +
|-
 +
|| '''Show Slide'''
 +
 
 +
'''Acknowledgment'''
 +
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
 +
|-  
 +
|| '''Show Slide'''
  
 +
'''Thank You'''
 +
|| This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay.
  
 +
Thank you for joining.
 
|-
 
|-
 
|}
 
|}

Revision as of 13:03, 21 November 2023

Title of the script: Linear Discriminant Analysis in R

Author: YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty

Keywords: R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, LDA, video tutorial.

Visual Cue Narration
Show slide

Opening Slide

Welcome to this spoken tutorial on Linear Discriminant Analysis in R.
Show slide

Learning Objectives

In this tutorial, we will learn about:
  1. Linear Discriminant Analysis (LDA) and its implementation.
  2. Assumptions of LDA
  3. Limitations of LDA
  4. LDA on a subset of Raisin dataset
  5. Visualization of the LDA separator and its corresponding confusion matrix.


Show slide

System Specifications

This tutorial is recorded using,
  • Windows 11
  • R version 4.3.0
  • RStudio version 2023.06.1

It is recommended to install R version 4.2.0 or higher.

Show slide.

Prerequisites

https://spoken-tutorial.org

To follow this tutorial, the learner should know:
  • Basics of R programming.
  • Basics of Machine Learning using R.

If not, please access the relevant tutorials on R on this website.

Show slide.

Linear Discriminant Analysis

Linear Discriminant Analysis is a statistical method.
  • It is used for classification.
  • It constructs a data driven line that best separates different classes.
  • It is based on maximization of likelihood function to classify two or more classes.


Show slide.

Applications of LDA

  • LDA technique is used in several applications like
    • Fraud Detection
    • Bio-Imaging classification
    • Classify patient disease state
Only Narration Let us now understand the assumptions of LDA.
Show Slide

Assumptions for LDA

Multivariate Normality:
  • All data entries are continuous, Gaussian, with equal covariance matrix for all the classes.
  • Mean vectors for each class are different.
  • Data records are independent and identically distributed among each class.
Show Slide

Limitations of LDA

Now we will see the limitations of LDA.
  • Departure from Gaussianity may increase misclassification probability in LDA.
  • LDA may perform poorly if data has unequal class covariance matrix.
Show Slide

Implementation Of LDA

Now let us implement LDA on the raisin dataset with two chosen variables.

More information on raisin data is available in the Additional Reading material on this tutorial page.

Show slide

Download Files

We will use a script file LDA.R

Please download this file from the Code files link of this tutorial.

Make a copy and then use it for practicing.

[Computer screen]

Point to LDA.R and the folder LDA.

Point to the MLProject folder on the Desktop.


Point to the LDA folder.

I have downloaded and moved these files to the LDA folder.


This folder is in the MLProject folder on my Desktop.


I have also set the LDA folder as my working directory.

Point to the script file LDA.R. In this tutorial, we will create a LDA classifier model on the raisin dataset.


Let us switch to RStudio.

Open LDA.R in RStudio


Point to LDA.R in RStudio.

Open the script LDA.R in RStudio.

For this, click on the script LDA.R.

Script LDA.R opens in RStudio.

Highlight the Readxl package.

Highlight the command library(MASS)

Highlight the command library(ggplot2)

Highlight the command library(caret)

Highlight the command library(caret)

Highlight all the commands.

#install.packages(“package_name”)

Readxl package is used to load the Excel file.


The MASS package contains the lda() function that we will use for our analysis.


The ggplot2 package is used to plot the results of our analysis.


The caret package contains the

confusionMatrix function.


It is used as a measure for the performance of the classifier.


Please note that in order to import these libraries, we need to install them.


Please ensure that everything is installed correctly.


You can use the command install.packages(“package_name”) to install the required packages.


As I have already installed these packages, I will directly import them.

[RStudio]

library(readxl)

library(MASS)

library(ggplot2)

library(caret)

library(lattice)


Select and run these commands to import the requisite packages.
Highlight the command

data <- read_xlsx("Raisin.xlsx")


Highlight the command data<-data[c("minorAL","ecc","class")]


Highlight the commands.

data <- read_xlsx("Raisin.xlsx")


data<-data[c("minorAL","ecc","class")]

We will read the excel file and choose 3 columns, two features (minorAL, ecc) and one target (class) variable.

Run these commands to import the raisin dataset.

Drag boundary to see the Environment tab clearly.

Point to the data variable in the Environment tab.

Click the data to load the dataset.

Drag boundary to see the Environment tab clearly.

In the Environment tab under Data heading, you will see a data variable.

Click the data variable to load the dataset in the Source window.

Drag boundary to see the Source window clearly. Drag boundary to see the Source window clearly.
[RStudio]

Type these commands in the source window.

data$class <- factor(data$class)

In the Source window type this command.
Highlight the below commands.

data$class <- factor(data$class)

Select the commands and click the Run button.

Here we are converting the variable data$class to a factor.

It ensures that the categorical data is properly encoded.

Select the command and run it. them.

Only Narration. Now we split our dataset into training and testing data.
[RStudio]

Type the command in the source window.

set.seed(1)

index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

In the Source window type these commands.
Highlight the command

set.seed(1)

Highlight the command sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

Highlight the command replace=FALSE

Select the commands and click the Run button.

First we set a seed for reproducible results.


We will create a vector of indices using sample() function.


This will be 70% for training and 30% for testing.


The training data is chosen using simple random sampling without replacement.

Select the commands and run them.

The vector is shown in the Environment tab.
Point to train-test split. We use the indices that we previously generated to obtain our train-test split.
[RStudio]

Type the command

train_data <- data [index_split, ]

test_data <- data[-c(index_split), ]

In the Source window type these commands.
Highlight the command

train_data <- data[index_split, ]

Highlight the command

test_data <- data[-c(index_split), ]

This creates training data, consisting of 630 unique rows.


This creates testing data, consisting of 270 unique rows.

Select the commands and click the Run button.


Point to the sets in the Environment Tab

Select the commands and run them.


The data sets are shown in the Environment tab.


Click on test_data and train_data to load them in the Source window.

Only Narration. Let us train our LDA model.
[RStudio]

LDA_model <- lda(class~.,data=train_data)

LDA_model

In the Source window, type these commands.
Highlight the command

LDA_model <- lda(class~.,data=train_data)

LDA_model


Highlight the command LDA_model


Click on Save and Run buttons.

Point to the output in the console window.

We pass two parameters to the lda() function.
  1. formula
  2. data on which the model should train.

Select the comands and run them.

The output is shown in the console window.

Drag boundary to see the console window. Drag boundary to see the console window clearly.
Highlight output in the console. Our model provides us with a lot of information.

Let us go through them one at a time.

Highlight the command Prior probabilities of groups.

Highlight the command Group means.

Highlight the command Coefficients of linear discriminants

These explain the distribution of classes in the training dataset.


These display the mean values of each predictor variable for each species.


These display the linear combination of predictor variables.


The given linear combinations form the decision rule of the LDA model.

Drag boundary to see the Source window. Drag boundary to see the Source window clearly.
Let us use this model to make predictions on the testing data.
[RStudio]

predicted_values <- predict(LDA_model, test_data)

In the Source window type this command and run it.

Let us check what predicted_values contain.

Click the predicted_values data in the Environment tab.


Point to the table.

Click the predicted_values data in the Environment tab.

The predicted_values table is loaded in the Source window.

[RStudio]

head(predicted_values$class)

head(predicted_values$posterior)

head(predicted_values$x)

In the Source window type these commands and run them.


The output is seen in the console window.

Highlight the command output of head(predicted_values$class) in the console.


Highlight the command output of head(predicted_values$posterior) in the console.


Highlight the command output of head(predicted_values$x) in console

It contains the type of species that the model has predicted for each observation.


It contains the posterior probability of the observation belonging to each class.

This contains the linear discriminants for each observation.

Only Narration. Now we will measure the performance of our model using the Confusion Matrix.
[RStudio]

confusion <-table(test_data$class,predicted_values$class)


fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)


Click on Save and Run buttons.

In the Source window type these commands.


Save and run the commands.

Highlight the command confusion <- table(test_data$class, predicted_values$class)

Highlight the command

fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1)

This table creates a confusion matrix.


The fourfoldplot() function generates a visual plot of the confusion matrix,


The output is seen in the plot window.

Highlight the plot in plot window Drag boundary to see the plot window clearly.

Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations.

This number may change for different sets of training data.

Only Narration. Let us visualize how well our model separates different classes.
[RStudio]

[RStudio]

X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100)


Y <- seq(min(train_data$ecc), max(train_data$ecc), length.out = 100)


min_max <- expand.grid(minorAL = X, ecc = Y)


min_max$predicted_class <- predict(LDA_model, newdata = min_max)$class


grid <- expand.grid(minorAL = X, ecc = Y)

grid$class <- predict(LDA_model, newdata = grid)$class


grid$classnum <- as.numeric(grid$class)


Click on Save and Run buttons.

In the Source window, type these commands.


This block of code operates as a setup for visual plotting.


It consists of square grid coordinates in the range of training data and their predicted linear discriminants.


The seq function generates a sequence of evenly spaced values within a range of smallest and largest values of 'minorAL' and 'ecc' variables from the training data.


The 'grid' variable contains the generated data including the prediction of the LDA_model on it.


The as.numeric function encodes the predicted classes labels into numeric values.


Select the commands and run them.

Point to the Environment tab. Drag boundary to see the details in the Environment tab.


These variables contain the data for the visualization of the linear discriminants.

Click the grid data in the Environment tab.

The grid data table is loaded in the Source window.

[RStudio]


ggplot() +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +

geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +

theme_minimal()


ggplot() +

geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +

geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(title = "LDA Decision Boundary") +

theme_minimal()

In the Source window, type these commands.
Highlight the command

ggplot() +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +

geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal()


ggplot() +

geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +

geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(title = "LDA Decision Boundary") +

theme_minimal()


Select the commands and run them.

This command creates the decision boundary plot


It plots the grid points with colors indicating the predicted classes.

geom_raster creates a colour map indicating the predicted classes of the grid points

geom_contour creates the decision boundary of the LDA.

The scale_color_manual function assigns specific colors to the classes and so does scale_fill_manual function.


The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the model.


Select and run these commands.


Drag boundaries to see the plot window clearly.

Point the output in the Plots window We can see that our model has separated most of the data points clearly.
Only Narration With this we come to end of this tutorial.

Let us summarize.

Show Slide

Summary

In this tutorial we have learnt:
  • Linear Discriminant Analysis (LDA) and its implementation. 
  • Assumptions of LDA
  • Limitations of LDA
  • LDA on a subset of Raisin dataset
  • Visualization of the LDA separator and its corresponding confusion matrix


Now we will suggest an assignment for this Spoken Tutorial.
Show Slide

Assignment

  • Perform LDA on inbuilt PlantGrowthdataset
  • Evaluate the model using a confusion matrix and visualize the results
Show slide

About the Spoken Tutorial Project

The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.

Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.


Please contact us.

Show Slide

Spoken Tutorial Forum to answer questions.

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.

Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion

The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial project was established by the Ministry of Education Govt of India.
Show Slide

Thank You

This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay.

Thank you for joining.

Contributors and Content Editors

Madhurig