Script | Spoken-Tutorial - User contributions [en]

Machine-Learning-using-R/C3/Bagging-in-R/English

2024-11-27T06:18:17Z

Ushav: Created page with "'''Title of the script''': Bagging Algorithm for Decision Tree using R '''Author''': Debatosh Chakraboty and YATE ASSEKE RONALD RONALD. '''Keywords''': R, RStudio, Bagging A..."

'''Title of the script''': Bagging Algorithm for Decision Tree using R

'''Author''': Debatosh Chakraboty and YATE ASSEKE RONALD RONALD.

'''Keywords''': R, RStudio, Bagging Algorithm, machine learning, supervised, unsupervised, dataset, video tutorial.

{|border=1
|-
|| '''Visual Cue'''
|| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this Spoken Tutorial on '''Bagging in R.'''
|-
||'''Show slide'''

'''Learning Objectives'''
|| In this tutorial, we will learn about:
* Bagging.
* Assumptions for Bagging.
* Advantages of Bagging.
* Implementation of Bagging using Decision Tree in R.
* Model Evaluation.
* Limitations of Bagging.
|-
||'''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
||'''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know:
Basic programming in '''R'''.
Basics of '''Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Bootstrap aggregation (Bagging) '''
|| Now let us learn about '''Bootstrap aggregation '''or '''Bagging'''.
* Any classification model fitted across several training data subsets is desired to have consistent decision boundaries.
* Large variation in the decision boundaries indicate higher variability of the classification model.
* Bagging is a commonly used ensemble technique to reduce this variation.
* In Bagging, random subsets of the training data are repeatedly chosen to construct multiple classifiers.
* The Bootstrap classifiers constructed from chosen subsets are then aggregated.
* For bagging of the decision tree classifier, the aggregation is done by a majority vote of the class predicted by Bootstrap trees.
|-
|| '''Show slide'''

'''Assumptions of Bagging'''
* Each observation is independent.
* The assumption of the chosen classifier is satisfied.
|| Primarily, the assumptions of the chosen classifier must be satisfied for bagging.
|-
|| '''Show slide'''

'''Advantages of Bagging'''
|| Advantages of Bagging include:
* Bagging reduces the variation of the chosen model.
* Bagging improves the performance (accuracy) of the decision tree classifier in general.
|-
|| '''Show slide'''

'''Implementation of Bagging'''
|| Now we will perform '''Bagging of Decision Tree classifier '''on the '''Raisin''' dataset with two chosen variables.
|-
|| '''Show slide '''

'''Download Files'''
|| For this tutorial, I will use a script file''' Bagging-Decision-Tree.R'''.

'''Raisin Dataset 'raisin.xlsx'.'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

Highlight '''Bagging-Decision-Tree.R''' and the folder
|| I have downloaded and moved these files to the '''Bagging '''folder.

The '''Bagging''' folder is in the '''MLProject''' folder .

I have also set the '''Bagging''' folder as my working Directory.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Double click '''Bagging-Decision-Tree.R''' in RStudio

Point to '''Bagging-Decision-Tree.R''' in RStudio.
|| Open the script '''Bagging-Decision-Tree.R'''. in '''RStudio'''.

Script '''Bagging-Decision-Tree.R''' opens in '''RStudio'''.
|-
|| [RStudio]

'''library(readxl)'''

'''library(ipred)'''

'''library(caret)'''

'''library(cvms)'''

'''library(rpart)'''
|| Select and run these commands to import the necessary packages.
|-
|| [RStudio]

Highlight '''library(ipred)'''

Highlight''' library(rpart)'''

Highlight '''library(cvms)'''
|| The''' ipred '''library contains the '''bagging()''' function.

The '''rpart '''library will be used to implement the decision tree model for bagging.

We will use the '''cvms''' package for plotting the confusion matrix.

As I have already installed these packages.

I have directly imported them.
|-
|| Highlight''' '''

'''data <- read_xlsx("Raisin.xlsx")'''

'''data<-data[c("minorAL","ecc","class")]'''

'''data$class <- factor(data$class)'''
|| Run these commands to import the '''raisin''' dataset and prepare it for model building.

Click on data in the Environment tab to load it in the Source Window
|-
|| [RStudio]

'''set.seed(1) '''

'''index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| Type these commands in the source window to perform the train-test split
|-
|| Highlight '''set.seed(1)'''

Highlight '''sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''

Highlight '''replace=FALSE'''

Select the commands and click the Run button.
|| Select and run the commands.

The data sets will be shown in the Environment tab.
|-
||
|| Let us now create our '''Bagging''' model.
|-
|| [RStudio]

'''bagging_model <- bagging(class ~ ., data = train_data, coob = TRUE, nbagg = 200,control = rpart.control(cp = 0.00001, xval = 10, maxdepth = 2))'''
|| In the source window type these commands.
|-
|| Highlight

'''Bagging_model <- bagging(class ~ ., data = train_data, coob = TRUE, nbagg = 200,control = rpart.control(cp = 0.00001, xval = 10, maxdepth = 2))'''
|| '''bagging():''' The bagging() function is used to create a bagging ensemble model.

'''class ~''' .: This formula indicates that the model should predict the 'class' variable.

It uses all other variables in the train_data as predictors.

'''data:''' The dataset used for building the model, is specified as train_data.

'''coob:''' When '''coob''' is TRUE, it indicates out-of-bag (OOB) error estimate.

OOB error is a technique to measure the error of the generated bootstrap classifiers.

'''nbagg:''' Sets the number of bootstrap replicates for bagging. It is set to 200 in this case.

The '''rpart.control''' argument allows to set up the hyperparameters of the base classifier.

'''cp '''denotes the complexity parameter which is set to 0.00001.

'''Xval''' is the number of cross-validations which is set to 10.

'''Maxdepth '''is the maximum depth of any node of the final tree. It is limited to 2 in this case.

Select and run the command to train the model.
|-
|| '''print(bagging_model)'''
|| In the '''Source''' window type and run this command.
|-
|| Point to the console window.
|| The output is shown in the console window.

Drag boundary to see the console window clearly.
|-
|| Highlight

'''Out-of-bag estimate of misclassification error: 0.1746'''
|| We can confirm that our model is trained successfully.

The training misclassification error of the model is 0.1746.
|-
|| [RStudio]

'''predictions <- predict(bagging_model, newdata = test_data, type = "class")'''
|| Let us now use our model for prediction.

In the source window type and run the command
|-
|| Highlight

'''predictions <- predict(bagging_model, newdata = test_data, type = "class")'''

Click on '''Save''' and '''Run '''buttons.
|| This command stores the prediction of the model bagging_model on test data in a variable '''predictions'''.
|-
||
|| Let's now evaluate our model.
|-
|| [RStudio]

'''confusion_matrix <- confusionMatrix(predictions, test_data$class)'''
|| Type this command in the''' Source''' window
|-
|| Highlight

'''confusion_matrix <- confusionMatrix(predictions, test$class)'''
|| This command will create a confusion matrix list.

The list will contain the different evaluation metrics.

Select and run the command
|-
|| [RStudio]

'''confusion_matrix$overall["Accuracy"]'''
|| Now, let us type this command.

This command will display the accuracy of the model.

It retrieves it from the confusion Matrix list created.

Select and run the command
|-
|| '''Highlight '''0.8407
|| We can see that our model has 84 percent accuracy

Note that we can achieve higher accuracy by not manually specifying the max-depth attribute.
|-
|| '''confusion_table <- data.frame(confusion_matrix$table)'''
|| In the source window, type this command.

This will create a data-frame of the confusion matrix table.

Select and run the command.

Click on confusion_table in the Environment tab.

Notice that it displays the number of correct and incorrect predictions for each class.
|-
|| Cursor in the source window.
|| In the source window, type these commands to plot the confusion matrix
|-
|| '''plot_confusion_matrix(confusion_table, '''

'''target_col = "Reference",'''

'''prediction_col = "Prediction",'''

'''counts_col = "Freq",'''

'''palette = list("low" = "pink1","high" = "green1"),'''

'''add_normalized = FALSE,'''

'''add_row_percentages = FALSE,'''

'''add_col_percentages = FALSE)'''
|| We use the '''plot_confusion_matrix '''function from the''' cvms '''package.

We will use the created data frame '''confusion_table'''.

'''Target_col '''is the column in the dataframe with the labels for reference.

'''Prediction_col '''is the column in the dataframe with predicted labels.

'''Counts_col''' is the column in the dataframe with the number of correct and incorrect labels.

The palette will plot the correct and incorrect predictions in different colours.

Select and run the commands

The output is seen in the plot window
|-
|| Highlight output in '''plot window'''
|| 24 '''Besni''' samples have been incorrectly classified.

19 '''Kecimen''' samples have been incorrectly classified.

Overall, the model has misclassified only 43 samples.
|-
||
|| Let us plot our model decision boundary.
|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 200),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 200)) '''

'''grid$class = predict(bagging_model, newdata = grid, type = "class")'''

'''grid$classnum <- as.numeric(grid$class)'''
|| In the '''Source''' window type these commands
|-
|| Highlight

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 200),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 200)) '''

'''<nowiki># Predict classes</nowiki>'''

'''grid$class = predict(bagging_model, newdata = grid, type = "class")'''

'''grid$classnum <- as.numeric(grid$class)'''
|| This code first creates a '''grid '''of points spanning the feature space.

The '''Bagging '''model then predicts the class of each point in this grid.

Select and run the commands
|-
|| [RStudio]

ggplot() +

geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +

geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),

colour = "black", linewidth = 0.7) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(x = "MinorAL", y = "ecc", title = "Decision Boundary of Bootstrap Bagging") +

theme_minimal()
||In the'''Source '''window type thesecommands
|-
|| Highlight

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Decision Boundary of Bootstrap Bagging") +'''

'''theme_minimal()'''
|| We plot the decision boundary using predicted classes of the grid. 

This command creates decision boundary and distribution of data points with colors indicating the predicted classes. 

Select and run the command.
|-
|| Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Highlight output in plot window
|| We observe that the model has separated most of the data points clearly.

Note that after applying bagging to the decision tree classifier, the decision boundary looks similar to that of the decision tree.

But it is more robust and complicated.
|-
|| '''Limitations of Bagging'''
* Bagging is hard to interpret.
* Requires more computational time.
* Bagging doesn’t improve model bias.
|| These are the limitations of Bagging.
|-
|| Only Narration
|| With this we come to the end of this tutorial.

Let us summarize.
|-
|| '''Show Slide'''

'''Summary'''
|| In this tutorial we have learnt about:
* Bagging
* Assumptions for Bagging
* Advantages of Bagging
* Implementation of Bagging using Decision Tree in R
* Model Evaluation
* Limitations of Bagging
|-
||
|| Now we will suggest the assignment for this Spoken Tutorial.
|-
|| '''Show Slide'''

'''Assignment'''
||
* Apply Bagging using Decision Tree on '''PimaIndiansDiabetes''' dataset
* Install the '''pdp''' package and import the dataset using the '''data(pima)''' command
* Visualize the decision boundary and measure the accuracy of the model
|-
|| '''Show slide'''

'''About the Spoken Tutorial Project'''
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
|| '''Show slide'''

'''Spoken Tutorial Workshops'''
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| '''Show Slide'''

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.
|| Please post your time queries in this forum.
|-
|| '''Show Slide'''

'''Forum to answer questions'''
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| '''Show Slide'''

'''Textbook Companion'''
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| '''Show Slide'''

'''Acknowledgment'''
|| The '''Spoken Tutorial''' was established by the Ministry of Education Govt of India.
|-
|| '''Show Slide'''

Thank You
||This tutorial is contributed by Debatosh Chakraborty and Yate Asseke Ronald O from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Decision-Tree-in-R/English

2024-11-27T06:10:45Z

Ushav: Created page with "'''Title of the script''': Decision Tree in R '''Author''': Debatosh Chakraborty and Yate Asseke Ronald Olivera '''Keywords''': R, RStudio, machine learning, supervised, uns..."

'''Title of the script''': Decision Tree in R

'''Author''': Debatosh Chakraborty and Yate Asseke Ronald Olivera

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, classification, regression, decision tree, video tutorial.

{|border=1
|-
|| '''Visual Cue'''
|| '''Narration'''
|-
||'''Show slide'''

'''Opening Slide'''
|| Welcome to this Spoken Tutorial on '''Decision Tree''' '''in R.'''
|-
||'''Show slide'''

'''Learning Objectives'''
|| In this tutorial, we will learn about:
* '''Decision Tree'''
* Assumptions for '''Decision Tree'''
* Advantages of '''Decision Tree'''
* Implementation of '''Decision Tree''' in '''R'''.
* Plotting the decision tree model
* Evaluation of the model'''.'''
* Visualizing the model decision boundary
* Limitations of '''Decision Tree'''.
|-
||'''Show slide'''

'''System Specifications'''
||This tutorial is recorded using,
* '''Windows 11'''
* '''R '''version '''4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
||'''Show slide'''

'''Prerequisites'''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, learner should know:
* '''Basic programming in R'''
* '''Basics of Machine Learning'''

If not, please access the relevant tutorials on this website.
|-
||'''Show slide'''

'''What is a Decision Tree?'''
||Let us see what a decision tree is?
* It uses a binary tree to split the feature space into several sub-regions
* The nodes of the tree are the locations at which the feature space splits
* Misclassification error, Gini index, and entropy aid in identifying ideal splits.
* The decision boundaries in the Decision Treemodel are nonlinear
|-
|| '''Show slide'''

'''Assumptions of Decision Tree'''
* The root node of the tree consists of the entire training set.
* The model does not assume any specific distribution of features.
* Each observation is independent.
|| The assumptions of the decision tree model are as follows.
|-
|| '''Show slide'''

'''Advantages of Decision Tree'''
|| The advantages of decision tree model include:
* It does not require feature variables to be necessarily continuous
* Decision trees are intuitive and easy to visualize
* When the response is continuous, the decision tree methodology can be easily implemented as a regression tree

The regression tree method will be discussed in a separate tutorial.
|-
|| '''Show Slide'''

'''Implementation of Decision Tree'''
|| Now we will construct a '''Decision Tree '''on the '''Raisin''' dataset with two chosen variables.
|-
||'''Show slide'''

'''Download Files'''
|| For this tutorial, we will use

A script file '''DecisionTree.R'''.

Raisin Dataset 'raisin.xlsx'

Please download these files from the '''Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

Highlight '''DecisionTree.R'''
|| I have downloaded and moved these files to the '''Decision Tree '''folder.

We will create a Decision Tree classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Double-click '''DecisionTree.R''' on RStudio

Point to '''DecisionTree.R''' in RStudio.
|| Open the script '''DecisionTree.R''' in '''RStudio'''.

Script '''DecisionTree.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight '''library(readxl)'''

'''library(ggplot2)'''

'''library(caret)'''

Highlight '''library(rpart)'''

Highlight '''library(rpart.plot)'''

Highlight '''library(cvms)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
|| Select and run these commands to import the packages.

These packages will be used to aid thebuilding and evaluation of the classifier.

We will use '''rpart '''package to create the decision tree classifier.

We will use the '''rpart.plot '''package for plotting the '''decision tree'''.

We will use the '''cvms''' package for plotting the confusion matrix.

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight

'''data <- read_xlsx("Raisin.xlsx")'''

Highlight

'''data <- data[c("minorAL",”ecc”,"class")]'''

Highlight

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
|| These commands will load the'''Raisin dataset'''.

They will also prepare the dataset for model building.

Select and run the commands.
|-
|| Click on '''data''' in the '''Environment''' tab to load the dataset.

Point to the Source window.
|| Click on '''data''' in the '''Environment''' tab to load the modified data in the '''Source''' window.
|-
|| [RStudio]

'''set.seed(1)'''

'''trainIndex <- createDataPartition(data$class, p = 0.7, list = FALSE)'''

'''train <- data[trainIndex, ]'''

'''test <- data[-trainIndex, ]'''
|| In the '''Source''' window type these commands.
|-
||Highlight

'''set.seed(1)'''

Highlight

'''trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''

Highlight

'''train <- data[trainIndex, ]'''

Highlight

'''test <- data[-trainIndex, ]'''
|| This will split our dataset into training and testing data.

Select the commands and run them.
|-
||
|| Let us now create our '''Decision Tree''' model.
|-
||'''model <- rpart(class ~ ., data = train_data, method = 'class','''

'''control = rpart.control(cp = .00001, xval = 10, maxdepth = 2),'''

'''parms = list(split = "gini"))'''

'''summary(decision_model)'''
|| In the source window, type these commands.
|-
|| [RStudio]

Highlight '''formula = class ~ .'''

Highlight '''data=train'''

Highlight '''method = 'class''''

Highlight '''parms = list(split = "gini")'''

Highlight '''maxdepth = 2'''

Highlight '''xval = 10'''

Highlight '''cp = .00001'''

Click the Run button.

Point to Environment tab.
|| This is the formula we use for this model.

'''class''' is taken as the dependent variable.

The remaining attributes are independent variables.

'''data=train_data''' uses the '''training''' partition of dataset to train our model.

This tells our model that we are doing a classification task.

This “Gini Index” will be used to determine the best splits of the nodes.

This determines the maximum depth of the tree.

This is the number of cross-validations for each split.

The maximum loss of cross-validation which is desirable.

Select and run the commands.

The model data is shown in the '''Environment''' tab.
|-
|| '''Highlight''' CP

'''HIghlight '''Node Information

'''Highlight '''n=630

'''Highlight '''class counts

'''Highlight '''probabilities

'''Highlight '''Predicted class

'''Highlight '''Primary splits
|| The summary of the model created is shown in the '''console''' window

Drag boundary to see the console window clearly

'''CP''' displays the complexity table for the trees created in the final model.

This displays the information about each node created.

This includes,

Total observations used to create the node.

The distribution of observations for each class in the node.

The probability of each class.

The class with the highest probability is the predicted class for the node.

This denotes the split information for that particular node.
|-
|| [RStudio]

'''rpart.plot(decision_model)'''
||Now let us visualize the decision tree model.

In the '''Source '''window type this command and run it.

Drag boundary to see the plot window clearly
|-
|| '''Hover '''Kecimen

'''Hover''' 0.71

'''Hover''' 48% 52%
||The trained decision tree model is shown in the plot window

For each node,

Predicted Class

It’s probability

And the Percentage of total observations is shown

One must note that the modeled tree is interpretable.

This is because the max depth of the tree is manually specified.

But it comes at the cost of underfitting and an increase in misclassification error.
|-
||
|| Now let us use the model to make predictions on the testing data partition.
|-
||[RStudio]

'''predictions <- predict(decision_model, newdata = test_data, type = "class")'''

Select and run this command.
||In the source window type this command and run it.

This command generates the predicted classes from the trained decision tree model.
|-
||
|| Let's now evaluate our model.
|-
||[RStudio]

'''confusion_matrix <- confusionMatrix(predictions, test$class)'''

|| Type this command in the '''Source''' window
|-
||Highlight

'''confusion_matrix <- confusionMatrix(predictions, test$class)'''
||This command will create a confusion matrix list.

The list will contain the different evaluation metrics.

Select and run the command
|-
||[RStudio]

'''confusion_matrix$overall["Accuracy"]'''
||Now, let us type this command.

This command will display the accuracy of the model by retrieving it from the confusion Matrix list.

Select and run the command
|-
|| '''Highlight '''0.807
||We can see that our model has 80 percent accuracy.

Note that the misclassifications are higher because of the manual specification of max depth attribute.

Choosing a higher value will make the model less interpretable and will reduce the misclassification error.
|-
|| '''confusion_table <- data.frame(confusion_matrix$table)'''
||In the source window, type these commands.

This will create a data-frame of the confusion matrix table.

Select and run the command.

Click on confusion_table in the Environment tab.

We notice that it displays the number of correct and incorrect predictions for each class.
|-
|| [RStudio]
|| In the source window, type these commands. To plot the confusion matrix

It will represent the number of correct and incorrect predictions using different colors.
|-
||'''plot_confusion_matrix(confusion_table,'''

'''target_col = "Reference",'''

'''prediction_col = "Prediction",'''

'''counts_col = "Freq",'''

'''palette = list("low" = "pink1","high" = "green1"),'''

'''add_normalized = FALSE,'''

'''add_row_percentages = FALSE,'''

'''add_col_percentages = FALSE)'''
|| We use the '''plot_confusion_matrix '''function from the cvms package.

We will use the dataframe '''confusion_table'''.

'''Target_col''' is the column reference in the dataframe '''confusion_table''' with the labels for reference.

'''Prediction_col''' is the column Prediction in the dataframe '''confusion_table''' with predicted labels.

'''Counts_col''' is the column Frequency in the dataframe '''confusion_table''' with the number of correct and incorrect labels.

The palette will plot the correct and incorrect predictions in different colours.

Select and run the commands.

The output is seen in the plot window
|-
|| Highlight '''Output in Plot window.'''
||This plot shows how well our model predicted the testing data.

We observe that:

'''Kecimen '''class''': 18 '''misclassifications

'''Besni''' class: '''34''' misclassifications
|-
||
||Now let us visualize the decision boundary of the model.
|-
||[RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500))'''

'''grid$class = predict(decision_model, newdata = grid, type = "class")'''

'''grid$classnum <- as.numeric(grid$class)'''
||In the source window type these commands
|-
||Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500))'''

'''grid$class = predict(decision_model, newdata = grid, type = "class")'''

'''grid$classnum <- as.numeric(grid$class'''
|| This code creates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc '''features in the dataset.

It then uses the '''Decision Tree''' model to predict the class of each point in this grid.

It stores these predictions as a new column ''''class' '''in the '''grid '''dataframe.

Select the commands and run them.
|-
||[RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''
||To visualise the generated data, type these commands
|-
||Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''
||This command creates the decision boundary plot of the decision tree model.

It shows the distribution of training data points.

It plots the grid points with colors indicating the predicted classes using Ggplot2.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
|| Point to the plot
||It shows that the decision boundary of a decision tree model is non-linear.

The complexity of the decision boundary increases with the complexity of the decision tree.
|-
||'''Show Slide'''

'''Limitations of Decision tree'''
* If the tree is too complex, it can overfit data.
* Small variations in data can result in a different tree.
* Large trees are difficult to interpret
* Noisy Data may cause inaccurate splits

|| Here are some of the limitations of '''Decision Tree.'''
|-
|| Only Narration
||With this we come to the end of the tutorial.

Let us summarize.
|-
||'''Show Slide'''

'''Summary'''
|| In this tutorial we have learnt about:
* '''Decision Tree'''
* Assumption of '''Decision Tree'''
* Advantages of '''Decision Tree.'''
* Implementation of '''Decision Tree '''in '''R'''.
* Plotting the decision tree model.
* Evaluation of the model'''.'''
* Visualizing the model decision boundary
* Limitations of Decision Tree.
|-
||
|| Now we will suggest the assignment for this Spoken Tutorial.
|-
||'''Show Slide'''

'''Assignment'''
||
* Apply Decision Tree on '''PimaIndiansDiabetes''' dataset
* Install the '''pdp''' package and import the dataset using the '''data(pima)''' command
* Visualize the decision tree and measure the accuracy of the model
|-
||'''Show slide'''

'''About the Spoken Tutorial Project'''
||The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
||'''Show slide'''

Spoken Tutorial Workshops
||We conduct workshops using Spoken Tutorials and give certificates.

For more details, please contact us.
|-
||'''Show Slide'''

'''Spoken Tutorial Forum to answer questions'''
|| Please post your timed queries in this forum.
|-
||'''Show Slide'''

'''Forum to answer questions'''
||Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| '''Show Slide'''

'''Textbook Companion'''
|| The '''FOSSEE '''team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
||'''Show Slide'''

'''Acknowledgment'''
|| The spoken tutorial project was established by the Ministry of Education government of India.
|-
||'''Show Slide'''

Thank You
||This tutorial is contributed by Debatosh Chakraborty and Yate Asseke Ronald. O from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T10:23:05Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm include
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables or features to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''test_data ''' and '''train_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment tab clearly

Click on '''test_data ''' and '''train_data ''' to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It does not involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting.

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate the performance matrix.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly.

The output is seen in the '''console''' window.

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T10:19:39Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm include
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables or features to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''test_data ''' and '''train_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment tab clearly

Click on '''test_data ''' and '''train_data ''' to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It does not involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting.

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate the performance matrix.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly.

The output is seen in the '''console''' window.

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T10:06:46Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm include
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables or features to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''test_data ''' and '''train_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment tab clearly

Click on '''test_data ''' and '''train_data ''' to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It does not involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting.

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate the performance matrix.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T09:57:37Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm include
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables or features to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''test_data ''' and '''train_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment tab clearly

Click on '''test_data ''' and '''train_data ''' to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting.

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate the performance matrix.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T09:13:15Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm include
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables or features to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''test_data ''' and '''train_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment tab clearly

Click on '''test_data ''' and '''train_data ''' to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate the performance matrix.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T09:08:46Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm include
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables or features to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''test_data ''' and '''train_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment tab clearly

Click on '''test_data ''' and '''train_data ''' to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T09:06:50Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm include
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables or features to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''test_data ''' and '''train_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment tab clearly

Click on '''test_data ''' and '''train_data ''' to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window and type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T08:53:16Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm include
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables or features to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''test_data ''' and '''train_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment tab clearly

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window and type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T08:51:08Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm include
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables or features to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment tab clearly

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window and type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T08:45:03Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm include
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables or features to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment window clearly

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window and type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T08:43:34Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm include
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment window clearly

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window and type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T08:37:47Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the chosen dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment window clearly

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window and type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T08:34:47Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of the dummy classifier

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment window clearly

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window and type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T07:09:53Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn from data.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment window clearly

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window and type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T07:08:27Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* To use GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn without being explicitly programmed.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment window clearly

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window and type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-04T07:06:19Z

Ushav:

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Supervised and Unsupervised Learning
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* Using GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn without being explicitly programmed.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment window clearly

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window and type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English

2024-06-03T13:02:54Z

Ushav: Created page with "'''Title of the script''': Introduction to Machine Learning in R '''Author''': Debatosh Chakraborty '''Keywords''': R, RStudio, machine learning, supervised, unsupervised, v..."

'''Title of the script''': Introduction to Machine Learning in R

'''Author''': Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Introduction to Machine Learning in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Machine Learning
* Classification and Regression Problems
* Workflow of ML CLassifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,

* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* Using GGPlot2 and dplyr package.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Machine Learning'''

''' '''

|| About machine learning

* ML enables computers to learn without being explicitly programmed.
* ML algorithms automate the learning process from data through patterns.
* Their primary role is prediction, classification or clustering of data.
* ML algorithms are applied in several applications.
* For example Natural Language Processing, Image and speech recognition, etc.

|-
|| '''Show Slide'''

'''Types of Machine Learning'''
|| ML algorithms include the following types and tasks:
* '''Supervised '''learning: Prediction and Classification''','''
* '''Unsupervised '''learning''': '''Clustering''','''
* '''Semi-supervised '''learning
* '''Reinforcement '''learning'''.'''

In this series, we will focus on '''Supervised''' and '''Unsupervised''' learning algorithms.
|-
|| '''Show Slide'''

'''Supervised and Unsupervised Learning'''

''' '''
|| Supervised learning: Labeled data
* ML algorithms predict labels for unseen features
* They predict based on given features and labels of data.

Unsupervised learning: Unlabeled data
* ML algorithms develop a mechanism to group similar features into clusters.
* And label them for future analysis.

|-
|| '''Show Slides'''

'''Classification and Regression'''

||
* Supervised learning consists of Regression and Classification.
* '''Regression''' is applied to predict and learn continuous-valued responses from features.
* Regression techniques include Linear, Spline, Ridge, Lasso, and others.
* '''Classification''' is applied to predict the class of a discrete (labeled) response from features.
* Classification techniques include Logistic Regression, Decision Tree, SVM, and others.

|-
|| '''Show Slides'''

'''Workflow of an ML Classifier algorithm'''
|| The Workflow of an ML Classifier algorithm
* Feature Space: Collection of all possible values of the features.
* A classification algorithm partitions the feature space into a number of classes.
* Data is split into training and testing sets to learn and evaluate the algorithm.
* The model learns from the training data to create partitions of feature space.
* The model is evaluated on the test dataset through performance metrics.

|-
|| '''Show Slide'''

'''Dataset'''

|| Let’s use '''Raisin dataset '''with two chosen variables to understand a classification problem.

For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''Intro.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''Intro.R''' and the folder '''Introduction.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''Introduction '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''Introduction''' folder as my working Directory.

In this tutorial, we will introduce classification on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click Intro.R in RStudio

Point to Intro.R in RStudio.
|| Let us open the script '''Intro.R''' in '''RStudio'''.

Script '''Intro.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.
|-
|| [RStudio]

Highlight the command''' '''

'''data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the '''Environment''' tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Click on '''data '''to load the dataset in the Source window.

Click on '''Intro.R''' in the Source window and close the tab.

|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data.

2 columns ("minorAL", "ecc") are chosen as features.

The class column is chosen as a target variable.

We convert the target variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
||
|| We will now understand the feature space of this data.
|-
|| '''range_minor_al <- range(data$minorAL)'''

'''range_ecc <- range(data$ecc)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''range_minor_al <- range(data$minorAL)'''

Highlight the command

'''range_ecc <- range(data$ecc)'''
|| These commands show the range of the feature variables '''minorAL''' and''' ecc.'''

Select and run the commands.

Drag boundary to see the environment tab clearly.

The minimum and maximum value of the minor_al and ecc are shown in their range variables
|-
|| '''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

'''feature <- expand.grid(minorAL = X, ecc = Y)'''

|| We will now use the range to generate grid points to construct the feature space.

In the '''Source''' window type these commands
|-
|| Highlight

'''X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100)'''

'''Y <- seq(min(data$ecc), max(data$ecc), length.out = 100)'''

HIghlight

'''feature <- expand.grid(minorAL = X, ecc = Y)'''
|| This command generates a sequence of points spanning the range of '''minorAL '''and''' ecc'''.

This command creates a cartesian product of the two features to create a feature space.

Select and run the commands.
|-
| | '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''

|| We will now plot the feature space created

In the '''Source''' window type these commands
|-

|| '''ggplot(data = data, aes(x = minorAL, y = ecc)) +'''

'''geom_point(aes(color = class), size = 2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Feature Space") +'''

'''theme_minimal()'''
|| These commands plot the data points in the feature space.

Select and run the commands.
|-
| | Drag boundaries.
|| Drag boundaries to see the plot window clearly.
|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
| | [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''Intro.R''' in the Source window, and type these commands.

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands
|-

| | Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Drag boundary to see the Environment window clearly

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Here we try to partition the '''feature space''' to construct the classifier.

To begin with, one might construct a '''heuristic '''line to build the classifier.
|-
|| [Rstudio]

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

|| In the Source window and type these commands.

|-
|| Highlight the command

'''fit = function(x)((x * (-0.0021)) + 1.445)'''

Highlight the command

'''model_predict <- function(x){'''

'''factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))'''

'''}'''

Click Save and Click Run buttons.
|| Let us describe the steps of the classification algorithm.

For that we will define a line to partition the data as a dummy classifier.

It doesn’t involve training data so performance may be poor.

We define a function that separates data points belonging to either side of the line.

Click Save.

Select and run the commands.

|-
|| '''feature$class <- model_predict(feature)'''

'''feature$classnum <- as.numeric(feature$class)'''

|| Let’s use the line to classify the feature space and draw the decision boundary.

In the '''Source''' window type these commands
|-
|| Highlight

'''feature$class <- model_predict(feature)'''

Highlight

'''feature$classnum <- as.numeric(feature$class)'''
||

This command will use the line created to predict the class of every point in the grid of feature space.

This command encodes the class string labels into numbers suitable for plotting

Select and run the commands.

|-
|| Click on '''feature''' in the Environment tab.

Point to the data in the Source window.
|| Drag boundary to see the Environment window.

Click on '''feature '''in the Environment tab.

The '''feature set '''with the predicted classes loads in the source window.
|-
|| '''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "Data Boundary") +'''

'''theme_minimal()'''

||

We are visualising the feature space and the partition line using GGPlot2.

Select and run the commands.

|-
|| Drag boundary to see the plot window.
|| Drag boundary to see the plot window clearly.

Overall plot shows that the chosen line approximately separates the training data classes.

|-
||

'''prediction_test = model_predict(test_data)'''
|| Let us see how well the partition performs on the testing dataset.

In the '''Source''' window type this command

|-
|| Highlight the command

'''prediction_test = model_predict(test_data)'''
||

We predict the classes from testing data and store it in the '''prediction_test '''variable.

Select and run the command.
|-
||
|| Let us now measure the performance of the classification.
|-
|| [RStudio]

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

||

In the '''Source''' window, type the command

|-
|| Highlight the command

'''test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)'''

Click on''' Save '''and '''Run '''buttons.
|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.

Select and run the command.
|-
|| '''test_confusion_matrix$overall["Accuracy"]'''

|| In the '''Source''' window, type this command

|-
|| Highlight

'''test_confusion_matrix$overall["Accuracy"]'''
|| It fetches the accuracy metric from the list created

Select and run the command
|-
||
|| Drag boundary to see the console window clearly
|-
|| Highlight

'''Accuray'''

0.6962963

|| The accuracy of the testing dataset is 69%
|-
|| Drag boundary to see the source window clearly

|| Drag boundary to see the source window clearly

Let us now view the confusion matrix of the testing dataset

|-
|| [RStudio]

'''test_confusion_matrix$table'''

|| In the '''Source''' window type this command

|-
|| Highlight the command

'''test_confusion_matrix$table'''

Click on''' Save '''and '''Run '''buttons.
|| Select and run the command.

The output is seen in the '''console''' window

|-
|| Point the output in the '''console window'''

Reference

Prediction Besni Kecimen

Besni 50 82

Kecimen 0 138

|| Drag boundary to see the console window clearly

Observe that:

0 samples of class Besni have been incorrectly classified.

82 samples of class Kecimen have been incorrectly classified.

We can see that our partition line is skewed.

|-
||
|| For the same problem many partitions can be drawn.

We can choose a complicated partition to reduce train misclassification error.

But there will be no control on test data.

We can aim to choose a classifier which is simple with a smaller test misclassification error.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Machine Learning
* Classification and Regression Problems
* Workflow of an ML Classifier Algorithm
* Visualizing Feature Space
* Constructing a dummy classifier
* Evaluation of an ML algorithm

|-
||
|| Here is an assignment for you.
|-

|| Show Slide

Assignment
||
*Use a vertical line as a classifier to partition the feature space.
* Plot the decision boundary for the same.
* Evaluate the classifier on the test dataset

|-

|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-

|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-

|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from our team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.

|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

R Activities

|| The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects.

We give certificates to those who do this.

For more details, please visit the website.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Logistic-Regression-in-R/English

2024-05-31T10:31:10Z

Ushav:

'''Title of the script''': Logistic Regression

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, classification, logistic regression, video tutorial.

{| border=1
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Logistic Regression in R.'''

|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about
* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''
|| To follow this tutorial, the learner should know:
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
||
|| Let us learn what '''logistic regression''' is
|-
|| '''Show slide'''

'''Logistic Regression'''

|| Logistic regression is a statistical model used for classification.

It models the probability of success for the explanatory variable.

* It predicts the probability, unlike the response in linear regression.
* The predicted probability is used as a classifier.
* The probability of success is modeled using the''' logit or (log odds) '''function.
* It is a linear classifier, as the logistic regression model has a linear logit.
* It is often used when the response variable is categorical.

|-
|| '''Show slide'''

'''Assumptions of Logistic Regression'''

* The distribution of the dependent variable is Bernoulli.
* The data records are independent.

|| The dependent variable's distribution is typically assumed to be a Bernoulli distribution in logistic regression.
|-
|| '''Show slide'''

'''Advantages of Logistic Regression'''

* It provides estimates of regression coefficients along with their standard errors.
* It also provides the predicted probability which in turn is used as a classifier.
* It doesn’t need explanatory variables to be necessarily continuous.
* In this sense, it is a more general classifier than LDA and QDA.

|| Logistic regression offers a significant advantage in that continuous explanatory variables are not a requirement.
|-
|| '''Show Slide'''

'''Implementation Of Logistic Regression'''
|| We will implement '''logistic regression''' using the '''Raisin '''dataset.

The additional reading material has more details on the '''Raisin dataset'''.

Please refer to it.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''LogisticRegression.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

Highlight LogisticRegression.R

Logistic Regression folder.
|| I have downloaded and moved these files to the '''Logistic Regression''' folder.

This folder is located in the '''MLProject '''folder.

I have also set the '''Logistic Regression''' folder as my Working Directory.

Let’s create a '''Logistic Regression''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click LogisticRegression.R in RStudio

Point to LogisticRegression.R in RStudio.
|| Open the script '''LogisticRegression.R''' in '''RStudio'''.

For this, click on the script '''LogisticRegression.R.'''

Script '''LogisticRegression.R''' opens in '''RStudio'''.
|-
|| [Rstudio]

Highlight the commands

'''library(readxl)'''

'''library(caret)'''

'''library(VGAM)'''

'''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''

|| Select and run these commands to import the necessary packages.

The '''VGAM''' package contains the '''glm()''' function required to create our classifier.

As I have already installed the packages.

I have directly imported them.
|-
|| [RStudio]

Highlight

'''data <- read_xlsx("Raisin_Dataset.xlsx")'''

'''data[c("minorAL",”ecc”,"class")]'''

'''data$class <- factor(data$class)'''

'''Highlight the commands.'''
|| These commands will load the '''Raisin dataset.'''

They will also prepare the dataset for model building.

Select and run the commands.

|-
|| Drag boundary to see the Environment tab.

Click on '''data '''on the Environment tab.
||

Click on '''data '''in the '''Environment '''tab.

It loads the modified dataset in the '''Source''' window.

|-
|| Point to the data.
|| Now we split our dataset into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''trainIndex<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

'''train <- data[trainIndex, ]'''

'''test <- data[-trainIndex, ]'''

||

In the '''Source''' window type these commands.

|-
|| Highlight

'''set.seed(1) '''

Highlight

'''trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

Highlight

'''train <- data[trainIndex, ]'''

Highlight

'''test <- data[-trainIndex, ]'''

Click on Save and Run buttons.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
||

Select the commands and run them.

|-
||
|| Let us create a '''Logistic Regression '''model on the '''training dataset'''.
|-
|| [RStudio]

'''Logistic_model <- glm(class ~ ., data = train, family = "binomial")'''

'''summary(Logistic_model)$coef'''
||

In the '''Source''' window type these commands
|-
| | Highlight glm()

Highlight '''class ~ .'''

Highlight '''family = binomial'''

Highlight '''train'''
|| The function glm() represents generalized linear models.

Logistic regression is among the class of models that it fits.

This is the formula for our model.

We try to predict target variable '''class''' based on '''minorAL '''and '''ecc '''features.

This ensures that our model predicts the probability for 2 classes.

It ensures that, out of all the models in glm, the logistic regression model is fit.

This is the data used to train our model.

Select the commands and run them.

The output is shown in the '''console '''window.
|-
|| Drag boundary to see the console window.
|| Drag boundary to see the '''console '''window.
|-
|| Point the output in the '''console'''

Highlight '''Coefficients'''

Highlight '''Pr(>|z|)'''
||

'''Coefficients''' denote the coefficients of the logit function.

That means the log-odds of class change by -0.04 for every unit change in minorAL.

The lower p-values suggest that the effects are statistically significant.

|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

'''View(Predicted.prob)'''

||

In the '''Source''' window type these commands

|-
|| Highlight

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

Highlight

'''Type = “response” '''

|| This command provides the predicted probability of the logistic regression model on the test dataset.

This command ensures the outcome is a probability.

Select and run the commands
|-
|| Point

Value

||

'''Predicted.prob '''stores the predicted probability of each observation belonging to a certain class.

|-
|| '''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| In the source window type the following commands
|-
|| Highlight

'''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| This retrieves the predicted classes from the probabilities.

If the probability is greater than 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen

We also convert the output to a '''factor''' datatype to fit in the Confusion matrix function.

Select and run the commands
|-
||
|| Let us measure the accuracy of our model.
|-
|| '''confusion_matrix <- confusionMatrix(predicted.classes,test_data$class)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(predicted.classes,test_data$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

List is created from the actual and predicted class labels.

And it is stored in the confusion_matrix variable.

It helps to assess the classification model's performance and accuracy.

Select and run these commands

|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands
|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

||

In the '''Source''' window type this command

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We use the '''plot_confusion_matrix()''' function to generate a visual plot of the '''confusion matrix list created.'''

Select and run the command

The output is seen in the '''plot''' window
|-
|| '''Output in Plot window.'''

|| This plot shows how well our model predicted the testing data.

We observe that:

'''21 '''misclassifications of Besni Class.

'''13 '''misclassifications of Kecimen class.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''

|| We will now visualize the decision boundary of the model.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''
|| This code first generates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc''' features in the dataset.

Then, it uses the '''Logistics Regression '''model to predict the probability of each point in this grid, storing these predictions as a new column ''''prob' '''in the '''grid '''dataframe.

It converts the predicted probabilities of the points into classes.

If the probability exceeds 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen.

The prediced classes are stored in ‘class’ column of grid data frame.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

Select and run the commands

Click on grid in the Environment tab to load the generated data in the Source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source '''window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| We are creating the decision boundary plot using GGPlot2 from the data generated.

It plots the grid points with colors indicating the predicted classes.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can conclude that the decision boundary of logistic regression is a straight line.

The line separates the data points clearly.
|-
|| Show slide

Limitations of Logistic Regression

* It’s sensitive to outliers which can affect the accuracy of the classifier.
* It can perform poorly in the presence of multicollinearity among explanatory variables.

|| Here are some of the limitations of Logistic Regression
|-
||
|| Now let us summarize what we have learned.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:

* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression Model

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| Show Slide

Assignment
||
* Apply logistic regression on the '''Wine '''dataset.
* This dataset can be found in the '''HDclassif''' package.
* Install the package and import the dataset using the '''data()''' command.
* Measure the accuracy of the model

|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions
|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

Textbook Companion
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald. O and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Logistic-Regression-in-R/English

2024-05-31T10:25:47Z

Ushav:

'''Title of the script''': Logistic Regression

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, classification, logistic regression, video tutorial.

{| border=1
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Logistic Regression in R.'''

|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about
* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''
|| To follow this tutorial, the learner should know:
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
||
|| Let us learn what '''logistic regression''' is
|-
|| '''Show slide'''

'''Logistic Regression'''

|| Logistic regression is a statistical model used for classification.

It models the probability of success for the explanatory variable.

* It predicts the probability, unlike the response in linear regression.
* The predicted probability is used as a classifier.
* The probability of success is modeled using the''' logit or (log odds) '''function.
* It is a linear classifier, as the logistic regression model has a linear logit.
* It is often used when the response variable is categorical.

|-
|| '''Show slide'''

'''Assumptions of Logistic Regression'''

* The distribution of the dependent variable is Bernoulli.
* The data records are independent.

|| The dependent variable's distribution is typically assumed to be a Bernoulli distribution in logistic regression.
|-
|| '''Show slide'''

'''Advantages of Logistic Regression'''

* It provides estimates of regression coefficients along with their standard errors.
* It also provides the predicted probability which in turn is used as a classifier.
* It doesn’t need explanatory variables to be necessarily continuous.
* In this sense, it is a more general classifier than LDA and QDA.

|| Logistic regression offers a significant advantage in that continuous explanatory variables are not a requirement.
|-
|| '''Show Slide'''

'''Implementation Of Logistic Regression'''
|| We will implement '''logistic regression''' using the '''Raisin '''dataset.

The additional reading material has more details on the '''Raisin dataset'''.

Please refer to it.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''LogisticRegression.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

Highlight LogisticRegression.R

Logistic Regression folder.
|| I have downloaded and moved these files to the '''Logistic Regression''' folder.

This folder is located in the '''MLProject '''folder.

I have also set the '''Logistic Regression''' folder as my Working Directory.

Let’s create a '''Logistic Regression''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click LogisticRegression.R in RStudio

Point to LogisticRegression.R in RStudio.
|| Open the script '''LogisticRegression.R''' in '''RStudio'''.

For this, click on the script '''LogisticRegression.R.'''

Script '''LogisticRegression.R''' opens in '''RStudio'''.
|-
|| [Rstudio]

Highlight the commands

'''library(readxl)'''

'''library(caret)'''

'''library(VGAM)'''

'''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''

|| Select and run these commands to import the necessary packages.

The '''VGAM''' package contains the '''glm()''' function required to create our classifier.

As I have already installed the packages.

I have directly imported them.
|-
|| [RStudio]

Highlight

'''data <- read_xlsx("Raisin_Dataset.xlsx")'''

'''data[c("minorAL",”ecc”,"class")]'''

'''data$class <- factor(data$class)'''

'''Highlight the commands.'''
|| These commands will load the '''Raisin dataset.'''

They will also prepare the dataset for model building.

Select and run the commands.

|-
|| Drag boundary to see the Environment tab.

Click on '''data '''on the Environment tab.
||

Click on '''data '''in the '''Environment '''tab.

It loads the modified dataset in the '''Source''' window.

|-
|| Point to the data.
|| Now we split our dataset into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''trainIndex<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

'''train <- data[trainIndex, ]'''

'''test <- data[-trainIndex, ]'''

||

In the '''Source''' window type these commands.

|-
|| Highlight

'''set.seed(1) '''

Highlight

'''trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

Highlight

'''train <- data[trainIndex, ]'''

Highlight

'''test <- data[-trainIndex, ]'''

Click on Save and Run buttons.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
||

Select the commands and run them.

|-
||
|| Let us create a '''Logistic Regression '''model on the '''training dataset'''.
|-
|| [RStudio]

'''Logistic_model <- glm(class ~ ., data = train, family = "binomial")'''

'''summary(Logistic_model)$coef'''
||

In the '''Source''' window type these commands
|-
| | Highlight glm()

Highlight '''class ~ .'''

Highlight '''family = binomial'''

Highlight '''train'''
|| The function glm() represents generalized linear models.

Logistic regression is among the class of models that it fits.

This is the formula for our model.

We try to predict target variable '''class''' based on '''minorAL '''and '''ecc '''features.

This ensures that our model predicts the probability for 2 classes.

It ensures that, out of all the models in glm, the logistic regression model is fit.

This is the data used to train our model.

Select the commands and run them.

The output is shown in the '''console '''window.
|-
|| Drag boundary to see the console window.
|| Drag boundary to see the '''console '''window.
|-
|| Point the output in the '''console'''

Highlight '''Coefficients'''

Highlight '''Pr(>|z|)'''
||

'''Coefficients''' denote the coefficients of the logit function.

That means the log-odds of class change by -0.04 for every unit change in minorAL.

The lower p-values suggest that the effects are statistically significant.

|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

'''View(Predicted.prob)'''

||

In the '''Source''' window type these commands

|-
|| Highlight

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

Highlight

'''Type = “response” '''

|| This command provides the predicted probability of the logistic regression model on the test dataset.

This command ensures the outcome is a probability.

Select and run the commands
|-
|| Point

Value

||

'''Predicted.prob '''stores the predicted probability of each observation belonging to a certain class.

|-
|| '''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| In the source window type the following commands
|-
|| Highlight

'''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| This retrieves the predicted classes from the probabilities.

If the probability is greater than 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen

We also convert the output to a '''factor''' datatype to fit in the Confusion matrix function.

Select and run the commands
|-
||
|| Let us measure the accuracy of our model.
|-
|| '''confusion_matrix <- confusionMatrix(predicted.classes,test_data$class)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(predicted.classes,test_data$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

List is created from the actual and predicted class labels.

And it is stored in the confusion_matrix variable.

It helps to assess the classification model's performance and accuracy.

Select and run these commands

|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands
|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

||

In the '''Source''' window type this command

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We use the '''plot_confusion_matrix()''' function to generate a visual plot of the '''confusion matrix list created.'''

Select and run the command

The output is seen in the '''plot''' window
|-
|| '''Output in Plot window.'''

|| This plot shows how well our model predicted the testing data.

We observe that:

'''21 '''misclassifications of Besni Class.

'''13 '''misclassifications of Kecimen class.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''

|| We will now visualize the decision boundary of the model.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''
|| This code first generates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc''' features in the dataset.

Then, it uses the '''Logistics Regression '''model to predict the probability of each point in this grid, storing these predictions as a new column ''''prob' '''in the '''grid '''dataframe.

It converts the predicted probabilities of the points into classes.

If the probability exceeds 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen.

The prediced classes are stored in ‘class’ column of grid data frame.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

Select and run the commands

Click on grid in the Environment tab to load the generated data in the Source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source '''window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| We are creating the decision boundary plot using GGPlot2 from the data generated.

It plots the grid points with colors indicating the predicted classes.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can conclude that the decision boundary of logistic regression is a straight line.

The line separates the data points clearly.
|-
|| Show slide

Limitations of Logistic Regression

* It’s sensitive to outliers which can affect the accuracy of the classifier.
* It can perform poorly in the presence of multicollinearity among explanatory variables.

|| Here are some of the limitations of Logistic Regression
|-
||
|| Now let us summarize what we have learned.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:

* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| Show Slide

Assignment
||
* Apply logistic regression on the '''Wine '''dataset.
* This dataset can be found in the '''HDclassif''' package.
* Install the package and import the dataset using the '''data()''' command.
* Measure the accuracy of the model

|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions
|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

Textbook Companion
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald. O and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Logistic-Regression-in-R/English

2024-05-31T10:19:19Z

Ushav:

'''Title of the script''': Logistic Regression

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, classification, logistic regression, video tutorial.

{| border=1
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Logistic Regression in R.'''

|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about
* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''
|| To follow this tutorial, the learner should know:
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
||
|| Let us learn what '''logistic regression''' is
|-
|| '''Show slide'''

'''Logistic Regression'''

|| Logistic regression is a statistical model used for classification.

It models the probability of success for the explanatory variable.

* It predicts the probability, unlike the response in linear regression.
* The predicted probability is used as a classifier.
* The probability of success is modeled using the''' logit or (log odds) '''function.
* It is a linear classifier, as the logistic regression model has a linear logit.
* It is often used when the response variable is categorical.

|-
|| '''Show slide'''

'''Assumptions of Logistic Regression'''

* The distribution of the dependent variable is Bernoulli.
* The data records are independent.

|| The dependent variable's distribution is typically assumed to be a Bernoulli distribution in logistic regression.
|-
|| '''Show slide'''

'''Advantages of Logistic Regression'''

* It provides estimates of regression coefficients along with their standard errors.
* It also provides the predicted probability which in turn is used as a classifier.
* It doesn’t need explanatory variables to be necessarily continuous.
* In this sense, it is a more general classifier than LDA and QDA.

|| Logistic regression offers a significant advantage in that continuous explanatory variables are not a requirement.
|-
|| '''Show Slide'''

'''Implementation Of Logistic Regression'''
|| We will implement '''logistic regression''' using the '''Raisin '''dataset.

The additional reading material has more details on the '''Raisin dataset'''.

Please refer to it.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''LogisticRegression.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

Highlight LogisticRegression.R

Logistic Regression folder.
|| I have downloaded and moved these files to the '''Logistic Regression''' folder.

This folder is located in the '''MLProject '''folder.

I have also set the '''Logistic Regression''' folder as my Working Directory.

Let’s create a '''Logistic Regression''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click LogisticRegression.R in RStudio

Point to LogisticRegression.R in RStudio.
|| Open the script '''LogisticRegression.R''' in '''RStudio'''.

For this, click on the script '''LogisticRegression.R.'''

Script '''LogisticRegression.R''' opens in '''RStudio'''.
|-
|| [Rstudio]

Highlight the commands

'''library(readxl)'''

'''library(caret)'''

'''library(VGAM)'''

'''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''

|| Select and run these commands to import the necessary packages.

The '''VGAM''' package contains the '''glm()''' function required to create our classifier.

As I have already installed the packages.

I have directly imported them.
|-
|| [RStudio]

Highlight

'''data <- read_xlsx("Raisin_Dataset.xlsx")'''

'''data[c("minorAL",”ecc”,"class")]'''

'''data$class <- factor(data$class)'''

'''Highlight the commands.'''
|| These commands will load the '''Raisin dataset.'''

They will also prepare the dataset for model building.

Select and run the commands.

|-
|| Drag boundary to see the Environment tab.

Click on '''data '''on the Environment tab.
||

Click on '''data '''in the '''Environment '''tab.

It loads the modified dataset in the '''Source''' window.

|-
|| Point to the data.
|| Now we split our dataset into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''trainIndex<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

'''train <- data[trainIndex, ]'''

'''test <- data[-trainIndex, ]'''

||

In the '''Source''' window type these commands.

|-
|| Highlight

'''set.seed(1) '''

Highlight

'''trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

Highlight

'''train <- data[trainIndex, ]'''

Highlight

'''test <- data[-trainIndex, ]'''

Click on Save and Run buttons.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
||

Select the commands and run them.

|-
||
|| Let us create a '''Logistic Regression '''model on the '''training dataset'''.
|-
|| [RStudio]

'''Logistic_model <- glm(class ~ ., data = train, family = "binomial")'''

'''summary(Logistic_model)$coef'''
||

In the '''Source''' window type these commands
|-
| | Highlight glm()

Highlight '''class ~ .'''

Highlight '''family = binomial'''

Highlight '''train'''
|| The function glm() represents generalized linear models.

Logistic regression is among the class of models that it fits.

This is the formula for our model.

We try to predict target variable '''class''' based on '''minorAL '''and '''ecc '''features.

This ensures that our model predicts the probability for 2 classes.

It ensures that, out of all the models in glm, the logistic regression model is fit.

This is the data used to train our model.

Select the commands and run them.

The output is shown in the '''console '''window.
|-
|| Drag boundary to see the console window.
|| Drag boundary to see the '''console '''window.
|-
|| Point the output in the '''console'''

Highlight '''Coefficients'''

Highlight '''Pr(>|z|)'''
||

'''Coefficients''' denote the coefficients of the logit function.

That means the log-odds of class change by -0.04 for every unit change in minorAL.

The lower p-values suggest that the effects are statistically significant.

|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

'''View(Predicted.prob)'''

||

In the '''Source''' window type these commands

|-
|| Highlight

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

Highlight

'''Type = “response” '''

|| This command provides the predicted probability of the logistic regression model on the test dataset.

This command ensures the outcome is a probability.

Select and run the commands
|-
|| Point

Value

||

'''Predicted.prob '''stores the predicted probability of each observation belonging to a certain class.

|-
|| '''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| In the source window type the following commands
|-
|| Highlight

'''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| This retrieves the predicted classes from the probabilities.

If the probability is greater than 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen

We also convert the output to a '''factor''' datatype to fit in the Confusion matrix function.

Select and run the commands
|-
||
|| Let us measure the accuracy of our model.
|-
|| '''confusion_matrix <- confusionMatrix(predicted.classes,test_data$class)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(predicted.classes,test_data$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

List is created from the actual and predicted class labels.

And it is stored in the confusion_matrix variable.

It helps to assess the classification model's performance and accuracy.

Select and run these commands

|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands
|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

||

In the '''Source''' window type this command

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We use the '''plot_confusion_matrix()''' function to generate a visual plot of the '''confusion matrix list created.'''

Select and run the command

The output is seen in the '''plot''' window
|-
|| '''Output in Plot window.'''

|| This plot shows how well our model predicted the testing data.

We observe that:

'''21 '''misclassifications of Besni Class.

'''13 '''misclassifications of Kecimen class.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''

|| We will now visualize the decision boundary of the model.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''
|| This code first generates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc''' features in the dataset.

Then, it uses the '''Logistics Regression '''model to predict the probability of each point in this grid, storing these predictions as a new column ''''prob' '''in the '''grid '''dataframe.

It converts the predicted probabilities of the points into classes.

If the probability exceeds 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen.

The prediced classes are stored in ‘class’ column of grid data frame.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

Select and run the commands

Click on grid in the Environment tab to load the generated data in the Source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source '''window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| We are creating the decision boundary plot using GGPlot2 from the data generated.

It plots the grid points with colors indicating the predicted classes.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can conclude that the decision boundary of logistic regression is a straight line.

The line separates the data points clearly.
|-
|| Show slide

Limitations of Logistic Regression

* It’s sensitive to outliers which can affect the accuracy of the classifier.
* It can perform poorly in the presence of multicollinearity among explanatory variables.

|| Here are some of the limitations of Logistic Regression
|-
||
|| Let us summarize what we have learned.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:

* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| Show Slide

Assignment
||
* Apply logistic regression on the '''Wine '''dataset.
* This dataset can be found in the '''HDclassif''' package.
* Install the package and import the dataset using the '''data()''' command.
* Measure the accuracy of the model

|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions
|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

Textbook Companion
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald. O and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Logistic-Regression-in-R/English

2024-05-31T09:17:24Z

Ushav:

'''Title of the script''': Logistic Regression

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, classification, logistic regression, video tutorial.

{| border=1
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Logistic Regression in R.'''

|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about
* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''
|| To follow this tutorial, the learner should know:
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
||
|| Let us learn what '''logistic regression''' is
|-
|| '''Show slide'''

'''Logistic Regression'''

|| Logistic regression is a statistical model used for classification.

It models the probability of success for the explanatory variable.

* It predicts the probability, unlike the response in linear regression.
* The predicted probability is used as a classifier.
* The probability of success is modeled using the''' logit or (log odds) '''function.
* It is a linear classifier, as the logistic regression model has a linear logit.
* It is often used when the response variable is categorical.

|-
|| '''Show slide'''

'''Assumptions of Logistic Regression'''

* The distribution of the dependent variable is Bernoulli.
* The data records are independent.

|| The dependent variable's distribution is typically assumed to be a Bernoulli distribution in logistic regression.
|-
|| '''Show slide'''

'''Advantages of Logistic Regression'''

* It provides estimates of regression coefficients along with their standard errors.
* It also provides the predicted probability which in turn is used as a classifier.
* It doesn’t need explanatory variables to be necessarily continuous.
* In this sense, it is a more general classifier than LDA and QDA.

|| Logistic regression offers a significant advantage in that continuous explanatory variables are not a requirement.
|-
|| '''Show Slide'''

'''Implementation Of Logistic Regression'''
|| We will implement '''logistic regression''' using the '''Raisin '''dataset.

The additional reading material has more details on the '''Raisin dataset'''.

Please refer to it.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''LogisticRegression.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

Highlight LogisticRegression.R

Logistic Regression folder.
|| I have downloaded and moved these files to the '''Logistic Regression''' folder.

This folder is located in the '''MLProject '''folder.

I have also set the '''Logistic Regression''' folder as my Working Directory.

Let’s create a '''Logistic Regression''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click LogisticRegression.R in RStudio

Point to LogisticRegression.R in RStudio.
|| Open the script '''LogisticRegression.R''' in '''RStudio'''.

For this, click on the script '''LogisticRegression.R.'''

Script '''LogisticRegression.R''' opens in '''RStudio'''.
|-
|| [Rstudio]

Highlight the commands

'''library(readxl)'''

'''library(caret)'''

'''library(VGAM)'''

'''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''

|| Select and run these commands to import the necessary packages.

The '''VGAM''' package contains the '''glm()''' function required to create our classifier.

As I have already installed the packages.

I have directly imported them.
|-
|| [RStudio]

Highlight

'''data <- read_xlsx("Raisin_Dataset.xlsx")'''

'''data[c("minorAL",”ecc”,"class")]'''

'''data$class <- factor(data$class)'''

'''Highlight the commands.'''
|| These commands will load the '''Raisin dataset.'''

They will also prepare the dataset for model building.

Select and run the commands.

|-
|| Drag boundary to see the Environment tab.

Click on '''data '''on the Environment tab.
||

Click on '''data '''in the '''Environment '''tab.

It loads the modified dataset in the '''Source''' window.

|-
|| Point to the data.
|| Now we split our dataset into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''trainIndex<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

'''train <- data[trainIndex, ]'''

'''test <- data[-trainIndex, ]'''

||

In the '''Source''' window type these commands.

|-
|| Highlight

'''set.seed(1) '''

Highlight

'''trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

Highlight

'''train <- data[trainIndex, ]'''

Highlight

'''test <- data[-trainIndex, ]'''

Click on Save and Run buttons.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
||

Select the commands and run them.

|-
||
|| Let us create a '''Logistic Regression '''model on the '''training dataset'''.
|-
|| [RStudio]

'''Logistic_model <- glm(class ~ ., data = train, family = "binomial")'''

'''summary(Logistic_model)$coef'''
||

In the '''Source''' window type these commands
|-
| | Highlight glm()

Highlight '''class ~ .'''

Highlight '''family = binomial'''

Highlight '''train'''
|| The function glm() represents generalized linear models.

Logistic regression is among the class of models that it fits.

This is the formula for our model.

We try to predict target variable '''class''' based on '''minorAL '''and '''ecc '''features.

This ensures that our model predicts the probability for 2 classes.

It ensures that, out of all the models in glm, the logistic regression model is fit.

This is the data used to train our model.

Select the commands and run them.

The output is shown in the '''console '''window.
|-
|| Drag boundary to see the console window.
|| Drag boundary to see the '''console '''window.
|-
|| Point the output in the '''console'''

Highlight '''Coefficients'''

Highlight '''Pr(>|z|)'''
||

'''Coefficients''' denote the coefficients of the logit function.

That means the log-odds of class change by -0.04 for every unit change in minorAL.

The lower p-values suggest that the effects are statistically significant.

|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

'''View(Predicted.prob)'''

||

In the '''Source''' window type these commands

|-
|| Highlight

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

Highlight

'''Type = “response” '''

|| This command provides the predicted probability of the logistic regression model on the test dataset.

This command ensures the outcome is a probability.

Select and run the commands
|-
|| Point

Value

||

'''Predicted.prob '''stores the predicted probability of each observation belonging to a certain class.

|-
|| '''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| In the source window type the following commands
|-
|| Highlight

'''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| This retrieves the predicted classes from the probabilities.

If the probability is greater than 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen

We also convert the output to a '''factor''' datatype to fit in the Confusion matrix function.

Select and run the commands
|-
||
|| Let us measure the accuracy of our model.
|-
|| '''confusion_matrix <- confusionMatrix(predicted.classes,test_data$class)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(predicted.classes,test_data$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

List is created from the actual and predicted class labels.

And it is stored in the confusion_matrix variable.

It helps to assess the classification model's performance and accuracy.

Select and run these commands

|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands
|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

||

In the '''Source''' window type this command

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We use the '''plot_confusion_matrix()''' function to generate a visual plot of the '''confusion matrix list created.'''

Select and run the command

The output is seen in the '''plot''' window
|-
|| '''Output in Plot window.'''

|| This plot shows how well our model predicted the testing data.

We observe that:

'''21 '''misclassifications of Besni Class.

'''13 '''misclassifications of Kecimen class.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''

|| We will now visualize the decision boundary of the model.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''
|| This code first generates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc''' features in the dataset.

Then, it uses the '''Logistics Regression '''model to predict the probability of each point in this grid, storing these predictions as a new column ''''prob' '''in the '''grid '''dataframe.

It converts the predicted probabilities of he points into classes.

If the probability exceeds 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen.

The prediced classes are stored in ‘class’ column of grid data frame.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

Select and run the commands

Click on grid in the Environment tab to load the generated data in the Source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source '''window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| We are creating the decision boundary plot using GGPlot2 from the data generated.

It plots the grid points with colors indicating the predicted classes.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can conclude that the decision boundary of logistic regression is a straight line.

The line separates the data points clearly.
|-
|| Show slide

Limitations of Logistic Regression

* It’s sensitive to outliers which can affect the accuracy of the classifier.
* It can perform poorly in the presence of multicollinearity among explanatory variables.

|| Here are some of the limitations of Logistic Regression
|-
||
|| Let us summarize what we have learned.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:

* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| Show Slide

Assignment
||
* Apply logistic regression on the '''Wine '''dataset.
* This dataset can be found in the '''HDclassif''' package.
* Install the package and import the dataset using the '''data()''' command.
* Measure the accuracy of the model

|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions
|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

Textbook Companion
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald. O and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Logistic-Regression-in-R/English

2024-05-31T09:14:24Z

Ushav:

'''Title of the script''': Logistic Regression

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, classification, logistic regression, video tutorial.

{| border=1
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Logistic Regression in R.'''

|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about
* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''
|| To follow this tutorial, the learner should know:
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
||
|| Let us learn what '''logistic regression''' is
|-
|| '''Show slide'''

'''Logistic Regression'''

|| Logistic regression is a statistical model used for classification.

It models the probability of success for the explanatory variable.

* It predicts the probability, unlike the response in linear regression.
* The predicted probability is used as a classifier.
* The probability of success is modeled using the''' logit or (log odds) '''function.
* It is a linear classifier, as the logistic regression model has a linear logit.
* It is often used when the response variable is categorical.

|-
|| '''Show slide'''

'''Assumptions of Logistic Regression'''

* The distribution of the dependent variable is Bernoulli.
* The data records are independent.

|| The dependent variable's distribution is typically assumed to be a Bernoulli distribution in logistic regression.
|-
|| '''Show slide'''

'''Advantages of Logistic Regression'''

* It provides estimates of regression coefficients along with their standard errors.
* It also provides the predicted probability which in turn is used as a classifier.
* It doesn’t need explanatory variables to be necessarily continuous.
* In this sense, it is a more general classifier than LDA and QDA.

|| Logistic regression offers a significant advantage in that continuous explanatory variables are not a requirement.
|-
|| '''Show Slide'''

'''Implementation Of Logistic Regression'''
|| We will implement '''logistic regression''' using the '''Raisin '''dataset.

The additional reading material has more details on the '''Raisin dataset'''.

Please refer to it.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''LogisticRegression.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

Highlight LogisticRegression.R

Logistic Regression folder.
|| I have downloaded and moved these files to the '''Logistic Regression''' folder.

This folder is located in the '''MLProject '''folder.

I have also set the '''Logistic Regression''' folder as my Working Directory.

Let’s create a '''Logistic Regression''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click LogisticRegression.R in RStudio

Point to LogisticRegression.R in RStudio.
|| Open the script '''LogisticRegression.R''' in '''RStudio'''.

For this, click on the script '''LogisticRegression.R.'''

Script '''LogisticRegression.R''' opens in '''RStudio'''.
|-
|| [Rstudio]

Highlight the commands

'''library(readxl)'''

'''library(caret)'''

'''library(VGAM)'''

'''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''

|| Select and run these commands to import the necessary packages.

The '''VGAM''' package contains the '''glm()''' function required to create our classifier.

As I have already installed the packages.

I have directly imported them.
|-
|| [RStudio]

Highlight

'''data <- read_xlsx("Raisin_Dataset.xlsx")'''

'''data[c("minorAL",”ecc”,"class")]'''

'''data$class <- factor(data$class)'''

'''Highlight the commands.'''
|| These commands will load the '''Raisin dataset.'''

They will also prepare the dataset for model building.

Select and run the commands.

|-
|| Drag boundary to see the Environment tab.

Click on '''data '''on the Environment tab.
||

Click on '''data '''in the '''Environment '''tab.

It loads the modified dataset in the '''Source''' window.

|-
|| Point to the data.
|| Now we split our dataset into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''trainIndex<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

'''train <- data[trainIndex, ]'''

'''test <- data[-trainIndex, ]'''

||

In the '''Source''' window type these commands.

|-
|| Highlight

'''set.seed(1) '''

Highlight

'''trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

Highlight

'''train <- data[trainIndex, ]'''

Highlight

'''test <- data[-trainIndex, ]'''

Click on Save and Run buttons.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
||

Select the commands and run them.

|-
||
|| Let us create a '''Logistic Regression '''model on the '''training dataset'''.
|-
|| [RStudio]

'''Logistic_model <- glm(class ~ ., data = train, family = "binomial")'''

'''summary(Logistic_model)$coef'''
||

In the '''Source''' window type these commands
|-
| | Highlight glm()

Highlight '''class ~ .'''

Highlight '''family = binomial'''

Highlight '''train'''
|| The function glm() represents generalized linear models.

Logistic regression is among the class of models that it fits.

This is the formula for our model.

We try to predict target variable '''class''' based on '''minorAL '''and '''ecc '''features.

This ensures that our model predicts the probability for 2 classes.

It ensures that, out of all the models in glm, the logistic regression model is fit.

This is the data used to train our model.

Select the commands and run them.

The output is shown in the '''console '''window.
|-
|| Drag boundary to see the console window.
|| Drag boundary to see the '''console '''window.
|-
|| Point the output in the '''console'''

Highlight '''Coefficients'''

Highlight '''Pr(>|z|)'''
||

'''Coefficients''' denote the coefficients of the logit function.

That means the log-odds of class change by -0.04 for every unit change in minorAL.

The lower p-values suggest that the effects are statistically significant.

|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

'''View(Predicted.prob)'''

||

In the '''Source''' window type these commands

|-
|| Highlight

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

Highlight

'''Type = “response” '''

|| This command provides the predicted probability of the logistic regression model on the test dataset.

This command ensures the outcome is a probability.

Select and run the commands
|-
|| Point

Value

||

'''Predicted.prob '''stores the predicted probability of each observation belonging to a certain class.

|-
|| '''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| In the source window type the following commands
|-
|| Highlight

'''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| This retrieves the predicted classes from the probabilities.

If the probability is greater than 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen

We also convert the output to a '''factor''' datatype to fit in the Confusion matrix function.

Select and run the commands
|-
||
|| Let us measure the accuracy of our model.
|-
|| '''confusion_matrix <- confusionMatrix(predicted.classes,test_data$class)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(predicted.classes,test_data$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

List is created from the actual and predicted class labels.

And it is stored in the confusion_matrix variable.

It helps to assess the classification model's performance and accuracy.

Select and run these commands

|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands
|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

||

In the '''Source''' window type this command

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We use the '''plot_confusion_matrix()''' function to generate a visual plot of the '''confusion matrix list created.'''

Select and run the command

The output is seen in the '''plot''' window
|-
|| '''Output in Plot window.'''

|| This plot shows how well our model predicted the testing data.

We observe that:

'''21 '''misclassifications of Besni Class.

'''13 '''misclassifications of Kecimen class.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''

|| We will visualize the decision boundary of the model.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''
|| This code first generates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc''' features in the dataset.

Then, it uses the '''Logistics Regression '''model to predict the probability of each point in this grid, storing these predictions as a new column ''''prob' '''in the '''grid '''dataframe.

It converts the predicted probabilities of he points into classes.

If the probability exceeds 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen.

The prediced classes are stored in ‘class’ column of grid data frame.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

Select and run the commands

Click on grid in the Environment tab to load the generated data in the Source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source '''window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| We are creating the decision boundary plot using GGPlot2 from the data generated.

It plots the grid points with colors indicating the predicted classes.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can conclude that the decision boundary of logistic regression is a straight line.

The line separates the data points clearly.
|-
|| Show slide

Limitations of Logistic Regression

* It’s sensitive to outliers which can affect the accuracy of the classifier.
* It can perform poorly in the presence of multicollinearity among explanatory variables.

|| Here are some of the limitations of Logistic Regression
|-
||
|| Let us summarize what we have learned.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:

* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| Show Slide

Assignment
||
* Apply logistic regression on the '''Wine '''dataset.
* This dataset can be found in the '''HDclassif''' package.
* Install the package and import the dataset using the '''data()''' command.
* Measure the accuracy of the model

|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions
|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

Textbook Companion
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald. O and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Logistic-Regression-in-R/English

2024-05-31T09:06:03Z

Ushav:

'''Title of the script''': Logistic Regression

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, classification, logistic regression, video tutorial.

{| border=1
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Logistic Regression in R.'''

|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about
* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''
|| To follow this tutorial, the learner should know:
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
||
|| Let us learn what '''logistic regression''' is
|-
|| '''Show slide'''

'''Logistic Regression'''

|| Logistic regression is a statistical model used for classification.

It models the probability of success for the explanatory variable.

* It predicts the probability, unlike the response in linear regression.
* The predicted probability is used as a classifier.
* The probability of success is modeled using the''' logit or (log odds) '''function.
* It is a linear classifier, as the logistic regression model has a linear logit.
* It is often used when the response variable is categorical.

|-
|| '''Show slide'''

'''Assumptions of Logistic Regression'''

* The distribution of the dependent variable is Bernoulli.
* The data records are independent.

|| The dependent variable's distribution is typically assumed to be a Bernoulli distribution in logistic regression.
|-
|| '''Show slide'''

'''Advantages of Logistic Regression'''

* It provides estimates of regression coefficients along with their standard errors.
* It also provides the predicted probability which in turn is used as a classifier.
* It doesn’t need explanatory variables to be necessarily continuous.
* In this sense, it is a more general classifier than LDA and QDA.

|| Logistic regression offers a significant advantage in that continuous explanatory variables are not a requirement.
|-
|| '''Show Slide'''

'''Implementation Of Logistic Regression'''
|| We will implement '''logistic regression''' using the '''Raisin '''dataset.

The additional reading material has more details on the '''Raisin dataset'''.

Please refer to it.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''LogisticRegression.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

Highlight LogisticRegression.R

Logistic Regression folder.
|| I have downloaded and moved these files to the '''Logistic Regression''' folder.

This folder is located in the '''MLProject '''folder.

I have also set the '''Logistic Regression''' folder as my Working Directory.

Let’s create a '''Logistic Regression''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click LogisticRegression.R in RStudio

Point to LogisticRegression.R in RStudio.
|| Open the script '''LogisticRegression.R''' in '''RStudio'''.

For this, click on the script '''LogisticRegression.R.'''

Script '''LogisticRegression.R''' opens in '''RStudio'''.
|-
|| [Rstudio]

Highlight the commands

'''library(readxl)'''

'''library(caret)'''

'''library(VGAM)'''

'''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''

|| Select and run these commands to import the necessary packages.

The '''VGAM''' package contains the '''glm()''' function required to create our classifier.

As I have already installed the packages.

I have directly imported them.
|-
|| [RStudio]

Highlight

'''data <- read_xlsx("Raisin_Dataset.xlsx")'''

'''data[c("minorAL",”ecc”,"class")]'''

'''data$class <- factor(data$class)'''

'''Highlight the commands.'''
|| These commands will load the '''Raisin dataset.'''

They will also prepare the dataset for model building.

Select and run the commands.

|-
|| Drag boundary to see the Environment tab.

Click on '''data '''on the Environment tab.
||

Click on '''data '''in the '''Environment '''tab.

It loads the modified dataset in the '''Source''' window.

|-
|| Point to the data.
|| Now we split our dataset into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''trainIndex<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

'''train <- data[trainIndex, ]'''

'''test <- data[-trainIndex, ]'''

||

In the '''Source''' window type these commands.

|-
|| Highlight

'''set.seed(1) '''

Highlight

'''trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

Highlight

'''train <- data[trainIndex, ]'''

Highlight

'''test <- data[-trainIndex, ]'''

Click on Save and Run buttons.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
||

Select the commands and run them.

|-
||
|| Let us create a '''Logistic Regression '''model on the '''training dataset'''.
|-
|| [RStudio]

'''Logistic_model <- glm(class ~ ., data = train, family = "binomial")'''

'''summary(Logistic_model)$coef'''
||

In the '''Source''' window type these commands
|-
| | Highlight glm()

Highlight '''class ~ .'''

Highlight '''family = binomial'''

Highlight '''train'''
|| The function glm() represents generalized linear models.

Logistic regression is among the class of models that it fits.

This is the formula for our model.

We try to predict target variable '''class''' based on '''minorAL '''and '''ecc '''features.

This ensures that our model predicts the probability for 2 classes.

It ensures that, out of all the models in glm, the logistic regression model is fit.

This is the data used to train our model.

Select the commands and run them.

The output is shown in the '''console '''window.
|-
|| Drag boundary to see the console window.
|| Drag boundary to see the '''console '''window.
|-
|| Point the output in the '''console'''

Highlight '''Coefficients'''

Highlight '''Pr(>|z|)'''
||

'''Coefficients''' denote the coefficients of the logit function.

That means the log-odds of class change by -0.04 for every unit change in minorAL.

The lower p-values suggest that the effects are statistically significant.

|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

'''View(Predicted.prob)'''

||

In the '''Source''' window type these commands

|-
|| Highlight

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

Highlight

'''Type = “response” '''

|| This command provides the predicted probability of the logistic regression model on the test dataset.

This command ensures the outcome is a probability.

Select and run the commands
|-
|| Point

Value

||

'''Predicted.prob '''stores the predicted probability of each observation belonging to a certain class.

|-
|| '''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| In the source window type the following commands
|-
|| Highlight

'''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| This retrieves the predicted classes from the probabilities.

If the probability is greater than 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen

We also convert the output to a '''factor''' datatype to fit in the Confusion matrix function.

Select and run the commands
|-
||
|| Let us measure the accuracy of our model.
|-
|| '''confusion_matrix <- confusionMatrix(predicted.classes,test_data$class)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(predicted.classes,test_data$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

List is created from the actual and predicted class labels.

And it is stored in the confusion_matrix variable.

It helps to assess the classification model's performance and accuracy.

Select and run these commands

|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands
|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

|| Click on '''QDA.R''' in the '''Source '''window.

In the '''Source''' window type this command

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We use the '''plot_confusion_matrix()''' function to generate a visual plot of the '''confusion matrix list created.'''

Select and run the command

The output is seen in the '''plot''' window
|-
|| '''Output in Plot window.'''

|| This plot shows how well our model predicted the testing data.

We observe that:

'''21 '''misclassifications of Besni Class.

'''13 '''misclassifications of Kecimen class.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''

|| We will visualize the decision boundary of the model.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''
|| This code first generates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc''' features in the dataset.

Then, it uses the '''Logistics Regression '''model to predict the probability of each point in this grid, storing these predictions as a new column ''''prob' '''in the '''grid '''dataframe.

It converts the predicted probabilities of he points into classes.

If the probability exceeds 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen.

The prediced classes are stored in ‘class’ column of grid data frame.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

Select and run the commands

Click on grid in the Environment tab to load the generated data in the Source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source '''window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| We are creating the decision boundary plot using GGPlot2 from the data generated.

It plots the grid points with colors indicating the predicted classes.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can conclude that the decision boundary of logistic regression is a straight line.

The line separates the data points clearly.
|-
|| Show slide

Limitations of Logistic Regression

* It’s sensitive to outliers which can affect the accuracy of the classifier.
* It can perform poorly in the presence of multicollinearity among explanatory variables.

|| Here are some of the limitations of Logistic Regression
|-
||
|| Let us summarize what we have learned.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:

* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| Show Slide

Assignment
||
* Apply logistic regression on the '''Wine '''dataset.
* This dataset can be found in the '''HDclassif''' package.
* Install the package and import the dataset using the '''data()''' command.
* Measure the accuracy of the model

|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions
|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

Textbook Companion
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald. O and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Logistic-Regression-in-R/English

2024-05-31T09:00:11Z

Ushav:

'''Title of the script''': Logistic Regression

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, classification, logistic regression, video tutorial.

{| border=1
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Logistic Regression in R.'''

|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about
* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''
|| To follow this tutorial, the learner should know:
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
||
|| Let us learn what '''logistic regression''' is
|-
|| '''Show slide'''

'''Logistic Regression'''

|| Logistic regression is a statistical model used for classification.

It models the probability of success for the explanatory variable.

* It predicts the probability, unlike the response in linear regression.
* The predicted probability is used as a classifier.
* The probability of success is modeled using the''' logit or (log odds) '''function.
* It is a linear classifier, as the logistic regression model has a linear logit.
* It is often used when the response variable is categorical.

|-
|| '''Show slide'''

'''Assumptions of Logistic Regression'''

* The distribution of the dependent variable is Bernoulli.
* The data records are independent.

|| The dependent variable's distribution is typically assumed to be a Bernoulli distribution in logistic regression.
|-
|| '''Show slide'''

'''Advantages of Logistic Regression'''

* It provides estimates of regression coefficients along with their standard errors.
* It also provides the predicted probability which in turn is used as a classifier.
* It doesn’t need explanatory variables to be necessarily continuous.
* In this sense, it is a more general classifier than LDA and QDA.

|| Logistic regression offers a significant advantage in that continuous explanatory variables are not a requirement.
|-
|| '''Show Slide'''

'''Implementation Of Logistic Regression'''
|| We will implement '''logistic regression''' using the '''Raisin '''dataset.

The additional reading material has more details on the '''Raisin dataset'''.

Please refer to it.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''LogisticRegression.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

Highlight LogisticRegression.R

Logistic Regression folder.
|| I have downloaded and moved these files to the '''Logistic Regression''' folder.

This folder is located in the '''MLProject '''folder on the '''Desktop'''.

I have also set the '''Logistic Regression''' folder as my Working Directory.

Let’s create a '''Logistic Regression''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click LogisticRegression.R in RStudio

Point to LogisticRegression.R in RStudio.
|| Open the script '''LogisticRegression.R''' in '''RStudio'''.

For this, click on the script '''LogisticRegression.R.'''

Script '''LogisticRegression.R''' opens in '''RStudio'''.
|-
|| [Rstudio]

Highlight the commands

'''library(readxl)'''

'''library(caret)'''

'''library(VGAM)'''

'''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''

|| Select and run these commands to import the necessary packages.

The '''VGAM''' package contains the '''glm()''' function required to create our classifier.

As I have already installed the packages.

I have directly imported them.
|-
|| [RStudio]

Highlight

'''data <- read_xlsx("Raisin_Dataset.xlsx")'''

'''data[c("minorAL",”ecc”,"class")]'''

'''data$class <- factor(data$class)'''

'''Highlight the commands.'''
|| These commands will load the '''Raisin dataset.'''

They will also prepare the dataset for model building.

Select and run the commands.

|-
|| Drag boundary to see the Environment tab.

Click on '''data '''on the Environment tab.
||

Click on '''data '''in the '''Environment '''tab.

It loads the modified dataset in the '''Source''' window.

|-
|| Point to the data.
|| Now we split our dataset into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''trainIndex<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

'''train <- data[trainIndex, ]'''

'''test <- data[-trainIndex, ]'''

||

In the '''Source''' window type these commands.

|-
|| Highlight

'''set.seed(1) '''

Highlight

'''trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

Highlight

'''train <- data[trainIndex, ]'''

Highlight

'''test <- data[-trainIndex, ]'''

Click on Save and Run buttons.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
||

Select the commands and run them.

|-
||
|| Let us create a '''Logistic Regression '''model on the '''training dataset'''.
|-
|| [RStudio]

'''Logistic_model <- glm(class ~ ., data = train, family = "binomial")'''

'''summary(Logistic_model)$coef'''
||

In the '''Source''' window type these commands
|-
| | Highlight glm()

Highlight '''class ~ .'''

Highlight '''family = binomial'''

Highlight '''train'''
|| The function glm() represents generalized linear models.

Logistic regression is among the class of models that it fits.

This is the formula for our model.

We try to predict target variable '''class''' based on '''minorAL '''and '''ecc '''features.

This ensures that our model predicts the probability for 2 classes.

It ensures that, out of all the models in glm, the logistic regression model is fit.

This is the data used to train our model.

Select the commands and run them.

The output is shown in the '''console '''window.
|-
|| Drag boundary to see the console window.
|| Drag boundary to see the '''console '''window.
|-
|| Point the output in the '''console'''

Highlight '''Coefficients'''

Highlight '''Pr(>|z|)'''
||

'''Coefficients''' denote the coefficients of the logit function.

That means the log-odds of class change by -0.04 for every unit change in minorAL.

The lower p-values suggest that the effects are statistically significant.

|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

'''View(Predicted.prob)'''

||

In the '''Source''' window type these commands

|-
|| Highlight

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

Highlight

'''Type = “response” '''

|| This command provides the predicted probability of the logistic regression model on the test dataset.

This command ensures the outcome is a probability.

Select and run the commands
|-
|| Point

Value

||

'''Predicted.prob '''stores the predicted probability of each observation belonging to a certain class.

|-
|| '''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| In the source window type the following commands
|-
|| Highlight

'''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| This retrieves the predicted classes from the probabilities.

If the probability is greater than 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen

We also convert the output to a '''factor''' datatype to fit in the Confusion matrix function.

Select and run the commands
|-
||
|| Let us measure the accuracy of our model.
|-
|| '''confusion_matrix <- confusionMatrix(predicted.classes,test_data$class)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(predicted.classes,test_data$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

List is created from the actual and predicted class labels.

And it is stored in the confusion_matrix variable.

It helps to assess the classification model's performance and accuracy.

Select and run these commands

|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands
|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

|| Click on '''QDA.R''' in the '''Source '''window.

In the '''Source''' window type this command

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We use the '''plot_confusion_matrix()''' function to generate a visual plot of the '''confusion matrix list created.'''

Select and run the command

The output is seen in the '''plot''' window
|-
|| '''Output in Plot window.'''

|| This plot shows how well our model predicted the testing data.

We observe that:

'''21 '''misclassifications of Besni Class.

'''13 '''misclassifications of Kecimen class.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''

|| We will visualize the decision boundary of the model.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''
|| This code first generates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc''' features in the dataset.

Then, it uses the '''Logistics Regression '''model to predict the probability of each point in this grid, storing these predictions as a new column ''''prob' '''in the '''grid '''dataframe.

It converts the predicted probabilities of he points into classes.

If the probability exceeds 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen.

The prediced classes are stored in ‘class’ column of grid data frame.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

Select and run the commands

Click on grid in the Environment tab to load the generated data in the Source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source '''window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| We are creating the decision boundary plot using GGPlot2 from the data generated.

It plots the grid points with colors indicating the predicted classes.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can conclude that the decision boundary of logistic regression is a straight line.

The line separates the data points clearly.
|-
|| Show slide

Limitations of Logistic Regression

* It’s sensitive to outliers which can affect the accuracy of the classifier.
* It can perform poorly in the presence of multicollinearity among explanatory variables.

|| Here are some of the limitations of Logistic Regression
|-
||
|| Let us summarize what we have learned.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:

* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| Show Slide

Assignment
||
* Apply logistic regression on the '''Wine '''dataset.
* This dataset can be found in the '''HDclassif''' package.
* Install the package and import the dataset using the '''data()''' command.
* Measure the accuracy of the model

|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions
|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

Textbook Companion
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald. O and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Logistic-Regression-in-R/English

2024-05-31T08:55:38Z

Ushav: Created page with "'''Title of the script''': Logistic Regression '''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty '''Keywords''': R, RStudio, machine learning, supervised, un..."

'''Title of the script''': Logistic Regression

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, classification, logistic regression, video tutorial.

{| border=1
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Logistic Regression in R.'''

|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about
* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''
|| To follow this tutorial, the learner should know:
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
||
|| Let us learn what '''logistic regression''' is
|-
|| '''Show slide'''

'''Logistic Regression'''

|| Logistic regression is a statistical model used for classification.

It models the probability of success for the explanatory variable.

* It predicts the probability, unlike the response in linear regression.
* The predicted probability is used as a classifier.
* The probability of success is modeled using the''' logit or (log odds) '''function.
* It is a linear classifier, as the logistic regression model has a linear logit.
* It is often used when the response variable is categorical.

|-
|| '''Show slide'''

'''Assumptions of Logistic Regression'''

* The distribution of the dependent variable is Bernoulli.
* The data records are independent.

|| The dependent variable's distribution is typically assumed to be a Bernoulli distribution in logistic regression.
|-
|| '''Show slide'''

'''Advantages of Logistic Regression'''

* It provides estimates of regression coefficients along with their standard errors.
* It also provides the predicted probability which in turn is used as a classifier.
* It doesn’t need explanatory variables to be necessarily continuous.
* In this sense, it is a more general classifier than LDA and QDA.

|| Logistic regression offers a significant advantage in that continuous explanatory variables are not a requirement.
|-
|| '''Show Slide'''

'''Implementation Of Logistic Regression'''
|| We will implement '''logistic regression''' using the '''Raisin '''dataset.

The additional reading material has more details on the '''Raisin dataset'''.

Please refer to it.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''LogisticRegression.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

Highlight LogisticRegression.R

Logistic Regression folder.
|| I have downloaded and moved these files to the '''Logistic Regression''' folder.

This folder is located in the '''MLProject '''folder on the '''Desktop'''.

I have also set the '''Logistic Regression''' folder as my Working Directory.

Let’s create a '''Logistic Regression''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click LogisticRegression.R in RStudio

Point to LogisticRegression.R in RStudio.
|| Open the script '''LogisticRegression.R''' in '''RStudio'''.

For this, click on the script '''LogisticRegression.R.'''

Script '''LogisticRegression.R''' opens in '''RStudio'''.
|-
|| [Rstudio]

Highlight the commands

'''library(readxl)'''

'''library(caret)'''

'''library(VGAM)'''

'''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''

|| Select and run these commands to import the necessary packages.

The '''VGAM''' package contains the '''glm()''' function required to create our classifier.

As I have already installed the packages.

I have directly imported them.
|-
|| [RStudio]

Highlight

'''data <- read_xlsx("Raisin_Dataset.xlsx")'''

'''data[c("minorAL",”ecc”,"class")]'''

'''data$class <- factor(data$class)'''

'''Highlight the commands.'''
|| These commands will load the '''Raisin dataset.'''

They will also prepare the dataset for model building.

Select and run the commands.

|-
|| Drag boundary to see the Environment tab.

Click on '''data '''on the Environment tab.
||

Click on '''data '''in the '''Environment '''tab.

It loads the modified dataset in the '''Source''' window.

|-
|| Point to the data.
|| Now we split our dataset into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''trainIndex<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

'''train <- data[trainIndex, ]'''

'''test <- data[-trainIndex, ]'''

||

In the '''Source''' window type these commands.

|-
|| Highlight

'''set.seed(1) '''

Highlight

'''trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

Highlight

'''train <- data[trainIndex, ]'''

Highlight

'''test <- data[-trainIndex, ]'''

Click on Save and Run buttons.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
||

Select the commands and run them.

|-
||
|| Let us create a '''Logistic Regression '''model on the '''training dataset'''.
|-
|| [RStudio]

'''Logistic_model <- glm(class ~ ., data = train, family = "binomial")'''

'''summary(Logistic_model)$coef'''
||

In the '''Source''' window type these commands
|-
| | Highlight glm()

Highlight '''class ~ .'''

Highlight '''family = binomial'''

Highlight '''train'''
|| The function glm() represents generalized linear models.

Logistic regression is among the class of models that it fits.

This is the formula for our model.

We try to predict target variable '''class''' based on '''minorAL '''and '''ecc '''features.

This ensures that our model predicts the probability for 2 classes.

It ensures that, out of all the models in glm, the logistic regression model is fit.

This is the data used to train our model.

Select the commands and run them.

The output is shown in the '''console '''window.
|-
|| Drag boundary to see the console window.
|| Drag boundary to see the '''console '''window.
|-
|| Point the output in the '''console'''

Highlight '''Coefficients'''

Highlight '''Pr(>|z|)'''
||

'''Coefficients''' denote the coefficients of the logit function.

That means the log-odds of class change by -0.04 for every unit change in minorAL.

The lower p-values suggest that the effects are statistically significant.

|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

'''View(Predicted.prob)'''

||

In the '''Source''' window type these commands

|-
|| Highlight

'''Predicted.prob <- predict(Logistic_model, test, type="response")'''

Highlight

'''Type = “response” '''

|| This command provides the predicted probability of the logistic regression model on the test dataset.

This command ensures the outcome is a probability.

Select and run the commands
|-
|| Point

Value

||

'''Predicted.prob '''stores the predicted probability of each observation belonging to a certain class.

|-
|| '''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| In the source window type the following commands
|-
|| Highlight

'''predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))'''
|| This retrieves the predicted classes from the probabilities.

If the probability is greater than 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen

We also convert the output to a '''factor''' datatype to fit in the Confusion matrix function.

Select and run the commands
|-
||
|| Let us measure the accuracy of our model.
|-
|| '''confusion_matrix <- confusionMatrix(predicted.classes,test_data$class)'''

|| In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(predicted.classes,test_data$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

List is created from the actual and predicted class labels.

And it is stored in the confusion_matrix variable.

It helps to assess the classification model's performance and accuracy.

Select and run these commands

|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands
|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

|| Click on '''QDA.R''' in the '''Source '''window.

In the '''Source''' window type this command

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We use the '''plot_confusion_matrix()''' function to generate a visual plot of the '''confusion matrix list created.'''

Select and run the command

The output is seen in the '''plot''' window
|-
|| '''Output in Plot window.'''

|| This plot shows how well our model predicted the testing data.

We observe that:

'''21 '''misclassifications of Besni Class.

'''13 '''misclassifications of Kecimen class.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''

|| We will visualize the decision boundary of the model.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$prob <- predict(model, newdata = grid, type = "response")'''

'''grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')'''

'''grid$classnum <- as.numeric(as.factor(grid$class))'''
|| This code first generates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc''' features in the dataset.

Then, it uses the '''Logistics Regression '''model to predict the probability of each point in this grid, storing these predictions as a new column ''''prob' '''in the '''grid '''dataframe.

It converts the predicted probabilities of he points into classes.

If the probability exceeds 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen.

The prediced classes are stored in ‘class’ column of grid data frame.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

Select and run the commands

Click on grid in the Environment tab to load the generated data in the Source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source '''window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +'''

'''theme_minimal()'''

|| We are creating the decision boundary plot using GGPlot2 from the data generated.

It plots the grid points with colors indicating the predicted classes.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can conclude that the decision boundary of logistic regression is a straight line.

The line separates the data points clearly.
|-
|| Show slide

Limitations of Logistic Regression

* It’s sensitive to outliers which can affect the accuracy of the classifier.
* It can perform poorly in the presence of multicollinearity among explanatory variables.

|| Here are some of the limitations of Logistic Regression
|-
||
|| Let us summarize what we have learned.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:

* Logistic Regression
* Assumptions of Logistic Regression
* Advantages of Logistic Regression
* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
* Model Evaluation.
* Visualization of the model Decision Boundary
* Limitations of Logistic Regression

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| Show Slide

Assignment
||
* Apply logistic regression on the '''Wine '''dataset.
* This dataset can be found in the '''HDclassif''' package.
* Install the package and import the dataset using the '''data()''' command.
* Measure the accuracy of the model

|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions
|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| Show Slide

Textbook Companion
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald. O and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Quadratic-Discriminant-Analysis-in-R/English

2024-05-31T05:32:55Z

Ushav:

'''Title of the script''': Quadratic Discriminant Analysis in R

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, QDA, quadratic discriminant analysis, video tutorial.

{| border=1
|-
| | '''Visual Cue'''
| | '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Quadratic Discriminant Analysis in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Quadratic Discriminant Analysis (QDA).
* Comparison between '''QDA '''and''' LDA'''.
* Assumptions for QDA.
* Applications of QDA
* Implementation of QDA using''' Raisin''' Dataset'''.'''
* Visualization of the '''QDA '''separator
* Limitations of QDA

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Quadratic Discriminant Analysis'''
|| * Quadratic discriminant analysis is a statistical method used for classification.
* QDA constructs a data-driven non-linear separator between two classes.
* The covariance matrix for different classes is not necessarily equal.
* A quadratic function describes the decision boundary between each pair of classes.

|-
|| '''Show Slide'''

'''Differences between LDA and QDA'''
|| Now let’s see the differences between LDA and QDA

* '''LDA''' assumes that each class has the same covariance matrix.
* '''QDA''' relaxes the assumption of an equal covariance matrix for all the classes.
* '''LDA''' constructs a linear boundary, while '''QDA '''constructs a non-linear boundary.
* When the covariance matrices of different classes are the same, '''QDA '''reduces to '''LDA'''.

|-
|| '''Show Slides'''

'''Assumptions for QDA'''

'''QDA '''is primarily used when data is multivariate Gaussian.

'''QDA''' assumes that each class has its own covariance matrix.

|| Now let us see the assumption of QDA

QDA is used when data is multivariate Gaussian and each class has its own covariance matrix.
|-
|| '''Show slide.'''

'''Applications of QDA'''

* Medical Diagnosis.
* Bio-Imaging classification.
* Fraud Detection.

|| QDA technique is used in several applications.

|-
|| '''Show Slide'''

'''Implementation Of QDA'''
|| Let us implement '''QDA '''on the '''Raisin''' '''dataset '''with two chosen variables'''.'''

For more information on Raisin data please see the Additional Reading material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''QDA.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''QDA.R''' and the folder '''QDA.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''QDA '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''QDA''' folder as my working Directory.

In this tutorial, we will create a '''QDA''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click QDA.R in RStudio

Point to QDA.R in RStudio.
|| Let us open the script '''QDA.R''' in '''RStudio'''.

For this, click on the script '''QDA.R.'''

Script '''QDA.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command''' library(MASS)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

The '''MASS''' package contains the '''qda()''' function to create our classifier.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

We will use the '''dplyr''' package to aid the visualisation of the confusion matrix.

Please ensure that all the packages are installed correctly.

As I have already installed the packages.

I have directly imported them.

|-
|| [RStudio]

'''data<- read_xlsx("Raisint.xlsx")'''

||
|-
|| Highlight the command''' data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the Environment tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Then click on '''data '''to load the dataset in the Source window.

|-
|| [Rstudio]

'''data$class <- factor(data$class)'''
|| Click on '''QDA.R''' in the Source window and close the tab.
|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data and convert the variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

|| First we set a seed for reproducible results.

We will create a vector of indices using '''sample() '''function.

It will be a 70% of the total number of rows for training and 30% for testing.

The training data is chosen using simple random sampling without replacement.

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands

|-
|| Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Let’s perform '''QDA''' on the '''training''' dataset.
|-
|| [Rstudio]

'''QDA_model <- qda(class~.,data=train_data)'''
|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window

type these commands

|-
||

Highlight the command

'''QDA_model <- qda(class~.,data=train_data)'''

Highlight the command

'''QDA_model '''

Click Save and Click Run buttons.
|| We use this command to create '''QDA Model'''

We pass two parameters to the '''qda()''' function.
# formula
# data on which the model should train.

Click Save.

Select and run the commands.

The output is shown in the '''console '''window.
|-

|| Point the output in the '''console '''

Highlight the command '''Prior probabilities of group'''

Highlight the command '''Group means'''
|| These are the parameters of our model.

This indicates the composition of classes in the training data.

These indicate the mean values of the predictor variables for each class.
|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''predicted_values <- predict(QDA_model, test_data)'''

'''predicted_values '''
||

In the '''Source''' window type these commands

|-
|| Highlight the command

'''predicted_values <- predict(model, test)'''

Highlight the command

'''predicted_values '''

Click on''' Save '''and '''Run '''buttons.
|| Let’s use this command to predict the class variable from the test data using the trained QDA model.

This predicts the class and posterior probability for the testing data.

Select and run the commands.

|-
|| Click on '''predicted_values '''in the Environment tab.

Point the output in the '''console'''

Highlight the command '''class'''

Highlight the command '''posterior'''
|| Click on '''predicted_values''' in the Environment tab

This shows us that our predicted variable has two components.

'''class''' contains the predicted '''classes '''of the testing data.

'''Posterior''' contains the '''posterior probability''' of an observation belonging to each class.
|-
||
|| Let us compute the accuracy of our model.
|-
|| '''confusion <- confusionMatrix(test_data$class,predicted_values$class)'''

|| Click on '''QDA.R''' in the source window.

In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(test_data$class,predicted_values$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

The list is created from the actual and predicted class labels of testing data and it is stored in the confusion variable.

It helps to assess the classification model's performance and accuracy.

Select and run the command.

The confusion matrix list is shown in the Environment tab.

Click '''confusion '''to load it in the''' Source '''window.

'''confusion '''list contains a component table containing the required confusion matrix.
|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| Now let’s plot the confusion matrix from the table.

In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands.

|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

|| In the '''Source''' window type these commands

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We are using the created '''plot_confusion_matrix()''' function to generate the visual plot of the confusion matrix in '''confusion''' variable

Select and run the command.

The output is seen in the '''plot''' window.
|-
|| Point the output in the '''plot window'''
|| Drag boundary to see the plot window clearly

Observe that:

22 samples of class Kecimen have been incorrectly classified.

11 samples of class Besni have been incorrectly classified.

Overall, the model has misclassified only '''33''' out of '''270 '''samples.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$class = predict(QDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

||

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

Highlight the command

'''grid$class = predict(QDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

'''grid$classnum <- as.numeric(grid$class)'''
|| This block of code first creates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc '''features in the dataset.

It stores it in a variable ''''grid''''.

Then, it uses the QDA model to predict the class of each point in this grid.

It stores these predictions as a new column ''''class' '''in the '''grid '''dataframe.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

The resulting grid of points and their predicted classes will be used to visualize the decision boundaries of the QDA model.

Select and run these commands.

Click '''grid''' on the Environment tab to load the grid dataframe in the source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''

|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = var, y = kurt, fill = class), alpha = 0.3) +'''

'''geom_point(data = train_data, aes(x = var, y = kurt, color = class)) +'''

'''geom_contour(data = grid, aes(x = var, y = kurt, z = classnum),'''

'''colour = "black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "Variance", y = "Kurtosis", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''

''')'''
|| We are creating the decision boundary plot using '''ggplot2.'''

It plots the grid points with colors indicating the predicted classes.

'''geom_raster '''creates a colour map indicating the predicted classes of the grid points

'''geom_point '''plots the training data points in the plot.

'''geom_contour''' creates the decision boundary of the QDA.

The '''scale_fill_manual''' function assigns specific colors to the classes and so does '''scale_color_manual''' function.

The overall plot provides a visual representation of the decision boundary.

And the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can see that the decision boundary of our model is non-linear.

And our model has separated most of the data points clearly.
|-
|| '''Show slide.'''

'''Limitations of QDA'''

* Multicollinearity among predictors may lead to poor performance.
* The presence of outliers in data may also lead to poor performance.

|| These are the limitations of QDA

|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:* Quadratic Discriminant Analysis (QDA).
* Comparison between '''QDA '''and''' LDA'''.
* Assumptions for QDA.
* Applications of QDA
* Implementation Of QDA using''' Raisin''' Dataset'''.'''
* Visualization of the '''QDA '''separator
* Limitations of QDA
|-
||
|| Here is an assignment for you.
|-
|| Show Slide

Assignment
||
* Apply '''QDA''' on the '''wine''' dataset.
* Measure the accuracy of the model.

This dataset can be found in the '''HDclassif '''package.

Install the package and import the dataset using the '''data() '''command
|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.

|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
||

Show Slide

Textbook Companion

|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Quadratic-Discriminant-Analysis-in-R/English

2024-05-31T05:30:05Z

Ushav:

'''Title of the script''': Quadratic Discriminant Analysis in R

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, QDA, quadratic discriminant analysis, video tutorial.

{| border=1
|-
| | '''Visual Cue'''
| | '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Quadratic Discriminant Analysis in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Quadratic Discriminant Analysis (QDA).
* Comparison between '''QDA '''and''' LDA'''.
* Assumptions for QDA.
* Applications of QDA
* Implementation of QDA using''' Raisin''' Dataset'''.'''
* Visualization of the '''QDA '''separator
* Limitations of QDA

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Quadratic Discriminant Analysis'''
|| * Quadratic discriminant analysis is a statistical method used for classification.
* QDA constructs a data-driven non-linear separator between two classes.
* The covariance matrix for different classes is not necessarily equal.
* A quadratic function describes the decision boundary between each pair of classes.

|-
|| '''Show Slide'''

'''Differences between LDA and QDA'''
|| Now let’s see the differences between LDA and QDA

* '''LDA''' assumes that each class has the same covariance matrix.
* '''QDA''' relaxes the assumption of an equal covariance matrix for all the classes.
* '''LDA''' constructs a linear boundary, while '''QDA '''constructs a non-linear boundary.
* When the covariance matrices of different classes are the same, '''QDA '''reduces to '''LDA'''.

|-
|| '''Show Slides'''

'''Assumptions for QDA'''

'''QDA '''is primarily used when data is multivariate Gaussian.

'''QDA''' assumes that each class has its own covariance matrix.

|| Now let us see the assumption of QDA

QDA is used when data is multivariate Gaussian and each class has its own covariance matrix.
|-
|| '''Show slide.'''

'''Applications of QDA'''

* Medical Diagnosis.
* Bio-Imaging classification.
* Fraud Detection.

|| QDA technique is used in several applications.

|-
|| '''Show Slide'''

'''Implementation Of QDA'''
|| Let us implement '''QDA '''on the '''Raisin''' '''dataset '''with two chosen variables'''.'''

For more information on Raisin data please see the Additional Reading material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''QDA.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''QDA.R''' and the folder '''QDA.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''QDA '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''QDA''' folder as my working Directory.

In this tutorial, we will create a '''QDA''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click QDA.R in RStudio

Point to QDA.R in RStudio.
|| Let us open the script '''QDA.R''' in '''RStudio'''.

For this, click on the script '''QDA.R.'''

Script '''QDA.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command''' library(MASS)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

The '''MASS''' package contains the '''qda()''' function to create our classifier.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

We will use the '''dplyr''' package to aid the visualisation of the confusion matrix.

Please ensure that all the packages are installed correctly.

As I have already installed the packages.

I have directly imported them.

|-
|| [RStudio]

'''data<- read_xlsx("Raisint.xlsx")'''

||
|-
|| Highlight the command''' data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the Environment tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Then click on '''data '''to load the dataset in the Source window.

|-
|| [Rstudio]

'''data$class <- factor(data$class)'''
|| Click on '''QDA.R''' in the Source window and close the tab.
|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data and convert the variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

|| First we set a seed for reproducible results.

We will create a vector of indices using '''sample() '''function.

It will be a 70% of the total number of rows for training and 30% for testing.

The training data is chosen using simple random sampling without replacement.

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands

|-
|| Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Let’s perform '''QDA''' on the '''training''' dataset.
|-
|| [Rstudio]

'''QDA_model <- qda(class~.,data=train_data)'''
|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window

type these commands

|-
||

Highlight the command

'''QDA_model <- qda(class~.,data=train_data)'''

Highlight the command

'''QDA_model '''

Click Save and Click Run buttons.
|| We use this command to create '''QDA Model'''

We pass two parameters to the '''qda()''' function.
# formula
# data on which the model should train.

Click Save.

Select and run the commands.

The output is shown in the '''console '''window.
|-

|| Point the output in the '''console '''

Highlight the command '''Prior probabilities of group'''

Highlight the command '''Group means'''
|| These are the parameters of our model.

This indicates the composition of classes in the training data.

These indicate the mean values of the predictor variables for each class.
|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''predicted_values <- predict(QDA_model, test_data)'''

'''predicted_values '''
||

In the '''Source''' window type these commands

|-
|| Highlight the command

'''predicted_values <- predict(model, test)'''

Highlight the command

'''predicted_values '''

Click on''' Save '''and '''Run '''buttons.
|| Let’s use this command to predict the class variable from the test data using the trained QDA model.

This predicts the class and posterior probability for the testing data.

Select and run the commands.

|-
|| Click on '''predicted_values '''in the Environment tab.

Point the output in the '''console'''

Highlight the command '''class'''

Highlight the command '''posterior'''
|| Click on '''predicted_values''' in the Environment tab

This shows us that our predicted variable has two components.

'''class''' contains the predicted '''classes '''of the testing data.

'''Posterior''' contains the '''posterior probability''' of an observation belonging to each class.
|-
||
|| Let us compute the accuracy of our model.
|-
|| '''confusion <- confusionMatrix(test_data$class,predicted_values$class)'''

|| Click on '''QDA.R''' in the source window.

In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(test_data$class,predicted_values$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

The list is created from the actual and predicted class labels of testing data.

And it is stored in the confusion variable.

It helps to assess the classification model's performance and accuracy.

Select and run the command.

The confusion matrix list is shown in the Environment tab.

Click '''confusion '''to load it in the''' Source '''window.

'''confusion '''list contains a component table containing the required confusion matrix.
|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| Now let’s plot the confusion matrix from the table.

In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands.

|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

|| In the '''Source''' window type these commands

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We are using the created '''plot_confusion_matrix()''' function to generate the visual plot of the confusion matrix in '''confusion''' variable

Select and run the command.

The output is seen in the '''plot''' window.
|-
|| Point the output in the '''plot window'''
|| Drag boundary to see the plot window clearly

Observe that:

22 samples of class Kecimen have been incorrectly classified.

11 samples of class Besni have been incorrectly classified.

Overall, the model has misclassified only '''33''' out of '''270 '''samples.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$class = predict(QDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

||

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

Highlight the command

'''grid$class = predict(QDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

'''grid$classnum <- as.numeric(grid$class)'''
|| This block of code first creates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc '''features in the dataset.

It stores it in a variable ''''grid''''.

Then, it uses the QDA model to predict the class of each point in this grid.

It stores these predictions as a new column ''''class' '''in the '''grid '''dataframe.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

The resulting grid of points and their predicted classes will be used to visualize the decision boundaries of the QDA model.

Select and run these commands.

Click '''grid''' on the Environment tab to load the grid dataframe in the source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''

|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = var, y = kurt, fill = class), alpha = 0.3) +'''

'''geom_point(data = train_data, aes(x = var, y = kurt, color = class)) +'''

'''geom_contour(data = grid, aes(x = var, y = kurt, z = classnum),'''

'''colour = "black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "Variance", y = "Kurtosis", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''

''')'''
|| We are creating the decision boundary plot using '''ggplot2.'''

It plots the grid points with colors indicating the predicted classes.

'''geom_raster '''creates a colour map indicating the predicted classes of the grid points

'''geom_point '''plots the training data points in the plot.

'''geom_contour''' creates the decision boundary of the QDA.

The '''scale_fill_manual''' function assigns specific colors to the classes and so does '''scale_color_manual''' function.

The overall plot provides a visual representation of the decision boundary.

And the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can see that the decision boundary of our model is non-linear.

And our model has separated most of the data points clearly.
|-
|| '''Show slide.'''

'''Limitations of QDA'''

* Multicollinearity among predictors may lead to poor performance.
* The presence of outliers in data may also lead to poor performance.

|| These are the limitations of QDA

|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:* Quadratic Discriminant Analysis (QDA).
* Comparison between '''QDA '''and''' LDA'''.
* Assumptions for QDA.
* Applications of QDA
* Implementation Of QDA using''' Raisin''' Dataset'''.'''
* Visualization of the '''QDA '''separator
* Limitations of QDA
|-
||
|| Here is an assignment for you.
|-
|| Show Slide

Assignment
||
* Apply '''QDA''' on the '''wine''' dataset.
* Measure the accuracy of the model.

This dataset can be found in the '''HDclassif '''package.

Install the package and import the dataset using the '''data() '''command
|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.

|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
||

Show Slide

Textbook Companion

|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Quadratic-Discriminant-Analysis-in-R/English

2024-05-31T05:26:41Z

Ushav:

'''Title of the script''': Quadratic Discriminant Analysis in R

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, QDA, quadratic discriminant analysis, video tutorial.

{| border=1
|-
| | '''Visual Cue'''
| | '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Quadratic Discriminant Analysis in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Quadratic Discriminant Analysis (QDA).
* Comparison between '''QDA '''and''' LDA'''.
* Assumptions for QDA.
* Applications of QDA
* Implementation of QDA using''' Raisin''' Dataset'''.'''
* Visualization of the '''QDA '''separator
* Limitations of QDA

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Quadratic Discriminant Analysis'''
|| * Quadratic discriminant analysis is a statistical method used for classification.
* QDA constructs a data-driven non-linear separator between two classes.
* The covariance matrix for different classes is not necessarily equal.
* A quadratic function describes the decision boundary between each pair of classes.

|-
|| '''Show Slide'''

'''Differences between LDA and QDA'''
|| Now let’s see the differences between LDA and QDA

* '''LDA''' assumes that each class has the same covariance matrix.
* '''QDA''' relaxes the assumption of an equal covariance matrix for all the classes.
* '''LDA''' constructs a linear boundary, while '''QDA '''constructs a non-linear boundary.
* When the covariance matrices of different classes are the same, '''QDA '''reduces to '''LDA'''.

|-
|| '''Show Slides'''

'''Assumptions for QDA'''

'''QDA '''is primarily used when data is multivariate Gaussian.

'''QDA''' assumes that each class has its own covariance matrix.

|| Now let us see the assumption of QDA

QDA is used when data is multivariate Gaussian and each class has its own covariance matrix.
|-
|| '''Show slide.'''

'''Applications of QDA'''

* Medical Diagnosis.
* Bio-Imaging classification.
* Fraud Detection.

|| QDA technique is used in several applications.

|-
|| '''Show Slide'''

'''Implementation Of QDA'''
|| Let us implement '''QDA '''on the '''Raisin''' '''dataset '''with two chosen variables'''.'''

For more information on Raisin data please see the Additional Reading material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''QDA.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''QDA.R''' and the folder '''QDA.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''QDA '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''QDA''' folder as my working Directory.

In this tutorial, we will create a '''QDA''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click QDA.R in RStudio

Point to QDA.R in RStudio.
|| Let us open the script '''QDA.R''' in '''RStudio'''.

For this, click on the script '''QDA.R.'''

Script '''QDA.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command''' library(MASS)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

The '''MASS''' package contains the '''qda()''' function to create our classifier.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

We will use the '''dplyr''' package to aid the visualisation of the confusion matrix.

Please ensure that all the packages are installed correctly.

As I have already installed the packages.

I have directly imported them.

|-
|| [RStudio]

'''data<- read_xlsx("Raisint.xlsx")'''

||
|-
|| Highlight the command''' data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the Environment tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Then click on '''data '''to load the dataset in the Source window.

|-
|| [Rstudio]

'''data$class <- factor(data$class)'''
|| Click on '''QDA.R''' in the Source window and close the tab.
|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data and convert the variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

|| First we set a seed for reproducible results.

We will create a vector of indices using '''sample() '''function.

It will be a 70% of the total number of rows for training and 30% for testing.

The training data is chosen using simple random sampling without replacement.

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands

|-
|| Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Let’s perform '''QDA''' on the '''training''' dataset.
|-
|| [Rstudio]

'''QDA_model <- qda(class~.,data=train_data)'''
|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window

type these commands

|-
||

Highlight the command

'''QDA_model <- qda(class~.,data=train_data)'''

Highlight the command

'''QDA_model '''

Click Save and Click Run buttons.
|| We use this command to create '''QDA Model'''

We pass two parameters to the '''qda()''' function.
# formula
# data on which the model should train.

Click Save.

Select and run the commands.

The output is shown in the '''console '''window.
|-

|| Point the output in the '''console '''

Highlight the command '''Prior probabilities of group'''

Highlight the command '''Group means'''
|| These are the parameters of our model.

This indicates the composition of classes in the training data.

These indicate the mean values of the predictor variables for each class.
|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''predicted_values <- predict(QDA_model, test_data)'''

'''predicted_values '''
||

Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''predicted_values <- predict(model, test)'''

Highlight the command

'''predicted_values '''

Click on''' Save '''and '''Run '''buttons.
|| Let’s use this command to predict the class variable from the test data using the trained QDA model.

This predicts the class and posterior probability for the testing data.

Select and run the commands.

|-
|| Click on '''predicted_values '''in the Environment tab.

Point the output in the '''console'''

Highlight the command '''class'''

Highlight the command '''posterior'''
|| Click on '''predicted_values''' in the Environment tab

This shows us that our predicted variable has two components.

'''class''' contains the predicted '''classes '''of the testing data.

'''Posterior''' contains the '''posterior probability''' of an observation belonging to each class.
|-
||
|| Let us compute the accuracy of our model.
|-
|| '''confusion <- confusionMatrix(test_data$class,predicted_values$class)'''

|| Click on '''QDA.R''' in the source window.

In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(test_data$class,predicted_values$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

The list is created from the actual and predicted class labels of testing data.

And it is stored in the confusion variable.

It helps to assess the classification model's performance and accuracy.

Select and run the command.

The confusion matrix list is shown in the Environment tab.

Click '''confusion '''to load it in the''' Source '''window.

'''confusion '''list contains a component table containing the required confusion matrix.
|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| Now let’s plot the confusion matrix from the table.

In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands.

|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

|| In the '''Source''' window type these commands

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We are using the created '''plot_confusion_matrix()''' function to generate the visual plot of the confusion matrix in '''confusion''' variable

Select and run the command.

The output is seen in the '''plot''' window.
|-
|| Point the output in the '''plot window'''
|| Drag boundary to see the plot window clearly

Observe that:

22 samples of class Kecimen have been incorrectly classified.

11 samples of class Besni have been incorrectly classified.

Overall, the model has misclassified only '''33''' out of '''270 '''samples.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$class = predict(QDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

||

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

Highlight the command

'''grid$class = predict(QDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

'''grid$classnum <- as.numeric(grid$class)'''
|| This block of code first creates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc '''features in the dataset.

It stores it in a variable ''''grid''''.

Then, it uses the QDA model to predict the class of each point in this grid.

It stores these predictions as a new column ''''class' '''in the '''grid '''dataframe.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

The resulting grid of points and their predicted classes will be used to visualize the decision boundaries of the QDA model.

Select and run these commands.

Click '''grid''' on the Environment tab to load the grid dataframe in the source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''

|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = var, y = kurt, fill = class), alpha = 0.3) +'''

'''geom_point(data = train_data, aes(x = var, y = kurt, color = class)) +'''

'''geom_contour(data = grid, aes(x = var, y = kurt, z = classnum),'''

'''colour = "black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "Variance", y = "Kurtosis", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''

''')'''
|| We are creating the decision boundary plot using '''ggplot2.'''

It plots the grid points with colors indicating the predicted classes.

'''geom_raster '''creates a colour map indicating the predicted classes of the grid points

'''geom_point '''plots the training data points in the plot.

'''geom_contour''' creates the decision boundary of the QDA.

The '''scale_fill_manual''' function assigns specific colors to the classes and so does '''scale_color_manual''' function.

The overall plot provides a visual representation of the decision boundary.

And the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can see that the decision boundary of our model is non-linear.

And our model has separated most of the data points clearly.
|-
|| '''Show slide.'''

'''Limitations of QDA'''

* Multicollinearity among predictors may lead to poor performance.
* The presence of outliers in data may also lead to poor performance.

|| These are the limitations of QDA

|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:* Quadratic Discriminant Analysis (QDA).
* Comparison between '''QDA '''and''' LDA'''.
* Assumptions for QDA.
* Applications of QDA
* Implementation Of QDA using''' Raisin''' Dataset'''.'''
* Visualization of the '''QDA '''separator
* Limitations of QDA
|-
||
|| Here is an assignment for you.
|-
|| Show Slide

Assignment
||
* Apply '''QDA''' on the '''wine''' dataset.
* Measure the accuracy of the model.

This dataset can be found in the '''HDclassif '''package.

Install the package and import the dataset using the '''data() '''command
|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.

|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
||

Show Slide

Textbook Companion

|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Quadratic-Discriminant-Analysis-in-R/English

2024-05-31T05:14:46Z

Ushav:

'''Title of the script''': Quadratic Discriminant Analysis in R

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, QDA, quadratic discriminant analysis, video tutorial.

{| border=1
|-
| | '''Visual Cue'''
| | '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Quadratic Discriminant Analysis in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Quadratic Discriminant Analysis (QDA).
* Comparison between '''QDA '''and''' LDA'''.
* Assumptions for QDA.
* Applications of QDA
* Implementation of QDA using''' Raisin''' Dataset'''.'''
* Visualization of the '''QDA '''separator
* Limitations of QDA

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Quadratic Discriminant Analysis'''
|| * Quadratic discriminant analysis is a statistical method used for classification.
* QDA constructs a data-driven non-linear separator between two classes.
* The covariance matrix for different classes is not necessarily equal.
* A quadratic function describes the decision boundary between each pair of classes.

|-
|| '''Show Slide'''

'''Differences between LDA and QDA'''
|| Now let’s see the differences between LDA and QDA

* '''LDA''' assumes that each class has the same covariance matrix.
* '''QDA''' relaxes the assumption of an equal covariance matrix for all the classes.
* '''LDA''' constructs a linear boundary, while '''QDA '''constructs a non-linear boundary.
* When the covariance matrices of different classes are the same, '''QDA '''reduces to '''LDA'''.

|-
|| '''Show Slides'''

'''Assumptions for QDA'''

'''QDA '''is primarily used when data is multivariate Gaussian.

'''QDA''' assumes that each class has its own covariance matrix.

|| Now let us see the assumption of QDA

QDA is used when data is multivariate Gaussian and each class has its own covariance matrix.
|-
|| '''Show slide.'''

'''Applications of QDA'''

* Medical Diagnosis.
* Bio-Imaging classification.
* Fraud Detection.

|| QDA technique is used in several applications.

|-
|| '''Show Slide'''

'''Implementation Of QDA'''
|| Let us implement '''QDA '''on the '''Raisin''' '''dataset '''with two chosen variables'''.'''

For more information on Raisin data please see the Additional Reading material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''QDA.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''QDA.R''' and the folder '''QDA.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''QDA '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''QDA''' folder as my working Directory.

In this tutorial, we will create a '''QDA''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click QDA.R in RStudio

Point to QDA.R in RStudio.
|| Let us open the script '''QDA.R''' in '''RStudio'''.

For this, click on the script '''QDA.R.'''

Script '''QDA.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command''' library(MASS)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

The '''MASS''' package contains the '''qda()''' function to create our classifier.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

We will use the '''dplyr''' package to aid the visualisation of the confusion matrix.

Please ensure that all the packages are installed correctly.

As I have already installed the packages.

I have directly imported them.

|-
|| [RStudio]

'''data<- read_xlsx("Raisint.xlsx")'''

||
|-
|| Highlight the command''' data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the Environment tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Then click on '''data '''to load the dataset in the Source window.

|-
|| [Rstudio]

'''data$class <- factor(data$class)'''
|| Click on '''QDA.R''' in the Source window and close the tab.
|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data and convert the variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

|| First we set a seed for reproducible results.

We will create a vector of indices using '''sample() '''function.

It will be a 70% of the total number of rows for training and 30% for testing.

The training data is chosen using simple random sampling without replacement.

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands

|-
|| Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Let’s perform '''QDA''' on the '''training''' dataset.
|-
|| [Rstudio]

'''QDA_model <- qda(class~.,data=train_data)'''
|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window

type these commands

|-
||

Highlight the command

'''QDA_model <- qda(class~.,data=train_data)'''

Highlight the command

'''QDA_model '''

Click Save and Click Run buttons.
|| We use this command to create '''QDA Model'''

We pass two parameters to the '''qda()''' function.
# formula
# data on which the model should train.

Click Save.

Select and run the commands.

The output is shown in the '''console '''window.
|-
|| Drag boundary to see the console window.
|| Drag boundary to see the '''console '''window.
|-
|| Point the output in the '''console '''

Highlight the command '''Prior probabilities of group'''

Highlight the command '''Group means'''
|| These are the parameters of our model.

This indicates the composition of classes in the training data.

These indicate the mean values of the predictor variables for each class.
|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''predicted_values <- predict(QDA_model, test_data)'''

'''predicted_values '''
||

Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''predicted_values <- predict(model, test)'''

Highlight the command

'''predicted_values '''

Click on''' Save '''and '''Run '''buttons.
|| Let’s use this command to predict the class variable from the test data using the trained QDA model.

This predicts the class and posterior probability for the testing data.

Select and run the commands.

|-
|| Click on '''predicted_values '''in the Environment tab.

Point the output in the '''console'''

Highlight the command '''class'''

Highlight the command '''posterior'''
|| Click on '''predicted_values''' in the Environment tab

This shows us that our predicted variable has two components.

'''class''' contains the predicted '''classes '''of the testing data.

'''Posterior''' contains the '''posterior probability''' of an observation belonging to each class.
|-
||
|| Let us compute the accuracy of our model.
|-
|| '''confusion <- confusionMatrix(test_data$class,predicted_values$class)'''

|| Click on '''QDA.R''' in the source window.

In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(test_data$class,predicted_values$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

The list is created from the actual and predicted class labels of testing data.

And it is stored in the confusion variable.

It helps to assess the classification model's performance and accuracy.

Select and run the command.

The confusion matrix list is shown in the Environment tab.

Click '''confusion '''to load it in the''' Source '''window.

'''confusion '''list contains a component table containing the required confusion matrix.
|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| Now let’s plot the confusion matrix from the table.

In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands.

|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

|| In the '''Source''' window type these commands

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We are using the created '''plot_confusion_matrix()''' function to generate the visual plot of the confusion matrix in '''confusion''' variable

Select and run the command.

The output is seen in the '''plot''' window.
|-
|| Point the output in the '''plot window'''
|| Drag boundary to see the plot window clearly

Observe that:

22 samples of class Kecimen have been incorrectly classified.

11 samples of class Besni have been incorrectly classified.

Overall, the model has misclassified only '''33''' out of '''270 '''samples.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$class = predict(QDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

||

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

Highlight the command

'''grid$class = predict(QDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

'''grid$classnum <- as.numeric(grid$class)'''
|| This block of code first creates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc '''features in the dataset.

It stores it in a variable ''''grid''''.

Then, it uses the QDA model to predict the class of each point in this grid.

It stores these predictions as a new column ''''class' '''in the '''grid '''dataframe.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

The resulting grid of points and their predicted classes will be used to visualize the decision boundaries of the QDA model.

Select and run these commands.

Click '''grid''' on the Environment tab to load the grid dataframe in the source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''

|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = var, y = kurt, fill = class), alpha = 0.3) +'''

'''geom_point(data = train_data, aes(x = var, y = kurt, color = class)) +'''

'''geom_contour(data = grid, aes(x = var, y = kurt, z = classnum),'''

'''colour = "black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "Variance", y = "Kurtosis", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''

''')'''
|| We are creating the decision boundary plot using '''ggplot2.'''

It plots the grid points with colors indicating the predicted classes.

'''geom_raster '''creates a colour map indicating the predicted classes of the grid points

'''geom_point '''plots the training data points in the plot.

'''geom_contour''' creates the decision boundary of the QDA.

The '''scale_fill_manual''' function assigns specific colors to the classes and so does '''scale_color_manual''' function.

The overall plot provides a visual representation of the decision boundary.

And the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can see that the decision boundary of our model is non-linear.

And our model has separated most of the data points clearly.
|-
|| '''Show slide.'''

'''Limitations of QDA'''

* Multicollinearity among predictors may lead to poor performance.
* The presence of outliers in data may also lead to poor performance.

|| These are the limitations of QDA

|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:* Quadratic Discriminant Analysis (QDA).
* Comparison between '''QDA '''and''' LDA'''.
* Assumptions for QDA.
* Applications of QDA
* Implementation Of QDA using''' Raisin''' Dataset'''.'''
* Visualization of the '''QDA '''separator
* Limitations of QDA
|-
||
|| Here is an assignment for you.
|-
|| Show Slide

Assignment
||
* Apply '''QDA''' on the '''wine''' dataset.
* Measure the accuracy of the model.

This dataset can be found in the '''HDclassif '''package.

Install the package and import the dataset using the '''data() '''command
|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.

|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
||

Show Slide

Textbook Companion

|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Quadratic-Discriminant-Analysis-in-R/English

2024-05-30T12:23:53Z

Ushav:

'''Title of the script''': Quadratic Discriminant Analysis in R

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, QDA, quadratic discriminant analysis, video tutorial.

{| border=1
|-
| | '''Visual Cue'''
| | '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Quadratic Discriminant Analysis in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Quadratic Discriminant Analysis (QDA).
* Comparison between '''QDA '''and''' LDA'''.
* Assumptions for QDA.
* Applications of QDA
* Implementation of QDA using''' Raisin''' Dataset'''.'''
* Visualization of the '''QDA '''separator
* Limitations of QDA

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Quadratic Discriminant Analysis'''
|| * Quadratic discriminant analysis is a statistical method used for classification.
* QDA constructs a data-driven non-linear separator between two classes.
* The covariance matrix for different classes is not necessarily equal.
* A quadratic function describes the decision boundary between each pair of classes.

|-
|| '''Show Slide'''

'''Differences between LDA and QDA'''
|| Now let’s see the differences between LDA and QDA

* '''LDA''' assumes that each class has the same covariance matrix.
* '''QDA''' relaxes the assumption of an equal covariance matrix for all the classes.
* '''LDA''' constructs a linear boundary, while '''QDA '''constructs a non-linear boundary.
* When the covariance matrices of different classes are the same, '''QDA '''reduces to '''LDA'''.

|-
|| '''Show Slides'''

'''Assumptions for QDA'''

'''QDA '''is primarily used when data is multivariate Gaussian.

'''QDA''' assumes that each class has its own covariance matrix.

|| Now let us see the assumption of QDA

QDA is used when data is multivariate Gaussian and each class has its own covariance matrix.
|-
|| '''Show slide.'''

'''Applications of QDA'''

* Medical Diagnosis.
* Bio-Imaging classification.
* Fraud Detection.

|| QDA technique is used in several applications.

|-
|| '''Show Slide'''

'''Implementation Of QDA'''
|| Let us implement '''QDA '''on the '''Raisin''' '''dataset '''with two chosen variables'''.'''

For more information on Raisin data please see the Additional Reading material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''QDA.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''QDA.R''' and the folder '''QDA.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''QDA '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''QDA''' folder as my working Directory.

In this tutorial, we will create a '''QDA''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click QDA.R in RStudio

Point to QDA.R in RStudio.
|| Let us open the script '''QDA.R''' in '''RStudio'''.

For this, click on the script '''QDA.R.'''

Script '''QDA.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command''' library(MASS)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

The '''MASS''' package contains the '''qda()''' function

to create our classifier.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

We will use the '''dplyr''' package to aid the visualisation of the confusion matrix.

Please ensure that all the packages are installed correctly.

As I have already installed the packages.

I have directly imported them.

|-
|| [RStudio]

'''data<- read_xlsx("Raisint.xlsx")'''

||
|-
|| Highlight the command''' data<- read_xlsx("Raisin.xlsx")'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the Environment tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Then click on '''data '''to load the dataset in the Source window.

|-
|| [Rstudio]

'''data$class <- factor(data$class)'''
|| Click on '''QDA.R''' in the Source window and close the tab.
|-
|| Highlight the command.

'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
||

We now select three columns from data and convert the variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

||

Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

|| First we set a seed for reproducible results.

We will create a vector of indices using '''sample() '''function.

It will be a 70% of the total number of rows for training and 30% for testing.

The training data is chosen using simple random sampling without replacement.

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands

|-
|| Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Let’s perform '''QDA''' on the '''training''' dataset.
|-
|| [Rstudio]

'''QDA_model <- qda(class~.,data=train_data)'''
|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window

type these commands

|-
||

Highlight the command

'''QDA_model <- qda(class~.,data=train_data)'''

Highlight the command

'''QDA_model '''

Click Save and Click Run buttons.
|| We use this command to create '''QDA Model'''

We pass two parameters to the '''qda()''' function.
# formula
# data on which the model should train.

Click Save.

Select and run the commands.

The output is shown in the '''console '''window.
|-
|| Drag boundary to see the console window.
|| Drag boundary to see the '''console '''window.
|-
|| Point the output in the '''console '''

Highlight the command '''Prior probabilities of group'''

Highlight the command '''Group means'''
|| These are the parameters of our model.

This indicates the composition of classes in the training data.

These indicate the mean values of the predictor variables for each class.
|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''predicted_values <- predict(QDA_model, test_data)'''

'''predicted_values '''
||

Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''predicted_values <- predict(model, test)'''

Highlight the command

'''predicted_values '''

Click on''' Save '''and '''Run '''buttons.
|| Let’s use this command to predict the class variable from the test data using the trained QDA model.

This predicts the class and posterior probability for the testing data.

Select and run the commands.

|-
|| Click on '''predicted_values '''in the Environment tab.

Point the output in the '''console'''

Highlight the command '''class'''

Highlight the command '''posterior'''
|| Click on '''predicted_values''' in the Environment tab

This shows us that our predicted variable has two components.

'''class''' contains the predicted '''classes '''of the testing data.

'''Posterior''' contains the '''posterior probability''' of an observation belonging to each class.
|-
||
|| Let us compute the accuracy of our model.
|-
|| '''confusion <- confusionMatrix(test_data$class,predicted_values$class)'''

|| Click on '''QDA.R''' in the source window.

In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(test_data$class,predicted_values$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

The list is created from the actual and predicted class labels of testing data.

And it is stored in the confusion variable.

It helps to assess the classification model's performance and accuracy.

Select and run the command.

The confusion matrix list is shown in the Environment tab.

Click '''confusion '''to load it in the''' Source '''window.

'''confusion '''list contains a component table containing the required confusion matrix.
|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| Now let’s plot the confusion matrix from the table.

In the '''Source''' window type these commands
|-
||

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands.

|-
|| [RStudio]

'''plot_confusion_matrix(confusion)'''

|| In the '''Source''' window type these commands

|-
||

Highlight the command

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
||

We are using the created '''plot_confusion_matrix()''' function to generate the visual plot of the confusion matrix in '''confusion''' variable

Select and run the command.

The output is seen in the '''plot''' window.
|-
|| Point the output in the '''plot window'''
|| Drag boundary to see the plot window clearly

Observe that:

22 samples of class Kecimen have been incorrectly classified.

11 samples of class Besni have been incorrectly classified.

Overall, the model has misclassified only '''33''' out of '''270 '''samples.

|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$class = predict(QDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

||

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

Highlight the command

'''grid$class = predict(QDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

'''grid$classnum <- as.numeric(grid$class)'''
|| This block of code first creates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc '''features in the dataset.

It stores it in a variable ''''grid''''.

Then, it uses the QDA model to predict the class of each point in this grid.

It stores these predictions as a new column ''''class' '''in the '''grid '''dataframe.

The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

The resulting grid of points and their predicted classes will be used to visualize the decision boundaries of the QDA model.

Select and run these commands.

Click '''grid''' on the Environment tab to load the grid dataframe in the source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''

|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = var, y = kurt, fill = class), alpha = 0.3) +'''

'''geom_point(data = train_data, aes(x = var, y = kurt, color = class)) +'''

'''geom_contour(data = grid, aes(x = var, y = kurt, z = classnum),'''

'''colour = "black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "Variance", y = "Kurtosis", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''

''')'''
|| We are creating the decision boundary plot using '''ggplot2.'''

It plots the grid points with colors indicating the predicted classes.

'''geom_raster '''creates a colour map indicating the predicted classes of the grid points

'''geom_point '''plots the training data points in the plot.

'''geom_contour''' creates the decision boundary of the QDA.

The '''scale_fill_manual''' function assigns specific colors to the classes and so does '''scale_color_manual''' function.

The overall plot provides a visual representation of the decision boundary.

And the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can see that the decision boundary of our model is non-linear.

And our model has separated most of the data points clearly.
|-
|| '''Show slide.'''

'''Limitations of QDA'''

* Multicollinearity among predictors may lead to poor performance.
* The presence of outliers in data may also lead to poor performance.

|| These are the limitations of QDA

|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:* Quadratic Discriminant Analysis (QDA).
* Comparison between '''QDA '''and''' LDA'''.
* Assumptions for QDA.
* Applications of QDA
* Implementation Of QDA using''' Raisin''' Dataset'''.'''
* Visualization of the '''QDA '''separator
* Limitations of QDA
|-
||
|| Here is an assignment for you.
|-
|| Show Slide

Assignment
||
* Apply '''QDA''' on the '''wine''' dataset.
* Measure the accuracy of the model.

This dataset can be found in the '''HDclassif '''package.

Install the package and import the dataset using the '''data() '''command
|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.

|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
||

Show Slide

Textbook Companion

|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Linear-Discriminant-Analysis-in-R/English

2024-05-30T05:22:41Z

Ushav:

'''Title of the script''': Linear Discriminant Analysis in R

'''Author''': YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, confusion matrix, console, LDA, video tutorial.

{| border=1
|-
|| '''Visual Cue'''
|| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Linear Discriminant Analysis in R.'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
# Linear Discriminant Analysis ('''LDA''') and its implementation.
# Assumptions of LDA
# Limitations of LDA
# LDA on a subset of Raisin dataset
# Visualization of the '''LDA''' separator and its corresponding confusion matrix.

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide.'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know:

* Basics of '''R''' programming.
* Basics of '''Machine Learning '''using '''R'''.

If not, please access the relevant tutorials on '''R '''on this website.
|-
|| '''Show slide.'''

'''Linear Discriminant Analysis'''
|| Linear Discriminant Analysis is a statistical method.
* It is used for classification.
* It constructs a data driven line that best separates different classes.
* It is based on maximization of likelihood function to classify two or more classes.

|-
|| '''Show slide.'''

'''Applications of LDA'''
||
* LDA technique is used in several applications like

** Fraud Detection
** Bio-Imaging classification
** Classify patient disease state

|-
|| Only Narration
|| Let us now understand the assumptions of LDA.
|-
|| '''Show Slide '''

'''Assumptions for LDA'''
|| '''Multivariate Normality: '''

* All data entries are continuous, Gaussian, with equal covariance matrix for all the classes.
* Mean vectors for each class are different.
* Data records are independent and identically distributed among each class.

|-
|| '''Show Slide '''

'''Limitations of LDA'''
|| Now we will see the limitations of LDA.

* Departure from Gaussianity can increase misclassification probability in LDA.
* '''LDA''' may perform poorly if data has unequal class covariance matrix.

|-
|| '''Show Slide'''

'''Implementation Of LDA'''
|| Now let us implement '''LDA''' on the '''raisin dataset '''with two chosen variables'''.'''

More information on '''raisin''' data is available in the '''Additional Reading material''' on this tutorial page.
|-
|| '''Show slide '''

'''Download Files'''
|| We will use a script file '''LDA.R'''

Please download this file from the''' Code files''' link of this tutorial.

Make a copy and then use it for practising.
|-
|| [Computer screen]

Point to '''LDA.R''' and the folder '''LDA.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

Point to the''' LDA folder.'''
|| I have downloaded and moved these files to the '''LDA '''folder.

This folder is in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''LDA''' folder as my working''' directory'''.
|-
|| Point to the script file '''LDA.R.'''
|| In this tutorial, we will create a '''LDA''' classifier model on the '''raisin''' dataset.

Let us switch to '''RStudio'''.
|-
|| Open '''LDA.R '''in '''RStudio'''

Point to''' LDA.R''' in '''RStudio'''.
|| Open the script '''LDA.R''' in '''RStudio'''.

For this, click on the script '''LDA.R.'''

Script '''LDA.R''' opens in '''RStudio'''.
|-
|| Highlight the '''Readxl package.'''

Highlight the command '''library(MASS) '''

Highlight the command '''library(ggplot2)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(caret)'''

Highlight all the commands.

'''<nowiki>#install.packages(“package_name”)</nowiki>'''
|| '''Readxl package''' is used to load the '''Excel''' file.

The''' MASS package''' contains the '''lda()''' function that we will use for our analysis.

The '''ggplot2 package''' is used to plot the results of our analysis.

The '''caret package''' contains the

'''confusionMatrix''' function.

It is used as a measure for the performance of the classifier.

Please note that in order to import these libraries, we need to install them.

Please ensure that everything is installed correctly.

You can use the command '''install.packages(“package_name”)''' to install the required packages.

As I have already installed these packages, I will directly import them.

|-
|| [RStudio]

'''library(readxl)'''

'''library(MASS)'''

'''library(ggplot2)'''

'''library(caret)'''

'''library(lattice)'''

|| Select and run these commands to import the requisite packages.

|-
|| Highlight the command''' '''

'''data <- read_xlsx("Raisin.xlsx")'''

Highlight the command''' data<-data[c("minorAL","ecc","class")]'''

Highlight the commands.

'''data <- read_xlsx("Raisin.xlsx")'''

'''data<-data[c("minorAL","ecc","class")]'''

|| We will read the excel file and choose 3 columns, two features ('''minorAL, ecc)''' and one target ('''class''') variable.

Run these commands to import the '''raisin''' dataset.

|-
|| Drag boundary to see the '''Environment '''tab clearly.

Point to the data variable in the Environment tab.

Click the data to load the dataset.

|| Drag boundary to see the Environment tab clearly.

In the Environment tab under '''Data '''heading, you will see a '''data '''variable.

Click the data''' variable''' to load the dataset in the '''Source''' window.
|-
|| Drag boundary to see the Source window clearly.
|| Drag boundary to see the '''Source '''window clearly.

|-
||[RStudio]

Type these commands in the source window.

'''data$class <- factor(data$class)'''

|| In the '''Source''' window type this command.

|-
||Highlight the below commands.

'''data$class <- factor(data$class)'''

Select the commands and click the Run button.

||Here we are converting the variable '''data$class''' to a factor.

It ensures that the categorical data is properly encoded.

Select the command and run it.
|-
||Only Narration.
|| Now we split our dataset into training and testing data.
|-
||[RStudio]

Type the command in the source window.

'''set.seed(1) '''

'''index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''
||In the '''Source''' window type these commands.

|-
||Highlight the command

'''set.seed(1)'''

Highlight the command '''sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''

Highlight the command '''replace=FALSE'''

Select the commands and click the Run button.
||First we set a seed for reproducible results.

We will create a vector of indices using '''sample() '''function.

This will be 70% for training and 30% for testing.

The training data is chosen using simple random sampling without replacement.

Select the commands and run them.
|-
|
|| The vector is shown in the''' Environment '''tab.
|-
||Point to train-test split.
|| We use the indices that we previously generated to obtain our train-test split.
|-
|| [RStudio]

Type the command

'''train_data <- data [index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source '''window type these commands.
|-
|| Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab
|| Select the commands and run them.

The data sets are shown in the Environment tab.

Click on '''test_data '''and '''train_data '''to load them in the Source window.

|-
|| Only Narration.
|| Let us train our '''LDA''' model.
|-
|| [RStudio]

'''LDA_model <- lda(class~.,data=train_data)'''

'''LDA_model'''
|| In the '''Source '''window, type these commands.

|-
|| Highlight the command

'''LDA_model <- lda(class~.,data=train_data)'''

'''LDA_model'''

Highlight the command '''LDA_model'''

Click on Save and Run buttons.

Point to the output in the '''console '''window.
|| We pass two parameters to the '''lda()''' function.
# formula
# data on which the model should train.

Select the comands and run them.

The output is shown in the '''console''' window.
|-
|| Drag boundary to see the '''console''' window.
|| Drag boundary to see the '''console '''window clearly.

|-
|| Highlight '''output''' in the '''console.'''
|| Our '''model''' provides us with a lot of information.

Let us go through them one at a time.
|-
|| Highlight the command '''Prior probabilities of groups. '''

Highlight the command''' Group means.'''

Highlight the command '''Coefficients of linear discriminants '''

|| These explain the distribution of classes in the training dataset.

These display the mean values of each '''predictor '''variable for each '''species'''.

These display the '''linear combination of predictor''' variables.

The given linear combinations form the decision rule of the '''LDA''' model.

|-
|| Drag boundary to see the Source window.
|| Drag boundary to see the '''Source '''window clearly.

|-
||
|| Let us use this model to make predictions on the testing data.
|-
|| [RStudio]

'''predicted_values <- predict(LDA_model, test_data)'''

|| In the '''Source '''window type this command and run it.

Let us check what '''predicted_values''' contain.

|-
|| Click the '''predicted_values '''data in the Environment tab.

Point to the table.
|| Click the '''predicted_values '''data in the Environment tab.

The '''predicted_values '''table is loaded in the '''Source''' window.

|-
|| [RStudio]

'''head(predicted_values$class)'''

'''head(predicted_values$posterior)'''

'''head(predicted_values$x)'''
|| In the '''Source''' window type these commands and run them.

The output is seen in the''' console''' window.
|-
|| Highlight the command output of '''head(predicted_values$class) '''in the '''console.'''

Highlight the command output of '''head(predicted_values$posterior)''' in the '''console.'''

Highlight the command output of '''head(predicted_values$x) '''in '''console'''
|| It contains the type of species that the model has predicted for each observation.

It contains the '''posterior probability''' of the observation belonging to each class.

This contains the linear discriminants for each observation.

|-
|| Only Narration.
|| Now we will measure the performance of our model using the '''Confusion Matrix'''.
|-
|| [RStudio]

'''confusion <-table(test_data$class,predicted_values$class)'''

'''fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)'''

Click on '''Save '''and''' Run''' buttons.
|| In the '''Source '''window type these commands.

Save and run the commands.
|-
|| Highlight the command '''confusion <- table(test_data$class, predicted_values$class)'''

Highlight the command

'''fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1)'''

|| This table creates a confusion matrix.

The '''fourfoldplot()''' function generates a visual plot of the confusion matrix,

The output is seen in the '''plot''' window.
|-
|| Highlight the plot in '''plot window '''
|| Drag boundary to see the plot window clearly.

Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations.

This number may change for different sets of training data.

|-
|| Only Narration.
|| Let us visualize how well our model separates different classes.
|-
|| [RStudio]

[RStudio]

'''X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100)'''

'''Y <- seq(min(train_data$ecc), max(train_data$ecc), length.out = 100)'''

'''min_max <- expand.grid(minorAL = X, ecc = Y)'''

'''min_max$predicted_class <- predict(LDA_model, newdata = min_max)$class'''

'''grid <- expand.grid(minorAL = X, ecc = Y)'''

'''grid$class <- predict(LDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

Click on Save and Run buttons.

|| In the '''Source''' window, type these commands.

This block of code operates as a setup for visual plotting.

It consists of square grid coordinates in the range of training data and their predicted linear discriminants.

The ''' seq ''' function generates a sequence of evenly spaced values within a range of smallest and largest values of 'minorAL' and 'ecc' variables from the training data.

The''' 'grid' '''variable contains the generated data including the prediction of the LDA_model on it.

The '''as.numeric''' function encodes the predicted classes labels into numeric values.

Select the commands and run them.

|-
|| Point to the Environment tab.
|| Drag boundary to see the details in the Environment tab.

These variables contain the data for the visualization of the linear discriminants.

Click the '''grid''' '''data''' in the Environment tab.

The '''grid data''' table is loaded in the '''Source''' window.

|-
|| [RStudio]

'''ggplot() +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +'''

'''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +'''

'''theme_minimal()'''

'''ggplot() +'''

'''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "LDA Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window, type these commands.

|-
|| Highlight the command

'''ggplot() +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +'''

'''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal()'''

'''ggplot() +'''

'''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "LDA Decision Boundary") +'''

'''theme_minimal()'''

Select the commands and run them.

|| This command creates the decision boundary plot

It plots the '''grid''' points with colors indicating the predicted classes.

'''geom_raster '''creates a colour map indicating the predicted classes of the grid points

'''geom_contour '''creates the decision boundary of the LDA.

The '''scale_color_manual''' function assigns specific colors to the classes and so does '''scale_fill_manual''' function.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
|| Point the output in the '''Plots '''window
|| We can see that our model has separated most of the data points clearly.
|-
|| Only Narration
|| With this we come to end of this tutorial.

Let us summarize.
|-
|| '''Show Slide'''

'''Summary'''
|| In this tutorial we have learnt:

* Linear Discriminant Analysis ('''LDA''') and its implementation. 
* Assumptions of LDA
* Limitations of LDA
* LDA on a subset of Raisin dataset
* Visualization of the '''LDA''' separator and its corresponding confusion matrix

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| '''Show Slide'''

'''Assignment'''
||
* Perform LDA on inbuilt '''PlantGrowthdataset'''
* Evaluate the model using a confusion matrix and visualize the results

|-
|| '''Show slide'''

'''About the Spoken Tutorial Project'''
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
|| '''Show slide'''

'''Spoken Tutorial Workshops'''
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| '''Show Slide'''

'''Spoken Tutorial Forum to answer questions.'''

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.
|-
|| '''Show Slide'''

'''Forum to answer questions'''
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| '''Show Slide'''

'''Textbook Companion'''
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| '''Show Slide'''

'''Acknowledgment'''
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| '''Show Slide'''

'''Thank You'''
|| This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Linear-Discriminant-Analysis-in-R/English

2024-05-30T05:03:40Z

Ushav:

'''Title of the script''': Linear Discriminant Analysis in R

'''Author''': YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, confusion matrix, console, LDA, video tutorial.

{| border=1
|-
|| '''Visual Cue'''
|| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Linear Discriminant Analysis in R.'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
# Linear Discriminant Analysis ('''LDA''') and its implementation.
# Assumptions of LDA
# Limitations of LDA
# LDA on a subset of Raisin dataset
# Visualization of the '''LDA''' separator and its corresponding confusion matrix.

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide.'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know:

* Basics of '''R''' programming.
* Basics of '''Machine Learning '''using '''R'''.

If not, please access the relevant tutorials on '''R '''on this website.
|-
|| '''Show slide.'''

'''Linear Discriminant Analysis'''
|| Linear Discriminant Analysis is a statistical method.
* It is used for classification.
* It constructs a data driven line that best separates different classes.
* It is based on maximization of likelihood function to classify two or more classes.

|-
|| '''Show slide.'''

'''Applications of LDA'''
||
* LDA technique is used in several applications like

** Fraud Detection
** Bio-Imaging classification
** Classify patient disease state

|-
|| Only Narration
|| Let us now understand the assumptions of LDA.
|-
|| '''Show Slide '''

'''Assumptions for LDA'''
|| '''Multivariate Normality: '''

* All data entries are continuous, Gaussian, with equal covariance matrix for all the classes.
* Mean vectors for each class are different.
* Data records are independent and identically distributed among each class.

|-
|| '''Show Slide '''

'''Limitations of LDA'''
|| Now we will see the limitations of LDA.

* Departure from Gaussianity can increase misclassification probability in LDA.
* '''LDA''' may perform poorly if data has unequal class covariance matrix.

|-
|| '''Show Slide'''

'''Implementation Of LDA'''
|| Now let us implement '''LDA''' on the '''raisin dataset '''with two chosen variables'''.'''

More information on '''raisin''' data is available in the '''Additional Reading material''' on this tutorial page.
|-
|| '''Show slide '''

'''Download Files'''
|| We will use a script file '''LDA.R'''

Please download this file from the''' Code files''' link of this tutorial.

Make a copy and then use it for practising.
|-
|| [Computer screen]

Point to '''LDA.R''' and the folder '''LDA.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

Point to the''' LDA folder.'''
|| I have downloaded and moved these files to the '''LDA '''folder.

This folder is in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''LDA''' folder as my working''' directory'''.
|-
|| Point to the script file '''LDA.R.'''
|| In this tutorial, we will create a '''LDA''' classifier model on the '''raisin''' dataset.

Let us switch to '''RStudio'''.
|-
|| Open '''LDA.R '''in '''RStudio'''

Point to''' LDA.R''' in '''RStudio'''.
|| Open the script '''LDA.R''' in '''RStudio'''.

For this, click on the script '''LDA.R.'''

Script '''LDA.R''' opens in '''RStudio'''.
|-
|| Highlight the '''Readxl package.'''

Highlight the command '''library(MASS) '''

Highlight the command '''library(ggplot2)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(caret)'''

Highlight all the commands.

'''<nowiki>#install.packages(“package_name”)</nowiki>'''
|| '''Readxl package''' is used to load the '''Excel''' file.

The''' MASS package''' contains the '''lda()''' function that we will use for our analysis.

The '''ggplot2 package''' is used to plot the results of our analysis.

The '''caret package''' contains the

'''confusionMatrix''' function.

It is used as a measure for the performance of the classifier.

Please note that in order to import these libraries, we need to install them.

Please ensure that everything is installed correctly.

You can use the command '''install.packages(“package_name”)''' to install the required packages.

As I have already installed these packages, I will directly import them.

|-
|| [RStudio]

'''library(readxl)'''

'''library(MASS)'''

'''library(ggplot2)'''

'''library(caret)'''

'''library(lattice)'''

|| Select and run these commands to import the requisite packages.

|-
|| Highlight the command''' '''

'''data <- read_xlsx("Raisin.xlsx")'''

Highlight the command''' data<-data[c("minorAL","ecc","class")]'''

Highlight the commands.

'''data <- read_xlsx("Raisin.xlsx")'''

'''data<-data[c("minorAL","ecc","class")]'''

|| We will read the excel file and choose 3 columns, two features ('''minorAL, ecc)''' and one target ('''class''') variable.

Run these commands to import the '''raisin''' dataset.

|-
|| Drag boundary to see the '''Environment '''tab clearly.

Point to the data variable in the Environment tab.

Click the data to load the dataset.

|| Drag boundary to see the Environment tab clearly.

In the Environment tab under '''Data '''heading, you will see a '''data '''variable.

Click the data''' variable''' to load the dataset in the '''Source''' window.
|-
|| Drag boundary to see the Source window clearly.
|| Drag boundary to see the '''Source '''window clearly.

|-
||[RStudio]

Type these commands in the source window.

'''data$class <- factor(data$class)'''

|| In the '''Source''' window type this command.

|-
||Highlight the below commands.

'''data$class <- factor(data$class)'''

Select the commands and click the Run button.

||Here we are converting the variable '''data$class''' to a factor.

It ensures that the categorical data is properly encoded.

Select the command and run it.
|-
||Only Narration.
|| Now we split our dataset into training and testing data.
|-
||[RStudio]

Type the command in the source window.

'''set.seed(1) '''

'''index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''
||In the '''Source''' window type these commands.

|-
||Highlight the command

'''set.seed(1)'''

Highlight the command '''sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''

Highlight the command '''replace=FALSE'''

Select the commands and click the Run button.
||First we set a seed for reproducible results.

We will create a vector of indices using '''sample() '''function.

This will be 70% for training and 30% for testing.

The training data is chosen using simple random sampling without replacement.

Select the commands and run them.
|-
|| The vector is shown in the''' Environment '''tab.
|-
||Point to train-test split.
|| We use the indices that we previously generated to obtain our train-test split.
|-
|| [RStudio]

Type the command

'''train_data <- data [index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source '''window type these commands.
|-
|| Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab
|| Select the commands and run them.

The data sets are shown in the Environment tab.

Click on '''test_data '''and '''train_data '''to load them in the Source window.

|-
|| Only Narration.
|| Let us train our '''LDA''' model.
|-
|| [RStudio]

'''LDA_model <- lda(class~.,data=train_data)'''

'''LDA_model'''
|| In the '''Source '''window, type these commands.

|-
|| Highlight the command

'''LDA_model <- lda(class~.,data=train_data)'''

'''LDA_model'''

Highlight the command '''LDA_model'''

Click on Save and Run buttons.

Point to the output in the '''console '''window.
|| We pass two parameters to the '''lda()''' function.
# formula
# data on which the model should train.

Select the comands and run them.

The output is shown in the '''console''' window.
|-
|| Drag boundary to see the '''console''' window.
|| Drag boundary to see the '''console '''window clearly.

|-
|| Highlight '''output''' in the '''console.'''
|| Our '''model''' provides us with a lot of information.

Let us go through them one at a time.
|-
|| Highlight the command '''Prior probabilities of groups. '''

Highlight the command''' Group means.'''

Highlight the command '''Coefficients of linear discriminants '''

|| These explain the distribution of classes in the training dataset.

These display the mean values of each '''predictor '''variable for each '''species'''.

These display the '''linear combination of predictor''' variables.

The given linear combinations form the decision rule of the '''LDA''' model.

|-
|| Drag boundary to see the Source window.
|| Drag boundary to see the '''Source '''window clearly.

|-
||
|| Let us use this model to make predictions on the testing data.
|-
|| [RStudio]

'''predicted_values <- predict(LDA_model, test_data)'''

|| In the '''Source '''window type this command and run it.

Let us check what '''predicted_values''' contain.

|-
|| Click the '''predicted_values '''data in the Environment tab.

Point to the table.
|| Click the '''predicted_values '''data in the Environment tab.

The '''predicted_values '''table is loaded in the '''Source''' window.

|-
|| [RStudio]

'''head(predicted_values$class)'''

'''head(predicted_values$posterior)'''

'''head(predicted_values$x)'''
|| In the '''Source''' window type these commands and run them.

The output is seen in the''' console''' window.
|-
|| Highlight the command output of '''head(predicted_values$class) '''in the '''console.'''

Highlight the command output of '''head(predicted_values$posterior)''' in the '''console.'''

Highlight the command output of '''head(predicted_values$x) '''in '''console'''
|| It contains the type of species that the model has predicted for each observation.

It contains the '''posterior probability''' of the observation belonging to each class.

This contains the linear discriminants for each observation.

|-
|| Only Narration.
|| Now we will measure the performance of our model using the '''Confusion Matrix'''.
|-
|| [RStudio]

'''confusion <-table(test_data$class,predicted_values$class)'''

'''fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)'''

Click on '''Save '''and''' Run''' buttons.
|| In the '''Source '''window type these commands.

Save and run the commands.
|-
|| Highlight the command '''confusion <- table(test_data$class, predicted_values$class)'''

Highlight the command

'''fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1)'''

|| This table creates a confusion matrix.

The '''fourfoldplot()''' function generates a visual plot of the confusion matrix,

The output is seen in the '''plot''' window.
|-
|| Highlight the plot in '''plot window '''
|| Drag boundary to see the plot window clearly.

Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations.

This number may change for different sets of training data.

|-
|| Only Narration.
|| Let us visualize how well our model separates different classes.
|-
|| [RStudio]

[RStudio]

'''X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100)'''

'''Y <- seq(min(train_data$ecc), max(train_data$ecc), length.out = 100)'''

'''min_max <- expand.grid(minorAL = X, ecc = Y)'''

'''min_max$predicted_class <- predict(LDA_model, newdata = min_max)$class'''

'''grid <- expand.grid(minorAL = X, ecc = Y)'''

'''grid$class <- predict(LDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

Click on Save and Run buttons.

|| In the '''Source''' window, type these commands.

This block of code operates as a setup for visual plotting.

It consists of square grid coordinates in the range of training data and their predicted linear discriminants.

The ''' seq ''' function generates a sequence of evenly spaced values within a range of smallest and largest values of 'minorAL' and 'ecc' variables from the training data.

The''' 'grid' '''variable contains the generated data including the prediction of the LDA_model on it.

The '''as.numeric''' function encodes the predicted classes labels into numeric values.

Select the commands and run them.

|-
|| Point to the Environment tab.
|| Drag boundary to see the details in the Environment tab.

These variables contain the data for the visualization of the linear discriminants.

Click the '''grid''' '''data''' in the Environment tab.

The '''grid data''' table is loaded in the '''Source''' window.

|-
|| [RStudio]

'''ggplot() +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +'''

'''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +'''

'''theme_minimal()'''

'''ggplot() +'''

'''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "LDA Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window, type these commands.

|-
|| Highlight the command

'''ggplot() +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +'''

'''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal()'''

'''ggplot() +'''

'''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "LDA Decision Boundary") +'''

'''theme_minimal()'''

Select the commands and run them.

|| This command creates the decision boundary plot

It plots the '''grid''' points with colors indicating the predicted classes.

'''geom_raster '''creates a colour map indicating the predicted classes of the grid points

'''geom_contour '''creates the decision boundary of the LDA.

The '''scale_color_manual''' function assigns specific colors to the classes and so does '''scale_fill_manual''' function.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
|| Point the output in the '''Plots '''window
|| We can see that our model has separated most of the data points clearly.
|-
|| Only Narration
|| With this we come to end of this tutorial.

Let us summarize.
|-
|| '''Show Slide'''

'''Summary'''
|| In this tutorial we have learnt:

* Linear Discriminant Analysis ('''LDA''') and its implementation. 
* Assumptions of LDA
* Limitations of LDA
* LDA on a subset of Raisin dataset
* Visualization of the '''LDA''' separator and its corresponding confusion matrix

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| '''Show Slide'''

'''Assignment'''
||
* Perform LDA on inbuilt '''PlantGrowthdataset'''
* Evaluate the model using a confusion matrix and visualize the results

|-
|| '''Show slide'''

'''About the Spoken Tutorial Project'''
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
|| '''Show slide'''

'''Spoken Tutorial Workshops'''
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| '''Show Slide'''

'''Spoken Tutorial Forum to answer questions.'''

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.
|-
|| '''Show Slide'''

'''Forum to answer questions'''
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| '''Show Slide'''

'''Textbook Companion'''
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| '''Show Slide'''

'''Acknowledgment'''
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| '''Show Slide'''

'''Thank You'''
|| This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Linear-Discriminant-Analysis-in-R/English

2024-05-28T13:41:48Z

Ushav:

'''Title of the script''': Linear Discriminant Analysis in R

'''Author''': YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, confusion matrix, console, LDA, video tutorial.

{| border=1
|-
|| '''Visual Cue'''
|| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Linear Discriminant Analysis in R.'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
# Linear Discriminant Analysis ('''LDA''') and its implementation.
# Assumptions of LDA
# Limitations of LDA
# LDA on a subset of Raisin dataset
# Visualization of the '''LDA''' separator and its corresponding confusion matrix.

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide.'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know:

* Basics of '''R''' programming.
* Basics of '''Machine Learning '''using '''R'''.

If not, please access the relevant tutorials on '''R '''on this website.
|-
|| '''Show slide.'''

'''Linear Discriminant Analysis'''
|| Linear Discriminant Analysis is a statistical method.
* It is used for classification.
* It constructs a data driven line that best separates different classes.
* It is based on maximization of likelihood function to classify two or more classes.

|-
|| '''Show slide.'''

'''Applications of LDA'''
||
* LDA technique is used in several applications like

** Fraud Detection
** Bio-Imaging classification
** Classify patient disease state

|-
|| Only Narration
|| Let us now understand the assumptions of LDA.
|-
|| '''Show Slide '''

'''Assumptions for LDA'''
|| '''Multivariate Normality: '''

* All data entries are continuous, Gaussian, with equal covariance matrix for all the classes.
* Mean vectors for each class are different.
* Data records are independent and identically distributed among each class.

|-
|| '''Show Slide '''

'''Limitations of LDA'''
|| Now we will see the limitations of LDA.

* Departure from Gaussianity can increase misclassification probability in LDA.
* '''LDA''' may perform poorly if data has unequal class covariance matrix.

|-
|| '''Show Slide'''

'''Implementation Of LDA'''
|| Now let us implement '''LDA''' on the '''raisin dataset '''with two chosen variables'''.'''

More information on '''raisin''' data is available in the '''Additional Reading material''' on this tutorial page.
|-
|| '''Show slide '''

'''Download Files'''
|| We will use a script file '''LDA.R'''

Please download this file from the''' Code files''' link of this tutorial.

Make a copy and then use it for practising.
|-
|| [Computer screen]

Point to '''LDA.R''' and the folder '''LDA.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

Point to the''' LDA folder.'''
|| I have downloaded and moved these files to the '''LDA '''folder.

This folder is in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''LDA''' folder as my working''' directory'''.
|-
|| Point to the script file '''LDA.R.'''
|| In this tutorial, we will create a '''LDA''' classifier model on the '''raisin''' dataset.

Let us switch to '''RStudio'''.
|-
|| Open '''LDA.R '''in '''RStudio'''

Point to''' LDA.R''' in '''RStudio'''.
|| Open the script '''LDA.R''' in '''RStudio'''.

For this, click on the script '''LDA.R.'''

Script '''LDA.R''' opens in '''RStudio'''.
|-
|| Highlight the '''Readxl package.'''

Highlight the command '''library(MASS) '''

Highlight the command '''library(ggplot2)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(caret)'''

Highlight all the commands.

'''<nowiki>#install.packages(“package_name”)</nowiki>'''
|| '''Readxl package''' is used to load the '''Excel''' file.

The''' MASS package''' contains the '''lda()''' function that we will use for our analysis.

The '''ggplot2 package''' is used to plot the results of our analysis.

The '''caret package''' contains the

'''confusionMatrix''' function.

It is used as a measure for the performance of the classifier.

Please note that in order to import these libraries, we need to install them.

Please ensure that everything is installed correctly.

You can use the command '''install.packages(“package_name”)''' to install the required packages.

As I have already installed these packages, I will directly import them.

|-
|| [RStudio]

'''library(readxl)'''

'''library(MASS)'''

'''library(ggplot2)'''

'''library(caret)'''

'''library(lattice)'''

|| Select and run these commands to import the requisite packages.

|-
|| Highlight the command''' '''

'''data <- read_xlsx("Raisin.xlsx")'''

Highlight the command''' data<-data[c("minorAL","ecc","class")]'''

Highlight the commands.

'''data <- read_xlsx("Raisin.xlsx")'''

'''data<-data[c("minorAL","ecc","class")]'''

|| We will read the excel file and choose 3 columns, two features ('''minorAL, ecc)''' and one target ('''class''') variable.

Run these commands to import the '''raisin''' dataset.

|-
|| Drag boundary to see the '''Environment '''tab clearly.

Point to the data variable in the Environment tab.

Click the data to load the dataset.

|| Drag boundary to see the Environment tab clearly.

In the Environment tab under '''Data '''heading, you will see a '''data '''variable.

Click the data''' variable''' to load the dataset in the '''Source''' window.
|-
|| Drag boundary to see the Source window clearly.
|| Drag boundary to see the '''Source '''window clearly.

|-
||[RStudio]

Type these commands in the source window.

'''data$class <- factor(data$class)'''

|| In the '''Source''' window type this command.

|-
||Highlight the below commands.

'''data$class <- factor(data$class)'''

Select the commands and click the Run button.

||Here we are converting the variable '''data$class''' to a factor.

It ensures that the categorical data is properly encoded.

Select the command and run it. them.
|-
||Only Narration.
|| Now we split our dataset into training and testing data.
|-
||[RStudio]

Type the command in the source window.

'''set.seed(1) '''

'''index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''
||In the '''Source''' window type these commands.

|-
||Highlight the command

'''set.seed(1)'''

Highlight the command '''sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''

Highlight the command '''replace=FALSE'''

Select the commands and click the Run button.
||First we set a seed for reproducible results.

We will create a vector of indices using '''sample() '''function.

This will be 70% for training and 30% for testing.

The training data is chosen using simple random sampling without replacement.

Select the commands and run them.
|-
|| The vector is shown in the''' Environment '''tab.
|-
||Point to train-test split.
|| We use the indices that we previously generated to obtain our train-test split.
|-
|| [RStudio]

Type the command

'''train_data <- data [index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source '''window type these commands.
|-
|| Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab
|| Select the commands and run them.

The data sets are shown in the Environment tab.

Click on '''test_data '''and '''train_data '''to load them in the Source window.

|-
|| Only Narration.
|| Let us train our '''LDA''' model.
|-
|| [RStudio]

'''LDA_model <- lda(class~.,data=train_data)'''

'''LDA_model'''
|| In the '''Source '''window, type these commands.

|-
|| Highlight the command

'''LDA_model <- lda(class~.,data=train_data)'''

'''LDA_model'''

Highlight the command '''LDA_model'''

Click on Save and Run buttons.

Point to the output in the '''console '''window.
|| We pass two parameters to the '''lda()''' function.
# formula
# data on which the model should train.

Select the comands and run them.

The output is shown in the '''console''' window.
|-
|| Drag boundary to see the '''console''' window.
|| Drag boundary to see the '''console '''window clearly.

|-
|| Highlight '''output''' in the '''console.'''
|| Our '''model''' provides us with a lot of information.

Let us go through them one at a time.
|-
|| Highlight the command '''Prior probabilities of groups. '''

Highlight the command''' Group means.'''

Highlight the command '''Coefficients of linear discriminants '''

|| These explain the distribution of classes in the training dataset.

These display the mean values of each '''predictor '''variable for each '''species'''.

These display the '''linear combination of predictor''' variables.

The given linear combinations form the decision rule of the '''LDA''' model.

|-
|| Drag boundary to see the Source window.
|| Drag boundary to see the '''Source '''window clearly.

|-
||
|| Let us use this model to make predictions on the testing data.
|-
|| [RStudio]

'''predicted_values <- predict(LDA_model, test_data)'''

|| In the '''Source '''window type this command and run it.

Let us check what '''predicted_values''' contain.

|-
|| Click the '''predicted_values '''data in the Environment tab.

Point to the table.
|| Click the '''predicted_values '''data in the Environment tab.

The '''predicted_values '''table is loaded in the '''Source''' window.

|-
|| [RStudio]

'''head(predicted_values$class)'''

'''head(predicted_values$posterior)'''

'''head(predicted_values$x)'''
|| In the '''Source''' window type these commands and run them.

The output is seen in the''' console''' window.
|-
|| Highlight the command output of '''head(predicted_values$class) '''in the '''console.'''

Highlight the command output of '''head(predicted_values$posterior)''' in the '''console.'''

Highlight the command output of '''head(predicted_values$x) '''in '''console'''
|| It contains the type of species that the model has predicted for each observation.

It contains the '''posterior probability''' of the observation belonging to each class.

This contains the linear discriminants for each observation.

|-
|| Only Narration.
|| Now we will measure the performance of our model using the '''Confusion Matrix'''.
|-
|| [RStudio]

'''confusion <-table(test_data$class,predicted_values$class)'''

'''fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)'''

Click on '''Save '''and''' Run''' buttons.
|| In the '''Source '''window type these commands.

Save and run the commands.
|-
|| Highlight the command '''confusion <- table(test_data$class, predicted_values$class)'''

Highlight the command

'''fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1)'''

|| This table creates a confusion matrix.

The '''fourfoldplot()''' function generates a visual plot of the confusion matrix,

The output is seen in the '''plot''' window.
|-
|| Highlight the plot in '''plot window '''
|| Drag boundary to see the plot window clearly.

Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations.

This number may change for different sets of training data.

|-
|| Only Narration.
|| Let us visualize how well our model separates different classes.
|-
|| [RStudio]

[RStudio]

'''X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100)'''

'''Y <- seq(min(train_data$ecc), max(train_data$ecc), length.out = 100)'''

'''min_max <- expand.grid(minorAL = X, ecc = Y)'''

'''min_max$predicted_class <- predict(LDA_model, newdata = min_max)$class'''

'''grid <- expand.grid(minorAL = X, ecc = Y)'''

'''grid$class <- predict(LDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

Click on Save and Run buttons.

|| In the '''Source''' window, type these commands.

This block of code operates as a setup for visual plotting.

It consists of square grid coordinates in the range of training data and their predicted linear discriminants.

The ''' seq ''' function generates a sequence of evenly spaced values within a range of smallest and largest values of 'minorAL' and 'ecc' variables from the training data.

The''' 'grid' '''variable contains the generated data including the prediction of the LDA_model on it.

The '''as.numeric''' function encodes the predicted classes labels into numeric values.

Select the commands and run them.

|-
|| Point to the Environment tab.
|| Drag boundary to see the details in the Environment tab.

These variables contain the data for the visualization of the linear discriminants.

Click the '''grid''' '''data''' in the Environment tab.

The '''grid data''' table is loaded in the '''Source''' window.

|-
|| [RStudio]

'''ggplot() +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +'''

'''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +'''

'''theme_minimal()'''

'''ggplot() +'''

'''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "LDA Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window, type these commands.

|-
|| Highlight the command

'''ggplot() +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +'''

'''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal()'''

'''ggplot() +'''

'''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "LDA Decision Boundary") +'''

'''theme_minimal()'''

Select the commands and run them.

|| This command creates the decision boundary plot

It plots the '''grid''' points with colors indicating the predicted classes.

'''geom_raster '''creates a colour map indicating the predicted classes of the grid points

'''geom_contour '''creates the decision boundary of the LDA.

The '''scale_color_manual''' function assigns specific colors to the classes and so does '''scale_fill_manual''' function.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
|| Point the output in the '''Plots '''window
|| We can see that our model has separated most of the data points clearly.
|-
|| Only Narration
|| With this we come to end of this tutorial.

Let us summarize.
|-
|| '''Show Slide'''

'''Summary'''
|| In this tutorial we have learnt:

* Linear Discriminant Analysis ('''LDA''') and its implementation. 
* Assumptions of LDA
* Limitations of LDA
* LDA on a subset of Raisin dataset
* Visualization of the '''LDA''' separator and its corresponding confusion matrix

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| '''Show Slide'''

'''Assignment'''
||
* Perform LDA on inbuilt '''PlantGrowthdataset'''
* Evaluate the model using a confusion matrix and visualize the results

|-
|| '''Show slide'''

'''About the Spoken Tutorial Project'''
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
|| '''Show slide'''

'''Spoken Tutorial Workshops'''
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| '''Show Slide'''

'''Spoken Tutorial Forum to answer questions.'''

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.
|-
|| '''Show Slide'''

'''Forum to answer questions'''
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| '''Show Slide'''

'''Textbook Companion'''
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| '''Show Slide'''

'''Acknowledgment'''
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| '''Show Slide'''

'''Thank You'''
|| This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Linear-Discriminant-Analysis-in-R/English

2024-05-28T13:39:36Z

Ushav:

'''Title of the script''': Linear Discriminant Analysis in R

'''Author''': YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, confusion matrix, console, LDA, video tutorial.

{| border=1
|-
|| '''Visual Cue'''
|| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Linear Discriminant Analysis in R.'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
# Linear Discriminant Analysis ('''LDA''') and its implementation.
# Assumptions of LDA
# Limitations of LDA
# LDA on a subset of Raisin dataset
# Visualization of the '''LDA''' separator and its corresponding confusion matrix.

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide.'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know:

* Basics of '''R''' programming.
* Basics of '''Machine Learning '''using '''R'''.

If not, please access the relevant tutorials on '''R '''on this website.
|-
|| '''Show slide.'''

'''Linear Discriminant Analysis'''
|| Linear Discriminant Analysis is a statistical method.
* It is used for classification.
* It constructs a data driven line that best separates different classes.
* It is based on maximization of likelihood function to classify two or more classes.

|-
|| '''Show slide.'''

'''Applications of LDA'''
||
* LDA technique is used in several applications like

** Fraud Detection
** Bio-Imaging classification
** Classify patient disease state

|-
|| Only Narration
|| Let us now understand the assumptions of LDA.
|-
|| '''Show Slide '''

'''Assumptions for LDA'''
|| '''Multivariate Normality: '''

* All data entries are continuous, Gaussian, with equal covariance matrix for all the classes.
* Mean vectors for each class are different.
* Data records are independent and identically distributed among each class.

|-
|| '''Show Slide '''

'''Limitations of LDA'''
|| Now we will see the limitations of LDA.

* Departure from Gaussianity can increase misclassification probability in LDA.
* '''LDA''' may perform poorly if data has unequal class covariance matrix.

|-
|| '''Show Slide'''

'''Implementation Of LDA'''
|| Now let us implement '''LDA''' on the '''raisin dataset '''with two chosen variables'''.'''

More information on '''raisin''' data is available in the '''Additional Reading material''' on this tutorial page.
|-
|| '''Show slide '''

'''Download Files'''
|| We will use a script file '''LDA.R'''

Please download this file from the''' Code files''' link of this tutorial.

Make a copy and then use it for practicing.
|-
|| [Computer screen]

Point to '''LDA.R''' and the folder '''LDA.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

Point to the''' LDA folder.'''
|| I have downloaded and moved these files to the '''LDA '''folder.

This folder is in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''LDA''' folder as my working''' directory'''.
|-
|| Point to the script file '''LDA.R.'''
|| In this tutorial, we will create a '''LDA''' classifier model on the '''raisin''' dataset.

Let us switch to '''RStudio'''.
|-
|| Open '''LDA.R '''in '''RStudio'''

Point to''' LDA.R''' in '''RStudio'''.
|| Open the script '''LDA.R''' in '''RStudio'''.

For this, click on the script '''LDA.R.'''

Script '''LDA.R''' opens in '''RStudio'''.
|-
|| Highlight the '''Readxl package.'''

Highlight the command '''library(MASS) '''

Highlight the command '''library(ggplot2)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(caret)'''

Highlight all the commands.

'''<nowiki>#install.packages(“package_name”)</nowiki>'''
|| '''Readxl package''' is used to load the '''Excel''' file.

The''' MASS package''' contains the '''lda()''' function that we will use for our analysis.

The '''ggplot2 package''' is used to plot the results of our analysis.

The '''caret package''' contains the

'''confusionMatrix''' function.

It is used as a measure for the performance of the classifier.

Please note that in order to import these libraries, we need to install them.

Please ensure that everything is installed correctly.

You can use the command '''install.packages(“package_name”)''' to install the required packages.

As I have already installed these packages, I will directly import them.

|-
|| [RStudio]

'''library(readxl)'''

'''library(MASS)'''

'''library(ggplot2)'''

'''library(caret)'''

'''library(lattice)'''

|| Select and run these commands to import the requisite packages.

|-
|| Highlight the command''' '''

'''data <- read_xlsx("Raisin.xlsx")'''

Highlight the command''' data<-data[c("minorAL","ecc","class")]'''

Highlight the commands.

'''data <- read_xlsx("Raisin.xlsx")'''

'''data<-data[c("minorAL","ecc","class")]'''

|| We will read the excel file and choose 3 columns, two features ('''minorAL, ecc)''' and one target ('''class''') variable.

Run these commands to import the '''raisin''' dataset.

|-
|| Drag boundary to see the '''Environment '''tab clearly.

Point to the data variable in the Environment tab.

Click the data to load the dataset.

|| Drag boundary to see the Environment tab clearly.

In the Environment tab under '''Data '''heading, you will see a '''data '''variable.

Click the data''' variable''' to load the dataset in the '''Source''' window.
|-
|| Drag boundary to see the Source window clearly.
|| Drag boundary to see the '''Source '''window clearly.

|-
||[RStudio]

Type these commands in the source window.

'''data$class <- factor(data$class)'''

|| In the '''Source''' window type this command.

|-
||Highlight the below commands.

'''data$class <- factor(data$class)'''

Select the commands and click the Run button.

||Here we are converting the variable '''data$class''' to a factor.

It ensures that the categorical data is properly encoded.

Select the command and run it. them.
|-
||Only Narration.
|| Now we split our dataset into training and testing data.
|-
||[RStudio]

Type the command in the source window.

'''set.seed(1) '''

'''index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''
||In the '''Source''' window type these commands.

|-
||Highlight the command

'''set.seed(1)'''

Highlight the command '''sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''

Highlight the command '''replace=FALSE'''

Select the commands and click the Run button.
||First we set a seed for reproducible results.

We will create a vector of indices using '''sample() '''function.

This will be 70% for training and 30% for testing.

The training data is chosen using simple random sampling without replacement.

Select the commands and run them.
|-
|| The vector is shown in the''' Environment '''tab.
|-
||Point to train-test split.
|| We use the indices that we previously generated to obtain our train-test split.
|-
|| [RStudio]

Type the command

'''train_data <- data [index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source '''window type these commands.
|-
|| Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab
|| Select the commands and run them.

The data sets are shown in the Environment tab.

Click on '''test_data '''and '''train_data '''to load them in the Source window.

|-
|| Only Narration.
|| Let us train our '''LDA''' model.
|-
|| [RStudio]

'''LDA_model <- lda(class~.,data=train_data)'''

'''LDA_model'''
|| In the '''Source '''window, type these commands.

|-
|| Highlight the command

'''LDA_model <- lda(class~.,data=train_data)'''

'''LDA_model'''

Highlight the command '''LDA_model'''

Click on Save and Run buttons.

Point to the output in the '''console '''window.
|| We pass two parameters to the '''lda()''' function.
# formula
# data on which the model should train.

Select the comands and run them.

The output is shown in the '''console''' window.
|-
|| Drag boundary to see the '''console''' window.
|| Drag boundary to see the '''console '''window clearly.

|-
|| Highlight '''output''' in the '''console.'''
|| Our '''model''' provides us with a lot of information.

Let us go through them one at a time.
|-
|| Highlight the command '''Prior probabilities of groups. '''

Highlight the command''' Group means.'''

Highlight the command '''Coefficients of linear discriminants '''

|| These explain the distribution of classes in the training dataset.

These display the mean values of each '''predictor '''variable for each '''species'''.

These display the '''linear combination of predictor''' variables.

The given linear combinations form the decision rule of the '''LDA''' model.

|-
|| Drag boundary to see the Source window.
|| Drag boundary to see the '''Source '''window clearly.

|-
||
|| Let us use this model to make predictions on the testing data.
|-
|| [RStudio]

'''predicted_values <- predict(LDA_model, test_data)'''

|| In the '''Source '''window type this command and run it.

Let us check what '''predicted_values''' contain.

|-
|| Click the '''predicted_values '''data in the Environment tab.

Point to the table.
|| Click the '''predicted_values '''data in the Environment tab.

The '''predicted_values '''table is loaded in the '''Source''' window.

|-
|| [RStudio]

'''head(predicted_values$class)'''

'''head(predicted_values$posterior)'''

'''head(predicted_values$x)'''
|| In the '''Source''' window type these commands and run them.

The output is seen in the''' console''' window.
|-
|| Highlight the command output of '''head(predicted_values$class) '''in the '''console.'''

Highlight the command output of '''head(predicted_values$posterior)''' in the '''console.'''

Highlight the command output of '''head(predicted_values$x) '''in '''console'''
|| It contains the type of species that the model has predicted for each observation.

It contains the '''posterior probability''' of the observation belonging to each class.

This contains the linear discriminants for each observation.

|-
|| Only Narration.
|| Now we will measure the performance of our model using the '''Confusion Matrix'''.
|-
|| [RStudio]

'''confusion <-table(test_data$class,predicted_values$class)'''

'''fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)'''

Click on '''Save '''and''' Run''' buttons.
|| In the '''Source '''window type these commands.

Save and run the commands.
|-
|| Highlight the command '''confusion <- table(test_data$class, predicted_values$class)'''

Highlight the command

'''fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1)'''

|| This table creates a confusion matrix.

The '''fourfoldplot()''' function generates a visual plot of the confusion matrix,

The output is seen in the '''plot''' window.
|-
|| Highlight the plot in '''plot window '''
|| Drag boundary to see the plot window clearly.

Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations.

This number may change for different sets of training data.

|-
|| Only Narration.
|| Let us visualize how well our model separates different classes.
|-
|| [RStudio]

[RStudio]

'''X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100)'''

'''Y <- seq(min(train_data$ecc), max(train_data$ecc), length.out = 100)'''

'''min_max <- expand.grid(minorAL = X, ecc = Y)'''

'''min_max$predicted_class <- predict(LDA_model, newdata = min_max)$class'''

'''grid <- expand.grid(minorAL = X, ecc = Y)'''

'''grid$class <- predict(LDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

Click on Save and Run buttons.

|| In the '''Source''' window, type these commands.

This block of code operates as a setup for visual plotting.

It consists of square grid coordinates in the range of training data and their predicted linear discriminants.

The ''' seq ''' function generates a sequence of evenly spaced values within a range of smallest and largest values of 'minorAL' and 'ecc' variables from the training data.

The''' 'grid' '''variable contains the generated data including the prediction of the LDA_model on it.

The '''as.numeric''' function encodes the predicted classes labels into numeric values.

Select the commands and run them.

|-
|| Point to the Environment tab.
|| Drag boundary to see the details in the Environment tab.

These variables contain the data for the visualization of the linear discriminants.

Click the '''grid''' '''data''' in the Environment tab.

The '''grid data''' table is loaded in the '''Source''' window.

|-
|| [RStudio]

'''ggplot() +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +'''

'''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +'''

'''theme_minimal()'''

'''ggplot() +'''

'''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "LDA Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window, type these commands.

|-
|| Highlight the command

'''ggplot() +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +'''

'''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal()'''

'''ggplot() +'''

'''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "LDA Decision Boundary") +'''

'''theme_minimal()'''

Select the commands and run them.

|| This command creates the decision boundary plot

It plots the '''grid''' points with colors indicating the predicted classes.

'''geom_raster '''creates a colour map indicating the predicted classes of the grid points

'''geom_contour '''creates the decision boundary of the LDA.

The '''scale_color_manual''' function assigns specific colors to the classes and so does '''scale_fill_manual''' function.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
|| Point the output in the '''Plots '''window
|| We can see that our model has separated most of the data points clearly.
|-
|| Only Narration
|| With this we come to end of this tutorial.

Let us summarize.
|-
|| '''Show Slide'''

'''Summary'''
|| In this tutorial we have learnt:

* Linear Discriminant Analysis ('''LDA''') and its implementation. 
* Assumptions of LDA
* Limitations of LDA
* LDA on a subset of Raisin dataset
* Visualization of the '''LDA''' separator and its corresponding confusion matrix

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| '''Show Slide'''

'''Assignment'''
||
* Perform LDA on inbuilt '''PlantGrowthdataset'''
* Evaluate the model using a confusion matrix and visualize the results

|-
|| '''Show slide'''

'''About the Spoken Tutorial Project'''
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
|| '''Show slide'''

'''Spoken Tutorial Workshops'''
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| '''Show Slide'''

'''Spoken Tutorial Forum to answer questions.'''

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.
|-
|| '''Show Slide'''

'''Forum to answer questions'''
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| '''Show Slide'''

'''Textbook Companion'''
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| '''Show Slide'''

'''Acknowledgment'''
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| '''Show Slide'''

'''Thank You'''
|| This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Quadratic-Discriminant-Analysis-in-R/English

2024-05-16T12:44:19Z

Ushav: Created page with "'''Title of the script''': Quadratic Discriminant Analysis in R '''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty <div style="margin-right:-1.27cm;">'''Keywo..."

'''Title of the script''': Quadratic Discriminant Analysis in R

'''Author''': Yate Asseke Ronald Olivera and Debatosh Chakraborty

<div style="margin-right:-1.27cm;">'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, QDA, quadratic discriminant analysis, video tutorial.

{| border=1
|-
| align=center| '''Visual Cue'''
| align=center| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on''' Quadratic Discriminant Analysis in R'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
* Quadratic Discriminant Analysis (QDA).
* Comparison between '''QDA '''and''' LDA'''.
* Assumptions for QDA.
* Limitations of QDA
* Applications of QDA
* Implementation of QDA using''' Raisin''' Dataset'''.'''
* Visualization of the '''QDA '''separator

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know,
* Basic programming in '''R'''.
* '''Basics of Machine Learning'''.

If not, please access the relevant tutorials on this website.
|-
|| '''Show slide'''

'''Quadratic Discriminant Analysis'''
||
* Quadratic discriminant analysis is a statistical method used for classification.
* QDA constructs a data-driven non-linear separator between two classes.
* The covariance matrix for different classes isThis line is repeated in the next two slides.Just like "It is based on maximization of likelihood function to classify two or more classes." in LDA, we can specify a way how QDA created non-linear boundary. not necessarily equal.
* A quadratic function describes the decision boundary between each pair of classes.
* more than 80 characters. please shorten he sentence.The decision boundary between each pair of classes is described by a quadratic function.

|-
|| '''Show Slide'''

'''Differences between LDA and QDA'''
|| Now let’s see the differences between LDA and QDA

* '''LDA''' assumes that each class has the same covariance matrix.
* '''QDA''' relaxes the assumption of an equal covariance matrix for all the classes.
* '''LDA''' constructs a linear boundary, while '''QDA '''constructs a non-linear boundary.
* When the covariance matrices of different classes are the same, '''QDA '''reduces to '''LDA'''.

|-
|| '''Show Slides'''

'''Assumptions for QDA'''

'''QDA '''is primarily used when data is multivariate Gaussian.

'''QDA''' assumes that each class has its own covariance matrix.

|| Now let us see the assumption of QDA

QDA is used when data is multivariate Gaussian and each class has its own covariance matrix.
|-
|| '''Show slide.'''

'''Limitations of QDA'''

* Multicollinearity among predictors may lead to poor performance.
* The presence of outliers in data may also lead to poor performance.

|| These are the limitations of QDA

|-
|| '''Show slide.'''

'''Applications of QDA'''

* Medical Diagnosis.
* Bio-Imaging classification.
* Fraud Detection.

|| QDA technique is used in several applications.

|-
|| '''Show Slide'''

'''Implementation Of QDA'''
|| Let us implement '''QDA '''on the '''Raisin''' '''dataset '''with two chosen variables'''.'''

For more information on Raisin data please see the Additional Reading material on this tutorial page.
|-
|| '''Show slide '''

'''Download Files '''
|| We will use a script file '''QDA.R '''and '''Raisin Dataset ‘raisin.xlsx’'''

Please download these files from the''' Code files''' link of this tutorial.

Make a copy and then use them while practicing.
|-
|| [Computer screen]

point to '''QDA.R''' and the folder '''QDA.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

|| I have downloaded and moved these files to the '''QDA '''folder.

This folder is located in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''QDA''' folder as my working Directory.

In this tutorial, we will create a '''QDA''' classifier model on the '''raisin''' dataset.
|-
||
|| Let us switch to '''RStudio'''.
|-
|| Click QDA.R in RStudio

Point to QDA.R in RStudio.
|| Let us open the script '''QDA.R''' in '''RStudio'''.

For this, click on the script '''QDA.R.'''

Script '''QDA.R''' opens in '''RStudio'''.
|-
|| [RStudio]

Highlight the command''' library(readxl)'''

Highlight the command''' library(MASS)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(ggplot2)'''

'''library(dplyr)'''

'''<nowiki>#install.packages(“package_name”)</nowiki>'''

'''Point to the command.'''
||

Select and run these commands to import the packages.

We will use the '''readxl''' package to load the excel file of our '''Raisin Dataset'''.

The '''MASS''' package contains the '''qda()''' function

to create our classifier.

We will use the '''caret''' package to create the '''confusion matrix.'''

The '''ggplot2''' package will be used to create the '''decision boundary plot.'''

We will use the '''dplyr''' package to aid the visualisation of the confusion matrix.

Please ensure that all the packages are installed correctly.

As I have already installed the packages.

I have directly imported them.

|-
|| [RStudio]

'''data<- read_xlsx("Raisint.xlsx")'''

|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands.

|-
|| Highlight the command''' data<- read_xlsx("Raisin.xlsx")'''

These commands are already there in script file'''data<-data[c("minorAL",ecc,"class")]'''

|| Run this command to load the '''Raisin '''dataset.

Drag boundary to see the Environment tab clearly.

In the Environment tab below Data, you will see the '''data '''variable.

Then click on '''data '''to load the dataset in the Source window.

|-
|| [Rstudio]

Type these commands in R studio.

These commands are already there in script file'''data$class <- factor(data$class)'''

|| Click on '''QDA.R''' in the Source window and close the tab.

In the '''Source''' window type these commands

|-
|| Highlight the command.

These commands are already there in script file'''data<-data[c("minorAL",ecc,"class")]'''

'''data$class <- factor(data$class)'''

Select the commands and click the Run button
|| We now select three columns from data and convert the variable '''data$class '''to a factor.

Select and run the commands.
|-
|| Click on the Environment tab.

Click on '''data.'''
|| Click on '''data '''to load the modified data in the Source window.

|-
|| Point to the data.
|| Now let us split our data into training and testing data.
|-
|| [RStudio]

'''set.seed(1) '''

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''
||

Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''set.seed(1)'''

Highlight the command

'''index_split<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) '''

|| First we set a seed for reproducible results.

We will create a vector of indices using '''sample() '''function.

It will be a 70% of the total number of rows for training and 30% for testing.

The training data is chosen using simple random sampling without replacement.

Select the commands and run them.
|-
|| [RStudio]

'''train_data <- data[index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source''' window type these commands

|-
|| Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab

Click the '''train_data '''and '''test_data '''
||

Select the commands and run them.

The data sets are shown in the '''Environment '''tab.

Click on '''train_data '''and '''test_data '''to load them in the Source window.
|-
||
|| Let’s perform '''QDA''' on the '''training''' dataset.
|-
|| [Rstudio]

'''QDA_model <- qda(class~.,data=train_data)'''
|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window

type these commands

|-
||

Highlight the command

'''QDA_model <- qda(class~.,data=train_data)'''

Highlight the command

'''QDA_model '''

Click Save and Click Run buttons.
|| We use this command to create '''QDA Model'''

We pass two parameters to the '''qda()''' function.# formula
# data on which the model should train.

Click Save.

Select and run the commands.

The output is shown in the '''console '''window.
|-
|| Drag boundary to see the console window.
|| Drag boundary to see the '''console '''window.
|-
|| Point the output in the '''console '''

Highlight the command '''Prior probabilities of group'''

Highlight the command '''Group means'''
|| These are the parameters of our model.

This indicates the composition of classes in the training data.

These indicate the mean values of the predictor variables for each class.
|-
|| Drag boundary to see the '''Source '''window.
|| Drag boundary to see the '''Source''' window.
|-
||
|| Let us now use our model to make predictions on test data.
|-
|| [RStudio]

'''predicted_values <- predict(QDA_model, test_data)'''

'''predicted_values '''
||

Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''predicted_values <- predict(model, test)'''

Type the command before highlighting

Highlight the command

'''predicted_values '''

Click on''' Save '''and '''Run '''buttons.
|| Let’s use this command to predict the class variable from the test data using the trained QDA model.

This will give us more information about the model such as class and posterior.

This predicts the class and posterior probability for the testing data.

Select and run the commands.

|-
|| This part is not clear Click on '''predicted_values '''in the Environment tab.

Point the output in the This part is not clear'''console'''

Highlight the command '''class'''

Highlight the command '''posterior'''
|| Click on '''predicted_values''' in the Environment tab

This shows us that our predicted variable has two components.

'''class''' contains the predicted '''classes '''of the testing data.

'''Posterior''' contains the '''posterior probability''' of an observation belonging to each class.
|-
||
|| Let us compute the accuracy of our model.
|-
|| '''confusion <- confusionMatrix(test_data$class,predicted_values$class)'''

|| Click on '''QDA.R''' in the source window.

In the '''Source''' window type these commands
|-
|| Highlight the command '''confusionMatrix(test_data$class,predicted_values$class)'''

Point to the confusion in the Environment Tab

Highlight the attribute

'''table'''
|| This command creates a confusion matrix list.

The list is created from the actual and predicted class labels of testing data.

And it is stored in the confusion variable.

It helps to assess the classification model's performance and accuracy.

Select and run the command.

The confusion matrix list is shown in the Environment tab.

Click '''confusion '''to load it in the''' Source '''window.

'''confusion '''list contains a component table containing the required confusion matrix.
|-
|| '''plot_confusion_matrix <- function(confusion_matrix){'''

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| Now let’s plot the confusion matrix from the table.

Click on '''QDA.R''' in the source window.

In the '''Source''' window type these commands
|-
||
'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''Highlight '''the command

'''tab <- confusion_matrix$table'''

'''tab = as.data.frame(tab)'''

'''tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))'''

'''tab <- tab %>%'''

'''rename(Actual = Reference) %>%'''

'''mutate(cor = if_else(Actual == Prediction, 1,0))'''

'''tab$cor <- as.factor(tab$cor)'''

'''Highlight '''the command

'''ggplot(tab, aes(Actual,Prediction)) +'''

'''geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +'''

'''scale_fill_manual(values = c("red","green")) +'''

'''theme_light() +'''

'''theme(legend.position = "None",'''

'''line = element_blank()) +'''

'''scale_x_discrete(position = "top")'''

'''}'''

|| These commands create a function '''plot_confusion_matrix '''to display the confusion matrix from the confusion matrix list created.

It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using '''GGPlot2'''.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands.

|-
|| [RStudio]

<span style="color:#ff0000;">'''fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)'''

'''plot_confusion_matrix(confusion)'''

|| Click on '''QDA.R''' in the '''Source '''window.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''confusion <- table(test_data$class,predicted_values$class)'''

Highlight the command

<span style="color:#ff0000;">'''fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)'''

'''plot_confusion_matrix(confusion)'''

Click on''' Save '''and '''Run '''buttons.
|| The table output is not displayed / used.'''table''' creates a confusion matrix that compares the actual and predicted class labels.

We are using the created '''plot_confusion_matrix()''' function to generate the visual plot of the confusion matrix in '''confusion''' variable

Select and run the command.

The output is seen in the '''plot''' window.
|-
|| Point the output in the '''plot window'''
|| Drag boundary to see the plot window clearly

Observe that:

22 24 samples of class 0 ...samples of class Kecimen have been incorrectly classified.

11 samples of class Besni have been incorrectly classified.

Overall, the model has misclassified only '''33''' out of '''270 '''samples.

We can say that our model performs well.
|-
|| [RStudio]

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

'''grid$class = predict(QDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

|| Drag boundary to see the source window clearly.

In the '''Source''' window type these commands

|-
|| Highlight the command

'''grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),'''

'''ecc = seq(min(data$ecc), max(data$ecc), length = 500)) '''

Highlight the command

'''grid$class = predict(QDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

'''grid$classnum <- as.numeric(grid$class)'''
|| This block of code first creates a '''grid '''of points spanning the range of '''minorAL '''and '''ecc '''features in the dataset.

It stores it in a variable ''''grid''''.

Then, it uses the QDA model to predict the class of each point in this grid.

It stores these predictions as a new column ''''class' '''in the '''grid '''dataframe.

I have added this part The '''as.numeric''' function encodes the predicted classes string labels into numeric values.

The resulting grid of points and their predicted classes will be used to visualize the decision boundaries of the QDA model.

Select and run these commands.

Click '''grid''' on the Environment tab to load the grid dataframe in the source window.
|-
|| [RStudio]

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +'''

'''geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),'''

'''colour = "black", linewidth = 0.7) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''

|| Click on '''QDA.R''' in the Source window.

In the '''Source''' window type these commands
|-
|| Highlight the command

'''ggplot() +'''

'''geom_raster(data = grid, aes(x = var, y = kurt, fill = class), alpha = 0.3) +'''

'''geom_point(data = train_data, aes(x = var, y = kurt, color = class)) +'''

'''geom_contour(data = grid, aes(x = var, y = kurt, z = classnum),'''

'''colour = "black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(x = "Variance", y = "Kurtosis", title = "QDA Decision Boundary") +'''

'''theme_minimal()'''

''')'''
|| This command is same as LDA plot one. Please check if that script part can be added hereWe are creating the decision boundary plot using '''ggplot2.'''

This command creates the decision boundary plot

It plots the grid points with colors indicating the predicted classes.

'''geom_raster '''creates a colour map indicating the predicted classes of the grid points

'''geom_point '''plots the training data points in the plot.

'''geom_contour''' creates the decision boundary of the QDA.

The '''scale_fill_manual''' function assigns specific colors to the classes and so does '''scale_color_manual''' function.

The overall plot provides a visual representation of the decision boundary.

And the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
||
|| We can see that the decision boundary of our model is a non-linear line.

And our model has separated most of the data points clearly.
|-
||
|| With this, we come to the end of this tutorial.

Let us summarize.
|-
|| Show Slide

Summary
|| In this tutorial we have learned about:
* Quadratic Discriminant Analysis (QDA).
* Comparison between '''QDA '''and''' LDA'''.
* Assumptions for QDA.
* Limitations of QDA
* Applications of QDA
* Implementation Of QDA using''' Raisin''' Dataset'''.'''
* Visualization of the '''QDA '''separator

|-
||
|| Here is an assignment for you.
|-
|| Show Slide

Assignment
||
* Apply '''QDA''' on the '''wine''' dataset.
* Measure the accuracy of the model.

This dataset can be found in the '''HDclassif '''package.

Install the package and import the dataset using the '''data() '''command
|-
|| Show slide

About the Spoken Tutorial Project
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
|| Show slide

Spoken Tutorial Workshops
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.

|| Please post your timed queries in this forum.
|-
|| Show Slide

Forum to answer questions
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
||

Show Slide

Textbook Companion

|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| Show Slide

Acknowledgment
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| Show Slide

Thank You
|| This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.
|-
|}

Machine-Learning-using-R/C2/Linear-Discriminant-Analysis-in-R/English

2023-11-30T09:50:47Z

Ushav: Created page with "'''Title of the script''': Linear Discriminant Analysis in R '''Author''': YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty '''Keywords''': R, RStudio, machine learnin..."

'''Title of the script''': Linear Discriminant Analysis in R

'''Author''': YATE ASSEKE RONALD OLIVERA and Debatosh Charkraborty

'''Keywords''': R, RStudio, machine learning, supervised, unsupervised, dimensionality reduction, confusion matrix, console, LDA, video tutorial.

{| border=1
|-
|| '''Visual Cue'''
|| '''Narration'''
|-
|| '''Show slide'''

'''Opening Slide'''
|| Welcome to this spoken tutorial on '''Linear Discriminant Analysis in R.'''
|-
|| '''Show slide'''

'''Learning Objectives'''

|| In this tutorial, we will learn about:
# Linear Discriminant Analysis ('''LDA''') and its implementation.
# Assumptions of LDA
# Limitations of LDA
# LDA on a subset of Raisin dataset
# Visualization of the '''LDA''' separator and its corresponding confusion matrix.

|-
|| '''Show slide'''

'''System Specifications'''
|| This tutorial is recorded using,
* '''Windows 11 '''
* '''R '''version''' 4.3.0'''
* '''RStudio''' version '''2023.06.1'''

It is recommended to install '''R''' version '''4.2.0''' or higher.
|-
|| '''Show slide.'''

'''Prerequisites '''

'''https://spoken-tutorial.org'''
|| To follow this tutorial, the learner should know:

* Basics of '''R''' programming.
* Basics of '''Machine Learning '''using '''R'''.

If not, please access the relevant tutorials on '''R '''on this website.
|-
|| '''Show slide.'''

'''Linear Discriminant Analysis'''
|| Linear Discriminant Analysis is a statistical method.
* It is used for classification.
* It constructs a data driven line that best separates different classes.
* It is based on maximization of likelihood function to classify two or more classes.

|-
|| '''Show slide.'''

'''Applications of LDA'''
||
* LDA technique is used in several applications like

** Fraud Detection
** Bio-Imaging classification
** Classify patient disease state

|-
|| Only Narration
|| Let us now understand the assumptions of LDA.
|-
|| '''Show Slide '''

'''Assumptions for LDA'''
|| '''Multivariate Normality: '''

* All data entries are continuous, Gaussian, with equal covariance matrix for all the classes.
* Mean vectors for each class are different.
* Data records are independent and identically distributed among each class.

|-
|| '''Show Slide '''

'''Limitations of LDA'''
|| Now we will see the limitations of LDA.

* Departure from Gaussianity may increase misclassification probability in LDA.
* '''LDA''' may perform poorly if data has unequal class covariance matrix.

|-
|| '''Show Slide'''

'''Implementation Of LDA'''
|| Now let us implement '''LDA''' on the '''raisin dataset '''with two chosen variables'''.'''

More information on '''raisin''' data is available in the '''Additional Reading material''' on this tutorial page.
|-
|| '''Show slide '''

'''Download Files'''
|| We will use a script file '''LDA.R'''

Please download this file from the''' Code files''' link of this tutorial.

Make a copy and then use it for practicing.
|-
|| [Computer screen]

Point to '''LDA.R''' and the folder '''LDA.'''

Point to the''' MLProject folder '''on the '''Desktop.'''

Point to the''' LDA folder.'''
|| I have downloaded and moved these files to the '''LDA '''folder.

This folder is in the '''MLProject''' folder on my '''Desktop'''.

I have also set the '''LDA''' folder as my working''' directory'''.
|-
|| Point to the script file '''LDA.R.'''
|| In this tutorial, we will create a '''LDA''' classifier model on the '''raisin''' dataset.

Let us switch to '''RStudio'''.
|-
|| Open '''LDA.R '''in '''RStudio'''

Point to''' LDA.R''' in '''RStudio'''.
|| Open the script '''LDA.R''' in '''RStudio'''.

For this, click on the script '''LDA.R.'''

Script '''LDA.R''' opens in '''RStudio'''.
|-
|| Highlight the '''Readxl package.'''

Highlight the command '''library(MASS) '''

Highlight the command '''library(ggplot2)'''

Highlight the command '''library(caret)'''

Highlight the command '''library(caret)'''

Highlight all the commands.

'''<nowiki>#install.packages(“package_name”)</nowiki>'''
|| '''Readxl package''' is used to load the '''Excel''' file.

The''' MASS package''' contains the '''lda()''' function that we will use for our analysis.

The '''ggplot2 package''' is used to plot the results of our analysis.

The '''caret package''' contains the

'''confusionMatrix''' function.

It is used as a measure for the performance of the classifier.

Please note that in order to import these libraries, we need to install them.

Please ensure that everything is installed correctly.

You can use the command '''install.packages(“package_name”)''' to install the required packages.

As I have already installed these packages, I will directly import them.

|-
|| [RStudio]

'''library(readxl)'''

'''library(MASS)'''

'''library(ggplot2)'''

'''library(caret)'''

'''library(lattice)'''

|| Select and run these commands to import the requisite packages.

|-
|| Highlight the command''' '''

'''data <- read_xlsx("Raisin.xlsx")'''

Highlight the command''' data<-data[c("minorAL","ecc","class")]'''

Highlight the commands.

'''data <- read_xlsx("Raisin.xlsx")'''

'''data<-data[c("minorAL","ecc","class")]'''

|| We will read the excel file and choose 3 columns, two features ('''minorAL, ecc)''' and one target ('''class''') variable.

Run these commands to import the '''raisin''' dataset.

|-
|| Drag boundary to see the '''Environment '''tab clearly.

Point to the data variable in the Environment tab.

Click the data to load the dataset.

|| Drag boundary to see the Environment tab clearly.

In the Environment tab under '''Data '''heading, you will see a '''data '''variable.

Click the data''' variable''' to load the dataset in the '''Source''' window.
|-
|| Drag boundary to see the Source window clearly.
|| Drag boundary to see the '''Source '''window clearly.

|-
||[RStudio]

Type these commands in the source window.

'''data$class <- factor(data$class)'''

|| In the '''Source''' window type this command.

|-
||Highlight the below commands.

'''data$class <- factor(data$class)'''

Select the commands and click the Run button.

||Here we are converting the variable '''data$class''' to a factor.

It ensures that the categorical data is properly encoded.

Select the command and run it. them.
|-
||Only Narration.
|| Now we split our dataset into training and testing data.
|-
||[RStudio]

Type the command in the source window.

'''set.seed(1) '''

'''index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''
||In the '''Source''' window type these commands.

|-
||Highlight the command

'''set.seed(1)'''

Highlight the command '''sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)'''

Highlight the command '''replace=FALSE'''

Select the commands and click the Run button.
||First we set a seed for reproducible results.

We will create a vector of indices using '''sample() '''function.

This will be 70% for training and 30% for testing.

The training data is chosen using simple random sampling without replacement.

Select the commands and run them.
|-
|| The vector is shown in the''' Environment '''tab.
|-
||Point to train-test split.
|| We use the indices that we previously generated to obtain our train-test split.
|-
|| [RStudio]

Type the command

'''train_data <- data [index_split, ]'''

'''test_data <- data[-c(index_split), ]'''
|| In the '''Source '''window type these commands.
|-
|| Highlight the command

'''train_data <- data[index_split, ]'''

Highlight the command

'''test_data <- data[-c(index_split), ]'''
|| This creates training data, consisting of 630 unique rows.

This creates testing data, consisting of 270 unique rows.
|-
|| Select the commands and click the Run button.

Point to the sets in the Environment Tab
|| Select the commands and run them.

The data sets are shown in the Environment tab.

Click on '''test_data '''and '''train_data '''to load them in the Source window.

|-
|| Only Narration.
|| Let us train our '''LDA''' model.
|-
|| [RStudio]

'''LDA_model <- lda(class~.,data=train_data)'''

'''LDA_model'''
|| In the '''Source '''window, type these commands.

|-
|| Highlight the command

'''LDA_model <- lda(class~.,data=train_data)'''

'''LDA_model'''

Highlight the command '''LDA_model'''

Click on Save and Run buttons.

Point to the output in the '''console '''window.
|| We pass two parameters to the '''lda()''' function.
# formula
# data on which the model should train.

Select the comands and run them.

The output is shown in the '''console''' window.
|-
|| Drag boundary to see the '''console''' window.
|| Drag boundary to see the '''console '''window clearly.

|-
|| Highlight '''output''' in the '''console.'''
|| Our '''model''' provides us with a lot of information.

Let us go through them one at a time.
|-
|| Highlight the command '''Prior probabilities of groups. '''

Highlight the command''' Group means.'''

Highlight the command '''Coefficients of linear discriminants '''

|| These explain the distribution of classes in the training dataset.

These display the mean values of each '''predictor '''variable for each '''species'''.

These display the '''linear combination of predictor''' variables.

The given linear combinations form the decision rule of the '''LDA''' model.

|-
|| Drag boundary to see the Source window.
|| Drag boundary to see the '''Source '''window clearly.

|-
||
|| Let us use this model to make predictions on the testing data.
|-
|| [RStudio]

'''predicted_values <- predict(LDA_model, test_data)'''

|| In the '''Source '''window type this command and run it.

Let us check what '''predicted_values''' contain.

|-
|| Click the '''predicted_values '''data in the Environment tab.

Point to the table.
|| Click the '''predicted_values '''data in the Environment tab.

The '''predicted_values '''table is loaded in the '''Source''' window.

|-
|| [RStudio]

'''head(predicted_values$class)'''

'''head(predicted_values$posterior)'''

'''head(predicted_values$x)'''
|| In the '''Source''' window type these commands and run them.

The output is seen in the''' console''' window.
|-
|| Highlight the command output of '''head(predicted_values$class) '''in the '''console.'''

Highlight the command output of '''head(predicted_values$posterior)''' in the '''console.'''

Highlight the command output of '''head(predicted_values$x) '''in '''console'''
|| It contains the type of species that the model has predicted for each observation.

It contains the '''posterior probability''' of the observation belonging to each class.

This contains the linear discriminants for each observation.

|-
|| Only Narration.
|| Now we will measure the performance of our model using the '''Confusion Matrix'''.
|-
|| [RStudio]

'''confusion <-table(test_data$class,predicted_values$class)'''

'''fourfoldplot(confusion, color = c("red", "green"), conf.level = 0, margin=1)'''

Click on '''Save '''and''' Run''' buttons.
|| In the '''Source '''window type these commands.

Save and run the commands.
|-
|| Highlight the command '''confusion <- table(test_data$class, predicted_values$class)'''

Highlight the command

'''fourfoldplot(confusion, color = c("red", green"), conf.level = 0, margin=1)'''

|| This table creates a confusion matrix.

The '''fourfoldplot()''' function generates a visual plot of the confusion matrix,

The output is seen in the '''plot''' window.
|-
|| Highlight the plot in '''plot window '''
|| Drag boundary to see the plot window clearly.

Given the specific seed (set.seed=1), LDA has misclassified 33 out of 270 observations.

This number may change for different sets of training data.

|-
|| Only Narration.
|| Let us visualize how well our model separates different classes.
|-
|| [RStudio]

[RStudio]

'''X <- seq(min(train_data$minorAL), max(train_data$minorAL), length.out = 100)'''

'''Y <- seq(min(train_data$ecc), max(train_data$ecc), length.out = 100)'''

'''min_max <- expand.grid(minorAL = X, ecc = Y)'''

'''min_max$predicted_class <- predict(LDA_model, newdata = min_max)$class'''

'''grid <- expand.grid(minorAL = X, ecc = Y)'''

'''grid$class <- predict(LDA_model, newdata = grid)$class'''

'''grid$classnum <- as.numeric(grid$class)'''

Click on Save and Run buttons.

|| In the '''Source''' window, type these commands.

This block of code operates as a setup for visual plotting.

It consists of square grid coordinates in the range of training data and their predicted linear discriminants.

The ''' seq ''' function generates a sequence of evenly spaced values within a range of smallest and largest values of 'minorAL' and 'ecc' variables from the training data.

The''' 'grid' '''variable contains the generated data including the prediction of the LDA_model on it.

The '''as.numeric''' function encodes the predicted classes labels into numeric values.

Select the commands and run them.

|-
|| Point to the Environment tab.
|| Drag boundary to see the details in the Environment tab.

These variables contain the data for the visualization of the linear discriminants.

Click the '''grid''' '''data''' in the Environment tab.

The '''grid data''' table is loaded in the '''Source''' window.

|-
|| [RStudio]

'''ggplot() +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +'''

'''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +'''

'''theme_minimal()'''

'''ggplot() +'''

'''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "LDA Decision Boundary") +'''

'''theme_minimal()'''

|| In the '''Source''' window, type these commands.

|-
|| Highlight the command

'''ggplot() +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 3) +'''

'''geom_point(data = min_max, aes(x = minorAL, y = ecc, color = predicted_class), size = 1, alpha = 0.3) +theme_minimal()'''

'''ggplot() +'''

'''geom_raster(data=grid, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) +'''

'''geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class), size = 2) +'''

'''geom_contour(data= grid, aes(x=minorAL, y=ecc, z = classnum), colour="black", linewidth = 1.2) +'''

'''scale_fill_manual(values = c("#ffff46", "#FF46e9")) +'''

'''scale_color_manual(values = c("red", "blue")) +'''

'''labs(title = "LDA Decision Boundary") +'''

'''theme_minimal()'''

Select the commands and run them.

|| This command creates the decision boundary plot

It plots the '''grid''' points with colors indicating the predicted classes.

'''geom_raster '''creates a colour map indicating the predicted classes of the grid points

'''geom_contour '''creates the decision boundary of the LDA.

The '''scale_color_manual''' function assigns specific colors to the classes and so does '''scale_fill_manual''' function.

The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the '''model'''.

Select and run these commands.

Drag boundaries to see the plot window clearly.
|-
|| Point the output in the '''Plots '''window
|| We can see that our model has separated most of the data points clearly.
|-
|| Only Narration
|| With this we come to end of this tutorial.

Let us summarize.
|-
|| '''Show Slide'''

'''Summary'''
|| In this tutorial we have learnt:

* Linear Discriminant Analysis ('''LDA''') and its implementation. 
* Assumptions of LDA
* Limitations of LDA
* LDA on a subset of Raisin dataset
* Visualization of the '''LDA''' separator and its corresponding confusion matrix

|-
||
|| Now we will suggest an assignment for this Spoken Tutorial.
|-
|| '''Show Slide'''

'''Assignment'''
||
* Perform LDA on inbuilt '''PlantGrowthdataset'''
* Evaluate the model using a confusion matrix and visualize the results

|-
|| '''Show slide'''

'''About the Spoken Tutorial Project'''
|| The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.
|-
|| '''Show slide'''

'''Spoken Tutorial Workshops'''
|| We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.
|-
|| '''Show Slide'''

'''Spoken Tutorial Forum to answer questions.'''

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.
|| Please post your timed queries in this forum.
|-
|| '''Show Slide'''

'''Forum to answer questions'''
|| Do you have any general/technical questions?

Please visit the forum given in the link.
|-
|| '''Show Slide'''

'''Textbook Companion'''
|| The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.
|-
|| '''Show Slide'''

'''Acknowledgment'''
|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
|-
|| '''Show Slide'''

'''Thank You'''
|| This tutorial is contributed by Yate Asseke Ronald and Debatosh Chakraborthy from IIT Bombay.

Thank you for joining.
|-
|}