Difference between revisions of "Machine-Learning-using-R/C2/Logistic-Regression-in-R/English"

Latest revision as of 16:01, 31 May 2024

Title of the script: Logistic Regression

Author: Yate Asseke Ronald Olivera and Debatosh Chakraborty

Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, logistic regression, video tutorial.

Visual Cue	Narration
Show slide Opening Slide	Welcome to this spoken tutorial on Logistic Regression in R.
Show slide Learning Objectives	In this tutorial, we will learn about Logistic Regression Assumptions of Logistic Regression Advantages of Logistic Regression Implementation of Logistic Regression in R using Raisin dataset. Model Evaluation. Visualization of the model Decision Boundary Limitations of Logistic Regression
Show slide System Specifications	This tutorial is recorded using, Windows 11 R version 4.3.0 RStudio version 2023.06.1 It is recommended to install R version 4.2.0 or higher.
Show slide Prerequisites	To follow this tutorial, the learner should know: Basic programming in R. Basics of Machine Learning. If not, please access the relevant tutorials on this website.
	Let us learn what logistic regression is
Show slide Logistic Regression	Logistic regression is a statistical model used for classification. It models the probability of success for the explanatory variable. It predicts the probability, unlike the response in linear regression. The predicted probability is used as a classifier. The probability of success is modeled using the logit or (log odds) function. It is a linear classifier, as the logistic regression model has a linear logit. It is often used when the response variable is categorical.
Show slide Assumptions of Logistic Regression The distribution of the dependent variable is Bernoulli. The data records are independent.	The dependent variable's distribution is typically assumed to be a Bernoulli distribution in logistic regression.
Show slide Advantages of Logistic Regression It provides estimates of regression coefficients along with their standard errors. It also provides the predicted probability which in turn is used as a classifier. It doesn’t need explanatory variables to be necessarily continuous. In this sense, it is a more general classifier than LDA and QDA.	Logistic regression offers a significant advantage in that continuous explanatory variables are not a requirement.
Show Slide Implementation Of Logistic Regression	We will implement logistic regression using the Raisin dataset. The additional reading material has more details on the Raisin dataset. Please refer to it.
Show slide Download Files	We will use a script file LogisticRegression.R and Raisin Dataset ‘raisin.xlsx’ Please download these files from the Code files link of this tutorial. Make a copy and then use them while practicing.
[Computer screen] Highlight LogisticRegression.R Logistic Regression folder.	I have downloaded and moved these files to the Logistic Regression folder. This folder is located in the MLProject folder. I have also set the Logistic Regression folder as my Working Directory. Let’s create a Logistic Regression classifier model on the raisin dataset.
	Let us switch to RStudio.
Click LogisticRegression.R in RStudio Point to LogisticRegression.R in RStudio.	Open the script LogisticRegression.R in RStudio. For this, click on the script LogisticRegression.R. Script LogisticRegression.R opens in RStudio.
[Rstudio] Highlight the commands library(readxl) library(caret) library(VGAM) library(ggplot2) library(dplyr) #install.packages(“package_name”) Point to the command.	Select and run these commands to import the necessary packages. The VGAM package contains the glm() function required to create our classifier. As I have already installed the packages. I have directly imported them.
[RStudio] Highlight data <- read_xlsx("Raisin_Dataset.xlsx") data[c("minorAL",”ecc”,"class")] data$class <- factor(data$class) Highlight the commands.	These commands will load the Raisin dataset. They will also prepare the dataset for model building. Select and run the commands.
Drag boundary to see the Environment tab. Click on data on the Environment tab.	Click on data in the Environment tab. It loads the modified dataset in the Source window.
Point to the data.	Now we split our dataset into training and testing data.
[RStudio] set.seed(1) *trainIndex<- sample(1:nrow(data),size=0.7nrow(data),replace=FALSE) train <- data[trainIndex, ] test <- data[-trainIndex, ]**	In the Source window type these commands.
Highlight set.seed(1) Highlight *trainIndex <- sample(1:nrow(data),size=0.7nrow(data),replace=FALSE) Highlight train <- data[trainIndex, ] Highlight test <- data[-trainIndex, ] Click on Save and Run buttons. Click on train_data and test_data** to load them in the Source window.	Select the commands and run them.
	Let us create a Logistic Regression model on the training dataset.
[RStudio] Logistic_model <- glm(class ~ ., data = train, family = "binomial") summary(Logistic_model)$coef	In the Source window type these commands
Highlight glm() Highlight class ~ . Highlight family = binomial Highlight train	The function glm() represents generalized linear models. Logistic regression is among the class of models that it fits. This is the formula for our model. We try to predict target variable class based on minorAL and ecc features. This ensures that our model predicts the probability for 2 classes. It ensures that, out of all the models in glm, the logistic regression model is fit. This is the data used to train our model. Select the commands and run them. The output is shown in the console window.
Drag boundary to see the console window.	Drag boundary to see the console window.
Point the output in the console Highlight Coefficients Highlight Pr(>\|z\|)	Coefficients denote the coefficients of the logit function. That means the log-odds of class change by -0.04 for every unit change in minorAL. The lower p-values suggest that the effects are statistically significant.
Drag boundary to see the Source window.	Drag boundary to see the Source window.
	Let us now use our model to make predictions on test data.
[RStudio] Predicted.prob <- predict(Logistic_model, test, type="response") View(Predicted.prob)	In the Source window type these commands
Highlight Predicted.prob <- predict(Logistic_model, test, type="response") Highlight Type = “response”	This command provides the predicted probability of the logistic regression model on the test dataset. This command ensures the outcome is a probability. Select and run the commands
Point Value	Predicted.prob stores the predicted probability of each observation belonging to a certain class.
predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))	In the source window type the following commands
Highlight predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))	This retrieves the predicted classes from the probabilities. If the probability is greater than 0.5 then Kecimen class otherwise Besni Class is chosen We also convert the output to a factor datatype to fit in the Confusion matrix function. Select and run the commands
	Let us measure the accuracy of our model.
confusion_matrix <- confusionMatrix(predicted.classes,test_data$class)	In the Source window type these commands
Highlight the command confusionMatrix(predicted.classes,test_data$class) Point to the confusion in the Environment Tab Highlight the attribute table	This command creates a confusion matrix list. List is created from the actual and predicted class labels. And it is stored in the confusion_matrix variable. It helps to assess the classification model's performance and accuracy. Select and run these commands
plot_confusion_matrix <- function(confusion_matrix){ tab <- confusion_matrix$table tab = as.data.frame(tab) tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction))) tab <- tab %>% rename(Actual = Reference) %>% mutate(cor = if_else(Actual == Prediction, 1,0)) tab$cor <- as.factor(tab$cor) ggplot(tab, aes(Actual,Prediction)) + geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) + scale_fill_manual(values = c("red","green")) + theme_light() + theme(legend.position = "None", line = element_blank()) + scale_x_discrete(position = "top") }	In the Source window type these commands
Highlight the command tab <- confusion_matrix$table Highlight the command tab <- confusion_matrix$table tab = as.data.frame(tab) tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction))) tab <- tab %>% rename(Actual = Reference) %>% mutate(cor = if_else(Actual == Prediction, 1,0)) tab$cor <- as.factor(tab$cor) Highlight the command ggplot(tab, aes(Actual,Prediction)) + geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) + scale_fill_manual(values = c("red","green")) + theme_light() + theme(legend.position = "None", line = element_blank()) + scale_x_discrete(position = "top") }	These commands create a function plot_confusion_matrix to display the confusion matrix from the confusion matrix list created. It fetches the confusion matrix table from the list. It creates a data frame from the table which is suitable for plotting using GGPlot2. It plots the confusion matrix using the data frame created. It represents correct and incorrect predictions using different colors. Select and run the commands
[RStudio] plot_confusion_matrix(confusion)	In the Source window type this command
Highlight the command plot_confusion_matrix(confusion) Click on Save and Run buttons.	We use the plot_confusion_matrix() function to generate a visual plot of the confusion matrix list created. Select and run the command The output is seen in the plot window
Output in Plot window.	This plot shows how well our model predicted the testing data. We observe that: 21 misclassifications of Besni Class. 13 misclassifications of Kecimen class.
[RStudio] grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500), ecc = seq(min(data$ecc), max(data$ecc), length = 500)) grid$prob <- predict(model, newdata = grid, type = "response") grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni') grid$classnum <- as.numeric(as.factor(grid$class))	We will now visualize the decision boundary of the model. In the Source window type these commands
Highlight the command grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500), ecc = seq(min(data$ecc), max(data$ecc), length = 500)) grid$prob <- predict(model, newdata = grid, type = "response") grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni') grid$classnum <- as.numeric(as.factor(grid$class))	This code first generates a grid of points spanning the range of minorAL and ecc features in the dataset. Then, it uses the Logistics Regression model to predict the probability of each point in this grid, storing these predictions as a new column 'prob' in the grid dataframe. It converts the predicted probabilities of the points into classes. If the probability exceeds 0.5 then Kecimen class otherwise Besni Class is chosen. The prediced classes are stored in ‘class’ column of grid data frame. The as.numeric function encodes the predicted classes string labels into numeric values. Select and run the commands Click on grid in the Environment tab to load the generated data in the Source window.
[RStudio] ggplot() + geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") + theme_minimal()	In the Source window type these commands
Highlight the command ggplot() + geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") + theme_minimal()	We are creating the decision boundary plot using GGPlot2 from the data generated. It plots the grid points with colors indicating the predicted classes. The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the model. Select and run these commands. Drag boundaries to see the plot window clearly.
	We can conclude that the decision boundary of logistic regression is a straight line. The line separates the data points clearly.
Show slide Limitations of Logistic Regression It’s sensitive to outliers which can affect the accuracy of the classifier. It can perform poorly in the presence of multicollinearity among explanatory variables.	Here are some of the limitations of Logistic Regression
	Now let us summarize what we have learned.
Show Slide Summary	In this tutorial we have learned about: Logistic Regression Assumptions of Logistic Regression Advantages of Logistic Regression Implementation of Logistic Regression using Raisin dataset. Model Evaluation. Visualization of the model Decision Boundary Limitations of Logistic Regression Model
	Now we will suggest an assignment for this Spoken Tutorial.
Show Slide Assignment	Apply logistic regression on the Wine dataset. This dataset can be found in the HDclassif package. Install the package and import the dataset using the data() command. Measure the accuracy of the model
Show slide About the Spoken Tutorial Project	The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
Show slide Spoken Tutorial Workshops	We conduct workshops using Spoken Tutorials and give certificates. Please contact us.
Show Slide Spoken Tutorial Forum to answer questions	Please post your timed queries in this forum.
Show Slide Forum to answer questions	Do you have any general/technical questions? Please visit the forum given in the link.
Show Slide Textbook Companion	The FOSSEE team coordinates the coding of solved examples of popular books and case study projects. We give certificates to those who do this. For more details, please visit these sites.
Show Slide Acknowledgment	The Spoken Tutorial project was established by the Ministry of Education Govt of India.
Show Slide Thank You	This tutorial is contributed by Yate Asseke Ronald. O and Debatosh Chakraborty from IIT Bombay. Thank you for joining.

Contributors and Content Editors

Ushav

@@ Line 552: / Line 552: @@
-It converts the predicted probabilities of he points into classes.
+It converts the predicted probabilities of the points into classes.
 If the probability exceeds 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen.
@@ Line 640: / Line 640: @@
 |-
 ||
-|| Let us summarize what we have learned.
+|| Now let us summarize what we have learned.
 |-
 || Show Slide
@@ Line 651: / Line 651: @@
 * Assumptions of Logistic Regression
 * Advantages of Logistic Regression
-* Implementation of Logistic Regression in '''R''' using '''Raisin '''dataset'''.'''
+* Implementation of Logistic Regression using '''Raisin '''dataset'''.'''
 * Model Evaluation.
 * Visualization of the model Decision Boundary
-* Limitations of Logistic Regression
+* Limitations of Logistic Regression Model
 |-
@@ Line 707: / Line 707: @@
 Acknowledgment
-|| The '''Spoken Tutorial''' was established by the Ministry of Education Govt of India.
+|| The '''Spoken Tutorial''' project was established by the Ministry of Education Govt of India.
 |-
 || Show Slide

Difference between revisions of "Machine-Learning-using-R/C2/Logistic-Regression-in-R/English"

Latest revision as of 16:01, 31 May 2024

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools