Machine-Learning-using-R/C2/Logistic-Regression-in-R/English
Title of the script: Logistic Regression
Author: Yate Asseke Ronald Olivera and Debatosh Chakraborty
Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, logistic regression, video tutorial.
Visual Cue | Narration |
Show slide
Opening Slide |
Welcome to this spoken tutorial on Logistic Regression in R.
|
Show slide
Learning Objectives
|
In this tutorial, we will learn about
|
Show slide
System Specifications |
This tutorial is recorded using,
It is recommended to install R version 4.2.0 or higher. |
Show slide
Prerequisites |
To follow this tutorial, the learner should know:
If not, please access the relevant tutorials on this website. |
Let us learn what logistic regression is | |
Show slide
Logistic Regression |
Logistic regression is a statistical model used for classification.
It models the probability of success for the explanatory variable.
|
Show slide
Assumptions of Logistic Regression
|
The dependent variable's distribution is typically assumed to be a Bernoulli distribution in logistic regression. |
Show slide
Advantages of Logistic Regression
|
Logistic regression offers a significant advantage in that continuous explanatory variables are not a requirement. |
Show Slide
Implementation Of Logistic Regression |
We will implement logistic regression using the Raisin dataset.
Please refer to it. |
Show slide
Download Files |
We will use a script file LogisticRegression.R and Raisin Dataset ‘raisin.xlsx’
Please download these files from the Code files link of this tutorial. Make a copy and then use them while practicing. |
[Computer screen]
Highlight LogisticRegression.R Logistic Regression folder. |
I have downloaded and moved these files to the Logistic Regression folder.
This folder is located in the MLProject folder.
|
Let us switch to RStudio. | |
Click LogisticRegression.R in RStudio
Point to LogisticRegression.R in RStudio. |
Open the script LogisticRegression.R in RStudio.
|
[Rstudio]
Highlight the commands
library(caret) library(VGAM) library(ggplot2) library(dplyr) #install.packages(“package_name”) Point to the command. |
Select and run these commands to import the necessary packages.
The VGAM package contains the glm() function required to create our classifier. As I have already installed the packages. I have directly imported them. |
[RStudio]
Highlight data <- read_xlsx("Raisin_Dataset.xlsx")
|
These commands will load the Raisin dataset.
They will also prepare the dataset for model building. Select and run the commands. |
Drag boundary to see the Environment tab.
Click on data on the Environment tab. |
Click on data in the Environment tab. It loads the modified dataset in the Source window. |
Point to the data. | Now we split our dataset into training and testing data. |
[RStudio]
set.seed(1) trainIndex<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) train <- data[trainIndex, ] test <- data[-trainIndex, ] |
In the Source window type these commands. |
Highlight
set.seed(1)
trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) Highlight train <- data[trainIndex, ] Highlight test <- data[-trainIndex, ] Click on Save and Run buttons.
|
Select the commands and run them. |
Let us create a Logistic Regression model on the training dataset. | |
[RStudio]
Logistic_model <- glm(class ~ ., data = train, family = "binomial") summary(Logistic_model)$coef |
In the Source window type these commands |
Highlight glm()
Highlight class ~ . Highlight family = binomial Highlight train |
The function glm() represents generalized linear models.
Logistic regression is among the class of models that it fits. This is the formula for our model. We try to predict target variable class based on minorAL and ecc features. This ensures that our model predicts the probability for 2 classes. It ensures that, out of all the models in glm, the logistic regression model is fit. This is the data used to train our model. Select the commands and run them. The output is shown in the console window. |
Drag boundary to see the console window. | Drag boundary to see the console window. |
Point the output in the console
Highlight Coefficients Highlight Pr(>|z|) |
Coefficients denote the coefficients of the logit function. That means the log-odds of class change by -0.04 for every unit change in minorAL. The lower p-values suggest that the effects are statistically significant. |
Drag boundary to see the Source window. | Drag boundary to see the Source window. |
Let us now use our model to make predictions on test data. | |
[RStudio]
Predicted.prob <- predict(Logistic_model, test, type="response") View(Predicted.prob) |
In the Source window type these commands
|
Highlight
Predicted.prob <- predict(Logistic_model, test, type="response")
Type = “response”
|
This command provides the predicted probability of the logistic regression model on the test dataset.
This command ensures the outcome is a probability. Select and run the commands |
Point
|
Predicted.prob stores the predicted probability of each observation belonging to a certain class. |
predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni")) | In the source window type the following commands |
Highlight
predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni")) |
This retrieves the predicted classes from the probabilities.
If the probability is greater than 0.5 then Kecimen class otherwise Besni Class is chosen We also convert the output to a factor datatype to fit in the Confusion matrix function. Select and run the commands |
Let us measure the accuracy of our model. | |
confusion_matrix <- confusionMatrix(predicted.classes,test_data$class)
|
In the Source window type these commands |
Highlight the command confusionMatrix(predicted.classes,test_data$class)
Highlight the attribute table |
This command creates a confusion matrix list.
List is created from the actual and predicted class labels. And it is stored in the confusion_matrix variable. It helps to assess the classification model's performance and accuracy. Select and run these commands |
plot_confusion_matrix <- function(confusion_matrix){
tab <- confusion_matrix$table tab = as.data.frame(tab) tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))
rename(Actual = Reference) %>% mutate(cor = if_else(Actual == Prediction, 1,0))
geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) + scale_fill_manual(values = c("red","green")) + theme_light() + theme(legend.position = "None", line = element_blank()) + scale_x_discrete(position = "top") }
|
In the Source window type these commands |
tab <- confusion_matrix$table Highlight the command tab <- confusion_matrix$table tab = as.data.frame(tab) tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction))) tab <- tab %>% rename(Actual = Reference) %>% mutate(cor = if_else(Actual == Prediction, 1,0)) tab$cor <- as.factor(tab$cor) Highlight the command ggplot(tab, aes(Actual,Prediction)) + geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) + scale_fill_manual(values = c("red","green")) + theme_light() + theme(legend.position = "None", line = element_blank()) + scale_x_discrete(position = "top") } |
These commands create a function plot_confusion_matrix to display the confusion matrix from the confusion matrix list created.
It creates a data frame from the table which is suitable for plotting using GGPlot2. It plots the confusion matrix using the data frame created. It represents correct and incorrect predictions using different colors. Select and run the commands |
[RStudio]
|
In the Source window type this command |
Highlight the command plot_confusion_matrix(confusion)
|
We use the plot_confusion_matrix() function to generate a visual plot of the confusion matrix list created. Select and run the command The output is seen in the plot window |
Output in Plot window. | This plot shows how well our model predicted the testing data.
We observe that: 21 misclassifications of Besni Class. 13 misclassifications of Kecimen class. |
[RStudio]
ecc = seq(min(data$ecc), max(data$ecc), length = 500)) grid$prob <- predict(model, newdata = grid, type = "response")
|
We will now visualize the decision boundary of the model.
In the Source window type these commands |
Highlight the command
grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500), ecc = seq(min(data$ecc), max(data$ecc), length = 500)) grid$prob <- predict(model, newdata = grid, type = "response") grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')
|
This code first generates a grid of points spanning the range of minorAL and ecc features in the dataset.
If the probability exceeds 0.5 then Kecimen class otherwise Besni Class is chosen. The prediced classes are stored in ‘class’ column of grid data frame.
|
[RStudio]
geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") + theme_minimal() |
In the Source window type these commands |
Highlight the command
ggplot() + geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") + theme_minimal() |
We are creating the decision boundary plot using GGPlot2 from the data generated.
It plots the grid points with colors indicating the predicted classes.
|
We can conclude that the decision boundary of logistic regression is a straight line.
The line separates the data points clearly. | |
Show slide
Limitations of Logistic Regression
|
Here are some of the limitations of Logistic Regression |
Let us summarize what we have learned. | |
Show Slide
Summary |
In this tutorial we have learned about:
|
Now we will suggest an assignment for this Spoken Tutorial. | |
Show Slide
Assignment |
|
Show slide
About the Spoken Tutorial Project |
The video at the following link summarizes the Spoken Tutorial project. Please download and watch it. |
Show slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
|
Show Slide
Spoken Tutorial Forum to answer questions |
Please post your timed queries in this forum. |
Show Slide
Forum to answer questions |
Do you have any general/technical questions?
Please visit the forum given in the link. |
Show Slide
Textbook Companion |
The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.
We give certificates to those who do this. For more details, please visit these sites. |
Show Slide
Acknowledgment |
The Spoken Tutorial was established by the Ministry of Education Govt of India. |
Show Slide
Thank You |
This tutorial is contributed by Yate Asseke Ronald. O and Debatosh Chakraborty from IIT Bombay.
Thank you for joining. |