Difference between revisions of "Machine-Learning-using-R/C2/Logistic-Regression-in-R/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
Line 552: Line 552:
  
  
It converts the predicted probabilities of he points into classes.
+
It converts the predicted probabilities of the points into classes.
  
 
If the probability exceeds 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen.
 
If the probability exceeds 0.5 then '''Kecimen '''class otherwise '''Besni '''Class is chosen.

Revision as of 15:49, 31 May 2024

Title of the script: Logistic Regression

Author: Yate Asseke Ronald Olivera and Debatosh Chakraborty

Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, logistic regression, video tutorial.


Visual Cue Narration
Show slide

Opening Slide

Welcome to this spoken tutorial on Logistic Regression in R.


Show slide

Learning Objectives


In this tutorial, we will learn about
  • Logistic Regression
  • Assumptions of Logistic Regression
  • Advantages of Logistic Regression
  • Implementation of Logistic Regression in R using Raisin dataset.
  • Model Evaluation.
  • Visualization of the model Decision Boundary
  • Limitations of Logistic Regression
Show slide

System Specifications

This tutorial is recorded using,
  • Windows 11
  • R version 4.3.0
  • RStudio version 2023.06.1

It is recommended to install R version 4.2.0 or higher.

Show slide

Prerequisites

To follow this tutorial, the learner should know:
  • Basic programming in R.
  • Basics of Machine Learning.

If not, please access the relevant tutorials on this website.

Let us learn what logistic regression is
Show slide

Logistic Regression

Logistic regression is a statistical model used for classification.

It models the probability of success for the explanatory variable.

  • It predicts the probability, unlike the response in linear regression.
  • The predicted probability is used as a classifier.
  • The probability of success is modeled using the logit or (log odds) function.
  • It is a linear classifier, as the logistic regression model has a linear logit.
  • It is often used when the response variable is categorical.
Show slide

Assumptions of Logistic Regression


  • The distribution of the dependent variable is Bernoulli.
  • The data records are independent.


The dependent variable's distribution is typically assumed to be a Bernoulli distribution in logistic regression.
Show slide

Advantages of Logistic Regression


  • It provides estimates of regression coefficients along with their standard errors.
  • It also provides the predicted probability which in turn is used as a classifier.
  • It doesn’t need explanatory variables to be necessarily continuous.
  • In this sense, it is a more general classifier than LDA and QDA.


Logistic regression offers a significant advantage in that continuous explanatory variables are not a requirement.
Show Slide

Implementation Of Logistic Regression

We will implement logistic regression using the Raisin dataset.


The additional reading material has more details on the Raisin dataset.

Please refer to it.

Show slide

Download Files

We will use a script file LogisticRegression.R and Raisin Dataset ‘raisin.xlsx’

Please download these files from the Code files link of this tutorial.

Make a copy and then use them while practicing.

[Computer screen]

Highlight LogisticRegression.R

Logistic Regression folder.

I have downloaded and moved these files to the Logistic Regression folder.

This folder is located in the MLProject folder.


I have also set the Logistic Regression folder as my Working Directory.


Let’s create a Logistic Regression classifier model on the raisin dataset.

Let us switch to RStudio.
Click LogisticRegression.R in RStudio

Point to LogisticRegression.R in RStudio.

Open the script LogisticRegression.R in RStudio.


For this, click on the script LogisticRegression.R.


Script LogisticRegression.R opens in RStudio.

[Rstudio]

Highlight the commands


library(readxl)

library(caret)

library(VGAM)

library(ggplot2)

library(dplyr)

#install.packages(“package_name”)

Point to the command.

Select and run these commands to import the necessary packages.

The VGAM package contains the glm() function required to create our classifier.

As I have already installed the packages.

I have directly imported them.

[RStudio]

Highlight

data <- read_xlsx("Raisin_Dataset.xlsx")


data[c("minorAL",”ecc”,"class")]


data$class <- factor(data$class)


Highlight the commands.

These commands will load the Raisin dataset.

They will also prepare the dataset for model building.

Select and run the commands.

Drag boundary to see the Environment tab.

Click on data on the Environment tab.

Click on data in the Environment tab.

It loads the modified dataset in the Source window.

Point to the data. Now we split our dataset into training and testing data.
[RStudio]

set.seed(1)

trainIndex<- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

train <- data[trainIndex, ]

test <- data[-trainIndex, ]

In the Source window type these commands.

Highlight

set.seed(1)


Highlight

trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

Highlight

train <- data[trainIndex, ]

Highlight

test <- data[-trainIndex, ]

Click on Save and Run buttons.


Click on train_data and test_data to load them in the Source window.

Select the commands and run them.

Let us create a Logistic Regression model on the training dataset.
[RStudio]

Logistic_model <- glm(class ~ ., data = train, family = "binomial")

summary(Logistic_model)$coef

In the Source window type these commands

Highlight glm()

Highlight class ~ .

Highlight family = binomial

Highlight train

The function glm() represents generalized linear models.

Logistic regression is among the class of models that it fits.

This is the formula for our model.

We try to predict target variable class based on minorAL and ecc features.

This ensures that our model predicts the probability for 2 classes.

It ensures that, out of all the models in glm, the logistic regression model is fit.

This is the data used to train our model.

Select the commands and run them.

The output is shown in the console window.

Drag boundary to see the console window. Drag boundary to see the console window.
Point the output in the console

Highlight Coefficients

Highlight Pr(>|z|)

Coefficients denote the coefficients of the logit function.

That means the log-odds of class change by -0.04 for every unit change in minorAL.

The lower p-values suggest that the effects are statistically significant.

Drag boundary to see the Source window. Drag boundary to see the Source window.
Let us now use our model to make predictions on test data.
[RStudio]

Predicted.prob <- predict(Logistic_model, test, type="response")

View(Predicted.prob)

In the Source window type these commands


Highlight

Predicted.prob <- predict(Logistic_model, test, type="response")


Highlight

Type = “response”


This command provides the predicted probability of the logistic regression model on the test dataset.

This command ensures the outcome is a probability.

Select and run the commands

Point


Value

Predicted.prob stores the predicted probability of each observation belonging to a certain class.

predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni")) In the source window type the following commands
Highlight

predicted.classes <- factor(ifelse(predicted.prob > 0.5, "Kecimen", "Besni"))

This retrieves the predicted classes from the probabilities.

If the probability is greater than 0.5 then Kecimen class otherwise Besni Class is chosen

We also convert the output to a factor datatype to fit in the Confusion matrix function.

Select and run the commands

Let us measure the accuracy of our model.
confusion_matrix <- confusionMatrix(predicted.classes,test_data$class)


In the Source window type these commands
Highlight the command confusionMatrix(predicted.classes,test_data$class)


Point to the confusion in the Environment Tab

Highlight the attribute

table

This command creates a confusion matrix list.

List is created from the actual and predicted class labels.

And it is stored in the confusion_matrix variable.

It helps to assess the classification model's performance and accuracy.

Select and run these commands

plot_confusion_matrix <- function(confusion_matrix){

tab <- confusion_matrix$table

tab = as.data.frame(tab)

tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))


tab <- tab %>%

rename(Actual = Reference) %>%

mutate(cor = if_else(Actual == Prediction, 1,0))


tab$cor <- as.factor(tab$cor)


ggplot(tab, aes(Actual,Prediction)) +

geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +

scale_fill_manual(values = c("red","green")) +

theme_light() +

theme(legend.position = "None",

line = element_blank()) +

scale_x_discrete(position = "top")

}


In the Source window type these commands


Highlight the command

tab <- confusion_matrix$table

Highlight the command

tab <- confusion_matrix$table

tab = as.data.frame(tab)

tab$Prediction <- factor(tab$Prediction, levels = rev(levels(tab$Prediction)))

tab <- tab %>%

rename(Actual = Reference) %>%

mutate(cor = if_else(Actual == Prediction, 1,0))

tab$cor <- as.factor(tab$cor)

Highlight the command

ggplot(tab, aes(Actual,Prediction)) +

geom_tile(aes(fill= cor),alpha = 0.4) + geom_text(aes(label=Freq)) +

scale_fill_manual(values = c("red","green")) +

theme_light() +

theme(legend.position = "None",

line = element_blank()) +

scale_x_discrete(position = "top")

}

These commands create a function plot_confusion_matrix to display the confusion matrix from the confusion matrix list created.


It fetches the confusion matrix table from the list.

It creates a data frame from the table which is suitable for plotting using GGPlot2.

It plots the confusion matrix using the data frame created.

It represents correct and incorrect predictions using different colors.

Select and run the commands

[RStudio]


plot_confusion_matrix(confusion)

In the Source window type this command

Highlight the command

plot_confusion_matrix(confusion)


Click on Save and Run buttons.

We use the plot_confusion_matrix() function to generate a visual plot of the confusion matrix list created.

Select and run the command

The output is seen in the plot window

Output in Plot window. This plot shows how well our model predicted the testing data.

We observe that:

21 misclassifications of Besni Class.

13 misclassifications of Kecimen class.

[RStudio]


grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),

ecc = seq(min(data$ecc), max(data$ecc), length = 500))

grid$prob <- predict(model, newdata = grid, type = "response")


grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')


grid$classnum <- as.numeric(as.factor(grid$class))

We will now visualize the decision boundary of the model.

In the Source window type these commands

Highlight the command

grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),

ecc = seq(min(data$ecc), max(data$ecc), length = 500))

grid$prob <- predict(model, newdata = grid, type = "response")

grid$class <- ifelse(grid$prob > 0.5, 'Kecimen', 'Besni')


grid$classnum <- as.numeric(as.factor(grid$class))

This code first generates a grid of points spanning the range of minorAL and ecc features in the dataset.


Then, it uses the Logistics Regression model to predict the probability of each point in this grid, storing these predictions as a new column 'prob' in the grid dataframe.


It converts the predicted probabilities of the points into classes.

If the probability exceeds 0.5 then Kecimen class otherwise Besni Class is chosen.

The prediced classes are stored in ‘class’ column of grid data frame.


The as.numeric function encodes the predicted classes string labels into numeric values.


Select and run the commands


Click on grid in the Environment tab to load the generated data in the Source window.

[RStudio]


ggplot() +

geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +

geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),

colour = "black", linewidth = 0.7) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +

theme_minimal()

In the Source window type these commands
Highlight the command

ggplot() +

geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +

geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),

colour = "black", linewidth = 0.7) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(x = "MinorAL", y = "ecc", title = "Logistic Regression Decision Boundary") +

theme_minimal()

We are creating the decision boundary plot using GGPlot2 from the data generated.

It plots the grid points with colors indicating the predicted classes.


The overall plot provides a visual representation of the decision boundary and the distribution of training data points of the model.


Select and run these commands.


Drag boundaries to see the plot window clearly.

We can conclude that the decision boundary of logistic regression is a straight line.

The line separates the data points clearly.

Show slide

Limitations of Logistic Regression


  • It’s sensitive to outliers which can affect the accuracy of the classifier.
  • It can perform poorly in the presence of multicollinearity among explanatory variables.


Here are some of the limitations of Logistic Regression
Let us summarize what we have learned.
Show Slide

Summary

In this tutorial we have learned about:


  • Logistic Regression
  • Assumptions of Logistic Regression
  • Advantages of Logistic Regression
  • Implementation of Logistic Regression in R using Raisin dataset.
  • Model Evaluation.
  • Visualization of the model Decision Boundary
  • Limitations of Logistic Regression
Now we will suggest an assignment for this Spoken Tutorial.
Show Slide

Assignment

  • Apply logistic regression on the Wine dataset.
  • This dataset can be found in the HDclassif package.
  • Install the package and import the dataset using the data() command.
  • Measure the accuracy of the model
Show slide

About the Spoken Tutorial Project

The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.


Please contact us.

Show Slide

Spoken Tutorial Forum to answer questions

Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion

The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial was established by the Ministry of Education Govt of India.
Show Slide

Thank You

This tutorial is contributed by Yate Asseke Ronald. O and Debatosh Chakraborty from IIT Bombay.

Thank you for joining.

Contributors and Content Editors

Ushav