Difference between revisions of "Python-for-Machine-Learning/C2/Logistic-Regression-Binary-Classification/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
(Created page with " <div style="margin-left:2.54cm;margin-right: <div style="margin-left:1.27cm;margin-right:0cm;"></div> {| border="1" |- || '''Visual Cue''' || '''Narration''' |- || Show sli...")
 
Line 1: Line 1:
 +
{| border="1"
  
 
<div style="margin-left:2.54cm;margin-right:
 
 
<div style="margin-left:1.27cm;margin-right:0cm;"></div>
 
{| border="1"
 
|-
 
 
|| '''Visual Cue'''
 
|| '''Visual Cue'''
 
|| '''Narration'''
 
|| '''Narration'''
Line 13: Line 8:
 
'''Welcome'''
 
'''Welcome'''
 
|| Welcome to the Spoken Tutorial on '''Logistic Regression - Binary Classification.'''
 
|| Welcome to the Spoken Tutorial on '''Logistic Regression - Binary Classification.'''
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show slide:
 
|| Show slide:
  
 
'''Learning Objectives'''
 
'''Learning Objectives'''
|| In this tutorial, we will learn about
+
|| In this tutorial, we will learn about:
* <div style="color:#000000;margin-left:1.905cm;margin-right:0cm;">Logistic Regression</div>
+
* Logistic Regression
* <div style="color:#000000;margin-left:1.905cm;margin-right:0cm;">Binary Classification</div>
+
* Binary Classification
* <div style="color:#000000;margin-left:1.905cm;margin-right:0cm;">Multiclass Classification</div>
+
* Multiclass Classification
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show slide:
 
|| Show slide:
  
 
'''System Requirements'''
 
'''System Requirements'''
 
|| To record this tutorial, I am using
 
|| To record this tutorial, I am using
* <div style="margin-left:1.905cm;margin-right:0cm;"><span style="color:#000000;">'''Ubuntu Linux '''</span><span style="color:#000000;">OS version</span><span style="color:#000000;">''' 24.04'''</span></div>
+
* '''Ubuntu Linux '''OS version''' 24.04'''
* <div style="margin-left:1.905cm;margin-right:0cm;"><span style="color:#000000;">'''Jupyter Notebook '''</span><span style="color:#000000;">IDE</span></div>
+
* '''Jupyter Notebook '''IDE
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show slide:  
 
|| Show slide:  
  
'''Prerequisite'''
+
'''Prerequisites'''
 
|| To follow this tutorial,
 
|| To follow this tutorial,
  
* <div style="margin-left:1.905cm;margin-right:0cm;"><span style="color:#000000;">The learner must have basic knowledge of </span><span style="color:#000000;">'''Python.'''</span></div>
+
* The learner must have basic knowledge of '''Python.'''
* <div style="margin-left:1.905cm;margin-right:0cm;"><span style="color:#000000;">For pre-requisite </span><span style="color:#000000;">'''Python'''</span><span style="color:#000000;"> tutorials, please visit this website.</span></div>
+
* For pre-requisite '''Python''' tutorials, please visit this website.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show slide:
 
|| Show slide:
  
 
'''Code files'''
 
'''Code files'''
 
||
 
||
* <div style="margin-left:1.27cm;margin-right:0cm;">The files used in this tutorial are provided in the '''Code files '''link.</div>
+
* The files used in this tutorial are provided in the '''Code files '''link.
* <div style="margin-left:1.27cm;margin-right:0cm;">Please download and extract the files.</div>
+
* Please download and extract the files.
* <div style="margin-left:1.27cm;margin-right:0cm;">Make a copy and then use them while practicing.</div>
+
* Make a copy and then use them while practicing.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show slide:
 
|| Show slide:
  
 
'''Logistic Regression'''
 
'''Logistic Regression'''
 
||
 
||
* <div style="color:#000000;margin-left:1.27cm;margin-right:0cm;">Logistic regression is a machine learning algorithm used for classification tasks.</div>
+
* Logistic regression is a machine learning algorithm used for classification tasks.
* <div style="color:#000000;margin-left:1.27cm;margin-right:0cm;">The goal is to predict a binary outcome (like yes/no, true/false) based on input features.</div>
+
* The goal is to predict a binary outcome (like yes/no, true/false) based on input features.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show slide:
 
|| Show slide:
  
 
'''Logistic Regression'''
 
'''Logistic Regression'''
 
||
 
||
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="color:#000000;">Logistic regression uses the </span><span style="color:#000000;">'''logit'''</span><span style="color:#000000;"> function.</span></div>
+
* Logistic regression uses the '''logit''' function.
* <div style="color:#000000;margin-left:1.27cm;margin-right:0cm;">This function maps the linear combination of features to probabilities zero and one.</div>
+
* This function maps the linear combination of features to probabilities zero and one.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show Slide:'''Types of classification'''
 
|| Show Slide:'''Types of classification'''
 
|| There are two types of classification. They are
 
|| There are two types of classification. They are
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Binary Classification'''</div>
+
* '''Binary Classification'''
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Multiclass Classification'''</div>
+
* '''Multiclass Classification'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show slide:
 
|| Show slide:
  
 
'''Binary classification'''
 
'''Binary classification'''
 
||
 
||
* <div style="margin-left:1.27cm;margin-right:0cm;">Binary classification is used for modeling a binary target variable.</div>
+
* Binary classification is used for modeling a binary target variable.
* <div style="margin-left:1.27cm;margin-right:0cm;">The target variable has only two possible outcomes.</div>
+
* The target variable has only two possible outcomes.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show slide:
 
|| Show slide:
  
 
'''Multiclass classification'''
 
'''Multiclass classification'''
 
||  
 
||  
* <div style="margin-left:1.27cm;margin-right:0cm;">Multiclass classification is an extension of binary classification.</div>
+
* Multiclass classification is an extension of binary classification.
* <div style="margin-left:1.27cm;margin-right:0cm;">The target variable can have two or more possible outcomes.</div>
+
* The target variable can have two or more possible outcomes.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Open the file ads.csv and point to the fields as per narration.
 
|| Open the file ads.csv and point to the fields as per narration.
 
|| To implement the '''Binary classification model, '''we use the '''Ads dot csv '''dataset.
 
|| To implement the '''Binary classification model, '''we use the '''Ads dot csv '''dataset.
  
 
Here, we analyze customers to predict if they will make a purchase in the store.
 
Here, we analyze customers to predict if they will make a purchase in the store.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Point to the '''LR_Binary.ipynb'''
 
|| Point to the '''LR_Binary.ipynb'''
 
|| '''LR underscore Binary''' '''dot ipynb '''is the python notebook file for this demonstration.
 
|| '''LR underscore Binary''' '''dot ipynb '''is the python notebook file for this demonstration.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Press '''Ctrl+Alt'''+'''T '''keys
 
|| Press '''Ctrl+Alt'''+'''T '''keys
 
Type '''conda activate ml'''
 
Type '''conda activate ml'''
Line 101: Line 96:
  
 
Activate the machine learning environment as shown.
 
Activate the machine learning environment as shown.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Type '''cd Downloads'''
 
|| Type '''cd Downloads'''
  
Line 112: Line 107:
  
 
Then type, '''jupyter space notebook '''and press''' Enter.'''
 
Then type, '''jupyter space notebook '''and press''' Enter.'''
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show Jupyter Notebook Home page:
 
|| Show Jupyter Notebook Home page:
  
Line 120: Line 115:
 
Click the '''LR underscore Binary dot ipynb '''file to open it.
 
Click the '''LR underscore Binary dot ipynb '''file to open it.
  
<div style="color:#000000;">Note that each cell will have the output displayed in this file.</div>
+
Note that each cell will have the output displayed in this file.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''import pandas as pd '''
 
|| Highlight '''import pandas as pd '''
  
Line 130: Line 125:
  
 
Make sure to Press''' Shift '''and''' Enter''' to execute the code in each cell.
 
Make sure to Press''' Shift '''and''' Enter''' to execute the code in each cell.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| '''Highlight ads = pd.read_csv(r" Ads.csv") '''
 
|| '''Highlight ads = pd.read_csv(r" Ads.csv") '''
 
'''ads.head() '''
 
'''ads.head() '''
 
|| We load the dataset to a variable '''ads''' using the method '''pd dot read underscore csv'''.
 
|| We load the dataset to a variable '''ads''' using the method '''pd dot read underscore csv'''.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''Data Exploration'''
 
|| Highlight '''Data Exploration'''
 
|| Let us explore the dataset.
 
|| Let us explore the dataset.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight
 
|| Highlight
  
Line 147: Line 142:
  
 
Then we summarize the dataset, including rows, columns, and missing values using '''ads dot info'''.  
 
Then we summarize the dataset, including rows, columns, and missing values using '''ads dot info'''.  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''plt.figure(figsize=(6, 4))'''
 
|| Highlight '''plt.figure(figsize=(6, 4))'''
  
Line 156: Line 151:
  
 
In the output cell, ignore the warning if you get any.
 
In the output cell, ignore the warning if you get any.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''Data Preprocessing'''
 
|| Highlight '''Data Preprocessing'''
 
|| Let us preprocess the dataset.
 
|| Let us preprocess the dataset.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''ads.drop(columns=['User ID'])'''
 
|| Highlight '''ads.drop(columns=['User ID'])'''
  
Line 166: Line 161:
  
 
Let us display the first few rows of the updated data and verify.
 
Let us display the first few rows of the updated data and verify.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Only narration
 
|| Only narration
 
|| In the dataset, the column '''Gender''' contains string data type.
 
|| In the dataset, the column '''Gender''' contains string data type.
  
 
The '''fit method''' in '''sklearn''' can't train models with string data.
 
The '''fit method''' in '''sklearn''' can't train models with string data.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''l = LabelEncoder()'''
 
|| Highlight '''l = LabelEncoder()'''
 
|| So, we use the '''LabelEncoder method''' to convert the string data type into integer.
 
|| So, we use the '''LabelEncoder method''' to convert the string data type into integer.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''ads['Gender'] = l.fit_transform(ads['Gender'])'''
 
|| Highlight '''ads['Gender'] = l.fit_transform(ads['Gender'])'''
  
Line 182: Line 177:
  
 
Have a look at the encoded data.
 
Have a look at the encoded data.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Only narration
 
|| Only narration
  
Line 195: Line 190:
  
 
Notice that we have removed the '''Purchased''' column.
 
Notice that we have removed the '''Purchased''' column.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight  
 
|| Highlight  
  
Line 204: Line 199:
  
 
The variable '''y''' has only the class labels as shown.
 
The variable '''y''' has only the class labels as shown.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Only narration
 
|| Only narration
  
Line 212: Line 207:
  
 
First we create the instance of '''MinMaxScaler''' using the '''MinMaxScaler method.'''
 
First we create the instance of '''MinMaxScaler''' using the '''MinMaxScaler method.'''
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''scaled_x = pd.DataFrame(mms.fit_transform(x),columns=x.columns)'''
 
|| Highlight '''scaled_x = pd.DataFrame(mms.fit_transform(x),columns=x.columns)'''
 
|| '''mms dot fit underscore transform '''method is used for scaling each feature.
 
|| '''mms dot fit underscore transform '''method is used for scaling each feature.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''scaled_x.head()'''
 
|| Highlight '''scaled_x.head()'''
 
|| Now we see the scaled data for the feature '''x.'''
 
|| Now we see the scaled data for the feature '''x.'''
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''Train and Test Split'''
 
|| Highlight '''Train and Test Split'''
 
|| Next, we split the data into training and testing sets.
 
|| Next, we split the data into training and testing sets.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''train_test_split(scaled_x,y,test_size=0.3,random_state=0)'''
 
|| Highlight '''train_test_split(scaled_x,y,test_size=0.3,random_state=0)'''
 
|| '''scaled underscore x '''contains the preprocessed features, and y is the target variable.  
 
|| '''scaled underscore x '''contains the preprocessed features, and y is the target variable.  
Line 228: Line 223:
  
 
The remaining 70 percent is used for training.
 
The remaining 70 percent is used for training.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
||  
+
|| Point to the code.
 
|| '''x underscore train and y underscore train''' are training features and labels.
 
|| '''x underscore train and y underscore train''' are training features and labels.
  
Line 237: Line 232:
  
 
Test data is used to evaluate the model performance.
 
Test data is used to evaluate the model performance.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''Model Instantiation of Binary Classification and Model training'''
 
|| Highlight '''Model Instantiation of Binary Classification and Model training'''
  
Line 244: Line 239:
  
 
We create an instance of '''LogisticRegression''' from the '''sklearn''' library.
 
We create an instance of '''LogisticRegression''' from the '''sklearn''' library.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''lr.fit(x_train,y_train)'''
 
|| Highlight '''lr.fit(x_train,y_train)'''
 
|| Now we train the model using the '''fit '''method on the '''training data'''.
 
|| Now we train the model using the '''fit '''method on the '''training data'''.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''y_train_pred = lr.predict(x_train) '''
 
|| Highlight '''y_train_pred = lr.predict(x_train) '''
 
|| Next let us calculate the '''training accuracy.'''
 
|| Next let us calculate the '''training accuracy.'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''Training Accuracy: 0.814'''
 
|| Highlight '''Training Accuracy: 0.814'''
 
|| We see the '''training accuracy''' is 0.814 which is pretty good.
 
|| We see the '''training accuracy''' is 0.814 which is pretty good.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''y_train_pred_proba = lr.predict_proba(x_train)[:, 1]'''
 
|| Highlight '''y_train_pred_proba = lr.predict_proba(x_train)[:, 1]'''
  
Line 262: Line 257:
  
 
It will return the predicted probabilities for each class.
 
It will return the predicted probabilities for each class.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''roc_auc_train = roc_auc_score(y_train, y_train_pred_proba)'''
 
|| Highlight '''roc_auc_train = roc_auc_score(y_train, y_train_pred_proba)'''
  
Line 272: Line 267:
  
 
A higher score indicates better performance.
 
A higher score indicates better performance.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''logloss_train = log_loss(y_train, y_train_pred_proba)'''
 
|| Highlight '''logloss_train = log_loss(y_train, y_train_pred_proba)'''
 
 
|| We also calculate the '''cross entropy loss''' for the training data.  
 
|| We also calculate the '''cross entropy loss''' for the training data.  
  
Line 280: Line 274:
  
 
A lower value indicates better model performance.
 
A lower value indicates better model performance.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''ROC-AUC Score: 0.917'''
 
|| Highlight '''ROC-AUC Score: 0.917'''
  
Line 291: Line 285:
  
 
It shows how well the model’s predictions match the actual labels.
 
It shows how well the model’s predictions match the actual labels.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight: '''Predictions for Test Data'''
 
|| Highlight: '''Predictions for Test Data'''
 
|| Further, we predict labels for the '''x underscore test'''.
 
|| Further, we predict labels for the '''x underscore test'''.
  
 
For prediction we use the class of '''test underscore data'''.
 
For prediction we use the class of '''test underscore data'''.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight: '''test_data = x_test.iloc[10].values.reshape(1, -1)'''
 
|| Highlight: '''test_data = x_test.iloc[10].values.reshape(1, -1)'''
 
|| Next, we predict the features of the 10th test data point in '''x underscore test'''.
 
|| Next, we predict the features of the 10th test data point in '''x underscore test'''.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight: '''predicted_class = lr.predict(test_data)'''
 
|| Highlight: '''predicted_class = lr.predict(test_data)'''
 
|| Then we use the '''Binary classification model''' to predict the classes.
 
|| Then we use the '''Binary classification model''' to predict the classes.
  
 
We predict the classes based on test underscore data.
 
We predict the classes based on test underscore data.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight: '''actual_class = y_test[10]'''
 
|| Highlight: '''actual_class = y_test[10]'''
 
|| '''actual underscore class''' has the actual class of the test data point.
 
|| '''actual underscore class''' has the actual class of the test data point.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight: '''print(f"Predicted class:'''
 
|| Highlight: '''print(f"Predicted class:'''
 
|| Finally, we print the predicted and actual class.
 
|| Finally, we print the predicted and actual class.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight: '''Predicted class: 0, Actual class: 0'''
 
|| Highlight: '''Predicted class: 0, Actual class: 0'''
 
 
|| We get the output, '''predicted value''' as '''0''' and the '''actual value''' as '''0'''.
 
|| We get the output, '''predicted value''' as '''0''' and the '''actual value''' as '''0'''.
  
 
In the output cell, ignore the warning if you get any.
 
In the output cell, ignore the warning if you get any.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''y_pred = lr.predict(x_test)'''
 
|| Highlight '''y_pred = lr.predict(x_test)'''
 
|| '''y underscore pred''' predicts the target variables.
 
|| '''y underscore pred''' predicts the target variables.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight: '''print("Binary classification - Actual vs Predicted:")'''
 
|| Highlight: '''print("Binary classification - Actual vs Predicted:")'''
  
 
|| We print actual and predicted class labels.
 
|| We print actual and predicted class labels.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight: '''test_accuracy = accuracy_score(y_test, y_pred)'''
 
|| Highlight: '''test_accuracy = accuracy_score(y_test, y_pred)'''
 
 
|| Then we also calculate the '''test accuracy.'''
 
|| Then we also calculate the '''test accuracy.'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight: '''Test Accuracy: 0.833'''
 
|| Highlight: '''Test Accuracy: 0.833'''
 
|| The '''test accuracy''' is approximately 0.833, which is also pretty good.
 
|| The '''test accuracy''' is approximately 0.833, which is also pretty good.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight
 
|| Highlight
  
Line 337: Line 329:
  
 
|| We calculate the '''ROC-AUC score''' and '''cross-entropy loss''' for the test data.
 
|| We calculate the '''ROC-AUC score''' and '''cross-entropy loss''' for the test data.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight '''ROC-AUC Score: 0.949'''
 
|| Highlight '''ROC-AUC Score: 0.949'''
  
Line 346: Line 338:
  
 
A Cross Entropy Loss of '''0.355''' shows fairly accurate predictions.
 
A Cross Entropy Loss of '''0.355''' shows fairly accurate predictions.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight
 
|| Highlight
  
Line 354: Line 346:
  
 
It shows how well the model is correctly classifying the instances.
 
It shows how well the model is correctly classifying the instances.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show output plot
 
|| Show output plot
 
|| The '''confusion matrix''' shows '''76 non-buyers''' are correctly predicted.
 
|| The '''confusion matrix''' shows '''76 non-buyers''' are correctly predicted.
Line 368: Line 360:
 
This means the model performs well but struggles with some misclassifications.
 
This means the model performs well but struggles with some misclassifications.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Only narration
 
|| Only narration
 
|| We conclude our implementation of '''Binary classification'''.
 
|| We conclude our implementation of '''Binary classification'''.
  
 
We have successfully predicted if a given user will make a purchase in the store.
 
We have successfully predicted if a given user will make a purchase in the store.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show slide:
 
|| Show slide:
  
Line 381: Line 373:
  
 
|| In this tutorial, we have learnt about
 
|| In this tutorial, we have learnt about
* <div style="margin-left:1.27cm;margin-right:0cm;">Logistic Regression</div>
+
* Logistic Regression
* <div style="margin-left:1.27cm;margin-right:0cm;">Binary Classification</div>
+
* Binary Classification
* <div style="margin-left:1.27cm;margin-right:0cm;">Multiclass Classification</div>
+
* Multiclass Classification
  
 
In the next tutorial, we’ll learn how to implement Multiclass classification for Logistic Regression.
 
In the next tutorial, we’ll learn how to implement Multiclass classification for Logistic Regression.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show slide:  
 
|| Show slide:  
  
 
'''Assignment '''
 
'''Assignment '''
  
<span style="background-color:#ffffff;">As an assignment, please do the following:</span>
+
As an assignment, please do the following:
 
|| As an assignment,
 
|| As an assignment,
  
<span style="background-color:#ffffff;">Replace the </span><span style="background-color:#ffffff;">'''y underscore pred '''</span><span style="background-color:#ffffff;">code with the code as shown here.</span>
+
Replace the '''y underscore pred ''' code with the code as shown here.
  
 
Observe the changes in Training and Testing accuracy.
 
Observe the changes in Training and Testing accuracy.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
 
|| Show slide image:
 
|| Show slide image:
  
Line 404: Line 396:
 
'''binary.png'''
 
'''binary.png'''
 
|| After completing the assignment, the output should match the expected result.
 
|| After completing the assignment, the output should match the expected result.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
|| <div style="color:#252525;">Show Slide:</div>
+
|| Show Slide:
  
<div style="color:#252525;">'''FOSSEE Forum'''</div>
+
'''FOSSEE Forum'''
|| <span style="background-color:#ffffff;color:#000000;">For any general or technical questions on </span><span style="color:#000000;">'''Python for'''</span>
+
|| For any general or technical questions on '''Python for'''
  
<span style="color:#000000;">'''Machine Learning'''</span><span style="background-color:#ffffff;color:#000000;">, visit the</span><span style="background-color:#ffffff;color:#000000;">''' FOSSEE forum'''</span><span style="background-color:#ffffff;color:#000000;"> and post your question.</span>
+
'''Machine Learning''', visit the ''' FOSSEE forum''' and post your question.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.199cm;padding-right:0.191cm;"
+
|-  
|| <div style="color:#000000;">Show slide:</div>
+
|| Show slide:
  
<div style="color:#000000;">'''Thank You'''</div>
+
'''Thank You'''
|| <span style="background-color:#ffffff;color:#000000;">This is </span><span style="background-color:#ffffff;color:#000000;">'''Anvita Thadavoose Manjummel'''</span><span style="background-color:#ffffff;color:#000000;">, a FOSSEE Summer Fellow 2025, IIT Bombay signing off.</span>
+
|| This is '''Anvita Thadavoose Manjummel''', a FOSSEE Summer Fellow 2025, IIT Bombay signing off.
  
<div style="color:#000000;">Thanks for joining.</div>
+
Thanks for joining.
 
|-
 
|-
 
|}
 
|}

Revision as of 12:54, 14 July 2025

Visual Cue Narration
Show slide:

Welcome

Welcome to the Spoken Tutorial on Logistic Regression - Binary Classification.
Show slide:

Learning Objectives

In this tutorial, we will learn about:
  • Logistic Regression
  • Binary Classification
  • Multiclass Classification
Show slide:

System Requirements

To record this tutorial, I am using
  • Ubuntu Linux OS version 24.04
  • Jupyter Notebook IDE
Show slide:

Prerequisites

To follow this tutorial,
  • The learner must have basic knowledge of Python.
  • For pre-requisite Python tutorials, please visit this website.
Show slide:

Code files

  • The files used in this tutorial are provided in the Code files link.
  • Please download and extract the files.
  • Make a copy and then use them while practicing.
Show slide:

Logistic Regression

  • Logistic regression is a machine learning algorithm used for classification tasks.
  • The goal is to predict a binary outcome (like yes/no, true/false) based on input features.
Show slide:

Logistic Regression

  • Logistic regression uses the logit function.
  • This function maps the linear combination of features to probabilities zero and one.
Show Slide:Types of classification There are two types of classification. They are
  • Binary Classification
  • Multiclass Classification
Show slide:

Binary classification

  • Binary classification is used for modeling a binary target variable.
  • The target variable has only two possible outcomes.
Show slide:

Multiclass classification

  • Multiclass classification is an extension of binary classification.
  • The target variable can have two or more possible outcomes.
Open the file ads.csv and point to the fields as per narration. To implement the Binary classification model, we use the Ads dot csv dataset.

Here, we analyze customers to predict if they will make a purchase in the store.

Point to the LR_Binary.ipynb LR underscore Binary dot ipynb is the python notebook file for this demonstration.
Press Ctrl+Alt+T keys

Type conda activate ml Press Enter

Let us open the Linux terminal by pressing Ctrl,Alt and T keys together.

Activate the machine learning environment as shown.

Type cd Downloads

Type jupyter notebook

Press Enter

I have saved my code file in the Downloads folder.

Please navigate to the respective folder of your code file location.

Then type, jupyter space notebook and press Enter.

Show Jupyter Notebook Home page:

Double Click on LR_Binary.ipynb file

We can see the Jupyter Notebook Home page has opened in the web browser.

Click the LR underscore Binary dot ipynb file to open it.

Note that each cell will have the output displayed in this file.

Highlight import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

We have imported the necessary libraries for Binary classification.

Make sure to Press Shift and Enter to execute the code in each cell.

Highlight ads = pd.read_csv(r" Ads.csv")

ads.head()

We load the dataset to a variable ads using the method pd dot read underscore csv.
Highlight Data Exploration Let us explore the dataset.
Highlight

print(f"Shape of the dataset{ads.shape}")

print(ads.info())

First, we display the number of rows and columns of the dataset using ads dot shape.

Then we summarize the dataset, including rows, columns, and missing values using ads dot info.

Highlight plt.figure(figsize=(6, 4))

sns.countplot(x='Purchased', data=ads, palette='viridis')

Next, we visualize the dataset by plotting the count of the Purchased attribute.

This attribute represents the target variable.

In the output cell, ignore the warning if you get any.

Highlight Data Preprocessing Let us preprocess the dataset.
Highlight ads.drop(columns=['User ID'])

ads.head()

We delete the User ID column, as it is not required for the prediction.

Let us display the first few rows of the updated data and verify.

Only narration In the dataset, the column Gender contains string data type.

The fit method in sklearn can't train models with string data.

Highlight l = LabelEncoder() So, we use the LabelEncoder method to convert the string data type into integer.
Highlight ads['Gender'] = l.fit_transform(ads['Gender']) Using the fit underscore transform we encode the Gender column.

Now, female is encoded as 0 and male as 1.

Have a look at the encoded data.

Only narration

Highlight x = ads.drop(columns=["Purchased"])

x.head()

Now, let us prepare the data for training.

First we remove the target column Purchased from the ads dataset.

Then, we copy the remaining features into the variable x.

Notice that we have removed the Purchased column.

Highlight

y = ads.Purchased

y.head()

Next, let us assign the Purchased column which is the target feature to y.

The variable y has only the class labels as shown.

Only narration

Highlight mms = MinMaxScaler()

Next, we perform Feature Scaling which is used to normalize the features.

First we create the instance of MinMaxScaler using the MinMaxScaler method.

Highlight scaled_x = pd.DataFrame(mms.fit_transform(x),columns=x.columns) mms dot fit underscore transform method is used for scaling each feature.
Highlight scaled_x.head() Now we see the scaled data for the feature x.
Highlight Train and Test Split Next, we split the data into training and testing sets.
Highlight train_test_split(scaled_x,y,test_size=0.3,random_state=0) scaled underscore x contains the preprocessed features, and y is the target variable.

The test underscore size equals 0.3 means 30% of the data is allocated for testing.

The remaining 70 percent is used for training.

Point to the code. x underscore train and y underscore train are training features and labels.

Training data is used to train the model.

x underscore test and y underscore test are test features and labels.

Test data is used to evaluate the model performance.

Highlight Model Instantiation of Binary Classification and Model training

Highlight lr = LogisticRegression()

Now let’s train the Binary Classification Model using Logistic Regression.

We create an instance of LogisticRegression from the sklearn library.

Highlight lr.fit(x_train,y_train) Now we train the model using the fit method on the training data.
Highlight y_train_pred = lr.predict(x_train) Next let us calculate the training accuracy.
Highlight Training Accuracy: 0.814 We see the training accuracy is 0.814 which is pretty good.
Highlight y_train_pred_proba = lr.predict_proba(x_train)[:, 1] The trained logistic regression model is used to predict the probabilities.

It predicts the target variable for the training data.

It will return the predicted probabilities for each class.

Highlight roc_auc_train = roc_auc_score(y_train, y_train_pred_proba) We calculate the ROC-AUC score for the model’s performance on the training data.

ROC is Receiver Operating Characteristic, and AUC is Area Under the Curve.

It measures how well the model distinguishes between the two classes.

A higher score indicates better performance.

Highlight logloss_train = log_loss(y_train, y_train_pred_proba) We also calculate the cross entropy loss for the training data.

It measures how close the predicted probabilities are to the actual class labels.

A lower value indicates better model performance.

Highlight ROC-AUC Score: 0.917

Cross Entropy Loss: 0.406

The ROC-AUC Score is 0.917.

This shows the model effectively distinguishes between the classes.

The cross-entropy loss is 0.406.

It shows how well the model’s predictions match the actual labels.

Highlight: Predictions for Test Data Further, we predict labels for the x underscore test.

For prediction we use the class of test underscore data.

Highlight: test_data = x_test.iloc[10].values.reshape(1, -1) Next, we predict the features of the 10th test data point in x underscore test.
Highlight: predicted_class = lr.predict(test_data) Then we use the Binary classification model to predict the classes.

We predict the classes based on test underscore data.

Highlight: actual_class = y_test[10] actual underscore class has the actual class of the test data point.
Highlight: print(f"Predicted class: Finally, we print the predicted and actual class.
Highlight: Predicted class: 0, Actual class: 0 We get the output, predicted value as 0 and the actual value as 0.

In the output cell, ignore the warning if you get any.

Highlight y_pred = lr.predict(x_test) y underscore pred predicts the target variables.
Highlight: print("Binary classification - Actual vs Predicted:") We print actual and predicted class labels.
Highlight: test_accuracy = accuracy_score(y_test, y_pred) Then we also calculate the test accuracy.
Highlight: Test Accuracy: 0.833 The test accuracy is approximately 0.833, which is also pretty good.
Highlight

y_test_pred_proba = lr.predict_proba(x_test)[:, 1]

We calculate the ROC-AUC score and cross-entropy loss for the test data.
Highlight ROC-AUC Score: 0.949

Cross Entropy Loss: 0.355

The ROC-AUC score is 0.949.

The model demonstrates excellent performance in distinguishing between classes.

A Cross Entropy Loss of 0.355 shows fairly accurate predictions.

Highlight

conf_matrix = confusion_matrix(y_test, y_pred)

We can also visualize the confusion matrix of the model’s performance.

It shows how well the model is correctly classifying the instances.

Show output plot The confusion matrix shows 76 non-buyers are correctly predicted.

However, three non-buyers are incorrectly classified as buyers.

Similarly, 24 actual buyers are correctly identified.

But 17 were misclassified as non-buyers.

A higher number in the diagonal indicated better accuracy.

This means the model performs well but struggles with some misclassifications.

Only narration We conclude our implementation of Binary classification.

We have successfully predicted if a given user will make a purchase in the store.

Show slide:

Summary

Only Narration

In this tutorial, we have learnt about
  • Logistic Regression
  • Binary Classification
  • Multiclass Classification

In the next tutorial, we’ll learn how to implement Multiclass classification for Logistic Regression.

Show slide:

Assignment

As an assignment, please do the following:
As an assignment,
Replace the  y underscore pred  code with the code as shown here.

Observe the changes in Training and Testing accuracy.

Show slide image:

Assignment Solution

binary.png

After completing the assignment, the output should match the expected result.
Show Slide:
FOSSEE Forum
For any general or technical questions on Python for

Machine Learning, visit the FOSSEE forum and post your question.

Show slide:

Thank You

This is Anvita Thadavoose Manjummel, a FOSSEE Summer Fellow 2025, IIT Bombay signing off.

Thanks for joining.

Contributors and Content Editors

Madhurig, Nirmala Venkat