Difference between revisions of "Python-for-Machine-Learning/C2/Logistic-Regression-Binary-Classification/English"
| Line 1: | Line 1: | ||
{| border="1" | {| border="1" | ||
| − | + | |- | |
|| '''Visual Cue''' | || '''Visual Cue''' | ||
|| '''Narration''' | || '''Narration''' | ||
| Line 171: | Line 171: | ||
|- | |- | ||
|| Highlight '''ads['Gender'] = l.fit_transform(ads['Gender'])''' | || Highlight '''ads['Gender'] = l.fit_transform(ads['Gender'])''' | ||
| − | |||
|| Using the '''fit underscore transform '''we encode the '''Gender '''column. | || Using the '''fit underscore transform '''we encode the '''Gender '''column. | ||
| − | Now,''' female '''is encoded as '''0 '''and '''male '''as '''1.''' | + | Now,''' female '''is encoded as '''0 ''' and '''male '''as '''1.''' |
Have a look at the encoded data. | Have a look at the encoded data. | ||
| Line 185: | Line 184: | ||
|| Now, let us prepare the data for training. | || Now, let us prepare the data for training. | ||
| − | First we remove the target column '''Purchased '''from the ads dataset | + | First we remove the target column '''Purchased '''from the ads dataset'''. |
| − | Then | + | Then , we copy the remaining features into the variable '''x.''' |
Notice that we have removed the '''Purchased''' column. | Notice that we have removed the '''Purchased''' column. | ||
| Line 209: | Line 208: | ||
|- | |- | ||
|| Highlight '''scaled_x = pd.DataFrame(mms.fit_transform(x),columns=x.columns)''' | || Highlight '''scaled_x = pd.DataFrame(mms.fit_transform(x),columns=x.columns)''' | ||
| − | || '''mms dot fit underscore transform '''method is used for scaling each feature. | + | || '''mms dot fit underscore transform ''' method is used for scaling each feature. |
|- | |- | ||
|| Highlight '''scaled_x.head()''' | || Highlight '''scaled_x.head()''' | ||
| − | || Now we see the scaled data for the feature '''x | + | || Now we see the scaled data for the feature '''x'''. |
|- | |- | ||
|| Highlight '''Train and Test Split''' | || Highlight '''Train and Test Split''' | ||
| Line 218: | Line 217: | ||
|- | |- | ||
|| Highlight '''train_test_split(scaled_x,y,test_size=0.3,random_state=0)''' | || Highlight '''train_test_split(scaled_x,y,test_size=0.3,random_state=0)''' | ||
| − | || '''scaled underscore x '''contains the preprocessed features, and y is the target variable. | + | || '''scaled underscore x '''contains the preprocessed features, and '''y''' is the target variable. |
| − | The''' test underscore size''' equals 0.3 means 30% of the data is allocated for testing. | + | The ''' test underscore size''' equals 0.3 means 30% of the data is allocated for testing. |
The remaining 70 percent is used for training. | The remaining 70 percent is used for training. | ||
| Line 236: | Line 235: | ||
Highlight '''lr = LogisticRegression()''' | Highlight '''lr = LogisticRegression()''' | ||
| − | || Now let’s train the''' Binary Classification Model '''using '''Logistic Regression | + | || Now let’s train the ''' Binary Classification Model ''' using '''Logistic Regression'''. |
We create an instance of '''LogisticRegression''' from the '''sklearn''' library. | We create an instance of '''LogisticRegression''' from the '''sklearn''' library. | ||
| Line 259: | Line 258: | ||
|- | |- | ||
|| Highlight '''roc_auc_train = roc_auc_score(y_train, y_train_pred_proba)''' | || Highlight '''roc_auc_train = roc_auc_score(y_train, y_train_pred_proba)''' | ||
| − | |||
|| We calculate the '''ROC-AUC score''' for the model’s performance on the training data. | || We calculate the '''ROC-AUC score''' for the model’s performance on the training data. | ||
| Line 356: | Line 354: | ||
But '''17''' were '''misclassified''' as '''non-buyers'''. | But '''17''' were '''misclassified''' as '''non-buyers'''. | ||
| − | A '''higher | + | A '''higher number''' in the '''diagonal''' indicated '''better accuracy'''. |
This means the model performs well but struggles with some misclassifications. | This means the model performs well but struggles with some misclassifications. | ||
| Line 383: | Line 381: | ||
'''Assignment ''' | '''Assignment ''' | ||
| − | + | As an assignment, please do the following: | |
|| As an assignment, | || As an assignment, | ||
| − | + | Replace the '''y underscore pred ''' code with the code as shown here. | |
Observe the changes in Training and Testing accuracy. | Observe the changes in Training and Testing accuracy. | ||
| Line 399: | Line 397: | ||
|| Show Slide: | || Show Slide: | ||
| − | + | '''FOSSEE Forum''' | |
|| For any general or technical questions on '''Python for''' | || For any general or technical questions on '''Python for''' | ||
Revision as of 12:58, 14 July 2025
| Visual Cue | Narration |
| Show slide:
Welcome |
Welcome to the Spoken Tutorial on Logistic Regression - Binary Classification. |
| Show slide:
Learning Objectives |
In this tutorial, we will learn about:
|
| Show slide:
System Requirements |
To record this tutorial, I am using
|
| Show slide:
Prerequisites |
To follow this tutorial,
|
| Show slide:
Code files |
|
| Show slide:
Logistic Regression |
|
| Show slide:
Logistic Regression |
|
| Show Slide:Types of classification | There are two types of classification. They are
|
| Show slide:
Binary classification |
|
| Show slide:
Multiclass classification |
|
| Open the file ads.csv and point to the fields as per narration. | To implement the Binary classification model, we use the Ads dot csv dataset.
Here, we analyze customers to predict if they will make a purchase in the store. |
| Point to the LR_Binary.ipynb | LR underscore Binary dot ipynb is the python notebook file for this demonstration. |
| Press Ctrl+Alt+T keys
Type conda activate ml Press Enter |
Let us open the Linux terminal by pressing Ctrl,Alt and T keys together.
Activate the machine learning environment as shown. |
| Type cd Downloads
Type jupyter notebook Press Enter |
I have saved my code file in the Downloads folder.
Please navigate to the respective folder of your code file location. Then type, jupyter space notebook and press Enter. |
| Show Jupyter Notebook Home page:
Double Click on LR_Binary.ipynb file |
We can see the Jupyter Notebook Home page has opened in the web browser.
Click the LR underscore Binary dot ipynb file to open it. Note that each cell will have the output displayed in this file. |
| Highlight import pandas as pd
import matplotlib.pyplot as plt import seaborn as sns |
We have imported the necessary libraries for Binary classification.
Make sure to Press Shift and Enter to execute the code in each cell. |
| Highlight ads = pd.read_csv(r" Ads.csv")
ads.head() |
We load the dataset to a variable ads using the method pd dot read underscore csv. |
| Highlight Data Exploration | Let us explore the dataset. |
| Highlight
print(f"Shape of the dataset{ads.shape}") print(ads.info()) |
First, we display the number of rows and columns of the dataset using ads dot shape.
Then we summarize the dataset, including rows, columns, and missing values using ads dot info. |
| Highlight plt.figure(figsize=(6, 4))
sns.countplot(x='Purchased', data=ads, palette='viridis') |
Next, we visualize the dataset by plotting the count of the Purchased attribute.
This attribute represents the target variable. In the output cell, ignore the warning if you get any. |
| Highlight Data Preprocessing | Let us preprocess the dataset. |
| Highlight ads.drop(columns=['User ID'])
ads.head() |
We delete the User ID column, as it is not required for the prediction.
Let us display the first few rows of the updated data and verify. |
| Only narration | In the dataset, the column Gender contains string data type.
The fit method in sklearn can't train models with string data. |
| Highlight l = LabelEncoder() | So, we use the LabelEncoder method to convert the string data type into integer. |
| Highlight ads['Gender'] = l.fit_transform(ads['Gender']) | Using the fit underscore transform we encode the Gender column.
Now, female is encoded as 0 and male as 1. Have a look at the encoded data. |
| Only narration
Highlight x = ads.drop(columns=["Purchased"]) x.head() |
Now, let us prepare the data for training.
First we remove the target column Purchased from the ads dataset. Then , we copy the remaining features into the variable x. Notice that we have removed the Purchased column. |
| Highlight
y = ads.Purchased y.head() |
Next, let us assign the Purchased column which is the target feature to y.
The variable y has only the class labels as shown. |
| Only narration
Highlight mms = MinMaxScaler() |
Next, we perform Feature Scaling which is used to normalize the features.
First we create the instance of MinMaxScaler using the MinMaxScaler method. |
| Highlight scaled_x = pd.DataFrame(mms.fit_transform(x),columns=x.columns) | mms dot fit underscore transform method is used for scaling each feature. |
| Highlight scaled_x.head() | Now we see the scaled data for the feature x. |
| Highlight Train and Test Split | Next, we split the data into training and testing sets. |
| Highlight train_test_split(scaled_x,y,test_size=0.3,random_state=0) | scaled underscore x contains the preprocessed features, and y is the target variable.
The test underscore size equals 0.3 means 30% of the data is allocated for testing. The remaining 70 percent is used for training. |
| Point to the code. | x underscore train and y underscore train are training features and labels.
Training data is used to train the model. x underscore test and y underscore test are test features and labels. Test data is used to evaluate the model performance. |
| Highlight Model Instantiation of Binary Classification and Model training
Highlight lr = LogisticRegression() |
Now let’s train the Binary Classification Model using Logistic Regression.
We create an instance of LogisticRegression from the sklearn library. |
| Highlight lr.fit(x_train,y_train) | Now we train the model using the fit method on the training data. |
| Highlight y_train_pred = lr.predict(x_train) | Next let us calculate the training accuracy. |
| Highlight Training Accuracy: 0.814 | We see the training accuracy is 0.814 which is pretty good. |
| Highlight y_train_pred_proba = lr.predict_proba(x_train)[:, 1] | The trained logistic regression model is used to predict the probabilities.
It predicts the target variable for the training data. It will return the predicted probabilities for each class. |
| Highlight roc_auc_train = roc_auc_score(y_train, y_train_pred_proba) | We calculate the ROC-AUC score for the model’s performance on the training data.
ROC is Receiver Operating Characteristic, and AUC is Area Under the Curve. It measures how well the model distinguishes between the two classes. A higher score indicates better performance. |
| Highlight logloss_train = log_loss(y_train, y_train_pred_proba) | We also calculate the cross entropy loss for the training data.
It measures how close the predicted probabilities are to the actual class labels. A lower value indicates better model performance. |
| Highlight ROC-AUC Score: 0.917
Cross Entropy Loss: 0.406 |
The ROC-AUC Score is 0.917.
This shows the model effectively distinguishes between the classes. The cross-entropy loss is 0.406. It shows how well the model’s predictions match the actual labels. |
| Highlight: Predictions for Test Data | Further, we predict labels for the x underscore test.
For prediction we use the class of test underscore data. |
| Highlight: test_data = x_test.iloc[10].values.reshape(1, -1) | Next, we predict the features of the 10th test data point in x underscore test. |
| Highlight: predicted_class = lr.predict(test_data) | Then we use the Binary classification model to predict the classes.
We predict the classes based on test underscore data. |
| Highlight: actual_class = y_test[10] | actual underscore class has the actual class of the test data point. |
| Highlight: print(f"Predicted class: | Finally, we print the predicted and actual class. |
| Highlight: Predicted class: 0, Actual class: 0 | We get the output, predicted value as 0 and the actual value as 0.
In the output cell, ignore the warning if you get any. |
| Highlight y_pred = lr.predict(x_test) | y underscore pred predicts the target variables. |
| Highlight: print("Binary classification - Actual vs Predicted:") | We print actual and predicted class labels. |
| Highlight: test_accuracy = accuracy_score(y_test, y_pred) | Then we also calculate the test accuracy. |
| Highlight: Test Accuracy: 0.833 | The test accuracy is approximately 0.833, which is also pretty good. |
| Highlight
y_test_pred_proba = lr.predict_proba(x_test)[:, 1] |
We calculate the ROC-AUC score and cross-entropy loss for the test data. |
| Highlight ROC-AUC Score: 0.949
Cross Entropy Loss: 0.355 |
The ROC-AUC score is 0.949.
The model demonstrates excellent performance in distinguishing between classes. A Cross Entropy Loss of 0.355 shows fairly accurate predictions. |
| Highlight
conf_matrix = confusion_matrix(y_test, y_pred) |
We can also visualize the confusion matrix of the model’s performance.
It shows how well the model is correctly classifying the instances. |
| Show output plot | The confusion matrix shows 76 non-buyers are correctly predicted.
However, three non-buyers are incorrectly classified as buyers. Similarly, 24 actual buyers are correctly identified. But 17 were misclassified as non-buyers. A higher number in the diagonal indicated better accuracy. This means the model performs well but struggles with some misclassifications. |
| Only narration | We conclude our implementation of Binary classification.
We have successfully predicted if a given user will make a purchase in the store. |
| Show slide:
Summary Only Narration |
In this tutorial, we have learnt about
In the next tutorial, we’ll learn how to implement Multiclass classification for Logistic Regression. |
| Show slide:
Assignment As an assignment, please do the following: |
As an assignment,
Replace the y underscore pred code with the code as shown here. Observe the changes in Training and Testing accuracy. |
| Show slide image:
Assignment Solution binary.png |
After completing the assignment, the output should match the expected result. |
| Show Slide:
FOSSEE Forum |
For any general or technical questions on Python for
Machine Learning, visit the FOSSEE forum and post your question. |
| Show slide:
Thank You |
This is Anvita Thadavoose Manjummel, a FOSSEE Summer Fellow 2025, IIT Bombay signing off.
Thanks for joining. |