Difference between revisions of "Python-for-Machine-Learning/C2/Linear-Regression/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
Line 1: Line 1:
 +
  
 
{| border="1"
 
{| border="1"
Line 8: Line 9:
 
|| Show slide:
 
|| Show slide:
  
'''Welcome'''  
+
'''Welcome'''
|| Welcome to the Spoken Tutorial on '''Linear Regression'''.
+
|| Welcome to the Spoken Tutorial on''' Linear Regression'''.
  
 
|-  
 
|-  

Revision as of 15:58, 7 July 2025


Visual Cue Narration
Show slide:

Welcome

Welcome to the Spoken Tutorial on Linear Regression.
Show Slide:

Learning Objectives

In this tutorial, we will learn about
  • Linear Regression
  • Simple Linear Regression
  • Multiple Linear Regression
  • Evaluation Metrics
Show Slide:

System Requirements

To Record this tutorial, I am using
  • Ubuntu Linux operating system 24.04
  • Jupyter Notebook IDE
Show Slide:

Prerequisite

To follow this tutorial,
  • The learner must have basic knowledge of Python.
  • For prerequisite Python tutorials, please visit this website.
Show Slide:

Code files

  • The files used in this tutorial are provided in the Code files link.
  • Please download and extract the files.
  • Make a copy and then use them while practicing.
Show Slide:

Linear Regression

  • Linear regression is a predictive technique used in machine learning.
  • It builds the relationship between a dependent and independent variable.
  • Linear regression is categorized into Simple and Multiple linear regression.
Show Slide:

Simple Linear Regression

  • Simple Linear Regression is a way to find relationships between two variables.
  • It studies how one independent variable affects one dependent variable.
Show Slide:

Multiple Linear Regression

  • Multiple linear Regression is an extension of simple linear regression.
  • It examines how multiple factors influence a single outcome.
Show Slide:

Evaluation Metrics

  • To assess the model’s performance, we use evaluation metrics.
  • These metrics indicate how well the regression model fits the data.
  • The two common metrics are Mean Absolute Error and R squared score.
Hover over the files I have created required files for the demonstration of Linear Regression.
Open the file salaries.csv and point to the fields as per narration.

Open the file salaries_mlr.csv and point to the fields as per narration.

To implement Simple Linear Regression, we use the salaries dot csv dataset.

This dataset contains salaries based on years of experience.

We use salaries underscore mlr dot csv dataset for Multiple Linear Regression.

This dataset contains multiple columns as shown.

Point to the LinearRegression.ipynb LinearRegression dot ipynb is the python notebook file for this demonstration.
Press Ctrl,Alt+T keys

Type conda activate ml

Press Enter

Let us open the Linux terminal. Press Ctrl, Alt and T keys together.

First, we need to activate the machine learning environment.

Run the command conda space activate space ml.

Press Enter.

Go to the Downloads folder

Type cd Downloads

Press Enter

Type jupyter notebook

Press Enter

I have saved my code file in the Downloads folder.

Please navigate to the respective folder of your code file location.

Then type, jupyter space notebook and press Enter.

Show Jupyter Notebook Home page

Click onLinearRegression.ipynb file

We see the Jupyter Notebook Home page.

Click the LinearRegression dot ipynb file to open it.

Note that each cell will have the output displayed in this file.

Highlight the lines:

import numpy as np

import pandas as pd

Press Shift+Enter

We start by importing the required libraries for Simple Linear Regression.

Make sure to Press Shift and Enter to execute the code in each cell.

Highlight the lines:

df_salary=pd.read_csv("salaries.csv")

Let us load the dataset into a variable called df underscore salary.
Highlight the lines:

df_salary.head()

Next, we display the first few rows of the data.
Highlight the lines:

df_salary.describe()

Now, we generate summary statistics for the numerical columns.
Highlight the lines:

sns.heatmap(df_salary.corr(), annot=True, cmap="coolwarm")

plt.show()

Correlation heatmap shows how attributes in the dataset are related.
Narration: Correlation measures how two variables are related to each other.

Correlation measures the relationship between two variables

The correlation values range from -1 to 1.

Show the Correlation matrix output 4.47 Here, experience and income have a correlation of 0.97.This means that as experience increases, income also increases strongly.

Let us understand the correlation value ranges.

Show Slide:

Correlation Matrix

  • A value of 1 means a perfect positive correlation.
  • A value of -1 means a perfect negative correlation.
  • A value of 0 means no correlation
Highlight the lines:

plt.figure(figsize=(6,4))

Now we create a boxplot to visualize the income distribution.
Show the output This image is a boxplot of income before removing outliers.

Outliers are extreme values that differ significantly from other data points.

They are the small circles on the right side of the boxplot.

Here, incomes around 60,000 to 65,000 are considered as outliers.

The line inside the box is the median.

Highlight the lines:

Q1 = df_salary[['experience', 'income']].quantile(0.25)  

Q3 = df_salary[['experience', 'income']].quantile(0.75)  

IQR = Q3 - Q1

Next, we will remove these outliers using the Interquartile Range method.

We calculate first quartile Q1 and third quartile Q3 for experience and income.

Then, we compute the IQR and remove the outliers.

Highlight the lines:

plt.figure(figsize=(6,4))

Now, we plot the income distribution after removing outliers.
Show the output Observe that the small circles are gone, showing outliers were removed.
Highlight the lines:

x=df_salary['experience']

y=df_salary['income']

Now, we define x as experience and y as income from the dataset.
Highlight the lines: Then, we split the data into training and testing sets.
Highlight the lines:

x_train=np.array(x_train).reshape(-1,1)

x_test=np.array(x_test).reshape(-1,1)

We then reshape the x underscore train lists into 2D array.

The same is done for x underscore test for compatibility.

Highlight the lines:

lr=LinearRegression()

lr.fit(x_train,y_train)

Now, we initialize a Linear Regression model and train it using training data.
Highlight the lines:

print("Intercept (W0):", lr.intercept_)

print("Coefficient (W1):", lr.coef_)

Then, we print the intercept W0 and coefficient W1 of the model.

These define the model’s slope and relationship between experience and income.

Highlight the lines:

y_pred_train = lr.predict(x_train)

y_pred_train = y_pred_train.round().astype(int)

y_pred_train

Now, we use the trained model to make predictions on the training data.

We round the predictions to whole numbers for better readability.

Then, we display the rounded predictions.

Highlight the lines:

mae_train = mean_absolute_error(y_train, y_pred_train)

print("MAE (Training):", mae_train)

Next, we calculate the Mean Absolute Error on the training data.

Mean Absolute Error measures prediction accuracy.

Highlight the lines:

r2_score(y_pred_train, y_train)

Then, we compute the R squared score to evaluate the model’s performance.

R squared score measures how well the model explains the variance in the data.

A value closer to 1 indicates a stronger fit.

Highlight the lines:

y_pred_test = lr.predict(x_test)

y_pred_test = y_pred_test.round().astype(int)

y_pred_test

Now, we make predictions on the test data.
Highlight the lines:

plt.scatter(x_test,y_test)

To visualize performance, we create a scatter plot of actual vs predicted values.
Show the output In the output we can see that most points are close to the line.

It shows a positive correlation.

Highlight the lines:

mean_absolute_error(y_test,y_pred_test)

Now, compute the Mean Absolute Error on the test data.
Highlight the lines:

r2_score(y_pred_test, y_test)

Then, we calculate and display the R squared score.
Narration The model has a Mean Absolute Error of 1626.41, indicating prediction errors.

The R-squared score of 0.87 shows the model explains most of the variance.

Overall, the model performs well but has some prediction errors.

Now let us see the implementation of Multiple Linear Regression.
Highlight the lines:

df_salaries = pd.read_csv(r"salaries_mlr.csv")

First, load the dataset for Multiple Linear Regression.
Highlight the lines:

df_salaries.tail()

Then, we display the last five rows.
Highlight the lines:

df_salaries.dtypes

Next, we check the data types of each column in the dataset.
Highlight the lines:

df_salaries.isnull().sum()

We also check for any missing values in the dataset by summing them.
Highlight the lines:

df_salaries['gender'] = df_salaries['gender'].map({'m': 1, 'f': 0})

Now, we convert gender column to numeric values, 1 for Male and 0 for Female.
Highlight the lines:

X = df_salaries.drop(columns='income')

y = df_salaries['income']

Then, we separate the features X and the target variable y for prediction.
Highlight the lines:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Now, we split the data into training and testing sets.
Highlight the lines:

model = LinearRegression()

model.fit(X_train, y_train)

We initialize a Linear Regression model and train it using the training data.
Highlight the lines:

coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})

Next, we print the model's coefficients and intercept.
Highlight the lines:

y_train_pred = model.predict(X_train)

y_train_pred = y_train_pred.round().astype(int)

y_train_pred

Now, we make predictions on the training data.
Highlight the lines:

mae_train = mean_absolute_error(y_train, y_train_pred) print(f'Training data MAE: {mae_train}')

Next, we compute the Mean Absolute Error for training data.
Highlight the lines:

r2_train = r2_score(y_train, y_train_pred)

n_train = len(y_train

Then, we computethe R squared score to measure the model performance

After that, we compute and print the adjusted R squared score.

Highlight the lines:

y_test_pred = model.predict(X_test)

y_test_pred = y_test_pred.round().astype(int)

Moving forward, we make predictions on the test data.
Highlight the lines:

plt.scatter(y_test, y_test_pred, color='red', label='Predicted')

plt.scatter(y_test, y_test, color='blue', alpha=0.5, label='Actual')

We compare actual vs predicted income using a scatter plot.
Highlight the lines:

mae_test = mean_absolute_error(y_test, y_test_pred) print(f'Testing data MAE: {mae_test}')

Then, we compute the Mean Absolute Error for the test data.
Highlight the lines:

r2_test = r2_score(y_test, y_test_pred)

n_test = len(y_test)

k_test = X_test.shape[1]

Next, we calculate the R squared score for the test data.
Narration The model has an MAE of 1700.15, showing the average prediction error in income.The Adjusted R squared score is 0.921.

It indicates the model explains 92.1 percent of income variance.

Highlight the lines:

residuals = y_test - y_test_pred

plt.show()

Now, we analyse the residuals to check model errors.

We create a scatter plot of predicted values versus residuals.

Highlight the output This is a residual plot for the regression model.

The red dashed line represents zero residual.

Points above the line mean predictions are lower than actual values.

Points below the line mean predictions are higher than actual values.

Most residuals are close to zero, meaning predictions are fairly accurate.

Narration Thus, we successfully implemented Multiple Linear Regression.
Show slide:

Summary

This brings us to the end of the tutorial. Let us summarize.

In this tutorial, we have learnt about

  • Linear Regression
  • Simple Linear Regression
  • Multiple Linear Regression
  • Evaluation Metrics
Show Slide:

Assignment

In Multiple Linear Regression code,

  • Replace the test_size parameter as shown here.


In Multiple Linear Regression code,
  • Replace the test_size parameter as shown here.
  • Observe the change in MAE and Adjusted R squared score.


Show Slide:

Assignment Solution

Show s1 img file

After completing the assignment, the output should match the expected result.
Show Slide:

FOSSEE Forum

For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question.
Show Slide:

Thank you

This is Harini Theiveegan, a FOSSEE Summer Fellow 2025, IIT Bombay signing off

Thanks for joining.

Contributors and Content Editors

Madhurig, Nirmala Venkat