Difference between revisions of "Python-for-Machine-Learning/C2/Linear-Regression/English"
(Created page with " <div style="margin-left:1.27cm;margin-right:0cm;"></div> {| border="1" |- || '''Visual Cue''' || '''Narration''' |- |- style="border:0.5pt solid #000000;padding-top:0cm;padd...") |
|||
| Line 1: | Line 1: | ||
| − | + | ||
{| border="1" | {| border="1" | ||
|- | |- | ||
| Line 7: | Line 7: | ||
|| '''Narration''' | || '''Narration''' | ||
|- | |- | ||
| − | |- | + | |- |
| − | || | + | || Show slide: |
| − | + | '''Welcome''' | |
|| Welcome to the Spoken Tutorial on''' Linear Regression'''. | || Welcome to the Spoken Tutorial on''' Linear Regression'''. | ||
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
| Line 19: | Line 19: | ||
|| In this tutorial, we will learn about | || In this tutorial, we will learn about | ||
| − | * | + | * '''Linear Regression''' |
| − | * | + | * '''Simple Linear Regression''' |
| − | * | + | * '''Multiple Linear Regression''' |
| − | * | + | * '''Evaluation Metrics''' |
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
| Line 30: | Line 30: | ||
|| To Record this tutorial, I am using | || To Record this tutorial, I am using | ||
| − | * | + | * '''Ubuntu Linux operating system 24.04''' |
| − | * | + | * '''Jupyter Notebook''' '''IDE''' |
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
'''Prerequisite''' | '''Prerequisite''' | ||
|| To follow this tutorial, | || To follow this tutorial, | ||
| − | * | + | * The learner must have basic knowledge of '''Python.''' |
| − | * | + | * For prerequisite '''Python''' tutorials, please visit this website. |
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
'''Code files''' | '''Code files''' | ||
|| | || | ||
| − | * | + | * The files used in this tutorial are provided in the '''Code files '''link. |
| − | * | + | * Please download and extract the files. |
| − | * | + | * Make a copy and then use them while practicing. |
| − | + | |- | |
| − | |- | + | |
|| Show Slide: | || Show Slide: | ||
| Line 55: | Line 54: | ||
|| | || | ||
| − | * | + | * '''Linear regression''' is a predictive technique used in machine learning. |
| − | * | + | * It builds the relationship between a '''dependent''' and '''independent''' variable. |
| − | * | + | * Linear regression is categorized into '''Simple''' and '''Multiple linear regression'''. |
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
| Line 65: | Line 64: | ||
|| | || | ||
| − | * | + | * '''Simple Linear Regression '''is a way to find relationships between two variables. |
| − | * | + | * It studies how one '''independent variable''' affects one '''dependent variable'''. |
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
'''Multiple Linear Regression''' | '''Multiple Linear Regression''' | ||
|| | || | ||
| − | * | + | * '''Multiple linear Regression''' is an extension of simple linear regression. |
| − | * | + | * It examines how multiple factors influence a single outcome. |
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
'''Evaluation Metrics''' | '''Evaluation Metrics''' | ||
|| | || | ||
| − | * | + | * To assess the model’s performance, we use '''evaluation metrics'''. |
| − | * | + | * These metrics indicate how well the '''regression model''' fits the data. |
| − | * | + | * The two common metrics are '''Mean Absolute Error '''and '''R squared score.''' |
| − | |- | + | |- |
|| Hover over the files | || Hover over the files | ||
|| I have created required files for the demonstration of''' Linear Regression.''' | || I have created required files for the demonstration of''' Linear Regression.''' | ||
| − | |- | + | |- |
|| Open the file salaries.csv and point to the fields as per narration. | || Open the file salaries.csv and point to the fields as per narration. | ||
| Line 98: | Line 97: | ||
This dataset contains multiple columns as shown. | This dataset contains multiple columns as shown. | ||
| − | |- | + | |- |
|| Point to the '''LinearRegression.ipynb ''' | || Point to the '''LinearRegression.ipynb ''' | ||
|| '''LinearRegression dot ipynb '''is the python notebook file for this demonstration. | || '''LinearRegression dot ipynb '''is the python notebook file for this demonstration. | ||
| − | |- | + | |- |
|| Press '''Ctrl,Alt'''+'''T '''keys | || Press '''Ctrl,Alt'''+'''T '''keys | ||
| Line 115: | Line 114: | ||
Press '''Enter.''' | Press '''Enter.''' | ||
| − | |- | + | |- |
|| Go to the '''Downloads '''folder | || Go to the '''Downloads '''folder | ||
| Line 131: | Line 130: | ||
Then type, '''jupyter space notebook and''' press''' Enter.''' | Then type, '''jupyter space notebook and''' press''' Enter.''' | ||
| − | |- | + | |- |
|| Show '''Jupyter Notebook Home page''' | || Show '''Jupyter Notebook Home page''' | ||
| Line 139: | Line 138: | ||
Click the '''LinearRegression dot ipynb''' file to open it. | Click the '''LinearRegression dot ipynb''' file to open it. | ||
| − | + | Note that each cell will have the output displayed in this file. | |
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 153: | Line 152: | ||
Make sure to Press''' Shift '''and''' Enter''' to execute the code in each cell. | Make sure to Press''' Shift '''and''' Enter''' to execute the code in each cell. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 159: | Line 158: | ||
|| Let us load the dataset into a variable called '''df underscore salary.''' | || Let us load the dataset into a variable called '''df underscore salary.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 165: | Line 164: | ||
|| Next, we display the '''first few rows''' of the data. | || Next, we display the '''first few rows''' of the data. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 171: | Line 170: | ||
|| Now, we generate '''summary statistics''' for the numerical columns. | || Now, we generate '''summary statistics''' for the numerical columns. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 179: | Line 178: | ||
|| '''Correlation heatmap''' shows how attributes in the dataset are related. | || '''Correlation heatmap''' shows how attributes in the dataset are related. | ||
| − | |- | + | |- |
|| Narration: | || Narration: | ||
|| '''Correlation''' measures how two variables are related to each other. | || '''Correlation''' measures how two variables are related to each other. | ||
| Line 185: | Line 184: | ||
'''Correlation''' measures the relationship between two variables | '''Correlation''' measures the relationship between two variables | ||
| − | + | The '''correlation''' '''values range from -1 to 1'''. | |
| − | |- | + | |- |
|| Show the Correlation matrix output 4.47 | || Show the Correlation matrix output 4.47 | ||
| − | || | + | || Here, experience and income have a correlation of '''0.97.'''This means that as '''experience increases''', '''income also increases''' strongly. |
Let us understand the correlation value ranges. | Let us understand the correlation value ranges. | ||
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
'''Correlation Matrix''' | '''Correlation Matrix''' | ||
|| | || | ||
| − | * | + | * A value of '''1''' means a '''perfect''' '''positive correlation'''. |
| − | * | + | * A value of '''-1''' means a '''perfect negative correlation'''. |
| − | * | + | * A value of '''0 '''means '''no correlation''' |
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 208: | Line 207: | ||
|| Now we create a '''boxplot''' to visualize the income distribution. | || Now we create a '''boxplot''' to visualize the income distribution. | ||
| − | |- | + | |- |
|| Show the output | || Show the output | ||
| Line 221: | Line 220: | ||
The line inside the box is the '''median'''. | The line inside the box is the '''median'''. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 236: | Line 235: | ||
Then, we compute the''' IQR '''and remove the outliers. | Then, we compute the''' IQR '''and remove the outliers. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 243: | Line 242: | ||
|| Now, we plot the income distribution after '''removing outliers'''. | || Now, we plot the income distribution after '''removing outliers'''. | ||
| − | |- | + | |- |
|| Show the output | || Show the output | ||
|| Observe that the small circles are gone, showing outliers were removed. | || Observe that the small circles are gone, showing outliers were removed. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 255: | Line 254: | ||
'''y=df_salary['income'] ''' | '''y=df_salary['income'] ''' | ||
|| Now, we define '''x''' as '''experience''' and '''y''' as '''income''' from the dataset. | || Now, we define '''x''' as '''experience''' and '''y''' as '''income''' from the dataset. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
|| Then, we split the data into '''training''' and '''testing''' '''sets'''. | || Then, we split the data into '''training''' and '''testing''' '''sets'''. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 270: | Line 269: | ||
The same is done for '''x underscore test''' for '''compatibility.''' | The same is done for '''x underscore test''' for '''compatibility.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 278: | Line 277: | ||
|| Now, we initialize a '''Linear Regression model''' and train it using training data. | || Now, we initialize a '''Linear Regression model''' and train it using training data. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 289: | Line 288: | ||
These define the model’s '''slope and relationship''' between experience and income. | These define the model’s '''slope and relationship''' between experience and income. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 303: | Line 302: | ||
Then, we display the rounded predictions. | Then, we display the rounded predictions. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 314: | Line 313: | ||
'''Mean Absolute Error''' measures '''prediction accuracy.''' | '''Mean Absolute Error''' measures '''prediction accuracy.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 324: | Line 323: | ||
A '''value closer to''' '''1''' indicates a '''stronger fit.''' | A '''value closer to''' '''1''' indicates a '''stronger fit.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 334: | Line 333: | ||
|| Now, we make predictions on the test data. | || Now, we make predictions on the test data. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 341: | Line 340: | ||
|| To visualize performance, we create a '''scatter plot of actual vs predicted values'''. | || To visualize performance, we create a '''scatter plot of actual vs predicted values'''. | ||
| − | |- | + | |- |
|| Show the output | || Show the output | ||
|| In the output we can see that most points are close to the line. | || In the output we can see that most points are close to the line. | ||
| Line 347: | Line 346: | ||
It shows a '''positive correlation.''' | It shows a '''positive correlation.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 354: | Line 353: | ||
|| Now, compute the '''Mean Absolute Error '''on the test data. | || Now, compute the '''Mean Absolute Error '''on the test data. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
'''r2_score(y_pred_test, y_test) ''' | '''r2_score(y_pred_test, y_test) ''' | ||
|| Then, we calculate and display the '''R squared score'''. | || Then, we calculate and display the '''R squared score'''. | ||
| − | |- | + | |- |
|| Narration | || Narration | ||
| Line 368: | Line 367: | ||
Overall, the model performs well but has some prediction errors. | Overall, the model performs well but has some prediction errors. | ||
| − | |- | + | |- |
|| | || | ||
|| Now let us see the implementation of '''Multiple Linear Regression'''. | || Now let us see the implementation of '''Multiple Linear Regression'''. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 379: | Line 378: | ||
|| First, load the dataset for '''Multiple Linear Regression'''. | || First, load the dataset for '''Multiple Linear Regression'''. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 385: | Line 384: | ||
|| Then, we display the '''last five rows.''' | || Then, we display the '''last five rows.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 391: | Line 390: | ||
|| Next, we check the '''data types''' of each column in the dataset. | || Next, we check the '''data types''' of each column in the dataset. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 397: | Line 396: | ||
|| We also check for any '''missing values''' in the dataset by summing them. | || We also check for any '''missing values''' in the dataset by summing them. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 403: | Line 402: | ||
|| Now, we convert '''gender column''' to numeric values, '''1 for Male''' and '''0 for Female'''. | || Now, we convert '''gender column''' to numeric values, '''1 for Male''' and '''0 for Female'''. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 411: | Line 410: | ||
|| Then, we separate the '''features X''' and the '''target variable y''' for prediction. | || Then, we separate the '''features X''' and the '''target variable y''' for prediction. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 417: | Line 416: | ||
|| Now, we split the data into '''training and testing sets.''' | || Now, we split the data into '''training and testing sets.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 425: | Line 424: | ||
|| We initialize a''' Linear Regression model''' and train it using the training data. | || We initialize a''' Linear Regression model''' and train it using the training data. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 432: | Line 431: | ||
|| Next, we print the model's '''coefficients and intercept'''. | || Next, we print the model's '''coefficients and intercept'''. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 441: | Line 440: | ||
'''y_train_pred''' | '''y_train_pred''' | ||
|| Now, we make '''predictions on the training data.''' | || Now, we make '''predictions on the training data.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 447: | Line 446: | ||
|| Next, we compute the '''Mean Absolute Error for training data'''. | || Next, we compute the '''Mean Absolute Error for training data'''. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 457: | Line 456: | ||
After that, we compute and print the '''adjusted R squared '''score. | After that, we compute and print the '''adjusted R squared '''score. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 465: | Line 464: | ||
|| Moving forward, we make '''predictions on the test data.''' | || Moving forward, we make '''predictions on the test data.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 472: | Line 471: | ||
'''plt.scatter(y_test, y_test, color='blue', alpha=0.5, label='Actual') ''' | '''plt.scatter(y_test, y_test, color='blue', alpha=0.5, label='Actual') ''' | ||
|| We compare '''actual vs predicted income''' using a '''scatter plot.''' | || We compare '''actual vs predicted income''' using a '''scatter plot.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 478: | Line 477: | ||
|| Then, we compute the '''Mean Absolute Error''' for the test data. | || Then, we compute the '''Mean Absolute Error''' for the test data. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 487: | Line 486: | ||
'''k_test = X_test.shape[1] ''' | '''k_test = X_test.shape[1] ''' | ||
|| Next, we calculate the '''R squared score '''for the test data. | || Next, we calculate the '''R squared score '''for the test data. | ||
| − | |- | + | |- |
|| Narration | || Narration | ||
|| The model has an '''MAE''' of '''1700.15''', showing the average prediction error in income.The '''Adjusted R squared score''' is '''0.921'''. | || The model has an '''MAE''' of '''1700.15''', showing the average prediction error in income.The '''Adjusted R squared score''' is '''0.921'''. | ||
| Line 493: | Line 492: | ||
It indicates the model explains '''92.1 percent''' of income variance. | It indicates the model explains '''92.1 percent''' of income variance. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 503: | Line 502: | ||
We create a '''scatter plot''' of '''predicted values versus residuals.''' | We create a '''scatter plot''' of '''predicted values versus residuals.''' | ||
| − | |- | + | |- |
|| Highlight the output | || Highlight the output | ||
| Line 516: | Line 515: | ||
Most '''residuals''' are '''close to zero''', meaning predictions are fairly accurate. | Most '''residuals''' are '''close to zero''', meaning predictions are fairly accurate. | ||
| − | |- | + | |- |
|| Narration | || Narration | ||
|| Thus, we successfully implemented '''Multiple Linear Regression'''. | || Thus, we successfully implemented '''Multiple Linear Regression'''. | ||
| − | |- | + | |- |
|| Show slide: | || Show slide: | ||
| Line 527: | Line 526: | ||
In this tutorial, we have learnt about | In this tutorial, we have learnt about | ||
| − | * | + | * '''Linear Regression''' |
| − | * | + | * '''Simple Linear Regression''' |
| − | * | + | * '''Multiple Linear Regression''' |
| − | * | + | * '''Evaluation Metrics''' |
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
| Line 538: | Line 537: | ||
In Multiple Linear Regression code, | In Multiple Linear Regression code, | ||
| − | * | + | * Replace the test_size parameter as shown here. |
| + | |||
| − | |||
|| In Multiple Linear Regression code, | || In Multiple Linear Regression code, | ||
| − | * | + | * Replace the '''test_size parameter''' as shown here. |
| − | * | + | * Observe the change in '''MAE '''and '''Adjusted R squared score'''. |
| + | |||
| − | + | |- | |
| − | |- | + | |
|| Show Slide: | || Show Slide: | ||
| Line 554: | Line 553: | ||
Show s1 img file | Show s1 img file | ||
|| After completing the assignment, the output should match the expected result. | || After completing the assignment, the output should match the expected result. | ||
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
'''FOSSEE Forum''' | '''FOSSEE Forum''' | ||
| − | || For any general or technical questions on | + | || For any general or technical questions on '''Python for Machine Learning''', visit the''' FOSSEE forum''' and post your question. |
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
'''Thank you''' | '''Thank you''' | ||
| − | || | + | || This is '''Harini Theiveegan''', a FOSSEE Summer Fellow 2025, IIT Bombay signing off |
Thanks for joining. | Thanks for joining. | ||
Revision as of 15:48, 4 July 2025
| Visual Cue | Narration |
| Show slide:
Welcome |
Welcome to the Spoken Tutorial on Linear Regression. |
| Show Slide:
Learning Objectives |
In this tutorial, we will learn about
|
| Show Slide:
System Requirements |
To Record this tutorial, I am using
|
| Show Slide:
Prerequisite |
To follow this tutorial,
|
| Show Slide:
Code files |
|
| Show Slide:
Linear Regression |
|
| Show Slide:
Simple Linear Regression |
|
| Show Slide:
Multiple Linear Regression |
|
| Show Slide:
Evaluation Metrics |
|
| Hover over the files | I have created required files for the demonstration of Linear Regression. |
| Open the file salaries.csv and point to the fields as per narration.
Open the file salaries_mlr.csv and point to the fields as per narration. |
To implement Simple Linear Regression, we use the salaries dot csv dataset.
This dataset contains salaries based on years of experience. We use salaries underscore mlr dot csv dataset for Multiple Linear Regression. This dataset contains multiple columns as shown. |
| Point to the LinearRegression.ipynb | LinearRegression dot ipynb is the python notebook file for this demonstration. |
| Press Ctrl,Alt+T keys
Type conda activate ml Press Enter |
Let us open the Linux terminal. Press Ctrl, Alt and T keys together.
First, we need to activate the machine learning environment. Run the command conda space activate space ml. Press Enter. |
| Go to the Downloads folder
Type cd Downloads Press Enter Type jupyter notebook Press Enter |
I have saved my code file in the Downloads folder.
Please navigate to the respective folder of your code file location. Then type, jupyter space notebook and press Enter. |
| Show Jupyter Notebook Home page
Click onLinearRegression.ipynb file |
We see the Jupyter Notebook Home page.
Click the LinearRegression dot ipynb file to open it. Note that each cell will have the output displayed in this file. |
| Highlight the lines:
import numpy as np import pandas as pd Press Shift+Enter |
We start by importing the required libraries for Simple Linear Regression.
Make sure to Press Shift and Enter to execute the code in each cell. |
| Highlight the lines:
df_salary=pd.read_csv("salaries.csv") |
Let us load the dataset into a variable called df underscore salary. |
| Highlight the lines:
df_salary.head() |
Next, we display the first few rows of the data. |
| Highlight the lines:
df_salary.describe() |
Now, we generate summary statistics for the numerical columns. |
| Highlight the lines:
sns.heatmap(df_salary.corr(), annot=True, cmap="coolwarm") plt.show() |
Correlation heatmap shows how attributes in the dataset are related. |
| Narration: | Correlation measures how two variables are related to each other.
Correlation measures the relationship between two variables The correlation values range from -1 to 1. |
| Show the Correlation matrix output 4.47 | Here, experience and income have a correlation of 0.97.This means that as experience increases, income also increases strongly.
Let us understand the correlation value ranges. |
| Show Slide:
Correlation Matrix |
|
| Highlight the lines:
plt.figure(figsize=(6,4)) |
Now we create a boxplot to visualize the income distribution. |
| Show the output | This image is a boxplot of income before removing outliers.
Outliers are extreme values that differ significantly from other data points. They are the small circles on the right side of the boxplot. Here, incomes around 60,000 to 65,000 are considered as outliers. The line inside the box is the median. |
| Highlight the lines:
Q1 = df_salary[['experience', 'income']].quantile(0.25) Q3 = df_salary[['experience', 'income']].quantile(0.75) IQR = Q3 - Q1 |
Next, we will remove these outliers using the Interquartile Range method.
We calculate first quartile Q1 and third quartile Q3 for experience and income. Then, we compute the IQR and remove the outliers. |
| Highlight the lines:
plt.figure(figsize=(6,4)) |
Now, we plot the income distribution after removing outliers. |
| Show the output | Observe that the small circles are gone, showing outliers were removed. |
| Highlight the lines:
x=df_salary['experience'] y=df_salary['income'] |
Now, we define x as experience and y as income from the dataset. |
| Highlight the lines: | Then, we split the data into training and testing sets. |
| Highlight the lines:
x_train=np.array(x_train).reshape(-1,1) x_test=np.array(x_test).reshape(-1,1) |
We then reshape the x underscore train lists into 2D array.
The same is done for x underscore test for compatibility. |
| Highlight the lines:
lr=LinearRegression() lr.fit(x_train,y_train) |
Now, we initialize a Linear Regression model and train it using training data. |
| Highlight the lines:
print("Intercept (W0):", lr.intercept_) print("Coefficient (W1):", lr.coef_) |
Then, we print the intercept W0 and coefficient W1 of the model.
These define the model’s slope and relationship between experience and income. |
| Highlight the lines:
y_pred_train = lr.predict(x_train) y_pred_train = y_pred_train.round().astype(int) y_pred_train |
Now, we use the trained model to make predictions on the training data.
We round the predictions to whole numbers for better readability. Then, we display the rounded predictions. |
| Highlight the lines:
mae_train = mean_absolute_error(y_train, y_pred_train) print("MAE (Training):", mae_train) |
Next, we calculate the Mean Absolute Error on the training data.
Mean Absolute Error measures prediction accuracy. |
| Highlight the lines:
r2_score(y_pred_train, y_train) |
Then, we compute the R squared score to evaluate the model’s performance.
R squared score measures how well the model explains the variance in the data. A value closer to 1 indicates a stronger fit. |
| Highlight the lines:
y_pred_test = lr.predict(x_test) y_pred_test = y_pred_test.round().astype(int) y_pred_test |
Now, we make predictions on the test data. |
| Highlight the lines:
plt.scatter(x_test,y_test) |
To visualize performance, we create a scatter plot of actual vs predicted values. |
| Show the output | In the output we can see that most points are close to the line.
It shows a positive correlation. |
| Highlight the lines:
mean_absolute_error(y_test,y_pred_test) |
Now, compute the Mean Absolute Error on the test data. |
| Highlight the lines:
r2_score(y_pred_test, y_test) |
Then, we calculate and display the R squared score. |
| Narration | The model has a Mean Absolute Error of 1626.41, indicating prediction errors.
The R-squared score of 0.87 shows the model explains most of the variance. Overall, the model performs well but has some prediction errors. |
| Now let us see the implementation of Multiple Linear Regression. | |
| Highlight the lines:
df_salaries = pd.read_csv(r"salaries_mlr.csv") |
First, load the dataset for Multiple Linear Regression. |
| Highlight the lines:
df_salaries.tail() |
Then, we display the last five rows. |
| Highlight the lines:
df_salaries.dtypes |
Next, we check the data types of each column in the dataset. |
| Highlight the lines:
df_salaries.isnull().sum() |
We also check for any missing values in the dataset by summing them. |
| Highlight the lines:
df_salaries['gender'] = df_salaries['gender'].map({'m': 1, 'f': 0}) |
Now, we convert gender column to numeric values, 1 for Male and 0 for Female. |
| Highlight the lines:
X = df_salaries.drop(columns='income') y = df_salaries['income'] |
Then, we separate the features X and the target variable y for prediction. |
| Highlight the lines:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) |
Now, we split the data into training and testing sets. |
| Highlight the lines:
model = LinearRegression() model.fit(X_train, y_train) |
We initialize a Linear Regression model and train it using the training data. |
| Highlight the lines:
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_}) |
Next, we print the model's coefficients and intercept. |
| Highlight the lines:
y_train_pred = model.predict(X_train) y_train_pred = y_train_pred.round().astype(int) y_train_pred |
Now, we make predictions on the training data. |
| Highlight the lines:
mae_train = mean_absolute_error(y_train, y_train_pred) print(f'Training data MAE: {mae_train}') |
Next, we compute the Mean Absolute Error for training data. |
| Highlight the lines:
r2_train = r2_score(y_train, y_train_pred) n_train = len(y_train |
Then, we computethe R squared score to measure the model performance
After that, we compute and print the adjusted R squared score. |
| Highlight the lines:
y_test_pred = model.predict(X_test) y_test_pred = y_test_pred.round().astype(int) |
Moving forward, we make predictions on the test data. |
| Highlight the lines:
plt.scatter(y_test, y_test_pred, color='red', label='Predicted') plt.scatter(y_test, y_test, color='blue', alpha=0.5, label='Actual') |
We compare actual vs predicted income using a scatter plot. |
| Highlight the lines:
mae_test = mean_absolute_error(y_test, y_test_pred) print(f'Testing data MAE: {mae_test}') |
Then, we compute the Mean Absolute Error for the test data. |
| Highlight the lines:
r2_test = r2_score(y_test, y_test_pred) n_test = len(y_test) k_test = X_test.shape[1] |
Next, we calculate the R squared score for the test data. |
| Narration | The model has an MAE of 1700.15, showing the average prediction error in income.The Adjusted R squared score is 0.921.
It indicates the model explains 92.1 percent of income variance. |
| Highlight the lines:
residuals = y_test - y_test_pred plt.show() |
Now, we analyse the residuals to check model errors.
We create a scatter plot of predicted values versus residuals. |
| Highlight the output | This is a residual plot for the regression model.
The red dashed line represents zero residual. Points above the line mean predictions are lower than actual values. Points below the line mean predictions are higher than actual values. Most residuals are close to zero, meaning predictions are fairly accurate. |
| Narration | Thus, we successfully implemented Multiple Linear Regression. |
| Show slide:
Summary |
This brings us to the end of the tutorial. Let us summarize.
In this tutorial, we have learnt about
|
| Show Slide:
Assignment In Multiple Linear Regression code,
|
In Multiple Linear Regression code,
|
| Show Slide:
Assignment Solution Show s1 img file |
After completing the assignment, the output should match the expected result. |
| Show Slide:
FOSSEE Forum |
For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question. |
| Show Slide:
Thank you |
This is Harini Theiveegan, a FOSSEE Summer Fellow 2025, IIT Bombay signing off
Thanks for joining. |