Difference between revisions of "Python-for-Machine-Learning/C2/Linear-Regression/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
(Created page with " <div style="margin-left:1.27cm;margin-right:0cm;"></div> {| border="1" |- || '''Visual Cue''' || '''Narration''' |- |- style="border:0.5pt solid #000000;padding-top:0cm;padd...")
 
Line 1: Line 1:
  
  
<div style="margin-left:1.27cm;margin-right:0cm;"></div>
+
 
 
{| border="1"
 
{| border="1"
 
|-
 
|-
Line 7: Line 7:
 
|| '''Narration'''
 
|| '''Narration'''
 
|-
 
|-
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
|| <div style="color:#000000;">Show slide:</div>
+
|| Show slide:
  
<div style="color:#000000;">'''Welcome'''</div>
+
'''Welcome'''
 
|| Welcome to the Spoken Tutorial on''' Linear Regression'''.
 
|| Welcome to the Spoken Tutorial on''' Linear Regression'''.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show Slide:  
 
|| Show Slide:  
  
Line 19: Line 19:
  
 
|| In this tutorial, we will learn about
 
|| In this tutorial, we will learn about
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="color:#000000;">'''Linear Regression'''</span></div>
+
* '''Linear Regression'''
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="color:#000000;">'''Simple Linear Regression'''</span></div>
+
* '''Simple Linear Regression'''
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Multiple Linear Regression'''</div>
+
* '''Multiple Linear Regression'''
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="color:#000000;">'''Evaluation Metrics'''</span></div>
+
* '''Evaluation Metrics'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show Slide:  
 
|| Show Slide:  
  
Line 30: Line 30:
  
 
|| To Record this tutorial, I am using
 
|| To Record this tutorial, I am using
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="color:#000000;">'''Ubuntu Linux operating system 24.04'''</span></div>
+
* '''Ubuntu Linux operating system 24.04'''
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="background-color:transparent;color:#000000;">'''Jupyter Notebook'''</span><span style="background-color:transparent;color:#000000;"> </span><span style="background-color:transparent;color:#000000;">'''IDE'''</span></div>
+
* '''Jupyter Notebook''' '''IDE'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show Slide:  
 
|| Show Slide:  
  
 
'''Prerequisite'''
 
'''Prerequisite'''
 
|| To follow this tutorial,
 
|| To follow this tutorial,
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="background-color:transparent;color:#000000;">The learner must have basic knowledge of </span><span style="background-color:transparent;color:#000000;">'''Python.'''</span></div>
+
* The learner must have basic knowledge of '''Python.'''
* <div style="margin-left:1.27cm;margin-right:0cm;">For prerequisite '''Python''' tutorials, please visit this website.</div>
+
* For prerequisite '''Python''' tutorials, please visit this website.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show Slide:
 
|| Show Slide:
  
 
'''Code files'''
 
'''Code files'''
 
||
 
||
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="background-color:transparent;color:#000000;">The files used in this tutorial are provided in the </span><span style="background-color:transparent;color:#000000;">'''Code files '''</span><span style="background-color:transparent;color:#000000;">link.</span></div>
+
* The files used in this tutorial are provided in the '''Code files '''link.
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="background-color:transparent;color:#252525;">Please download and extract the files.</span></div>
+
* Please download and extract the files.
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="background-color:transparent;color:#252525;">Make a copy and then use them while practicing.</span></div>
+
* Make a copy and then use them while practicing.
 
+
|-  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
 
|| Show Slide:  
 
|| Show Slide:  
  
Line 55: Line 54:
  
 
||  
 
||  
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Linear regression''' is a predictive technique used in machine learning. </div>
+
* '''Linear regression''' is a predictive technique used in machine learning.  
* <div style="margin-left:1.27cm;margin-right:0cm;">It builds the relationship between a '''dependent''' and '''independent''' variable.</div>
+
* It builds the relationship between a '''dependent''' and '''independent''' variable.
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="background-color:transparent;color:#000000;">Linear regression is categorized into </span><span style="background-color:transparent;color:#000000;">'''Simple'''</span><span style="background-color:transparent;color:#000000;"> and </span><span style="background-color:transparent;color:#000000;">'''Multiple linear regression'''</span><span style="background-color:transparent;color:#000000;">.</span></div>
+
* Linear regression is categorized into '''Simple''' and '''Multiple linear regression'''.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show Slide:  
 
|| Show Slide:  
  
Line 65: Line 64:
  
 
||  
 
||  
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="background-color:transparent;">'''Simple Linear Regression '''</span><span style="background-color:transparent;">is a way to find </span>relationships<span style="background-color:transparent;"> between two variables.</span></div>
+
* '''Simple Linear Regression '''is a way to find relationships between two variables.
* <div style="margin-left:1.27cm;margin-right:0cm;">It studies how one '''independent variable''' affects one '''dependent variable'''.</div>
+
* It studies how one '''independent variable''' affects one '''dependent variable'''.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show Slide:  
 
|| Show Slide:  
  
 
'''Multiple Linear Regression'''
 
'''Multiple Linear Regression'''
 
||
 
||
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="background-color:transparent;">'''Multiple linear Regression'''</span><span style="background-color:transparent;"> is an extension of simple linear regression.</span></div>
+
* '''Multiple linear Regression''' is an extension of simple linear regression.
* <div style="margin-left:1.27cm;margin-right:0cm;">It examines how multiple factors influence a single outcome.</div>
+
* It examines how multiple factors influence a single outcome.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show Slide:
 
|| Show Slide:
  
 
'''Evaluation Metrics'''
 
'''Evaluation Metrics'''
 
||
 
||
* <div style="margin-left:1.27cm;margin-right:0cm;">To assess the model’s performance, we use '''evaluation metrics'''.</div>
+
* To assess the model’s performance, we use '''evaluation metrics'''.
* <div style="margin-left:1.27cm;margin-right:0cm;">These metrics indicate how well the '''regression model''' fits the data. </div>
+
* These metrics indicate how well the '''regression model''' fits the data.  
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="background-color:transparent;">The two common metrics are </span><span style="background-color:transparent;">'''Mean Absolute Error '''</span><span style="background-color:transparent;">and </span><span style="background-color:transparent;">'''R squared score.'''</span></div>
+
* The two common metrics are '''Mean Absolute Error '''and '''R squared score.'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Hover over the files
 
|| Hover over the files
 
|| I have created required files for the demonstration of''' Linear Regression.'''
 
|| I have created required files for the demonstration of''' Linear Regression.'''
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Open the file salaries.csv and point to the fields as per narration.
 
|| Open the file salaries.csv and point to the fields as per narration.
  
Line 98: Line 97:
  
 
This dataset contains multiple columns as shown.
 
This dataset contains multiple columns as shown.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Point to the '''LinearRegression.ipynb '''
 
|| Point to the '''LinearRegression.ipynb '''
 
|| '''LinearRegression dot ipynb '''is the python notebook file for this demonstration.
 
|| '''LinearRegression dot ipynb '''is the python notebook file for this demonstration.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Press '''Ctrl,Alt'''+'''T '''keys
 
|| Press '''Ctrl,Alt'''+'''T '''keys
  
Line 115: Line 114:
 
Press '''Enter.'''
 
Press '''Enter.'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Go to the '''Downloads '''folder
 
|| Go to the '''Downloads '''folder
  
Line 131: Line 130:
 
Then type, '''jupyter space notebook and''' press''' Enter.'''
 
Then type, '''jupyter space notebook and''' press''' Enter.'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show '''Jupyter Notebook Home page'''
 
|| Show '''Jupyter Notebook Home page'''
  
Line 139: Line 138:
 
Click the '''LinearRegression dot ipynb''' file to open it.
 
Click the '''LinearRegression dot ipynb''' file to open it.
  
<div style="color:#000000;">Note that each cell will have the output displayed in this file.</div>
+
Note that each cell will have the output displayed in this file.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 153: Line 152:
 
Make sure to Press''' Shift '''and''' Enter''' to execute the code in each cell.
 
Make sure to Press''' Shift '''and''' Enter''' to execute the code in each cell.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 159: Line 158:
  
 
|| Let us load the dataset into a variable called '''df underscore salary.'''
 
|| Let us load the dataset into a variable called '''df underscore salary.'''
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 165: Line 164:
  
 
|| Next, we display the '''first few rows''' of the data.
 
|| Next, we display the '''first few rows''' of the data.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 171: Line 170:
  
 
|| Now, we generate '''summary statistics''' for the numerical columns.
 
|| Now, we generate '''summary statistics''' for the numerical columns.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 179: Line 178:
  
 
|| '''Correlation heatmap''' shows how attributes in the dataset are related.
 
|| '''Correlation heatmap''' shows how attributes in the dataset are related.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Narration:
 
|| Narration:
 
|| '''Correlation''' measures how two variables are related to each other.
 
|| '''Correlation''' measures how two variables are related to each other.
Line 185: Line 184:
 
'''Correlation''' measures the relationship between two variables
 
'''Correlation''' measures the relationship between two variables
  
<div style="color:#000000;">The '''correlation''' '''values range from -1 to 1'''.</div>
+
The '''correlation''' '''values range from -1 to 1'''.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show the Correlation matrix output 4.47
 
|| Show the Correlation matrix output 4.47
  
|| <div style="color:#000000;">Here, experience and income have a correlation of '''0.97.'''This means that as '''experience increases''', '''income also increases''' strongly.</div>
+
|| Here, experience and income have a correlation of '''0.97.'''This means that as '''experience increases''', '''income also increases''' strongly.
  
 
Let us understand the correlation value ranges.
 
Let us understand the correlation value ranges.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show Slide:
 
|| Show Slide:
  
 
'''Correlation Matrix'''
 
'''Correlation Matrix'''
 
||
 
||
* <div style="margin-left:1.27cm;margin-right:0cm;">A value of '''1''' means a '''perfect''' '''positive correlation'''.</div>
+
* A value of '''1''' means a '''perfect''' '''positive correlation'''.
* <div style="margin-left:1.27cm;margin-right:0cm;">A value of '''-1''' means a '''perfect negative correlation'''.</div>
+
* A value of '''-1''' means a '''perfect negative correlation'''.
* <div style="margin-left:1.27cm;margin-right:0cm;">A value of '''0 '''means '''no correlation'''</div>
+
* A value of '''0 '''means '''no correlation'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 208: Line 207:
  
 
|| Now we create a '''boxplot''' to visualize the income distribution.
 
|| Now we create a '''boxplot''' to visualize the income distribution.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show the output
 
|| Show the output
  
Line 221: Line 220:
 
The line inside the box is the '''median'''.
 
The line inside the box is the '''median'''.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 236: Line 235:
 
Then, we compute the''' IQR '''and remove the outliers.
 
Then, we compute the''' IQR '''and remove the outliers.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 243: Line 242:
 
|| Now, we plot the income distribution after '''removing outliers'''.
 
|| Now, we plot the income distribution after '''removing outliers'''.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show the output
 
|| Show the output
  
 
|| Observe that the small circles are gone, showing outliers were removed.
 
|| Observe that the small circles are gone, showing outliers were removed.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 255: Line 254:
 
'''y=df_salary['income'] '''
 
'''y=df_salary['income'] '''
 
|| Now, we define '''x''' as '''experience''' and '''y''' as '''income''' from the dataset.
 
|| Now, we define '''x''' as '''experience''' and '''y''' as '''income''' from the dataset.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
 
|| Then, we split the data into '''training''' and '''testing''' '''sets'''.
 
|| Then, we split the data into '''training''' and '''testing''' '''sets'''.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 270: Line 269:
 
The same is done for '''x underscore test''' for '''compatibility.'''
 
The same is done for '''x underscore test''' for '''compatibility.'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 278: Line 277:
  
 
|| Now, we initialize a '''Linear Regression model''' and train it using training data.
 
|| Now, we initialize a '''Linear Regression model''' and train it using training data.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 289: Line 288:
 
These define the model’s '''slope and relationship''' between experience and income.
 
These define the model’s '''slope and relationship''' between experience and income.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 303: Line 302:
 
Then, we display the rounded predictions.
 
Then, we display the rounded predictions.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 314: Line 313:
 
'''Mean Absolute Error''' measures '''prediction accuracy.'''
 
'''Mean Absolute Error''' measures '''prediction accuracy.'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 324: Line 323:
 
A '''value closer to''' '''1''' indicates a '''stronger fit.'''
 
A '''value closer to''' '''1''' indicates a '''stronger fit.'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 334: Line 333:
  
 
|| Now, we make predictions on the test data.
 
|| Now, we make predictions on the test data.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 341: Line 340:
 
|| To visualize performance, we create a '''scatter plot of actual vs predicted values'''.
 
|| To visualize performance, we create a '''scatter plot of actual vs predicted values'''.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show the output
 
|| Show the output
 
|| In the output we can see that most points are close to the line.
 
|| In the output we can see that most points are close to the line.
Line 347: Line 346:
 
It shows a '''positive correlation.'''
 
It shows a '''positive correlation.'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 354: Line 353:
 
|| Now, compute the '''Mean Absolute Error '''on the test data.
 
|| Now, compute the '''Mean Absolute Error '''on the test data.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
 
'''r2_score(y_pred_test, y_test) '''
 
'''r2_score(y_pred_test, y_test) '''
 
|| Then, we calculate and display the '''R squared score'''.
 
|| Then, we calculate and display the '''R squared score'''.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Narration
 
|| Narration
  
Line 368: Line 367:
 
Overall, the model performs well but has some prediction errors.
 
Overall, the model performs well but has some prediction errors.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
||  
 
||  
 
|| Now let us see the implementation of '''Multiple Linear Regression'''.
 
|| Now let us see the implementation of '''Multiple Linear Regression'''.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 379: Line 378:
 
|| First, load the dataset for '''Multiple Linear Regression'''.
 
|| First, load the dataset for '''Multiple Linear Regression'''.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 385: Line 384:
  
 
|| Then, we display the '''last five rows.'''
 
|| Then, we display the '''last five rows.'''
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 391: Line 390:
  
 
|| Next, we check the '''data types''' of each column in the dataset.
 
|| Next, we check the '''data types''' of each column in the dataset.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 397: Line 396:
 
|| We also check for any '''missing values''' in the dataset by summing them.
 
|| We also check for any '''missing values''' in the dataset by summing them.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 403: Line 402:
  
 
|| Now, we convert '''gender column''' to numeric values, '''1 for Male''' and '''0 for Female'''.
 
|| Now, we convert '''gender column''' to numeric values, '''1 for Male''' and '''0 for Female'''.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 411: Line 410:
  
 
|| Then, we separate the '''features X''' and the '''target variable y''' for prediction.
 
|| Then, we separate the '''features X''' and the '''target variable y''' for prediction.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 417: Line 416:
  
 
|| Now, we split the data into '''training and testing sets.'''
 
|| Now, we split the data into '''training and testing sets.'''
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 425: Line 424:
  
 
|| We initialize a''' Linear Regression model''' and train it using the training data.
 
|| We initialize a''' Linear Regression model''' and train it using the training data.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 432: Line 431:
 
|| Next, we print the model's '''coefficients and intercept'''.  
 
|| Next, we print the model's '''coefficients and intercept'''.  
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 441: Line 440:
 
'''y_train_pred'''
 
'''y_train_pred'''
 
|| Now, we make '''predictions on the training data.'''
 
|| Now, we make '''predictions on the training data.'''
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 447: Line 446:
  
 
|| Next, we compute the '''Mean Absolute Error for training data'''.
 
|| Next, we compute the '''Mean Absolute Error for training data'''.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 457: Line 456:
 
After that, we compute and print the '''adjusted R squared '''score.
 
After that, we compute and print the '''adjusted R squared '''score.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 465: Line 464:
  
 
|| Moving forward, we make '''predictions on the test data.'''
 
|| Moving forward, we make '''predictions on the test data.'''
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 472: Line 471:
 
'''plt.scatter(y_test, y_test, color='blue', alpha=0.5, label='Actual') '''
 
'''plt.scatter(y_test, y_test, color='blue', alpha=0.5, label='Actual') '''
 
|| We compare '''actual vs predicted income''' using a '''scatter plot.'''
 
|| We compare '''actual vs predicted income''' using a '''scatter plot.'''
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 478: Line 477:
  
 
|| Then, we compute the '''Mean Absolute Error''' for the test data.
 
|| Then, we compute the '''Mean Absolute Error''' for the test data.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 487: Line 486:
 
'''k_test = X_test.shape[1] '''
 
'''k_test = X_test.shape[1] '''
 
|| Next, we calculate the '''R squared score '''for the test data.
 
|| Next, we calculate the '''R squared score '''for the test data.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Narration
 
|| Narration
 
|| The model has an '''MAE''' of '''1700.15''', showing the average prediction error in income.The '''Adjusted R squared score''' is '''0.921'''.
 
|| The model has an '''MAE''' of '''1700.15''', showing the average prediction error in income.The '''Adjusted R squared score''' is '''0.921'''.
Line 493: Line 492:
 
It indicates the model explains '''92.1 percent''' of income variance.
 
It indicates the model explains '''92.1 percent''' of income variance.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the lines:
 
|| Highlight the lines:
  
Line 503: Line 502:
  
 
We create a '''scatter plot''' of '''predicted values versus residuals.'''
 
We create a '''scatter plot''' of '''predicted values versus residuals.'''
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Highlight the output
 
|| Highlight the output
  
Line 516: Line 515:
 
Most '''residuals''' are '''close to zero''', meaning predictions are fairly accurate.
 
Most '''residuals''' are '''close to zero''', meaning predictions are fairly accurate.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Narration
 
|| Narration
 
|| Thus, we successfully implemented '''Multiple Linear Regression'''.
 
|| Thus, we successfully implemented '''Multiple Linear Regression'''.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show slide:
 
|| Show slide:
  
Line 527: Line 526:
  
 
In this tutorial, we have learnt about
 
In this tutorial, we have learnt about
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Linear Regression'''</div>
+
* '''Linear Regression'''
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Simple Linear Regression'''</div>
+
* '''Simple Linear Regression'''
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Multiple Linear Regression'''</div>
+
* '''Multiple Linear Regression'''
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="background-color:transparent;color:#000000;">'''Evaluation Metrics'''</span></div>
+
* '''Evaluation Metrics'''
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show Slide:
 
|| Show Slide:
  
Line 538: Line 537:
  
 
In Multiple Linear Regression code,
 
In Multiple Linear Regression code,
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="background-color:transparent;color:#000000;">Replace the test_size parameter as shown here.</span></div>
+
* Replace the test_size parameter as shown here.
 +
 
  
<div style="margin-left:1.27cm;margin-right:0cm;"></div>
 
 
|| In Multiple Linear Regression code,  
 
|| In Multiple Linear Regression code,  
  
* <div style="margin-left:1.27cm;margin-right:0cm;">R<span style="background-color:transparent;color:#000000;">eplace the </span><span style="background-color:transparent;color:#000000;">'''test_size parameter'''</span><span style="background-color:transparent;color:#000000;"> as shown here.</span></div>
+
* Replace the '''test_size parameter''' as shown here.
* <div style="margin-left:1.27cm;margin-right:0cm;">Ob<span style="background-color:transparent;color:#000000;">serve the change in </span><span style="background-color:transparent;color:#000000;">'''MAE '''</span><span style="background-color:transparent;color:#000000;">and </span><span style="background-color:transparent;color:#000000;">'''Adjusted R squared score'''</span><span style="background-color:transparent;color:#000000;">.</span></div>
+
* Observe the change in '''MAE '''and '''Adjusted R squared score'''.
 +
 
  
<div style="color:#000000;"></div>
+
|-  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
 
|| Show Slide:
 
|| Show Slide:
  
Line 554: Line 553:
 
Show s1 img file
 
Show s1 img file
 
|| After completing the assignment, the output should match the expected result.
 
|| After completing the assignment, the output should match the expected result.
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show Slide:
 
|| Show Slide:
  
 
'''FOSSEE Forum'''
 
'''FOSSEE Forum'''
|| For any general or technical questions on <span style="background-color:#ffffff;">'''Python</span><span style="background-color:#ffffff;"> for Machine Learning'''</span>, visit the''' FOSSEE forum''' and post your question.
+
|| For any general or technical questions on '''Python for Machine Learning''', visit the''' FOSSEE forum''' and post your question.
  
|- style="border:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;"
+
|-  
 
|| Show Slide:  
 
|| Show Slide:  
  
 
'''Thank you'''
 
'''Thank you'''
|| <div style="color:#000000;">This is '''Harini Theiveegan''', a FOSSEE Summer Fellow 2025, IIT Bombay signing off</div>
+
|| This is '''Harini Theiveegan''', a FOSSEE Summer Fellow 2025, IIT Bombay signing off
  
 
Thanks for joining.
 
Thanks for joining.

Revision as of 15:48, 4 July 2025


Visual Cue Narration
Show slide:

Welcome

Welcome to the Spoken Tutorial on Linear Regression.
Show Slide:

Learning Objectives

In this tutorial, we will learn about
  • Linear Regression
  • Simple Linear Regression
  • Multiple Linear Regression
  • Evaluation Metrics
Show Slide:

System Requirements

To Record this tutorial, I am using
  • Ubuntu Linux operating system 24.04
  • Jupyter Notebook IDE
Show Slide:

Prerequisite

To follow this tutorial,
  • The learner must have basic knowledge of Python.
  • For prerequisite Python tutorials, please visit this website.
Show Slide:

Code files

  • The files used in this tutorial are provided in the Code files link.
  • Please download and extract the files.
  • Make a copy and then use them while practicing.
Show Slide:

Linear Regression

  • Linear regression is a predictive technique used in machine learning.
  • It builds the relationship between a dependent and independent variable.
  • Linear regression is categorized into Simple and Multiple linear regression.
Show Slide:

Simple Linear Regression

  • Simple Linear Regression is a way to find relationships between two variables.
  • It studies how one independent variable affects one dependent variable.
Show Slide:

Multiple Linear Regression

  • Multiple linear Regression is an extension of simple linear regression.
  • It examines how multiple factors influence a single outcome.
Show Slide:

Evaluation Metrics

  • To assess the model’s performance, we use evaluation metrics.
  • These metrics indicate how well the regression model fits the data.
  • The two common metrics are Mean Absolute Error and R squared score.
Hover over the files I have created required files for the demonstration of Linear Regression.
Open the file salaries.csv and point to the fields as per narration.

Open the file salaries_mlr.csv and point to the fields as per narration.

To implement Simple Linear Regression, we use the salaries dot csv dataset.

This dataset contains salaries based on years of experience.

We use salaries underscore mlr dot csv dataset for Multiple Linear Regression.

This dataset contains multiple columns as shown.

Point to the LinearRegression.ipynb LinearRegression dot ipynb is the python notebook file for this demonstration.
Press Ctrl,Alt+T keys

Type conda activate ml

Press Enter

Let us open the Linux terminal. Press Ctrl, Alt and T keys together.

First, we need to activate the machine learning environment.

Run the command conda space activate space ml.

Press Enter.

Go to the Downloads folder

Type cd Downloads

Press Enter

Type jupyter notebook

Press Enter

I have saved my code file in the Downloads folder.

Please navigate to the respective folder of your code file location.

Then type, jupyter space notebook and press Enter.

Show Jupyter Notebook Home page

Click onLinearRegression.ipynb file

We see the Jupyter Notebook Home page.

Click the LinearRegression dot ipynb file to open it.

Note that each cell will have the output displayed in this file.

Highlight the lines:

import numpy as np

import pandas as pd

Press Shift+Enter

We start by importing the required libraries for Simple Linear Regression.

Make sure to Press Shift and Enter to execute the code in each cell.

Highlight the lines:

df_salary=pd.read_csv("salaries.csv")

Let us load the dataset into a variable called df underscore salary.
Highlight the lines:

df_salary.head()

Next, we display the first few rows of the data.
Highlight the lines:

df_salary.describe()

Now, we generate summary statistics for the numerical columns.
Highlight the lines:

sns.heatmap(df_salary.corr(), annot=True, cmap="coolwarm")

plt.show()

Correlation heatmap shows how attributes in the dataset are related.
Narration: Correlation measures how two variables are related to each other.

Correlation measures the relationship between two variables

The correlation values range from -1 to 1.

Show the Correlation matrix output 4.47 Here, experience and income have a correlation of 0.97.This means that as experience increases, income also increases strongly.

Let us understand the correlation value ranges.

Show Slide:

Correlation Matrix

  • A value of 1 means a perfect positive correlation.
  • A value of -1 means a perfect negative correlation.
  • A value of 0 means no correlation
Highlight the lines:

plt.figure(figsize=(6,4))

Now we create a boxplot to visualize the income distribution.
Show the output This image is a boxplot of income before removing outliers.

Outliers are extreme values that differ significantly from other data points.

They are the small circles on the right side of the boxplot.

Here, incomes around 60,000 to 65,000 are considered as outliers.

The line inside the box is the median.

Highlight the lines:

Q1 = df_salary[['experience', 'income']].quantile(0.25)  

Q3 = df_salary[['experience', 'income']].quantile(0.75)  

IQR = Q3 - Q1

Next, we will remove these outliers using the Interquartile Range method.

We calculate first quartile Q1 and third quartile Q3 for experience and income.

Then, we compute the IQR and remove the outliers.

Highlight the lines:

plt.figure(figsize=(6,4))

Now, we plot the income distribution after removing outliers.
Show the output Observe that the small circles are gone, showing outliers were removed.
Highlight the lines:

x=df_salary['experience']

y=df_salary['income']

Now, we define x as experience and y as income from the dataset.
Highlight the lines: Then, we split the data into training and testing sets.
Highlight the lines:

x_train=np.array(x_train).reshape(-1,1)

x_test=np.array(x_test).reshape(-1,1)

We then reshape the x underscore train lists into 2D array.

The same is done for x underscore test for compatibility.

Highlight the lines:

lr=LinearRegression()

lr.fit(x_train,y_train)

Now, we initialize a Linear Regression model and train it using training data.
Highlight the lines:

print("Intercept (W0):", lr.intercept_)

print("Coefficient (W1):", lr.coef_)

Then, we print the intercept W0 and coefficient W1 of the model.

These define the model’s slope and relationship between experience and income.

Highlight the lines:

y_pred_train = lr.predict(x_train)

y_pred_train = y_pred_train.round().astype(int)

y_pred_train

Now, we use the trained model to make predictions on the training data.

We round the predictions to whole numbers for better readability.

Then, we display the rounded predictions.

Highlight the lines:

mae_train = mean_absolute_error(y_train, y_pred_train)

print("MAE (Training):", mae_train)

Next, we calculate the Mean Absolute Error on the training data.

Mean Absolute Error measures prediction accuracy.

Highlight the lines:

r2_score(y_pred_train, y_train)

Then, we compute the R squared score to evaluate the model’s performance.

R squared score measures how well the model explains the variance in the data.

A value closer to 1 indicates a stronger fit.

Highlight the lines:

y_pred_test = lr.predict(x_test)

y_pred_test = y_pred_test.round().astype(int)

y_pred_test

Now, we make predictions on the test data.
Highlight the lines:

plt.scatter(x_test,y_test)

To visualize performance, we create a scatter plot of actual vs predicted values.
Show the output In the output we can see that most points are close to the line.

It shows a positive correlation.

Highlight the lines:

mean_absolute_error(y_test,y_pred_test)

Now, compute the Mean Absolute Error on the test data.
Highlight the lines:

r2_score(y_pred_test, y_test)

Then, we calculate and display the R squared score.
Narration The model has a Mean Absolute Error of 1626.41, indicating prediction errors.

The R-squared score of 0.87 shows the model explains most of the variance.

Overall, the model performs well but has some prediction errors.

Now let us see the implementation of Multiple Linear Regression.
Highlight the lines:

df_salaries = pd.read_csv(r"salaries_mlr.csv")

First, load the dataset for Multiple Linear Regression.
Highlight the lines:

df_salaries.tail()

Then, we display the last five rows.
Highlight the lines:

df_salaries.dtypes

Next, we check the data types of each column in the dataset.
Highlight the lines:

df_salaries.isnull().sum()

We also check for any missing values in the dataset by summing them.
Highlight the lines:

df_salaries['gender'] = df_salaries['gender'].map({'m': 1, 'f': 0})

Now, we convert gender column to numeric values, 1 for Male and 0 for Female.
Highlight the lines:

X = df_salaries.drop(columns='income')

y = df_salaries['income']

Then, we separate the features X and the target variable y for prediction.
Highlight the lines:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Now, we split the data into training and testing sets.
Highlight the lines:

model = LinearRegression()

model.fit(X_train, y_train)

We initialize a Linear Regression model and train it using the training data.
Highlight the lines:

coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})

Next, we print the model's coefficients and intercept.
Highlight the lines:

y_train_pred = model.predict(X_train)

y_train_pred = y_train_pred.round().astype(int)

y_train_pred

Now, we make predictions on the training data.
Highlight the lines:

mae_train = mean_absolute_error(y_train, y_train_pred) print(f'Training data MAE: {mae_train}')

Next, we compute the Mean Absolute Error for training data.
Highlight the lines:

r2_train = r2_score(y_train, y_train_pred)

n_train = len(y_train

Then, we computethe R squared score to measure the model performance

After that, we compute and print the adjusted R squared score.

Highlight the lines:

y_test_pred = model.predict(X_test)

y_test_pred = y_test_pred.round().astype(int)

Moving forward, we make predictions on the test data.
Highlight the lines:

plt.scatter(y_test, y_test_pred, color='red', label='Predicted')

plt.scatter(y_test, y_test, color='blue', alpha=0.5, label='Actual')

We compare actual vs predicted income using a scatter plot.
Highlight the lines:

mae_test = mean_absolute_error(y_test, y_test_pred) print(f'Testing data MAE: {mae_test}')

Then, we compute the Mean Absolute Error for the test data.
Highlight the lines:

r2_test = r2_score(y_test, y_test_pred)

n_test = len(y_test)

k_test = X_test.shape[1]

Next, we calculate the R squared score for the test data.
Narration The model has an MAE of 1700.15, showing the average prediction error in income.The Adjusted R squared score is 0.921.

It indicates the model explains 92.1 percent of income variance.

Highlight the lines:

residuals = y_test - y_test_pred

plt.show()

Now, we analyse the residuals to check model errors.

We create a scatter plot of predicted values versus residuals.

Highlight the output This is a residual plot for the regression model.

The red dashed line represents zero residual.

Points above the line mean predictions are lower than actual values.

Points below the line mean predictions are higher than actual values.

Most residuals are close to zero, meaning predictions are fairly accurate.

Narration Thus, we successfully implemented Multiple Linear Regression.
Show slide:

Summary

This brings us to the end of the tutorial. Let us summarize.

In this tutorial, we have learnt about

  • Linear Regression
  • Simple Linear Regression
  • Multiple Linear Regression
  • Evaluation Metrics
Show Slide:

Assignment

In Multiple Linear Regression code,

  • Replace the test_size parameter as shown here.


In Multiple Linear Regression code,
  • Replace the test_size parameter as shown here.
  • Observe the change in MAE and Adjusted R squared score.


Show Slide:

Assignment Solution

Show s1 img file

After completing the assignment, the output should match the expected result.
Show Slide:

FOSSEE Forum

For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question.
Show Slide:

Thank you

This is Harini Theiveegan, a FOSSEE Summer Fellow 2025, IIT Bombay signing off

Thanks for joining.

Contributors and Content Editors

Madhurig, Nirmala Venkat