Difference between revisions of "Python-for-Machine-Learning/C2/Linear-Regression/English"
| Line 1: | Line 1: | ||
| + | |||
{| border="1" | {| border="1" | ||
| Line 8: | Line 9: | ||
|| Show slide: | || Show slide: | ||
| − | '''Welcome''' | + | '''Welcome''' |
| − | || Welcome to the Spoken Tutorial on '''Linear Regression'''. | + | || Welcome to the Spoken Tutorial on''' Linear Regression'''. |
|- | |- | ||
Revision as of 15:58, 7 July 2025
| Visual Cue | Narration |
| Show slide:
Welcome |
Welcome to the Spoken Tutorial on Linear Regression. |
| Show Slide:
Learning Objectives |
In this tutorial, we will learn about
|
| Show Slide:
System Requirements |
To Record this tutorial, I am using
|
| Show Slide:
Prerequisite |
To follow this tutorial,
|
| Show Slide:
Code files |
|
| Show Slide:
Linear Regression |
|
| Show Slide:
Simple Linear Regression |
|
| Show Slide:
Multiple Linear Regression |
|
| Show Slide:
Evaluation Metrics |
|
| Hover over the files | I have created required files for the demonstration of Linear Regression. |
| Open the file salaries.csv and point to the fields as per narration.
Open the file salaries_mlr.csv and point to the fields as per narration. |
To implement Simple Linear Regression, we use the salaries dot csv dataset.
This dataset contains salaries based on years of experience. We use salaries underscore mlr dot csv dataset for Multiple Linear Regression. This dataset contains multiple columns as shown. |
| Point to the LinearRegression.ipynb | LinearRegression dot ipynb is the python notebook file for this demonstration. |
| Press Ctrl,Alt+T keys
Type conda activate ml Press Enter |
Let us open the Linux terminal. Press Ctrl, Alt and T keys together.
First, we need to activate the machine learning environment. Run the command conda space activate space ml. Press Enter. |
| Go to the Downloads folder
Type cd Downloads Press Enter Type jupyter notebook Press Enter |
I have saved my code file in the Downloads folder.
Please navigate to the respective folder of your code file location. Then type, jupyter space notebook and press Enter. |
| Show Jupyter Notebook Home page
Click onLinearRegression.ipynb file |
We see the Jupyter Notebook Home page.
Click the LinearRegression dot ipynb file to open it. Note that each cell will have the output displayed in this file. |
| Highlight the lines:
import numpy as np import pandas as pd Press Shift+Enter |
We start by importing the required libraries for Simple Linear Regression.
Make sure to Press Shift and Enter to execute the code in each cell. |
| Highlight the lines:
df_salary=pd.read_csv("salaries.csv") |
Let us load the dataset into a variable called df underscore salary. |
| Highlight the lines:
df_salary.head() |
Next, we display the first few rows of the data. |
| Highlight the lines:
df_salary.describe() |
Now, we generate summary statistics for the numerical columns. |
| Highlight the lines:
sns.heatmap(df_salary.corr(), annot=True, cmap="coolwarm") plt.show() |
Correlation heatmap shows how attributes in the dataset are related. |
| Narration: | Correlation measures how two variables are related to each other.
Correlation measures the relationship between two variables The correlation values range from -1 to 1. |
| Show the Correlation matrix output 4.47 | Here, experience and income have a correlation of 0.97.This means that as experience increases, income also increases strongly.
Let us understand the correlation value ranges. |
| Show Slide:
Correlation Matrix |
|
| Highlight the lines:
plt.figure(figsize=(6,4)) |
Now we create a boxplot to visualize the income distribution. |
| Show the output | This image is a boxplot of income before removing outliers.
Outliers are extreme values that differ significantly from other data points. They are the small circles on the right side of the boxplot. Here, incomes around 60,000 to 65,000 are considered as outliers. The line inside the box is the median. |
| Highlight the lines:
Q1 = df_salary[['experience', 'income']].quantile(0.25) Q3 = df_salary[['experience', 'income']].quantile(0.75) IQR = Q3 - Q1 |
Next, we will remove these outliers using the Interquartile Range method.
We calculate first quartile Q1 and third quartile Q3 for experience and income. Then, we compute the IQR and remove the outliers. |
| Highlight the lines:
plt.figure(figsize=(6,4)) |
Now, we plot the income distribution after removing outliers. |
| Show the output | Observe that the small circles are gone, showing outliers were removed. |
| Highlight the lines:
x=df_salary['experience'] y=df_salary['income'] |
Now, we define x as experience and y as income from the dataset. |
| Highlight the lines: | Then, we split the data into training and testing sets. |
| Highlight the lines:
x_train=np.array(x_train).reshape(-1,1) x_test=np.array(x_test).reshape(-1,1) |
We then reshape the x underscore train lists into 2D array.
The same is done for x underscore test for compatibility. |
| Highlight the lines:
lr=LinearRegression() lr.fit(x_train,y_train) |
Now, we initialize a Linear Regression model and train it using training data. |
| Highlight the lines:
print("Intercept (W0):", lr.intercept_) print("Coefficient (W1):", lr.coef_) |
Then, we print the intercept W0 and coefficient W1 of the model.
These define the model’s slope and relationship between experience and income. |
| Highlight the lines:
y_pred_train = lr.predict(x_train) y_pred_train = y_pred_train.round().astype(int) y_pred_train |
Now, we use the trained model to make predictions on the training data.
We round the predictions to whole numbers for better readability. Then, we display the rounded predictions. |
| Highlight the lines:
mae_train = mean_absolute_error(y_train, y_pred_train) print("MAE (Training):", mae_train) |
Next, we calculate the Mean Absolute Error on the training data.
Mean Absolute Error measures prediction accuracy. |
| Highlight the lines:
r2_score(y_pred_train, y_train) |
Then, we compute the R squared score to evaluate the model’s performance.
R squared score measures how well the model explains the variance in the data. A value closer to 1 indicates a stronger fit. |
| Highlight the lines:
y_pred_test = lr.predict(x_test) y_pred_test = y_pred_test.round().astype(int) y_pred_test |
Now, we make predictions on the test data. |
| Highlight the lines:
plt.scatter(x_test,y_test) |
To visualize performance, we create a scatter plot of actual vs predicted values. |
| Show the output | In the output we can see that most points are close to the line.
It shows a positive correlation. |
| Highlight the lines:
mean_absolute_error(y_test,y_pred_test) |
Now, compute the Mean Absolute Error on the test data. |
| Highlight the lines:
r2_score(y_pred_test, y_test) |
Then, we calculate and display the R squared score. |
| Narration | The model has a Mean Absolute Error of 1626.41, indicating prediction errors.
The R-squared score of 0.87 shows the model explains most of the variance. Overall, the model performs well but has some prediction errors. |
| Now let us see the implementation of Multiple Linear Regression. | |
| Highlight the lines:
df_salaries = pd.read_csv(r"salaries_mlr.csv") |
First, load the dataset for Multiple Linear Regression. |
| Highlight the lines:
df_salaries.tail() |
Then, we display the last five rows. |
| Highlight the lines:
df_salaries.dtypes |
Next, we check the data types of each column in the dataset. |
| Highlight the lines:
df_salaries.isnull().sum() |
We also check for any missing values in the dataset by summing them. |
| Highlight the lines:
df_salaries['gender'] = df_salaries['gender'].map({'m': 1, 'f': 0}) |
Now, we convert gender column to numeric values, 1 for Male and 0 for Female. |
| Highlight the lines:
X = df_salaries.drop(columns='income') y = df_salaries['income'] |
Then, we separate the features X and the target variable y for prediction. |
| Highlight the lines:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) |
Now, we split the data into training and testing sets. |
| Highlight the lines:
model = LinearRegression() model.fit(X_train, y_train) |
We initialize a Linear Regression model and train it using the training data. |
| Highlight the lines:
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_}) |
Next, we print the model's coefficients and intercept. |
| Highlight the lines:
y_train_pred = model.predict(X_train) y_train_pred = y_train_pred.round().astype(int) y_train_pred |
Now, we make predictions on the training data. |
| Highlight the lines:
mae_train = mean_absolute_error(y_train, y_train_pred) print(f'Training data MAE: {mae_train}') |
Next, we compute the Mean Absolute Error for training data. |
| Highlight the lines:
r2_train = r2_score(y_train, y_train_pred) n_train = len(y_train |
Then, we computethe R squared score to measure the model performance
After that, we compute and print the adjusted R squared score. |
| Highlight the lines:
y_test_pred = model.predict(X_test) y_test_pred = y_test_pred.round().astype(int) |
Moving forward, we make predictions on the test data. |
| Highlight the lines:
plt.scatter(y_test, y_test_pred, color='red', label='Predicted') plt.scatter(y_test, y_test, color='blue', alpha=0.5, label='Actual') |
We compare actual vs predicted income using a scatter plot. |
| Highlight the lines:
mae_test = mean_absolute_error(y_test, y_test_pred) print(f'Testing data MAE: {mae_test}') |
Then, we compute the Mean Absolute Error for the test data. |
| Highlight the lines:
r2_test = r2_score(y_test, y_test_pred) n_test = len(y_test) k_test = X_test.shape[1] |
Next, we calculate the R squared score for the test data. |
| Narration | The model has an MAE of 1700.15, showing the average prediction error in income.The Adjusted R squared score is 0.921.
It indicates the model explains 92.1 percent of income variance. |
| Highlight the lines:
residuals = y_test - y_test_pred plt.show() |
Now, we analyse the residuals to check model errors.
We create a scatter plot of predicted values versus residuals. |
| Highlight the output | This is a residual plot for the regression model.
The red dashed line represents zero residual. Points above the line mean predictions are lower than actual values. Points below the line mean predictions are higher than actual values. Most residuals are close to zero, meaning predictions are fairly accurate. |
| Narration | Thus, we successfully implemented Multiple Linear Regression. |
| Show slide:
Summary |
This brings us to the end of the tutorial. Let us summarize.
In this tutorial, we have learnt about
|
| Show Slide:
Assignment In Multiple Linear Regression code,
|
In Multiple Linear Regression code,
|
| Show Slide:
Assignment Solution Show s1 img file |
After completing the assignment, the output should match the expected result. |
| Show Slide:
FOSSEE Forum |
For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question. |
| Show Slide:
Thank you |
This is Harini Theiveegan, a FOSSEE Summer Fellow 2025, IIT Bombay signing off
Thanks for joining. |