Difference between revisions of "Python-for-Machine-Learning/C3/Random-Forest/English"
(Created page with " <div style="margin-left:1.27cm;margin-right:0cm;"></div> {| border="1" |- || '''Visual Cue''' || '''Narration''' |- || <div style="color:#000000;">Show slide:</div> <div s...") |
|||
| Line 1: | Line 1: | ||
| − | |||
| − | |||
| − | |||
| − | |||
{| border="1" | {| border="1" | ||
|- | |- | ||
| Line 8: | Line 4: | ||
|| '''Narration''' | || '''Narration''' | ||
|- | |- | ||
| − | || | + | || Show slide: |
| − | + | '''Welcome''' | |
| − | || | + | || Welcome to the Spoken Tutorial on''' Random Forest'''. |
| − | |- | + | |- |
| − | || | + | || Show Slide: |
| − | + | '''Learning Objectives''' | |
|| In this tutorial, we will learn about | || In this tutorial, we will learn about | ||
| − | * | + | * '''Ensemble Learning and''' |
| − | * | + | * '''Random Forest''' |
| − | |- | + | |- |
| − | || | + | || Show Slide: |
|| To Record this tutorial, I am using | || To Record this tutorial, I am using | ||
| − | * | + | * '''Ubuntu Linux operating system 24.04''' |
| − | * | + | * '''Jupyter Notebook''' '''IDE''' |
| − | |- | + | |- |
| − | || | + | || Show Slide: |
| − | + | '''Prerequisite''' | |
|| To follow this tutorial, | || To follow this tutorial, | ||
| − | * | + | * The learner must have basic knowledge of '''Python.''' |
| − | * | + | * For pre-requisite '''Python''' tutorials, please visit this website. |
| − | |- | + | |- |
| − | || | + | || Show Slide: |
| − | + | '''Code files''' | |
|| | || | ||
| − | * | + | * The files used in this tutorial are provided in the '''Code files '''link. |
| − | * | + | * Please download and extract the files. |
| − | * | + | * Make a copy and then use them while practicing. |
| − | |- | + | |- |
|| Show slide: | || Show slide: | ||
'''Ensemble Learning''' | '''Ensemble Learning''' | ||
|| | || | ||
| − | * | + | * '''Ensemble learning''' combines multiple models to improve performance. |
| − | * | + | * It reduces errors by averaging predictions from different models. |
| − | * | + | * One popular ensemble method is '''Random Forest''', which uses decision trees. |
| − | |- | + | |- |
|| Show slide: | || Show slide: | ||
'''Random Forest''' | '''Random Forest''' | ||
|| | || | ||
| − | * | + | * '''Random Forest''' builds '''multiple trees''' and takes the majority vote. |
| − | * | + | * This improves accuracy and reduces overfitting compared to a single tree. |
| − | * | + | * It is widely used for '''classification''' and '''regression''' tasks. |
| − | |- | + | |- |
|| Point to the''' RandomForest.ipynb''' | || Point to the''' RandomForest.ipynb''' | ||
|| | || | ||
| Line 70: | Line 66: | ||
'''RandomForest'''.'''ipynb''' is the python notebook file for the demonstration of Random Forest | '''RandomForest'''.'''ipynb''' is the python notebook file for the demonstration of Random Forest | ||
| − | |- | + | |- |
|| Press '''Ctrl,Alt and T''' keys | || Press '''Ctrl,Alt and T''' keys | ||
| Line 83: | Line 79: | ||
Press '''Enter.''' | Press '''Enter.''' | ||
| − | |- | + | |- |
|| Go to the '''Downloads '''folder | || Go to the '''Downloads '''folder | ||
| Line 95: | Line 91: | ||
Then type, '''jupyter space notebook '''and press Enter to open Jupyter Notebook. | Then type, '''jupyter space notebook '''and press Enter to open Jupyter Notebook. | ||
| − | |- | + | |- |
|| Show '''Jupyter Notebook Home page''': | || Show '''Jupyter Notebook Home page''': | ||
| Line 104: | Line 100: | ||
Note that each cell will have the output displayed in this file. | Note that each cell will have the output displayed in this file. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 114: | Line 110: | ||
Please remember to Execute each cell by pressing '''Shift and Enter''' to get output. | Please remember to Execute each cell by pressing '''Shift and Enter''' to get output. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 131: | Line 127: | ||
We add the target column, '''MedHouseVal,''' which represents median house value. | We add the target column, '''MedHouseVal,''' which represents median house value. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
'''housing.head() ''' | '''housing.head() ''' | ||
|| To check the dataset, we display the first few rows using the '''head function'''. | || To check the dataset, we display the first few rows using the '''head function'''. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 142: | Line 138: | ||
|| Next, we use the '''shape function''' to check the number of rows and columns. | || Next, we use the '''shape function''' to check the number of rows and columns. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 148: | Line 144: | ||
|| Now, we print '''dataset details''' to understand its '''structure''' and '''statistics.''' | || Now, we print '''dataset details''' to understand its '''structure''' and '''statistics.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 154: | Line 150: | ||
|| Let's visualize '''house age vs median value''' using a '''line plot'''. | || Let's visualize '''house age vs median value''' using a '''line plot'''. | ||
| − | |- | + | |- |
|| Highlight the output: | || Highlight the output: | ||
| Line 160: | Line 156: | ||
The shaded region represents '''price variability''' in different age groups. | The shaded region represents '''price variability''' in different age groups. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 168: | Line 164: | ||
Next, we apply '''log transformation''' to reduce '''skewness''' in these features. | Next, we apply '''log transformation''' to reduce '''skewness''' in these features. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 176: | Line 172: | ||
This ensures all features have a similar range for better modeling. | This ensures all features have a similar range for better modeling. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 185: | Line 181: | ||
After that, we display the first few rows to check the data. | After that, we display the first few rows to check the data. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
'''y = housing.MedHouseVal''' | '''y = housing.MedHouseVal''' | ||
| Line 191: | Line 187: | ||
|| Next, we extract the '''target variable''' and store it in '''y.''' | || Next, we extract the '''target variable''' and store it in '''y.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
'''x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42) ''' | '''x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42) ''' | ||
|| Now, we split the dataset into '''training''' and '''testing sets.''' | || Now, we split the dataset into '''training''' and '''testing sets.''' | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 203: | Line 199: | ||
Then, we train the model using the training data. | Then, we train the model using the training data. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 209: | Line 205: | ||
|| After training, we predict house values for the training set. | || After training, we predict house values for the training set. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 217: | Line 213: | ||
We also compute the '''adjusted R squared score''' for better evaluation. | We also compute the '''adjusted R squared score''' for better evaluation. | ||
| − | |- | + | |- |
|| Highlight the output: | || Highlight the output: | ||
|| The '''Training MSE '''is '''0.037''', showing a low prediction error. | || The '''Training MSE '''is '''0.037''', showing a low prediction error. | ||
The '''Training Adjusted R squared Score''' is '''0.972'''.A high Adjusted R squared score suggests the model explains most variance. | The '''Training Adjusted R squared Score''' is '''0.972'''.A high Adjusted R squared score suggests the model explains most variance. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
'''y_pred = rf.predict(x_test)''' | '''y_pred = rf.predict(x_test)''' | ||
|| Now, we predict house prices for the test dataset. | || Now, we predict house prices for the test dataset. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
'''print("Random Forest - Actual vs Predicted:") ''' | '''print("Random Forest - Actual vs Predicted:") ''' | ||
|| Next, we compare actual vs predicted values for the test set. | || Next, we compare actual vs predicted values for the test set. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 240: | Line 236: | ||
Then, we compute the '''Adjusted R squared score''' for the test set. | Then, we compute the '''Adjusted R squared score''' for the test set. | ||
| − | |- | + | |- |
|| Highlight the output | || Highlight the output | ||
|| For the test set, the '''MSE''' is '''0.257''', indicating reasonable accuracy'''.''' | || For the test set, the '''MSE''' is '''0.257''', indicating reasonable accuracy'''.''' | ||
| Line 247: | Line 243: | ||
A higher Adjusted R Squared score indicates a good fit. | A higher Adjusted R Squared score indicates a good fit. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 254: | Line 250: | ||
|| To further analyze performance, we examine the '''residuals'''. | || To further analyze performance, we examine the '''residuals'''. | ||
| − | |- | + | |- |
|| Highlight the output: | || Highlight the output: | ||
|| Residuals are the difference between '''actual and predicted values.''' | || Residuals are the difference between '''actual and predicted values.''' | ||
| Line 261: | Line 257: | ||
Most residuals are near zero, meaning the prediction error is low. | Most residuals are near zero, meaning the prediction error is low. | ||
| − | |- | + | |- |
|| Highlight the lines: | || Highlight the lines: | ||
| Line 272: | Line 268: | ||
Higher value mean a feature is more important for the model. | Higher value mean a feature is more important for the model. | ||
| − | |- | + | |- |
|| Show the output: | || Show the output: | ||
|| The plot shows '''MedInc''' has the highest impact on house price predictions. | || The plot shows '''MedInc''' has the highest impact on house price predictions. | ||
Other features contribute less but still affect model performance. | Other features contribute less but still affect model performance. | ||
| − | |- | + | |- |
|| Narration | || Narration | ||
|| Thus, we successfully built a '''Random Forest model''' for house price prediction. | || Thus, we successfully built a '''Random Forest model''' for house price prediction. | ||
The model showed '''high accuracy''', indicating '''strong performance.''' | The model showed '''high accuracy''', indicating '''strong performance.''' | ||
| − | |- | + | |- |
|| Show slide: | || Show slide: | ||
| Line 289: | Line 285: | ||
In this tutorial, we have learnt about | In this tutorial, we have learnt about | ||
| − | * | + | * '''Ensemble Learning and''' |
| − | * | + | * '''Random Forest''' |
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
| Line 298: | Line 294: | ||
|| As an assignment, please do the following: | || As an assignment, please do the following: | ||
| − | * | + | * Replace the '''test_size parameter''' as shown here. |
| − | * | + | * Observe the change in '''MSE '''and '''Adjusted R squared score''' |
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
| Line 308: | Line 304: | ||
'''Show x img''' | '''Show x img''' | ||
|| After completing the assignment, the output should match as the expected result. | || After completing the assignment, the output should match as the expected result. | ||
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
| Line 314: | Line 310: | ||
|| For any general or technical questions on '''Python for Machine Learning''', visit the '''FOSSEE forum''' and post your question. | || For any general or technical questions on '''Python for Machine Learning''', visit the '''FOSSEE forum''' and post your question. | ||
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
Revision as of 15:38, 16 July 2025
| Visual Cue | Narration |
| Show slide:
Welcome |
Welcome to the Spoken Tutorial on Random Forest. |
| Show Slide:
Learning Objectives |
In this tutorial, we will learn about
|
| Show Slide: | To Record this tutorial, I am using
|
| Show Slide:
Prerequisite |
To follow this tutorial,
|
| Show Slide:
Code files |
|
| Show slide:
Ensemble Learning |
|
| Show slide:
Random Forest |
|
| Point to the RandomForest.ipynb |
RandomForest.ipynb is the python notebook file for the demonstration of Random Forest |
| Press Ctrl,Alt and T keys
Type conda activate ml Press Enter |
Let us open the Linux terminal. Press Ctrl, Alt and T keys together.
Activate the machine learning environment by typing conda space activate space ml Press Enter. |
| Go to the Downloads folder
Type cd Downloads Press Enter Type jupyter notebook Press Enter |
I have saved my code file in the Downloads folder.
Please navigate to the respective folder of your code file location. Then type, jupyter space notebook and press Enter to open Jupyter Notebook. |
| Show Jupyter Notebook Home page:
Click on RandomForest.ipynb file |
We can see the Jupyter Notebook Home page has opened in the web browser.
Click the RandomForest dot ipynb file to open it. Note that each cell will have the output displayed in this file. |
| Highlight the lines:
import pandas as pd import numpy as np from sklearn.metrics import accuracy_score |
We start by importing the required libraries for the Random Forest model.
Please remember to Execute each cell by pressing Shift and Enter to get output. |
| Highlight the lines:
california = fetch_california_housing() housing = pd.DataFrame(california.data, columns=california.feature_names) housing['MedHouseVal'] = california.target |
We use the California Housing dataset from sklearn library.
It has housing data from California districts. We analyze the dataset and predict the MedHouseValue. Now, we load the California Housing dataset into a Pandas DataFrame. We add the target column, MedHouseVal, which represents median house value. |
| Highlight the lines:
housing.head() |
To check the dataset, we display the first few rows using the head function. |
| Highlight the lines:
housing.shape |
Next, we use the shape function to check the number of rows and columns. |
| Highlight the lines:
housing.info() |
Now, we print dataset details to understand its structure and statistics. |
| Highlight the lines:
housing_sorted = housing.sort_values(by="HouseAge") |
Let's visualize house age vs median value using a line plot. |
| Highlight the output: | The plot shows newer homes have higher values, then prices stabilize.
The shaded region represents price variability in different age groups. |
| Highlight the lines:
skewed_features = ['AveRooms', 'AveBedrms', 'Population', 'AveOccup'] |
Before training, we check for skewed features that may affect predictions.
Next, we apply log transformation to reduce skewness in these features. |
| Highlight the lines:
scaler = StandardScaler() |
Now, we scale the feature values using StandardScaler.
This ensures all features have a similar range for better modeling. |
| Highlight the lines:
x = housing.drop(columns=["MedHouseVal"]) x.head() |
Then we separate the features by dropping the target column.
After that, we display the first few rows to check the data. |
| Highlight the lines:
y = housing.MedHouseVal y.head() |
Next, we extract the target variable and store it in y. |
| Highlight the lines:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42) |
Now, we split the dataset into training and testing sets. |
| Highlight the lines:
rf = |
Next, we create a Random Forest Regressor with 100 trees.
Then, we train the model using the training data. |
| Highlight the lines:
y_train_pred = rf.predict(x_train) |
After training, we predict house values for the training set. |
| Highlight the lines:
training_mse = |
Now, we compute the Mean Squared Error for training data.
We also compute the adjusted R squared score for better evaluation. |
| Highlight the output: | The Training MSE is 0.037, showing a low prediction error.
The Training Adjusted R squared Score is 0.972.A high Adjusted R squared score suggests the model explains most variance. |
| Highlight the lines:
y_pred = rf.predict(x_test) |
Now, we predict house prices for the test dataset. |
| Highlight the lines:
print("Random Forest - Actual vs Predicted:") |
Next, we compare actual vs predicted values for the test set. |
| Highlight the lines:
test_mse = mean_squared_error(y_test,y_pred) |
After that, we calculate the MSE for the test predictions.
Then, we compute the Adjusted R squared score for the test set. |
| Highlight the output | For the test set, the MSE is 0.257, indicating reasonable accuracy.
The Test Adjusted R squared Score is 0.804. A higher Adjusted R Squared score indicates a good fit. |
| Highlight the lines:
residuals = y_test - y_pred plt.show() |
To further analyze performance, we examine the residuals. |
| Highlight the output: | Residuals are the difference between actual and predicted values.
The red dashed line marks zero residuals for reference. Most residuals are near zero, meaning the prediction error is low. |
| Highlight the lines:
feature_importances = rf.feature_importances_ plt.show() |
Besides accuracy, we analyze feature importance to see key prediction factors.
Feature importance shows how much each feature impacts predictions. Higher value mean a feature is more important for the model. |
| Show the output: | The plot shows MedInc has the highest impact on house price predictions.
Other features contribute less but still affect model performance. |
| Narration | Thus, we successfully built a Random Forest model for house price prediction.
The model showed high accuracy, indicating strong performance. |
| Show slide:
Summary |
This brings us to the end of the tutorial. Let us summarize.
In this tutorial, we have learnt about
|
| Show Slide:
Assignment |
As an assignment, please do the following:
|
| Show Slide:
Assignment Solution Show x img |
After completing the assignment, the output should match as the expected result. |
| Show Slide:
FOSSEE Forum |
For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question. |
| Show Slide:
Thank you |
This is Harini Theiveegan, a FOSSEE Summer Fellow 2025, IIT Bombay signing off.
Thanks for joining |