Difference between revisions of "Python-for-Machine-Learning/C3/Random-Forest/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
 
Line 90: Line 90:
 
Please navigate to the respective folder of your code file location.
 
Please navigate to the respective folder of your code file location.
  
Then type, '''jupyter space notebook '''and press Enter to open Jupyter Notebook.
+
Then type, '''jupyter space notebook '''and press Enter to open '''Jupyter Notebook'''.
 
|-
 
|-
 
|| Show '''Jupyter Notebook Home page''':
 
|| Show '''Jupyter Notebook Home page''':
Line 106: Line 106:
 
'''import numpy as np'''
 
'''import numpy as np'''
 
'''from sklearn.metrics import accuracy_score'''
 
'''from sklearn.metrics import accuracy_score'''
 +
 
|| We start by importing the required libraries for the Random Forest model.
 
|| We start by importing the required libraries for the Random Forest model.
  
Line 217: Line 218:
 
|| The '''Training MSE '''is '''0.037''', showing a low prediction error.
 
|| The '''Training MSE '''is '''0.037''', showing a low prediction error.
  
The '''Training Adjusted R squared Score''' is '''0.972'''.A high Adjusted R squared score suggests the model explains most variance.
+
The '''Training Adjusted R squared Score''' is '''0.972'''.
 +
 
 +
A high Adjusted R squared score suggests the model explains most variance.
 
|-
 
|-
 
|| Highlight the lines:
 
|| Highlight the lines:
Line 238: Line 241:
 
|-
 
|-
 
|| Highlight the output
 
|| Highlight the output
|| For the test set, the '''MSE''' is '''0.257''', indicating reasonable accuracy'''.'''
+
|| For the test set, the '''MSE''' is '''0.257''', indicating reasonable accuracy.
  
 
The '''Test Adjusted R squared Score''' is '''0.804'''.
 
The '''Test Adjusted R squared Score''' is '''0.804'''.
Line 252: Line 255:
 
|-
 
|-
 
|| Highlight the output:
 
|| Highlight the output:
|| Residuals are the difference between '''actual and predicted values.'''
+
|| Residuals are the difference between actual and predicted values.
  
 
The red dashed line marks zero residuals for reference.
 
The red dashed line marks zero residuals for reference.
Line 282: Line 285:
  
 
'''Summary'''
 
'''Summary'''
|| This brings us to the end of the tutorial. Let us summarize.
+
 
  
 
In this tutorial, we have learnt about
 
In this tutorial, we have learnt about
 
* '''Ensemble Learning and'''
 
* '''Ensemble Learning and'''
 
*  '''Random Forest'''
 
*  '''Random Forest'''
 +
 +
|| This brings us to the end of the tutorial. Let us summarize.
 +
 +
  
 
|-
 
|-

Latest revision as of 19:48, 17 July 2025

Visual Cue Narration
Show slide:

Welcome

Welcome to the Spoken Tutorial on Random Forest.
Show Slide:

Learning Objectives

In this tutorial, we will learn about
  • Ensemble Learning and
  • Random Forest
Show Slide: To Record this tutorial, I am using
  • Ubuntu Linux operating system 24.04
  • Jupyter Notebook IDE
Show Slide:

Prerequisite

To follow this tutorial,
  • The learner must have basic knowledge of Python.
  • For pre-requisite Python tutorials, please visit this website.
Show Slide:

Code files

  • The files used in this tutorial are provided in the Code files link.
  • Please download and extract the files.
  • Make a copy and then use them while practicing.
Show slide:

Ensemble Learning

  • Ensemble learning combines multiple models to improve performance.
  • It reduces errors by averaging predictions from different models.
  • One popular ensemble method is Random Forest, which uses decision trees.
Show slide:

Random Forest

  • Random Forest builds multiple trees and takes the majority vote.
  • This improves accuracy and reduces overfitting compared to a single tree.
  • It is widely used for classification and regression tasks.
Point to the RandomForest.ipynb

RandomForest.ipynb is the python notebook file for the demonstration of Random Forest

Press Ctrl,Alt and T keys

Type conda activate ml

Press Enter

Let us open the Linux terminal. Press Ctrl, Alt and T keys together.

Activate the machine learning environment by typing

conda space activate space ml

Press Enter.

Go to the Downloads folder

Type cd Downloads Press Enter Type jupyter notebook Press Enter

I have saved my code file in the Downloads folder.

Please navigate to the respective folder of your code file location.

Then type, jupyter space notebook and press Enter to open Jupyter Notebook.

Show Jupyter Notebook Home page:

Click on RandomForest.ipynb file

We can see the Jupyter Notebook Home page has opened in the web browser.

Click the RandomForest dot ipynb file to open it.

Note that each cell will have the output displayed in this file.

Highlight the lines:

import pandas as pd import numpy as np from sklearn.metrics import accuracy_score

We start by importing the required libraries for the Random Forest model.

Please remember to Execute each cell by pressing Shift and Enter to get output.

Highlight the lines:

california = fetch_california_housing()  

housing = pd.DataFrame(california.data, columns=california.feature_names)

housing['MedHouseVal'] = california.target

We use the California Housing dataset from sklearn library.

It has housing data from California districts.

We analyze the dataset and predict the MedHouseValue.

Now, we load the California Housing dataset into a Pandas DataFrame.

We add the target column, MedHouseVal, which represents median house value.

Highlight the lines:

housing.head()

To check the dataset, we display the first few rows using the head function.
Highlight the lines:

housing.shape

Next, we use the shape function to check the number of rows and columns.
Highlight the lines:

housing.info()

Now, we print dataset details to understand its structure and statistics.
Highlight the lines:

housing_sorted = housing.sort_values(by="HouseAge")

Let's visualize house age vs median value using a line plot.
Highlight the output: The plot shows newer homes have higher values, then prices stabilize.

The shaded region represents price variability in different age groups.

Highlight the lines:

skewed_features = ['AveRooms', 'AveBedrms', 'Population', 'AveOccup']

Before training, we check for skewed features that may affect predictions.

Next, we apply log transformation to reduce skewness in these features.

Highlight the lines:

scaler = StandardScaler()

Now, we scale the feature values using StandardScaler.

This ensures all features have a similar range for better modeling.

Highlight the lines:

x = housing.drop(columns=["MedHouseVal"])

x.head()

Then we separate the features by dropping the target column.

After that, we display the first few rows to check the data.

Highlight the lines:

y = housing.MedHouseVal y.head()

Next, we extract the target variable and store it in y.
Highlight the lines:

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

Now, we split the dataset into training and testing sets.
Highlight the lines:

rf =

Next, we create a Random Forest Regressor with 100 trees.

Then, we train the model using the training data.

Highlight the lines:

y_train_pred = rf.predict(x_train)

After training, we predict house values for the training set.
Highlight the lines:

training_mse =

Now, we compute the Mean Squared Error for training data.

We also compute the adjusted R squared score for better evaluation.

Highlight the output: The Training MSE is 0.037, showing a low prediction error.

The Training Adjusted R squared Score is 0.972.

A high Adjusted R squared score suggests the model explains most variance.

Highlight the lines:

y_pred = rf.predict(x_test)

Now, we predict house prices for the test dataset.
Highlight the lines:

print("Random Forest - Actual vs Predicted:")

Next, we compare actual vs predicted values for the test set.
Highlight the lines:

test_mse = mean_squared_error(y_test,y_pred)

After that, we calculate the MSE for the test predictions.

Then, we compute the Adjusted R squared score for the test set.

Highlight the output For the test set, the MSE is 0.257, indicating reasonable accuracy.

The Test Adjusted R squared Score is 0.804.

A higher Adjusted R Squared score indicates a good fit.

Highlight the lines:

residuals = y_test - y_pred plt.show()

To further analyze performance, we examine the residuals.
Highlight the output: Residuals are the difference between actual and predicted values.

The red dashed line marks zero residuals for reference.

Most residuals are near zero, meaning the prediction error is low.

Highlight the lines:

feature_importances = rf.feature_importances_

plt.show()

Besides accuracy, we analyze feature importance to see key prediction factors.

Feature importance shows how much each feature impacts predictions.

Higher value mean a feature is more important for the model.

Show the output: The plot shows MedInc has the highest impact on house price predictions.

Other features contribute less but still affect model performance.

Narration Thus, we successfully built a Random Forest model for house price prediction.

The model showed high accuracy, indicating strong performance.

Show slide:

Summary


In this tutorial, we have learnt about

  • Ensemble Learning and
  • Random Forest
This brings us to the end of the tutorial. Let us summarize.


Show Slide:

Assignment

As an assignment, please do the following:
  • Replace the test_size parameter as shown here.
  • Observe the change in MSE and Adjusted R squared score
Show Slide:

Assignment Solution

Show x img

After completing the assignment, the output should match as the expected result.
Show Slide:

FOSSEE Forum

For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question.
Show Slide:

Thank you

This is Harini Theiveegan, a FOSSEE Summer Fellow 2025, IIT Bombay signing off.

Thanks for joining

Contributors and Content Editors

Madhurig, Nirmala Venkat