Difference between revisions of "Python-for-Machine-Learning/C3/Random-Forest/English"

Latest revision as of 19:48, 17 July 2025

Visual Cue	Narration
Show slide: Welcome	Welcome to the Spoken Tutorial on Random Forest.
Show Slide: Learning Objectives	In this tutorial, we will learn about Ensemble Learning and Random Forest
Show Slide:	To Record this tutorial, I am using Ubuntu Linux operating system 24.04 Jupyter Notebook IDE
Show Slide: Prerequisite	To follow this tutorial, The learner must have basic knowledge of Python. For pre-requisite Python tutorials, please visit this website.
Show Slide: Code files	The files used in this tutorial are provided in the Code files link. Please download and extract the files. Make a copy and then use them while practicing.
Show slide: Ensemble Learning	Ensemble learning combines multiple models to improve performance. It reduces errors by averaging predictions from different models. One popular ensemble method is Random Forest, which uses decision trees.
Show slide: Random Forest	Random Forest builds multiple trees and takes the majority vote. This improves accuracy and reduces overfitting compared to a single tree. It is widely used for classification and regression tasks.
Point to the RandomForest.ipynb	RandomForest.ipynb is the python notebook file for the demonstration of Random Forest
Press Ctrl,Alt and T keys Type conda activate ml Press Enter	Let us open the Linux terminal. Press Ctrl, Alt and T keys together. Activate the machine learning environment by typing conda space activate space ml Press Enter.
Go to the Downloads folder Type cd Downloads Press Enter Type jupyter notebook Press Enter	I have saved my code file in the Downloads folder. Please navigate to the respective folder of your code file location. Then type, jupyter space notebook and press Enter to open Jupyter Notebook.
Show Jupyter Notebook Home page: Click on RandomForest.ipynb file	We can see the Jupyter Notebook Home page has opened in the web browser. Click the RandomForest dot ipynb file to open it. Note that each cell will have the output displayed in this file.
Highlight the lines: import pandas as pd import numpy as np from sklearn.metrics import accuracy_score	We start by importing the required libraries for the Random Forest model. Please remember to Execute each cell by pressing Shift and Enter to get output.
Highlight the lines: california = fetch_california_housing() housing = pd.DataFrame(california.data, columns=california.feature_names) housing['MedHouseVal'] = california.target	We use the California Housing dataset from sklearn library. It has housing data from California districts. We analyze the dataset and predict the MedHouseValue. Now, we load the California Housing dataset into a Pandas DataFrame. We add the target column, MedHouseVal, which represents median house value.
Highlight the lines: housing.head()	To check the dataset, we display the first few rows using the head function.
Highlight the lines: housing.shape	Next, we use the shape function to check the number of rows and columns.
Highlight the lines: housing.info()	Now, we print dataset details to understand its structure and statistics.
Highlight the lines: housing_sorted = housing.sort_values(by="HouseAge")	Let's visualize house age vs median value using a line plot.
Highlight the output:	The plot shows newer homes have higher values, then prices stabilize. The shaded region represents price variability in different age groups.
Highlight the lines: skewed_features = ['AveRooms', 'AveBedrms', 'Population', 'AveOccup']	Before training, we check for skewed features that may affect predictions. Next, we apply log transformation to reduce skewness in these features.
Highlight the lines: scaler = StandardScaler()	Now, we scale the feature values using StandardScaler. This ensures all features have a similar range for better modeling.
Highlight the lines: x = housing.drop(columns=["MedHouseVal"]) x.head()	Then we separate the features by dropping the target column. After that, we display the first few rows to check the data.
Highlight the lines: y = housing.MedHouseVal y.head()	Next, we extract the target variable and store it in y.
Highlight the lines: x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)	Now, we split the dataset into training and testing sets.
Highlight the lines: rf =	Next, we create a Random Forest Regressor with 100 trees. Then, we train the model using the training data.
Highlight the lines: y_train_pred = rf.predict(x_train)	After training, we predict house values for the training set.
Highlight the lines: training_mse =	Now, we compute the Mean Squared Error for training data. We also compute the adjusted R squared score for better evaluation.
Highlight the output:	The Training MSE is 0.037, showing a low prediction error. The Training Adjusted R squared Score is 0.972. A high Adjusted R squared score suggests the model explains most variance.
Highlight the lines: y_pred = rf.predict(x_test)	Now, we predict house prices for the test dataset.
Highlight the lines: print("Random Forest - Actual vs Predicted:")	Next, we compare actual vs predicted values for the test set.
Highlight the lines: test_mse = mean_squared_error(y_test,y_pred)	After that, we calculate the MSE for the test predictions. Then, we compute the Adjusted R squared score for the test set.
Highlight the output	For the test set, the MSE is 0.257, indicating reasonable accuracy. The Test Adjusted R squared Score is 0.804. A higher Adjusted R Squared score indicates a good fit.
Highlight the lines: residuals = y_test - y_pred plt.show()	To further analyze performance, we examine the residuals.
Highlight the output:	Residuals are the difference between actual and predicted values. The red dashed line marks zero residuals for reference. Most residuals are near zero, meaning the prediction error is low.
Highlight the lines: feature_importances = rf.feature_importances_ plt.show()	Besides accuracy, we analyze feature importance to see key prediction factors. Feature importance shows how much each feature impacts predictions. Higher value mean a feature is more important for the model.
Show the output:	The plot shows MedInc has the highest impact on house price predictions. Other features contribute less but still affect model performance.
Narration	Thus, we successfully built a Random Forest model for house price prediction. The model showed high accuracy, indicating strong performance.
Show slide: Summary In this tutorial, we have learnt about Ensemble Learning and Random Forest	This brings us to the end of the tutorial. Let us summarize.
Show Slide: Assignment	As an assignment, please do the following: Replace the test_size parameter as shown here. Observe the change in MSE and Adjusted R squared score
Show Slide: Assignment Solution Show x img	After completing the assignment, the output should match as the expected result.
Show Slide: FOSSEE Forum	For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question.
Show Slide: Thank you	This is Harini Theiveegan, a FOSSEE Summer Fellow 2025, IIT Bombay signing off. Thanks for joining

Contributors and Content Editors

Madhurig, Nirmala Venkat

@@ Line 90: / Line 90: @@
 Please navigate to the respective folder of your code file location.
-Then type, '''jupyter space notebook '''and press Enter to open Jupyter Notebook.
+Then type, '''jupyter space notebook '''and press Enter to open '''Jupyter Notebook'''.
 |-
 || Show '''Jupyter Notebook Home page''':
@@ Line 106: / Line 106: @@
 '''import numpy as np'''
 '''from sklearn.metrics import accuracy_score'''
 || We start by importing the required libraries for the Random Forest model.
@@ Line 217: / Line 218: @@
 || The '''Training MSE '''is '''0.037''', showing a low prediction error.
-The '''Training Adjusted R squared Score''' is '''0.972'''.A high Adjusted R squared score suggests the model explains most variance.
+The '''Training Adjusted R squared Score''' is '''0.972'''.
+A high Adjusted R squared score suggests the model explains most variance.
 |-
 || Highlight the lines:
@@ Line 238: / Line 241: @@
 |-
 || Highlight the output
-|| For the test set, the '''MSE''' is '''0.257''', indicating reasonable accuracy'''.'''
+|| For the test set, the '''MSE''' is '''0.257''', indicating reasonable accuracy.
 The '''Test Adjusted R squared Score''' is '''0.804'''.
@@ Line 252: / Line 255: @@
 |-
 || Highlight the output:
-|| Residuals are the difference between '''actual and predicted values.'''
+|| Residuals are the difference between actual and predicted values.
 The red dashed line marks zero residuals for reference.
@@ Line 282: / Line 285: @@
 '''Summary'''
-|| This brings us to the end of the tutorial. Let us summarize.
 In this tutorial, we have learnt about
 * '''Ensemble Learning and'''
 *  '''Random Forest'''
+|| This brings us to the end of the tutorial. Let us summarize.
 |-

Difference between revisions of "Python-for-Machine-Learning/C3/Random-Forest/English"

Latest revision as of 19:48, 17 July 2025

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools