Difference between revisions of "Python-for-Machine-Learning/C3/Random-Forest/English"
| Line 90: | Line 90: | ||
Please navigate to the respective folder of your code file location. | Please navigate to the respective folder of your code file location. | ||
| − | Then type, '''jupyter space notebook '''and press Enter to open Jupyter Notebook. | + | Then type, '''jupyter space notebook '''and press Enter to open '''Jupyter Notebook'''. |
|- | |- | ||
|| Show '''Jupyter Notebook Home page''': | || Show '''Jupyter Notebook Home page''': | ||
| Line 106: | Line 106: | ||
'''import numpy as np''' | '''import numpy as np''' | ||
'''from sklearn.metrics import accuracy_score''' | '''from sklearn.metrics import accuracy_score''' | ||
| + | |||
|| We start by importing the required libraries for the Random Forest model. | || We start by importing the required libraries for the Random Forest model. | ||
| Line 217: | Line 218: | ||
|| The '''Training MSE '''is '''0.037''', showing a low prediction error. | || The '''Training MSE '''is '''0.037''', showing a low prediction error. | ||
| − | The '''Training Adjusted R squared Score''' is '''0.972'''.A high Adjusted R squared score suggests the model explains most variance. | + | The '''Training Adjusted R squared Score''' is '''0.972'''. |
| + | |||
| + | A high Adjusted R squared score suggests the model explains most variance. | ||
|- | |- | ||
|| Highlight the lines: | || Highlight the lines: | ||
| Line 238: | Line 241: | ||
|- | |- | ||
|| Highlight the output | || Highlight the output | ||
| − | || For the test set, the '''MSE''' is '''0.257''', indicating reasonable accuracy | + | || For the test set, the '''MSE''' is '''0.257''', indicating reasonable accuracy. |
The '''Test Adjusted R squared Score''' is '''0.804'''. | The '''Test Adjusted R squared Score''' is '''0.804'''. | ||
| Line 252: | Line 255: | ||
|- | |- | ||
|| Highlight the output: | || Highlight the output: | ||
| − | || Residuals are the difference between | + | || Residuals are the difference between actual and predicted values. |
The red dashed line marks zero residuals for reference. | The red dashed line marks zero residuals for reference. | ||
| Line 282: | Line 285: | ||
'''Summary''' | '''Summary''' | ||
| − | + | ||
In this tutorial, we have learnt about | In this tutorial, we have learnt about | ||
* '''Ensemble Learning and''' | * '''Ensemble Learning and''' | ||
* '''Random Forest''' | * '''Random Forest''' | ||
| + | |||
| + | || This brings us to the end of the tutorial. Let us summarize. | ||
| + | |||
| + | |||
|- | |- | ||
Latest revision as of 19:48, 17 July 2025
| Visual Cue | Narration |
| Show slide:
Welcome |
Welcome to the Spoken Tutorial on Random Forest. |
| Show Slide:
Learning Objectives |
In this tutorial, we will learn about
|
| Show Slide: | To Record this tutorial, I am using
|
| Show Slide:
Prerequisite |
To follow this tutorial,
|
| Show Slide:
Code files |
|
| Show slide:
Ensemble Learning |
|
| Show slide:
Random Forest |
|
| Point to the RandomForest.ipynb |
RandomForest.ipynb is the python notebook file for the demonstration of Random Forest |
| Press Ctrl,Alt and T keys
Type conda activate ml Press Enter |
Let us open the Linux terminal. Press Ctrl, Alt and T keys together.
Activate the machine learning environment by typing conda space activate space ml Press Enter. |
| Go to the Downloads folder
Type cd Downloads Press Enter Type jupyter notebook Press Enter |
I have saved my code file in the Downloads folder.
Please navigate to the respective folder of your code file location. Then type, jupyter space notebook and press Enter to open Jupyter Notebook. |
| Show Jupyter Notebook Home page:
Click on RandomForest.ipynb file |
We can see the Jupyter Notebook Home page has opened in the web browser.
Click the RandomForest dot ipynb file to open it. Note that each cell will have the output displayed in this file. |
| Highlight the lines:
import pandas as pd import numpy as np from sklearn.metrics import accuracy_score |
We start by importing the required libraries for the Random Forest model.
Please remember to Execute each cell by pressing Shift and Enter to get output. |
| Highlight the lines:
california = fetch_california_housing() housing = pd.DataFrame(california.data, columns=california.feature_names) housing['MedHouseVal'] = california.target |
We use the California Housing dataset from sklearn library.
It has housing data from California districts. We analyze the dataset and predict the MedHouseValue. Now, we load the California Housing dataset into a Pandas DataFrame. We add the target column, MedHouseVal, which represents median house value. |
| Highlight the lines:
housing.head() |
To check the dataset, we display the first few rows using the head function. |
| Highlight the lines:
housing.shape |
Next, we use the shape function to check the number of rows and columns. |
| Highlight the lines:
housing.info() |
Now, we print dataset details to understand its structure and statistics. |
| Highlight the lines:
housing_sorted = housing.sort_values(by="HouseAge") |
Let's visualize house age vs median value using a line plot. |
| Highlight the output: | The plot shows newer homes have higher values, then prices stabilize.
The shaded region represents price variability in different age groups. |
| Highlight the lines:
skewed_features = ['AveRooms', 'AveBedrms', 'Population', 'AveOccup'] |
Before training, we check for skewed features that may affect predictions.
Next, we apply log transformation to reduce skewness in these features. |
| Highlight the lines:
scaler = StandardScaler() |
Now, we scale the feature values using StandardScaler.
This ensures all features have a similar range for better modeling. |
| Highlight the lines:
x = housing.drop(columns=["MedHouseVal"]) x.head() |
Then we separate the features by dropping the target column.
After that, we display the first few rows to check the data. |
| Highlight the lines:
y = housing.MedHouseVal y.head() |
Next, we extract the target variable and store it in y. |
| Highlight the lines:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42) |
Now, we split the dataset into training and testing sets. |
| Highlight the lines:
rf = |
Next, we create a Random Forest Regressor with 100 trees.
Then, we train the model using the training data. |
| Highlight the lines:
y_train_pred = rf.predict(x_train) |
After training, we predict house values for the training set. |
| Highlight the lines:
training_mse = |
Now, we compute the Mean Squared Error for training data.
We also compute the adjusted R squared score for better evaluation. |
| Highlight the output: | The Training MSE is 0.037, showing a low prediction error.
The Training Adjusted R squared Score is 0.972. A high Adjusted R squared score suggests the model explains most variance. |
| Highlight the lines:
y_pred = rf.predict(x_test) |
Now, we predict house prices for the test dataset. |
| Highlight the lines:
print("Random Forest - Actual vs Predicted:") |
Next, we compare actual vs predicted values for the test set. |
| Highlight the lines:
test_mse = mean_squared_error(y_test,y_pred) |
After that, we calculate the MSE for the test predictions.
Then, we compute the Adjusted R squared score for the test set. |
| Highlight the output | For the test set, the MSE is 0.257, indicating reasonable accuracy.
The Test Adjusted R squared Score is 0.804. A higher Adjusted R Squared score indicates a good fit. |
| Highlight the lines:
residuals = y_test - y_pred plt.show() |
To further analyze performance, we examine the residuals. |
| Highlight the output: | Residuals are the difference between actual and predicted values.
The red dashed line marks zero residuals for reference. Most residuals are near zero, meaning the prediction error is low. |
| Highlight the lines:
feature_importances = rf.feature_importances_ plt.show() |
Besides accuracy, we analyze feature importance to see key prediction factors.
Feature importance shows how much each feature impacts predictions. Higher value mean a feature is more important for the model. |
| Show the output: | The plot shows MedInc has the highest impact on house price predictions.
Other features contribute less but still affect model performance. |
| Narration | Thus, we successfully built a Random Forest model for house price prediction.
The model showed high accuracy, indicating strong performance. |
| Show slide:
Summary
|
This brings us to the end of the tutorial. Let us summarize.
|
| Show Slide:
Assignment |
As an assignment, please do the following:
|
| Show Slide:
Assignment Solution Show x img |
After completing the assignment, the output should match as the expected result. |
| Show Slide:
FOSSEE Forum |
For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question. |
| Show Slide:
Thank you |
This is Harini Theiveegan, a FOSSEE Summer Fellow 2025, IIT Bombay signing off.
Thanks for joining |