Machine-Learning-using-R - old 2022/C4/Random-Forest-using-R/English

From Script | Spoken-Tutorial
Revision as of 17:14, 24 March 2023 by Madhurig (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Title of the script: Random Forest using R

Author: Tanmay Srinath

Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, random forest, bagging, decision tree, video tutorial.

Visual Cue Narration
Show slide

Opening Slide

Welcome to this spoken tutorial on Random Forest using R.
Show slide

Learning Objectives

In this tutorial, we will learn about:
  • Random Forest
  • Bagging
  • Benefits of Random Forest
  • Applications of Random Forest
  • Random Forest on iris data set
  • Tuning a Random Forest model.
Show slide

System Specifications

This tutorial is recorded using,
  • Ubuntu Linux OS version 20.04
  • R version 4.2.0
  • RStudio version 2022.02.3


Show slide

Prerequisites

To follow this tutorial, the learner should know:
  • Basics of R programming.
  • Basics of Machine Learning


If not, please access the relevant tutorials on this website.

Show slide

Random Forest

Let us begin with Random Forest.
  • It is a powerful and versatile supervised machine learning algorithm.
  • It grows and combines multiple decision trees to create a “forest”.
  • It can be used for both classification and regression problems.


We will learn some benefits of using it.
Show slide

Benefits of Random Forest

Random forests are created from subsets of data.

The final output is based on average or majority ranking.

Hence the problem of overfitting is taken care of.

Show slide

Bagging

Random Forest uses a concept called bagging to improve its performance.

Let us learn more about it.

Show slide

Bagging

Bagging
  • It is used to reduce the variance of statistical learning methods.
  • It works by creating multiple decision trees on multiple bootstrapped datasets.
Show slide

Bagging

  • Each of these decision trees are deep and not pruned.
  • The results from these trees are averaged to provide the final output.


Show slide

Bagging in Random Forests

  • Bagging process is used to decorrelate the trees that make up a random forest.
  • A random sample of predictors are chosen for each split in the decision tree.
  • Thus, the average of the trees will be less variable and more reliable.
Now let us learn about a few applications of Random Forest.
Show Slide

Applications of Random Forest

  • It is used in customer segmentation.

https://archive.ics.uci.edu/ml

/datasets/Online+Retail+II

  • It is used in cancer diagnosis.

https://archive.ics.uci.edu/ml/

datasets/breast+cancer+wisconsin+(diagnostic)

Now let us implement Random Forest on the iris dataset.
Show slide

Download Files

For this tutorial, we will use a script file RandomForest.R.

Please download this file from the Code files link of this tutorial.

Make a copy and then use it for practising.

[Computer screen]

Highlight RandomForest.R and the folder RandomForest.

I have downloaded and moved this file to the RandomForest folder.


This folder is located in the MLProject folder on my Desktop.


I have also set the RandomForest folder as my Working Directory.

Let us switch to RStudio.
Click Script RandomForest.R

Point to RandomForest.R in RStudio.

Open the script RandomForest.R in RStudio.

Script RandomForest.R opens in RStudio.

[RStudio]

Highlight library(randomForest)

We will be using the randomForest package for creating our model.


Since I have already installed the package I will directly import it.


If you have not installed the package please install it before importing.

[RStudio]

library(randomForest)

data(iris)

Select and run these commands.
Click in the Environment tab to load the iris dataset. Click in the Environment tab to load the iris dataset.
Now let us create our Random Forest model.
[RStudio]

set.seed(1007)

model=randomForest(formula = Species ~ ., data = iris, ntree=1000)

print(model)

Type the following commands.
Highlight set.seed(1007)


Highlight model=randomForest(formula = Species ~ ., data = iris, ntree=1000)


Highlight formula = Species ~ .

Highlight data = iris

Highlight ntree=1000

We set a seed for reproducible results.


This command creates a Random Forest model.


We are using this formula.


Here Species is taken as the dependent variable.

The remaining attributes are independent variables.


We are using the entire iris dataset to train the model.


The number of trees we are growing for this model is set to 1000.

Click Save and Run buttons. Save and run the commands.


The output is shown in the console window.

Drag boundary to see the console window clearly. Drag boundary to see the console window clearly.
Highlight output in console


Highlight Type of random forest: classification


Highlight Number of trees: 1000.

Our model’s specifications are displayed here.


This tells us that our model performs classification.


This gives the number of trees grown in the forest.

Highlight output in console


Highlight No. of variables tried at each split: 2

This gives the number of variables sampled randomly at each split.


Default value for classification is sqrt(p),

where p is the number of features.

Highlight output in console


Highlight OOB estimate of error rate: 5.33%

Highlight Confusion matrix

This gives the OOB or out-of-bag error rate for our model.

It measures prediction error of random forests and decision trees.

Our model has misclassified 8 samples, 3 in versicolor and 5 in virginica.

Drag boundary to see the Source window clearly. Drag boundary to see the Source window clearly.
Let us plot our model now.
[RStudio]

plot(model)


Drag boundaries to see the plot clearly.

Type and run this command to get a plot.


Drag boundaries to see the plot clearly.

Highlight Output in Plot window. This plot, compares the number of trees against the error of the model.


Our model’s error reaches a minimum after 380 trees.

This knowledge will be used to tune our Random Forest.

Show Slide

Tune a Random Forest

Tune a Random Forest
  • Some times the default parameters of the model are not optimal.
  • Thus we need to tune our Random Forest by changing a few parameters.
  • In R, this is done using the tuneRF() function.


Drag boundaries to see the Source window clearly. Now let us switch to RStudio.


Drag boundaries to see the Source window clearly.

[RStudio]

tuneRF(iris[,-5], iris[,5], stepFactor = 0.5, plot = TRUE, ntreeTry = 380, improve = 0.01)


Highlight iris[,-5], iris[,5]

Highlight stepFactor = 0.5

Highlight ntreeTry = 380


Highlight improve = 0.01

Click the Run button.

Type this command.

This is the data that we pass as input.

We are predicting the species, given here by iris[,5].


This value is used to update the tune if there is no improvement in OOB error.


We will try 380 trees for our model.


This indicates the amount that the OOB error needs to improve.


Run the command to see the output in the console window.

Highlight output in console

And plot window.

We see that our model is performing just as well for 2 and 4 features.

So we will stick with 2 features.

Now let us create the optimised model.
[RStudio]

model_new=randomForest(Species~., data=iris, ntree = 380, mtry = 2)

print(model_new)

Type and run these commands.
Highlight output in console We can see that OOB error has dropped to 4%.
Let us create a variable importance plot, or varImpPlot() in short.
[RStudio]

varImpPlot(model_new)


Highlight output in plot window.

Type this command.


Run the command to see the output of the plot window.


We see that Petal Length is the most important variable for the Random Forest model.

Only Narration. With this we come to the end of this tutorial. Let us summarize.
Show Slide

Summary

In this tutorial we have learnt:
  • Random Forest
  • Bagging
  • Benefits of Random Forest
  • Applications of Random Forest
  • Random Forest on iris data set
  • Tuning a Random Forest model


Show Slide

Assignment

Now we will suggest an assignment for this Spoken Tutorial.


Create a Random Forest for PimaIndiansDiabetes dataset.

Tune the model using tuneRF() command.

Show slide

About the Spoken Tutorial Project

The video at the following link summarises the Spoken Tutorial project.

Please download and watch it.

Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.


For more details, please contact us.

Show Slide

Spoken Tutorial Forum to answer questions

Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion

The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.


We give certificates to those who do this.


For more details, please visit these sites.

Show Slide

Acknowledgment

Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India.
Show Slide

Thank You

This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay.

Thank you for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey