Machine-Learning-using-R - old 2022/C4/Random-Forest-using-R/English

Title of the script: Random Forest using R

Author: Tanmay Srinath

Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, random forest, bagging, decision tree, video tutorial.

Visual Cue	Narration
Show slide Opening Slide	Welcome to this spoken tutorial on Random Forest using R.
Show slide Learning Objectives	In this tutorial, we will learn about: Random Forest Bagging Benefits of Random Forest Applications of Random Forest Random Forest on iris data set Tuning a Random Forest model.
Show slide System Specifications	This tutorial is recorded using, Ubuntu Linux OS version 20.04 R version 4.2.0 RStudio version 2022.02.3
Show slide Prerequisites	To follow this tutorial, the learner should know: Basics of R programming. Basics of Machine Learning If not, please access the relevant tutorials on this website.
Show slide Random Forest	Let us begin with Random Forest. It is a powerful and versatile supervised machine learning algorithm. It grows and combines multiple decision trees to create a “forest”. It can be used for both classification and regression problems.
	We will learn some benefits of using it.
Show slide Benefits of Random Forest	Random forests are created from subsets of data. The final output is based on average or majority ranking. Hence the problem of overfitting is taken care of.
Show slide Bagging	Random Forest uses a concept called bagging to improve its performance. Let us learn more about it.
Show slide Bagging	Bagging It is used to reduce the variance of statistical learning methods. It works by creating multiple decision trees on multiple bootstrapped datasets.
Show slide Bagging	Each of these decision trees are deep and not pruned. The results from these trees are averaged to provide the final output.
Show slide Bagging in Random Forests	Bagging process is used to decorrelate the trees that make up a random forest. A random sample of predictors are chosen for each split in the decision tree. Thus, the average of the trees will be less variable and more reliable.
	Now let us learn about a few applications of Random Forest.
Show Slide Applications of Random Forest	It is used in customer segmentation. https://archive.ics.uci.edu/ml /datasets/Online+Retail+II It is used in cancer diagnosis. https://archive.ics.uci.edu/ml/ datasets/breast+cancer+wisconsin+(diagnostic)
	Now let us implement Random Forest on the iris dataset.
Show slide Download Files	For this tutorial, we will use a script file RandomForest.R. Please download this file from the Code files link of this tutorial. Make a copy and then use it for practising.
[Computer screen] Highlight RandomForest.R and the folder RandomForest.	I have downloaded and moved this file to the RandomForest folder. This folder is located in the MLProject folder on my Desktop. I have also set the RandomForest folder as my Working Directory.
	Let us switch to RStudio.
Click Script RandomForest.R Point to RandomForest.R in RStudio.	Open the script RandomForest.R in RStudio. Script RandomForest.R opens in RStudio.
[RStudio] Highlight library(randomForest)	We will be using the randomForest package for creating our model. Since I have already installed the package I will directly import it. If you have not installed the package please install it before importing.
[RStudio] library(randomForest) data(iris)	Select and run these commands.
Click in the Environment tab to load the iris dataset.	Click in the Environment tab to load the iris dataset.
	Now let us create our Random Forest model.
[RStudio] set.seed(1007) model=randomForest(formula = Species ~ ., data = iris, ntree=1000) print(model)	Type the following commands.
Highlight set.seed(1007) Highlight model=randomForest(formula = Species ~ ., data = iris, ntree=1000) Highlight formula = Species ~ . Highlight data = iris Highlight ntree=1000	We set a seed for reproducible results. This command creates a Random Forest model. We are using this formula. Here Species is taken as the dependent variable. The remaining attributes are independent variables. We are using the entire iris dataset to train the model. The number of trees we are growing for this model is set to 1000.
Click Save and Run buttons.	Save and run the commands. The output is shown in the console window.
Drag boundary to see the console window clearly.	Drag boundary to see the console window clearly.
Highlight output in console Highlight Type of random forest: classification Highlight Number of trees: 1000.	Our model’s specifications are displayed here. This tells us that our model performs classification. This gives the number of trees grown in the forest.
Highlight output in console Highlight No. of variables tried at each split: 2	This gives the number of variables sampled randomly at each split. Default value for classification is sqrt(p), where p is the number of features.
Highlight output in console Highlight OOB estimate of error rate: 5.33% Highlight Confusion matrix	This gives the OOB or out-of-bag error rate for our model. It measures prediction error of random forests and decision trees. Our model has misclassified 8 samples, 3 in versicolor and 5 in virginica.
Drag boundary to see the Source window clearly.	Drag boundary to see the Source window clearly.
	Let us plot our model now.
[RStudio] plot(model) Drag boundaries to see the plot clearly.	Type and run this command to get a plot. Drag boundaries to see the plot clearly.
Highlight Output in Plot window.	This plot, compares the number of trees against the error of the model. Our model’s error reaches a minimum after 380 trees. This knowledge will be used to tune our Random Forest.
Show Slide Tune a Random Forest	Tune a Random Forest Some times the default parameters of the model are not optimal. Thus we need to tune our Random Forest by changing a few parameters. In R, this is done using the tuneRF() function.
Drag boundaries to see the Source window clearly.	Now let us switch to RStudio. Drag boundaries to see the Source window clearly.
[RStudio] tuneRF(iris[,-5], iris[,5], stepFactor = 0.5, plot = TRUE, ntreeTry = 380, improve = 0.01) Highlight iris[,-5], iris[,5] Highlight stepFactor = 0.5 Highlight ntreeTry = 380 Highlight improve = 0.01 Click the Run button.	Type this command. This is the data that we pass as input. We are predicting the species, given here by iris[,5]. This value is used to update the tune if there is no improvement in OOB error. We will try 380 trees for our model. This indicates the amount that the OOB error needs to improve. Run the command to see the output in the console window.
Highlight output in console And plot window.	We see that our model is performing just as well for 2 and 4 features. So we will stick with 2 features.
	Now let us create the optimised model.
[RStudio] model_new=randomForest(Species~., data=iris, ntree = 380, mtry = 2) print(model_new)	Type and run these commands.
Highlight output in console	We can see that OOB error has dropped to 4%.
	Let us create a variable importance plot, or varImpPlot() in short.
[RStudio] varImpPlot(model_new) Highlight output in plot window.	Type this command. Run the command to see the output of the plot window. We see that Petal Length is the most important variable for the Random Forest model.
Only Narration.	With this we come to the end of this tutorial. Let us summarize.
Show Slide Summary	In this tutorial we have learnt: Random Forest Bagging Benefits of Random Forest Applications of Random Forest Random Forest on iris data set Tuning a Random Forest model
Show Slide Assignment	Now we will suggest an assignment for this Spoken Tutorial. Create a Random Forest for PimaIndiansDiabetes dataset. Tune the model using tuneRF() command.
Show slide About the Spoken Tutorial Project	The video at the following link summarises the Spoken Tutorial project. Please download and watch it.
Show slide Spoken Tutorial Workshops	We conduct workshops using Spoken Tutorials and give certificates. For more details, please contact us.
Show Slide Spoken Tutorial Forum to answer questions	Please post your timed queries in this forum.
Show Slide Forum to answer questions	Do you have any general/technical questions? Please visit the forum given in the link.
Show Slide Textbook Companion	The FOSSEE team coordinates the coding of solved examples of popular books and case study projects. We give certificates to those who do this. For more details, please visit these sites.
Show Slide Acknowledgment	Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India.
Show Slide Thank You	This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay. Thank you for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey

Machine-Learning-using-R - old 2022/C4/Random-Forest-using-R/English

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools