Machine-Learning-using-R/C3/Bagging-in-R/English
Title of the script: Bagging Algorithm for Decision Tree using R
Author: Debatosh Chakraboty and YATE ASSEKE RONALD RONALD.
Keywords: R, RStudio, Bagging Algorithm, machine learning, supervised, unsupervised, dataset, video tutorial.
Visual Cue | Narration |
Show slide
Opening Slide |
Welcome to this Spoken Tutorial on Bagging in R. |
Show slide
Learning Objectives |
In this tutorial, we will learn about:
|
Show slide
System Specifications |
This tutorial is recorded using,
It is recommended to install R version 4.2.0 or higher. |
Show slide
Prerequisites |
To follow this tutorial, the learner should know:
Basic programming in R. Basics of Machine Learning. If not, please access the relevant tutorials on this website. |
Show slide
Bootstrap aggregation (Bagging) |
Now let us learn about Bootstrap aggregation or Bagging.
|
Show slide
Assumptions of Bagging
|
Primarily, the assumptions of the chosen classifier must be satisfied for bagging. |
Show slide
Advantages of Bagging |
Advantages of Bagging include:
|
Show slide
Implementation of Bagging |
Now we will perform Bagging of Decision Tree classifier on the Raisin dataset with two chosen variables. |
Show slide
Download Files |
For this tutorial, I will use a script file Bagging-Decision-Tree.R.
Raisin Dataset 'raisin.xlsx'. Please download these files from the Code files link of this tutorial. Make a copy and then use them while practicing. |
[Computer screen]
Highlight Bagging-Decision-Tree.R and the folder |
I have downloaded and moved these files to the Bagging folder.
The Bagging folder is in the MLProject folder . I have also set the Bagging folder as my working Directory. |
Let us switch to RStudio. | |
Double click Bagging-Decision-Tree.R in RStudio
Point to Bagging-Decision-Tree.R in RStudio. |
Open the script Bagging-Decision-Tree.R. in RStudio.
Script Bagging-Decision-Tree.R opens in RStudio. |
[RStudio]
library(readxl) library(ipred) library(caret) library(cvms) library(rpart) |
Select and run these commands to import the necessary packages. |
[RStudio]
Highlight library(ipred) Highlight library(rpart) Highlight library(cvms) |
The ipred library contains the bagging() function.
The rpart library will be used to implement the decision tree model for bagging. We will use the cvms package for plotting the confusion matrix. As I have already installed these packages. I have directly imported them. |
Highlight
data <- read_xlsx("Raisin.xlsx") data<-data[c("minorAL","ecc","class")] data$class <- factor(data$class) |
Run these commands to import the raisin dataset and prepare it for model building.
Click on data in the Environment tab to load it in the Source Window |
[RStudio]
set.seed(1) index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) train_data <- data[index_split, ] test_data <- data[-c(index_split), ] |
Type these commands in the source window to perform the train-test split |
Highlight set.seed(1)
Highlight sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) Highlight replace=FALSE Select the commands and click the Run button. |
Select and run the commands.
The data sets will be shown in the Environment tab. |
Let us now create our Bagging model. | |
[RStudio]
bagging_model <- bagging(class ~ ., data = train_data, coob = TRUE, nbagg = 200,control = rpart.control(cp = 0.00001, xval = 10, maxdepth = 2)) |
In the source window type these commands. |
Highlight
Bagging_model <- bagging(class ~ ., data = train_data, coob = TRUE, nbagg = 200,control = rpart.control(cp = 0.00001, xval = 10, maxdepth = 2)) |
bagging(): The bagging() function is used to create a bagging ensemble model.
class ~ .: This formula indicates that the model should predict the 'class' variable. It uses all other variables in the train_data as predictors. data: The dataset used for building the model, is specified as train_data. coob: When coob is TRUE, it indicates out-of-bag (OOB) error estimate. OOB error is a technique to measure the error of the generated bootstrap classifiers. nbagg: Sets the number of bootstrap replicates for bagging. It is set to 200 in this case. The rpart.control argument allows to set up the hyperparameters of the base classifier. cp denotes the complexity parameter which is set to 0.00001. Xval is the number of cross-validations which is set to 10. Maxdepth is the maximum depth of any node of the final tree. It is limited to 2 in this case. Select and run the command to train the model. |
print(bagging_model) | In the Source window type and run this command. |
Point to the console window. | The output is shown in the console window.
Drag boundary to see the console window clearly. |
Highlight
Out-of-bag estimate of misclassification error: 0.1746 |
We can confirm that our model is trained successfully.
The training misclassification error of the model is 0.1746. |
[RStudio]
predictions <- predict(bagging_model, newdata = test_data, type = "class") |
Let us now use our model for prediction.
In the source window type and run the command |
Highlight
predictions <- predict(bagging_model, newdata = test_data, type = "class") Click on Save and Run buttons. |
This command stores the prediction of the model bagging_model on test data in a variable predictions. |
Let's now evaluate our model. | |
[RStudio]
confusion_matrix <- confusionMatrix(predictions, test_data$class) |
Type this command in the Source window |
Highlight
confusion_matrix <- confusionMatrix(predictions, test$class) |
This command will create a confusion matrix list.
The list will contain the different evaluation metrics. Select and run the command |
[RStudio]
confusion_matrix$overall["Accuracy"] |
Now, let us type this command.
This command will display the accuracy of the model. It retrieves it from the confusion Matrix list created. Select and run the command |
Highlight 0.8407 | We can see that our model has 84 percent accuracy
Note that we can achieve higher accuracy by not manually specifying the max-depth attribute. |
confusion_table <- data.frame(confusion_matrix$table) | In the source window, type this command.
This will create a data-frame of the confusion matrix table. Select and run the command. Click on confusion_table in the Environment tab. Notice that it displays the number of correct and incorrect predictions for each class. |
Cursor in the source window. | In the source window, type these commands to plot the confusion matrix |
plot_confusion_matrix(confusion_table,
target_col = "Reference", prediction_col = "Prediction", counts_col = "Freq", palette = list("low" = "pink1","high" = "green1"), add_normalized = FALSE, add_row_percentages = FALSE, add_col_percentages = FALSE) |
We use the plot_confusion_matrix function from the cvms package.
We will use the created data frame confusion_table. Target_col is the column in the dataframe with the labels for reference. Prediction_col is the column in the dataframe with predicted labels. Counts_col is the column in the dataframe with the number of correct and incorrect labels. The palette will plot the correct and incorrect predictions in different colours. Select and run the commands The output is seen in the plot window |
Highlight output in plot window | 24 Besni samples have been incorrectly classified.
19 Kecimen samples have been incorrectly classified. Overall, the model has misclassified only 43 samples. |
Let us plot our model decision boundary. | |
[RStudio]
grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 200), ecc = seq(min(data$ecc), max(data$ecc), length = 200)) grid$class = predict(bagging_model, newdata = grid, type = "class") grid$classnum <- as.numeric(grid$class) |
In the Source window type these commands |
Highlight
grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 200), ecc = seq(min(data$ecc), max(data$ecc), length = 200)) # Predict classes grid$class = predict(bagging_model, newdata = grid, type = "class") grid$classnum <- as.numeric(grid$class) |
This code first creates a grid of points spanning the feature space.
The Bagging model then predicts the class of each point in this grid. Select and run the commands |
[RStudio]
ggplot() + geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "Decision Boundary of Bootstrap Bagging") + theme_minimal() |
In theSource window type thesecommands |
Highlight
ggplot() + geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "Decision Boundary of Bootstrap Bagging") + theme_minimal() |
We plot the decision boundary using predicted classes of the grid.
This command creates decision boundary and distribution of data points with colors indicating the predicted classes. Select and run the command. |
Drag boundaries. | Drag boundaries to see the plot window clearly. |
Highlight output in plot window | We observe that the model has separated most of the data points clearly.
Note that after applying bagging to the decision tree classifier, the decision boundary looks similar to that of the decision tree. But it is more robust and complicated. |
Limitations of Bagging
|
These are the limitations of Bagging. |
Only Narration | With this we come to the end of this tutorial.
Let us summarize. |
Show Slide
Summary |
In this tutorial we have learnt about:
|
Now we will suggest the assignment for this Spoken Tutorial. | |
Show Slide
Assignment |
|
Show slide
About the Spoken Tutorial Project |
The video at the following link summarizes the Spoken Tutorial project.
Please download and watch it. |
Show slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
Please contact us. |
Show Slide
Spoken Tutorial Forum to answer questions Do you have questions in THIS Spoken Tutorial? Choose the minute and second where you have the question. Explain your question briefly. Someone from the FOSSEE team will answer them. Please visit this site. |
Please post your time queries in this forum. |
Show Slide
Forum to answer questions |
Do you have any general/technical questions?
Please visit the forum given in the link. |
Show Slide
Textbook Companion |
The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.
We give certificates to those who do this. For more details, please visit these sites. |
Show Slide
Acknowledgment |
The Spoken Tutorial was established by the Ministry of Education Govt of India. |
Show Slide
Thank You |
This tutorial is contributed by Debatosh Chakraborty and Yate Asseke Ronald O from IIT Bombay.
Thank you for joining. |