Machine-Learning-using-R/C3/Bagging-in-R/English

From Script | Spoken-Tutorial
Revision as of 11:48, 27 November 2024 by Ushav (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Title of the script: Bagging Algorithm for Decision Tree using R

Author: Debatosh Chakraboty and YATE ASSEKE RONALD RONALD.

Keywords: R, RStudio, Bagging Algorithm, machine learning, supervised, unsupervised, dataset, video tutorial.

Visual Cue Narration
Show slide

Opening Slide

Welcome to this Spoken Tutorial on Bagging in R.
Show slide

Learning Objectives

In this tutorial, we will learn about:
  • Bagging.
  • Assumptions for Bagging.
  • Advantages of Bagging.
  • Implementation of Bagging using Decision Tree in R.
  • Model Evaluation.
  • Limitations of Bagging.
Show slide

System Specifications

This tutorial is recorded using,
  • Windows 11
  • R version 4.3.0
  • RStudio version 2023.06.1

It is recommended to install R version 4.2.0 or higher.

Show slide

Prerequisites

https://spoken-tutorial.org

To follow this tutorial, the learner should know:

Basic programming in R. Basics of Machine Learning.

If not, please access the relevant tutorials on this website.

Show slide

Bootstrap aggregation (Bagging)

Now let us learn about Bootstrap aggregation or Bagging.
  • Any classification model fitted across several training data subsets is desired to have consistent decision boundaries.
  • Large variation in the decision boundaries indicate higher variability of the classification model.
  • Bagging is a commonly used ensemble technique to reduce this variation.
  • In Bagging, random subsets of the training data are repeatedly chosen to construct multiple classifiers.
  • The Bootstrap classifiers constructed from chosen subsets are then aggregated.
  • For bagging of the decision tree classifier, the aggregation is done by a majority vote of the class predicted by Bootstrap trees.
Show slide

Assumptions of Bagging

  • Each observation is independent.
  • The assumption of the chosen classifier is satisfied.
Primarily, the assumptions of the chosen classifier must be satisfied for bagging.
Show slide

Advantages of Bagging

Advantages of Bagging include:
  • Bagging reduces the variation of the chosen model.
  • Bagging improves the performance (accuracy) of the decision tree classifier in general.
Show slide

Implementation of Bagging

Now we will perform Bagging of Decision Tree classifier on the Raisin dataset with two chosen variables.
Show slide

Download Files

For this tutorial, I will use a script file Bagging-Decision-Tree.R.

Raisin Dataset 'raisin.xlsx'.

Please download these files from the Code files link of this tutorial.

Make a copy and then use them while practicing.

[Computer screen]

Highlight Bagging-Decision-Tree.R and the folder

I have downloaded and moved these files to the Bagging folder.

The Bagging folder is in the MLProject folder .

I have also set the Bagging folder as my working Directory.

Let us switch to RStudio.
Double click Bagging-Decision-Tree.R in RStudio

Point to Bagging-Decision-Tree.R in RStudio.

Open the script Bagging-Decision-Tree.R. in RStudio.

Script Bagging-Decision-Tree.R opens in RStudio.

[RStudio]

library(readxl)

library(ipred)

library(caret)

library(cvms)

library(rpart)

Select and run these commands to import the necessary packages.
[RStudio]

Highlight library(ipred)

Highlight library(rpart)

Highlight library(cvms)

The ipred library contains the bagging() function.

The rpart library will be used to implement the decision tree model for bagging.

We will use the cvms package for plotting the confusion matrix.

As I have already installed these packages.

I have directly imported them.

Highlight

data <- read_xlsx("Raisin.xlsx")

data<-data[c("minorAL","ecc","class")]

data$class <- factor(data$class)

Run these commands to import the raisin dataset and prepare it for model building.

Click on data in the Environment tab to load it in the Source Window

[RStudio]

set.seed(1)

index_split=sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

train_data <- data[index_split, ]

test_data <- data[-c(index_split), ]

Type these commands in the source window to perform the train-test split
Highlight set.seed(1)

Highlight sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

Highlight replace=FALSE

Select the commands and click the Run button.

Select and run the commands.

The data sets will be shown in the Environment tab.

Let us now create our Bagging model.
[RStudio]

bagging_model <- bagging(class ~ ., data = train_data, coob = TRUE, nbagg = 200,control = rpart.control(cp = 0.00001, xval = 10, maxdepth = 2))

In the source window type these commands.
Highlight

Bagging_model <- bagging(class ~ ., data = train_data, coob = TRUE, nbagg = 200,control = rpart.control(cp = 0.00001, xval = 10, maxdepth = 2))

bagging(): The bagging() function is used to create a bagging ensemble model.

class ~ .: This formula indicates that the model should predict the 'class' variable.

It uses all other variables in the train_data as predictors.

data: The dataset used for building the model, is specified as train_data.

coob: When coob is TRUE, it indicates out-of-bag (OOB) error estimate.

OOB error is a technique to measure the error of the generated bootstrap classifiers.

nbagg: Sets the number of bootstrap replicates for bagging. It is set to 200 in this case.

The rpart.control argument allows to set up the hyperparameters of the base classifier.

cp denotes the complexity parameter which is set to 0.00001.

Xval is the number of cross-validations which is set to 10.

Maxdepth is the maximum depth of any node of the final tree. It is limited to 2 in this case.

Select and run the command to train the model.

print(bagging_model) In the Source window type and run this command.
Point to the console window. The output is shown in the console window.

Drag boundary to see the console window clearly.

Highlight

Out-of-bag estimate of misclassification error: 0.1746

We can confirm that our model is trained successfully.

The training misclassification error of the model is 0.1746.

[RStudio]

predictions <- predict(bagging_model, newdata = test_data, type = "class")

Let us now use our model for prediction.

In the source window type and run the command

Highlight

predictions <- predict(bagging_model, newdata = test_data, type = "class")

Click on Save and Run buttons.

This command stores the prediction of the model bagging_model on test data in a variable predictions.
Let's now evaluate our model.
[RStudio]

confusion_matrix <- confusionMatrix(predictions, test_data$class)

Type this command in the Source window
Highlight

confusion_matrix <- confusionMatrix(predictions, test$class)

This command will create a confusion matrix list.

The list will contain the different evaluation metrics.

Select and run the command

[RStudio]

confusion_matrix$overall["Accuracy"]

Now, let us type this command.

This command will display the accuracy of the model.

It retrieves it from the confusion Matrix list created.

Select and run the command

Highlight 0.8407 We can see that our model has 84 percent accuracy

Note that we can achieve higher accuracy by not manually specifying the max-depth attribute.

confusion_table <- data.frame(confusion_matrix$table) In the source window, type this command.

This will create a data-frame of the confusion matrix table.

Select and run the command.

Click on confusion_table in the Environment tab.

Notice that it displays the number of correct and incorrect predictions for each class.

Cursor in the source window. In the source window, type these commands to plot the confusion matrix
plot_confusion_matrix(confusion_table,

target_col = "Reference",

prediction_col = "Prediction",

counts_col = "Freq",

palette = list("low" = "pink1","high" = "green1"),

add_normalized = FALSE,

add_row_percentages = FALSE,

add_col_percentages = FALSE)

We use the plot_confusion_matrix function from the cvms package.

We will use the created data frame confusion_table.

Target_col is the column in the dataframe with the labels for reference.

Prediction_col is the column in the dataframe with predicted labels.

Counts_col is the column in the dataframe with the number of correct and incorrect labels.

The palette will plot the correct and incorrect predictions in different colours.

Select and run the commands

The output is seen in the plot window

Highlight output in plot window 24 Besni samples have been incorrectly classified.

19 Kecimen samples have been incorrectly classified.

Overall, the model has misclassified only 43 samples.

Let us plot our model decision boundary.
[RStudio]

grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 200),

ecc = seq(min(data$ecc), max(data$ecc), length = 200))

grid$class = predict(bagging_model, newdata = grid, type = "class")

grid$classnum <- as.numeric(grid$class)

In the Source window type these commands
Highlight

grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 200),

ecc = seq(min(data$ecc), max(data$ecc), length = 200))

# Predict classes

grid$class = predict(bagging_model, newdata = grid, type = "class")

grid$classnum <- as.numeric(grid$class)

This code first creates a grid of points spanning the feature space.

The Bagging model then predicts the class of each point in this grid.

Select and run the commands

[RStudio]

ggplot() +

geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +

geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),

colour = "black", linewidth = 0.7) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(x = "MinorAL", y = "ecc", title = "Decision Boundary of Bootstrap Bagging") +

theme_minimal()

In theSource window type thesecommands
Highlight

ggplot() +

geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +

geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),

colour = "black", linewidth = 0.7) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(x = "MinorAL", y = "ecc", title = "Decision Boundary of Bootstrap Bagging") +

theme_minimal()

We plot the decision boundary using predicted classes of the grid. 

This command creates decision boundary and distribution of data points with colors indicating the predicted classes. 

Select and run the command.

Drag boundaries. Drag boundaries to see the plot window clearly.
Highlight output in plot window We observe that the model has separated most of the data points clearly.

Note that after applying bagging to the decision tree classifier, the decision boundary looks similar to that of the decision tree.

But it is more robust and complicated.

Limitations of Bagging
  • Bagging is hard to interpret.
  • Requires more computational time.
  • Bagging doesn’t improve model bias.
These are the limitations of Bagging.
Only Narration With this we come to the end of this tutorial.

Let us summarize.

Show Slide

Summary

In this tutorial we have learnt about:
  • Bagging
  • Assumptions for Bagging
  • Advantages of Bagging
  • Implementation of Bagging using Decision Tree in R
  • Model Evaluation
  • Limitations of Bagging
Now we will suggest the assignment for this Spoken Tutorial.
Show Slide

Assignment

  • Apply Bagging using Decision Tree on PimaIndiansDiabetes dataset
  • Install the pdp package and import the dataset using the data(pima) command
  • Visualize the decision boundary and measure the accuracy of the model
Show slide

About the Spoken Tutorial Project

The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.

Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.

Please contact us.

Show Slide

Spoken Tutorial Forum to answer questions

Do you have questions in THIS Spoken Tutorial?

Choose the minute and second where you have the question.

Explain your question briefly.

Someone from the FOSSEE team will answer them.

Please visit this site.

Please post your time queries in this forum.
Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion

The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial was established by the Ministry of Education Govt of India.
Show Slide

Thank You

This tutorial is contributed by Debatosh Chakraborty and Yate Asseke Ronald O from IIT Bombay.

Thank you for joining.

Contributors and Content Editors

Ushav