Machine-Learning-using-R/C3/Bagging-in-R/English

Title of the script: Bagging Algorithm for Decision Tree using R

Author: Debatosh Chakraboty and YATE ASSEKE RONALD RONALD.

Keywords: R, RStudio, Bagging Algorithm, machine learning, supervised, unsupervised, dataset, video tutorial.

Visual Cue	Narration
Show slide Opening Slide	Welcome to this Spoken Tutorial on Bagging in R.
Show slide Learning Objectives	In this tutorial, we will learn about: Bagging. Assumptions for Bagging. Advantages of Bagging. Implementation of Bagging using Decision Tree in R. Model Evaluation. Limitations of Bagging.
Show slide System Specifications	This tutorial is recorded using, Windows 11 R version 4.3.0 RStudio version 2023.06.1 It is recommended to install R version 4.2.0 or higher.
Show slide Prerequisites https://spoken-tutorial.org	To follow this tutorial, the learner should know: Basic programming in R. Basics of Machine Learning. If not, please access the relevant tutorials on this website.
Show slide Bootstrap aggregation (Bagging)	Now let us learn about Bootstrap aggregation or Bagging. Any classification model fitted across several training data subsets is desired to have consistent decision boundaries. Large variation in the decision boundaries indicate higher variability of the classification model. Bagging is a commonly used ensemble technique to reduce this variation. In Bagging, random subsets of the training data are repeatedly chosen to construct multiple classifiers. The Bootstrap classifiers constructed from chosen subsets are then aggregated. For bagging of the decision tree classifier, the aggregation is done by a majority vote of the class predicted by Bootstrap trees.
Show slide Assumptions of Bagging Each observation is independent. The assumption of the chosen classifier is satisfied.	Primarily, the assumptions of the chosen classifier must be satisfied for bagging.
Show slide Advantages of Bagging	Advantages of Bagging include: Bagging reduces the variation of the chosen model. Bagging improves the performance (accuracy) of the decision tree classifier in general.
Show slide Implementation of Bagging	Now we will perform Bagging of Decision Tree classifier on the Raisin dataset with two chosen variables.
Show slide Download Files	For this tutorial, I will use a script file Bagging-Decision-Tree.R. Raisin Dataset 'raisin.xlsx'. Please download these files from the Code files link of this tutorial. Make a copy and then use them while practicing.
[Computer screen] Highlight Bagging-Decision-Tree.R and the folder	I have downloaded and moved these files to the Bagging folder. The Bagging folder is in the MLProject folder . I have also set the Bagging folder as my working Directory.
	Let us switch to RStudio.
Double click Bagging-Decision-Tree.R in RStudio Point to Bagging-Decision-Tree.R in RStudio.	Open the script Bagging-Decision-Tree.R. in RStudio. Script Bagging-Decision-Tree.R opens in RStudio.
[RStudio] library(readxl) library(ipred) library(caret) library(cvms) library(rpart)	Select and run these commands to import the necessary packages.
[RStudio] Highlight library(ipred) Highlight library(rpart) Highlight library(cvms)	The ipred library contains the bagging() function. The rpart library will be used to implement the decision tree model for bagging. We will use the cvms package for plotting the confusion matrix. As I have already installed these packages. I have directly imported them.
Highlight data <- read_xlsx("Raisin.xlsx") data<-data[c("minorAL","ecc","class")] data$class <- factor(data$class)	Run these commands to import the raisin dataset and prepare it for model building. Click on data in the Environment tab to load it in the Source Window
[RStudio] set.seed(1) *index_split=sample(1:nrow(data),size=0.7nrow(data),replace=FALSE) train_data <- data[index_split, ] test_data <- data[-c(index_split), ]**	Type these commands in the source window to perform the train-test split
Highlight set.seed(1) Highlight *sample(1:nrow(data),size=0.7nrow(data),replace=FALSE) Highlight replace=FALSE** Select the commands and click the Run button.	Select and run the commands. The data sets will be shown in the Environment tab.
	Let us now create our Bagging model.
[RStudio] bagging_model <- bagging(class ~ ., data = train_data, coob = TRUE, nbagg = 200,control = rpart.control(cp = 0.00001, xval = 10, maxdepth = 2))	In the source window type these commands.
Highlight Bagging_model <- bagging(class ~ ., data = train_data, coob = TRUE, nbagg = 200,control = rpart.control(cp = 0.00001, xval = 10, maxdepth = 2))	bagging(): The bagging() function is used to create a bagging ensemble model. class ~ .: This formula indicates that the model should predict the 'class' variable. It uses all other variables in the train_data as predictors. data: The dataset used for building the model, is specified as train_data. coob: When coob is TRUE, it indicates out-of-bag (OOB) error estimate. OOB error is a technique to measure the error of the generated bootstrap classifiers. nbagg: Sets the number of bootstrap replicates for bagging. It is set to 200 in this case. The rpart.control argument allows to set up the hyperparameters of the base classifier. cp denotes the complexity parameter which is set to 0.00001. Xval is the number of cross-validations which is set to 10. Maxdepth is the maximum depth of any node of the final tree. It is limited to 2 in this case. Select and run the command to train the model.
print(bagging_model)	In the Source window type and run this command.
Point to the console window.	The output is shown in the console window. Drag boundary to see the console window clearly.
Highlight Out-of-bag estimate of misclassification error: 0.1746	We can confirm that our model is trained successfully. The training misclassification error of the model is 0.1746.
[RStudio] predictions <- predict(bagging_model, newdata = test_data, type = "class")	Let us now use our model for prediction. In the source window type and run the command
Highlight predictions <- predict(bagging_model, newdata = test_data, type = "class") Click on Save and Run buttons.	This command stores the prediction of the model bagging_model on test data in a variable predictions.
	Let's now evaluate our model.
[RStudio] confusion_matrix <- confusionMatrix(predictions, test_data$class)	Type this command in the Source window
Highlight confusion_matrix <- confusionMatrix(predictions, test$class)	This command will create a confusion matrix list. The list will contain the different evaluation metrics. Select and run the command
[RStudio] confusion_matrix$overall["Accuracy"]	Now, let us type this command. This command will display the accuracy of the model. It retrieves it from the confusion Matrix list created. Select and run the command
Highlight 0.8407	We can see that our model has 84 percent accuracy Note that we can achieve higher accuracy by not manually specifying the max-depth attribute.
confusion_table <- data.frame(confusion_matrix$table)	In the source window, type this command. This will create a data-frame of the confusion matrix table. Select and run the command. Click on confusion_table in the Environment tab. Notice that it displays the number of correct and incorrect predictions for each class.
Cursor in the source window.	In the source window, type these commands to plot the confusion matrix
plot_confusion_matrix(confusion_table, target_col = "Reference", prediction_col = "Prediction", counts_col = "Freq", palette = list("low" = "pink1","high" = "green1"), add_normalized = FALSE, add_row_percentages = FALSE, add_col_percentages = FALSE)	We use the plot_confusion_matrix function from the cvms package. We will use the created data frame confusion_table. Target_col is the column in the dataframe with the labels for reference. Prediction_col is the column in the dataframe with predicted labels. Counts_col is the column in the dataframe with the number of correct and incorrect labels. The palette will plot the correct and incorrect predictions in different colours. Select and run the commands The output is seen in the plot window
Highlight output in plot window	24 Besni samples have been incorrectly classified. 19 Kecimen samples have been incorrectly classified. Overall, the model has misclassified only 43 samples.
	Let us plot our model decision boundary.
[RStudio] grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 200), ecc = seq(min(data$ecc), max(data$ecc), length = 200)) grid$class = predict(bagging_model, newdata = grid, type = "class") grid$classnum <- as.numeric(grid$class)	In the Source window type these commands
Highlight grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 200), ecc = seq(min(data$ecc), max(data$ecc), length = 200)) # Predict classes grid$class = predict(bagging_model, newdata = grid, type = "class") grid$classnum <- as.numeric(grid$class)	This code first creates a grid of points spanning the feature space. The Bagging model then predicts the class of each point in this grid. Select and run the commands
[RStudio] ggplot() + geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "Decision Boundary of Bootstrap Bagging") + theme_minimal()	In theSource window type thesecommands
Highlight ggplot() + geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "Decision Boundary of Bootstrap Bagging") + theme_minimal()	We plot the decision boundary using predicted classes of the grid. This command creates decision boundary and distribution of data points with colors indicating the predicted classes. Select and run the command.
Drag boundaries.	Drag boundaries to see the plot window clearly.
Highlight output in plot window	We observe that the model has separated most of the data points clearly. Note that after applying bagging to the decision tree classifier, the decision boundary looks similar to that of the decision tree. But it is more robust and complicated.
Limitations of Bagging Bagging is hard to interpret. Requires more computational time. Bagging doesn’t improve model bias.	These are the limitations of Bagging.
Only Narration	With this we come to the end of this tutorial. Let us summarize.
Show Slide Summary	In this tutorial we have learnt about: Bagging Assumptions for Bagging Advantages of Bagging Implementation of Bagging using Decision Tree in R Model Evaluation Limitations of Bagging
	Now we will suggest the assignment for this Spoken Tutorial.
Show Slide Assignment	Apply Bagging using Decision Tree on PimaIndiansDiabetes dataset Install the pdp package and import the dataset using the data(pima) command Visualize the decision boundary and measure the accuracy of the model
Show slide About the Spoken Tutorial Project	The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
Show slide Spoken Tutorial Workshops	We conduct workshops using Spoken Tutorials and give certificates. Please contact us.
Show Slide Spoken Tutorial Forum to answer questions Do you have questions in THIS Spoken Tutorial? Choose the minute and second where you have the question. Explain your question briefly. Someone from the FOSSEE team will answer them. Please visit this site.	Please post your time queries in this forum.
Show Slide Forum to answer questions	Do you have any general/technical questions? Please visit the forum given in the link.
Show Slide Textbook Companion	The FOSSEE team coordinates the coding of solved examples of popular books and case study projects. We give certificates to those who do this. For more details, please visit these sites.
Show Slide Acknowledgment	The Spoken Tutorial was established by the Ministry of Education Govt of India.
Show Slide Thank You	This tutorial is contributed by Debatosh Chakraborty and Yate Asseke Ronald O from IIT Bombay. Thank you for joining.

Contributors and Content Editors

Ushav

Machine-Learning-using-R/C3/Bagging-in-R/English

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools