Machine-Learning-using-R/C2/Decision-Tree-in-R/English

Title of the script: Decision Tree in R

Author: Debatosh Chakraborty and Yate Asseke Ronald Olivera

Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, regression, decision tree, video tutorial.

Visual Cue	Narration
Show slide Opening Slide	Welcome to this Spoken Tutorial on Decision Tree in R.
Show slide Learning Objectives	In this tutorial, we will learn about: Decision Tree Assumptions for Decision Tree Advantages of Decision Tree Implementation of Decision Tree in R. Plotting the decision tree model Evaluation of the model. Visualizing the model decision boundary Limitations of Decision Tree.
Show slide System Specifications	This tutorial is recorded using, Windows 11 R version 4.3.0 RStudio version 2023.06.1 It is recommended to install R version 4.2.0 or higher.
Show slide Prerequisites https://spoken-tutorial.org	To follow this tutorial, learner should know: Basic programming in R Basics of Machine Learning If not, please access the relevant tutorials on this website.
Show slide What is a Decision Tree?	Let us see what a decision tree is? It uses a binary tree to split the feature space into several sub-regions The nodes of the tree are the locations at which the feature space splits Misclassification error, Gini index, and entropy aid in identifying ideal splits. The decision boundaries in the Decision Treemodel are nonlinear
Show slide Assumptions of Decision Tree The root node of the tree consists of the entire training set. The model does not assume any specific distribution of features. Each observation is independent.	The assumptions of the decision tree model are as follows.
Show slide Advantages of Decision Tree	The advantages of decision tree model include: It does not require feature variables to be necessarily continuous Decision trees are intuitive and easy to visualize When the response is continuous, the decision tree methodology can be easily implemented as a regression tree The regression tree method will be discussed in a separate tutorial.
Show Slide Implementation of Decision Tree	Now we will construct a Decision Tree on the Raisin dataset with two chosen variables.
Show slide Download Files	For this tutorial, we will use A script file DecisionTree.R. Raisin Dataset 'raisin.xlsx' Please download these files from the Code files link of this tutorial. Make a copy and then use them while practicing.
[Computer screen] Highlight DecisionTree.R	I have downloaded and moved these files to the Decision Tree folder. We will create a Decision Tree classifier model on the raisin dataset.
	Let us switch to RStudio.
Double-click DecisionTree.R on RStudio Point to DecisionTree.R in RStudio.	Open the script DecisionTree.R in RStudio. Script DecisionTree.R opens in RStudio.
[RStudio] Highlight library(readxl) library(ggplot2) library(caret) Highlight library(rpart) Highlight library(rpart.plot) Highlight library(cvms) #install.packages(“package_name”) Point to the command.	Select and run these commands to import the packages. These packages will be used to aid thebuilding and evaluation of the classifier. We will use rpart package to create the decision tree classifier. We will use the rpart.plot package for plotting the decision tree. We will use the cvms package for plotting the confusion matrix. Please ensure that all the packages are installed correctly. As I have already installed the packages, I have imported them directly.
[RStudio] Highlight data <- read_xlsx("Raisin.xlsx") Highlight data <- data[c("minorAL",”ecc”,"class")] Highlight data$class <- factor(data$class) Select the commands and click the Run button	These commands will load theRaisin dataset. They will also prepare the dataset for model building. Select and run the commands.
Click on data in the Environment tab to load the dataset. Point to the Source window.	Click on data in the Environment tab to load the modified data in the Source window.
[RStudio] set.seed(1) trainIndex <- createDataPartition(data$class, p = 0.7, list = FALSE) train <- data[trainIndex, ] test <- data[-trainIndex, ]	In the Source window type these commands.
Highlight set.seed(1) Highlight *trainIndex <- sample(1:nrow(data),size=0.7nrow(data),replace=FALSE) Highlight train <- data[trainIndex, ] Highlight test <- data[-trainIndex, ]**	This will split our dataset into training and testing data. Select the commands and run them.
	Let us now create our Decision Tree model.
model <- rpart(class ~ ., data = train_data, method = 'class', control = rpart.control(cp = .00001, xval = 10, maxdepth = 2), parms = list(split = "gini")) summary(decision_model)	In the source window, type these commands.
[RStudio] Highlight formula = class ~ . Highlight data=train Highlight method = 'class' Highlight parms = list(split = "gini") Highlight maxdepth = 2 Highlight xval = 10 Highlight cp = .00001 Click the Run button. Point to Environment tab.	This is the formula we use for this model. class is taken as the dependent variable. The remaining attributes are independent variables. data=train_data uses the training partition of dataset to train our model. This tells our model that we are doing a classification task. This “Gini Index” will be used to determine the best splits of the nodes. This determines the maximum depth of the tree. This is the number of cross-validations for each split. The maximum loss of cross-validation which is desirable. Select and run the commands. The model data is shown in the Environment tab.
Highlight CP HIghlight Node Information Highlight n=630 Highlight class counts Highlight probabilities Highlight Predicted class Highlight Primary splits	The summary of the model created is shown in the console window Drag boundary to see the console window clearly CP displays the complexity table for the trees created in the final model. This displays the information about each node created. This includes, Total observations used to create the node. The distribution of observations for each class in the node. The probability of each class. The class with the highest probability is the predicted class for the node. This denotes the split information for that particular node.
[RStudio] rpart.plot(decision_model)	Now let us visualize the decision tree model. In the Source window type this command and run it. Drag boundary to see the plot window clearly
Hover Kecimen Hover 0.71 Hover 48% 52%	The trained decision tree model is shown in the plot window For each node, Predicted Class It’s probability And the Percentage of total observations is shown One must note that the modeled tree is interpretable. This is because the max depth of the tree is manually specified. But it comes at the cost of underfitting and an increase in misclassification error.
	Now let us use the model to make predictions on the testing data partition.
[RStudio] predictions <- predict(decision_model, newdata = test_data, type = "class") Select and run this command.	In the source window type this command and run it. This command generates the predicted classes from the trained decision tree model.
	Let's now evaluate our model.
[RStudio] confusion_matrix <- confusionMatrix(predictions, test$class)	Type this command in the Source window
Highlight confusion_matrix <- confusionMatrix(predictions, test$class)	This command will create a confusion matrix list. The list will contain the different evaluation metrics. Select and run the command
[RStudio] confusion_matrix$overall["Accuracy"]	Now, let us type this command. This command will display the accuracy of the model by retrieving it from the confusion Matrix list. Select and run the command
Highlight 0.807	We can see that our model has 80 percent accuracy. Note that the misclassifications are higher because of the manual specification of max depth attribute. Choosing a higher value will make the model less interpretable and will reduce the misclassification error.
confusion_table <- data.frame(confusion_matrix$table)	In the source window, type these commands. This will create a data-frame of the confusion matrix table. Select and run the command. Click on confusion_table in the Environment tab. We notice that it displays the number of correct and incorrect predictions for each class.
[RStudio]	In the source window, type these commands. To plot the confusion matrix It will represent the number of correct and incorrect predictions using different colors.
plot_confusion_matrix(confusion_table, target_col = "Reference", prediction_col = "Prediction", counts_col = "Freq", palette = list("low" = "pink1","high" = "green1"), add_normalized = FALSE, add_row_percentages = FALSE, add_col_percentages = FALSE)	We use the plot_confusion_matrix function from the cvms package. We will use the dataframe confusion_table. Target_col is the column reference in the dataframe confusion_table with the labels for reference. Prediction_col is the column Prediction in the dataframe confusion_table with predicted labels. Counts_col is the column Frequency in the dataframe confusion_table with the number of correct and incorrect labels. The palette will plot the correct and incorrect predictions in different colours. Select and run the commands. The output is seen in the plot window
Highlight Output in Plot window.	This plot shows how well our model predicted the testing data. We observe that: Kecimen class: 18 misclassifications Besni class: 34 misclassifications
	Now let us visualize the decision boundary of the model.
[RStudio] grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500), ecc = seq(min(data$ecc), max(data$ecc), length = 500)) grid$class = predict(decision_model, newdata = grid, type = "class") grid$classnum <- as.numeric(grid$class)	In the source window type these commands
Highlight the command grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500), ecc = seq(min(data$ecc), max(data$ecc), length = 500)) grid$class = predict(decision_model, newdata = grid, type = "class") grid$classnum <- as.numeric(grid$class	This code creates a grid of points spanning the range of minorAL and ecc features in the dataset. It then uses the Decision Tree model to predict the class of each point in this grid. It stores these predictions as a new column 'class' in the grid dataframe. Select the commands and run them.
[RStudio] ggplot() + geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") + theme_minimal()	To visualise the generated data, type these commands
Highlight the command ggplot() + geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") + theme_minimal()	This command creates the decision boundary plot of the decision tree model. It shows the distribution of training data points. It plots the grid points with colors indicating the predicted classes using Ggplot2. Select and run these commands. Drag boundaries to see the plot window clearly.
Point to the plot	It shows that the decision boundary of a decision tree model is non-linear. The complexity of the decision boundary increases with the complexity of the decision tree.
Show Slide Limitations of Decision tree If the tree is too complex, it can overfit data. Small variations in data can result in a different tree. Large trees are difficult to interpret Noisy Data may cause inaccurate splits	Here are some of the limitations of Decision Tree.
Only Narration	With this we come to the end of the tutorial. Let us summarize.
Show Slide Summary	In this tutorial we have learnt about: Decision Tree Assumption of Decision Tree Advantages of Decision Tree. Implementation of Decision Tree in R. Plotting the decision tree model. Evaluation of the model. Visualizing the model decision boundary Limitations of Decision Tree.
	Now we will suggest the assignment for this Spoken Tutorial.
Show Slide Assignment	Apply Decision Tree on PimaIndiansDiabetes dataset Install the pdp package and import the dataset using the data(pima) command Visualize the decision tree and measure the accuracy of the model
Show slide About the Spoken Tutorial Project	The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
Show slide Spoken Tutorial Workshops	We conduct workshops using Spoken Tutorials and give certificates. For more details, please contact us.
Show Slide Spoken Tutorial Forum to answer questions	Please post your timed queries in this forum.
Show Slide Forum to answer questions	Do you have any general/technical questions? Please visit the forum given in the link.
Show Slide Textbook Companion	The FOSSEE team coordinates the coding of solved examples of popular books and case study projects. We give certificates to those who do this. For more details, please visit these sites.
Show Slide Acknowledgment	The spoken tutorial project was established by the Ministry of Education government of India.
Show Slide Thank You	This tutorial is contributed by Debatosh Chakraborty and Yate Asseke Ronald. O from IIT Bombay. Thank you for joining.

Contributors and Content Editors

Ushav

Machine-Learning-using-R/C2/Decision-Tree-in-R/English

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools