Machine-Learning-using-R/C2/Decision-Tree-in-R/English
Title of the script: Decision Tree in R
Author: Debatosh Chakraborty and Yate Asseke Ronald Olivera
Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, regression, decision tree, video tutorial.
Visual Cue | Narration |
Show slide
Opening Slide |
Welcome to this Spoken Tutorial on Decision Tree in R. |
Show slide
Learning Objectives |
In this tutorial, we will learn about:
|
Show slide
System Specifications |
This tutorial is recorded using,
It is recommended to install R version 4.2.0 or higher. |
Show slide
Prerequisites |
To follow this tutorial, learner should know:
If not, please access the relevant tutorials on this website. |
Show slide
What is a Decision Tree? |
Let us see what a decision tree is?
|
Show slide
Assumptions of Decision Tree
|
The assumptions of the decision tree model are as follows. |
Show slide
Advantages of Decision Tree |
The advantages of decision tree model include:
The regression tree method will be discussed in a separate tutorial. |
Show Slide
Implementation of Decision Tree |
Now we will construct a Decision Tree on the Raisin dataset with two chosen variables. |
Show slide
Download Files |
For this tutorial, we will use
A script file DecisionTree.R. Raisin Dataset 'raisin.xlsx' Please download these files from the Code files link of this tutorial. Make a copy and then use them while practicing. |
[Computer screen]
Highlight DecisionTree.R |
I have downloaded and moved these files to the Decision Tree folder.
We will create a Decision Tree classifier model on the raisin dataset. |
Let us switch to RStudio. | |
Double-click DecisionTree.R on RStudio
Point to DecisionTree.R in RStudio. |
Open the script DecisionTree.R in RStudio.
Script DecisionTree.R opens in RStudio. |
[RStudio]
Highlight library(readxl) library(ggplot2) library(caret) Highlight library(rpart) Highlight library(rpart.plot) Highlight library(cvms) #install.packages(“package_name”) Point to the command. |
Select and run these commands to import the packages.
These packages will be used to aid thebuilding and evaluation of the classifier. We will use rpart package to create the decision tree classifier. We will use the rpart.plot package for plotting the decision tree. We will use the cvms package for plotting the confusion matrix. Please ensure that all the packages are installed correctly. As I have already installed the packages, I have imported them directly. |
[RStudio]
Highlight data <- read_xlsx("Raisin.xlsx") Highlight data <- data[c("minorAL",”ecc”,"class")] Highlight data$class <- factor(data$class) Select the commands and click the Run button |
These commands will load theRaisin dataset.
They will also prepare the dataset for model building. Select and run the commands. |
Click on data in the Environment tab to load the dataset.
Point to the Source window. |
Click on data in the Environment tab to load the modified data in the Source window. |
[RStudio]
set.seed(1) trainIndex <- createDataPartition(data$class, p = 0.7, list = FALSE) train <- data[trainIndex, ] test <- data[-trainIndex, ] |
In the Source window type these commands. |
Highlight
set.seed(1) Highlight trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE) Highlight train <- data[trainIndex, ] Highlight test <- data[-trainIndex, ] |
This will split our dataset into training and testing data.
Select the commands and run them. |
Let us now create our Decision Tree model. | |
model <- rpart(class ~ ., data = train_data, method = 'class',
control = rpart.control(cp = .00001, xval = 10, maxdepth = 2), parms = list(split = "gini")) summary(decision_model) |
In the source window, type these commands. |
[RStudio]
Highlight formula = class ~ . Highlight data=train Highlight method = 'class' Highlight parms = list(split = "gini") Highlight maxdepth = 2 Highlight xval = 10 Highlight cp = .00001 Click the Run button. Point to Environment tab. |
This is the formula we use for this model.
class is taken as the dependent variable. The remaining attributes are independent variables. data=train_data uses the training partition of dataset to train our model. This tells our model that we are doing a classification task. This “Gini Index” will be used to determine the best splits of the nodes. This determines the maximum depth of the tree. This is the number of cross-validations for each split. The maximum loss of cross-validation which is desirable. Select and run the commands. The model data is shown in the Environment tab. |
Highlight CP
HIghlight Node Information Highlight n=630 Highlight class counts Highlight probabilities Highlight Predicted class Highlight Primary splits |
The summary of the model created is shown in the console window
Drag boundary to see the console window clearly CP displays the complexity table for the trees created in the final model. This displays the information about each node created. This includes, Total observations used to create the node. The distribution of observations for each class in the node. The probability of each class. The class with the highest probability is the predicted class for the node. This denotes the split information for that particular node. |
[RStudio]
rpart.plot(decision_model) |
Now let us visualize the decision tree model.
In the Source window type this command and run it. Drag boundary to see the plot window clearly |
Hover Kecimen
Hover 0.71 Hover 48% 52% |
The trained decision tree model is shown in the plot window
For each node, Predicted Class It’s probability And the Percentage of total observations is shown One must note that the modeled tree is interpretable. This is because the max depth of the tree is manually specified. But it comes at the cost of underfitting and an increase in misclassification error. |
Now let us use the model to make predictions on the testing data partition. | |
[RStudio]
predictions <- predict(decision_model, newdata = test_data, type = "class") Select and run this command. |
In the source window type this command and run it.
This command generates the predicted classes from the trained decision tree model. |
Let's now evaluate our model. | |
[RStudio]
confusion_matrix <- confusionMatrix(predictions, test$class) |
Type this command in the Source window |
Highlight
confusion_matrix <- confusionMatrix(predictions, test$class) |
This command will create a confusion matrix list.
The list will contain the different evaluation metrics. Select and run the command |
[RStudio]
confusion_matrix$overall["Accuracy"] |
Now, let us type this command.
This command will display the accuracy of the model by retrieving it from the confusion Matrix list. Select and run the command |
Highlight 0.807 | We can see that our model has 80 percent accuracy.
Note that the misclassifications are higher because of the manual specification of max depth attribute. Choosing a higher value will make the model less interpretable and will reduce the misclassification error. |
confusion_table <- data.frame(confusion_matrix$table) | In the source window, type these commands.
This will create a data-frame of the confusion matrix table. Select and run the command. Click on confusion_table in the Environment tab. We notice that it displays the number of correct and incorrect predictions for each class. |
[RStudio] | In the source window, type these commands. To plot the confusion matrix
It will represent the number of correct and incorrect predictions using different colors. |
plot_confusion_matrix(confusion_table,
target_col = "Reference", prediction_col = "Prediction", counts_col = "Freq", palette = list("low" = "pink1","high" = "green1"), add_normalized = FALSE, add_row_percentages = FALSE, add_col_percentages = FALSE) |
We use the plot_confusion_matrix function from the cvms package.
We will use the dataframe confusion_table. Target_col is the column reference in the dataframe confusion_table with the labels for reference.
Counts_col is the column Frequency in the dataframe confusion_table with the number of correct and incorrect labels. The palette will plot the correct and incorrect predictions in different colours. Select and run the commands. The output is seen in the plot window |
Highlight Output in Plot window. | This plot shows how well our model predicted the testing data.
We observe that: Kecimen class: 18 misclassifications Besni class: 34 misclassifications |
Now let us visualize the decision boundary of the model. | |
[RStudio]
grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500), ecc = seq(min(data$ecc), max(data$ecc), length = 500)) grid$class = predict(decision_model, newdata = grid, type = "class") grid$classnum <- as.numeric(grid$class) |
In the source window type these commands |
Highlight the command
grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500), ecc = seq(min(data$ecc), max(data$ecc), length = 500)) grid$class = predict(decision_model, newdata = grid, type = "class") grid$classnum <- as.numeric(grid$class |
This code creates a grid of points spanning the range of minorAL and ecc features in the dataset.
It then uses the Decision Tree model to predict the class of each point in this grid. It stores these predictions as a new column 'class' in the grid dataframe. Select the commands and run them. |
[RStudio]
ggplot() + geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") + theme_minimal() |
To visualise the generated data, type these commands |
Highlight the command
ggplot() + geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) + geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) + geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum), colour = "black", linewidth = 0.7) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") + theme_minimal() |
This command creates the decision boundary plot of the decision tree model.
It shows the distribution of training data points. It plots the grid points with colors indicating the predicted classes using Ggplot2. Select and run these commands. Drag boundaries to see the plot window clearly. |
Point to the plot | It shows that the decision boundary of a decision tree model is non-linear.
The complexity of the decision boundary increases with the complexity of the decision tree. |
Show Slide
Limitations of Decision tree
|
Here are some of the limitations of Decision Tree. |
Only Narration | With this we come to the end of the tutorial.
Let us summarize. |
Show Slide
Summary |
In this tutorial we have learnt about:
|
Now we will suggest the assignment for this Spoken Tutorial. | |
Show Slide
Assignment |
|
Show slide
About the Spoken Tutorial Project |
The video at the following link summarizes the Spoken Tutorial project.
Please download and watch it. |
Show slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
For more details, please contact us. |
Show Slide
Spoken Tutorial Forum to answer questions |
Please post your timed queries in this forum. |
Show Slide
Forum to answer questions |
Do you have any general/technical questions?
Please visit the forum given in the link. |
Show Slide
Textbook Companion |
The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.
We give certificates to those who do this. For more details, please visit these sites. |
Show Slide
Acknowledgment |
The spoken tutorial project was established by the Ministry of Education government of India. |
Show Slide
Thank You |
This tutorial is contributed by Debatosh Chakraborty and Yate Asseke Ronald. O from IIT Bombay.
Thank you for joining. |