Machine-Learning-using-R/C2/Decision-Tree-in-R/English

From Script | Spoken-Tutorial
Revision as of 11:40, 27 November 2024 by Ushav (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Title of the script: Decision Tree in R

Author: Debatosh Chakraborty and Yate Asseke Ronald Olivera

Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, regression, decision tree, video tutorial.

Visual Cue Narration
Show slide

Opening Slide

Welcome to this Spoken Tutorial on Decision Tree in R.
Show slide

Learning Objectives

In this tutorial, we will learn about:
  • Decision Tree
  • Assumptions for Decision Tree
  • Advantages of Decision Tree
  • Implementation of Decision Tree in R.
  • Plotting the decision tree model
  • Evaluation of the model.
  • Visualizing the model decision boundary
  • Limitations of Decision Tree.
Show slide

System Specifications

This tutorial is recorded using,
  • Windows 11
  • R version 4.3.0
  • RStudio version 2023.06.1

It is recommended to install R version 4.2.0 or higher.

Show slide

Prerequisites

https://spoken-tutorial.org

To follow this tutorial, learner should know:
  • Basic programming in R
  • Basics of Machine Learning

If not, please access the relevant tutorials on this website.

Show slide

What is a Decision Tree?

Let us see what a decision tree is?
  • It uses a binary tree to split the feature space into several sub-regions
  • The nodes of the tree are the locations at which the feature space splits
  • Misclassification error, Gini index, and entropy aid in identifying ideal splits.
  • The decision boundaries in the Decision Treemodel are nonlinear
Show slide

Assumptions of Decision Tree

  • The root node of the tree consists of the entire training set.
  • The model does not assume any specific distribution of features.
  • Each observation is independent.
The assumptions of the decision tree model are as follows.
Show slide

Advantages of Decision Tree

The advantages of decision tree model include:
  • It does not require feature variables to be necessarily continuous
  • Decision trees are intuitive and easy to visualize
  • When the response is continuous, the decision tree methodology can be easily implemented as a regression tree

The regression tree method will be discussed in a separate tutorial.

Show Slide

Implementation of Decision Tree

Now we will construct a Decision Tree on the Raisin dataset with two chosen variables.
Show slide

Download Files

For this tutorial, we will use

A script file DecisionTree.R.

Raisin Dataset 'raisin.xlsx'

Please download these files from the Code files link of this tutorial.

Make a copy and then use them while practicing.

[Computer screen]

Highlight DecisionTree.R

I have downloaded and moved these files to the Decision Tree folder.

We will create a Decision Tree classifier model on the raisin dataset.

Let us switch to RStudio.
Double-click DecisionTree.R on RStudio

Point to DecisionTree.R in RStudio.

Open the script DecisionTree.R in RStudio.

Script DecisionTree.R opens in RStudio.

[RStudio]

Highlight library(readxl)

library(ggplot2)

library(caret)

Highlight library(rpart)

Highlight library(rpart.plot)

Highlight library(cvms)

#install.packages(“package_name”)

Point to the command.

Select and run these commands to import the packages.

These packages will be used to aid thebuilding and evaluation of the classifier.

We will use rpart package to create the decision tree classifier.

We will use the rpart.plot package for plotting the decision tree.

We will use the cvms package for plotting the confusion matrix.

Please ensure that all the packages are installed correctly.

As I have already installed the packages, I have imported them directly.

[RStudio]

Highlight

data <- read_xlsx("Raisin.xlsx")

Highlight

data <- data[c("minorAL",”ecc”,"class")]

Highlight

data$class <- factor(data$class)

Select the commands and click the Run button

These commands will load theRaisin dataset.

They will also prepare the dataset for model building.

Select and run the commands.

Click on data in the Environment tab to load the dataset.

Point to the Source window.

Click on data in the Environment tab to load the modified data in the Source window.
[RStudio]

set.seed(1)

trainIndex <- createDataPartition(data$class, p = 0.7, list = FALSE)

train <- data[trainIndex, ]

test <- data[-trainIndex, ]

In the Source window type these commands.
Highlight

set.seed(1)

Highlight

trainIndex <- sample(1:nrow(data),size=0.7*nrow(data),replace=FALSE)

Highlight

train <- data[trainIndex, ]

Highlight

test <- data[-trainIndex, ]

This will split our dataset into training and testing data.

Select the commands and run them.

Let us now create our Decision Tree model.
model <- rpart(class ~ ., data = train_data, method = 'class',

control = rpart.control(cp = .00001, xval = 10, maxdepth = 2),

parms = list(split = "gini"))

summary(decision_model)

In the source window, type these commands.
[RStudio]

Highlight formula = class ~ .

Highlight data=train

Highlight method = 'class'

Highlight parms = list(split = "gini")

Highlight maxdepth = 2

Highlight xval = 10

Highlight cp = .00001

Click the Run button.

Point to Environment tab.

This is the formula we use for this model.

class is taken as the dependent variable.

The remaining attributes are independent variables.

data=train_data uses the training partition of dataset to train our model.

This tells our model that we are doing a classification task.

This “Gini Index” will be used to determine the best splits of the nodes.

This determines the maximum depth of the tree.

This is the number of cross-validations for each split.

The maximum loss of cross-validation which is desirable.

Select and run the commands.

The model data is shown in the Environment tab.

Highlight CP

HIghlight Node Information

Highlight n=630

Highlight class counts

Highlight probabilities

Highlight Predicted class

Highlight Primary splits

The summary of the model created is shown in the console window

Drag boundary to see the console window clearly

CP displays the complexity table for the trees created in the final model.

This displays the information about each node created.

This includes,

Total observations used to create the node.

The distribution of observations for each class in the node.

The probability of each class.

The class with the highest probability is the predicted class for the node.

This denotes the split information for that particular node.

[RStudio]

rpart.plot(decision_model)

Now let us visualize the decision tree model.

In the Source window type this command and run it.

Drag boundary to see the plot window clearly

Hover Kecimen

Hover 0.71

Hover 48% 52%

The trained decision tree model is shown in the plot window

For each node,

Predicted Class

It’s probability

And the Percentage of total observations is shown

One must note that the modeled tree is interpretable.

This is because the max depth of the tree is manually specified.

But it comes at the cost of underfitting and an increase in misclassification error.

Now let us use the model to make predictions on the testing data partition.
[RStudio]

predictions <- predict(decision_model, newdata = test_data, type = "class")

Select and run this command.

In the source window type this command and run it.

This command generates the predicted classes from the trained decision tree model.

Let's now evaluate our model.
[RStudio]

confusion_matrix <- confusionMatrix(predictions, test$class)

Type this command in the Source window
Highlight

confusion_matrix <- confusionMatrix(predictions, test$class)

This command will create a confusion matrix list.

The list will contain the different evaluation metrics.

Select and run the command

[RStudio]

confusion_matrix$overall["Accuracy"]

Now, let us type this command.

This command will display the accuracy of the model by retrieving it from the confusion Matrix list.

Select and run the command

Highlight 0.807 We can see that our model has 80 percent accuracy.

Note that the misclassifications are higher because of the manual specification of max depth attribute.

Choosing a higher value will make the model less interpretable and will reduce the misclassification error.

confusion_table <- data.frame(confusion_matrix$table) In the source window, type these commands.

This will create a data-frame of the confusion matrix table.

Select and run the command.

Click on confusion_table in the Environment tab.

We notice that it displays the number of correct and incorrect predictions for each class.

[RStudio] In the source window, type these commands. To plot the confusion matrix

It will represent the number of correct and incorrect predictions using different colors.

plot_confusion_matrix(confusion_table,

target_col = "Reference",

prediction_col = "Prediction",

counts_col = "Freq",

palette = list("low" = "pink1","high" = "green1"),

add_normalized = FALSE,

add_row_percentages = FALSE,

add_col_percentages = FALSE)

We use the plot_confusion_matrix function from the cvms package.

We will use the dataframe confusion_table.

Target_col is the column reference in the dataframe confusion_table with the labels for reference.


Prediction_col is the column Prediction in the dataframe confusion_table with predicted labels.

Counts_col is the column Frequency in the dataframe confusion_table with the number of correct and incorrect labels.

The palette will plot the correct and incorrect predictions in different colours.

Select and run the commands.

The output is seen in the plot window

Highlight Output in Plot window. This plot shows how well our model predicted the testing data.

We observe that:

Kecimen class: 18 misclassifications

Besni class: 34 misclassifications

Now let us visualize the decision boundary of the model.
[RStudio]

grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),

ecc = seq(min(data$ecc), max(data$ecc), length = 500))

grid$class = predict(decision_model, newdata = grid, type = "class")

grid$classnum <- as.numeric(grid$class)

In the source window type these commands
Highlight the command

grid <- expand.grid(minorAL = seq(min(data$minorAL), max(data$minorAL), length = 500),

ecc = seq(min(data$ecc), max(data$ecc), length = 500))

grid$class = predict(decision_model, newdata = grid, type = "class")

grid$classnum <- as.numeric(grid$class

This code creates a grid of points spanning the range of minorAL and ecc features in the dataset.

It then uses the Decision Tree model to predict the class of each point in this grid.

It stores these predictions as a new column 'class' in the grid dataframe.

Select the commands and run them.

[RStudio]

ggplot() +

geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +

geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),

colour = "black", linewidth = 0.7) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") +

theme_minimal()

To visualise the generated data, type these commands
Highlight the command

ggplot() +

geom_raster(data = grid, aes(x = minorAL, y = ecc, fill = class), alpha = 0.4) +

geom_point(data = train_data, aes(x = minorAL, y = ecc, color = class)) +

geom_contour(data = grid, aes(x = minorAL, y = ecc, z = classnum),

colour = "black", linewidth = 0.7) +

scale_fill_manual(values = c("#ffff46", "#FF46e9")) +

scale_color_manual(values = c("red", "blue")) +

labs(x = "MinorAL", y = "ecc", title = "QDA Decision Boundary") +

theme_minimal()

This command creates the decision boundary plot of the decision tree model.

It shows the distribution of training data points.

It plots the grid points with colors indicating the predicted classes using Ggplot2.

Select and run these commands.

Drag boundaries to see the plot window clearly.

Point to the plot It shows that the decision boundary of a decision tree model is non-linear.

The complexity of the decision boundary increases with the complexity of the decision tree.

Show Slide

Limitations of Decision tree

  • If the tree is too complex, it can overfit data.
  • Small variations in data can result in a different tree.
  • Large trees are difficult to interpret
  • Noisy Data may cause inaccurate splits
Here are some of the limitations of Decision Tree.
Only Narration With this we come to the end of the tutorial.

Let us summarize.

Show Slide

Summary

In this tutorial we have learnt about:
  • Decision Tree
  • Assumption of Decision Tree
  • Advantages of Decision Tree.
  • Implementation of Decision Tree in R.
  • Plotting the decision tree model.
  • Evaluation of the model.
  • Visualizing the model decision boundary
  • Limitations of Decision Tree.
Now we will suggest the assignment for this Spoken Tutorial.
Show Slide

Assignment

  • Apply Decision Tree on PimaIndiansDiabetes dataset
  • Install the pdp package and import the dataset using the data(pima) command
  • Visualize the decision tree and measure the accuracy of the model
Show slide

About the Spoken Tutorial Project

The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.

Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.

For more details, please contact us.

Show Slide

Spoken Tutorial Forum to answer questions

Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion

The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.

We give certificates to those who do this.

For more details, please visit these sites.

Show Slide

Acknowledgment

The spoken tutorial project was established by the Ministry of Education government of India.
Show Slide

Thank You

This tutorial is contributed by Debatosh Chakraborty and Yate Asseke Ronald. O from IIT Bombay.

Thank you for joining.

Contributors and Content Editors

Ushav