Machine-Learning-using-R - old 2022/C4/Hierarchical-Clustering-in-R/English
Title of the script: Hierarchical Clustering
Author: Tanmay Srinath
Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, hierarchical clustering, tree of clusters, Agglomerative clustering, Dendrogram, video tutorial.
Visual Cue | Narration |
Show slide
Opening Slide |
Welcome to this spoken tutorial on Hierarchical Clustering in R. |
Show slide
Learning Objectives |
In this tutorial, we will learn about:
|
Show slide
System Specifications |
This tutorial is recorded using,
It is recommended to install R version 4.1.0 or higher. |
Show Slide
Prerequisites https://spoken-tutorial.org |
To follow this tutorial, the learner should know:
If not, please please access the relevant tutorials on this website. |
Show Slide
Hierarchical Clustering |
Hierarchical Clustering
|
Show slide
Dendrogram |
Dendrogram
|
Show slide
Types of Hierarchical Clustering |
There are two types of hierarchical clustering:
Agglomerative clustering and Divisive clustering. |
Show slide
Agglomerative clustering |
Agglomerative clustering
It uses a bottom-up approach. Each data point starts in its own cluster. |
Show slide
Divisive clustering |
Divisive clustering
It uses a top-down approach. All data points start in the same cluster. |
Show slide
Agglomerative clustering |
In this tutorial, we will focus on agglomerative clustering. |
Show Slide
Agglomerative clustering |
|
Show Slide
Linkage |
we will learn about linkage and its types.
|
Show Slide
Types of Linkage |
Primarily there are 4 types of Linkages.
|
Show Slide
Types of Linkage |
|
Now we will learn some advantages of using Hierarchical clustering. | |
Show slide
Advantages of Hierarchical Clustering |
|
Now let us learn a few applications of Hierarchical Clustering. | |
Show Slide
Applications of Hierarchical Clustering |
|
Show Slide
Hierarchical Clustering |
Now lets implement Hierarchical clustering on the iris dataset. |
Show slide
Download Files |
For this tutorial, we will use a script file HClust.R.
Please download this file from the Code files link of this tutorial. Make a copy and then use it for practising. |
[Computer screen]
Highlight HClust.R and the folder HClust. |
I have downloaded and moved this file to the HClust folder.
|
Let us switch to RStudio. | |
Click HClust.R in RStudio
Point to HClust.R in RStudio. |
Open the script HClust.R in RStudio.
|
[RStudio]
Highlight library(ggplot2) |
We will use the ggplot2 package for plotting our results.
|
[RStudio]
library(ggplot2) data(iris) |
Select and run these commands. |
Click in the Environment tab to load the iris dataset. | Click in the Environment tab to load the iris dataset. |
Now let us create our model. | |
[RStudio]
set.seed(121) model <- hclust(dist(iris[, 3:4])) plot(model) |
Type these commands. |
Highlight
set.seed(121)
model <- hclust(dist(iris[, 3:4]))
|
We set a seed for reproducible results.
We will use petal length and petal width as our two features.
For that we use the dist() function. This function finds the distances between each iris data point.
|
Point to the Dendrogram.
|
A dendrogram is seen in the plot window.
|
Highlight output in plot window
|
This is the dendrogram produced.
We can see that there are majorly 3 to 4 clusters.
|
Drag boundaries to see the Source window clearly. | Drag boundaries to see the Source window clearly. |
[RStudio]
cut_1 <- cutree(model, 3) Highlight cutree(model, 3) Click the Save and Run buttons. |
Type this command.
We pass two parameters - a model and the number of clusters we want.
|
Now we shall tabulate the results of our model | |
[RStudio]
table(cut_1, iris$Species)
|
Type this command
But it has problems when it needs to isolate versicolor species. |
Let us now try to improve our model. | |
[RStudio]
model_improved <- hclust(dist(iris[, 3:4]), method = 'average') plot(model_improved)
|
Type these commands.
This command uses the mean linkage method to find the clusters.
|
Point to the Dendrogram.
|
A dendrogram is seen in the plot window.
|
Highlight output in plot window.
|
We can see that there are 3 clear clusters in the dendrogram.
Let us cut the tree accordingly.
|
[RStudio]
cut_2 <- cutree(model_improved, 3) table(cut_2, iris$Species)
Highlight output in console |
Type these commands.
|
Now let us plot our results | |
[RStudio]
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point(size = 5) + geom_point(col = cut_2) |
Type this command.
|
Drag boundary to see the plot window clearly. | Drag boundaries to see the plot clearly. |
Highlight output in plot window | The clusters produced have clearly separated the 3 species.
We can see that setosa is clearly separated. Most of the versicolor and virginica data points are also clearly separated. |
Only Narration. | With this we come to the end of this tutorial.
Let us summarise. |
Show Slide
Summary |
In this tutorial we have learnt about:
|
Show Slide
Assignment |
Now we will suggest an assignment for this Spoken Tutorial.
|
Show Slide
About the Spoken Tutorial Project |
The video at the following link summarises the Spoken Tutorial project.
Please download and watch it. |
Show Slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
|
Show Slide
Spoken Tutorial Forum to answer questions |
Please post your timed queries in this forum. |
Show Slide
Forum to answer questions |
Do you have any general/technical questions?
Please visit the forum given in the link. |
Show Slide
Textbook Companion |
The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.
|
Show Slide
Acknowledgment |
The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India. |
Show Slide
Thank You |
This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay. Thank you for watching. |