Machine-Learning-using-R - old 2022/C4/Hierarchical-Clustering-in-R/English

From Script | Spoken-Tutorial
Jump to: navigation, search

Title of the script: Hierarchical Clustering

Author: Tanmay Srinath

Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, hierarchical clustering, tree of clusters, Agglomerative clustering, Dendrogram, video tutorial.


Visual Cue Narration
Show slide

Opening Slide

Welcome to this spoken tutorial on Hierarchical Clustering in R.
Show slide

Learning Objectives

In this tutorial, we will learn about:
  • Hierarchical Clustering
  • Types of Hierarchical Clustering
  • Advantages of Hierarchical Clustering
  • Linkage and its types
  • Applications of Hierarchical Clustering
  • Hierarchical Clustering on iris dataset.
Show slide

System Specifications

This tutorial is recorded using,
  • Ubuntu Linux OS version 20.04
  • R version 4.1.2
  • RStudio version 1.4.1717

It is recommended to install R version 4.1.0 or higher.

Show Slide

Prerequisites https://spoken-tutorial.org

To follow this tutorial, the learner should know:
  • Basics of R programming
  • Basics of Machine Learning.

If not, please please access the relevant tutorials on this website.

Show Slide

Hierarchical Clustering

Hierarchical Clustering
  • It is a method that works by grouping data into a tree of clusters.
  • It begins by treating every data point as a separate cluster.
  • Then it forms a dendrogram by combining the two closest clusters.
Show slide

Dendrogram

Dendrogram
  • It is a tree-like diagram.
  • It represents the clusters formed by hierarchical clustering.
Show slide

Types of Hierarchical Clustering

There are two types of hierarchical clustering:

Agglomerative clustering and Divisive clustering.

Show slide

Agglomerative clustering

Agglomerative clustering

It uses a bottom-up approach.

Each data point starts in its own cluster.

Show slide

Divisive clustering

Divisive clustering

It uses a top-down approach.

All data points start in the same cluster.

Show slide

Agglomerative clustering

In this tutorial, we will focus on agglomerative clustering.
Show Slide

Agglomerative clustering

  • This is the most common type of hierarchical clustering.
  • In this type, the dendrogram is built by starting from the leaves.
  • Clusters are then combined up the trunk.
Show Slide

Linkage

we will learn about linkage and its types.
  • Linkage defines the dissimilarity between two groups of observations.


Show Slide

Types of Linkage

Primarily there are 4 types of Linkages.
  • Complete: It is the maximum pairwise dissimilarity between observations in two different clusters.
  • Single: It is the minimum pairwise dissimilarity between observations in two different clusters.
Show Slide

Types of Linkage

  • Average: It is the mean pairwise dissimilarity between observations in two different clusters.
  • Centroid: It is the dissimilarity between centroids of two different clusters.


Now we will learn some advantages of using Hierarchical clustering.
Show slide

Advantages of Hierarchical Clustering

  • It gives homogenous clusters.
  • One can decide the number of clusters based on dendrograms.
  • It is mathematically very easy to understand.


Now let us learn a few applications of Hierarchical Clustering.
Show Slide

Applications of Hierarchical Clustering

  • It can be used to cluster shoppers based on past shopping history.
  • It can be used to build phylogenetic trees that show evolutionary relationships.


Show Slide

Hierarchical Clustering

Now lets implement Hierarchical clustering on the iris dataset.
Show slide

Download Files

For this tutorial, we will use a script file HClust.R.

Please download this file from the Code files link of this tutorial.

Make a copy and then use it for practising.

[Computer screen]

Highlight HClust.R and the folder HClust.

I have downloaded and moved this file to the HClust folder.


This folder is located in the MLProject folder on my Desktop.


I have also set the HClust folder as my working Directory.

Let us switch to RStudio.
Click HClust.R in RStudio

Point to HClust.R in RStudio.

Open the script HClust.R in RStudio.


For this, click on the script HClust.R


Script HClust.R opens in RStudio.

[RStudio]

Highlight library(ggplot2)

We will use the ggplot2 package for plotting our results.


Since I have already installed the package I will directly import it.

[RStudio]

library(ggplot2)

data(iris)

Select and run these commands.
Click in the Environment tab to load the iris dataset. Click in the Environment tab to load the iris dataset.
Now let us create our model.
[RStudio]

set.seed(121)

model <- hclust(dist(iris[, 3:4]))

plot(model)

Type these commands.
Highlight

set.seed(121)


Highlight

model <- hclust(dist(iris[, 3:4]))


Highlight dist(iris[, 3:4])


Select and click on the Run button.

We set a seed for reproducible results.


This is the command to create a hierarchical clustering model.

We will use petal length and petal width as our two features.


Hierarchical clustering works on finding the two closest clusters.

For that we use the dist() function.

This function finds the distances between each iris data point.


Select and run the commands.

Point to the Dendrogram.


Drag boundary to see the plot window clearly.

A dendrogram is seen in the plot window.


Drag boundaries to see the plot clearly.

Highlight output in plot window


Point to the dendrogram.


This is the dendrogram produced.

We can see that there are majorly 3 to 4 clusters.


Let us cut our dendrogram at the required number of clusters.

Drag boundaries to see the Source window clearly. Drag boundaries to see the Source window clearly.
[RStudio]

cut_1 <- cutree(model, 3)

Highlight cutree(model, 3)

Click the Save and Run buttons.

Type this command.


We use the cutree() command to cut our dendrogram.

We pass two parameters - a model and the number of clusters we want.


Save and run the command.


We can see the details in the Environment tab.

Now we shall tabulate the results of our model
[RStudio]

table(cut_1, iris$Species)


Highlight output in console

Type this command


Run the command to see the output in the console window.


Our model clearly distinguishes setosa and virginica.

But it has problems when it needs to isolate versicolor species.

Let us now try to improve our model.
[RStudio]

model_improved <- hclust(dist(iris[, 3:4]), method = 'average')

plot(model_improved)


Highlight method = 'average'


Click the Run button.

Type these commands.

This command uses the mean linkage method to find the clusters.


Select and run the commands.

Point to the Dendrogram.


Drag boundary to see the plot window clearly.

A dendrogram is seen in the plot window.


Drag boundaries to see the plot widow clearly.

Highlight output in plot window.


Drag boundaries to see the Source window clearly.

We can see that there are 3 clear clusters in the dendrogram.

Let us cut the tree accordingly.


Drag boundaries to see the Source window clearly.

[RStudio]

cut_2 <- cutree(model_improved, 3)

table(cut_2, iris$Species)


Click the Save and Run buttons.

Highlight output in console

Type these commands.


Save and run the commands.


We see that the new model has misclassified only 6 samples.

Now let us plot our results
[RStudio]

ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point(size = 5) + geom_point(col = cut_2)

Type this command.


Save ad run the command

Drag boundary to see the plot window clearly. Drag boundaries to see the plot clearly.
Highlight output in plot window The clusters produced have clearly separated the 3 species.

We can see that setosa is clearly separated.

Most of the versicolor and virginica data points are also clearly separated.

Only Narration. With this we come to the end of this tutorial.

Let us summarise.

Show Slide

Summary

In this tutorial we have learnt about:
  • Hierarchical Clustering
  • Types of Hierarchical Clustering
  • Advantages of Hierarchical Clustering
  • Linkage and its types
  • Applications of Hierarchical Clustering
  • Hierarchical Clustering on iris dataset
Show Slide

Assignment

Now we will suggest an assignment for this Spoken Tutorial.
  • Apply hierarchical clustering on PimaIndiansDiabetes dataset.
  • Install and import the mlbench package.
  • Run the data("PimaIndiansDiabetes2") command to load the dataset.
  • Compare between the various linkage methods.
Show Slide

About the Spoken Tutorial Project

The video at the following link summarises the Spoken Tutorial project.

Please download and watch it.

Show Slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.


For more details, please contact us.

Show Slide

Spoken Tutorial Forum to answer questions

Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion

The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.


We give certificates to those who do this.


For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India.
Show Slide

Thank You

This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay. Thank you for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey