Machine-Learning-using-R - old 2022/C4/Hierarchical-Clustering-in-R/English

Title of the script: Hierarchical Clustering

Author: Tanmay Srinath

Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, hierarchical clustering, tree of clusters, Agglomerative clustering, Dendrogram, video tutorial.

Visual Cue	Narration
Show slide Opening Slide	Welcome to this spoken tutorial on Hierarchical Clustering in R.
Show slide Learning Objectives	In this tutorial, we will learn about: Hierarchical Clustering Types of Hierarchical Clustering Advantages of Hierarchical Clustering Linkage and its types Applications of Hierarchical Clustering Hierarchical Clustering on iris dataset.
Show slide System Specifications	This tutorial is recorded using, Ubuntu Linux OS version 20.04 R version 4.1.2 RStudio version 1.4.1717 It is recommended to install R version 4.1.0 or higher.
Show Slide Prerequisites https://spoken-tutorial.org	To follow this tutorial, the learner should know: Basics of R programming Basics of Machine Learning. If not, please please access the relevant tutorials on this website.
Show Slide Hierarchical Clustering	Hierarchical Clustering It is a method that works by grouping data into a tree of clusters. It begins by treating every data point as a separate cluster. Then it forms a dendrogram by combining the two closest clusters.
Show slide Dendrogram	Dendrogram It is a tree-like diagram. It represents the clusters formed by hierarchical clustering.
Show slide Types of Hierarchical Clustering	There are two types of hierarchical clustering: Agglomerative clustering and Divisive clustering.
Show slide Agglomerative clustering	Agglomerative clustering It uses a bottom-up approach. Each data point starts in its own cluster.
Show slide Divisive clustering	Divisive clustering It uses a top-down approach. All data points start in the same cluster.
Show slide Agglomerative clustering	In this tutorial, we will focus on agglomerative clustering.
Show Slide Agglomerative clustering	This is the most common type of hierarchical clustering. In this type, the dendrogram is built by starting from the leaves. Clusters are then combined up the trunk.
Show Slide Linkage	we will learn about linkage and its types. Linkage defines the dissimilarity between two groups of observations.
Show Slide Types of Linkage	Primarily there are 4 types of Linkages. Complete: It is the maximum pairwise dissimilarity between observations in two different clusters. Single: It is the minimum pairwise dissimilarity between observations in two different clusters.
Show Slide Types of Linkage	Average: It is the mean pairwise dissimilarity between observations in two different clusters. Centroid: It is the dissimilarity between centroids of two different clusters.
	Now we will learn some advantages of using Hierarchical clustering.
Show slide Advantages of Hierarchical Clustering	It gives homogenous clusters. One can decide the number of clusters based on dendrograms. It is mathematically very easy to understand.
	Now let us learn a few applications of Hierarchical Clustering.
Show Slide Applications of Hierarchical Clustering	It can be used to cluster shoppers based on past shopping history. It can be used to build phylogenetic trees that show evolutionary relationships.
Show Slide Hierarchical Clustering	Now lets implement Hierarchical clustering on the iris dataset.
Show slide Download Files	For this tutorial, we will use a script file HClust.R. Please download this file from the Code files link of this tutorial. Make a copy and then use it for practising.
[Computer screen] Highlight HClust.R and the folder HClust.	I have downloaded and moved this file to the HClust folder. This folder is located in the MLProject folder on my Desktop. I have also set the HClust folder as my working Directory.
	Let us switch to RStudio.
Click HClust.R in RStudio Point to HClust.R in RStudio.	Open the script HClust.R in RStudio. For this, click on the script HClust.R Script HClust.R opens in RStudio.
[RStudio] Highlight library(ggplot2)	We will use the ggplot2 package for plotting our results. Since I have already installed the package I will directly import it.
[RStudio] library(ggplot2) data(iris)	Select and run these commands.
Click in the Environment tab to load the iris dataset.	Click in the Environment tab to load the iris dataset.
	Now let us create our model.
[RStudio] set.seed(121) model <- hclust(dist(iris[, 3:4])) plot(model)	Type these commands.
Highlight set.seed(121) Highlight model <- hclust(dist(iris[, 3:4])) Highlight dist(iris[, 3:4]) Select and click on the Run button.	We set a seed for reproducible results. This is the command to create a hierarchical clustering model. We will use petal length and petal width as our two features. Hierarchical clustering works on finding the two closest clusters. For that we use the dist() function. This function finds the distances between each iris data point. Select and run the commands.
Point to the Dendrogram. Drag boundary to see the plot window clearly.	A dendrogram is seen in the plot window. Drag boundaries to see the plot clearly.
Highlight output in plot window Point to the dendrogram.	This is the dendrogram produced. We can see that there are majorly 3 to 4 clusters. Let us cut our dendrogram at the required number of clusters.
Drag boundaries to see the Source window clearly.	Drag boundaries to see the Source window clearly.
[RStudio] cut_1 <- cutree(model, 3) Highlight cutree(model, 3) Click the Save and Run buttons.	Type this command. We use the cutree() command to cut our dendrogram. We pass two parameters - a model and the number of clusters we want. Save and run the command. We can see the details in the Environment tab.
	Now we shall tabulate the results of our model
[RStudio] table(cut_1, iris$Species) Highlight output in console	Type this command Run the command to see the output in the console window. Our model clearly distinguishes setosa and virginica. But it has problems when it needs to isolate versicolor species.
	Let us now try to improve our model.
[RStudio] model_improved <- hclust(dist(iris[, 3:4]), method = 'average') plot(model_improved) Highlight method = 'average' Click the Run button.	Type these commands. This command uses the mean linkage method to find the clusters. Select and run the commands.
Point to the Dendrogram. Drag boundary to see the plot window clearly.	A dendrogram is seen in the plot window. Drag boundaries to see the plot widow clearly.
Highlight output in plot window. Drag boundaries to see the Source window clearly.	We can see that there are 3 clear clusters in the dendrogram. Let us cut the tree accordingly. Drag boundaries to see the Source window clearly.
[RStudio] cut_2 <- cutree(model_improved, 3) table(cut_2, iris$Species) Click the Save and Run buttons. Highlight output in console	Type these commands. Save and run the commands. We see that the new model has misclassified only 6 samples.
	Now let us plot our results
[RStudio] ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point(size = 5) + geom_point(col = cut_2)	Type this command. Save ad run the command
Drag boundary to see the plot window clearly.	Drag boundaries to see the plot clearly.
Highlight output in plot window	The clusters produced have clearly separated the 3 species. We can see that setosa is clearly separated. Most of the versicolor and virginica data points are also clearly separated.
Only Narration.	With this we come to the end of this tutorial. Let us summarise.
Show Slide Summary	In this tutorial we have learnt about: Hierarchical Clustering Types of Hierarchical Clustering Advantages of Hierarchical Clustering Linkage and its types Applications of Hierarchical Clustering Hierarchical Clustering on iris dataset
Show Slide Assignment	Now we will suggest an assignment for this Spoken Tutorial. Apply hierarchical clustering on PimaIndiansDiabetes dataset. Install and import the mlbench package. Run the data("PimaIndiansDiabetes2") command to load the dataset. Compare between the various linkage methods.
Show Slide About the Spoken Tutorial Project	The video at the following link summarises the Spoken Tutorial project. Please download and watch it.
Show Slide Spoken Tutorial Workshops	We conduct workshops using Spoken Tutorials and give certificates. For more details, please contact us.
Show Slide Spoken Tutorial Forum to answer questions	Please post your timed queries in this forum.
Show Slide Forum to answer questions	Do you have any general/technical questions? Please visit the forum given in the link.
Show Slide Textbook Companion	The FOSSEE team coordinates the coding of solved examples of popular books and case study projects. We give certificates to those who do this. For more details, please visit these sites.
Show Slide Acknowledgment	The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India.
Show Slide Thank You	This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay. Thank you for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey

Machine-Learning-using-R - old 2022/C4/Hierarchical-Clustering-in-R/English

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools