Machine-Learning-using-R - old 2022/C3/K-Means-Clustering-in-R/English

Title of the script: k-means Clustering in R.

Author: Tanmay Srinath and Sudhakar Kumar (IIT Bombay)

Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, k-means clustering, k-means++, video tutorial.

Visual Cue	Narration
Show slide' Opening Slide'	Welcome to this spoken tutorial on K-Means Clustering in R.
Show slide' Learning Objectives'	In this tutorial, we will learn about: k-means Clustering Benefits of k-means Clustering Applications of k-means Clustering k-means++ clustering Different k-means++ models on iris data.
Show slide System Specifications	This tutorial is recorded using, Ubuntu Linux OS version 20.04 R version 4.1.2 RStudio version 1.4.1717 It is recommended to install R version 4.1.0 or higher.
Show slide Prerequisites https://spoken-tutorial.org/	To follow this tutorial, the learner should know: Basics of R programming. Basics of Machine Learning. If not, please access the relevant tutorials on this website.
Show slide k-means Clustering	k-means Clustering It partitions n observations into k clusters. So, observations are homogeneous within each cluster. Here each observation belongs to a cluster with the nearest cluster mean. k-means Clustering is a powerful algorithm.
	Here we will learn some benefits of using it.
Show slide Benefits of k-means Clustering	k-means Clustering is relatively simple to implement. k-means Clustering scales well on large datasets.
	Now let us learn a few applications of k-means Clustering.
Show Slide Applications of k-means Clustering Customer Segmentation. https://archive.ics.uci.edu/ml/datasets/online+retail Ailment Diagnosis. https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)	Customer Segmentation. Ailment Diagnosis.
Show Slide Optimising k-means	The basic form of k-means clustering is not optimal. It depends a lot on the initialisation of clusters. To overcome this drawback, we will use an optimised algorithm k-means++.
Show Slide k-means++ algorithm	It is an algorithm for choosing the initial centroid locations. The first center will be chosen at random. The next ones will be selected with a certain probability. This probability is proportional to the distance from the closest center. By avoiding random initialisation, it provides faster results.
Show Slide k-means++ Model	In this tutorial we will create 3 k-means++ models and compare their results. Now let us implement k-means++ on the iris dataset.
Show Slide Download Files	For this tutorial, we will use a script file K-means.R. Please download this file from the Code files link of this tutorial. Make a copy and then use it for practising.
[Computer screen] Highlight K_means.R and the folder K-means	I have downloaded and moved this file to the K-means folder. This folder is located in the MLProject folder on my Desktop. I have also set the K-means folder as my Working Directory.
	Let us switch to RStudio.
Click Kmeans.R in RStudio Point to Kmeans.R in RStudio.	Open the script K-means.R in RStudio. The script Kmeans.R opens in RStudio.
[RStudio] Highlight library(LICORS)	We will use the LICORS package for creating our k-means++ models.
[RStudio] library(LICORS) data(iris)	Since I have already installed the package I will directly import it. If you don’t have the LICORS library, install it before importing.
[RStudio] library(LICORS) data(iris)	Select and run these commands.
Click in the Environment tab to load the iris dataset.	Click in the Environment tab to load the iris dataset.
Point to Sepal.Length and Sepal.Width columns.	Now let us create our first k-means++ model. We will use sepal length and sepal width to separate different species.
[RStudio] set.seed(121) km_1=kmeanspp(iris[,1:2],3,iter.max=100,nstart=10) print(km_1)	Type these commands.
Highlight set.seed(121) km_1=kmeanspp(iris[,1:2],3,iter.max=100) Highlight iris[,1:2] Highlight 3 Highlight iter.max=100	We set a seed for reproducible results. This command creates a k-means++ model. This subsets the sepal length and sepal width columns from the iris dataset. 3 denotes the number of clusters of our model. This denotes the maximum number of iterations that our model will train.
Click the Run button.	Run these commands. The output is shown in the console window.
Drag boundary to see the console window clearly.	Drag boundary to see the console window clearly.
Highlight output in console Highlight Cluster means Highlight Clustering vector Highlight Within cluster sum of squares by cluster	Our model’s specifications are displayed here. This tells us the location of each centroid. This gives us the predicted classes of species. These are the sum of squared distances within clusters.
Drag boundary to see the Source window clearly.	Drag boundary to see the Source window clearly.
Cursor in the Source window.	Now we shall tabulate the results of our model
[RStudio] table(km_1$cluster,iris$Species) Click on Save and Run buttons.	Type this command. Save and run the command to see the output in the console window.
Highlight output in console.	Our model has misclassified 27 data points. It shows that sepal length and sepal width are not good parameters.
Point to Petal.Length and Petal.Width columns.	Let us now try another model. This time, we will use petal length and petal width as our parameters.
[RStudio] km_2=kmeanspp(iris[,3:4],3,iter.max=100) table(km_2$cluster,iris$Species)	Type the following commands Save and run the commands.
Highlight output in console	The output is shown in the console window. Here our model has misclassified only 6 data points. It shows that petal length and petal width distinguishes the species accurately.
Cursor in the Source window.	Finally, let us create a model that uses all the 4 dimensions of an iris flower.
[RStudio] km_3=kmeanspp(iris[,1:4],3,iter.max=100) table(km_3$cluster,iris$Species)	Type these commands. Run the commands to see the output in the console window.
Highlight output in console	This model cannot properly distinguish between versicolor and virginica. It shows that having all the parameters for k-means++ is detrimental. We have trained 3 models to illustrate how classification depends on parameters chosen. kmeans++ works best when petal length and petal width of iris flower are chosen.
Only Narration	With this we come to the end of this tutorial. Let us summarise.
Show Slide Summary	In this tutorial we have learnt about: k-means Clustering Benefits of k-means Clustering Applications of k-means Clustering k-means++ clustering Different k-means++ models on iris data
Show Slide Assignment	Now we will suggest the assignment for this Spoken Tutorial. Apply k-means++ on the PimaIndiansDiabetes dataset. Install and import the mlbench package. Run the data("PimaIndiansDiabetes2") command to load the dataset. Compare between the models with different input parameters.
Show slide About the Spoken Tutorial Project	The video at the following link summarises the Spoken Tutorial project. Please download and watch it.
Show slide Spoken Tutorial Workshops	We conduct workshops using Spoken Tutorials and give certificates. For more details, please contact us.
Show Slide Spoken Tutorial Forum to answer questions	Please post your timed queries in this forum.
Show Slide Forum to answer questions	Do you have any general/technical questions? Please visit the forum given in the link.
Show Slide Textbook Companion	The FOSSEE team coordinates the coding of solved examples of popular books and case study projects. We give certificates to those who do this. For more details, please visit these sites.
Show Slide Acknowledgment	The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India.
Show Slide Thank You	This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay. Thank you for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey

Machine-Learning-using-R - old 2022/C3/K-Means-Clustering-in-R/English

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools