Machine-Learning-using-R - old 2022/C3/K-Means-Clustering-in-R/English

From Script | Spoken-Tutorial
Jump to: navigation, search

Title of the script: k-means Clustering in R.

Author: Tanmay Srinath and Sudhakar Kumar (IIT Bombay)

Keywords: R, RStudio, machine learning, supervised, unsupervised, classification, k-means clustering, k-means++, video tutorial.


Visual Cue Narration
Show slide'

Opening Slide'

Welcome to this spoken tutorial on K-Means Clustering in R.
Show slide'

Learning Objectives'

In this tutorial, we will learn about:
  • k-means Clustering
  • Benefits of k-means Clustering
  • Applications of k-means Clustering
  • k-means++ clustering
  • Different k-means++ models on iris data.


Show slide

System Specifications

This tutorial is recorded using,
  • Ubuntu Linux OS version 20.04
  • R version 4.1.2
  • RStudio version 1.4.1717

It is recommended to install R version 4.1.0 or higher.

Show slide

Prerequisites


https://spoken-tutorial.org/

To follow this tutorial, the learner should know:
  • Basics of R programming.
  • Basics of Machine Learning.

If not, please access the relevant tutorials on this website.

Show slide

k-means Clustering

k-means Clustering
  • It partitions n observations into k clusters.
  • So, observations are homogeneous within each cluster.
  • Here each observation belongs to a cluster with the nearest cluster mean.

k-means Clustering is a powerful algorithm.

Here we will learn some benefits of using it.
Show slide

Benefits of k-means Clustering

  • k-means Clustering is relatively simple to implement.
  • k-means Clustering scales well on large datasets.
Now let us learn a few applications of k-means Clustering.
Show Slide

Applications of k-means Clustering


  • Customer Segmentation.

https://archive.ics.uci.edu/ml/datasets/online+retail

  • Ailment Diagnosis.

https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)

  • Customer Segmentation.
  • Ailment Diagnosis.
Show Slide

Optimising k-means

  • The basic form of k-means clustering is not optimal.
  • It depends a lot on the initialisation of clusters.
  • To overcome this drawback, we will use an optimised algorithm k-means++.
Show Slide

k-means++ algorithm

  • It is an algorithm for choosing the initial centroid locations.
  • The first center will be chosen at random.
  • The next ones will be selected with a certain probability.
  • This probability is proportional to the distance from the closest center.
  • By avoiding random initialisation, it provides faster results.


Show Slide

k-means++ Model

In this tutorial we will create 3 k-means++ models and compare their results.

Now let us implement k-means++ on the iris dataset.

Show Slide

Download Files

For this tutorial, we will use a script file K-means.R.

Please download this file from the Code files link of this tutorial.

Make a copy and then use it for practising.

[Computer screen]

Highlight K_means.R and the folder K-means

I have downloaded and moved this file to the K-means folder.

This folder is located in the MLProject folder on my Desktop.

I have also set the K-means folder as my Working Directory.

Let us switch to RStudio.
Click Kmeans.R in RStudio

Point to Kmeans.R in RStudio.

Open the script K-means.R in RStudio.

The script Kmeans.R opens in RStudio.

[RStudio]

Highlight library(LICORS)

We will use the LICORS package for creating our k-means++ models.


[RStudio]

library(LICORS)

data(iris)

Since I have already installed the package I will directly import it.

If you don’t have the LICORS library, install it before importing.

[RStudio]

library(LICORS)

data(iris)

Select and run these commands.
Click in the Environment tab to load the iris dataset. Click in the Environment tab to load the iris dataset.
Point to Sepal.Length and Sepal.Width columns. Now let us create our first k-means++ model.

We will use sepal length and sepal width to separate different species.

[RStudio]

set.seed(121)

km_1=kmeanspp(iris[,1:2],3,iter.max=100,nstart=10)

print(km_1)

Type these commands.
Highlight set.seed(121)


km_1=kmeanspp(iris[,1:2],3,iter.max=100)


Highlight iris[,1:2]

Highlight 3

Highlight iter.max=100

We set a seed for reproducible results.


This command creates a k-means++ model.


This subsets the sepal length and sepal width columns from the iris dataset.

3 denotes the number of clusters of our model.


This denotes the maximum number of iterations that our model will train.

Click the Run button. Run these commands.

The output is shown in the console window.

Drag boundary to see the console window clearly. Drag boundary to see the console window clearly.
Highlight output in console


Highlight Cluster means


Highlight Clustering vector


Highlight Within cluster sum of squares by cluster

Our model’s specifications are displayed here.


This tells us the location of each centroid.


This gives us the predicted classes of species.


These are the sum of squared distances within clusters.

Drag boundary to see the Source window clearly. Drag boundary to see the Source window clearly.
Cursor in the Source window. Now we shall tabulate the results of our model
[RStudio]

table(km_1$cluster,iris$Species)


Click on Save and Run buttons.

Type this command.


Save and run the command to see the output in the console window.

Highlight output in console. Our model has misclassified 27 data points.


It shows that sepal length and sepal width are not good parameters.

Point to Petal.Length and Petal.Width columns. Let us now try another model.


This time, we will use petal length and petal width as our parameters.

[RStudio]

km_2=kmeanspp(iris[,3:4],3,iter.max=100)


table(km_2$cluster,iris$Species)

Type the following commands


Save and run the commands.

Highlight output in console The output is shown in the console window.


Here our model has misclassified only 6 data points.


It shows that petal length and petal width distinguishes the species accurately.

Cursor in the Source window. Finally, let us create a model that uses all the 4 dimensions of an iris flower.
[RStudio]

km_3=kmeanspp(iris[,1:4],3,iter.max=100)


table(km_3$cluster,iris$Species)

Type these commands.


Run the commands to see the output in the console window.

Highlight output in console This model cannot properly distinguish between versicolor and virginica.


It shows that having all the parameters for k-means++ is detrimental.


We have trained 3 models to illustrate how classification depends on parameters chosen.


kmeans++ works best when petal length and petal width of iris flower are chosen.

Only Narration With this we come to the end of this tutorial.

Let us summarise.

Show Slide

Summary

In this tutorial we have learnt about:
  • k-means Clustering
  • Benefits of k-means Clustering
  • Applications of k-means Clustering
  • k-means++ clustering
  • Different k-means++ models on iris data
Show Slide

Assignment

Now we will suggest the assignment for this Spoken Tutorial.
  • Apply k-means++ on the PimaIndiansDiabetes dataset.
  • Install and import the mlbench package.
  • Run the data("PimaIndiansDiabetes2") command to load the dataset.
  • Compare between the models with different input parameters.


Show slide

About the Spoken Tutorial Project

The video at the following link summarises the Spoken Tutorial project.

Please download and watch it.

Show slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.


For more details, please contact us.

Show Slide

Spoken Tutorial Forum to answer questions

Please post your timed queries in this forum.
Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion

The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.


We give certificates to those who do this.


For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India.
Show Slide

Thank You

This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay.

Thank you for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey