Machine-Learning-using-R - old 2022/C2/Unsupervised-Learning/English

From Script | Spoken-Tutorial
Revision as of 15:34, 22 February 2022 by Nancyvarkey (Talk | contribs)

Jump to: navigation, search

Title of the script: Unsupervised Learning

Author: Tanmay Srinath

Keywords: R, RStudio, machine learning, load libraries, ggplot2, mclust, unsupervised learning, classification, k-means clustering, iris dataset, adjusted rand index, video tutorial.


Visual Cue Narration
Show Slide

Opening Slide

Welcome to this spoken tutorial on Unsupervised Learning.
Show Slide

Learning Objectives

In this tutorial, we will learn about:
  • Unsupervised Learning and its applications
  • k-means clustering on iris data set
  • Measure the performance using Adjusted RAND Index.
Show Slide

System Specifications

This tutorial is recorded using,
  • Ubuntu Linux OS version 20.04
  • R version 4.1.2
  • RStudio version 1.4.1717

It is recommended to install R version 4.1.0 or higher.

Show Slide

Prerequisites

To follow this tutorial, the learner should know:
  • Basics programming in R.
  • Basics of Machine Learning.

If not, please access the relevant tutorials on R on this website.

Let us now learn about Unsupervised learning.
Show Slide

Unsupervised Learning

  • It is a technique that is applied on unlabelled datasets.
  • It uses Machine Learning algorithms to analyse and cluster unlabelled data.
  • It deals with finding groups of data points with similar characteristics.
Show Slide

Types of Unsupervised Learning

Types of Unsupervised learning.
  • Clustering: it is used for grouping search engine results.
  • Anomaly Detection: It is used to detect fraudulent transactions.
Show Slide

k-means Clustering

Now let us implement k-means clustering on the iris data set.


To know more about clustering and its types, please refer to Additional Reading Material.

Show Slide

Download Files

For this tutorial, we will use a script file Clustering.R.


Please download this file from the Code files link of this tutorial.


Make a copy and then use it for practising.

[Computer screen]

Highlight Clustering.R and the folder UnsupervisedLearning

I have downloaded and moved this file to the UnsupervisedLearning folder.


This folder is located inthe MLProject folder on my Desktop.


I have also set the UnsupervisedLearning folder as my Working directory.

Let us switch to RStudio.
Double-click Clustering.R

Point to Clustering.R in RStudio.

Open the script Clustering.R in RStudio.


Script Clustering.R opens in RStudio.

[RStudio]

data(“iris”)

View(“iris”)

Click the Run button.

Point to the iris data set.

Select the given commands.


Click the Run button to see the iris data table.

[RStudio]

Highlight Iris in source window.

Highlight Species column

Scroll the table to show the species.

Highlight 3 species

Here we are using the labeled iris data set.

It contains 3 species - Setosa, Versicolor and Virginica.


There are 50 samples of each species for a total of 150 samples.

Show Slide

Posing the Problem

Can we group data based on sepal and petal dimensions?

If so, do these groups represent the original species label accurately?

Show Slide

Solution

The answer to this problem is to use a clustering algorithm.
Show Slide

Finding Number of Clusters

  • For real-life unlabelled data, we should find the optimal number of clusters.
  • For this, we use the Elbow Method.

To know more about it, please refer to the Additional Reading Material.

Let us switch to Rstudio.
Highlight Clustering.R in the Source window button In the Source window, click on the script Clustering.R.
[RStudio]


ggplot2

mclust

install.packages()

I will import the necessary packages.


  • ggplot2 to visualize the data
  • mclust to measure the accuracy.

As I have already installed these packages, I will directly import them.


If you have not installed these libraries, please install them using install.packages function.

Type

library(ggplot2)

library(mclust)


Press Ctrl + S keys to save the script.

At the top of the script, type the following commands.


Press Ctrl + S keys to save the script.

Highlight

library(ggplot2)

library(mclust)

Click the Run button.

Click the Run button to load these libraries.


[RStudio]

Highlight

Iris data set in the source window.


Click on the iris data set.


For the k-means clustering, we need to select two features of the data set.


So, we need to find the features that separate the data points clearly.


Hence, we will plot two graphs:

  • Sepal Length versus Sepal Width Width
  • Petal Length Length versus Petal Width
Point to Sepal Length versus Sepal Width.


[Rstudio]

Type

ggplot(iris,aes(x = Sepal.Length, y = Sepal.Width, col= Species)) + geom_point()


Click on Save button.

First, we will plot Sepal Length versus Sepal Width.

Type the following command.

Save the script.

Highlight

ggplot(iris,aes(x = Sepal.Length, y = Sepal.Width, col= Species)) + geom_point()


Click the Run button.

Select this command and run it to get a plot.
Drag Boundary I will drag the boundary to see the plot clearly.
Highlight

Output in Plot Window

As we can see, setosa is clearly distinguished.


But virginica and versicolor are not clearly separated.

Drag the boundary. Drag the boundary to see the Source window clearly.
Point to Petal Length versus Petal Width.


Type

ggplot(iris,aes(x = Petal.Length, y = Petal.Width, col= Species)) + geom_point()

Now we will plot Petal Length versus Petal Width.


Type the following command.

Highlight

ggplot(iris,aes(x = Petal.Length, y = Petal.Width, col= Species)) + geom_point()


Click on Save and Run buttons.

Save and run this command to see the plot.
Drag boundary to see the plot window clearly. Drag the boundary to see the plot window clearly.
Highlight Output in Plot Window Notice that Petal Length and Petal Width clearly separate the 3 species of iris.


Hence we should choose these two features.

Drag the boundary. Drag the boundary to see the Source window clearly.
Highlight Iris in source window


Highlight Species column

In this data set we have three species of iris.


In this case we already know that we need three clusters.


Hence, we will not be using the Elbow method.


To know more about it, please refer to the Additional Material on this tutorial page.

[Rstudio]

Type

km <- kmeans(iris[,3:4],3,nstart = 20)

Now let’s use the kmeans() function to perform k-means clustering.


Type the following command.

Highlight iris[,3:4]


Highlight nstart=20

Let me explain the parameters of this function.


This command uses Petal Length and Petal Width for k-means clustering.


This is used to find the best starting points of the centroids.

[Rstudio]

Click on Save and Run buttons.


Highlight km <- kmeans(iris[,3:4],3,nstart = 20)

Save the command and run it.


It runs 20 iterations and randomly initializes centroids.


Then it chooses the best configuration of centroids for clustering.

[RStudio]

Click the iris data set table.

We need to analyze our model to determine its performance.


Since the iris data set used here has labels, we will just tabulate the data.

[Rstudio]

Type

table(km$cluster,iris$Species)

Click on Save and Run buttons.

Type the following command.

Save the script and run the command.

Drag boundary to see the console window. Drag the boundary to see the Console window clearly.
[RStudio]

Highlight Row and Column in output

This table compares between predicted species and actual species.


Each row contains data points of 1 cluster.


Each column contains the species name.


Hence each cell gives the number of points belonging to a particular species.

[RStudio]

Highlight Output in the Console


Point to Setosa, versicolor and virginica.

50 Setosa samples have been clustered together.


2 Versicolor samples have been incorrectly clustered.


4 Virginica samples have been incorrectly clustered.


Overall, the model has misclassified only 6 samples.

Cursor in Rstudio window. Now we will calculate the accuracy of the model.


For this we will use the Adjusted Rand Index.

Show Slide

Adjusted Rand Index

  • This is a measure of similarity between two clusters.
  • It ranges from -1 to +1, where -1 is bad and +1 is good.
Let us switch to RStudio.
[RStudio]

Point to mclust library.


Type

adjustedRandIndex(km$cluster, iris$Species)


mclust library contains the function to calculate the adjusted RAND index.


Type the following command.

Click on Save and Run buttons. Save and run the command.
[RStudio]

Highlight Output on Console

The adjusted RAND index is very close to 1.


This means that our model has performed very well on this data set.

With this we come to the end of this tutorial. Let us summarize.
Show Slide

Summary

In this tutorial we have learnt:
  • Unsupervised Learning and its applications
  • k-means clustering on iris data set
  • Measure the performance using Adjusted RAND Index.
Here is an assignment for you.
Show Slide

Assignment

  • Using inbuilt PlantGrowth data set, perform k-means clustering.
  • Evaluate using adjusted RAND index.
Show Slide

About the Spoken Tutorial Project

The video at the following link summarises the Spoken Tutorial project.

Please download and watch it.

Show Slide

Spoken Tutorial Workshops

We conduct workshops using Spoken Tutorials and give certificates.


Please contact us.

Show Slide

Spoken Tutorial Forum to answer questions


Do you have questions in THIS Spoken Tutorial?

Please visit this site.

Choose the minute and second where you have the question.

Explain your question briefly.

The FOSSEE project will ensure an answer.

You will have to register to ask questions.

Show Slide

Forum to answer questions

Do you have any general/technical questions?

Please visit the forum given in the link.

Show Slide

Textbook Companion


The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.


We give certificates to those who do this.


For more details, please visit these sites.

Show Slide

Acknowledgment

The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India.
Show Slide

Thank You

This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay. Thank you for watching.

Contributors and Content Editors

Madhurig, Nancyvarkey