Difference between revisions of "Machine-Learning-using-R - old 2022/C2/Unsupervised-Learning/English"
(Created page with "'''Title of the script''': '''Unsupervised Learning''' '''Author''': Tanmay Srinath '''Keywords''': R, RStudio, machine learning, load libraries, ggplot2, mclust, unsupervis...") |
Nancyvarkey (Talk | contribs) m (Nancyvarkey moved page Machine-Learning-using-R/C2/Unsupervised-Learning/English to Machine-Learning-using-R - old 2022/C2/Unsupervised-Learning/English without leaving a redirect: Archiving previous version because new version will be created) |
||
(5 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
'''Author''': Tanmay Srinath | '''Author''': Tanmay Srinath | ||
− | '''Keywords''': R, RStudio, machine learning, load libraries, ggplot2, mclust, unsupervised, classification, k-means clustering, iris dataset, adjusted rand index, video tutorial. | + | '''Keywords''': R, RStudio, machine learning, load libraries, ggplot2, mclust, unsupervised learning, classification, k-means clustering, iris dataset, adjusted rand index, spoken tutorial, video tutorial. |
Line 23: | Line 23: | ||
|| In this tutorial, we will learn about: | || In this tutorial, we will learn about: | ||
* '''Unsupervised Learning ''' and its applications | * '''Unsupervised Learning ''' and its applications | ||
− | * '''k-means clustering '''on '''iris''' | + | * '''k-means clustering '''on '''iris data set''' |
* Measure the performance using '''Adjusted RAND Index'''. | * Measure the performance using '''Adjusted RAND Index'''. | ||
Line 31: | Line 31: | ||
'''System Specifications''' | '''System Specifications''' | ||
|| This tutorial is recorded using, | || This tutorial is recorded using, | ||
− | * '''Ubuntu Linux ''' | + | * '''Ubuntu Linux OS''' version 20.04 |
− | * '''R '''version | + | * '''R ''' version 4.1.2 |
− | * '''RStudio''' version | + | * '''RStudio''' version 1.4.1717 |
− | It is recommended to install '''R''' version | + | It is recommended to install '''R''' version 4.1.0 or higher. |
|- | |- | ||
|| '''Show Slide''' | || '''Show Slide''' | ||
Line 48: | Line 48: | ||
|- | |- | ||
− | || | + | || '''Show Slide''' |
+ | |||
+ | '''Unsupervised Learning''' | ||
|| Let us now learn about '''Unsupervised learning'''. | || Let us now learn about '''Unsupervised learning'''. | ||
|- | |- | ||
Line 55: | Line 57: | ||
'''Unsupervised Learning''' | '''Unsupervised Learning''' | ||
|| | || | ||
− | * It is a technique that is applied on unlabelled datasets. | + | * It is a technique that is applied on unlabelled '''datasets'''. |
− | * It uses '''Machine Learning algorithms''' to analyse and cluster unlabelled data. | + | * It uses '''Machine Learning algorithms''' to analyse and cluster unlabelled '''data'''. |
− | * It deals with finding groups of data points with similar characteristics. | + | * It deals with finding groups of '''data points''' with similar characteristics. |
|- | |- | ||
Line 63: | Line 65: | ||
'''Types of Unsupervised Learning''' | '''Types of Unsupervised Learning''' | ||
− | || Types of '''Unsupervised learning | + | || Types of '''Unsupervised learning'''. |
− | * '''Clustering''': it is used for grouping search engine results. | + | * '''Clustering''': it is used for grouping '''search engine''' results. |
* '''Anomaly Detection''': It is used to detect fraudulent transactions. | * '''Anomaly Detection''': It is used to detect fraudulent transactions. | ||
− | |||
|- | |- | ||
Line 73: | Line 74: | ||
'''k-means Clustering''' | '''k-means Clustering''' | ||
− | || Now let us implement '''k-means clustering''' on the '''iris | + | || Now let us implement '''k-means clustering''' on the '''iris data set.''' |
Line 81: | Line 82: | ||
'''Download Files''' | '''Download Files''' | ||
− | || For this tutorial, we will use a script file''' Clustering.R'''. | + | || For this tutorial, we will use a '''script''' file''' Clustering.R'''. |
− | Please download this file from the''' Code files''' link of this tutorial. | + | Please download this file from the ''' Code files''' link of this tutorial. |
Line 95: | Line 96: | ||
− | This folder is located inthe '''MLProject''' folder on my Desktop. | + | This folder is located inthe '''MLProject''' folder on my '''Desktop'''. |
− | I have also set the '''UnsupervisedLearning''' folder as my | + | I have also set the '''UnsupervisedLearning''' folder as my '''Working directory'''. |
|- | |- | ||
− | || | + | || Cursor in the '''UnsupervisedLearning''' folder. |
|| Let us switch to '''RStudio'''. | || Let us switch to '''RStudio'''. | ||
|- | |- | ||
Line 106: | Line 107: | ||
Point to '''Clustering.R''' in''' RStudio.''' | Point to '''Clustering.R''' in''' RStudio.''' | ||
− | || Open the | + | || Open the '''script Clustering.R''' in '''RStudio'''. |
− | + | '''Script Clustering.R''' opens in '''RStudio'''. | |
|- | |- | ||
|| [RStudio] | || [RStudio] | ||
Line 120: | Line 121: | ||
Point to the iris data set. | Point to the iris data set. | ||
+ | || Select the given '''commands'''. | ||
− | + | Click the '''Run''' button to see the''' iris data table'''. | |
− | + | ||
− | + | ||
− | Click the '''Run''' button to see the''' iris''' | + | |
|- | |- | ||
|| [RStudio] | || [RStudio] | ||
Line 136: | Line 135: | ||
Highlight 3 species | Highlight 3 species | ||
− | || Here we are using the labeled''' iris ''' | + | || Here we are using the labeled''' iris data set'''. |
It contains 3 species - '''Setosa''', '''Versicolor''' and '''Virginica. ''' | It contains 3 species - '''Setosa''', '''Versicolor''' and '''Virginica. ''' | ||
Line 147: | Line 146: | ||
'''Posing the Problem ''' | '''Posing the Problem ''' | ||
− | || Can we group data based on sepal and petal dimensions? | + | || Can we group '''data''' based on '''sepal''' and '''petal''' dimensions? |
If so, do these groups represent the original species label accurately? | If so, do these groups represent the original species label accurately? | ||
Line 161: | Line 160: | ||
'''Finding Number of Clusters''' | '''Finding Number of Clusters''' | ||
|| | || | ||
− | * For real-life unlabelled data, we should find the optimal number of clusters. | + | * For real-life unlabelled '''data''', we should find the optimal number of '''clusters'''. |
* For this, we use the '''Elbow Method'''. | * For this, we use the '''Elbow Method'''. | ||
Line 170: | Line 169: | ||
|- | |- | ||
|| Highlight '''Clustering.R''' in the '''Source''' window button | || Highlight '''Clustering.R''' in the '''Source''' window button | ||
− | || In the '''Source''' window, click on the | + | || In the '''Source''' window, click on the '''script Clustering.R'''. |
|- | |- | ||
|| [RStudio] | || [RStudio] | ||
Line 181: | Line 180: | ||
'''install.packages()''' | '''install.packages()''' | ||
− | || I will import the necessary packages. | + | || I will import the necessary '''packages'''. |
− | * '''ggplot2''' to visualize the data | + | * '''ggplot2''' to visualize the '''data''' |
* '''mclust''' to measure the accuracy. | * '''mclust''' to measure the accuracy. | ||
− | As I have already installed these packages, I will directly import them. | + | As I have already installed these '''packages''', I will directly import them. |
− | If you have not installed these libraries, please install them using '''install.packages ''' | + | If you have not installed these '''libraries''', please install them using '''install.packages function'''. |
|- | |- | ||
|| Type | || Type | ||
− | |||
− | |||
'''library(ggplot2)''' | '''library(ggplot2)''' | ||
Line 203: | Line 200: | ||
Press '''Ctrl''' + '''S''' keys to save the script. | Press '''Ctrl''' + '''S''' keys to save the script. | ||
− | || At the top of the script, type the following commands. | + | || At the top of the '''script''', type the following '''commands'''. |
− | Press '''Ctrl''' + '''S''' keys to save the script. | + | Press '''Ctrl''' + '''S''' keys to save the '''script'''. |
|- | |- | ||
Line 216: | Line 213: | ||
Click the '''Run''' button. | Click the '''Run''' button. | ||
− | || Click the '''Run''' button to load these libraries. | + | || Click the '''Run''' button to load these '''libraries'''. |
Line 227: | Line 224: | ||
− | || Click on the '''iris ''' | + | || Click on the '''iris data set'''. |
− | For the '''k-means clustering''', we need to select two features of the data set. | + | For the '''k-means clustering''', we need to select two features of the '''data set'''. |
− | So, we need to find the features that separate the data points clearly. | + | So, we need to find the features that separate the '''data points''' clearly. |
Line 239: | Line 236: | ||
* '''Sepal Length''' versus''' Sepal Width''' Width | * '''Sepal Length''' versus''' Sepal Width''' Width | ||
* '''Petal Length ''' Length versus '''Petal Width''' | * '''Petal Length ''' Length versus '''Petal Width''' | ||
− | |||
|- | |- | ||
Line 253: | Line 249: | ||
Click on '''Save''' button. | Click on '''Save''' button. | ||
− | || First, we will plot Sepal Length versus Sepal Width | + | || First, we will plot '''Sepal Length''' versus '''Sepal Width'''. |
− | + | ||
− | + | ||
− | + | Type the following '''command'''. | |
+ | Save the '''script'''. | ||
|- | |- | ||
Line 267: | Line 262: | ||
Click the '''Run''' button. | Click the '''Run''' button. | ||
− | || Select this command and run it to get a plot. | + | || Select this '''command''' and '''run''' it to get a plot. |
|- | |- | ||
|| Drag Boundary | || Drag Boundary | ||
− | || I will drag the boundary to see the | + | || I will drag the boundary to see the plot clearly. |
|- | |- | ||
Line 291: | Line 286: | ||
|- | |- | ||
− | || Point to '''Petal Length''' versus | + | || Point to '''Petal Length''' versus '''Petal Width.''' |
Line 297: | Line 292: | ||
'''ggplot(iris,aes(x = Petal.Length, y = Petal.Width, col= Species)) + geom_point()''' | '''ggplot(iris,aes(x = Petal.Length, y = Petal.Width, col= Species)) + geom_point()''' | ||
− | || Now we will plot '''Petal Length''' versus | + | || Now we will plot '''Petal Length''' versus '''Petal Width'''. |
− | Type the following command. | + | Type the following '''command'''. |
|- | |- | ||
Line 310: | Line 305: | ||
Click on '''Save '''and '''Run '''buttons. | Click on '''Save '''and '''Run '''buttons. | ||
− | || Save and run this command to see the plot. | + | || Save and '''run''' this '''command''' to see the plot. |
|- | |- | ||
|| Drag boundary to see the plot window clearly. | || Drag boundary to see the plot window clearly. | ||
− | || Drag the boundary to see the | + | || Drag the boundary to see the plot window clearly. |
|- | |- | ||
|| Highlight Output in '''Plot''' Window | || Highlight Output in '''Plot''' Window | ||
− | || Notice that '''Petal Length''' and '''Petal Width''' clearly separate the 3 species of iris. | + | || Notice that '''Petal Length''' and '''Petal Width''' clearly separate the 3 species of '''iris'''. |
Line 335: | Line 330: | ||
Highlight '''Species''' column | Highlight '''Species''' column | ||
− | || In this data set we have three species of iris. | + | || In this '''data set''' we have three species of '''iris'''. |
− | In this case we already know that we need three clusters. | + | In this case we already know that we need three '''clusters'''. |
Line 354: | Line 349: | ||
'''km <- kmeans(iris[,3:4],3,nstart = 20)''' | '''km <- kmeans(iris[,3:4],3,nstart = 20)''' | ||
− | || Now let’s use the '''kmeans() ''' | + | || Now let’s use the '''kmeans() function''' to perform '''k-means clustering'''. |
− | Type the following command. | + | Type the following '''command'''. |
|- | |- | ||
Line 365: | Line 360: | ||
Highlight '''nstart=20''' | Highlight '''nstart=20''' | ||
− | || Let me explain the parameters of this function. | + | || Let me explain the '''parameters''' of this '''function'''. |
− | This command uses '''Petal Length '''and '''Petal | + | This command uses '''Petal Length '''and '''Petal Width''' for '''k-means clustering'''. |
− | This is used to find the best starting points of the centroids. | + | This is used to find the best starting points of the '''centroids'''. |
|- | |- | ||
Line 382: | Line 377: | ||
Highlight km <- kmeans(iris[,3:4],3,nstart = 20) | Highlight km <- kmeans(iris[,3:4],3,nstart = 20) | ||
− | || Save the command and run it. | + | || Save the '''command''' and '''run''' it. |
− | It runs 20 iterations and randomly initializes centroids. | + | It runs 20 iterations and randomly initializes '''centroids'''. |
− | Then it chooses the best configuration of centroids for clustering. | + | Then it chooses the best configuration of '''centroids''' for '''clustering'''. |
|- | |- | ||
|| [RStudio] | || [RStudio] | ||
Line 397: | Line 392: | ||
− | Since the iris data set used here has labels, we will just tabulate the data. | + | Since the '''iris data set''' used here has labels, we will just tabulate the '''data'''. |
|- | |- | ||
Line 407: | Line 402: | ||
Click on '''Save '''and '''Run '''buttons. | Click on '''Save '''and '''Run '''buttons. | ||
− | || Type the following command. | + | || Type the following '''command'''. |
− | Save the script and run the command. | + | Save the '''script''' and '''run''' the '''command'''. |
|- | |- | ||
|| Drag boundary to see the '''console''' window. | || Drag boundary to see the '''console''' window. | ||
Line 420: | Line 415: | ||
− | Each row contains data points of 1 '''cluster'''. | + | Each row contains '''data points''' of 1 '''cluster'''. |
Line 437: | Line 432: | ||
Point to Setosa, versicolor and virginica. | Point to Setosa, versicolor and virginica. | ||
− | || 50 '''Setosa''' samples have been clustered together. | + | || 50 '''Setosa''' samples have been '''clustered''' together. |
− | 2 '''Versicolor''' samples have been incorrectly clustered. | + | 2 '''Versicolor''' samples have been incorrectly '''clustered'''. |
− | 4 '''Virginica''' samples have been incorrectly clustered. | + | 4 '''Virginica''' samples have been incorrectly '''clustered'''. |
Line 454: | Line 449: | ||
− | For this we will use the '''Adjusted Rand Index | + | For this we will use the '''Adjusted Rand Index'''. |
|- | |- | ||
Line 462: | Line 457: | ||
'''Adjusted Rand Index''' | '''Adjusted Rand Index''' | ||
|| | || | ||
− | * This is a measure of similarity between two clusters. | + | * This is a measure of similarity between two '''clusters'''. |
* It ranges from -1 to +1, where -1 is bad and +1 is good. | * It ranges from -1 to +1, where -1 is bad and +1 is good. | ||
− | |||
|- | |- | ||
Line 483: | Line 477: | ||
− | || '''mclust ''' | + | || '''mclust library''' contains the '''function''' to calculate the '''adjusted RAND index'''. |
− | Type the following command. | + | Type the following '''command'''. |
|- | |- | ||
|| Click on '''Save '''and '''Run '''buttons. | || Click on '''Save '''and '''Run '''buttons. | ||
− | || Save and run the command. | + | || Save and '''run''' the '''command'''. |
|- | |- | ||
Line 502: | Line 496: | ||
− | This means that our model has performed very well on this data set. | + | This means that our model has performed very well on this '''data set'''. |
|- | |- | ||
− | || | + | || Only Narration |
|| With this we come to the end of this tutorial. Let us summarize. | || With this we come to the end of this tutorial. Let us summarize. | ||
Line 515: | Line 509: | ||
'''Summary''' | '''Summary''' | ||
|| In this tutorial we have learnt: | || In this tutorial we have learnt: | ||
− | * '''Unsupervised Learning '''and its applications | + | * '''Unsupervised Learning ''' and its applications |
− | * '''k-means clustering '''on '''iris''' | + | * '''k-means clustering ''' on '''iris data set''' |
* Measure the performance using '''Adjusted RAND Index'''. | * Measure the performance using '''Adjusted RAND Index'''. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
Line 530: | Line 518: | ||
'''Assignment''' | '''Assignment''' | ||
− | || | + | ||Here is an assignment for you. |
− | + | ||
− | + | ||
+ | * Using '''inbuilt PlantGrowth data set''', perform '''k-means clustering'''. | ||
+ | * Evaluate using '''Adjusted RAND index'''. | ||
|- | |- | ||
Line 600: | Line 588: | ||
'''Acknowledgment''' | '''Acknowledgment''' | ||
− | || The '''Spoken Tutorial''' | + | || The '''Spoken Tutorial''' and '''FOSSEE''' projects are funded by the Ministry of Education Govt of India. |
|- | |- | ||
Line 606: | Line 594: | ||
'''Thank You''' | '''Thank You''' | ||
− | || This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay. Thank you for watching. | + | || This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay. |
+ | |||
+ | Thank you for watching. | ||
|- | |- | ||
|} | |} |
Latest revision as of 08:26, 9 October 2023
Title of the script: Unsupervised Learning
Author: Tanmay Srinath
Keywords: R, RStudio, machine learning, load libraries, ggplot2, mclust, unsupervised learning, classification, k-means clustering, iris dataset, adjusted rand index, spoken tutorial, video tutorial.
Visual Cue | Narration |
Show Slide
Opening Slide |
Welcome to this spoken tutorial on Unsupervised Learning. |
Show Slide
Learning Objectives |
In this tutorial, we will learn about:
|
Show Slide
System Specifications |
This tutorial is recorded using,
It is recommended to install R version 4.1.0 or higher. |
Show Slide
Prerequisites |
To follow this tutorial, the learner should know:
If not, please access the relevant tutorials on R on this website. |
Show Slide
Unsupervised Learning |
Let us now learn about Unsupervised learning. |
Show Slide
Unsupervised Learning |
|
Show Slide
Types of Unsupervised Learning |
Types of Unsupervised learning.
|
Show Slide
k-means Clustering |
Now let us implement k-means clustering on the iris data set.
|
Show Slide
Download Files |
For this tutorial, we will use a script file Clustering.R.
|
[Computer screen]
Highlight Clustering.R and the folder UnsupervisedLearning |
I have downloaded and moved this file to the UnsupervisedLearning folder.
|
Cursor in the UnsupervisedLearning folder. | Let us switch to RStudio. |
Double-click Clustering.R
Point to Clustering.R in RStudio. |
Open the script Clustering.R in RStudio.
|
[RStudio]
data(“iris”) View(“iris”) Click the Run button. Point to the iris data set. |
Select the given commands.
|
[RStudio]
Highlight Iris in source window. Highlight Species column Scroll the table to show the species. Highlight 3 species |
Here we are using the labeled iris data set.
It contains 3 species - Setosa, Versicolor and Virginica.
|
Show Slide
Posing the Problem |
Can we group data based on sepal and petal dimensions?
If so, do these groups represent the original species label accurately? |
Show Slide
Solution |
The answer to this problem is to use a clustering algorithm. |
Show Slide
Finding Number of Clusters |
To know more about it, please refer to the Additional Reading Material. |
Let us switch to Rstudio. | |
Highlight Clustering.R in the Source window button | In the Source window, click on the script Clustering.R. |
[RStudio]
mclust install.packages() |
I will import the necessary packages.
As I have already installed these packages, I will directly import them.
|
Type
library(ggplot2) library(mclust)
|
At the top of the script, type the following commands.
|
Highlight
library(ggplot2) library(mclust) Click the Run button. |
Click the Run button to load these libraries.
|
[RStudio]
Highlight Iris data set in the source window.
|
Click on the iris data set.
|
Point to Sepal Length versus Sepal Width.
Type ggplot(iris,aes(x = Sepal.Length, y = Sepal.Width, col= Species)) + geom_point()
|
First, we will plot Sepal Length versus Sepal Width.
Type the following command. Save the script. |
Highlight
ggplot(iris,aes(x = Sepal.Length, y = Sepal.Width, col= Species)) + geom_point()
|
Select this command and run it to get a plot. |
Drag Boundary | I will drag the boundary to see the plot clearly. |
Highlight
Output in Plot Window |
As we can see, setosa is clearly distinguished.
|
Drag the boundary. | Drag the boundary to see the Source window clearly. |
Point to Petal Length versus Petal Width.
ggplot(iris,aes(x = Petal.Length, y = Petal.Width, col= Species)) + geom_point() |
Now we will plot Petal Length versus Petal Width.
|
Highlight
ggplot(iris,aes(x = Petal.Length, y = Petal.Width, col= Species)) + geom_point()
|
Save and run this command to see the plot. |
Drag boundary to see the plot window clearly. | Drag the boundary to see the plot window clearly. |
Highlight Output in Plot Window | Notice that Petal Length and Petal Width clearly separate the 3 species of iris.
|
Drag the boundary. | Drag the boundary to see the Source window clearly. |
Highlight Iris in source window
|
In this data set we have three species of iris.
|
[Rstudio]
Type km <- kmeans(iris[,3:4],3,nstart = 20) |
Now let’s use the kmeans() function to perform k-means clustering.
|
Highlight iris[,3:4]
|
Let me explain the parameters of this function.
|
[Rstudio]
Click on Save and Run buttons.
|
Save the command and run it.
|
[RStudio]
Click the iris data set table. |
We need to analyze our model to determine its performance.
|
[Rstudio]
Type table(km$cluster,iris$Species) Click on Save and Run buttons. |
Type the following command.
Save the script and run the command. |
Drag boundary to see the console window. | Drag the boundary to see the Console window clearly. |
[RStudio]
Highlight Row and Column in output |
This table compares between predicted species and actual species.
|
[RStudio]
Highlight Output in the Console
|
50 Setosa samples have been clustered together.
|
Cursor in Rstudio window. | Now we will calculate the accuracy of the model.
|
Show Slide
Adjusted Rand Index |
|
Let us switch to RStudio. | |
[RStudio]
Point to mclust library.
adjustedRandIndex(km$cluster, iris$Species)
|
mclust library contains the function to calculate the adjusted RAND index.
|
Click on Save and Run buttons. | Save and run the command. |
[RStudio]
Highlight Output on Console |
The adjusted RAND index is very close to 1.
|
Only Narration | With this we come to the end of this tutorial. Let us summarize. |
Show Slide
Summary |
In this tutorial we have learnt:
|
Show Slide
Assignment |
Here is an assignment for you.
|
Show Slide
About the Spoken Tutorial Project |
The video at the following link summarises the Spoken Tutorial project.
Please download and watch it. |
Show Slide
Spoken Tutorial Workshops |
We conduct workshops using Spoken Tutorials and give certificates.
|
Show Slide
Spoken Tutorial Forum to answer questions
|
Do you have questions in THIS Spoken Tutorial?
Please visit this site. Choose the minute and second where you have the question. Explain your question briefly. The FOSSEE project will ensure an answer. You will have to register to ask questions. |
Show Slide
Forum to answer questions |
Do you have any general/technical questions?
Please visit the forum given in the link. |
Show Slide
Textbook Companion
|
The FOSSEE team coordinates the coding of solved examples of popular books and case study projects.
|
Show Slide
Acknowledgment |
The Spoken Tutorial and FOSSEE projects are funded by the Ministry of Education Govt of India. |
Show Slide
Thank You |
This tutorial is contributed by Tanmay Srinath and Madhuri Ganapathi from IIT Bombay.
Thank you for watching. |