Difference between revisions of "Python-for-Machine-Learning/C3/K-Means-Clustering/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
(Created page with " <div style="margin-left:1.27cm;margin-right:0cm;"></div> {| border="1" |- || '''Visual Cue''' || '''Narration''' |- || Show Slide: || Welcome to the Spoken Tutorial on''' K...")
 
Line 1: Line 1:
  
  
<div style="margin-left:1.27cm;margin-right:0cm;"></div>
+
 
 
{| border="1"
 
{| border="1"
 
|-
 
|-
Line 10: Line 10:
 
|| Welcome to the Spoken Tutorial on''' K-means Clustering.'''
 
|| Welcome to the Spoken Tutorial on''' K-means Clustering.'''
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Show Slide:  
+
|| Show Slide:  
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | In this tutorial, we will learn about
+
|| In this tutorial, we will learn about
* <div style="margin-left:1.27cm;margin-right:0cm;">Introduction to''' K-means Clustering'''</div>
+
* Introduction to''' K-means Clustering'''
* <div style="margin-left:1.27cm;margin-right:0cm;">Steps involved in''' K-means Algorithm'''</div>
+
* Steps involved in''' K-means Algorithm'''
* <div style="margin-left:1.27cm;margin-right:0cm;">Choosing the''' '''Optimal''' '''number of''' Clusters (k)'''</div>
+
* Choosing the''' '''Optimal''' '''number of''' Clusters (k)'''
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Show Slide:  
+
|| Show Slide:  
  
 
'''System Requirements'''
 
'''System Requirements'''
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | To Record this tutorial, I am using
+
|| To Record this tutorial, I am using
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Ubuntu Linux OS version 24.04'''</div>
+
* '''Ubuntu Linux OS version 24.04'''
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Jupyter Notebook''' '''IDE'''</div>
+
* '''Jupyter Notebook''' '''IDE'''
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Show Slide:  
+
|| Show Slide:  
  
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | To follow this tutorial,
+
|| To follow this tutorial,
* <div style="margin-left:1.27cm;margin-right:0cm;">The Learner must have basic knowledge of '''Python''' </div>
+
* The Learner must have basic knowledge of '''Python'''  
* <div style="margin-left:1.27cm;margin-right:0cm;">For prerequisite '''Python''' tutorials, please visit this website</div>
+
* For prerequisite '''Python''' tutorials, please visit this website
  
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Show Slide:
+
|| Show Slide:
  
 
'''Code files'''
 
'''Code files'''
  
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" |
+
||
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="color:#252525;">The files used in this tutorial are provided in the </span><span style="color:#252525;">'''Code files'''</span><span style="color:#252525;"> link.</span></div>
+
* The files used in this tutorial are provided in the '''Code files''' link.
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="color:#252525;">Please download and extract the files.</span></div>
+
* Please download and extract the files.
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="color:#252525;">Make a copy and then use them while practicing.</span></div>
+
* Make a copy and then use them while practicing.
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Show Slide:  
+
|| Show Slide:  
  
 
'''K-means Clustering'''
 
'''K-means Clustering'''
  
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" |
+
||
* <div style="margin-left:1.27cm;margin-right:0cm;">'''K-means clustering''' is an '''unsupervised''' machine learning algorithm.</div>
+
* '''K-means clustering''' is an '''unsupervised''' machine learning algorithm.
* <div style="margin-left:1.27cm;margin-right:0cm;">It partitions data into a predetermined number of '''clusters'''.</div>
+
* It partitions data into a predetermined number of '''clusters'''.
* <div style="margin-left:1.27cm;margin-right:0cm;">This algorithm aims to group similar '''data points''' together.</div>
+
* This algorithm aims to group similar '''data points''' together.
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Show Slide:  
+
|| Show Slide:  
  
 
'''K-means Clustering'''
 
'''K-means Clustering'''
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" |
+
||
* <div style="margin-left:1.27cm;margin-right:0cm;">It creates '''distinct clusters''' by separating dissimilar points.</div>
+
* It creates '''distinct clusters''' by separating dissimilar points.
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" |  
+
||  
  
 
Show Slide:  
 
Show Slide:  
Line 60: Line 60:
 
'''Working '''
 
'''Working '''
 
'''working.png'''
 
'''working.png'''
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Let us see the working of '''K-means clustering'''.
+
|| Let us see the working of '''K-means clustering'''.
* <div style="margin-left:1.27cm;margin-right:0cm;">First, choose the number of clusters, '''k''', to divide the data.</div>
+
* First, choose the number of clusters, '''k''', to divide the data.
* <div style="margin-left:1.27cm;margin-right:0cm;">Then, randomly pick''' k data points''' as''' initial centroids '''which is the center of each cluster.</div>
+
* Then, randomly pick''' k data points''' as''' initial centroids '''which is the center of each cluster.
* <div style="margin-left:1.27cm;margin-right:0cm;">It represents the average position of all points in that '''cluster'''.</div>
+
* It represents the average position of all points in that '''cluster'''.
  
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Show Slide:  
 
|| Show Slide:  
  
Line 71: Line 71:
 
'''working.png'''
 
'''working.png'''
 
||  
 
||  
* <div style="margin-left:1.27cm;margin-right:0cm;">Each '''data point''' is assigned to the nearest '''cluster centroid'''.</div>
+
* Each '''data point''' is assigned to the nearest '''cluster centroid'''.
* <div style="margin-left:1.27cm;margin-right:0cm;">Update the centroids by averaging the '''data points''' in each '''cluster'''.</div>
+
* Update the centroids by averaging the '''data points''' in each '''cluster'''.
  
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Show Slide:  
 
|| Show Slide:  
  
Line 80: Line 80:
 
'''working.png'''
 
'''working.png'''
 
||
 
||
* <div style="margin-left:1.27cm;margin-right:0cm;"><span style="background-color:transparent;">New </span><span style="background-color:transparent;">'''centroids'''</span><span style="background-color:transparent;"> replace the old ones based on these mean values.</span></div>
+
* New '''centroids''' replace the old ones based on these mean values.
  
* <div style="margin-left:1.27cm;margin-right:0cm;">This is repeated until the '''cluster''' assignments no longer change.</div>
+
* This is repeated until the '''cluster''' assignments no longer change.
  
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Hover over the files
 
|| Hover over the files
  
Line 91: Line 91:
  
 
'''K means clustering dot ipynb '''is the python notebook file used in this tutorial.
 
'''K means clustering dot ipynb '''is the python notebook file used in this tutorial.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Open '''Customers.csv dataset'''
 
|| Open '''Customers.csv dataset'''
 
|| For this tutorial, we use a '''dataset''' of customer income and spending scores.
 
|| For this tutorial, we use a '''dataset''' of customer income and spending scores.
  
 
The goal is to '''group individuals''' based on''' income''' and '''spending patterns'''.
 
The goal is to '''group individuals''' based on''' income''' and '''spending patterns'''.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Press '''Ctrl+Alt+T '''keys
 
|| Press '''Ctrl+Alt+T '''keys
  
Line 105: Line 105:
  
 
Activate the machine learning environment as shown.
 
Activate the machine learning environment as shown.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| To go to the '''Downloads '''folder,
 
|| To go to the '''Downloads '''folder,
  
Line 116: Line 116:
  
 
Then type, '''jupyter space notebook '''and press '''Enter.'''
 
Then type, '''jupyter space notebook '''and press '''Enter.'''
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Show Jupyter Notebook Home Page:Double click on
 
|| Show Jupyter Notebook Home Page:Double click on
  
Line 124: Line 124:
 
Click the '''K means clustering dot ipynb''' file to open it.
 
Click the '''K means clustering dot ipynb''' file to open it.
  
<div style="color:#000000;">Note that each cell will have the output displayed in this file.</div>
+
<div style="color:#000000;">Note that each cell will have the output displayed in this file.
  
 
Let us see the implementation of the '''K-Means Clustering'''.
 
Let us see the implementation of the '''K-Means Clustering'''.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
  
Line 138: Line 138:
 
Make sure to Press''' Shift '''and''' Enter''' to execute the code in each cell.
 
Make sure to Press''' Shift '''and''' Enter''' to execute the code in each cell.
  
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
  
 
|| We load the '''customers dataset''' into the '''df''' dataframe.
 
|| We load the '''customers dataset''' into the '''df''' dataframe.
 
Let us display the first few rows.
 
Let us display the first few rows.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
  
Line 154: Line 154:
  
 
'''Income ranges '''from''' 15K to 137K''', while the '''spending score varies '''from''' 1 to 99'''.
 
'''Income ranges '''from''' 15K to 137K''', while the '''spending score varies '''from''' 1 to 99'''.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
  
Line 163: Line 163:
  
 
The output shows '''False''' for all entries, confirming '''no missing values'''.
 
The output shows '''False''' for all entries, confirming '''no missing values'''.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
||  
 
||  
 
|| To visualize the dataset, we plot '''histograms''' for each numerical feature.
 
|| To visualize the dataset, we plot '''histograms''' for each numerical feature.
  
 
We also use '''boxplots''' to detect potential '''outliers'''.
 
We also use '''boxplots''' to detect potential '''outliers'''.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Hover over the output plots as per the narration
 
|| Hover over the output plots as per the narration
 
|| The''' histogram''' reveals the distribution of '''annual income''' and '''spending scores'''.
 
|| The''' histogram''' reveals the distribution of '''annual income''' and '''spending scores'''.
Line 179: Line 179:
  
 
The '''Spending Score''' feature does not exhibit clear outliers.
 
The '''Spending Score''' feature does not exhibit clear outliers.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Only narration
 
|| Only narration
  
Line 194: Line 194:
  
 
It improves uniformity and enhances the clustering performance.
 
It improves uniformity and enhances the clustering performance.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
  
Line 227: Line 227:
  
 
Further, we generate a plot to visualize the '''silhouette scores'''.
 
Further, we generate a plot to visualize the '''silhouette scores'''.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Only narration
 
|| Only narration
  
Line 238: Line 238:
  
 
In this plot, the optimal '''k''' is 5 as it has the highest '''silhouette score'''.
 
In this plot, the optimal '''k''' is 5 as it has the highest '''silhouette score'''.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
  
 
'''optimal_k = 5'''
 
'''optimal_k = 5'''
 
|| We store the''' optimal k '''value in the '''optimal underscore k''' variable.
 
|| We store the''' optimal k '''value in the '''optimal underscore k''' variable.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Only Narration
 
|| Only Narration
  
Line 250: Line 250:
  
 
The model is fitted to the '''scaled data''', assigning cluster labels.
 
The model is fitted to the '''scaled data''', assigning cluster labels.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
|| We extract the '''cluster centers''' from the '''trained K-means '''model.
 
|| We extract the '''cluster centers''' from the '''trained K-means '''model.
  
 
The coordinates of these centers are printed for analysis.
 
The coordinates of these centers are printed for analysis.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Show output and hover over it
 
|| Show output and hover over it
 
|| Each '''cluster center''' represents the average value of points in that cluster.
 
|| Each '''cluster center''' represents the average value of points in that cluster.
  
 
Each row corresponds to a '''centroid’s position''' for '''income '''and''' spending score'''.
 
Each row corresponds to a '''centroid’s position''' for '''income '''and''' spending score'''.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
  
Line 267: Line 267:
  
 
A new column '''Cluster''' is added to store cluster labels.
 
A new column '''Cluster''' is added to store cluster labels.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
  
Line 277: Line 277:
 
A''' higher score''' indicates well-separated and dense clusters.
 
A''' higher score''' indicates well-separated and dense clusters.
  
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Show the output
 
|| Show the output
 
|| The '''silhouette score''' of '''0.559''' suggests a '''moderate clustering quality'''.
 
|| The '''silhouette score''' of '''0.559''' suggests a '''moderate clustering quality'''.
  
<span style="color:#252525;">A score close to </span><span style="color:#252525;">'''-1'''</span><span style="color:#252525;"> means the data point is </span><span style="color:#252525;">'''poorly clustered'''</span><span style="color:#252525;">.</span>
+
A score close to '''-1''' means the data point is '''poorly clustered'''.
  
<span style="color:#252525;">A score close to </span><span style="color:#252525;">'''1'''</span><span style="color:#252525;"> means the data point is </span><span style="color:#252525;">'''well-clustered'''</span><span style="color:#252525;">.</span>
+
A score close to '''1''' means the data point is '''well-clustered'''.
  
 
This suggests that while '''clusters''' are distinguishable, some overlap may exist.
 
This suggests that while '''clusters''' are distinguishable, some overlap may exist.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
  
Line 299: Line 299:
  
 
This visualization helps to understand how data points are grouped.
 
This visualization helps to understand how data points are grouped.
|- style="background-color:transparent;border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Show the output plot
 
|| Show the output plot
 
|| The scatter plot visualizes '''K-means clustering''' results.
 
|| The scatter plot visualizes '''K-means clustering''' results.
Line 307: Line 307:
 
The''' clusters''' show distinct groupings based on '''annual income''' and''' spending score'''.
 
The''' clusters''' show distinct groupings based on '''annual income''' and''' spending score'''.
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Show Slide:
+
|| Show Slide:
  
 
'''Summary'''
 
'''Summary'''
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | This brings us to the end of the tutorial.  
+
|| This brings us to the end of the tutorial.  
  
 
Let us summarize.
 
Let us summarize.
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Show Slide:
+
|| Show Slide:
  
 
'''Assignment'''
 
'''Assignment'''
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | <div style="color:#000000;">As an assignment,</div>
+
|| As an assignment,
* <div style="color:#000000;margin-left:1.27cm;margin-right:0cm;">Use '''Elbow''' method instead of''' Silhouette''' method to find the optimal k</div>
+
* Use '''Elbow''' method instead of''' Silhouette''' method to find the optimal k
* <div style="color:#000000;margin-left:1.27cm;margin-right:0cm;">Use '''inertia''' to identify it</div>
+
* Use '''inertia''' to identify it
* <div style="color:#000000;margin-left:1.27cm;margin-right:0cm;">Then, plot the '''elbow''' curve</div>
+
* Then, plot the '''elbow''' curve
 +
 
  
<div style="color:#000000;"></div>
 
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Show Slide:
+
|| Show Slide:
  
 
'''Assignment'''
 
'''Assignment'''
  
 
Show elbow.png
 
Show elbow.png
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Refer to the code given:
+
|| Refer to the code given:
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Show Slide:
+
|| Show Slide:
  
 
'''Assignment Solution'''
 
'''Assignment Solution'''
  
 
Show plot.png
 
Show plot.png
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | After completing the assignment, we’ll get the following plot.
+
|| After completing the assignment, we’ll get the following plot.
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Show Slide:
+
|| Show Slide:
  
 
'''FOSSEE Forum'''
 
'''FOSSEE Forum'''
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | For any general or technical questions on <span style="background-color:#ffffff;">'''Python for Machine Learning'''</span>, visit the''' FOSSEE forum''' and post your question
+
|| For any general or technical questions on '''Python for Machine Learning''', visit the''' FOSSEE forum''' and post your question
 
|-
 
|-
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | Show Slide:  
+
|| Show Slide:  
  
 
'''Thank you'''
 
'''Thank you'''
| style="border:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.191cm;padding-right:0.191cm;" | This is '''Anvita Thadavoose Manjummel''', a FOSSEE Summer Fellow 2025, IIT Bombay signing off
+
|| This is '''Anvita Thadavoose Manjummel''', a FOSSEE Summer Fellow 2025, IIT Bombay signing off
  
 
Thanks for joining.
 
Thanks for joining.
 
|-
 
|-
 
|}
 
|}
<div style="margin-left:0cm;margin-right:0cm;"></div>
 

Revision as of 15:25, 9 July 2025


Visual Cue Narration
Show Slide: Welcome to the Spoken Tutorial on K-means Clustering.
Show Slide: In this tutorial, we will learn about
  • Introduction to K-means Clustering
  • Steps involved in K-means Algorithm
  • Choosing the Optimal number of Clusters (k)
Show Slide:

System Requirements

To Record this tutorial, I am using
  • Ubuntu Linux OS version 24.04
  • Jupyter Notebook IDE
Show Slide: To follow this tutorial,
  • The Learner must have basic knowledge of Python
  • For prerequisite Python tutorials, please visit this website
Show Slide:

Code files

  • The files used in this tutorial are provided in the Code files link.
  • Please download and extract the files.
  • Make a copy and then use them while practicing.
Show Slide:

K-means Clustering

  • K-means clustering is an unsupervised machine learning algorithm.
  • It partitions data into a predetermined number of clusters.
  • This algorithm aims to group similar data points together.
Show Slide:

K-means Clustering

  • It creates distinct clusters by separating dissimilar points.

Show Slide:

Working working.png

Let us see the working of K-means clustering.
  • First, choose the number of clusters, k, to divide the data.
  • Then, randomly pick k data points as initial centroids which is the center of each cluster.
  • It represents the average position of all points in that cluster.
Show Slide:

Working working.png

  • Each data point is assigned to the nearest cluster centroid.
  • Update the centroids by averaging the data points in each cluster.
Show Slide:

Working working.png

  • New centroids replace the old ones based on these mean values.
  • This is repeated until the cluster assignments no longer change.
Hover over the files

Point to KMeansClustering.ipynb

I have created the required files for the demonstration of K-Means clustering.

K means clustering dot ipynb is the python notebook file used in this tutorial.

Open Customers.csv dataset For this tutorial, we use a dataset of customer income and spending scores.

The goal is to group individuals based on income and spending patterns.

Press Ctrl+Alt+T keys

Type conda activate ml

Press Enter

Let us open the Linux terminal by pressing Ctrl,Alt and T keys together.

Activate the machine learning environment as shown.

To go to the Downloads folder,

Type cd Downloads

Type jupyter notebook

I have saved my code file in the Downloads folder.

Please navigate to the respective folder of your code file location.

Then type, jupyter space notebook and press Enter.

Show Jupyter Notebook Home Page:Double click on

KMeansClustering.ipynb file

We can see the Jupyter Notebook Home page has opened in the web browser.

Click the K means clustering dot ipynb file to open it.

Note that each cell will have the output displayed in this file.

Let us see the implementation of the K-Means Clustering.

Highlight:

import numpy as np

import pandas as pd

First, we import the necessary libraries required for K-means Clustering.

Make sure to Press Shift and Enter to execute the code in each cell.

Highlight: We load the customers dataset into the df dataframe.

Let us display the first few rows.

Highlight:

print(df.describe())

Hover over the output

To understand the data distribution, we use the describe function.

The dataset contains 200 entries for Annual Income and Spending Score.

Income ranges from 15K to 137K, while the spending score varies from 1 to 99.

Highlight:

df.isnull()

Hover over the output

We check for missing values in the dataset using isnull function.

The output shows False for all entries, confirming no missing values.

To visualize the dataset, we plot histograms for each numerical feature.

We also use boxplots to detect potential outliers.

Hover over the output plots as per the narration The histogram reveals the distribution of annual income and spending scores.

Income is more spread out, while spending scores form distinct clusters.

The boxplot identifies an outlier in the Annual Income feature.

It is indicated by a point above the upper whisker.

The Spending Score feature does not exhibit clear outliers.

Only narration

Highlight:

scaler = MinMaxScaler()

scaled_df = scaler.fit_transform(df)

Now, let us preprocess the dataset.

First, we normalize the dataset using MinMaxScaler to scale feature values.

This transformation ensures all values fall between 0 and 1.

It improves uniformity and enhances the clustering performance.

Highlight:

silhouette_scores = []

K_range = range(2, 11)

for K in K_range:

kmeans = KMeans(n_clusters=K, random_state=42, n_init=10)

cluster_labels = kmeans.fit_predict(scaled_df)

score = silhouette_score(scaled_df, cluster_labels)

silhouette_scores.append(score) plt.show()

Next, we find the optimal number of clusters using the Silhouette Score method.

The highest Silhouette score suggests the best k value.

To do this, we initialize an empty list to store silhouette scores.

The for loop iterates through a range of K values to perform clustering.

For each K, a KMeans model is created and trained.

Cluster labels are predicted, and the silhouette score is calculated.

The scores are stored for further analysis.

Further, we generate a plot to visualize the silhouette scores.

Only narration

Hover over output plot

Ignore the warning in the output cell.

The plot shows silhouette scores for different cluster values.

A higher score means better defined clusters with less overlap.

In this plot, the optimal k is 5 as it has the highest silhouette score.

Highlight:

optimal_k = 5

We store the optimal k value in the optimal underscore k variable.
Only Narration

Highlight:

Next, we create a K-means model using the optimal k value.

The model is fitted to the scaled data, assigning cluster labels.

Highlight: We extract the cluster centers from the trained K-means model.

The coordinates of these centers are printed for analysis.

Show output and hover over it Each cluster center represents the average value of points in that cluster.

Each row corresponds to a centroid’s position for income and spending score.

Highlight:

df['Cluster'] = kmeans.labels_

We then assign each data point to its respective cluster.

A new column Cluster is added to store cluster labels.

Highlight:

silhouette_avg = silhouette_score(scaled_df, kmeans.labels_)

print("Silhouette Score: ", silhouette_avg)

Next, we calculate the silhouette score to evaluate clustering quality.

A higher score indicates well-separated and dense clusters.

Show the output The silhouette score of 0.559 suggests a moderate clustering quality.

A score close to -1 means the data point is poorly clustered.

A score close to 1 means the data point is well-clustered.

This suggests that while clusters are distinguishable, some overlap may exist.

Highlight:

plt.figure(figsize=(8, 5))

plt.show()

Finally, we plot data points using the scaled income and spending score values.

To enhance clarity, each point is color-coded based on its cluster assignment.

The cluster centers are marked with red X for easy identification.

This visualization helps to understand how data points are grouped.

Show the output plot The scatter plot visualizes K-means clustering results.

Each dot represents a data point, color-coded by its assigned cluster.

The clusters show distinct groupings based on annual income and spending score.

Show Slide:

Summary

This brings us to the end of the tutorial.

Let us summarize.

Show Slide:

Assignment

As an assignment,
  • Use Elbow method instead of Silhouette method to find the optimal k
  • Use inertia to identify it
  • Then, plot the elbow curve


Show Slide:

Assignment

Show elbow.png

Refer to the code given:
Show Slide:

Assignment Solution

Show plot.png

After completing the assignment, we’ll get the following plot.
Show Slide:

FOSSEE Forum

For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question
Show Slide:

Thank you

This is Anvita Thadavoose Manjummel, a FOSSEE Summer Fellow 2025, IIT Bombay signing off

Thanks for joining.

Contributors and Content Editors

Madhurig, Nirmala Venkat

Retrieved from "https://script.spoken-tutorial.org/index.php?title=Python-for-Machine-Learning/C3/K-Means-Clustering/English&oldid=57020"