Difference between revisions of "Python-for-Machine-Learning/C3/K-Means-Clustering/English"
(Created page with " <div style="margin-left:1.27cm;margin-right:0cm;"></div> {| border="1" |- || '''Visual Cue''' || '''Narration''' |- || Show Slide: || Welcome to the Spoken Tutorial on''' K...") |
|||
| Line 1: | Line 1: | ||
| − | + | ||
{| border="1" | {| border="1" | ||
|- | |- | ||
| Line 10: | Line 10: | ||
|| Welcome to the Spoken Tutorial on''' K-means Clustering.''' | || Welcome to the Spoken Tutorial on''' K-means Clustering.''' | ||
|- | |- | ||
| − | | | + | || Show Slide: |
| − | | | + | || In this tutorial, we will learn about |
| − | * | + | * Introduction to''' K-means Clustering''' |
| − | * | + | * Steps involved in''' K-means Algorithm''' |
| − | * | + | * Choosing the''' '''Optimal''' '''number of''' Clusters (k)''' |
|- | |- | ||
| − | | | + | || Show Slide: |
'''System Requirements''' | '''System Requirements''' | ||
| − | | | + | || To Record this tutorial, I am using |
| − | * | + | * '''Ubuntu Linux OS version 24.04''' |
| − | * | + | * '''Jupyter Notebook''' '''IDE''' |
|- | |- | ||
| − | | | + | || Show Slide: |
| − | | | + | || To follow this tutorial, |
| − | * | + | * The Learner must have basic knowledge of '''Python''' |
| − | * | + | * For prerequisite '''Python''' tutorials, please visit this website |
|- | |- | ||
| − | | | + | || Show Slide: |
'''Code files''' | '''Code files''' | ||
| − | | | + | || |
| − | * | + | * The files used in this tutorial are provided in the '''Code files''' link. |
| − | * | + | * Please download and extract the files. |
| − | * | + | * Make a copy and then use them while practicing. |
|- | |- | ||
| − | | | + | || Show Slide: |
'''K-means Clustering''' | '''K-means Clustering''' | ||
| − | | | + | || |
| − | * | + | * '''K-means clustering''' is an '''unsupervised''' machine learning algorithm. |
| − | * | + | * It partitions data into a predetermined number of '''clusters'''. |
| − | * | + | * This algorithm aims to group similar '''data points''' together. |
|- | |- | ||
| − | | | + | || Show Slide: |
'''K-means Clustering''' | '''K-means Clustering''' | ||
| − | | | + | || |
| − | * | + | * It creates '''distinct clusters''' by separating dissimilar points. |
|- | |- | ||
| − | | | + | || |
Show Slide: | Show Slide: | ||
| Line 60: | Line 60: | ||
'''Working ''' | '''Working ''' | ||
'''working.png''' | '''working.png''' | ||
| − | | | + | || Let us see the working of '''K-means clustering'''. |
| − | * | + | * First, choose the number of clusters, '''k''', to divide the data. |
| − | * | + | * Then, randomly pick''' k data points''' as''' initial centroids '''which is the center of each cluster. |
| − | * | + | * It represents the average position of all points in that '''cluster'''. |
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
| Line 71: | Line 71: | ||
'''working.png''' | '''working.png''' | ||
|| | || | ||
| − | * | + | * Each '''data point''' is assigned to the nearest '''cluster centroid'''. |
| − | * | + | * Update the centroids by averaging the '''data points''' in each '''cluster'''. |
| − | |- | + | |- |
|| Show Slide: | || Show Slide: | ||
| Line 80: | Line 80: | ||
'''working.png''' | '''working.png''' | ||
|| | || | ||
| − | * | + | * New '''centroids''' replace the old ones based on these mean values. |
| − | * | + | * This is repeated until the '''cluster''' assignments no longer change. |
| − | |- | + | |- |
|| Hover over the files | || Hover over the files | ||
| Line 91: | Line 91: | ||
'''K means clustering dot ipynb '''is the python notebook file used in this tutorial. | '''K means clustering dot ipynb '''is the python notebook file used in this tutorial. | ||
| − | |- | + | |- |
|| Open '''Customers.csv dataset''' | || Open '''Customers.csv dataset''' | ||
|| For this tutorial, we use a '''dataset''' of customer income and spending scores. | || For this tutorial, we use a '''dataset''' of customer income and spending scores. | ||
The goal is to '''group individuals''' based on''' income''' and '''spending patterns'''. | The goal is to '''group individuals''' based on''' income''' and '''spending patterns'''. | ||
| − | |- | + | |- |
|| Press '''Ctrl+Alt+T '''keys | || Press '''Ctrl+Alt+T '''keys | ||
| Line 105: | Line 105: | ||
Activate the machine learning environment as shown. | Activate the machine learning environment as shown. | ||
| − | |- | + | |- |
|| To go to the '''Downloads '''folder, | || To go to the '''Downloads '''folder, | ||
| Line 116: | Line 116: | ||
Then type, '''jupyter space notebook '''and press '''Enter.''' | Then type, '''jupyter space notebook '''and press '''Enter.''' | ||
| − | |- | + | |- |
|| Show Jupyter Notebook Home Page:Double click on | || Show Jupyter Notebook Home Page:Double click on | ||
| Line 124: | Line 124: | ||
Click the '''K means clustering dot ipynb''' file to open it. | Click the '''K means clustering dot ipynb''' file to open it. | ||
| − | <div style="color:#000000;">Note that each cell will have the output displayed in this file. | + | <div style="color:#000000;">Note that each cell will have the output displayed in this file. |
Let us see the implementation of the '''K-Means Clustering'''. | Let us see the implementation of the '''K-Means Clustering'''. | ||
| − | |- | + | |- |
|| Highlight: | || Highlight: | ||
| Line 138: | Line 138: | ||
Make sure to Press''' Shift '''and''' Enter''' to execute the code in each cell. | Make sure to Press''' Shift '''and''' Enter''' to execute the code in each cell. | ||
| − | |- | + | |- |
|| Highlight: | || Highlight: | ||
|| We load the '''customers dataset''' into the '''df''' dataframe. | || We load the '''customers dataset''' into the '''df''' dataframe. | ||
Let us display the first few rows. | Let us display the first few rows. | ||
| − | |- | + | |- |
|| Highlight: | || Highlight: | ||
| Line 154: | Line 154: | ||
'''Income ranges '''from''' 15K to 137K''', while the '''spending score varies '''from''' 1 to 99'''. | '''Income ranges '''from''' 15K to 137K''', while the '''spending score varies '''from''' 1 to 99'''. | ||
| − | |- | + | |- |
|| Highlight: | || Highlight: | ||
| Line 163: | Line 163: | ||
The output shows '''False''' for all entries, confirming '''no missing values'''. | The output shows '''False''' for all entries, confirming '''no missing values'''. | ||
| − | |- | + | |- |
|| | || | ||
|| To visualize the dataset, we plot '''histograms''' for each numerical feature. | || To visualize the dataset, we plot '''histograms''' for each numerical feature. | ||
We also use '''boxplots''' to detect potential '''outliers'''. | We also use '''boxplots''' to detect potential '''outliers'''. | ||
| − | |- | + | |- |
|| Hover over the output plots as per the narration | || Hover over the output plots as per the narration | ||
|| The''' histogram''' reveals the distribution of '''annual income''' and '''spending scores'''. | || The''' histogram''' reveals the distribution of '''annual income''' and '''spending scores'''. | ||
| Line 179: | Line 179: | ||
The '''Spending Score''' feature does not exhibit clear outliers. | The '''Spending Score''' feature does not exhibit clear outliers. | ||
| − | |- | + | |- |
|| Only narration | || Only narration | ||
| Line 194: | Line 194: | ||
It improves uniformity and enhances the clustering performance. | It improves uniformity and enhances the clustering performance. | ||
| − | |- | + | |- |
|| Highlight: | || Highlight: | ||
| Line 227: | Line 227: | ||
Further, we generate a plot to visualize the '''silhouette scores'''. | Further, we generate a plot to visualize the '''silhouette scores'''. | ||
| − | |- | + | |- |
|| Only narration | || Only narration | ||
| Line 238: | Line 238: | ||
In this plot, the optimal '''k''' is 5 as it has the highest '''silhouette score'''. | In this plot, the optimal '''k''' is 5 as it has the highest '''silhouette score'''. | ||
| − | |- | + | |- |
|| Highlight: | || Highlight: | ||
'''optimal_k = 5''' | '''optimal_k = 5''' | ||
|| We store the''' optimal k '''value in the '''optimal underscore k''' variable. | || We store the''' optimal k '''value in the '''optimal underscore k''' variable. | ||
| − | |- | + | |- |
|| Only Narration | || Only Narration | ||
| Line 250: | Line 250: | ||
The model is fitted to the '''scaled data''', assigning cluster labels. | The model is fitted to the '''scaled data''', assigning cluster labels. | ||
| − | |- | + | |- |
|| Highlight: | || Highlight: | ||
|| We extract the '''cluster centers''' from the '''trained K-means '''model. | || We extract the '''cluster centers''' from the '''trained K-means '''model. | ||
The coordinates of these centers are printed for analysis. | The coordinates of these centers are printed for analysis. | ||
| − | |- | + | |- |
|| Show output and hover over it | || Show output and hover over it | ||
|| Each '''cluster center''' represents the average value of points in that cluster. | || Each '''cluster center''' represents the average value of points in that cluster. | ||
Each row corresponds to a '''centroid’s position''' for '''income '''and''' spending score'''. | Each row corresponds to a '''centroid’s position''' for '''income '''and''' spending score'''. | ||
| − | |- | + | |- |
|| Highlight: | || Highlight: | ||
| Line 267: | Line 267: | ||
A new column '''Cluster''' is added to store cluster labels. | A new column '''Cluster''' is added to store cluster labels. | ||
| − | |- | + | |- |
|| Highlight: | || Highlight: | ||
| Line 277: | Line 277: | ||
A''' higher score''' indicates well-separated and dense clusters. | A''' higher score''' indicates well-separated and dense clusters. | ||
| − | |- | + | |- |
|| Show the output | || Show the output | ||
|| The '''silhouette score''' of '''0.559''' suggests a '''moderate clustering quality'''. | || The '''silhouette score''' of '''0.559''' suggests a '''moderate clustering quality'''. | ||
| − | + | A score close to '''-1''' means the data point is '''poorly clustered'''. | |
| − | + | A score close to '''1''' means the data point is '''well-clustered'''. | |
This suggests that while '''clusters''' are distinguishable, some overlap may exist. | This suggests that while '''clusters''' are distinguishable, some overlap may exist. | ||
| − | |- | + | |- |
|| Highlight: | || Highlight: | ||
| Line 299: | Line 299: | ||
This visualization helps to understand how data points are grouped. | This visualization helps to understand how data points are grouped. | ||
| − | |- | + | |- |
|| Show the output plot | || Show the output plot | ||
|| The scatter plot visualizes '''K-means clustering''' results. | || The scatter plot visualizes '''K-means clustering''' results. | ||
| Line 307: | Line 307: | ||
The''' clusters''' show distinct groupings based on '''annual income''' and''' spending score'''. | The''' clusters''' show distinct groupings based on '''annual income''' and''' spending score'''. | ||
|- | |- | ||
| − | | | + | || Show Slide: |
'''Summary''' | '''Summary''' | ||
| − | | | + | || This brings us to the end of the tutorial. |
Let us summarize. | Let us summarize. | ||
|- | |- | ||
| − | | | + | || Show Slide: |
'''Assignment''' | '''Assignment''' | ||
| − | | | + | || As an assignment, |
| − | * | + | * Use '''Elbow''' method instead of''' Silhouette''' method to find the optimal k |
| − | * | + | * Use '''inertia''' to identify it |
| − | * | + | * Then, plot the '''elbow''' curve |
| + | |||
| − | |||
|- | |- | ||
| − | | | + | || Show Slide: |
'''Assignment''' | '''Assignment''' | ||
Show elbow.png | Show elbow.png | ||
| − | | | + | || Refer to the code given: |
|- | |- | ||
| − | | | + | || Show Slide: |
'''Assignment Solution''' | '''Assignment Solution''' | ||
Show plot.png | Show plot.png | ||
| − | | | + | || After completing the assignment, we’ll get the following plot. |
|- | |- | ||
| − | | | + | || Show Slide: |
'''FOSSEE Forum''' | '''FOSSEE Forum''' | ||
| − | | | + | || For any general or technical questions on '''Python for Machine Learning''', visit the''' FOSSEE forum''' and post your question |
|- | |- | ||
| − | | | + | || Show Slide: |
'''Thank you''' | '''Thank you''' | ||
| − | | | + | || This is '''Anvita Thadavoose Manjummel''', a FOSSEE Summer Fellow 2025, IIT Bombay signing off |
Thanks for joining. | Thanks for joining. | ||
|- | |- | ||
|} | |} | ||
| − | |||
Revision as of 15:25, 9 July 2025
| Visual Cue | Narration |
| Show Slide: | Welcome to the Spoken Tutorial on K-means Clustering. |
| Show Slide: | In this tutorial, we will learn about
|
| Show Slide:
System Requirements |
To Record this tutorial, I am using
|
| Show Slide: | To follow this tutorial,
|
| Show Slide:
Code files |
|
| Show Slide:
K-means Clustering |
|
| Show Slide:
K-means Clustering |
|
|
Show Slide: Working working.png |
Let us see the working of K-means clustering.
|
| Show Slide:
Working working.png |
|
| Show Slide:
Working working.png |
|
| Hover over the files
Point to KMeansClustering.ipynb |
I have created the required files for the demonstration of K-Means clustering.
K means clustering dot ipynb is the python notebook file used in this tutorial. |
| Open Customers.csv dataset | For this tutorial, we use a dataset of customer income and spending scores.
The goal is to group individuals based on income and spending patterns. |
| Press Ctrl+Alt+T keys
Type conda activate ml Press Enter |
Let us open the Linux terminal by pressing Ctrl,Alt and T keys together.
Activate the machine learning environment as shown. |
| To go to the Downloads folder,
Type cd Downloads Type jupyter notebook |
I have saved my code file in the Downloads folder.
Please navigate to the respective folder of your code file location. Then type, jupyter space notebook and press Enter. |
| Show Jupyter Notebook Home Page:Double click on
KMeansClustering.ipynb file |
We can see the Jupyter Notebook Home page has opened in the web browser.
Click the K means clustering dot ipynb file to open it. Note that each cell will have the output displayed in this file.
Let us see the implementation of the K-Means Clustering. |
| Highlight:
import numpy as np import pandas as pd |
First, we import the necessary libraries required for K-means Clustering.
Make sure to Press Shift and Enter to execute the code in each cell. |
| Highlight: | We load the customers dataset into the df dataframe.
Let us display the first few rows. |
| Highlight:
print(df.describe()) Hover over the output |
To understand the data distribution, we use the describe function.
The dataset contains 200 entries for Annual Income and Spending Score. Income ranges from 15K to 137K, while the spending score varies from 1 to 99. |
| Highlight:
df.isnull() Hover over the output |
We check for missing values in the dataset using isnull function.
The output shows False for all entries, confirming no missing values. |
| To visualize the dataset, we plot histograms for each numerical feature.
We also use boxplots to detect potential outliers. | |
| Hover over the output plots as per the narration | The histogram reveals the distribution of annual income and spending scores.
Income is more spread out, while spending scores form distinct clusters. The boxplot identifies an outlier in the Annual Income feature. It is indicated by a point above the upper whisker. The Spending Score feature does not exhibit clear outliers. |
| Only narration
Highlight: scaler = MinMaxScaler() scaled_df = scaler.fit_transform(df) |
Now, let us preprocess the dataset.
First, we normalize the dataset using MinMaxScaler to scale feature values. This transformation ensures all values fall between 0 and 1. It improves uniformity and enhances the clustering performance. |
| Highlight:
silhouette_scores = [] K_range = range(2, 11) for K in K_range: kmeans = KMeans(n_clusters=K, random_state=42, n_init=10) cluster_labels = kmeans.fit_predict(scaled_df) score = silhouette_score(scaled_df, cluster_labels) silhouette_scores.append(score) plt.show() |
Next, we find the optimal number of clusters using the Silhouette Score method.
The highest Silhouette score suggests the best k value. To do this, we initialize an empty list to store silhouette scores. The for loop iterates through a range of K values to perform clustering. For each K, a KMeans model is created and trained. Cluster labels are predicted, and the silhouette score is calculated. The scores are stored for further analysis. Further, we generate a plot to visualize the silhouette scores. |
| Only narration
Hover over output plot |
Ignore the warning in the output cell.
The plot shows silhouette scores for different cluster values. A higher score means better defined clusters with less overlap. In this plot, the optimal k is 5 as it has the highest silhouette score. |
| Highlight:
optimal_k = 5 |
We store the optimal k value in the optimal underscore k variable. |
| Only Narration
Highlight: |
Next, we create a K-means model using the optimal k value.
The model is fitted to the scaled data, assigning cluster labels. |
| Highlight: | We extract the cluster centers from the trained K-means model.
The coordinates of these centers are printed for analysis. |
| Show output and hover over it | Each cluster center represents the average value of points in that cluster.
Each row corresponds to a centroid’s position for income and spending score. |
| Highlight:
df['Cluster'] = kmeans.labels_ |
We then assign each data point to its respective cluster.
A new column Cluster is added to store cluster labels. |
| Highlight:
silhouette_avg = silhouette_score(scaled_df, kmeans.labels_) print("Silhouette Score: ", silhouette_avg) |
Next, we calculate the silhouette score to evaluate clustering quality.
A higher score indicates well-separated and dense clusters. |
| Show the output | The silhouette score of 0.559 suggests a moderate clustering quality.
A score close to -1 means the data point is poorly clustered. A score close to 1 means the data point is well-clustered. This suggests that while clusters are distinguishable, some overlap may exist. |
| Highlight:
plt.figure(figsize=(8, 5)) plt.show() |
Finally, we plot data points using the scaled income and spending score values.
To enhance clarity, each point is color-coded based on its cluster assignment. The cluster centers are marked with red X for easy identification. This visualization helps to understand how data points are grouped. |
| Show the output plot | The scatter plot visualizes K-means clustering results.
Each dot represents a data point, color-coded by its assigned cluster. The clusters show distinct groupings based on annual income and spending score. |
| Show Slide:
Summary |
This brings us to the end of the tutorial.
Let us summarize. |
| Show Slide:
Assignment |
As an assignment,
|
| Show Slide:
Assignment Show elbow.png |
Refer to the code given: |
| Show Slide:
Assignment Solution Show plot.png |
After completing the assignment, we’ll get the following plot. |
| Show Slide:
FOSSEE Forum |
For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question |
| Show Slide:
Thank you |
This is Anvita Thadavoose Manjummel, a FOSSEE Summer Fellow 2025, IIT Bombay signing off
Thanks for joining. |