Python-for-Machine-Learning/C3/K-Means-Clustering/English
Visual Cue | Narration |
Show Slide: | Welcome to the Spoken Tutorial on K-means Clustering. |
Show Slide: | In this tutorial, we will learn about
|
Show Slide:
System Requirements |
To Record this tutorial, I am using
|
Show Slide: | To follow this tutorial,
|
Show Slide:
Code files |
|
Show Slide:
K-means Clustering |
|
Show Slide:
K-means Clustering |
|
Show Slide:
Working working.png |
Let us see the working of K-means clustering.
|
Show Slide:
Working working.png |
|
Show Slide:
Working working.png |
|
Hover over the files
Point to KMeansClustering.ipynb |
I have created the required files for the demonstration of K-Means clustering.
K means clustering dot ipynb is the python notebook file used in this tutorial. |
Open Customers.csv dataset | For this tutorial, we use a dataset of customer income and spending scores.
The goal is to group individuals based on income and spending patterns. |
Press Ctrl+Alt+T keys
Type conda activate ml Press Enter |
Let us open the Linux terminal by pressing Ctrl, Alt and T keys together.
Activate the machine learning environment as shown. |
To go to the Downloads folder,
Type cd Downloads Type jupyter notebook |
I have saved my code file in the Downloads folder.
Please navigate to the respective folder of your code file location. Then type, jupyter space notebook and press Enter. |
Show Jupyter Notebook Home Page:Double click on
KMeansClustering.ipynb file |
We can see the Jupyter Notebook Home page has opened in the web browser.
Click the K means clustering dot ipynb file to open it. Note that each cell will have the output displayed in this file. Let us see the implementation of the K-Means Clustering. |
Highlight:
import numpy as np import pandas as pd |
First, we import the necessary libraries required for K-means Clustering.
Make sure to Press Shift and Enter to execute the code in each cell. |
Highlight: | We load the customers dataset into the df dataframe.
Let us display the first few rows. |
Highlight:
print(df.describe()) Hover over the output |
To understand the data distribution, we use the describe function.
The dataset contains 200 entries for Annual Income and Spending Score. Income ranges from 15K to 137K, while the spending score varies from 1 to 99. |
Highlight:
df.isnull() Hover over the output |
We check for missing values in the dataset using isnull function.
The output shows False for all entries, confirming no missing values. |
Highlight the code. | To visualize the dataset, we plot histograms for each numerical feature.
We also use boxplots to detect potential outliers. |
Hover over the output plots as per the narration | The histogram reveals the distribution of annual income and spending scores.
Income is more spread out, while spending scores form distinct clusters. The boxplot identifies an outlier in the Annual Income feature. It is indicated by a point above the upper whisker. The Spending Score feature does not exhibit clear outliers. |
Only narration
Highlight: scaler = MinMaxScaler() scaled_df = scaler.fit_transform(df) |
Now, let us preprocess the dataset.
First, we normalize the dataset using MinMaxScaler to scale feature values. This transformation ensures all values fall between 0 and 1. It improves uniformity and enhances the clustering performance. |
Highlight:
silhouette_scores = [] K_range = range(2, 11) for K in K_range: kmeans = KMeans(n_clusters=K, random_state=42, n_init=10) cluster_labels = kmeans.fit_predict(scaled_df) score = silhouette_score(scaled_df, cluster_labels) silhouette_scores.append(score) plt.show() |
Next, we find the optimal number of clusters using the Silhouette Score method.
The highest Silhouette score suggests the best k value. To do this, we initialize an empty list to store silhouette scores. The for loop iterates through a range of K values to perform clustering. For each K, a KMeans model is created and trained. Cluster labels are predicted, and the silhouette score is calculated. The scores are stored for further analysis. Further, we generate a plot to visualize the silhouette scores. |
Only narration
Hover over output plot |
Ignore the warning in the output cell.
The plot shows silhouette scores for different cluster values. A higher score means better defined clusters with less overlap. In this plot, the optimal k is 5 as it has the highest silhouette score. |
Highlight:
optimal_k = 5 |
We store the optimal k value in the optimal underscore k variable. |
Only Narration
Highlight: |
Next, we create a K-means model using the optimal k value.
The model is fitted to the scaled data, assigning cluster labels. |
Highlight: | We extract the cluster centers from the trained K-means model.
The coordinates of these centers are printed for analysis. |
Show output and hover over it | Each cluster center represents the average value of points in that cluster.
Each row corresponds to a centroid’s position for income and spending score. |
Highlight:
df['Cluster'] = kmeans.labels_ |
We then assign each data point to its respective cluster.
A new column Cluster is added to store cluster labels. |
Highlight:
silhouette_avg = silhouette_score(scaled_df, kmeans.labels_) print("Silhouette Score: ", silhouette_avg) |
Next, we calculate the silhouette score to evaluate clustering quality.
A higher score indicates well-separated and dense clusters. |
Show the output | The silhouette score of 0.559 suggests a moderate clustering quality.
A score close to -1 means the data point is poorly clustered . A score close to 1 means the data point is well-clustered. This suggests that while clusters are distinguishable, some overlap may exist. |
Highlight:
plt.figure(figsize=(8, 5)) plt.show() |
Finally, we plot data points using the scaled income and spending score values.
To enhance clarity, each point is color-coded based on its cluster assignment. The cluster centers are marked with red X for easy identification. This visualization helps to understand how data points are grouped. |
Show the output plot | The scatter plot visualizes K-means clustering results.
Each dot represents a data point, color-coded by its assigned cluster. The clusters show distinct groupings based on annual income and spending score. |
Show Slide:
Summary |
This brings us to the end of the tutorial.
Let us summarize. |
Show Slide:
Assignment |
As an assignment,
|
Show Slide:
Assignment Show elbow.png |
Refer to the code given: |
Show Slide:
Assignment Solution Show plot.png |
After completing the assignment, we’ll get the following plot. |
Show Slide:
FOSSEE Forum |
For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question |
Show Slide:
Thank you |
This is Anvita Thadavoose Manjummel, a FOSSEE Summer Fellow 2025, IIT Bombay signing off
Thanks for joining. |