Python-for-Machine-Learning/C3/K-Means-Clustering/English

From Script | Spoken-Tutorial
Jump to: navigation, search
Visual Cue Narration
Show Slide: Welcome to the Spoken Tutorial on K-means Clustering.
Show Slide: In this tutorial, we will learn about
  • Introduction to K-means Clustering
  • Steps involved in K-means Algorithm
  • Choosing the Optimal number of Clusters (k)
Show Slide:

System Requirements

To Record this tutorial, I am using
  • Ubuntu Linux OS version 24.04
  • Jupyter Notebook IDE
Show Slide: To follow this tutorial,
  • The Learner must have basic knowledge of Python
  • For prerequisite Python tutorials, please visit this website.
Show Slide:

Code files

  • The files used in this tutorial are provided in the Code files link.
  • Please download and extract the files.
  • Make a copy and then use them while practicing.
Show Slide:

K-means Clustering

  • K-means clustering is an unsupervised machine learning algorithm.
  • It partitions data into a predetermined number of clusters.
  • This algorithm aims to group similar data points together.
Show Slide:

K-means Clustering

  • It creates distinct clusters by separating dissimilar points.
Show Slide:

Working

working.png

Let us see the working of K-means clustering.
  • First, choose the number of clusters, k, to divide the data.
  • Then, randomly pick k data points as initial centroids which is the center of each cluster.
  • It represents the average position of all points in that cluster.
Show Slide:

Working working.png

  • Each data point is assigned to the nearest cluster centroid.
  • Update the centroids by averaging the data points in each cluster.
Show Slide:

Working working.png

  • New centroids replace the old ones based on these mean values.
  • This is repeated until the cluster assignments no longer change.
Hover over the files

Point to KMeansClustering.ipynb

I have created the required files for the demonstration of K-Means clustering.

K means clustering dot ipynb is the python notebook file used in this tutorial.

Open Customers.csv dataset For this tutorial, we use a dataset of customer income and spending scores.

The goal is to group individuals based on income and spending patterns.

Press Ctrl+Alt+T keys

Type conda activate ml

Press Enter

Let us open the Linux terminal by pressing Ctrl, Alt and T keys together.

Activate the machine learning environment as shown.

To go to the Downloads folder,

Type cd Downloads

Type jupyter notebook

I have saved my code file in the Downloads folder.

Please navigate to the respective folder of your code file location.

Then type, jupyter space notebook and press Enter.

Show Jupyter Notebook Home Page:Double click on

KMeansClustering.ipynb file

We can see the Jupyter Notebook Home page has opened in the web browser.

Click the K means clustering dot ipynb file to open it.

Note that each cell will have the output displayed in this file.

Let us see the implementation of the K-Means Clustering.

Highlight:

import numpy as np

import pandas as pd

First, we import the necessary libraries required for K-means Clustering.

Make sure to Press Shift and Enter to execute the code in each cell.

Highlight: We load the customers dataset into the df dataframe.

Let us display the first few rows.

Highlight:

print(df.describe())

Hover over the output

To understand the data distribution, we use the describe function.

The dataset contains 200 entries for Annual Income and Spending Score.

Income ranges from 15K to 137K, while the spending score varies from 1 to 99.

Highlight:

df.isnull()

Hover over the output

We check for missing values in the dataset using isnull function.

The output shows False for all entries, confirming no missing values.

Highlight the code. To visualize the dataset, we plot histograms for each numerical feature.

We also use boxplots to detect potential outliers.

Hover over the output plots as per the narration The histogram reveals the distribution of annual income and spending scores.

Income is more spread out, while spending scores form distinct clusters.

The boxplot identifies an outlier in the Annual Income feature.

It is indicated by a point above the upper whisker.

The Spending Score feature does not exhibit clear outliers.

Only narration

Highlight:

scaler = MinMaxScaler()

scaled_df = scaler.fit_transform(df)

Now, let us preprocess the dataset.

First, we normalize the dataset using MinMaxScaler to scale feature values.

This transformation ensures all values fall between 0 and 1.

It improves uniformity and enhances the clustering performance.

Highlight:

silhouette_scores = []

K_range = range(2, 11)

for K in K_range:

kmeans = KMeans(n_clusters=K, random_state=42, n_init=10)

cluster_labels = kmeans.fit_predict(scaled_df)

score = silhouette_score(scaled_df, cluster_labels)

silhouette_scores.append(score) plt.show()

Next, we find the optimal number of clusters using the Silhouette Score method.

The highest Silhouette score suggests the best k value.

To do this, we initialize an empty list to store silhouette scores.

The for loop iterates through a range of K values to perform clustering.

For each K, a KMeans model is created and trained.

Cluster labels are predicted, and the silhouette score is calculated.

The scores are stored for further analysis.

Further, we generate a plot to visualize the silhouette scores.

Only narration

Hover over output plot

Ignore the warning in the output cell.

The plot shows silhouette scores for different cluster values.

A higher score means better defined clusters with less overlap.

In this plot, the optimal k is 5 as it has the highest silhouette score.

Highlight:

optimal_k = 5

We store the optimal k value in the optimal underscore k variable.
Only Narration

Highlight:

Next, we create a K-means model using the optimal k value.

The model is fitted to the scaled data, assigning cluster labels.

Highlight: We extract the cluster centers from the trained K-means model.

The coordinates of these centers are printed for analysis.

Show output and hover over it Each cluster center represents the average value of points in that cluster.

Each row corresponds to a centroid’s position for income and spending score.

Highlight:

df['Cluster'] = kmeans.labels_

We then assign each data point to its respective cluster.

A new column Cluster is added to store cluster labels.

Highlight:

silhouette_avg = silhouette_score(scaled_df, kmeans.labels_)

print("Silhouette Score: ", silhouette_avg)

Next, we calculate the silhouette score to evaluate clustering quality.

A higher score indicates well-separated and dense clusters.

Show the output The silhouette score of 0.559 suggests a moderate clustering quality.

A score close to -1 means the data point is poorly clustered .

A score close to 1 means the data point is well-clustered.

This suggests that while clusters are distinguishable, some overlap may exist.

Highlight:

plt.figure(figsize=(8, 5))

plt.show()

Finally, we plot data points using the scaled income and spending score values.

To enhance clarity, each point is color-coded based on its cluster assignment.

The cluster centers are marked with red X for easy identification.

This visualization helps to understand how data points are grouped.

Show the output plot The scatter plot visualizes K-means clustering results.

Each dot represents a data point, color-coded by its assigned cluster.

The clusters show distinct groupings based on annual income and spending score.

Show Slide:

Summary

This brings us to the end of the tutorial.

Let us summarize.

Show Slide:

Assignment

As an assignment,
  • Use Elbow method instead of Silhouette method to find the optimal k
  • Use inertia to identify it
  • Then, plot the elbow curve


Show Slide:

Assignment

Show elbow.png

Refer to the code given:
Show Slide:

Assignment Solution

Show plot.png

After completing the assignment, we’ll get the following plot.
Show Slide:

FOSSEE Forum

For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question
Show Slide:

Thank you

This is Anvita Thadavoose Manjummel, a FOSSEE Summer Fellow 2025, IIT Bombay signing off

Thanks for joining.

Contributors and Content Editors

Madhurig, Nirmala Venkat