Python-for-Machine-Learning/C2/K-Nearest-Neighbor-Classification/English

From Script | Spoken-Tutorial
Jump to: navigation, search


Visual Cue Narration
Show Slide:

Welcome and Title Slide

Welcome to the Spoken Tutorial on K Nearest Neighbors Classification.
Show Slide:

Learning Objectives

In this tutorial, we will learn about
  • The fundamentals of KNN Algorithm
  • Implementing KNN for Classification using Iris dataset
  • Evaluating the performance of the trained model
Show Slide:

System Requirements

To record this tutorial, I am using
  • Ubuntu Linux OS version 24.04
  • Jupyter Notebook IDE
Show Slide:

Pre- requisites

To follow this tutorial,
  • The learner must have basic knowledge of Python.
  • For pre-requisite Python tutorials, please visit this website.
Show Slide:

Code Files

  • The files used in this tutorial are provided in the Code files link.
  • Please download and extract the files.
  • Make a copy and then use them while practicing.
Show Slide:

KNN

  • KNN stands for K Nearest Neighbors.
  • Nearest Neighbor algorithm predicts using closest similar training data points.
  • K indicates the number of neighboring points to be considered for prediction.
Show Slide:

KNN Classification

  • Features of K nearest samples are compared to determine similarity.
  • The new data point gets the most frequent class among its neighbors.
  • KNN is versatile and can effectively handle multi-class classification problems.
Show Slide:

Iris Dataset irisflowers.png

Hover over setosa, versicolor and virginica images

Hover over sepal length, sepal width, petal length and petal width

In this tutorial, we are using the Iris plants dataset.

It has 3 distinct flower classes.

Classes are classified by sepal length, sepal width, petal length, petal width.

Show image

Iris Dataset iris.png

The three flower classes appear as clusters in different colors on the graph.

We will use the four features of the flower to classify the three distinct classes.

A black dot represents a flower in the dataset that lacks a defined class.

Our goal is to predict the black dot’s class based on its nearest neighbors.

The closest points to the black dot are called its neighbors.

These neighbors determine the black dot’s class.

Hover over the files

Point to KNN classification.ipynb

I have created required files for the demonstration of KNN classification.

KNN classification dot ipynb is the python notebook file for this demonstration.

Press Ctrl+Alt+T keys

Type conda activate ml

Press Enter

Let us open the Linux terminal by pressing Ctrl, Alt and T keys together.

Activate the machine learning environment by typing

conda space activate space ml

Press Enter.

Type cd Downloads

Press Enter Type jupyter notebook Press Enter

I have saved my code file in the Downloads folder.

Please navigate to the respective folder of your code file location.

Type, jupyter space notebook and press Enter to open Jupyter Notebook.

Jupyter Notebook Home Page will be opened.

Click on KNNClassification.ipynb

We can see the homepage of the Jupyter notebook has opened in the web browser.

Locate the KNN classification dot ipynb file.

Open the file by clicking on it.

Note that each cell will have the output displayed in this file.

Let us see the implementation of the KNN classification model.

Highlight

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

We import these libraries for KNN Classification.

Please press Shift plus Enter to execute the code in each cell.

Highlight

iris = load_iris()

First, we load the dataset into a variable named iris.

The iris dataset is a built-in dataset available in the Scikit- learn library.

Only narration

Highlight iris.feature_names

Let us explore the dataset.
Highlight We list out the feature names of the Iris dataset as shown here.
Highlight iris.target_names Next let us list out the target names.
Highlight The target classes represent the different species of the iris flower.

Setosa is shown as 0, versicolor as 1 and virginica as 2.

Highlight iris_df = pd.DataFrame(iris.data,columns=iris.feature_names) First we create a Dataframe named iris underscore df.

We load the Iris dataset into the dataframe.

It holds the features and target class.

Highlight iris_df.head()

Highlight output

Using the head method the first few rows are displayed.

The default is 5 rows, but the value can be changed by specifying the argument.

Highlight iris_df['target'] = iris.target

iris_df.head()

We add a new target column with class labels to the dataframe.
Highlight iris_df.shape() The shape method gives the shape of the dataframe in rows and columns.
Only narration

Highlight

sns.pairplot( iris_df, hue='target', diag_kind='kde', palette='colorblind' ) plt.show()

Next, we plot a pairplot to visualize the iris dataset.

It visualizes relationships between different features.

It creates scatterplots for each pair of features, colored by class labels.

plt dot show is used to display the generated plots.

Show output plots We see the pairplots visualizing feature relationships.

Scatter plots compare two features and help to identify patterns.

Diagonal KDE plots show the distribution of each feature for different classes.

Different colors represent different target classes.

Clusters indicate which species are overlapping.

Only Narration

Highlight

X = iris_df.drop('target', axis=1)

Now, we split the dataset into X and y to prepare the data for training.

First we remove the target column named target.

Then copy the remaining features into the variable X.

Highlight y = iris_df['target'] Next, we assign the target column to y.
Highlight X We see that X contains all features except target species.
Highlight y

Highlight the output

We see that y contains the target classes.

It is the species of the iris flower.

Only narration.

Highlight train_test_split(X, y, test_size=0.4, random_state=42)

Next, we split the data into training and testing sets.

We use the train underscore test underscore split method.

The split ratio is adjustable through the test underscore size parameter.

We set the test underscore size as 0.4.

Here, we use 40 percent of the data for testing and 60 percent for training.

Setting random state equal to 42 ensures the split is reproducible.

It guarantees we get the same result across multiple executions.

Highlight X_train, X_test, y_train, y_test We assign the split data into four variables.

X underscore train contains the features of the training data.

It is used for model training.

y underscore train contains the target values for the training data.

X underscore test contains features for the test data.

It is used for making predictions.

y underscore test contains the actual class labels for the test data.

It is used for evaluating the model performance.

Highlight knn = KNeighborsClassifier(n_neighbors=7) Now, we train the KNN classifier using KNeighborsClassifier with 7 neighbors.
Highlight knn.fit(X_train, y_train) We train the KNN classifier using the fit method on the training data.

Fit method adjusts the model parameter using the training data.

Highlight y_train_pred = knn.predict(X_train) We predict the labels for the training data.
Highlight training_accuracy = accuracy_score(y_train, y_train_pred)

print("Training Accuracy: {training_accuracy:.3f}")

We calculate and print the accuracy of the model.

Accuracy is the ratio of correct prediction to the total number of instances.

It helps to measure how well the model is performing.

Highlight

Training Accuracy: 0.956

The accuracy is 0.956 which is quite good.
Highlight

print("\nClassification Report:")

print(classification_report(y_train, y_train_pred))


Next, we print the classification report.

The classification report helps to evaluate how well the model is performing.

Precision tells how many positive predictions made by the model were correct.

F1 Score is the balance between precision and recall.

Recall shows how well the model detects actual positive cases correctly.

Support is the count of true instances of each class in the dataset.

Show output table

Box for the data

From the report, we conclude that precision and recall is high for all classes.

F1-Score shows good overall performance across all classes.

The accuracy is 96 percent.

This means the model made correct predictions for 96 percent of the instances.

macro and weighted average reflect consistent performances across the dataset.

Only narration Next, we evaluate the model on the testing data.

First, we predict the class label for a single test sample.

Highlight sample_test_data = We extract the 10th row of the test dataset and reshape it for prediction.

We use the predict method to predict the class label using the trained model.

We store the actual class label in the actual underscore class variable.

Then, we print the predicted and actual class labels of the data sample.

Highlight output The predicted class for unseen data is 2, which is virginica.

The actual class is also 2, indicating the prediction is correct.

Highlight accuracy = accuracy_score(y_test, y_pred)

print(f"Testing Accuracy: {accuracy:.3f}")

Now, we calculate and print the accuracy.
Highlight Testing Accuracy: 0.983 The accuracy is approximately 0.983.

We can conclude that the model is performing well on unseen data.

Only Narration

Highlight y_test_bin = label_binarize(y_test, classes=[0, 1, 2])

Finally we plot the precision-recall curve to visualize the performance for each class.

It calculates precision and recall.

It computes average precision and summarizes the precision recall curve into a single score.

Highlight plt.figure(figsize=(10, 6)) plt dot plot function plots the precision recall curve.
Highlight plt.xlabel("Recall")

plt.ylabel("Precision")

plt.title("Precision-Recall

plt.show()

plt dot show displays the final precision recall curve.
Show output plot The Precision-Recall curve shows the trade-off between precision and recall.

KNN classifier achieved perfect precision-recall for all classes.

High AP indicates the model performs well in distinguishing all three classes.

Highlight print("\nClassification Report:")

print(classification_report(y_test, y_pred))

Finally, we evaluate the performance using the classification report.

The report offers a detailed assessment of the model’s performance.

Show Slide:

Summary

This brings us to the end of the tutorial. Let us summarize.
Show Slide:

Assignment

As an assignment, please do the following
  • Use K as 7 and test size as 0.2.
  • Evaluate the model performance using the classification report.
Show Slide:

Assignment Solution

K=7, train_test_split = 0.2

Here is the classification report for K equals 7 with a 0.2 train-test split.

We will get an accuracy of 96 percent.

Show Slide:

FOSSEE Forum

For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question.
Show Slide:

Thank You

This is Anvita Thadavoose Manjummel, a FOSSEE Summer Fellow 2025, IIT Bombay

Thanks for joining.

Contributors and Content Editors

Madhurig, Nirmala Venkat