Python-for-Machine-Learning/C2/K-Nearest-Neighbor-Classification/English
| Visual Cue | Narration |
| Show Slide:
Welcome and Title Slide |
Welcome to the Spoken Tutorial on K Nearest Neighbors Classification. |
| Show Slide:
Learning Objectives |
In this tutorial, we will learn about
|
| Show Slide:
System Requirements |
To record this tutorial, I am using
|
| Show Slide:
Pre- requisites |
To follow this tutorial,
|
| Show Slide:
Code Files |
|
| Show Slide:
KNN |
|
| Show Slide:
KNN Classification |
|
| Show Slide:
Iris Dataset irisflowers.png Hover over setosa, versicolor and virginica images Hover over sepal length, sepal width, petal length and petal width |
In this tutorial, we are using the Iris plants dataset.
It has 3 distinct flower classes. Classes are classified by sepal length, sepal width, petal length, petal width. |
| Show image
Iris Dataset iris.png |
The three flower classes appear as clusters in different colors on the graph.
We will use the four features of the flower to classify the three distinct classes. A black dot represents a flower in the dataset that lacks a defined class. Our goal is to predict the black dot’s class based on its nearest neighbors. The closest points to the black dot are called its neighbors. These neighbors determine the black dot’s class. |
| Hover over the files
Point to KNN classification.ipynb |
I have created required files for the demonstration of KNN classification.
KNN classification dot ipynb is the python notebook file for this demonstration. |
| Press Ctrl+Alt+T keys
Type conda activate ml Press Enter |
Let us open the Linux terminal by pressing Ctrl, Alt and T keys together.
Activate the machine learning environment by typing conda space activate space ml Press Enter. |
| Type cd Downloads
Press Enter Type jupyter notebook Press Enter |
I have saved my code file in the Downloads folder.
Please navigate to the respective folder of your code file location. Type, jupyter space notebook and press Enter to open Jupyter Notebook. |
| Jupyter Notebook Home Page will be opened.
Click on KNNClassification.ipynb |
We can see the homepage of the Jupyter notebook has opened in the web browser.
Locate the KNN classification dot ipynb file. Open the file by clicking on it. Note that each cell will have the output displayed in this file. Let us see the implementation of the KNN classification model. |
| Highlight
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns |
We import these libraries for KNN Classification.
Please press Shift plus Enter to execute the code in each cell. |
| Highlight
iris = load_iris() |
First, we load the dataset into a variable named iris.
The iris dataset is a built-in dataset available in the Scikit- learn library. |
| Only narration
Highlight iris.feature_names |
Let us explore the dataset. |
| Highlight | We list out the feature names of the Iris dataset as shown here. |
| Highlight iris.target_names | Next let us list out the target names. |
| Highlight | The target classes represent the different species of the iris flower.
Setosa is shown as 0, versicolor as 1 and virginica as 2. |
| Highlight iris_df = pd.DataFrame(iris.data,columns=iris.feature_names) | First we create a Dataframe named iris underscore df.
We load the Iris dataset into the dataframe. It holds the features and target class. |
| Highlight iris_df.head()
Highlight output |
Using the head method the first few rows are displayed.
The default is 5 rows, but the value can be changed by specifying the argument. |
| Highlight iris_df['target'] = iris.target
iris_df.head() |
We add a new target column with class labels to the dataframe. |
| Highlight iris_df.shape() | The shape method gives the shape of the dataframe in rows and columns. |
| Only narration
Highlight sns.pairplot( iris_df, hue='target', diag_kind='kde', palette='colorblind' ) plt.show() |
Next, we plot a pairplot to visualize the iris dataset.
It visualizes relationships between different features. It creates scatterplots for each pair of features, colored by class labels. plt dot show is used to display the generated plots. |
| Show output plots | We see the pairplots visualizing feature relationships.
Scatter plots compare two features and help to identify patterns. Diagonal KDE plots show the distribution of each feature for different classes. Different colors represent different target classes. Clusters indicate which species are overlapping. |
| Only Narration
Highlight X = iris_df.drop('target', axis=1) |
Now, we split the dataset into X and y to prepare the data for training.
First we remove the target column named target. Then copy the remaining features into the variable X. |
| Highlight y = iris_df['target'] | Next, we assign the target column to y. |
| Highlight X | We see that X contains all features except target species. |
| Highlight y
Highlight the output |
We see that y contains the target classes.
It is the species of the iris flower. |
| Only narration.
Highlight train_test_split(X, y, test_size=0.4, random_state=42) |
Next, we split the data into training and testing sets.
We use the train underscore test underscore split method. The split ratio is adjustable through the test underscore size parameter. We set the test underscore size as 0.4. Here, we use 40 percent of the data for testing and 60 percent for training. Setting random state equal to 42 ensures the split is reproducible. It guarantees we get the same result across multiple executions. |
| Highlight X_train, X_test, y_train, y_test | We assign the split data into four variables.
X underscore train contains the features of the training data. It is used for model training. y underscore train contains the target values for the training data. X underscore test contains features for the test data. It is used for making predictions. y underscore test contains the actual class labels for the test data. It is used for evaluating the model performance. |
| Highlight knn = KNeighborsClassifier(n_neighbors=7) | Now, we train the KNN classifier using KNeighborsClassifier with 7 neighbors. |
| Highlight knn.fit(X_train, y_train) | We train the KNN classifier using the fit method on the training data.
Fit method adjusts the model parameter using the training data. |
| Highlight y_train_pred = knn.predict(X_train) | We predict the labels for the training data. |
| Highlight training_accuracy = accuracy_score(y_train, y_train_pred)
print("Training Accuracy: {training_accuracy:.3f}") |
We calculate and print the accuracy of the model.
Accuracy is the ratio of correct prediction to the total number of instances. It helps to measure how well the model is performing. |
| Highlight
Training Accuracy: 0.956 |
The accuracy is 0.956 which is quite good. |
| Highlight
print("\nClassification Report:") print(classification_report(y_train, y_train_pred))
|
Next, we print the classification report.
The classification report helps to evaluate how well the model is performing. Precision tells how many positive predictions made by the model were correct. F1 Score is the balance between precision and recall. Recall shows how well the model detects actual positive cases correctly. Support is the count of true instances of each class in the dataset. |
| Show output table
Box for the data |
From the report, we conclude that precision and recall is high for all classes.
F1-Score shows good overall performance across all classes. The accuracy is 96 percent. This means the model made correct predictions for 96 percent of the instances. macro and weighted average reflect consistent performances across the dataset. |
| Only narration | Next, we evaluate the model on the testing data.
First, we predict the class label for a single test sample. |
| Highlight sample_test_data = | We extract the 10th row of the test dataset and reshape it for prediction.
We use the predict method to predict the class label using the trained model. We store the actual class label in the actual underscore class variable. Then, we print the predicted and actual class labels of the data sample. |
| Highlight output | The predicted class for unseen data is 2, which is virginica.
The actual class is also 2, indicating the prediction is correct. |
| Highlight accuracy = accuracy_score(y_test, y_pred)
print(f"Testing Accuracy: {accuracy:.3f}") |
Now, we calculate and print the accuracy. |
| Highlight Testing Accuracy: 0.983 | The accuracy is approximately 0.983.
We can conclude that the model is performing well on unseen data. |
| Only Narration
Highlight y_test_bin = label_binarize(y_test, classes=[0, 1, 2]) |
Finally we plot the precision-recall curve to visualize the performance for each class.
It calculates precision and recall. It computes average precision and summarizes the precision recall curve into a single score. |
| Highlight plt.figure(figsize=(10, 6)) | plt dot plot function plots the precision recall curve. |
| Highlight plt.xlabel("Recall")
plt.ylabel("Precision") plt.title("Precision-Recall plt.show() |
plt dot show displays the final precision recall curve. |
| Show output plot | The Precision-Recall curve shows the trade-off between precision and recall.
KNN classifier achieved perfect precision-recall for all classes. High AP indicates the model performs well in distinguishing all three classes. |
| Highlight print("\nClassification Report:")
print(classification_report(y_test, y_pred)) |
Finally, we evaluate the performance using the classification report.
The report offers a detailed assessment of the model’s performance. |
| Show Slide:
Summary |
This brings us to the end of the tutorial. Let us summarize. |
| Show Slide:
Assignment |
As an assignment, please do the following
|
| Show Slide:
Assignment Solution K=7, train_test_split = 0.2 |
Here is the classification report for K equals 7 with a 0.2 train-test split.
We will get an accuracy of 96 percent. |
| Show Slide:
FOSSEE Forum |
For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question. |
| Show Slide:
Thank You |
This is Anvita Thadavoose Manjummel, a FOSSEE Summer Fellow 2025, IIT Bombay
Thanks for joining. |