Python-for-Machine-Learning/C2/K-Nearest-Neighbor-Classification/English

Visual Cue	Narration
Show Slide: Welcome and Title Slide	Welcome to the Spoken Tutorial on K Nearest Neighbors Classification.
Show Slide: Learning Objectives	In this tutorial, we will learn about The fundamentals of KNN Algorithm Implementing KNN for Classification using Iris dataset Evaluating the performance of the trained model
Show Slide: System Requirements	To record this tutorial, I am using Ubuntu Linux OS version 24.04 Jupyter Notebook IDE
Show Slide: Pre- requisites	To follow this tutorial, The learner must have basic knowledge of Python. For pre-requisite Python tutorials, please visit this website.
Show Slide: Code Files	The files used in this tutorial are provided in the Code files link. Please download and extract the files. Make a copy and then use them while practicing.
Show Slide: KNN	KNN stands for K Nearest Neighbors. Nearest Neighbor algorithm predicts using closest similar training data points. K indicates the number of neighboring points to be considered for prediction.
Show Slide: KNN Classification	Features of K nearest samples are compared to determine similarity. The new data point gets the most frequent class among its neighbors. KNN is versatile and can effectively handle multi-class classification problems.
Show Slide: Iris Dataset irisflowers.png Hover over setosa, versicolor and virginica images Hover over sepal length, sepal width, petal length and petal width	In this tutorial, we are using the Iris plants dataset. It has 3 distinct flower classes. Classes are classified by sepal length, sepal width, petal length, petal width.
Show image Iris Dataset iris.png	The three flower classes appear as clusters in different colors on the graph. We will use the four features of the flower to classify the three distinct classes. A black dot represents a flower in the dataset that lacks a defined class. Our goal is to predict the black dot’s class based on its nearest neighbors. The closest points to the black dot are called its neighbors. These neighbors determine the black dot’s class.
Hover over the files Point to KNN classification.ipynb	I have created required files for the demonstration of KNN classification. KNN classification dot ipynb is the python notebook file for this demonstration.
Press Ctrl+Alt+T keys Type conda activate ml Press Enter	Let us open the Linux terminal by pressing Ctrl, Alt and T keys together. Activate the machine learning environment by typing conda space activate space ml Press Enter.
Type cd Downloads Press Enter Type jupyter notebook Press Enter	I have saved my code file in the Downloads folder. Please navigate to the respective folder of your code file location. Type, jupyter space notebook and press Enter to open Jupyter Notebook.
Jupyter Notebook Home Page will be opened. Click on KNNClassification.ipynb	We can see the homepage of the Jupyter notebook has opened in the web browser. Locate the KNN classification dot ipynb file. Open the file by clicking on it. Note that each cell will have the output displayed in this file. Let us see the implementation of the KNN classification model.
Highlight import pandas as pd import matplotlib.pyplot as plt import seaborn as sns	We import these libraries for KNN Classification. Please press Shift plus Enter to execute the code in each cell.
Highlight iris = load_iris()	First, we load the dataset into a variable named iris. The iris dataset is a built-in dataset available in the Scikit- learn library.
Only narration Highlight iris.feature_names	Let us explore the dataset.
Highlight	We list out the feature names of the Iris dataset as shown here.
Highlight iris.target_names	Next let us list out the target names.
Highlight	The target classes represent the different species of the iris flower. Setosa is shown as 0, versicolor as 1 and virginica as 2.
Highlight iris_df = pd.DataFrame(iris.data,columns=iris.feature_names)	First we create a Dataframe named iris underscore df. We load the Iris dataset into the dataframe. It holds the features and target class.
Highlight iris_df.head() Highlight output	Using the head method the first few rows are displayed. The default is 5 rows, but the value can be changed by specifying the argument.
Highlight iris_df['target'] = iris.target iris_df.head()	We add a new target column with class labels to the dataframe.
Highlight iris_df.shape()	The shape method gives the shape of the dataframe in rows and columns.
Only narration Highlight sns.pairplot( iris_df, hue='target', diag_kind='kde', palette='colorblind' ) plt.show()	Next, we plot a pairplot to visualize the iris dataset. It visualizes relationships between different features. It creates scatterplots for each pair of features, colored by class labels. plt dot show is used to display the generated plots.
Show output plots	We see the pairplots visualizing feature relationships. Scatter plots compare two features and help to identify patterns. Diagonal KDE plots show the distribution of each feature for different classes. Different colors represent different target classes. Clusters indicate which species are overlapping.
Only Narration Highlight X = iris_df.drop('target', axis=1)	Now, we split the dataset into X and y to prepare the data for training. First we remove the target column named target. Then copy the remaining features into the variable X.
Highlight y = iris_df['target']	Next, we assign the target column to y.
Highlight X	We see that X contains all features except target species.
Highlight y Highlight the output	We see that y contains the target classes. It is the species of the iris flower.
Only narration. Highlight train_test_split(X, y, test_size=0.4, random_state=42)	Next, we split the data into training and testing sets. We use the train underscore test underscore split method. The split ratio is adjustable through the test underscore size parameter. We set the test underscore size as 0.4. Here, we use 40 percent of the data for testing and 60 percent for training. Setting random state equal to 42 ensures the split is reproducible. It guarantees we get the same result across multiple executions.
Highlight X_train, X_test, y_train, y_test	We assign the split data into four variables. X underscore train contains the features of the training data. It is used for model training. y underscore train contains the target values for the training data. X underscore test contains features for the test data. It is used for making predictions. y underscore test contains the actual class labels for the test data. It is used for evaluating the model performance.
Highlight knn = KNeighborsClassifier(n_neighbors=7)	Now, we train the KNN classifier using KNeighborsClassifier with 7 neighbors.
Highlight knn.fit(X_train, y_train)	We train the KNN classifier using the fit method on the training data. Fit method adjusts the model parameter using the training data.
Highlight y_train_pred = knn.predict(X_train)	We predict the labels for the training data.
Highlight training_accuracy = accuracy_score(y_train, y_train_pred) print("Training Accuracy: {training_accuracy:.3f}")	We calculate and print the accuracy of the model. Accuracy is the ratio of correct prediction to the total number of instances. It helps to measure how well the model is performing.
Highlight Training Accuracy: 0.956	The accuracy is 0.956 which is quite good.
Highlight print("\nClassification Report:") print(classification_report(y_train, y_train_pred))	Next, we print the classification report. The classification report helps to evaluate how well the model is performing. Precision tells how many positive predictions made by the model were correct. F1 Score is the balance between precision and recall. Recall shows how well the model detects actual positive cases correctly. Support is the count of true instances of each class in the dataset.
Show output table Box for the data	From the report, we conclude that precision and recall is high for all classes. F1-Score shows good overall performance across all classes. The accuracy is 96 percent. This means the model made correct predictions for 96 percent of the instances. macro and weighted average reflect consistent performances across the dataset.
Only narration	Next, we evaluate the model on the testing data. First, we predict the class label for a single test sample.
Highlight sample_test_data =	We extract the 10th row of the test dataset and reshape it for prediction. We use the predict method to predict the class label using the trained model. We store the actual class label in the actual underscore class variable. Then, we print the predicted and actual class labels of the data sample.
Highlight output	The predicted class for unseen data is 2, which is virginica. The actual class is also 2, indicating the prediction is correct.
Highlight accuracy = accuracy_score(y_test, y_pred) print(f"Testing Accuracy: {accuracy:.3f}")	Now, we calculate and print the accuracy.
Highlight Testing Accuracy: 0.983	The accuracy is approximately 0.983. We can conclude that the model is performing well on unseen data.
Only Narration Highlight y_test_bin = label_binarize(y_test, classes=[0, 1, 2])	Finally we plot the precision-recall curve to visualize the performance for each class. It calculates precision and recall. It computes average precision and summarizes the precision recall curve into a single score.
Highlight plt.figure(figsize=(10, 6))	plt dot plot function plots the precision recall curve.
Highlight plt.xlabel("Recall") plt.ylabel("Precision") plt.title("Precision-Recall plt.show()	plt dot show displays the final precision recall curve.
Show output plot	The Precision-Recall curve shows the trade-off between precision and recall. KNN classifier achieved perfect precision-recall for all classes. High AP indicates the model performs well in distinguishing all three classes.
Highlight print("\nClassification Report:") print(classification_report(y_test, y_pred))	Finally, we evaluate the performance using the classification report. The report offers a detailed assessment of the model’s performance.
Show Slide: Summary	This brings us to the end of the tutorial. Let us summarize.
Show Slide: Assignment	As an assignment, please do the following Use K as 7 and test size as 0.2. Evaluate the model performance using the classification report.
Show Slide: Assignment Solution K=7, train_test_split = 0.2	Here is the classification report for K equals 7 with a 0.2 train-test split. We will get an accuracy of 96 percent.
Show Slide: FOSSEE Forum	For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question.
Show Slide: Thank You	This is Anvita Thadavoose Manjummel, a FOSSEE Summer Fellow 2025, IIT Bombay Thanks for joining.

Contributors and Content Editors

Madhurig, Nirmala Venkat

Python-for-Machine-Learning/C2/K-Nearest-Neighbor-Classification/English

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools