Difference between revisions of "Python-for-Machine-Learning/C3/Decision-Tree/English"
| Line 36: | Line 36: | ||
To follow this tutorial, | To follow this tutorial, | ||
|| To follow this tutorial, | || To follow this tutorial, | ||
| − | * The learner must have basic knowledge of '''Python | + | * The learner must have basic knowledge of '''Python'''. |
* For prerequisite '''Python''' tutorials, please visit this website. | * For prerequisite '''Python''' tutorials, please visit this website. | ||
| Line 64: | Line 64: | ||
|| | || | ||
* The '''root node''' starts the decision tree with a question or condition. | * The '''root node''' starts the decision tree with a question or condition. | ||
| − | * Based on the answer, we follow a '''branch''' to another '''node | + | * Based on the answer, we follow a '''branch''' to another '''node'''. |
| − | * A '''branch''' connects nodes, where each '''node '''represents a condition with outcomes. | + | * A '''branch''' connects nodes, where each '''node ''' represents a condition with outcomes. |
|- | |- | ||
|| '''Show dt.png img''' | || '''Show dt.png img''' | ||
| − | |||
| − | |||
|| | || | ||
* This new '''node''' poses another question or condition. | * This new '''node''' poses another question or condition. | ||
| Line 78: | Line 76: | ||
|- | |- | ||
|| Hover over the files | || Hover over the files | ||
| − | || I have created required files for the demonstration of Decision Tree. | + | || I have created required files for the demonstration of Decision Tree. |
|- | |- | ||
|| Open the file drug200.csv and point to the fields as per narration. | || Open the file drug200.csv and point to the fields as per narration. | ||
| − | || To implement the '''Decision Tree model, | + | || To implement the '''Decision Tree model, we use the '''drug200 dot csv '''dataset. |
Here, we analyze patient’s data to predict the most suitable drug for them. | Here, we analyze patient’s data to predict the most suitable drug for them. | ||
| Line 89: | Line 87: | ||
|- | |- | ||
|| Point to the '''DecisionTree.pynb''' | || Point to the '''DecisionTree.pynb''' | ||
| − | || '''DecisionTree dot ipynb''' is the ipython notebook file for this demonstration. | + | || '''DecisionTree dot ipynb''' is the '''ipython notebook''' file for this demonstration. |
|- | |- | ||
|| Press '''Ctrl,Alt and T''' keys | || Press '''Ctrl,Alt and T''' keys | ||
| Line 97: | Line 95: | ||
Press '''Enter''' | Press '''Enter''' | ||
|| Let us open the Linux terminal by pressing '''Ctrl, Alt''' and '''T '''keys together. | || Let us open the Linux terminal by pressing '''Ctrl, Alt''' and '''T '''keys together. | ||
| + | |||
Activate the machine learning environment by typing | Activate the machine learning environment by typing | ||
'''conda space activate''' '''space ml''' | '''conda space activate''' '''space ml''' | ||
| + | |||
Press '''Enter.''' | Press '''Enter.''' | ||
|- | |- | ||
| Line 112: | Line 112: | ||
Please navigate to the respective folder of your code file location. | Please navigate to the respective folder of your code file location. | ||
| − | Type, '''jupyter space notebook '''and press Enter to open Jupyter Notebook. | + | Type, '''jupyter space notebook ''' and press '''Enter''' to open '''Jupyter Notebook'''. |
|- | |- | ||
|| Show '''Jupyter Notebook Home page''': | || Show '''Jupyter Notebook Home page''': | ||
| Line 130: | Line 130: | ||
Press '''Shift+Enter''' | Press '''Shift+Enter''' | ||
| − | || These are the necessary libraries to be imported for the Decision Tree. | + | || These are the necessary libraries to be imported for the '''Decision Tree'''. |
Please remember to Execute each cell by pressing '''Shift and Enter''' to get output. | Please remember to Execute each cell by pressing '''Shift and Enter''' to get output. | ||
| Line 139: | Line 139: | ||
'''df_drug.head()''' | '''df_drug.head()''' | ||
Press '''Shift+Enter''' | Press '''Shift+Enter''' | ||
| − | || We start by loading the '''dataset''' from a CSV file and display the first few rows. | + | || We start by loading the '''dataset''' from a '''CSV''' file and display the first few rows. |
| Line 145: | Line 145: | ||
|| Highlight The lines: | || Highlight The lines: | ||
'''print("Length of Dataset:", len(df_drug)) print("Dataset Shape:", df_drug.shape)''' | '''print("Length of Dataset:", len(df_drug)) print("Dataset Shape:", df_drug.shape)''' | ||
| − | Press '''Shift+Enter''' | + | Press '''Shift+Enter'''. |
| + | |||
|| Then we print the '''number of rows''' and the '''shape''' of the dataset. | || Then we print the '''number of rows''' and the '''shape''' of the dataset. | ||
| Line 153: | Line 154: | ||
'''le_BP = LabelEncoder() ''' | '''le_BP = LabelEncoder() ''' | ||
Press '''Shift+Enter''' | Press '''Shift+Enter''' | ||
| − | || Next, we encode the categorical variables like '''Sex and BP''' into numerical values. | + | |
| + | || Next, we encode the categorical variables like '''Sex''' and '''BP''' into numerical values. | ||
|- | |- | ||
|| Highlight The lines: | || Highlight The lines: | ||
| Line 177: | Line 179: | ||
'''print("\nNumber of Duplicate Rows:", df_drug.duplicated().sum())''' | '''print("\nNumber of Duplicate Rows:", df_drug.duplicated().sum())''' | ||
'''df = df_drug.drop_duplicates()''' | '''df = df_drug.drop_duplicates()''' | ||
| − | '''print("Dataset Shape After Removing Duplicates:", df.shape)''' | + | '''print("Dataset Shape After Removing Duplicates:", df.shape)'''. |
| − | || Next, we check for | + | |
| + | || Next, we check for duplicate rows and remove them if found. | ||
We also print the '''dataset shape''' after removing the duplicates. | We also print the '''dataset shape''' after removing the duplicates. | ||
| Line 208: | Line 211: | ||
'''clf_entropy = DecisionTreeClassifier(criterion="entropy", random_state=5, max_depth=4, min_samples_leaf=5)''' | '''clf_entropy = DecisionTreeClassifier(criterion="entropy", random_state=5, max_depth=4, min_samples_leaf=5)''' | ||
| − | Press '''Shift+Enter''' | + | Press '''Shift+Enter'''. |
| + | |||
|| After that we create a '''decision tree classifier''' using '''entropy''' as the criterion. | || After that we create a '''decision tree classifier''' using '''entropy''' as the criterion. | ||
| + | |||
'''Entropy''' is the measure of disorder in the dataset. | '''Entropy''' is the measure of disorder in the dataset. | ||
| + | |||
It helps to classify the features into '''root '''and '''branches''' of the '''decision tree.''' | It helps to classify the features into '''root '''and '''branches''' of the '''decision tree.''' | ||
| Line 225: | Line 231: | ||
|| Highlight the lines | || Highlight the lines | ||
| − | '''print("Training Accuracy is", accuracy_score(y_train, y_pred_train) * 100 )''' | + | '''print("Training Accuracy is", accuracy_score(y_train, y_pred_train) * 100 )'''. |
| + | |||
|| Now we print the '''training accuracy.''' | || Now we print the '''training accuracy.''' | ||
|- | |- | ||
| Line 263: | Line 270: | ||
|| We also generate and print the''' classification report.''' | || We also generate and print the''' classification report.''' | ||
| − | This report gives '''precision, recall, f1-score''', and '''support''' for each class. | + | This report gives '''precision''', '''recall''', '''f1-score''', and '''support''' for each class. |
|- | |- | ||
|| Highlight the lines: | || Highlight the lines: | ||
| Line 294: | Line 301: | ||
|- | |- | ||
|| Show the output: | || Show the output: | ||
| − | |||
|| The output displays the '''multi class ROC curve''' of our classifier. | || The output displays the '''multi class ROC curve''' of our classifier. | ||
Latest revision as of 11:59, 4 July 2025
| Visual Cue | Narration |
| Show slide:
Welcome |
Welcome to the Spoken Tutorial on Decision Tree. |
| Show Slide:
Learning Objectives |
In this tutorial, we will learn about
|
| Show Slide:
System Requirements |
To Record this tutorial, I am using
|
| Show Slide:
Prerequisite To follow this tutorial, |
To follow this tutorial,
|
| Show Slide:
Code files |
|
| Show Slide:
Decision Tree |
|
| Show Slide:
Working of Decision Tree Show dt.png img |
|
| Show dt.png img |
|
| Hover over the files | I have created required files for the demonstration of Decision Tree. |
| Open the file drug200.csv and point to the fields as per narration. | To implement the Decision Tree model, we use the drug200 dot csv dataset.
Here, we analyze patient’s data to predict the most suitable drug for them. drug200 dataset contains Age, Sex, BP, Cholesterol, Na to K ratio and Drug. |
| Point to the DecisionTree.pynb | DecisionTree dot ipynb is the ipython notebook file for this demonstration. |
| Press Ctrl,Alt and T keys
Type conda activate ml Press Enter |
Let us open the Linux terminal by pressing Ctrl, Alt and T keys together.
Activate the machine learning environment by typing conda space activate space ml Press Enter. |
| Go to the Downloads folder
Type cd Downloads Press Enter Type jupyter notebook Press Enter |
I have saved my code file in the Downloads folder.
Please navigate to the respective folder of your code file location. Type, jupyter space notebook and press Enter to open Jupyter Notebook. |
| Show Jupyter Notebook Home page:
Click on DecisionTree.ipynb file |
We can see the Jupyter Notebook Home page has opened in the web browser.
Click the DecisionTree dot ipynb file to open it. Note that each cell will have the output displayed in this file. |
| Highlight The lines:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier Press Shift+Enter |
These are the necessary libraries to be imported for the Decision Tree.
Please remember to Execute each cell by pressing Shift and Enter to get output. |
| Highlight the lines:
df_drug=pd.read_csv("drug200.csv") df_drug.head() Press Shift+Enter |
We start by loading the dataset from a CSV file and display the first few rows.
|
| Highlight The lines:
print("Length of Dataset:", len(df_drug)) print("Dataset Shape:", df_drug.shape) Press Shift+Enter. |
Then we print the number of rows and the shape of the dataset. |
| Highlight the lines:
le_sex = LabelEncoder() le_BP = LabelEncoder() Press Shift+Enter |
Next, we encode the categorical variables like Sex and BP into numerical values. |
| Highlight The lines:
x=df_drug.drop(columns=['Drug']).values y = df_drug['Drug'].values Press Shift+Enter |
We then separate the features x and the target variable y. |
| Highlight the lines:
x |
Now we print the values of features. |
| Highlight the lines:
y |
Similarly, we print the values of the target. |
| Highlight the lines:
print("\nNumber of Duplicate Rows:", df_drug.duplicated().sum()) df = df_drug.drop_duplicates() print("Dataset Shape After Removing Duplicates:", df.shape). |
Next, we check for duplicate rows and remove them if found.
We also print the dataset shape after removing the duplicates. |
| Highlight the lines:
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns scaler = StandardScaler() |
Now we use StandardScaler to standardize the numerical columns. |
| Highlight the lines:
df.hist(figsize=(10, 8), bins=20, color='skyblue', edgecolor='black') plt.suptitle("Distribution of Numerical Features") plt.show() |
We then visualize the data distribution for numerical columns. |
| Highlight the lines:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=100) Press Shift+Enter |
Now, we split the data into training and testing sets. |
| Highlight the lines:
clf_entropy = DecisionTreeClassifier(criterion="entropy", random_state=5, max_depth=4, min_samples_leaf=5) Press Shift+Enter. |
After that we create a decision tree classifier using entropy as the criterion.
Entropy is the measure of disorder in the dataset. It helps to classify the features into root and branches of the decision tree. By navigating through the root and branches, we arrive at a decision of classes. |
| Highlight the lines:
clf_entropy.fit(x_train, y_train) y_pred_train = clf_entropy.predict(x_train) |
We then fit the classifier to the training data.
Once trained, we predict the target values for the training set. |
| Highlight the lines
print("Training Accuracy is", accuracy_score(y_train, y_pred_train) * 100 ). |
Now we print the training accuracy. |
| Highlight the lines:y_pred_en = clf_entropy.predict(x_test)
y_pred_en |
Next we make predictions on the test data. |
| Highlight The lines:
accuracy = round(accuracy_score(y_test, y_pred_en) * 100, 3) print("Accuracy is", accuracy) Press Shift+Enter |
Then we print the test accuracy |
| Highlight the output | The accuracy is 98.33 which indicates the model performs very well. |
| Highlight the lines:
cm = confusion_matrix(y_test, y_pred_en) plt.figure(figsize=(10, 7)) Press Shift+Enter |
To analyze model performance further, we create and display a confusion matrix.
It shows how well the model is correctly classifying the instances. |
| Highlight the lines:
report = classification_report(y_test, y_pred_en,zero_division=0) print("Classification Report:") Press Shift+Enter |
We also generate and print the classification report.
This report gives precision, recall, f1-score, and support for each class. |
| Highlight the lines:
classes = np.unique(y) y_test_bin = label_binarize(y_test, classes=classes) y_score = clf_entropy.predict_proba(x_test) |
Now we get all unique target classes from the dataset.
Next, we binarize y underscore test for multi class ROC plotting. We binarize to handle multi class ROC as it needs binary format. Then, we predict class probabilities using the model. |
| Highlight the lines:
fpr = dict() tpr = dict() roc_auc = dict() n_classes = len(classes) for i in range(n_classes): plt.legend(loc="lower right") plt.grid(True) plt.show() |
We plot the ROC curve for each class using the predicted probabilities.
The ROC curve shows how well the model distinguishes between classes. |
| Show the output: | The output displays the multi class ROC curve of our classifier.
Let us see about the AUC score that is area under the curve. DrugA, DrugB, DrugC have an AUC score of 1, indicating perfect classification. DrugX and DrugY have AUC scores of 0.96 and 0.98, which are very high. The closer the curve is to the top left, the better the model performs. Here, all curves are close to the top left corner of the plot. So, our classifier performs very well for all classes. |
| Highlight the lines:
feature_names = df_drug.columns[:-1] plt.figure(figsize=(29, 10)) |
Then we extract the column names excluding Drug for tree visualization.
Next, we set the figure size and plot the decision tree. Then we save and display the tree visualization as a PNG image file. |
| Show the output | The tree helps classify which drug to give based on patient features.
It first checks the Na to K value, then splits further using BP, Age, and Cholesterol. Each colored box shows sample count and predicted drug type. The tree splits until leaves mostly contain one drug, showing zero entropy. |
| Narration | Thus, we built a decision tree to predict drug types based on patient data.
The model showed high accuracy, indicating strong predictive performance. |
| Show slide:
Summary
|
This brings us to the end of the tutorial. Let us summarize.
|
| Show Slide:
Assignment As an assignment, please do the following |
As an assignment, please do the following:
Replace the max underscore depth as shown here. Observe the change in Testing accuracy. |
| Show Slide:
Assignment Solution Show x img |
After completing the assignment, the output should match as the expected result. |
| Show Slide:
FOSSEE Forum |
For any general or technical questions on Python for Machine Learning, visit the FOSSEE forum and post your question |
| Show Slide:
Thank you |
This is Harini Theiveegan, a FOSSEE Summer Fellowship 2025, IIT Bombay signing off
Thanks for joining |