From e55fd209339f3c916771c9602e9aa9de85b34ca8 Mon Sep 17 00:00:00 2001 From: pratik3336 Date: Thu, 20 Nov 2025 13:56:21 -0500 Subject: [PATCH 01/29] Create 03_decision_trees.md --- .../03_ML_classification/03_decision_trees.md | 241 ++++++++++++++++++ 1 file changed, 241 insertions(+) create mode 100644 lessons/03_ML_classification/03_decision_trees.md diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md new file mode 100644 index 0000000..0c40f71 --- /dev/null +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -0,0 +1,241 @@ +# Lesson 3 +### CTD Python 200 +**Decision Trees and Ensemble Learning (Random Forests)** + +## Why trees? + +Imagine being asked to identify a flower. You might think: + +> “If the petals are tiny, it’s probably a setosa. +> If not, check petal width… big petals? Probably virginica.” + +That thought process — breaking a decision into simple **yes/no** questions — is exactly what a **Decision Tree** does. + +Deep learning models are often described as “black boxes.” +Decision Trees are the opposite: **transparent and interpretable.** +You can literally trace any prediction step-by-step. + +But there’s a catch: +A single tree can become *too* confident in the training data. +It memorizes noise → **overfitting** → looks great in training, worse in real life. + +Random Forests fix this. +But first, let’s build (and visualize!) a single tree. + +--- + +## What you’ll learn today + +By the end of this lesson, you will: + +- Train a **Decision Tree Classifier** and understand its structure +- Interpret decisions directly from the tree visualization +- Learn why trees tend to **overfit** +- Use a **Random Forest** (an ensemble of trees) to improve performance +- Evaluate with **accuracy + classification report** (from KNN lesson) +- Compare performance on **Iris and Digits datasets** + +--- + +## Step 1 — Setup + +```python +from sklearn.datasets import load_iris, load_digits +from sklearn.model_selection import train_test_split +from sklearn.metrics import accuracy_score, classification_report +from sklearn.tree import DecisionTreeClassifier, plot_tree +from sklearn.ensemble import RandomForestClassifier + +import matplotlib.pyplot as plt +import seaborn as sns +import pandas as pd +```` + +Load datasets: + +```python +iris = load_iris(as_frame=True) +X_iris, y_iris = iris.data, iris.target + +digits = load_digits() +X_digits, y_digits = digits.data, digits.target +``` + +Split the data: + +```python +def split(X, y): + return train_test_split(X, y, test_size=0.2, random_state=42) + +X_train_i, X_test_i, y_train_i, y_test_i = split(X_iris, y_iris) +X_train_d, X_test_d, y_train_d, y_test_d = split(X_digits, y_digits) +``` + +We now have two datasets: + +| Dataset | Type | Classes | Why use it? | +| ---------- | --------------- | ------- | ------------------------ | +| **Iris** | Tabular numeric | 3 | Simple + visualizable | +| **Digits** | Image-like grid | 10 | More realistic challenge | + +--- + +## Part A — Decision Tree Classifier + +Train the tree: + +```python +tree_clf = DecisionTreeClassifier(random_state=42) +tree_clf.fit(X_train_i, y_train_i) + +preds_tree = tree_clf.predict(X_test_i) +print("Decision Tree Accuracy (Iris):", accuracy_score(y_test_i, preds_tree)) +print(classification_report(y_test_i, preds_tree)) +``` + +### Visualize the decision-making + +```python +plt.figure(figsize=(16, 10)) +plot_tree(tree_clf, filled=True, + feature_names=iris.feature_names, + class_names=iris.target_names) +plt.title("Iris Decision Tree") +plt.show() +``` + +Look how the model makes decisions: + +> “Is petal width ≤ 0.8? +> → Yes → Setosa.” + +Each **split** reduces uncertainty in the data. +Technically, trees measure uncertainty using **Gini Impurity** — +but you can think of it as: *“How mixed is this node?”* +Pure nodes = confident predictions. + +--- + +## The Overfitting Problem + +Decision trees love **deep**, very specific rules. + +Example: +“If petal width = 1.75 and sepal width > 3.2 and…” +…that rule might only apply to *one* weird training sample. + +Accuracy in training: 🚀 +Accuracy on new data: 😬 + +Let’s test the same tree on **Digits** — a harder task: + +```python +preds_digits_tree = tree_clf.fit(X_train_d, y_train_d).predict(X_test_d) +print("Decision Tree Accuracy (Digits):", accuracy_score(y_test_d, preds_digits_tree)) +``` + +You’ll likely see poorer generalization — proof of **overfitting**. + +--- + +## Part B — Enter the Random Forest 🌲🌲🌲 + +> A forest is a group of trees that never all overfit in the same way. + +How it works (intuition only — no rocket science): + +1. It samples slightly different training subsets +2. Builds a tree for each subset +3. They **vote** on predictions + +This reduces **variance**, producing robust results. + +```python +rf_clf = RandomForestClassifier(n_estimators=100, random_state=42) +rf_clf.fit(X_train_i, y_train_i) +preds_rf = rf_clf.predict(X_test_i) + +print("Random Forest Accuracy (Iris):", accuracy_score(y_test_i, preds_rf)) +print(classification_report(y_test_i, preds_rf)) +``` + +Now compare on Digits: + +```python +preds_rf_digits = rf_clf.fit(X_train_d, y_train_d).predict(X_test_d) +print("Random Forest Accuracy (Digits):", accuracy_score(y_test_d, preds_rf_digits)) +``` + +You should see **noticeable improvement** — especially for digits. + +--- + +## Comparing performance + +| Model | Interpretability | Overfitting Risk | Iris Accuracy | Digits Accuracy | +| -------------------- | ------------------------ | ---------------- | ------------- | --------------- | +| Single Decision Tree | ⭐⭐⭐⭐⭐ (human readable) | High | Good | Weaker | +| Random Forest | ⭐⭐ (harder to visualize) | Much lower | Great | Much better | + +We’ve gained reliability — at the small cost of transparency. + +--- + +## Understanding which features matter + +Random Forests can show what features they rely on: + +```python +feat_df = pd.DataFrame({ + "feature": X_iris.columns, + "importance": rf_clf.feature_importances_ +}) +sns.barplot(data=feat_df, x="importance", y="feature") +plt.title("Iris Feature Importance") +plt.show() +``` + +This is a powerful middle ground: +**better performance + partial explainability.** + +--- + +## Key ideas to take with you + +* **Decision Trees** mimic human reasoning but **overfit** easily +* **Gini Impurity** measures how “mixed” a node is +* **Random Forest = many trees voting** + → more stable, less overfitting +* Same metrics from KNN lesson help us compare fairly +* On complex data (Digits), forests shine even more + +--- + +## Explore More + +Optional but recommended for intuition: + +* Decision Trees + [https://www.ibm.com/think/topics/decision-trees](https://www.ibm.com/think/topics/decision-trees) + [https://www.youtube.com/watch?v=JcI5E2Ng6r4](https://www.youtube.com/watch?v=JcI5E2Ng6r4) + [https://www.youtube.com/watch?v=u4IxOk2ijSs](https://www.youtube.com/watch?v=u4IxOk2ijSs) + [https://www.youtube.com/watch?v=zs6yHVtxyv8](https://www.youtube.com/watch?v=zs6yHVtxyv8) +* Random Forests + [https://www.youtube.com/watch?v=gkXX4h3qYm4](https://www.youtube.com/watch?v=gkXX4h3qYm4) +* scikit-learn demo + [https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html) + +--- + +## Next up + +Preventing overfitting with **hyperparameter tuning**: + +* `max_depth` +* `min_samples_split` +* cross-validation +* confusion matrices + +``` + +--- From 3cdc9c7a18fc4a71351fe174a26fb1adad2410ae Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:03:02 -0500 Subject: [PATCH 02/29] Update 03_decision_trees.md --- .../03_ML_classification/03_decision_trees.md | 189 +++++++++--------- 1 file changed, 90 insertions(+), 99 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 0c40f71..4f425ca 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -6,38 +6,67 @@ Imagine being asked to identify a flower. You might think: -> “If the petals are tiny, it’s probably a setosa. +> “If the petals are tiny, it’s probably setosa. > If not, check petal width… big petals? Probably virginica.” -That thought process — breaking a decision into simple **yes/no** questions — is exactly what a **Decision Tree** does. +That thought process — breaking a decision into a sequence of **yes/no questions** — is exactly how a **Decision Tree** works. + +image + +**Image credit: [scikit-learn.org documentation](https://www.geeksforgeeks.org/machine-learning/decision-tree/)** Deep learning models are often described as “black boxes.” -Decision Trees are the opposite: **transparent and interpretable.** -You can literally trace any prediction step-by-step. +Decision Trees are the opposite: **transparent and interpretable**. +You can literally trace any prediction step-by-step by following the tree branches. -But there’s a catch: -A single tree can become *too* confident in the training data. -It memorizes noise → **overfitting** → looks great in training, worse in real life. +However, a single decision tree can become too confident in the training data. +This leads to overfitting: excellent performance on seen examples, weaker performance on new ones. -Random Forests fix this. -But first, let’s build (and visualize!) a single tree. +Random Forests solve this problem by combining many trees. + + +## How Decision Trees Work? +1. Start with the Root Node: It begins with a main question at the root node which is derived from the dataset’s features. + +2. Ask Yes/No Questions: From the root, the tree asks a series of yes/no questions to split the data into subsets based on specific attributes. + +3. Branching Based on Answers: Each question leads to different branches: + +If the answer is yes, the tree follows one path. +If the answer is no, the tree follows another path. +4. Continue Splitting: This branching continues through further decisions helps in reducing the data down step-by-step. + +5. Reach the Leaf Node: The process ends when there are no more useful questions to ask leading to the leaf node where the final decision or prediction is made. + +Let’s look at a simple example to understand how it works. Imagine we need to decide whether to drink coffee based on the time of day and how tired we feel. The tree first checks the time: + +1. In the morning: It asks “Tired?” + +If yes, the tree suggests drinking coffee. +If no, it says no coffee is needed. +2. In the afternoon: It asks again “Tired?” + +If yes, it suggests drinking coffee. +If no, no coffee is needed. + +image + +**Image credit: [scikit-learn.org documentation](https://www.geeksforgeeks.org/machine-learning/decision-tree/)** ---- ## What you’ll learn today By the end of this lesson, you will: -- Train a **Decision Tree Classifier** and understand its structure -- Interpret decisions directly from the tree visualization -- Learn why trees tend to **overfit** -- Use a **Random Forest** (an ensemble of trees) to improve performance -- Evaluate with **accuracy + classification report** (from KNN lesson) -- Compare performance on **Iris and Digits datasets** +- Train and interpret a **Decision Tree Classifier** +- Visualize how splitting decisions are made +- Understand **Gini Impurity** as a measure of node “mixedness” +- Learn why individual trees can **overfit** +- Use a **Random Forest** to improve stability and accuracy +- Compare both models with evaluation metrics from the KNN lesson +- Test both **Iris** (simple) and **Digits** (more complex) datasets ---- - -## Step 1 — Setup +## Setup ```python from sklearn.datasets import load_iris, load_digits @@ -73,12 +102,10 @@ X_train_d, X_test_d, y_train_d, y_test_d = split(X_digits, y_digits) We now have two datasets: -| Dataset | Type | Classes | Why use it? | -| ---------- | --------------- | ------- | ------------------------ | -| **Iris** | Tabular numeric | 3 | Simple + visualizable | -| **Digits** | Image-like grid | 10 | More realistic challenge | - ---- +| Dataset | Type | Classes | Why use it? | +| ---------- | --------------- | ------- | ----------------------------------- | +| **Iris** | Tabular numeric | 3 | Simple, easy to visualize decisions | +| **Digits** | Image-like grid | 10 | More realistic, harder to classify | ## Part A — Decision Tree Classifier @@ -93,62 +120,39 @@ print("Decision Tree Accuracy (Iris):", accuracy_score(y_test_i, preds_tree)) print(classification_report(y_test_i, preds_tree)) ``` -### Visualize the decision-making +### Visualizing decisions ```python plt.figure(figsize=(16, 10)) -plot_tree(tree_clf, filled=True, - feature_names=iris.feature_names, +plot_tree(tree_clf, filled=True, + feature_names=iris.feature_names, class_names=iris.target_names) plt.title("Iris Decision Tree") plt.show() ``` -Look how the model makes decisions: - -> “Is petal width ≤ 0.8? -> → Yes → Setosa.” +Decision Trees measure uncertainty in each split using **Gini Impurity**: +low impurity = more confident prediction. -Each **split** reduces uncertainty in the data. -Technically, trees measure uncertainty using **Gini Impurity** — -but you can think of it as: *“How mixed is this node?”* -Pure nodes = confident predictions. - ---- +They continue splitting until the data is as “pure” as possible. ## The Overfitting Problem -Decision trees love **deep**, very specific rules. - -Example: -“If petal width = 1.75 and sepal width > 3.2 and…” -…that rule might only apply to *one* weird training sample. - -Accuracy in training: 🚀 -Accuracy on new data: 😬 - -Let’s test the same tree on **Digits** — a harder task: +Trees can become overly specific to the training data: ```python preds_digits_tree = tree_clf.fit(X_train_d, y_train_d).predict(X_test_d) print("Decision Tree Accuracy (Digits):", accuracy_score(y_test_d, preds_digits_tree)) ``` -You’ll likely see poorer generalization — proof of **overfitting**. - ---- - -## Part B — Enter the Random Forest 🌲🌲🌲 - -> A forest is a group of trees that never all overfit in the same way. +Notice that accuracy generally drops on the more complex Digits dataset — a sign of overfitting. -How it works (intuition only — no rocket science): +## Part B — Random Forest (Ensemble Method) -1. It samples slightly different training subsets -2. Builds a tree for each subset -3. They **vote** on predictions +A **Random Forest** builds many trees on slightly different samples of the training data. +Each tree votes, and the most common answer wins. -This reduces **variance**, producing robust results. +This reduces overfitting because the trees do not make the same mistakes. ```python rf_clf = RandomForestClassifier(n_estimators=100, random_state=42) @@ -159,31 +163,27 @@ print("Random Forest Accuracy (Iris):", accuracy_score(y_test_i, preds_rf)) print(classification_report(y_test_i, preds_rf)) ``` -Now compare on Digits: +Test on Digits as well: ```python preds_rf_digits = rf_clf.fit(X_train_d, y_train_d).predict(X_test_d) print("Random Forest Accuracy (Digits):", accuracy_score(y_test_d, preds_rf_digits)) ``` -You should see **noticeable improvement** — especially for digits. +You should see significantly improved generalization. ---- - -## Comparing performance +## Model comparison -| Model | Interpretability | Overfitting Risk | Iris Accuracy | Digits Accuracy | -| -------------------- | ------------------------ | ---------------- | ------------- | --------------- | -| Single Decision Tree | ⭐⭐⭐⭐⭐ (human readable) | High | Good | Weaker | -| Random Forest | ⭐⭐ (harder to visualize) | Much lower | Great | Much better | +| Model | Interpretability | Overfitting Risk | Iris Accuracy | Digits Accuracy | +| ------------- | --------------------------- | ---------------- | ------------- | --------------- | +| Decision Tree | Very high | High | Good | Moderate | +| Random Forest | More difficult to interpret | Much lower | Very good | Strong | -We’ve gained reliability — at the small cost of transparency. +Interpretability decreases slightly, but robustness increases substantially. ---- +## Feature importance (Random Forest) -## Understanding which features matter - -Random Forests can show what features they rely on: +Which features matter most? ```python feat_df = pd.DataFrame({ @@ -195,47 +195,38 @@ plt.title("Iris Feature Importance") plt.show() ``` -This is a powerful middle ground: -**better performance + partial explainability.** - ---- +This provides insight into which measurements are driving decisions. ## Key ideas to take with you -* **Decision Trees** mimic human reasoning but **overfit** easily -* **Gini Impurity** measures how “mixed” a node is -* **Random Forest = many trees voting** - → more stable, less overfitting -* Same metrics from KNN lesson help us compare fairly -* On complex data (Digits), forests shine even more - ---- +* **Decision Trees** are interpretable and mimic human decision-making +* They tend to **overfit** without constraints +* **Random Forests** combine many trees to reduce variance and improve accuracy +* Same evaluation tools from the KNN lesson let us compare fairly +* Forests shine on complex, higher-dimensional data like Digits ## Explore More -Optional but recommended for intuition: +Optional resources for deeper intuition: * Decision Trees [https://www.ibm.com/think/topics/decision-trees](https://www.ibm.com/think/topics/decision-trees) [https://www.youtube.com/watch?v=JcI5E2Ng6r4](https://www.youtube.com/watch?v=JcI5E2Ng6r4) [https://www.youtube.com/watch?v=u4IxOk2ijSs](https://www.youtube.com/watch?v=u4IxOk2ijSs) [https://www.youtube.com/watch?v=zs6yHVtxyv8](https://www.youtube.com/watch?v=zs6yHVtxyv8) + * Random Forests [https://www.youtube.com/watch?v=gkXX4h3qYm4](https://www.youtube.com/watch?v=gkXX4h3qYm4) -* scikit-learn demo - [https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html) ---- - -## Next up +* scikit-learn example + [https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html) -Preventing overfitting with **hyperparameter tuning**: +## Next steps -* `max_depth` -* `min_samples_split` -* cross-validation -* confusion matrices - -``` +In the next lesson, we’ll build on this foundation and explore: +* Controlling model complexity (`max_depth`, `min_samples_split`) +* Reducing overfitting with **cross-validation** +* Confusion matrices and evaluation tradeoffs --- + From 0c04bde57bb4054ada1fd9ad9ce7a4e60308d15b Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:04:21 -0500 Subject: [PATCH 03/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 4f425ca..2fb4795 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -26,14 +26,16 @@ Random Forests solve this problem by combining many trees. ## How Decision Trees Work? + 1. Start with the Root Node: It begins with a main question at the root node which is derived from the dataset’s features. 2. Ask Yes/No Questions: From the root, the tree asks a series of yes/no questions to split the data into subsets based on specific attributes. 3. Branching Based on Answers: Each question leads to different branches: -If the answer is yes, the tree follows one path. -If the answer is no, the tree follows another path. + If the answer is yes, the tree follows one path. + If the answer is no, the tree follows another path. + 4. Continue Splitting: This branching continues through further decisions helps in reducing the data down step-by-step. 5. Reach the Leaf Node: The process ends when there are no more useful questions to ask leading to the leaf node where the final decision or prediction is made. @@ -42,12 +44,13 @@ Let’s look at a simple example to understand how it works. Imagine we need to 1. In the morning: It asks “Tired?” -If yes, the tree suggests drinking coffee. -If no, it says no coffee is needed. + If yes, the tree suggests drinking coffee. + If no, it says no coffee is needed. + 2. In the afternoon: It asks again “Tired?” -If yes, it suggests drinking coffee. -If no, no coffee is needed. + If yes, it suggests drinking coffee. + If no, no coffee is needed. image From ccf0b9502c78cee1cf399554944168513fc0008f Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:05:14 -0500 Subject: [PATCH 04/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 2fb4795..97272e7 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -13,7 +13,7 @@ That thought process — breaking a decision into a sequence of **yes/no questio image -**Image credit: [scikit-learn.org documentation](https://www.geeksforgeeks.org/machine-learning/decision-tree/)** +**Image credit: [decision-tree-geeks documentation](https://www.geeksforgeeks.org/machine-learning/decision-tree/)** Deep learning models are often described as “black boxes.” Decision Trees are the opposite: **transparent and interpretable**. @@ -54,7 +54,7 @@ Let’s look at a simple example to understand how it works. Imagine we need to image -**Image credit: [scikit-learn.org documentation](https://www.geeksforgeeks.org/machine-learning/decision-tree/)** +**Image credit: [decision-tree-geeks documentation](https://www.geeksforgeeks.org/machine-learning/decision-tree/)** ## What you’ll learn today From a5600867e6e35021005fcf93f5009c71ccb7f98a Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:09:10 -0500 Subject: [PATCH 05/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 97272e7..9b7e440 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -155,6 +155,17 @@ Notice that accuracy generally drops on the more complex Digits dataset — a si A **Random Forest** builds many trees on slightly different samples of the training data. Each tree votes, and the most common answer wins. +### How Random Forest Classification Works + +Imagine you have a complex problem to solve, and you gather a group of experts from different fields to provide their input. Each expert provides their opinion based on their expertise and experience. Then, the experts would vote to arrive at a final decision. + +In a random forest classification, multiple decision trees are created using different random subsets of the data and features. Each decision tree is like an expert, providing its opinion on how to classify the data. Predictions are made by calculating the prediction for each decision tree and then taking the most popular result. (For regression, predictions use an averaging technique instead.) + +Screenshot 2025-11-20 at 2 08 19 PM + +**Image credit: [random-forest-data-camp documentation](https://www.datacamp.com/tutorial/random-forests-classifier-python)** + + This reduces overfitting because the trees do not make the same mistakes. ```python From c6561b03d88242205868a4ba74daf2182d40041c Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:14:14 -0500 Subject: [PATCH 06/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 9b7e440..3cb9d86 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -134,6 +134,14 @@ plt.title("Iris Decision Tree") plt.show() ``` +Screenshot 2025-11-20 at 2 12 56 PM + +awe + + +**Image credit: Google Colab** + + Decision Trees measure uncertainty in each split using **Gini Impurity**: low impurity = more confident prediction. From c53ec03915dfe93fbda93590e916d8280c3edd6b Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:22:06 -0500 Subject: [PATCH 07/29] Update 03_decision_trees.md --- .../03_ML_classification/03_decision_trees.md | 262 ++++++++++-------- 1 file changed, 146 insertions(+), 116 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 3cb9d86..37ceaf1 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -1,4 +1,3 @@ -# Lesson 3 ### CTD Python 200 **Decision Trees and Ensemble Learning (Random Forests)** @@ -17,73 +16,65 @@ That thought process — breaking a decision into a sequence of **yes/no questio Deep learning models are often described as “black boxes.” Decision Trees are the opposite: **transparent and interpretable**. -You can literally trace any prediction step-by-step by following the tree branches. +You can literally trace any prediction step-by-step. However, a single decision tree can become too confident in the training data. -This leads to overfitting: excellent performance on seen examples, weaker performance on new ones. +This leads to **overfitting**: great performance on seen examples, worse on new data. Random Forests solve this problem by combining many trees. +--- -## How Decision Trees Work? - -1. Start with the Root Node: It begins with a main question at the root node which is derived from the dataset’s features. - -2. Ask Yes/No Questions: From the root, the tree asks a series of yes/no questions to split the data into subsets based on specific attributes. - -3. Branching Based on Answers: Each question leads to different branches: - - If the answer is yes, the tree follows one path. - If the answer is no, the tree follows another path. - -4. Continue Splitting: This branching continues through further decisions helps in reducing the data down step-by-step. - -5. Reach the Leaf Node: The process ends when there are no more useful questions to ask leading to the leaf node where the final decision or prediction is made. - -Let’s look at a simple example to understand how it works. Imagine we need to decide whether to drink coffee based on the time of day and how tired we feel. The tree first checks the time: - -1. In the morning: It asks “Tired?” - - If yes, the tree suggests drinking coffee. - If no, it says no coffee is needed. - -2. In the afternoon: It asks again “Tired?” +## How Decision Trees Work - If yes, it suggests drinking coffee. - If no, no coffee is needed. +1. Start with a root node using one feature to split data +2. Ask a “yes/no” question that reduces label mixing +3. Continue splitting until no more useful splits exist +4. Reach a leaf node → make a prediction image -**Image credit: [decision-tree-geeks documentation](https://www.geeksforgeeks.org/machine-learning/decision-tree/)** - +--- ## What you’ll learn today -By the end of this lesson, you will: - - Train and interpret a **Decision Tree Classifier** -- Visualize how splitting decisions are made -- Understand **Gini Impurity** as a measure of node “mixedness” -- Learn why individual trees can **overfit** -- Use a **Random Forest** to improve stability and accuracy -- Compare both models with evaluation metrics from the KNN lesson -- Test both **Iris** (simple) and **Digits** (more complex) datasets +- Measure node uncertainty using **Gini Impurity** +- Why trees tend to **overfit** +- Use a **Random Forest** to improve accuracy +- Evaluate all models using metrics from KNN lesson +- Compare performance on **Iris** and **Digits** + +--- ## Setup ```python +!pip install seaborn -q + from sklearn.datasets import load_iris, load_digits from sklearn.model_selection import train_test_split -from sklearn.metrics import accuracy_score, classification_report +from sklearn.metrics import ( + accuracy_score, + classification_report, + confusion_matrix, + ConfusionMatrixDisplay, +) from sklearn.tree import DecisionTreeClassifier, plot_tree from sklearn.ensemble import RandomForestClassifier +from sklearn.neighbors import KNeighborsClassifier import matplotlib.pyplot as plt import seaborn as sns import pandas as pd +import numpy as np + +plt.rcParams["figure.dpi"] = 120 ```` -Load datasets: +--- + +## Load datasets ```python iris = load_iris(as_frame=True) @@ -91,28 +82,28 @@ X_iris, y_iris = iris.data, iris.target digits = load_digits() X_digits, y_digits = digits.data, digits.target + +print("Iris shape:", X_iris.shape) +print("Digits shape:", X_digits.shape) ``` -Split the data: +--- + +## Split data ```python def split(X, y): - return train_test_split(X, y, test_size=0.2, random_state=42) + return train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) X_train_i, X_test_i, y_train_i, y_test_i = split(X_iris, y_iris) X_train_d, X_test_d, y_train_d, y_test_d = split(X_digits, y_digits) ``` -We now have two datasets: - -| Dataset | Type | Classes | Why use it? | -| ---------- | --------------- | ------- | ----------------------------------- | -| **Iris** | Tabular numeric | 3 | Simple, easy to visualize decisions | -| **Digits** | Image-like grid | 10 | More realistic, harder to classify | +--- ## Part A — Decision Tree Classifier -Train the tree: +### Train the tree ```python tree_clf = DecisionTreeClassifier(random_state=42) @@ -126,55 +117,37 @@ print(classification_report(y_test_i, preds_tree)) ### Visualizing decisions ```python -plt.figure(figsize=(16, 10)) -plot_tree(tree_clf, filled=True, - feature_names=iris.feature_names, - class_names=iris.target_names) -plt.title("Iris Decision Tree") +plt.figure(figsize=(18, 12)) +plot_tree(tree_clf, filled=True, + feature_names=iris.feature_names, + class_names=iris.target_names, + rounded=True, fontsize=9) +plt.title("Decision Tree - Iris Dataset") plt.show() ``` -Screenshot 2025-11-20 at 2 12 56 PM - awe +**Decision Trees measure impurity** → less mixed = more confident. -**Image credit: Google Colab** - - -Decision Trees measure uncertainty in each split using **Gini Impurity**: -low impurity = more confident prediction. - -They continue splitting until the data is as “pure” as possible. +--- ## The Overfitting Problem -Trees can become overly specific to the training data: - ```python preds_digits_tree = tree_clf.fit(X_train_d, y_train_d).predict(X_test_d) print("Decision Tree Accuracy (Digits):", accuracy_score(y_test_d, preds_digits_tree)) ``` -Notice that accuracy generally drops on the more complex Digits dataset — a sign of overfitting. - -## Part B — Random Forest (Ensemble Method) - -A **Random Forest** builds many trees on slightly different samples of the training data. -Each tree votes, and the most common answer wins. - -### How Random Forest Classification Works - -Imagine you have a complex problem to solve, and you gather a group of experts from different fields to provide their input. Each expert provides their opinion based on their expertise and experience. Then, the experts would vote to arrive at a final decision. - -In a random forest classification, multiple decision trees are created using different random subsets of the data and features. Each decision tree is like an expert, providing its opinion on how to classify the data. Predictions are made by calculating the prediction for each decision tree and then taking the most popular result. (For regression, predictions use an averaging technique instead.) +Decision Tree performance typically drops on complex datasets like Digits. -Screenshot 2025-11-20 at 2 08 19 PM +--- -**Image credit: [random-forest-data-camp documentation](https://www.datacamp.com/tutorial/random-forests-classifier-python)** +## Part B — Random Forest (Ensemble Method) +Random Forest = Many trees voting → **reduced overfitting** -This reduces overfitting because the trees do not make the same mistakes. +Screenshot 2025-11-20 at 2 08 19 PM ```python rf_clf = RandomForestClassifier(n_estimators=100, random_state=42) @@ -185,70 +158,127 @@ print("Random Forest Accuracy (Iris):", accuracy_score(y_test_i, preds_rf)) print(classification_report(y_test_i, preds_rf)) ``` -Test on Digits as well: +Test on Digits: ```python preds_rf_digits = rf_clf.fit(X_train_d, y_train_d).predict(X_test_d) print("Random Forest Accuracy (Digits):", accuracy_score(y_test_d, preds_rf_digits)) ``` -You should see significantly improved generalization. +--- + +## Confusion Matrices (Iris + Digits) + +```python +def plot_confusions(model, model_name): + # Iris + model.fit(X_train_i, y_train_i) + preds_i = model.predict(X_test_i) + cm_i = confusion_matrix(y_test_i, preds_i) + disp_i = ConfusionMatrixDisplay(cm_i, display_labels=iris.target_names) + + # Digits + model.fit(X_train_d, y_train_d) + preds_d = model.predict(X_test_d) + cm_d = confusion_matrix(y_test_d, preds_d) + disp_d = ConfusionMatrixDisplay(cm_d) + + fig, axes = plt.subplots(1, 2, figsize=(12, 4)) + disp_i.plot(ax=axes[0], colorbar=False) + axes[0].set_title(f"{model_name} - Iris") + + disp_d.plot(ax=axes[1], colorbar=False) + axes[1].set_title(f"{model_name} - Digits") + + plt.tight_layout() + plt.show() + +plot_confusions(DecisionTreeClassifier(random_state=42), "Decision Tree") +plot_confusions(RandomForestClassifier(n_estimators=100, random_state=42), "Random Forest") +``` + +--- + +## Accuracy Comparison (KNN vs Tree vs Forest) + +```python +models = [ + ("KNN", KNeighborsClassifier(n_neighbors=5)), + ("Decision Tree", DecisionTreeClassifier(random_state=42)), + ("Random Forest", RandomForestClassifier(n_estimators=100, random_state=42)), +] -## Model comparison +def evaluate(models, X_train, X_test, y_train, y_test): + scores = [] + for name, model in models: + model.fit(X_train, y_train) + preds = model.predict(X_test) + scores.append({"model": name, "accuracy": accuracy_score(y_test, preds)}) + return pd.DataFrame(scores) -| Model | Interpretability | Overfitting Risk | Iris Accuracy | Digits Accuracy | -| ------------- | --------------------------- | ---------------- | ------------- | --------------- | -| Decision Tree | Very high | High | Good | Moderate | -| Random Forest | More difficult to interpret | Much lower | Very good | Strong | +df_iris = evaluate(models, X_train_i, X_test_i, y_train_i, y_test_i) +df_digits = evaluate(models, X_train_d, X_test_d, y_train_d, y_test_d) -Interpretability decreases slightly, but robustness increases substantially. +fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharey=True) -## Feature importance (Random Forest) +sns.barplot(data=df_iris, x="model", y="accuracy", ax=axes[0]) +axes[0].set_title("Accuracy Comparison - Iris") -Which features matter most? +sns.barplot(data=df_digits, x="model", y="accuracy", ax=axes[1]) +axes[1].set_title("Accuracy Comparison - Digits") + +plt.tight_layout() +plt.show() +``` + +--- + +## Feature Importance (Random Forest) ```python +rf_clf.fit(X_train_i, y_train_i) feat_df = pd.DataFrame({ "feature": X_iris.columns, "importance": rf_clf.feature_importances_ -}) +}).sort_values(by="importance", ascending=False) + +plt.figure(figsize=(8, 5)) sns.barplot(data=feat_df, x="importance", y="feature") -plt.title("Iris Feature Importance") +plt.title("Feature Importance — Random Forest (Iris)") +plt.tight_layout() plt.show() ``` -This provides insight into which measurements are driving decisions. +Petal length & width are most informative in Iris. -## Key ideas to take with you - -* **Decision Trees** are interpretable and mimic human decision-making -* They tend to **overfit** without constraints -* **Random Forests** combine many trees to reduce variance and improve accuracy -* Same evaluation tools from the KNN lesson let us compare fairly -* Forests shine on complex, higher-dimensional data like Digits +--- -## Explore More +## Key takeaways -Optional resources for deeper intuition: +* **Decision Trees** are interpretable but can **overfit** +* **Random Forests** combine many trees → **better generalization** +* Model evaluation should include: -* Decision Trees - [https://www.ibm.com/think/topics/decision-trees](https://www.ibm.com/think/topics/decision-trees) - [https://www.youtube.com/watch?v=JcI5E2Ng6r4](https://www.youtube.com/watch?v=JcI5E2Ng6r4) - [https://www.youtube.com/watch?v=u4IxOk2ijSs](https://www.youtube.com/watch?v=u4IxOk2ijSs) - [https://www.youtube.com/watch?v=zs6yHVtxyv8](https://www.youtube.com/watch?v=zs6yHVtxyv8) + * Accuracy + * Classification reports + * Confusion matrices +* Random Forest especially improves performance on complex data like Digits -* Random Forests - [https://www.youtube.com/watch?v=gkXX4h3qYm4](https://www.youtube.com/watch?v=gkXX4h3qYm4) +--- -* scikit-learn example - [https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html) +## Explore More -## Next steps +Optional resources: -In the next lesson, we’ll build on this foundation and explore: +* [https://www.ibm.com/think/topics/decision-trees](https://www.ibm.com/think/topics/decision-trees) +* [https://www.youtube.com/watch?v=JcI5E2Ng6r4](https://www.youtube.com/watch?v=JcI5E2Ng6r4) +* [https://www.youtube.com/watch?v=gkXX4h3qYm4](https://www.youtube.com/watch?v=gkXX4h3qYm4) +* [https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html) -* Controlling model complexity (`max_depth`, `min_samples_split`) -* Reducing overfitting with **cross-validation** -* Confusion matrices and evaluation tradeoffs --- +## Next steps + +* Hyperparameter tuning (`max_depth`, `min_samples_split`, `n_estimators`) +* Preventing overfitting with **cross-validation** +* Confusion matrices: error trade-offs From 26016cc6ab53e2edf7c8492eae52ec1d4d49051b Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:23:38 -0500 Subject: [PATCH 08/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 37ceaf1..a743ae3 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -34,6 +34,9 @@ Random Forests solve this problem by combining many trees. image +**Image credit: [decision-tree-geeks documentation](https://www.geeksforgeeks.org/machine-learning/decision-tree/)** + + --- ## What you’ll learn today From a28e4950b2b3cd410874c7adac0aeeb32a1d7155 Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:24:00 -0500 Subject: [PATCH 09/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index a743ae3..14c91fb 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -37,8 +37,6 @@ Random Forests solve this problem by combining many trees. **Image credit: [decision-tree-geeks documentation](https://www.geeksforgeeks.org/machine-learning/decision-tree/)** ---- - ## What you’ll learn today - Train and interpret a **Decision Tree Classifier** From 58577b9dc4856031488945ddb993e5bb7d395c14 Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:32:27 -0500 Subject: [PATCH 10/29] Update 03_decision_trees.md --- .../03_ML_classification/03_decision_trees.md | 116 +++++++++++++----- 1 file changed, 83 insertions(+), 33 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 14c91fb..fc7faa0 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -1,3 +1,5 @@ +````md +# Lesson 3 ### CTD Python 200 **Decision Trees and Ensemble Learning (Random Forests)** @@ -27,24 +29,27 @@ Random Forests solve this problem by combining many trees. ## How Decision Trees Work -1. Start with a root node using one feature to split data -2. Ask a “yes/no” question that reduces label mixing -3. Continue splitting until no more useful splits exist -4. Reach a leaf node → make a prediction +1. Start with a root node using one feature to split data +2. Ask a “yes/no” question that reduces label mixing +3. Continue splitting until no more useful splits exist +4. Reach a leaf node → make a prediction image **Image credit: [decision-tree-geeks documentation](https://www.geeksforgeeks.org/machine-learning/decision-tree/)** +--- ## What you’ll learn today -- Train and interpret a **Decision Tree Classifier** -- Measure node uncertainty using **Gini Impurity** -- Why trees tend to **overfit** -- Use a **Random Forest** to improve accuracy -- Evaluate all models using metrics from KNN lesson -- Compare performance on **Iris** and **Digits** +By the end of this lesson, you will be able to: + +- Train and interpret a **Decision Tree Classifier** +- Measure node uncertainty using **Gini Impurity** +- Explain why trees tend to **overfit** +- Use a **Random Forest** to improve accuracy and robustness +- Evaluate models using the metrics from the KNN lesson +- Compare performance on the **Iris** and **Digits** datasets --- @@ -100,11 +105,18 @@ X_train_i, X_test_i, y_train_i, y_test_i = split(X_iris, y_iris) X_train_d, X_test_d, y_train_d, y_test_d = split(X_digits, y_digits) ``` +We now have two datasets: + +| Dataset | Type | Classes | Why use it? | +| ------- | --------------- | ------- | ----------------------------------- | +| Iris | Tabular numeric | 3 | Simple, easy to visualize decisions | +| Digits | Image-like grid | 10 | More realistic, harder to classify | + --- ## Part A — Decision Tree Classifier -### Train the tree +### Train the tree (Iris) ```python tree_clf = DecisionTreeClassifier(random_state=42) @@ -119,34 +131,42 @@ print(classification_report(y_test_i, preds_tree)) ```python plt.figure(figsize=(18, 12)) -plot_tree(tree_clf, filled=True, - feature_names=iris.feature_names, - class_names=iris.target_names, - rounded=True, fontsize=9) +plot_tree( + tree_clf, + filled=True, + feature_names=iris.feature_names, + class_names=iris.target_names, + rounded=True, + fontsize=9 +) plt.title("Decision Tree - Iris Dataset") plt.show() ``` -awe - -**Decision Trees measure impurity** → less mixed = more confident. +Decision Trees measure **impurity** at each node: +less mixed = more confident prediction. --- ## The Overfitting Problem +A tree can become too specialized to the training data. + ```python preds_digits_tree = tree_clf.fit(X_train_d, y_train_d).predict(X_test_d) print("Decision Tree Accuracy (Digits):", accuracy_score(y_test_d, preds_digits_tree)) ``` -Decision Tree performance typically drops on complex datasets like Digits. +On the more complex **Digits** dataset, accuracy typically drops — a sign of overfitting and poorer generalization. --- ## Part B — Random Forest (Ensemble Method) -Random Forest = Many trees voting → **reduced overfitting** +A **Random Forest** builds many trees on slightly different samples of the training data. +Each tree votes, and the most common answer wins. + +Random Forest = many trees voting → reduced overfitting. Screenshot 2025-11-20 at 2 08 19 PM @@ -159,17 +179,21 @@ print("Random Forest Accuracy (Iris):", accuracy_score(y_test_i, preds_rf)) print(classification_report(y_test_i, preds_rf)) ``` -Test on Digits: +Test on Digits as well: ```python preds_rf_digits = rf_clf.fit(X_train_d, y_train_d).predict(X_test_d) print("Random Forest Accuracy (Digits):", accuracy_score(y_test_d, preds_rf_digits)) ``` +You should see improved generalization compared to a single tree, especially on Digits. + --- ## Confusion Matrices (Iris + Digits) +Confusion matrices help you see **which classes** are being confused with which. + ```python def plot_confusions(model, model_name): # Iris @@ -202,6 +226,12 @@ plot_confusions(RandomForestClassifier(n_estimators=100, random_state=42), "Rand ## Accuracy Comparison (KNN vs Tree vs Forest) +Now we compare three models side by side: + +* K-Nearest Neighbors (KNN) +* Decision Tree +* Random Forest + ```python models = [ ("KNN", KNeighborsClassifier(n_neighbors=5)), @@ -224,18 +254,24 @@ fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharey=True) sns.barplot(data=df_iris, x="model", y="accuracy", ax=axes[0]) axes[0].set_title("Accuracy Comparison - Iris") +axes[0].set_ylim(0, 1.05) sns.barplot(data=df_digits, x="model", y="accuracy", ax=axes[1]) axes[1].set_title("Accuracy Comparison - Digits") +axes[1].set_ylim(0, 1.05) plt.tight_layout() plt.show() ``` +This gives a visual summary of how each model performs on both datasets. + --- ## Feature Importance (Random Forest) +Random Forests can also tell us which features matter most. + ```python rf_clf.fit(X_train_i, y_train_i) feat_df = pd.DataFrame({ @@ -250,20 +286,21 @@ plt.tight_layout() plt.show() ``` -Petal length & width are most informative in Iris. +In the Iris dataset, petal length and petal width are usually the most important features. --- ## Key takeaways -* **Decision Trees** are interpretable but can **overfit** -* **Random Forests** combine many trees → **better generalization** -* Model evaluation should include: +* **Decision Trees** are highly interpretable and mimic human decision-making +* They tend to **overfit** if we let them grow unchecked +* **Random Forests** combine many trees to reduce variance and improve accuracy +* Evaluation should include: * Accuracy * Classification reports * Confusion matrices -* Random Forest especially improves performance on complex data like Digits +* Random Forests often shine on more complex, higher-dimensional data like **Digits** --- @@ -271,15 +308,28 @@ Petal length & width are most informative in Iris. Optional resources: -* [https://www.ibm.com/think/topics/decision-trees](https://www.ibm.com/think/topics/decision-trees) -* [https://www.youtube.com/watch?v=JcI5E2Ng6r4](https://www.youtube.com/watch?v=JcI5E2Ng6r4) -* [https://www.youtube.com/watch?v=gkXX4h3qYm4](https://www.youtube.com/watch?v=gkXX4h3qYm4) -* [https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html) +* Decision Trees + + * [https://www.ibm.com/think/topics/decision-trees](https://www.ibm.com/think/topics/decision-trees) + * [https://www.youtube.com/watch?v=JcI5E2Ng6r4](https://www.youtube.com/watch?v=JcI5E2Ng6r4) + +* Random Forests + + * [https://www.youtube.com/watch?v=gkXX4h3qYm4](https://www.youtube.com/watch?v=gkXX4h3qYm4) + +* scikit-learn demo + + * [https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html) --- ## Next steps -* Hyperparameter tuning (`max_depth`, `min_samples_split`, `n_estimators`) -* Preventing overfitting with **cross-validation** -* Confusion matrices: error trade-offs +In upcoming lessons, we will: + +* Tune hyperparameters such as `max_depth`, `min_samples_split`, and `n_estimators` +* Use **cross-validation** to better estimate model performance +* Explore confusion matrices in more depth and discuss error trade-offs + +``` +``` From c5efa19163c5244fae76bb3760d400f3285ffcab Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:32:43 -0500 Subject: [PATCH 11/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index fc7faa0..4f3386e 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -1,4 +1,3 @@ -````md # Lesson 3 ### CTD Python 200 **Decision Trees and Ensemble Learning (Random Forests)** @@ -331,5 +330,3 @@ In upcoming lessons, we will: * Use **cross-validation** to better estimate model performance * Explore confusion matrices in more depth and discuss error trade-offs -``` -``` From 578c80de6a3329f2d14fc0e7325fbc5e3123bb9a Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:38:21 -0500 Subject: [PATCH 12/29] Update 03_decision_trees.md --- .../03_ML_classification/03_decision_trees.md | 25 ++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 4f3386e..79655db 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -81,6 +81,16 @@ plt.rcParams["figure.dpi"] = 120 ## Load datasets +The Iris dataset is a small table of 150 flowers. +For each flower, we have 4 numbers: sepal length, sepal width, petal length, and petal width (all in cm). +Each row is labeled as one of 3 species: setosa, versicolor, or virginica. +It’s simple, clean, and perfect for learning classification. + +The Digits dataset is like tiny black-and-white pictures of handwritten numbers. +Each digit image is 8×8 pixels, flattened into 64 numbers that say how dark each pixel is. +The label tells us which digit (0–9) the image represents. +It’s more complex than Iris and closer to a “real” machine learning problem. + ```python iris = load_iris(as_frame=True) X_iris, y_iris = iris.data, iris.target @@ -96,6 +106,10 @@ print("Digits shape:", X_digits.shape) ## Split data +Here we split each dataset into a training set and a test set. +The model sees the training data and learns from it, but never sees the test data during training. +Later, we use the test set to check how well the model generalizes to new, unseen examples. + ```python def split(X, y): return train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) @@ -126,6 +140,11 @@ print("Decision Tree Accuracy (Iris):", accuracy_score(y_test_i, preds_tree)) print(classification_report(y_test_i, preds_tree)) ``` +In this section, we train a single decision tree on the Iris data. +We then use it to make predictions on the test set and measure how often it’s correct. +This gives us a first baseline: how good is one tree by itself? + + ### Visualizing decisions ```python @@ -150,6 +169,9 @@ less mixed = more confident prediction. ## The Overfitting Problem A tree can become too specialized to the training data. +Now we test the same tree on the more complex Digits dataset. +We’ll see that accuracy is worse, because the tree tends to memorize specific patterns from the training set. +This introduces the idea of overfitting: doing great on training data but worse on new data. ```python preds_digits_tree = tree_clf.fit(X_train_d, y_train_d).predict(X_test_d) @@ -304,9 +326,6 @@ In the Iris dataset, petal length and petal width are usually the most important --- ## Explore More - -Optional resources: - * Decision Trees * [https://www.ibm.com/think/topics/decision-trees](https://www.ibm.com/think/topics/decision-trees) From c82c646d345ebeb09716a54993e8137ac82a1127 Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:40:21 -0500 Subject: [PATCH 13/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 79655db..59c0285 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -54,6 +54,10 @@ By the end of this lesson, you will be able to: ## Setup +This part is your roadmap for the lesson. +We tell you which models you’ll use (Decision Tree and Random Forest) and which datasets you’ll work with. +You’ll also see how to measure performance and compare models fairly using the same metrics. + ```python !pip install seaborn -q From bda432fcc3ea62b28e164cdb0fa40e85a407356c Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:42:35 -0500 Subject: [PATCH 14/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 59c0285..257e82e 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -144,6 +144,10 @@ print("Decision Tree Accuracy (Iris):", accuracy_score(y_test_i, preds_tree)) print(classification_report(y_test_i, preds_tree)) ``` +Screenshot 2025-11-20 at 2 41 28 PM + +**Image Credit: Google Colab** + In this section, we train a single decision tree on the Iris data. We then use it to make predictions on the test set and measure how often it’s correct. This gives us a first baseline: how good is one tree by itself? @@ -165,6 +169,10 @@ plt.title("Decision Tree - Iris Dataset") plt.show() ``` + +**Image Credit: Google Colab** + + Decision Trees measure **impurity** at each node: less mixed = more confident prediction. From a751d0a7f55f192716584d1887bb2eb993553495 Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:43:15 -0500 Subject: [PATCH 15/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 257e82e..648588b 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -135,6 +135,10 @@ We now have two datasets: ### Train the tree (Iris) +In this section, we train a single decision tree on the Iris data. +We then use it to make predictions on the test set and measure how often it’s correct. +This gives us a first baseline: how good is one tree by itself? + ```python tree_clf = DecisionTreeClassifier(random_state=42) tree_clf.fit(X_train_i, y_train_i) @@ -148,11 +152,6 @@ print(classification_report(y_test_i, preds_tree)) **Image Credit: Google Colab** -In this section, we train a single decision tree on the Iris data. -We then use it to make predictions on the test set and measure how often it’s correct. -This gives us a first baseline: how good is one tree by itself? - - ### Visualizing decisions ```python From ac1a3f8425e21fe4d5a3cb726222ffa95760b001 Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:48:19 -0500 Subject: [PATCH 16/29] Update 03_decision_trees.md --- .../03_ML_classification/03_decision_trees.md | 27 ++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 648588b..75aab31 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -168,12 +168,12 @@ plt.title("Decision Tree - Iris Dataset") plt.show() ``` +Visualize **Image Credit: Google Colab** - -Decision Trees measure **impurity** at each node: -less mixed = more confident prediction. +Decision Trees measure impurity at each split. +Lower impurity = clearer separation between classes → more confident prediction. --- @@ -189,6 +189,27 @@ preds_digits_tree = tree_clf.fit(X_train_d, y_train_d).predict(X_test_d) print("Decision Tree Accuracy (Digits):", accuracy_score(y_test_d, preds_digits_tree)) ``` +Result- Decision Tree Accuracy (Digits): 0.825 + +### What do we learn from this result? + +When we trained the Decision Tree on the simple Iris dataset, it performed very well — almost perfect accuracy. + +But when we tried the **same model** on the more complex Digits dataset, accuracy dropped to around **82–83%**. +This shows that: + +* The tree **memorized** many tiny details in the training data +* But those details **did not apply** to the new digit images +* So the model **struggles** when the task is harder and more varied + +This behavior is called **overfitting**: + +> The model is smart on training data… +> but not very smart on new data. + +In real machine learning work, we want models that **generalize**, not just memorize — +and this is exactly why we introduce **Random Forests** next. + On the more complex **Digits** dataset, accuracy typically drops — a sign of overfitting and poorer generalization. --- From fd6b5b199fa572c339fb29297ccd2063821e735f Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:48:59 -0500 Subject: [PATCH 17/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 75aab31..5eee6fc 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -152,6 +152,8 @@ print(classification_report(y_test_i, preds_tree)) **Image Credit: Google Colab** +--- + ### Visualizing decisions ```python From 51f730744dea65f12377fa5fab9bebb9ecffeaa9 Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:51:31 -0500 Subject: [PATCH 18/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 5eee6fc..82e9cda 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -221,10 +221,19 @@ On the more complex **Digits** dataset, accuracy typically drops — a sign of o A **Random Forest** builds many trees on slightly different samples of the training data. Each tree votes, and the most common answer wins. -Random Forest = many trees voting → reduced overfitting. +Imagine you have a complex problem to solve, and you gather a group of experts from different fields to provide their input. Each expert provides their opinion based on their expertise and experience. Then, the experts would vote to arrive at a final decision. + +In a random forest classification, multiple decision trees are created using different random subsets of the data and features. Each decision tree is like an expert, providing its opinion on how to classify the data. Predictions are made by calculating the prediction for each decision tree and then taking the most popular result. (For regression, predictions use an averaging technique instead.) + +**Example** +In the diagram below, we have a random forest with n decision trees, and we’ve shown the first 5, along with their predictions (either “Dog” or “Cat”). Each tree is exposed to a different number of features and a different sample of the original dataset, and as such, every tree can be different. Each tree makes a prediction. + +Looking at the first 5 trees, we can see that 4/5 predicted the sample was a Cat. The green circles indicate a hypothetical path the tree took to reach its decision. The random forest would count the number of predictions from decision trees for Cat and for Dog, and choose the most popular prediction. Screenshot 2025-11-20 at 2 08 19 PM +**Image Credits:[https://www.datacamp.com/tutorial/random-forests-classifier-python]** + ```python rf_clf = RandomForestClassifier(n_estimators=100, random_state=42) rf_clf.fit(X_train_i, y_train_i) From b4313ee89e01384e6216e4e9362965f0424b707e Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:52:49 -0500 Subject: [PATCH 19/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 82e9cda..345cfd2 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -223,8 +223,6 @@ Each tree votes, and the most common answer wins. Imagine you have a complex problem to solve, and you gather a group of experts from different fields to provide their input. Each expert provides their opinion based on their expertise and experience. Then, the experts would vote to arrive at a final decision. -In a random forest classification, multiple decision trees are created using different random subsets of the data and features. Each decision tree is like an expert, providing its opinion on how to classify the data. Predictions are made by calculating the prediction for each decision tree and then taking the most popular result. (For regression, predictions use an averaging technique instead.) - **Example** In the diagram below, we have a random forest with n decision trees, and we’ve shown the first 5, along with their predictions (either “Dog” or “Cat”). Each tree is exposed to a different number of features and a different sample of the original dataset, and as such, every tree can be different. Each tree makes a prediction. From 650800db0e8b8dca811f61c4bbc2e53f3c8a94d5 Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 14:56:54 -0500 Subject: [PATCH 20/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 345cfd2..998337e 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -241,6 +241,10 @@ print("Random Forest Accuracy (Iris):", accuracy_score(y_test_i, preds_rf)) print(classification_report(y_test_i, preds_rf)) ``` +Screenshot 2025-11-20 at 2 55 11 PM + +**Image Credits:Google Colab** + Test on Digits as well: ```python @@ -248,6 +252,10 @@ preds_rf_digits = rf_clf.fit(X_train_d, y_train_d).predict(X_test_d) print("Random Forest Accuracy (Digits):", accuracy_score(y_test_d, preds_rf_digits)) ``` +Screenshot 2025-11-20 at 2 56 06 PM + +**Image Credits:Google Colab** + You should see improved generalization compared to a single tree, especially on Digits. --- From 62134ee34994dec83c72dc72345a4e2d25bf6a8b Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 15:29:14 -0500 Subject: [PATCH 21/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 998337e..0df0a45 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -241,7 +241,7 @@ print("Random Forest Accuracy (Iris):", accuracy_score(y_test_i, preds_rf)) print(classification_report(y_test_i, preds_rf)) ``` -Screenshot 2025-11-20 at 2 55 11 PM +Screenshot 2025-11-20 at 3 27 38 PM **Image Credits:Google Colab** @@ -252,7 +252,7 @@ preds_rf_digits = rf_clf.fit(X_train_d, y_train_d).predict(X_test_d) print("Random Forest Accuracy (Digits):", accuracy_score(y_test_d, preds_rf_digits)) ``` -Screenshot 2025-11-20 at 2 56 06 PM +Screenshot 2025-11-20 at 3 28 53 PM **Image Credits:Google Colab** From 2e65a18607bc0d0f8c79d3f681f2bf32c2a2bd62 Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 15:32:17 -0500 Subject: [PATCH 22/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 0df0a45..8a7264b 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -292,6 +292,10 @@ plot_confusions(DecisionTreeClassifier(random_state=42), "Decision Tree") plot_confusions(RandomForestClassifier(n_estimators=100, random_state=42), "Random Forest") ``` +Screenshot 2025-11-20 at 3 31 40 PM + +**Image Credits:Google Colab** + --- ## Accuracy Comparison (KNN vs Tree vs Forest) From 9eb2f69615ba62de16c4e4f8158989311eb85e8e Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 15:34:41 -0500 Subject: [PATCH 23/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 8a7264b..df017e3 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -338,6 +338,11 @@ plt.tight_layout() plt.show() ``` +Screenshot 2025-11-20 at 3 33 31 PM + +**Image Credits:Google Colab** + + This gives a visual summary of how each model performs on both datasets. --- From bae5a615668bd1e48def08929ceaddf1529663af Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 15:36:11 -0500 Subject: [PATCH 24/29] Update 03_decision_trees.md --- .../03_ML_classification/03_decision_trees.md | 23 ------------------- 1 file changed, 23 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index df017e3..d990a81 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -342,33 +342,10 @@ plt.show() **Image Credits:Google Colab** - This gives a visual summary of how each model performs on both datasets. --- -## Feature Importance (Random Forest) - -Random Forests can also tell us which features matter most. - -```python -rf_clf.fit(X_train_i, y_train_i) -feat_df = pd.DataFrame({ - "feature": X_iris.columns, - "importance": rf_clf.feature_importances_ -}).sort_values(by="importance", ascending=False) - -plt.figure(figsize=(8, 5)) -sns.barplot(data=feat_df, x="importance", y="feature") -plt.title("Feature Importance — Random Forest (Iris)") -plt.tight_layout() -plt.show() -``` - -In the Iris dataset, petal length and petal width are usually the most important features. - ---- - ## Key takeaways * **Decision Trees** are highly interpretable and mimic human decision-making From 5c08b822ee6144588ca292737c966bace285110b Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 15:39:17 -0500 Subject: [PATCH 25/29] Update 03_decision_trees.md --- .../03_ML_classification/03_decision_trees.md | 47 +++++++++++++++---- 1 file changed, 39 insertions(+), 8 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index d990a81..81e3274 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -346,18 +346,50 @@ This gives a visual summary of how each model performs on both datasets. --- +## Findings + +Absolutely — here’s a clean and student-friendly **Final Takeaways** section you can paste right after the evaluation visuals: 👇 + +--- + +## Final Takeaways + +Here’s what we learned by testing different models on the Iris and Digits datasets: + +### Decision Trees + +* Very easy to understand — you can read every split like a flowchart +* Work great on **simple** datasets like Iris +* But tend to **overfit** — memorize instead of generalize + +### Random Forests + +* Combine many trees → group decision (voting) +* Much **better generalization**, especially on harder tasks like Digits +* Higher accuracy + fewer mistakes across classes +* Still gives some interpretability (feature importance) + +--- + +### 🔑 Big Picture Lessons + +| Concept | What we learned | +| ------------------ | ---------------------------------------------------------------------- | +| Interpretability | Trees are easy to explain to humans | +| Overfitting | Trees fit noise on complex data | +| Ensemble learning | “Many weak learners = one strong learner” | +| Evaluation matters | Accuracy alone doesn’t tell the whole story — confusion matrices help! | + +> If your goal is reliability in the real world: **Random Forests are a safer choice than a single Decision Tree.** + +--- + ## Key takeaways * **Decision Trees** are highly interpretable and mimic human decision-making * They tend to **overfit** if we let them grow unchecked * **Random Forests** combine many trees to reduce variance and improve accuracy -* Evaluation should include: - - * Accuracy - * Classification reports - * Confusion matrices -* Random Forests often shine on more complex, higher-dimensional data like **Digits** - + --- ## Explore More @@ -380,7 +412,6 @@ This gives a visual summary of how each model performs on both datasets. In upcoming lessons, we will: -* Tune hyperparameters such as `max_depth`, `min_samples_split`, and `n_estimators` * Use **cross-validation** to better estimate model performance * Explore confusion matrices in more depth and discuss error trade-offs From ff308e1de2dac9e8c6607917e8c81e9834bf7094 Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 15:40:33 -0500 Subject: [PATCH 26/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 9 --------- 1 file changed, 9 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 81e3274..8e3f48f 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -345,13 +345,6 @@ plt.show() This gives a visual summary of how each model performs on both datasets. --- - -## Findings - -Absolutely — here’s a clean and student-friendly **Final Takeaways** section you can paste right after the evaluation visuals: 👇 - ---- - ## Final Takeaways Here’s what we learned by testing different models on the Iris and Digits datasets: @@ -369,8 +362,6 @@ Here’s what we learned by testing different models on the Iris and Digits data * Higher accuracy + fewer mistakes across classes * Still gives some interpretability (feature importance) ---- - ### 🔑 Big Picture Lessons | Concept | What we learned | From 3b20ab67085558d94a226126672cfb8ef933dd55 Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Thu, 20 Nov 2025 15:41:12 -0500 Subject: [PATCH 27/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index 8e3f48f..fffd618 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -362,8 +362,6 @@ Here’s what we learned by testing different models on the Iris and Digits data * Higher accuracy + fewer mistakes across classes * Still gives some interpretability (feature importance) -### 🔑 Big Picture Lessons - | Concept | What we learned | | ------------------ | ---------------------------------------------------------------------- | | Interpretability | Trees are easy to explain to humans | From f0e27d834329a2d238bf1a68e613779c00d55fe6 Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Mon, 5 Jan 2026 16:10:14 -0500 Subject: [PATCH 28/29] Update 03_decision_trees.md --- .../03_ML_classification/03_decision_trees.md | 471 +++++++----------- 1 file changed, 190 insertions(+), 281 deletions(-) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index fffd618..f7a2235 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -2,405 +2,314 @@ ### CTD Python 200 **Decision Trees and Ensemble Learning (Random Forests)** -## Why trees? +--- + +## Why Trees? + +In the previous lesson, we learned about **K-Nearest Neighbors (KNN)**. +KNN makes predictions by comparing distances between data points. -Imagine being asked to identify a flower. You might think: +That works well in some settings — but not all. -> “If the petals are tiny, it’s probably setosa. -> If not, check petal width… big petals? Probably virginica.” +Now imagine a different way of thinking: -That thought process — breaking a decision into a sequence of **yes/no questions** — is exactly how a **Decision Tree** works. +> “If an email has lots of dollar signs and exclamation points, it might be spam. +> If it also contains words like *free* or *remove*, that makes spam even more likely.” -image +That style of reasoning — asking a **sequence of yes/no questions** — is exactly how a **Decision Tree** works. -**Image credit: [decision-tree-geeks documentation](https://www.geeksforgeeks.org/machine-learning/decision-tree/)** +decision tree -Deep learning models are often described as “black boxes.” -Decision Trees are the opposite: **transparent and interpretable**. -You can literally trace any prediction step-by-step. +**Image credit:** GeeksForGeeks -However, a single decision tree can become too confident in the training data. -This leads to **overfitting**: great performance on seen examples, worse on new data. +Unlike many machine learning models that behave like black boxes, decision trees are: -Random Forests solve this problem by combining many trees. +- **Interpretable** — you can read every decision +- **Human-like** — they resemble flowcharts +- **Powerful on tabular data** + +But they also have a weakness… --- -## How Decision Trees Work +## The Big Idea -1. Start with a root node using one feature to split data -2. Ask a “yes/no” question that reduces label mixing -3. Continue splitting until no more useful splits exist -4. Reach a leaf node → make a prediction +A **single decision tree** can become too confident. +If allowed to grow without constraints, it may memorize the training data. -image +This problem is called **overfitting**. -**Image credit: [decision-tree-geeks documentation](https://www.geeksforgeeks.org/machine-learning/decision-tree/)** +To solve it, we use **Random Forests**, which combine many trees into one stronger model. --- -## What you’ll learn today +## What You’ll Learn Today By the end of this lesson, you will be able to: -- Train and interpret a **Decision Tree Classifier** -- Measure node uncertainty using **Gini Impurity** -- Explain why trees tend to **overfit** -- Use a **Random Forest** to improve accuracy and robustness -- Evaluate models using the metrics from the KNN lesson -- Compare performance on the **Iris** and **Digits** datasets +- Compare **KNN vs Decision Trees vs Random Forests** +- Explain why trees outperform KNN on tabular data +- Understand **overfitting** in decision trees +- Use **Random Forests** to improve generalization +- Interpret results using **precision, recall, and F1 score** +- Connect model behavior to real-world intuition (spam detection) --- -## Setup +## Dataset: Spambase (Real-World Tabular Data) -This part is your roadmap for the lesson. -We tell you which models you’ll use (Decision Tree and Random Forest) and which datasets you’ll work with. -You’ll also see how to measure performance and compare models fairly using the same metrics. +In this lesson we use the **Spambase dataset** from the UCI Machine Learning Repository. -```python -!pip install seaborn -q +Each row represents an **email**. +The features are numeric signals such as: -from sklearn.datasets import load_iris, load_digits -from sklearn.model_selection import train_test_split -from sklearn.metrics import ( - accuracy_score, - classification_report, - confusion_matrix, - ConfusionMatrixDisplay, -) -from sklearn.tree import DecisionTreeClassifier, plot_tree -from sklearn.ensemble import RandomForestClassifier -from sklearn.neighbors import KNeighborsClassifier +- How often words like `"free"`, `"remove"`, `"your"` appear +- Frequency of characters like `"!"` and `"$"` +- Statistics about capital letter usage -import matplotlib.pyplot as plt -import seaborn as sns -import pandas as pd -import numpy as np +The label tells us whether the email is: -plt.rcParams["figure.dpi"] = 120 -```` +- `1` → spam +- `0` → not spam (ham) ---- +This dataset is ideal because: +- It’s realistic +- It has many features +- It clearly shows differences between models -## Load datasets - -The Iris dataset is a small table of 150 flowers. -For each flower, we have 4 numbers: sepal length, sepal width, petal length, and petal width (all in cm). -Each row is labeled as one of 3 species: setosa, versicolor, or virginica. -It’s simple, clean, and perfect for learning classification. +--- -The Digits dataset is like tiny black-and-white pictures of handwritten numbers. -Each digit image is 8×8 pixels, flattened into 64 numbers that say how dark each pixel is. -The label tells us which digit (0–9) the image represents. -It’s more complex than Iris and closer to a “real” machine learning problem. +## Setup ```python -iris = load_iris(as_frame=True) -X_iris, y_iris = iris.data, iris.target - -digits = load_digits() -X_digits, y_digits = digits.data, digits.target +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt -print("Iris shape:", X_iris.shape) -print("Digits shape:", X_digits.shape) -``` +from sklearn.model_selection import train_test_split +from sklearn.metrics import classification_report, f1_score +from sklearn.neighbors import KNeighborsClassifier +from sklearn.tree import DecisionTreeClassifier +from sklearn.ensemble import RandomForestClassifier +from sklearn.preprocessing import StandardScaler +from sklearn.pipeline import Pipeline +```` --- -## Split data - -Here we split each dataset into a training set and a test set. -The model sees the training data and learns from it, but never sees the test data during training. -Later, we use the test set to check how well the model generalizes to new, unseen examples. +## Load the Dataset ```python -def split(X, y): - return train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) +from io import BytesIO +import requests + +url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data" +response = requests.get(url) +response.raise_for_status() -X_train_i, X_test_i, y_train_i, y_test_i = split(X_iris, y_iris) -X_train_d, X_test_d, y_train_d, y_test_d = split(X_digits, y_digits) +df = pd.read_csv(BytesIO(response.content), header=None) ``` -We now have two datasets: +The dataset contains **4601 emails** and **58 columns** +(57 features + 1 label). -| Dataset | Type | Classes | Why use it? | -| ------- | --------------- | ------- | ----------------------------------- | -| Iris | Tabular numeric | 3 | Simple, easy to visualize decisions | -| Digits | Image-like grid | 10 | More realistic, harder to classify | +```python +print(df.shape) +df.head() +``` --- -## Part A — Decision Tree Classifier - -### Train the tree (Iris) +## Train / Test Split -In this section, we train a single decision tree on the Iris data. -We then use it to make predictions on the test set and measure how often it’s correct. -This gives us a first baseline: how good is one tree by itself? +We separate features (`X`) from labels (`y`) and use a **stratified split**. +This keeps the spam ratio similar in both sets. ```python -tree_clf = DecisionTreeClassifier(random_state=42) -tree_clf.fit(X_train_i, y_train_i) - -preds_tree = tree_clf.predict(X_test_i) -print("Decision Tree Accuracy (Iris):", accuracy_score(y_test_i, preds_tree)) -print(classification_report(y_test_i, preds_tree)) +X = df.iloc[:, :-1] +y = df.iloc[:, -1] + +X_train, X_test, y_train, y_test = train_test_split( + X, + y, + test_size=0.25, + random_state=0, + stratify=y +) ``` -Screenshot 2025-11-20 at 2 41 28 PM - -**Image Credit: Google Colab** - --- -### Visualizing decisions +## Model 1 — KNN (Scaled Baseline) -```python -plt.figure(figsize=(18, 12)) -plot_tree( - tree_clf, - filled=True, - feature_names=iris.feature_names, - class_names=iris.target_names, - rounded=True, - fontsize=9 -) -plt.title("Decision Tree - Iris Dataset") -plt.show() -``` +We start with **KNN**, using proper feature scaling. +This is our **baseline** model. -Visualize +```python +knn_scaled = Pipeline([ + ("scaler", StandardScaler()), + ("knn", KNeighborsClassifier(n_neighbors=5)) +]) -**Image Credit: Google Colab** +knn_scaled.fit(X_train, y_train) +pred_knn = knn_scaled.predict(X_test) -Decision Trees measure impurity at each split. -Lower impurity = clearer separation between classes → more confident prediction. +print(classification_report(y_test, pred_knn, digits=3)) +``` ---- +### What to Notice -## The Overfitting Problem +* KNN works reasonably well +* But performance is limited on high-dimensional tabular data +* Distance alone is not enough to capture complex patterns -A tree can become too specialized to the training data. -Now we test the same tree on the more complex Digits dataset. -We’ll see that accuracy is worse, because the tree tends to memorize specific patterns from the training set. -This introduces the idea of overfitting: doing great on training data but worse on new data. +This sets the stage for trees. -```python -preds_digits_tree = tree_clf.fit(X_train_d, y_train_d).predict(X_test_d) -print("Decision Tree Accuracy (Digits):", accuracy_score(y_test_d, preds_digits_tree)) -``` +--- -Result- Decision Tree Accuracy (Digits): 0.825 +## Model 2 — Decision Tree -### What do we learn from this result? +Decision Trees do **not use distance**. +Instead, they learn **rules** like: -When we trained the Decision Tree on the simple Iris dataset, it performed very well — almost perfect accuracy. +> “Is the frequency of `$` greater than X?” -But when we tried the **same model** on the more complex Digits dataset, accuracy dropped to around **82–83%**. -This shows that: +This makes them very effective for tabular datasets like spam detection. -* The tree **memorized** many tiny details in the training data -* But those details **did not apply** to the new digit images -* So the model **struggles** when the task is harder and more varied +```python +tree = DecisionTreeClassifier(random_state=0) +tree.fit(X_train, y_train) +pred_tree = tree.predict(X_test) -This behavior is called **overfitting**: +print(classification_report(y_test, pred_tree, digits=3)) +``` -> The model is smart on training data… -> but not very smart on new data. +### Why Trees Often Beat KNN Here -In real machine learning work, we want models that **generalize**, not just memorize — -and this is exactly why we introduce **Random Forests** next. +* Each feature is evaluated independently +* Trees naturally model non-linear relationships +* No scaling required +* Well-suited for mixed and sparse signals -On the more complex **Digits** dataset, accuracy typically drops — a sign of overfitting and poorer generalization. +But there’s a problem… --- -## Part B — Random Forest (Ensemble Method) +## Overfitting Warning ⚠️ -A **Random Forest** builds many trees on slightly different samples of the training data. -Each tree votes, and the most common answer wins. +A decision tree can keep splitting until it perfectly classifies the training data. -Imagine you have a complex problem to solve, and you gather a group of experts from different fields to provide their input. Each expert provides their opinion based on their expertise and experience. Then, the experts would vote to arrive at a final decision. +That means: -**Example** -In the diagram below, we have a random forest with n decision trees, and we’ve shown the first 5, along with their predictions (either “Dog” or “Cat”). Each tree is exposed to a different number of features and a different sample of the original dataset, and as such, every tree can be different. Each tree makes a prediction. +* Very low training error +* Worse performance on new data -Looking at the first 5 trees, we can see that 4/5 predicted the sample was a Cat. The green circles indicate a hypothetical path the tree took to reach its decision. The random forest would count the number of predictions from decision trees for Cat and for Dog, and choose the most popular prediction. +This is **high variance** behavior. -Screenshot 2025-11-20 at 2 08 19 PM +To fix this, we use ensembles. -**Image Credits:[https://www.datacamp.com/tutorial/random-forests-classifier-python]** +--- -```python -rf_clf = RandomForestClassifier(n_estimators=100, random_state=42) -rf_clf.fit(X_train_i, y_train_i) -preds_rf = rf_clf.predict(X_test_i) +## Model 3 — Random Forest 🌲🌲🌲 -print("Random Forest Accuracy (Iris):", accuracy_score(y_test_i, preds_rf)) -print(classification_report(y_test_i, preds_rf)) -``` +A **Random Forest** is a collection of decision trees. -Screenshot 2025-11-20 at 3 27 38 PM +Each tree: -**Image Credits:Google Colab** +* Sees a random sample of the data +* Uses a random subset of features +* Makes its own prediction -Test on Digits as well: +The forest **votes**, and the majority wins. ```python -preds_rf_digits = rf_clf.fit(X_train_d, y_train_d).predict(X_test_d) -print("Random Forest Accuracy (Digits):", accuracy_score(y_test_d, preds_rf_digits)) -``` - -Screenshot 2025-11-20 at 3 28 53 PM +rf = RandomForestClassifier( + n_estimators=100, + random_state=0 +) -**Image Credits:Google Colab** +rf.fit(X_train, y_train) +pred_rf = rf.predict(X_test) -You should see improved generalization compared to a single tree, especially on Digits. +print(classification_report(y_test, pred_rf, digits=3)) +``` --- -## Confusion Matrices (Iris + Digits) - -Confusion matrices help you see **which classes** are being confused with which. +## Why Random Forests Work Better -```python -def plot_confusions(model, model_name): - # Iris - model.fit(X_train_i, y_train_i) - preds_i = model.predict(X_test_i) - cm_i = confusion_matrix(y_test_i, preds_i) - disp_i = ConfusionMatrixDisplay(cm_i, display_labels=iris.target_names) - - # Digits - model.fit(X_train_d, y_train_d) - preds_d = model.predict(X_test_d) - cm_d = confusion_matrix(y_test_d, preds_d) - disp_d = ConfusionMatrixDisplay(cm_d) - - fig, axes = plt.subplots(1, 2, figsize=(12, 4)) - disp_i.plot(ax=axes[0], colorbar=False) - axes[0].set_title(f"{model_name} - Iris") - - disp_d.plot(ax=axes[1], colorbar=False) - axes[1].set_title(f"{model_name} - Digits") - - plt.tight_layout() - plt.show() - -plot_confusions(DecisionTreeClassifier(random_state=42), "Decision Tree") -plot_confusions(RandomForestClassifier(n_estimators=100, random_state=42), "Random Forest") -``` +* Individual trees make different mistakes +* Voting cancels out errors +* Variance is reduced +* Generalization improves -Screenshot 2025-11-20 at 3 31 40 PM +This is called **ensemble learning**: -**Image Credits:Google Colab** +> Many weak learners → one strong learner --- -## Accuracy Comparison (KNN vs Tree vs Forest) +## Comparing F1 Scores -Now we compare three models side by side: +Accuracy alone can be misleading for spam detection. +We care about both: -* K-Nearest Neighbors (KNN) -* Decision Tree -* Random Forest +* Catching spam (recall) +* Not blocking real emails (precision) + +The **F1 score** balances both. ```python -models = [ - ("KNN", KNeighborsClassifier(n_neighbors=5)), - ("Decision Tree", DecisionTreeClassifier(random_state=42)), - ("Random Forest", RandomForestClassifier(n_estimators=100, random_state=42)), -] - -def evaluate(models, X_train, X_test, y_train, y_test): - scores = [] - for name, model in models: - model.fit(X_train, y_train) - preds = model.predict(X_test) - scores.append({"model": name, "accuracy": accuracy_score(y_test, preds)}) - return pd.DataFrame(scores) - -df_iris = evaluate(models, X_train_i, X_test_i, y_train_i, y_test_i) -df_digits = evaluate(models, X_train_d, X_test_d, y_train_d, y_test_d) - -fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharey=True) - -sns.barplot(data=df_iris, x="model", y="accuracy", ax=axes[0]) -axes[0].set_title("Accuracy Comparison - Iris") -axes[0].set_ylim(0, 1.05) - -sns.barplot(data=df_digits, x="model", y="accuracy", ax=axes[1]) -axes[1].set_title("Accuracy Comparison - Digits") -axes[1].set_ylim(0, 1.05) - -plt.tight_layout() -plt.show() +models = { + "KNN (scaled)": pred_knn, + "Decision Tree": pred_tree, + "Random Forest": pred_rf +} + +for name, preds in models.items(): + score = f1_score(y_test, preds) + print(f"{name:15s} F1 = {score:.3f}") ``` -Screenshot 2025-11-20 at 3 33 31 PM - -**Image Credits:Google Colab** +### Typical Pattern You’ll See -This gives a visual summary of how each model performs on both datasets. +``` +KNN (scaled) < Decision Tree < Random Forest +``` --- -## Final Takeaways -Here’s what we learned by testing different models on the Iris and Digits datasets: +## Final Takeaways ### Decision Trees -* Very easy to understand — you can read every split like a flowchart -* Work great on **simple** datasets like Iris -* But tend to **overfit** — memorize instead of generalize +* Easy to understand and visualize +* Excellent for tabular data +* High variance → prone to overfitting ### Random Forests -* Combine many trees → group decision (voting) -* Much **better generalization**, especially on harder tasks like Digits -* Higher accuracy + fewer mistakes across classes -* Still gives some interpretability (feature importance) +* Reduce overfitting +* Improve reliability +* Often the best default choice for tabular ML -| Concept | What we learned | -| ------------------ | ---------------------------------------------------------------------- | -| Interpretability | Trees are easy to explain to humans | -| Overfitting | Trees fit noise on complex data | -| Ensemble learning | “Many weak learners = one strong learner” | -| Evaluation matters | Accuracy alone doesn’t tell the whole story — confusion matrices help! | +| Concept | Lesson | +| -------- | ----------------------------- | +| KNN | Distance-based, needs scaling | +| Trees | Rule-based, interpretable | +| Forests | Ensembles reduce variance | +| F1 Score | Balances precision & recall | -> If your goal is reliability in the real world: **Random Forests are a safer choice than a single Decision Tree.** +> **If you want strong real-world performance on tabular data, Random Forests are a safer choice than a single tree.** --- -## Key takeaways - -* **Decision Trees** are highly interpretable and mimic human decision-making -* They tend to **overfit** if we let them grow unchecked -* **Random Forests** combine many trees to reduce variance and improve accuracy - ---- - -## Explore More -* Decision Trees - - * [https://www.ibm.com/think/topics/decision-trees](https://www.ibm.com/think/topics/decision-trees) - * [https://www.youtube.com/watch?v=JcI5E2Ng6r4](https://www.youtube.com/watch?v=JcI5E2Ng6r4) - -* Random Forests - - * [https://www.youtube.com/watch?v=gkXX4h3qYm4](https://www.youtube.com/watch?v=gkXX4h3qYm4) - -* scikit-learn demo - - * [https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html) - ---- - -## Next steps +## Next Steps In upcoming lessons, we will: -* Use **cross-validation** to better estimate model performance -* Explore confusion matrices in more depth and discuss error trade-offs - +* Use **cross-validation** for more reliable evaluation +* Tune hyperparameters +* Explore **feature importance** and model interpretation +* Discuss trade-offs between different error types +--- From f08d7e2c13b409a52f163abef10561d2bb771ebb Mon Sep 17 00:00:00 2001 From: Pratik Karunakar Poojari <76115015+pratik3336@users.noreply.github.com> Date: Mon, 5 Jan 2026 16:21:32 -0500 Subject: [PATCH 29/29] Update 03_decision_trees.md --- lessons/03_ML_classification/03_decision_trees.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/lessons/03_ML_classification/03_decision_trees.md b/lessons/03_ML_classification/03_decision_trees.md index f7a2235..ba058e4 100644 --- a/lessons/03_ML_classification/03_decision_trees.md +++ b/lessons/03_ML_classification/03_decision_trees.md @@ -118,6 +118,8 @@ print(df.shape) df.head() ``` +Screenshot 2026-01-05 at 4 17 33 PM + --- ## Train / Test Split @@ -157,6 +159,8 @@ pred_knn = knn_scaled.predict(X_test) print(classification_report(y_test, pred_knn, digits=3)) ``` +Screenshot 2026-01-05 at 4 18 28 PM + ### What to Notice * KNN works reasonably well @@ -184,6 +188,9 @@ pred_tree = tree.predict(X_test) print(classification_report(y_test, pred_tree, digits=3)) ``` +Screenshot 2026-01-05 at 4 19 28 PM + + ### Why Trees Often Beat KNN Here * Each feature is evaluated independently @@ -234,6 +241,9 @@ pred_rf = rf.predict(X_test) print(classification_report(y_test, pred_rf, digits=3)) ``` +Screenshot 2026-01-05 at 4 20 15 PM + + --- ## Why Random Forests Work Better @@ -271,6 +281,9 @@ for name, preds in models.items(): print(f"{name:15s} F1 = {score:.3f}") ``` +Screenshot 2026-01-05 at 4 21 13 PM + + ### Typical Pattern You’ll See ```