Skip to content
Open
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
e55fd20
Create 03_decision_trees.md
pratik3336 Nov 20, 2025
3cdc9c7
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
0c04bde
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
ccf0b95
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
a560086
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
c6561b0
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
c53ec03
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
26016cc
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
a28e495
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
58577b9
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
c5efa19
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
578c80d
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
c82c646
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
bda432f
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
a751d0a
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
ac1a3f8
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
fd6b5b1
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
51f7307
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
b4313ee
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
650800d
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
62134ee
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
2e65a18
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
9eb2f69
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
bae5a61
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
5c08b82
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
ff308e1
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
3b20ab6
Update 03_decision_trees.md
pratik3336 Nov 20, 2025
f0e27d8
Update 03_decision_trees.md
pratik3336 Jan 5, 2026
f08d7e2
Update 03_decision_trees.md
pratik3336 Jan 5, 2026
efd4fe9
Merge branch 'Code-the-Dream-School:main' into decision-tree
pratik3336 Jan 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
328 changes: 328 additions & 0 deletions lessons/03_ML_classification/03_decision_trees.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
# Lesson 3
### CTD Python 200
**Decision Trees and Ensemble Learning (Random Forests)**

---

## Why Trees?

In the previous lesson, we learned about **K-Nearest Neighbors (KNN)**.
KNN makes predictions by comparing distances between data points.

That works well in some settings — but not all.

Now imagine a different way of thinking:

> “If an email has lots of dollar signs and exclamation points, it might be spam.
> If it also contains words like *free* or *remove*, that makes spam even more likely.”

That style of reasoning — asking a **sequence of yes/no questions** — is exactly how a **Decision Tree** works.

<img width="800" height="400" alt="decision tree" src="https://github.com/user-attachments/assets/3cd4e0d6-8da7-4dc3-b05b-f1379fae0f4c" />

**Image credit:** GeeksForGeeks

Unlike many machine learning models that behave like black boxes, decision trees are:

- **Interpretable** — you can read every decision
- **Human-like** — they resemble flowcharts
- **Powerful on tabular data**

But they also have a weakness…

---

## The Big Idea

A **single decision tree** can become too confident.
If allowed to grow without constraints, it may memorize the training data.

This problem is called **overfitting**.

To solve it, we use **Random Forests**, which combine many trees into one stronger model.

---

## What You’ll Learn Today

By the end of this lesson, you will be able to:

- Compare **KNN vs Decision Trees vs Random Forests**
- Explain why trees outperform KNN on tabular data
- Understand **overfitting** in decision trees
- Use **Random Forests** to improve generalization
- Interpret results using **precision, recall, and F1 score**
- Connect model behavior to real-world intuition (spam detection)

---

## Dataset: Spambase (Real-World Tabular Data)

In this lesson we use the **Spambase dataset** from the UCI Machine Learning Repository.

Each row represents an **email**.
The features are numeric signals such as:

- How often words like `"free"`, `"remove"`, `"your"` appear
- Frequency of characters like `"!"` and `"$"`
- Statistics about capital letter usage

The label tells us whether the email is:

- `1` → spam
- `0` → not spam (ham)

This dataset is ideal because:
- It’s realistic
- It has many features
- It clearly shows differences between models

---

## Setup

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
````

---

## Load the Dataset

```python
from io import BytesIO
import requests

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
response = requests.get(url)
response.raise_for_status()

df = pd.read_csv(BytesIO(response.content), header=None)
```

The dataset contains **4601 emails** and **58 columns**
(57 features + 1 label).

```python
print(df.shape)
df.head()
```

<img width="930" height="251" alt="Screenshot 2026-01-05 at 4 17 33 PM" src="https://github.com/user-attachments/assets/b49624ed-cc21-4d6a-8a05-4efe4f2ce950" />

---

## Train / Test Split

We separate features (`X`) from labels (`y`) and use a **stratified split**.
This keeps the spam ratio similar in both sets.

```python
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=0,
stratify=y
)
```

---

## Model 1 — KNN (Scaled Baseline)

We start with **KNN**, using proper feature scaling.
This is our **baseline** model.

```python
knn_scaled = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=5))
])

knn_scaled.fit(X_train, y_train)
pred_knn = knn_scaled.predict(X_test)

print(classification_report(y_test, pred_knn, digits=3))
```

<img width="537" height="156" alt="Screenshot 2026-01-05 at 4 18 28 PM" src="https://github.com/user-attachments/assets/d35ea1c0-fdf9-4b80-94a1-53a008ad89e7" />

### What to Notice

* KNN works reasonably well
* But performance is limited on high-dimensional tabular data
* Distance alone is not enough to capture complex patterns

This sets the stage for trees.

---

## Model 2 — Decision Tree

Decision Trees do **not use distance**.
Instead, they learn **rules** like:

> “Is the frequency of `$` greater than X?”

This makes them very effective for tabular datasets like spam detection.

```python
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
pred_tree = tree.predict(X_test)

print(classification_report(y_test, pred_tree, digits=3))
```

<img width="522" height="154" alt="Screenshot 2026-01-05 at 4 19 28 PM" src="https://github.com/user-attachments/assets/7f87bfcf-ab0e-4567-a17f-1e40689f2d66" />


### Why Trees Often Beat KNN Here

* Each feature is evaluated independently
* Trees naturally model non-linear relationships
* No scaling required
* Well-suited for mixed and sparse signals

But there’s a problem…

---

## Overfitting Warning ⚠️

A decision tree can keep splitting until it perfectly classifies the training data.

That means:

* Very low training error
* Worse performance on new data

This is **high variance** behavior.

To fix this, we use ensembles.

---

## Model 3 — Random Forest 🌲🌲🌲

A **Random Forest** is a collection of decision trees.

Each tree:

* Sees a random sample of the data
* Uses a random subset of features
* Makes its own prediction

The forest **votes**, and the majority wins.

```python
rf = RandomForestClassifier(
n_estimators=100,
random_state=0
)

rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)

print(classification_report(y_test, pred_rf, digits=3))
```

<img width="517" height="163" alt="Screenshot 2026-01-05 at 4 20 15 PM" src="https://github.com/user-attachments/assets/70956555-6be3-430e-a4f8-dca201c3c261" />


---

## Why Random Forests Work Better

* Individual trees make different mistakes
* Voting cancels out errors
* Variance is reduced
* Generalization improves

This is called **ensemble learning**:

> Many weak learners → one strong learner

---

## Comparing F1 Scores

Accuracy alone can be misleading for spam detection.
We care about both:

* Catching spam (recall)
* Not blocking real emails (precision)

The **F1 score** balances both.

```python
models = {
"KNN (scaled)": pred_knn,
"Decision Tree": pred_tree,
"Random Forest": pred_rf
}

for name, preds in models.items():
score = f1_score(y_test, preds)
print(f"{name:15s} F1 = {score:.3f}")
```

<img width="297" height="65" alt="Screenshot 2026-01-05 at 4 21 13 PM" src="https://github.com/user-attachments/assets/81688ef4-279c-4f0d-a38c-f08c81944f15" />


### Typical Pattern You’ll See

```
KNN (scaled) < Decision Tree < Random Forest
```

---

## Final Takeaways

### Decision Trees

* Easy to understand and visualize
* Excellent for tabular data
* High variance → prone to overfitting

### Random Forests

* Reduce overfitting
* Improve reliability
* Often the best default choice for tabular ML

| Concept | Lesson |
| -------- | ----------------------------- |
| KNN | Distance-based, needs scaling |
| Trees | Rule-based, interpretable |
| Forests | Ensembles reduce variance |
| F1 Score | Balances precision & recall |

> **If you want strong real-world performance on tabular data, Random Forests are a safer choice than a single tree.**

---

## Next Steps

In upcoming lessons, we will:

* Use **cross-validation** for more reliable evaluation
* Tune hyperparameters
* Explore **feature importance** and model interpretation
* Discuss trade-offs between different error types
---