Skip to content

Conversation

@EricThomson
Copy link
Collaborator

Closes #58

A lot of this is review from Python 100. Some is new (goes over one hot encoding and PCA is new content).

@EricThomson EricThomson requested review from roshansuresh and removed request for Shubham-Agarwall January 5, 2026 15:19
Copy link
Collaborator

@roshansuresh roshansuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a good lesson. The demo is great and instructive, but the example before that could also be approached the same way. PCA is confusing to explain since the combination of features does not always happen in a predictive manner. My apologies for the large number of comments. Hopefully they are helpful!

- Feature scaling (standardization, normalization)
- Encoding categorical variables (one hot encoding)
- Creating new features (feature engineering)
- Dimensionality reduction and Principal component analysis
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Dimensionality reduction using Principal Component Analysis (PCA)"

- Creating new features (feature engineering)
- Dimensionality reduction and Principal component analysis

Some of the material will be review of what you learned in Python 100, but specifically geared toward optimizing data for consumption by classifiers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"will be a review of"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"optimizing data for consumption by classifiers" is a bit confusing. Maybe say "geared towards preparing the data to train a classifier"


## 1. Numerical vs Categorical Features

[Video on numeric vs categorical data](https://www.youtube.com/watch?v=rodnzzR4dLo)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps include the link in a sentence introducing it?


Before we can train a classifier, we need to understand the kind of data we are giving it. Machine learning models only know how to work with _numbers_, so every feature in our dataset eventually has to be represented numerically.

**Numerical features** are things that are *already* numbers: age, height, temperature, income. They are typically represented as floats or ints in Python. Models generally work well with these, but many algorithms still need the numbers to be put on a similar scale before training. We will cover scaling next.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally not a fan of using "things"


**Categorical features** describe types or labels instead of quantities. They are often represented as `strings` in Python: city name, animal species, shirt size. These values mean something to *us*, but such raw text is not useful to a machine learning model. We need to convert these categories into a numerical format that a classifier can learn from. That is where things like one-hot encoding come in (which we will cover below).

Even large language models (LLMs) do not work with strings: as we will see in future weeks when we cover LLMs, linguistic inputs must be converted to numerical arrays before large language models can get traction with them.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"get traction with them" doesn't seem right here. Perhaps "work with them?"


![PCA Room](resources/jellyfish_room_pca.jpg)

Now imagine in that video of the room that on the desk there is a small jellyfish toy with a built-in light that cycles between deep violet and almost-black. The group of pixels that represent the jellyfish all brighten and darken together in their own violet rhythm, independently of the room's background illumination. This creates a localized cluster of highly correlated pixel changes that are not explained by the global brightness pattern. In a case like this, with pixels that fluctuate together independently of the background illumination, PCA would automatically identify the jellyfish pixel cluster as the *second* principal component.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an image attached here, not a video.
Also, not sure why this would be the "second" principal component.


In this way, PCA acts like a very smart form of compression. Instead of throwing away random pixels or selecting every third column of the image, it builds new features that preserve as much of the original information as possible based on which pixels are correlated with each other in a dataset.

PCA offers a way to reconstruct the original dataset from these compressed features. In the jellyfish room example, PCA would let us represent each frame of the video using just two numbers: one for the overall background brightness level (the first principal component) and one for the brightness of the jellyfish toy (the second principal component). This is a huge reduction from millions of pixel values to just two numbers, while still capturing the essential content of each frame.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not a fan of this example unfortunately. Perhaps its just best to explain using the demo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main issue is with PCA reducing the entire image to two values with certainty on what those values represent

plt.show()
```

The mean face is the average of all faces in the dataset. You can think of the eigenfaces as basic building blocks for constructing individual faces. PC1 is the eigenface that captures the most correlated activity among the pixels, PC2 the second most, and so on. Each eigenface shows the discovered pattern of correlated pixel intensities. Red regions mean "add brightness to the mean" when you move in the direction of that component, and blue regions mean "subtract brightness here".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contradicts what was mentioned in the previous example. Each principal component is essentially an image of the same shape as the image in the dataset where previously it says PCA can reduce an entire image into two numbers.

plt.show()
```

The mean face is the average of all faces in the dataset. You can think of the eigenfaces as basic building blocks for constructing individual faces. PC1 is the eigenface that captures the most correlated activity among the pixels, PC2 the second most, and so on. Each eigenface shows the discovered pattern of correlated pixel intensities. Red regions mean "add brightness to the mean" when you move in the direction of that component, and blue regions mean "subtract brightness here".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"PC1 is the eigenface that captures the most correlated features among the pixels"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Red regions mean "add brightness to the mean" when you move in the direction of that component, and blue regions mean "subtract brightness here"." Direction of the component is a bit confusing here. Maybe remove?


fig, axes = plt.subplots(len(components_list), len(rand_indices), figsize=(10, 7))

for i, n in enumerate(components_list):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code needs some more comments explaining the different steps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ML Module Week 3 Topic 1: Data preprocessing and Feature engineering

3 participants