Preprocessing for classification (week 3) #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

EricThomson wants to merge 5 commits into main from ml/preprocessing

Collaborator

EricThomson commented Nov 26, 2025

Closes #58

A lot of this is review from Python 100. Some is new (goes over one hot encoding and PCA is new content).

EricThomson added 5 commits

November 22, 2025 17:07


          initial ml preprocessing notes through feature engineering

6e27b23


          added ml preprocessing md

fc27985


          added jellyfish figure

ee35b36


          cleaned up discussion of preprocessing

28c5278


          pca explanation clarified

0dcd325

EricThomson requested a review from Shubham-Agarwall

December 2, 2025 16:24

EricThomson requested review from roshansuresh and removed request for Shubham-Agarwall

January 5, 2026 15:19

roshansuresh reviewed

View reviewed changes

Collaborator

roshansuresh left a comment

Overall this is a good lesson. The demo is great and instructive, but the example before that could also be approached the same way. PCA is confusing to explain since the combination of features does not always happen in a predictive manner. My apologies for the large number of comments. Hopefully they are helpful!

lessons/03_ML_classification/01_ml_preprocessing.md

+              - Feature scaling (standardization, normalization)
+              - Encoding categorical variables (one hot encoding)
+              - Creating new features (feature engineering)
+              - Dimensionality reduction and Principal component analysis

Collaborator

roshansuresh Jan 6, 2026

"Dimensionality reduction using Principal Component Analysis (PCA)"

lessons/03_ML_classification/01_ml_preprocessing.md

+              - Creating new features (feature engineering)
+              - Dimensionality reduction and Principal component analysis
+              Some of the material will be review of what you learned in Python 100, but specifically geared toward optimizing data for consumption by classifiers.

Collaborator

roshansuresh Jan 6, 2026

"will be a review of"

Collaborator

roshansuresh Jan 6, 2026

"optimizing data for consumption by classifiers" is a bit confusing. Maybe say "geared towards preparing the data to train a classifier"

lessons/03_ML_classification/01_ml_preprocessing.md


		## 1. Numerical vs Categorical Features

		[Video on numeric vs categorical data](https://www.youtube.com/watch?v=rodnzzR4dLo)

Collaborator

roshansuresh Jan 6, 2026

Perhaps include the link in a sentence introducing it?

lessons/03_ML_classification/01_ml_preprocessing.md


		Before we can train a classifier, we need to understand the kind of data we are giving it. Machine learning models only know how to work with _numbers_, so every feature in our dataset eventually has to be represented numerically.

		Numerical features are things that are already numbers: age, height, temperature, income. They are typically represented as floats or ints in Python. Models generally work well with these, but many algorithms still need the numbers to be put on a similar scale before training. We will cover scaling next.

Collaborator

roshansuresh Jan 6, 2026

generally not a fan of using "things"

lessons/03_ML_classification/01_ml_preprocessing.md


		Categorical features describe types or labels instead of quantities. They are often represented as `strings` in Python: city name, animal species, shirt size. These values mean something to us, but such raw text is not useful to a machine learning model. We need to convert these categories into a numerical format that a classifier can learn from. That is where things like one-hot encoding come in (which we will cover below).

		Even large language models (LLMs) do not work with strings: as we will see in future weeks when we cover LLMs, linguistic inputs must be converted to numerical arrays before large language models can get traction with them.

Collaborator

roshansuresh Jan 6, 2026

"get traction with them" doesn't seem right here. Perhaps "work with them?"

lessons/03_ML_classification/01_ml_preprocessing.md


		![PCA Room](resources/jellyfish_room_pca.jpg)

		Now imagine in that video of the room that on the desk there is a small jellyfish toy with a built-in light that cycles between deep violet and almost-black. The group of pixels that represent the jellyfish all brighten and darken together in their own violet rhythm, independently of the room's background illumination. This creates a localized cluster of highly correlated pixel changes that are not explained by the global brightness pattern. In a case like this, with pixels that fluctuate together independently of the background illumination, PCA would automatically identify the jellyfish pixel cluster as the second principal component.

Collaborator

roshansuresh Jan 6, 2026

There's an image attached here, not a video.
Also, not sure why this would be the "second" principal component.

lessons/03_ML_classification/01_ml_preprocessing.md


		In this way, PCA acts like a very smart form of compression. Instead of throwing away random pixels or selecting every third column of the image, it builds new features that preserve as much of the original information as possible based on which pixels are correlated with each other in a dataset.

		PCA offers a way to reconstruct the original dataset from these compressed features. In the jellyfish room example, PCA would let us represent each frame of the video using just two numbers: one for the overall background brightness level (the first principal component) and one for the brightness of the jellyfish toy (the second principal component). This is a huge reduction from millions of pixel values to just two numbers, while still capturing the essential content of each frame.

Collaborator

roshansuresh Jan 6, 2026

Yeah not a fan of this example unfortunately. Perhaps its just best to explain using the demo

Collaborator

roshansuresh Jan 6, 2026

My main issue is with PCA reducing the entire image to two values with certainty on what those values represent

lessons/03_ML_classification/01_ml_preprocessing.md

+              plt.show()
+              ```
+              The mean face is the average of all faces in the dataset. You can think of the eigenfaces as basic building blocks for constructing individual faces. PC1 is the eigenface that captures the most correlated activity among the pixels, PC2 the second most, and so on. Each eigenface shows the discovered pattern of correlated pixel intensities. Red regions mean "add brightness to the mean" when you move in the direction of that component, and blue regions mean "subtract brightness here".

Collaborator

roshansuresh Jan 6, 2026

This contradicts what was mentioned in the previous example. Each principal component is essentially an image of the same shape as the image in the dataset where previously it says PCA can reduce an entire image into two numbers.

lessons/03_ML_classification/01_ml_preprocessing.md

+              plt.show()
+              ```
+              The mean face is the average of all faces in the dataset. You can think of the eigenfaces as basic building blocks for constructing individual faces. PC1 is the eigenface that captures the most correlated activity among the pixels, PC2 the second most, and so on. Each eigenface shows the discovered pattern of correlated pixel intensities. Red regions mean "add brightness to the mean" when you move in the direction of that component, and blue regions mean "subtract brightness here".

Collaborator

roshansuresh Jan 6, 2026

"PC1 is the eigenface that captures the most correlated features among the pixels"

Collaborator

roshansuresh Jan 6, 2026

"Red regions mean "add brightness to the mean" when you move in the direction of that component, and blue regions mean "subtract brightness here"." Direction of the component is a bit confusing here. Maybe remove?

lessons/03_ML_classification/01_ml_preprocessing.md


		fig, axes = plt.subplots(len(components_list), len(rand_indices), figsize=(10, 7))

		for i, n in enumerate(components_list):

Collaborator

roshansuresh Jan 6, 2026

Code needs some more comments explaining the different steps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet