-
Notifications
You must be signed in to change notification settings - Fork 8
Preprocessing for classification (week 3) #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
roshansuresh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is a good lesson. The demo is great and instructive, but the example before that could also be approached the same way. PCA is confusing to explain since the combination of features does not always happen in a predictive manner. My apologies for the large number of comments. Hopefully they are helpful!
| - Feature scaling (standardization, normalization) | ||
| - Encoding categorical variables (one hot encoding) | ||
| - Creating new features (feature engineering) | ||
| - Dimensionality reduction and Principal component analysis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Dimensionality reduction using Principal Component Analysis (PCA)"
| - Creating new features (feature engineering) | ||
| - Dimensionality reduction and Principal component analysis | ||
|
|
||
| Some of the material will be review of what you learned in Python 100, but specifically geared toward optimizing data for consumption by classifiers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"will be a review of"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"optimizing data for consumption by classifiers" is a bit confusing. Maybe say "geared towards preparing the data to train a classifier"
|
|
||
| ## 1. Numerical vs Categorical Features | ||
|
|
||
| [Video on numeric vs categorical data](https://www.youtube.com/watch?v=rodnzzR4dLo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps include the link in a sentence introducing it?
|
|
||
| Before we can train a classifier, we need to understand the kind of data we are giving it. Machine learning models only know how to work with _numbers_, so every feature in our dataset eventually has to be represented numerically. | ||
|
|
||
| **Numerical features** are things that are *already* numbers: age, height, temperature, income. They are typically represented as floats or ints in Python. Models generally work well with these, but many algorithms still need the numbers to be put on a similar scale before training. We will cover scaling next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally not a fan of using "things"
|
|
||
| **Categorical features** describe types or labels instead of quantities. They are often represented as `strings` in Python: city name, animal species, shirt size. These values mean something to *us*, but such raw text is not useful to a machine learning model. We need to convert these categories into a numerical format that a classifier can learn from. That is where things like one-hot encoding come in (which we will cover below). | ||
|
|
||
| Even large language models (LLMs) do not work with strings: as we will see in future weeks when we cover LLMs, linguistic inputs must be converted to numerical arrays before large language models can get traction with them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"get traction with them" doesn't seem right here. Perhaps "work with them?"
|
|
||
|  | ||
|
|
||
| Now imagine in that video of the room that on the desk there is a small jellyfish toy with a built-in light that cycles between deep violet and almost-black. The group of pixels that represent the jellyfish all brighten and darken together in their own violet rhythm, independently of the room's background illumination. This creates a localized cluster of highly correlated pixel changes that are not explained by the global brightness pattern. In a case like this, with pixels that fluctuate together independently of the background illumination, PCA would automatically identify the jellyfish pixel cluster as the *second* principal component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an image attached here, not a video.
Also, not sure why this would be the "second" principal component.
|
|
||
| In this way, PCA acts like a very smart form of compression. Instead of throwing away random pixels or selecting every third column of the image, it builds new features that preserve as much of the original information as possible based on which pixels are correlated with each other in a dataset. | ||
|
|
||
| PCA offers a way to reconstruct the original dataset from these compressed features. In the jellyfish room example, PCA would let us represent each frame of the video using just two numbers: one for the overall background brightness level (the first principal component) and one for the brightness of the jellyfish toy (the second principal component). This is a huge reduction from millions of pixel values to just two numbers, while still capturing the essential content of each frame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah not a fan of this example unfortunately. Perhaps its just best to explain using the demo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main issue is with PCA reducing the entire image to two values with certainty on what those values represent
| plt.show() | ||
| ``` | ||
|
|
||
| The mean face is the average of all faces in the dataset. You can think of the eigenfaces as basic building blocks for constructing individual faces. PC1 is the eigenface that captures the most correlated activity among the pixels, PC2 the second most, and so on. Each eigenface shows the discovered pattern of correlated pixel intensities. Red regions mean "add brightness to the mean" when you move in the direction of that component, and blue regions mean "subtract brightness here". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This contradicts what was mentioned in the previous example. Each principal component is essentially an image of the same shape as the image in the dataset where previously it says PCA can reduce an entire image into two numbers.
| plt.show() | ||
| ``` | ||
|
|
||
| The mean face is the average of all faces in the dataset. You can think of the eigenfaces as basic building blocks for constructing individual faces. PC1 is the eigenface that captures the most correlated activity among the pixels, PC2 the second most, and so on. Each eigenface shows the discovered pattern of correlated pixel intensities. Red regions mean "add brightness to the mean" when you move in the direction of that component, and blue regions mean "subtract brightness here". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"PC1 is the eigenface that captures the most correlated features among the pixels"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Red regions mean "add brightness to the mean" when you move in the direction of that component, and blue regions mean "subtract brightness here"." Direction of the component is a bit confusing here. Maybe remove?
|
|
||
| fig, axes = plt.subplots(len(components_list), len(rand_indices), figsize=(10, 7)) | ||
|
|
||
| for i, n in enumerate(components_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code needs some more comments explaining the different steps
Closes #58
A lot of this is review from Python 100. Some is new (goes over one hot encoding and PCA is new content).