-
Notifications
You must be signed in to change notification settings - Fork 1
Applying buscar to CFReT Screen #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
wli51
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work! Had nitpicks and one major (but maybe out of scope) suggestion for treatment filtering.
| # These preprocessing steps ensure that all datasets are standardized, well-documented, and ready for comparative and integrative analyses. | ||
|
|
||
| # In[1]: | ||
| # In[2]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bro your cell run numbers are off
| if shared_features is not None: | ||
| # Only select metadata and shared features | ||
| return profile_df.select(meta_cols + shared_features) | ||
| if not shared_contains_meta: | ||
| meta_cols, _ = split_meta_and_features(profile_df) | ||
| return profile_df.select(meta_cols + shared_features) | ||
|
|
||
| return profile_df.select(shared_features) | ||
| return profile_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider changing these nested ifs to guard clause (i know its one nested layer but the if is not ...: if not ...: is already somewhat unreadable).
I think what you are trying to achieve here is:
def load_profile(profile_path: pathlib.Path) -> pl.DataFrame:
"""Internal function to load a single profile file."""
profile_df = pl.read_parquet(profile_path)
print(f"Loaded profile {profile_path.name} with shape {profile_df.shape}")
# If no shared features are provided, return the full DataFrame
if shared_features is None:
return profile_df
# Shared features already include meta → directly select them
if shared_contains_meta:
return profile_df.select(shared_features)
# Otherwise: extract meta and prepend them
meta_cols, _ = split_meta_and_features(profile_df)
return profile_df.select(meta_cols + shared_features)| if all(isinstance(p, str) for p in profile_paths): | ||
| profile_paths = [pathlib.Path(p) for p in profile_paths] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this fails to do the conversion when one of the paths is a string but the others are pathlib Paths, if your downstream stuff depends on pathlib functions (which doesn't seem like it). I also believe that at least for the more recent python versions pathlib.Path() simply does a no-op on pathlib objects so I think you can safely remove the if statement and just cast everything to pathlib.
| on_emd = ot.emd2(weights_ref, weights_exp, on_M) | ||
| off_emd = ot.emd2(weights_ref, weights_exp, off_M) | ||
| # compute on and off emd scores with increased numItermax to avoid warnings | ||
| on_emd = ot.emd2(weights_ref, weights_exp, on_M, numItermax=100000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this configurable via compute_earth_movers_distance maybe and set the numItermax as a default maybe?
| if exp_cluster is None: | ||
| exp_cluster_population_df = exp_profiles.filter( | ||
| (pl.col(treatment_col) == treatment) | ||
| & (pl.col(cluster_col).is_null()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is worth documenting why would you expect to get None exp_clusters and why do we continue analysis with profiles filtered to only treatment level and not reference cluster here? I am some what confused reading this.
| # ## Parameters | ||
| # | ||
| # Below are the parameters used for this notebook. The CFReT-screen dataset contains two hearts: **Healthy (Heart 7)** and **Failing (Heart 19)**, which has been diagnosed with dilated cardiomyopathy. | ||
| # | ||
| # DMSO Control Naming Convention | ||
| # | ||
| # To distinguish between control conditions from different heart sources, the `Metadata_treatment` column values are modified as follows: | ||
| # - **Healthy controls** (Heart 7 + DMSO): `"DMSO_heart_7"` | ||
| # - **Failing controls** (Heart 19 + DMSO): `"DMSO_heart_19"` | ||
| # | ||
| # Parameter Definitions: | ||
| # - **`healthy_ref_treatment`**: Reference treatment name for healthy controls | ||
| # - **`failing_ref_treatment`**: Reference treatment name for failing heart controls | ||
| # - **`treatment_col`**: Column name containing treatment metadata | ||
| # - **`cfret_screen_cluster_param_grid`**: Dictionary defining the hyperparameter search space for clustering optimization when assessing heterogeneity across treatments. Includes: | ||
| # - `cluster_resolution`: Granularity of clusters (float, 0.1–2.2) | ||
| # - `n_neighbors`: Number of neighbors for graph construction (int, 5–100) | ||
| # - `cluster_method`: Clustering algorithm (categorical: leiden) | ||
| # - `neighbor_distance_metric`: Distance metric for neighbor computation (categorical: euclidean, cosine, manhattan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great documentations!
|
|
||
|
|
||
| # using numpy to calculate 10th percentile | ||
| tenth_percentile = np.round(np.percentile(counts["count"], 10), 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 10th percentile cell count threshold seemed to be a reasonable heuristic here but might not be the the most robust or statistically rigorous approach. From the cell count distribution plot, it looks like the CFReT data has sort of a bi-modal distribution for cell counts across treatment groups, where the first mode is towards the low end (presumably the highly toxic compounds) and the second mode being the actual valid mode.
If this kind of toxic + valid bi-modal count distribution is what we would generally expect for buscar and we consistently want to drop the treatments around the low mode, it might be worth developing a special count filtering step. If not please disregard the rest of this comment. If so, I would recommend trying to first fitting a GaussianMixture and falling back to percentile or elbow method:
# bimodal fit
gm2 = GaussianMixture(n_components=2).fit(counts)
# want to also fit a unimodal model as baseline
gm1 = GaussianMixture(n_components=1).fit(counts)
# analyze how good the two fits compare
# uses Bayes Information Criterion here (Akaike Information Criterion could also work maybe)
if gm2.bic(counts) >= gm1.bic(counts):
# Model with 2 components is not better than 1
# fallback to percentile or elbow
...
else:
# optional separation check as a t/z test
mean_diff = abs(gm2.means_[0] - gm2.means_[1])[0]
pooled_sd = np.sqrt(gm2.covariances_.mean())
if mean_diff < 1.0 * pooled_sd:
# fall back
...
# keeping only treatments assigned to the higher valued component
labels = gm2.predict(counts)
high_component = gm.means_.argmax()
keep_mask = labels == high_componentSee here for extra gmm model selection tips
Let me know your thoughts on this. If you actually decide to adopt this, it is probably out of scope of this PR and is probably worth having a separate PR for.
This PR updates the CFReT screen analysis by incorporating the new clustering implementation.