Skip to content

Conversation

@axiomcura
Copy link
Member

This PR updates the CFReT screen analysis by incorporating the new clustering implementation.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@axiomcura axiomcura marked this pull request as ready for review November 24, 2025 20:38
@axiomcura axiomcura requested a review from wli51 November 24, 2025 20:54
Copy link
Contributor

@wli51 wli51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work! Had nitpicks and one major (but maybe out of scope) suggestion for treatment filtering.

# These preprocessing steps ensure that all datasets are standardized, well-documented, and ready for comparative and integrative analyses.

# In[1]:
# In[2]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bro your cell run numbers are off

Comment on lines 90 to 96
if shared_features is not None:
# Only select metadata and shared features
return profile_df.select(meta_cols + shared_features)
if not shared_contains_meta:
meta_cols, _ = split_meta_and_features(profile_df)
return profile_df.select(meta_cols + shared_features)

return profile_df.select(shared_features)
return profile_df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing these nested ifs to guard clause (i know its one nested layer but the if is not ...: if not ...: is already somewhat unreadable).

I think what you are trying to achieve here is:

def load_profile(profile_path: pathlib.Path) -> pl.DataFrame:
    """Internal function to load a single profile file."""

    profile_df = pl.read_parquet(profile_path)
    print(f"Loaded profile {profile_path.name} with shape {profile_df.shape}")

    # If no shared features are provided, return the full DataFrame
    if shared_features is None:
        return profile_df

    # Shared features already include meta → directly select them
    if shared_contains_meta:
        return profile_df.select(shared_features)

    # Otherwise: extract meta and prepend them
    meta_cols, _ = split_meta_and_features(profile_df)
    return profile_df.select(meta_cols + shared_features)

Comment on lines +207 to +208
if all(isinstance(p, str) for p in profile_paths):
profile_paths = [pathlib.Path(p) for p in profile_paths]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this fails to do the conversion when one of the paths is a string but the others are pathlib Paths, if your downstream stuff depends on pathlib functions (which doesn't seem like it). I also believe that at least for the more recent python versions pathlib.Path() simply does a no-op on pathlib objects so I think you can safely remove the if statement and just cast everything to pathlib.

on_emd = ot.emd2(weights_ref, weights_exp, on_M)
off_emd = ot.emd2(weights_ref, weights_exp, off_M)
# compute on and off emd scores with increased numItermax to avoid warnings
on_emd = ot.emd2(weights_ref, weights_exp, on_M, numItermax=100000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this configurable via compute_earth_movers_distance maybe and set the numItermax as a default maybe?

if exp_cluster is None:
exp_cluster_population_df = exp_profiles.filter(
(pl.col(treatment_col) == treatment)
& (pl.col(cluster_col).is_null())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is worth documenting why would you expect to get None exp_clusters and why do we continue analysis with profiles filtered to only treatment level and not reference cluster here? I am some what confused reading this.

Comment on lines +33 to +51
# ## Parameters
#
# Below are the parameters used for this notebook. The CFReT-screen dataset contains two hearts: **Healthy (Heart 7)** and **Failing (Heart 19)**, which has been diagnosed with dilated cardiomyopathy.
#
# DMSO Control Naming Convention
#
# To distinguish between control conditions from different heart sources, the `Metadata_treatment` column values are modified as follows:
# - **Healthy controls** (Heart 7 + DMSO): `"DMSO_heart_7"`
# - **Failing controls** (Heart 19 + DMSO): `"DMSO_heart_19"`
#
# Parameter Definitions:
# - **`healthy_ref_treatment`**: Reference treatment name for healthy controls
# - **`failing_ref_treatment`**: Reference treatment name for failing heart controls
# - **`treatment_col`**: Column name containing treatment metadata
# - **`cfret_screen_cluster_param_grid`**: Dictionary defining the hyperparameter search space for clustering optimization when assessing heterogeneity across treatments. Includes:
# - `cluster_resolution`: Granularity of clusters (float, 0.1–2.2)
# - `n_neighbors`: Number of neighbors for graph construction (int, 5–100)
# - `cluster_method`: Clustering algorithm (categorical: leiden)
# - `neighbor_distance_metric`: Distance metric for neighbor computation (categorical: euclidean, cosine, manhattan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great documentations!



# using numpy to calculate 10th percentile
tenth_percentile = np.round(np.percentile(counts["count"], 10), 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 10th percentile cell count threshold seemed to be a reasonable heuristic here but might not be the the most robust or statistically rigorous approach. From the cell count distribution plot, it looks like the CFReT data has sort of a bi-modal distribution for cell counts across treatment groups, where the first mode is towards the low end (presumably the highly toxic compounds) and the second mode being the actual valid mode.

If this kind of toxic + valid bi-modal count distribution is what we would generally expect for buscar and we consistently want to drop the treatments around the low mode, it might be worth developing a special count filtering step. If not please disregard the rest of this comment. If so, I would recommend trying to first fitting a GaussianMixture and falling back to percentile or elbow method:

# bimodal fit
gm2 = GaussianMixture(n_components=2).fit(counts)

# want to also fit a unimodal model as baseline
gm1 = GaussianMixture(n_components=1).fit(counts) 

# analyze how good the two fits compare
# uses Bayes Information Criterion here (Akaike Information Criterion could also work maybe)
if gm2.bic(counts) >= gm1.bic(counts):
    # Model with 2 components is not better than 1
    # fallback to percentile or elbow
    ...
else:
    # optional separation check as a t/z test
    mean_diff = abs(gm2.means_[0] - gm2.means_[1])[0]
    pooled_sd = np.sqrt(gm2.covariances_.mean())
    if mean_diff < 1.0 * pooled_sd:
        # fall back
        ...
    # keeping only treatments assigned to the higher valued component
    labels = gm2.predict(counts)
    high_component = gm.means_.argmax()
    keep_mask = labels == high_component

See here for extra gmm model selection tips

Let me know your thoughts on this. If you actually decide to adopt this, it is probably out of scope of this PR and is probably worth having a separate PR for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants