Applying buscar to CFReT Screen #56

axiomcura · 2025-11-24T20:21:24Z

This PR updates the CFReT screen analysis by incorporating the new clustering implementation.

review-notebook-app · 2025-11-24T20:21:30Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

wli51

LGTM! Great work! Had nitpicks and one major (but maybe out of scope) suggestion for treatment filtering.

wli51 · 2025-12-01T17:24:18Z

notebooks/0.download-data/nbconverted/2.preprocessing.py

 # These preprocessing steps ensure that all datasets are standardized, well-documented, and ready for comparative and integrative analyses.

-# In[1]:
+# In[2]:


bro your cell run numbers are off

wli51 · 2025-12-01T17:37:56Z

notebooks/0.download-data/nbconverted/2.preprocessing.py

        if shared_features is not None:
-            # Only select metadata and shared features
-            return profile_df.select(meta_cols + shared_features)
+            if not shared_contains_meta:
+                meta_cols, _ = split_meta_and_features(profile_df)
+                return profile_df.select(meta_cols + shared_features)
+
+            return profile_df.select(shared_features)
        return profile_df


Consider changing these nested ifs to guard clause (i know its one nested layer but the if is not ...: if not ...: is already somewhat unreadable).

I think what you are trying to achieve here is:

def load_profile(profile_path: pathlib.Path) -> pl.DataFrame: """Internal function to load a single profile file.""" profile_df = pl.read_parquet(profile_path) print(f"Loaded profile {profile_path.name} with shape {profile_df.shape}") # If no shared features are provided, return the full DataFrame if shared_features is None: return profile_df # Shared features already include meta → directly select them if shared_contains_meta: return profile_df.select(shared_features) # Otherwise: extract meta and prepend them meta_cols, _ = split_meta_and_features(profile_df) return profile_df.select(meta_cols + shared_features)

wli51 · 2025-12-01T18:06:07Z

notebooks/0.download-data/nbconverted/2.preprocessing.py

+    if all(isinstance(p, str) for p in profile_paths):
+        profile_paths = [pathlib.Path(p) for p in profile_paths]


i think this fails to do the conversion when one of the paths is a string but the others are pathlib Paths, if your downstream stuff depends on pathlib functions (which doesn't seem like it). I also believe that at least for the more recent python versions pathlib.Path() simply does a no-op on pathlib objects so I think you can safely remove the if statement and just cast everything to pathlib.

wli51 · 2025-12-01T18:10:07Z

utils/metrics.py

-    on_emd = ot.emd2(weights_ref, weights_exp, on_M)
-    off_emd = ot.emd2(weights_ref, weights_exp, off_M)
+    # compute on and off emd scores with increased numItermax to avoid warnings
+    on_emd = ot.emd2(weights_ref, weights_exp, on_M, numItermax=100000)


make this configurable via compute_earth_movers_distance maybe and set the numItermax as a default maybe?

wli51 · 2025-12-01T18:17:03Z

utils/metrics.py

+            if exp_cluster is None:
+                exp_cluster_population_df = exp_profiles.filter(
+                    (pl.col(treatment_col) == treatment)
+                    & (pl.col(cluster_col).is_null())


maybe it is worth documenting why would you expect to get None exp_clusters and why do we continue analysis with profiles filtered to only treatment level and not reference cluster here? I am some what confused reading this.

wli51 · 2025-12-01T18:19:43Z

notebooks/3.cfret-screen-analysis/nbconverted/1.cfret_screen_analysis.py

+# ## Parameters
+#
+# Below are the parameters used for this notebook. The CFReT-screen dataset contains two hearts: **Healthy (Heart 7)** and **Failing (Heart 19)**, which has been diagnosed with dilated cardiomyopathy.
+#
+# DMSO Control Naming Convention
+#
+# To distinguish between control conditions from different heart sources, the `Metadata_treatment` column values are modified as follows:
+# - **Healthy controls** (Heart 7 + DMSO): `"DMSO_heart_7"`
+# - **Failing controls** (Heart 19 + DMSO): `"DMSO_heart_19"`
+#
+# Parameter Definitions:
+# - **`healthy_ref_treatment`**: Reference treatment name for healthy controls
+# - **`failing_ref_treatment`**: Reference treatment name for failing heart controls
+# - **`treatment_col`**: Column name containing treatment metadata
+# - **`cfret_screen_cluster_param_grid`**: Dictionary defining the hyperparameter search space for clustering optimization when assessing heterogeneity across treatments. Includes:
+#     - `cluster_resolution`: Granularity of clusters (float, 0.1–2.2)
+#     - `n_neighbors`: Number of neighbors for graph construction (int, 5–100)
+#     - `cluster_method`: Clustering algorithm (categorical: leiden)
+#     - `neighbor_distance_metric`: Distance metric for neighbor computation (categorical: euclidean, cosine, manhattan)


great documentations!

wli51 · 2025-12-01T18:48:29Z

notebooks/3.cfret-screen-analysis/nbconverted/1.cfret_screen_analysis.py

+
+
+# using numpy to calculate 10th percentile
+tenth_percentile = np.round(np.percentile(counts["count"], 10), 3)


The 10th percentile cell count threshold seemed to be a reasonable heuristic here but might not be the the most robust or statistically rigorous approach. From the cell count distribution plot, it looks like the CFReT data has sort of a bi-modal distribution for cell counts across treatment groups, where the first mode is towards the low end (presumably the highly toxic compounds) and the second mode being the actual valid mode.

If this kind of toxic + valid bi-modal count distribution is what we would generally expect for buscar and we consistently want to drop the treatments around the low mode, it might be worth developing a special count filtering step. If not please disregard the rest of this comment. If so, I would recommend trying to first fitting a GaussianMixture and falling back to percentile or elbow method:

# bimodal fit gm2 = GaussianMixture(n_components=2).fit(counts) # want to also fit a unimodal model as baseline gm1 = GaussianMixture(n_components=1).fit(counts) # analyze how good the two fits compare # uses Bayes Information Criterion here (Akaike Information Criterion could also work maybe) if gm2.bic(counts) >= gm1.bic(counts): # Model with 2 components is not better than 1 # fallback to percentile or elbow ... else: # optional separation check as a t/z test mean_diff = abs(gm2.means_[0] - gm2.means_[1])[0] pooled_sd = np.sqrt(gm2.covariances_.mean()) if mean_diff < 1.0 * pooled_sd: # fall back ... # keeping only treatments assigned to the higher valued component labels = gm2.predict(counts) high_component = gm.means_.argmax() keep_mask = labels == high_component

See here for extra gmm model selection tips

Let me know your thoughts on this. If you actually decide to adopt this, it is probably out of scope of this PR and is probably worth having a separate PR for.

axiomcura added 9 commits October 17, 2025 13:22

updated preprocessing module for cfret screen data

28b240f

updates in the module

58a540c

dep updates

058b890

update

6786397

Merge remote-tracking branch 'origin/main' into buscar-cfret-screen

90ecbb5

update

4861864

created new module 3.cfret-screen

a1cdd05

updates

9bd35a9

move thes files to git ignore, this will be added to another PR

96a5b32

axiomcura added 6 commits November 24, 2025 13:24

updated git ignore

8bca6da

removed log files

95992e9

cleaned

cfef755

nbconverted

0d0e0c3

removed misplaced files

2ef45ad

removed extra comment

182ecb4

axiomcura marked this pull request as ready for review November 24, 2025 20:38

axiomcura requested a review from wli51 November 24, 2025 20:54

wli51 approved these changes Dec 1, 2025

View reviewed changes

axiomcura added 2 commits December 18, 2025 10:46

update

98884db

Merge remote-tracking branch 'origin/main' into buscar-cfret-screen

6cd73b2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Applying buscar to CFReT Screen #56

Applying buscar to CFReT Screen #56

Uh oh!

axiomcura commented Nov 24, 2025

Uh oh!

review-notebook-app bot commented Nov 24, 2025

Uh oh!

wli51 left a comment

Uh oh!

wli51 Dec 1, 2025

Uh oh!

wli51 Dec 1, 2025

Uh oh!

wli51 Dec 1, 2025

Uh oh!

wli51 Dec 1, 2025

Uh oh!

wli51 Dec 1, 2025

Uh oh!

wli51 Dec 1, 2025

Uh oh!

wli51 Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if all(isinstance(p, str) for p in profile_paths):
		profile_paths = [pathlib.Path(p) for p in profile_paths]



		# using numpy to calculate 10th percentile
		tenth_percentile = np.round(np.percentile(counts["count"], 10), 3)

Applying buscar to CFReT Screen #56

Are you sure you want to change the base?

Applying buscar to CFReT Screen #56

Uh oh!

Conversation

axiomcura commented Nov 24, 2025

Uh oh!

review-notebook-app bot commented Nov 24, 2025

Uh oh!

wli51 left a comment

Choose a reason for hiding this comment

Uh oh!

wli51 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

wli51 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

wli51 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

wli51 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

wli51 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

wli51 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

wli51 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants