Skip to content

Conversation

@Chloe-Thangavelu
Copy link
Contributor

@Chloe-Thangavelu Chloe-Thangavelu commented May 29, 2025

Summary:
Modified downsample_cells function to accept anndata.AnnData objects as input. When an AnnData object is provided, .X and .obs data are combined into a pandas DataFrame, before applying the rest of the downsampling function.

Changes:

  • Added code to convert anndata objects into pandas dataframes
  • Returns error message if input is neither an anndata object or pandas dataframe
  • Created unit test to check anndata object is correctly processes

Modified downsample_cells function to accept anndata.AnnData objects as input. When an AnnData object is provided, .X` and .obs data are combined into a pandas DataFrame, before applying the rest of the downsampling function.
@georgezakinih georgezakinih requested a review from fangliu117 May 29, 2025 18:12
This commit modifies the 'downsample_cells' function and adds a  helper function '_get_downsampled_indices' to provide cell downsampling capabilities for both AnnData objects and Pandas DataFrames.
@Chloe-Thangavelu Chloe-Thangavelu changed the title feat(downsample):AnnData input for downsample_cells feat(downsample):AnnData input and output for downsample_cells Jun 13, 2025
This test now checks anndata objects are accepted, downsampled correctly, and returned as annadata objects.
Reordering the code to convert annotations to a list before extracting annotation information, ensuring it is in DataFrame format as required by subsequent downsample_cells code.
@Chloe-Thangavelu
Copy link
Contributor Author

Chloe-Thangavelu commented Jun 13, 2025

Summary:
Modified downsample_cells function to accept and return anndata.AnnData objects. When an AnnData object or pandas dataframe is provided, annotation information is extracted and used in a helper function to determine which cell indexes to keep.

Changes:

  • Added code to handle two different input data types: annData object or pandas dataframe
  • Placed downsampling logic into a helper function that returns which cell ID indexes to keep.
  • The input data are subset according to these returned indexes in the main function.
  • Created unit test to check anndata objects are correctly processed and returned

@fangliu117 fangliu117 requested a review from Copilot June 24, 2025 02:55
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the downsample_cells function to support input as an AnnData object, converting its .X and .obs into a DataFrame for downsampling, while maintaining compatibility with pandas DataFrames.

  • Added conversion logic for AnnData input
  • Updated error handling and documentation for input types
  • Introduced a new unit test to validate AnnData processing

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/test_data_utils/test_downsample_cells.py Added a new unit test to ensure downsample_cells correctly processes AnnData objects
src/spac/data_utils.py Refactored downsample_cells logic to handle both pandas DataFrame and AnnData inputs, updating docstrings and internal variable usage

Comment on lines 704 to 707
logging.basicConfig(level=logging.WARNING)
# Convert annotations to list if it's a string
if isinstance(annotations, str):
annotations = [annotations]

# Check if the columns to downsample on exist
missing_columns = [
col for col in annotations if col not in input_data.columns
]
if missing_columns:
raise ValueError(
f"Columns {missing_columns} do not exist in the dataframe"
)

# If n_samples is None, return the input data without processing
if n_samples is None:
return input_data.copy()

# Combine annotations into a single column if multiple annotations
if len(annotations) > 1:
Copy link

Copilot AI Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeated call to logging.basicConfig in both downsample_cells and _get_downsampled_indexes may cause configuration conflicts; consider configuring logging once at application startup.

Suggested change
logging.basicConfig(level=logging.WARNING)
# Convert annotations to list if it's a string
if isinstance(annotations, str):
annotations = [annotations]
# Check if the columns to downsample on exist
missing_columns = [
col for col in annotations if col not in input_data.columns
]
if missing_columns:
raise ValueError(
f"Columns {missing_columns} do not exist in the dataframe"
)
# If n_samples is None, return the input data without processing
if n_samples is None:
return input_data.copy()
# Combine annotations into a single column if multiple annotations
if len(annotations) > 1:
# Combine annotations into a single column if multiple annotations
if len(annotations) > 1:

Copilot uses AI. Check for mistakes.
Comment on lines +629 to +631
else:
raise TypeError("Input data must be a Pandas DataFrame or Anndata Object.")

Copy link

Copilot AI Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency and clarity, update the error message to refer to 'AnnData' (with proper casing) instead of 'Anndata'.

Suggested change
else:
raise TypeError("Input data must be a Pandas DataFrame or Anndata Object.")
else:
raise TypeError("Input data must be a Pandas DataFrame or AnnData Object.")

Copilot uses AI. Check for mistakes.
Otherwise, choose the first n cells.
combined_col_name : str, default='_combined_'
Name of the column that will store combined values when multiple
annotation columns are provided.
Copy link
Collaborator

@fangliu117 fangliu117 Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "combined_col_name" parameter is documented but never used in the code.
Since grouping_col is a pd.Series and not a new column in the cell_data DataFrame, the combined_col_name parameter isn't strictly necessary. May either remove or assign the name to the Series.

if len(annotations) > 1:
input_data[combined_col_name] = input_data[annotations].apply(
grouping_col = cell_data[annotations].apply(
lambda row: '_'.join(row.values.astype(str)), axis=1)
Copy link
Collaborator

@fangliu117 fangliu117 Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The apply method for combining annotations is readable but can be slow on very large datasets. A more performant, vectorized approach is to use str.cat or agg.

grouping_col = cell_data[annotations].astype(str).agg('_'.join, axis=1)
(Ensure all columns are string type first)

lambda row: '_'.join(row.values.astype(str)), axis=1)
grouping_col = combined_col_name
else:
grouping_col = annotations[0]
Copy link
Collaborator

@fangliu117 fangliu117 Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable grouping_col is used inconsistently.
Suggested:
if len(annotations) > 1:
cell_data[combined_col_name] = cell_data[annotations].apply(
lambda row: '_'.join(row.values.astype(str)), axis=1)
grouping_col = combined_col_name
else:
grouping_col = annotations[0] # This is a string, not a Series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I will go through & add these suggestions today

combined_col_name= '_combined_',
min_threshold= 5
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test calls the function with stratify=False and min_threshold=5. In the downsample_cells function, the min_threshold parameter is only used when stratify=True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes should be complete now. I have tested the code and it looks like its working. Let me know if any other fixes need to be made.

When multiple annotations are provided, a new temporary column (named by `combined_col_name`) is now explicitly added to the `cell_data` DataFrame, making `grouping_col` consistently a column name (string).
Replaced slow `DataFrame.apply` with the vectorized `DataFrame.astype(str).agg('_'.join, axis=1)` for combining annotations
@Chloe-Thangavelu Chloe-Thangavelu force-pushed the feature/add-downsample-anndata-compatibility branch from bb4b7a3 to 1915dca Compare June 26, 2025 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants