feat(downsample):AnnData input and output for downsample_cells by Chloe-Thangavelu · Pull Request #349 · FNLCR-DMAP/SCSAWorkflow

Chloe-Thangavelu · 2025-05-29T04:18:24Z

Summary:
Modified downsample_cells function to accept anndata.AnnData objects as input. When an AnnData object is provided, .X and .obs data are combined into a pandas DataFrame, before applying the rest of the downsampling function.

Changes:

Added code to convert anndata objects into pandas dataframes
Returns error message if input is neither an anndata object or pandas dataframe
Created unit test to check anndata object is correctly processes

Modified downsample_cells function to accept anndata.AnnData objects as input. When an AnnData object is provided, .X` and .obs data are combined into a pandas DataFrame, before applying the rest of the downsampling function.

This commit modifies the 'downsample_cells' function and adds a helper function '_get_downsampled_indices' to provide cell downsampling capabilities for both AnnData objects and Pandas DataFrames.

This test now checks anndata objects are accepted, downsampled correctly, and returned as annadata objects.

Reordering the code to convert annotations to a list before extracting annotation information, ensuring it is in DataFrame format as required by subsequent downsample_cells code.

Chloe-Thangavelu · 2025-06-13T05:33:41Z

Summary:
Modified downsample_cells function to accept and return anndata.AnnData objects. When an AnnData object or pandas dataframe is provided, annotation information is extracted and used in a helper function to determine which cell indexes to keep.

Changes:

Added code to handle two different input data types: annData object or pandas dataframe
Placed downsampling logic into a helper function that returns which cell ID indexes to keep.
The input data are subset according to these returned indexes in the main function.
Created unit test to check anndata objects are correctly processed and returned

Copilot

Pull Request Overview

This PR updates the downsample_cells function to support input as an AnnData object, converting its .X and .obs into a DataFrame for downsampling, while maintaining compatibility with pandas DataFrames.

Added conversion logic for AnnData input
Updated error handling and documentation for input types
Introduced a new unit test to validate AnnData processing

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
tests/test_data_utils/test_downsample_cells.py	Added a new unit test to ensure downsample_cells correctly processes AnnData objects
src/spac/data_utils.py	Refactored downsample_cells logic to handle both pandas DataFrame and AnnData inputs, updating docstrings and internal variable usage

Copilot · 2025-06-24T02:56:06Z

src/spac/data_utils.py

    logging.basicConfig(level=logging.WARNING)
-    # Convert annotations to list if it's a string
-    if isinstance(annotations, str):
-        annotations = [annotations]
-
-    # Check if the columns to downsample on exist
-    missing_columns = [
-        col for col in annotations if col not in input_data.columns
-    ]
-    if missing_columns:
-        raise ValueError(
-            f"Columns {missing_columns} do not exist in the dataframe"
-        )
-
-    # If n_samples is None, return the input data without processing
-    if n_samples is None:
-        return input_data.copy()

    # Combine annotations into a single column if multiple annotations
    if len(annotations) > 1:


The repeated call to logging.basicConfig in both downsample_cells and _get_downsampled_indexes may cause configuration conflicts; consider configuring logging once at application startup.

Suggested change

logging.basicConfig(level=logging.WARNING)

# Convert annotations to list if it's a string

if isinstance(annotations, str):

annotations = [annotations]

# Check if the columns to downsample on exist

missing_columns = [

col for col in annotations if col not in input_data.columns

]

if missing_columns:

raise ValueError(

f"Columns {missing_columns} do not exist in the dataframe"

)

# If n_samples is None, return the input data without processing

if n_samples is None:

return input_data.copy()

# Combine annotations into a single column if multiple annotations

if len(annotations) > 1:

# Combine annotations into a single column if multiple annotations

if len(annotations) > 1:

Copilot · 2025-06-24T02:56:06Z

src/spac/data_utils.py

+    else:
+        raise TypeError("Input data must be a Pandas DataFrame or Anndata Object.")
+


For consistency and clarity, update the error message to refer to 'AnnData' (with proper casing) instead of 'Anndata'.

Suggested change

else:

raise TypeError("Input data must be a Pandas DataFrame or Anndata Object.")

else:

raise TypeError("Input data must be a Pandas DataFrame or AnnData Object.")

fangliu117 · 2025-06-24T07:18:10Z

src/spac/data_utils.py

@@ -586,62 +691,32 @@ def downsample_cells(input_data, annotations, n_samples=None, stratify=False,
        annotation columns are provided.


The "combined_col_name" parameter is documented but never used in the code.
Since grouping_col is a pd.Series and not a new column in the cell_data DataFrame, the combined_col_name parameter isn't strictly necessary. May either remove or assign the name to the Series.

fangliu117 · 2025-06-24T07:23:03Z

src/spac/data_utils.py

    if len(annotations) > 1:
-        input_data[combined_col_name] = input_data[annotations].apply(
+        grouping_col = cell_data[annotations].apply(
            lambda row: '_'.join(row.values.astype(str)), axis=1)


The apply method for combining annotations is readable but can be slow on very large datasets. A more performant, vectorized approach is to use str.cat or agg.

grouping_col = cell_data[annotations].astype(str).agg('_'.join, axis=1)
(Ensure all columns are string type first)

fangliu117 · 2025-06-24T07:30:02Z

src/spac/data_utils.py

            lambda row: '_'.join(row.values.astype(str)), axis=1)
-        grouping_col = combined_col_name
    else:
        grouping_col = annotations[0]


The variable grouping_col is used inconsistently.
Suggested:
if len(annotations) > 1:
cell_data[combined_col_name] = cell_data[annotations].apply(
lambda row: '_'.join(row.values.astype(str)), axis=1)
grouping_col = combined_col_name
else:
grouping_col = annotations[0] # This is a string, not a Series

Thank you, I will go through & add these suggestions today

fangliu117 · 2025-06-24T07:51:57Z

tests/test_data_utils/test_downsample_cells.py

+            combined_col_name= '_combined_',
+            min_threshold= 5
+        )
+


The test calls the function with stratify=False and min_threshold=5. In the downsample_cells function, the min_threshold parameter is only used when stratify=True.

The changes should be complete now. I have tested the code and it looks like its working. Let me know if any other fixes need to be made.

When multiple annotations are provided, a new temporary column (named by `combined_col_name`) is now explicitly added to the `cell_data` DataFrame, making `grouping_col` consistently a column name (string). Replaced slow `DataFrame.apply` with the vectorized `DataFrame.astype(str).agg('_'.join, axis=1)` for combining annotations

feat(downsample):anndata.AnnData input for downsample_cells

520934e

Modified downsample_cells function to accept anndata.AnnData objects as input. When an AnnData object is provided, .X` and .obs data are combined into a pandas DataFrame, before applying the rest of the downsampling function.

georgezakinih requested a review from fangliu117 May 29, 2025 18:12

feat(downsample):AnnData input and output for downsample_cells

507ef7f

This commit modifies the 'downsample_cells' function and adds a helper function '_get_downsampled_indices' to provide cell downsampling capabilities for both AnnData objects and Pandas DataFrames.

Chloe-Thangavelu changed the title ~~feat(downsample):AnnData input for downsample_cells~~ feat(downsample):AnnData input and output for downsample_cells Jun 13, 2025

Chloe-Thangavelu added 2 commits June 12, 2025 22:13

Update test_downsample_cells.py

3e84383

This test now checks anndata objects are accepted, downsampled correctly, and returned as annadata objects.

Fix downsample_cells in data_utils.py

c110994

Reordering the code to convert annotations to a list before extracting annotation information, ensuring it is in DataFrame format as required by subsequent downsample_cells code.

fangliu117 requested a review from Copilot June 24, 2025 02:55

Copilot AI reviewed Jun 24, 2025

View reviewed changes

fangliu117 reviewed Jun 24, 2025

View reviewed changes

Chloe-Thangavelu force-pushed the feature/add-downsample-anndata-compatibility branch from bb4b7a3 to 1915dca Compare June 26, 2025 03:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(downsample):AnnData input and output for downsample_cells#349

feat(downsample):AnnData input and output for downsample_cells#349
Chloe-Thangavelu wants to merge 5 commits intoFNLCR-DMAP:devfrom
Chloe-Thangavelu:feature/add-downsample-anndata-compatibility

Chloe-Thangavelu commented May 29, 2025 •

edited

Loading

Uh oh!

Chloe-Thangavelu commented Jun 13, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 24, 2025

Uh oh!

Copilot AI Jun 24, 2025

Uh oh!

fangliu117 Jun 24, 2025 •

edited

Loading

Uh oh!

fangliu117 Jun 24, 2025 •

edited

Loading

Uh oh!

fangliu117 Jun 24, 2025 •

edited

Loading

Uh oh!

Chloe-Thangavelu Jun 24, 2025

Uh oh!

fangliu117 Jun 24, 2025

Uh oh!

Chloe-Thangavelu Jul 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		else:
		raise TypeError("Input data must be a Pandas DataFrame or Anndata Object.")

		@@ -586,62 +691,32 @@ def downsample_cells(input_data, annotations, n_samples=None, stratify=False,
		annotation columns are provided.

Conversation

Chloe-Thangavelu commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chloe-Thangavelu commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

fangliu117 Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fangliu117 Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fangliu117 Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Chloe-Thangavelu Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

fangliu117 Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Chloe-Thangavelu Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Chloe-Thangavelu commented May 29, 2025 •

edited

Loading

Chloe-Thangavelu commented Jun 13, 2025 •

edited

Loading

fangliu117 Jun 24, 2025 •

edited

Loading

fangliu117 Jun 24, 2025 •

edited

Loading

fangliu117 Jun 24, 2025 •

edited

Loading