Skip to content

[QST] Question about how to save the files #524

@liangyf24

Description

@liangyf24

What is your question?
Hi there,

First of all, thank you very much for maintaining this repository. I really appreciate the effort that has gone into making GPU-accelerated single-cell analysis possible.

I am writing to ask for advice regarding an issue I encountered when saving results.

Dataset context
• Dataset size: ~880,000 cells
• Workflow involves GPU-based preprocessing (via rapids_singlecell multi GPU workflow)
• One of the layers (e.g. adata.layers["scaled"]) is very large
• According to Dask diagnostics, this layer corresponds to ~94 GB of data

The issue:
When trying to save the AnnData object, both h5ad and zarr formats fail. For example, when writing to h5ad, I encounter errors like:
File ~/miniconda3/envs/rapids_singlecell/lib/python3.13/site-packages/dask/local.py:191, in start_state_from_dask(dsk, cache, sortkey, keys)
189 if task is None:
190 if dependents[key] and not cache.get(key):
--> 191 raise ValueError(
192 "Missing dependency {} for dependents {}".format(
193 key, dependents[key]
194 )
195 )
196 continue
197 elif isinstance(task, DataNode):

ValueError: Missing dependency ('scale_kernel_center-0940b9c96e20102f885e498c537f15dd', 9, 0) for dependents {('store-map-cc0b083db7973a40ae52a6e4cdb83699', 9, 0)}
Error raised while writing key 'scaled' of <class 'h5py._hl.group.Group'> to /layers

I also tried:
• Writing to Zarr instead of H5AD
• adata.layers['scaled'].compute() is not feasible in my case due to memory constraints (the layer alone is ~94 GB).

My question:
What would be the recommended way to store final results in this situation?

Specifically:
• Is it expected that large intermediate layers (e.g. scaled data) should be dropped before saving?
• Is there a recommended pattern for persisting results when working with very large GPU/Dask-backed layers?
• Are there best practices for saving AnnData objects at this scale that I may be missing?

Any guidance on how you would handle this in a production-scale workflow would be greatly appreciated.

Thank you again for your time and for maintaining this project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions