Skip to content

Conversation

@ilan-gold
Copy link
Collaborator

@ilan-gold ilan-gold commented Nov 3, 2025

Calling to_native_dtype + __str__ came up as one of the only python-CPU-bound things when doing some benchmarking. My use-case is quite contrived (generating thousands of WithSubset objects) but I think it's probably worth investigating getting rid of these calls. Some observations:

1. I wonder if all getting the dtype and fill_val be wrapped up in just relying on https://docs.rs/zarrs/latest/zarrs/array/struct.Array.html#method.open and then using the values directly (there are probably other benefits of doing this) but I think this is a separate PR So it turns out the array actually doesn't have to be created yet when the pipeline is generated. So I ended up doing something similar with the metadata (making it an actual Struct and then working with that to get fill and dtype).
2. Regardless, most of this refactor is around removing Basic anyway so that chunk handling is independent of the ability. I noticed that ChunkRepresentation requires ownership over its arguments which means we copy per-chunk. Not sure what would go into making that a reference, but it's no worse than the previous situation where I think we were generating copies repeatedly, but from PyO3 calling python --> Hooray! No longer!

The benefit wasn't crazy ~5% but I think going in this direction is good (see point 1)

@ilan-gold ilan-gold marked this pull request as draft November 3, 2025 14:49
@LDeakin LDeakin mentioned this pull request Jan 1, 2026
@ilan-gold ilan-gold changed the base branch from main to ld/zarrs_0.23.0 January 3, 2026 19:08
@ilan-gold ilan-gold changed the title (feat): remove dtype + fill val handling per chunk perf: remove dtype + fill val handling per chunk Jan 5, 2026
Comment on lines +74 to +75
&self.data_type,
&self.fill_value,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general not sure what best practice here - put behind a method that returns the reference or is this ok?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine! I’d also remove the key and shape functions and just make them pub for consistency.

@ilan-gold ilan-gold marked this pull request as ready for review January 5, 2026 11:28
@pytest.mark.filterwarnings(
# TODO: Fix handling of string fill values for Zarr v2 bytes data
"ignore:Array is unsupported by ZarrsCodecPipeline. incompatible fill value .eAAAAAAAAA==. for data type bytes:UserWarning"
"ignore:Array is unsupported by ZarrsCodecPipeline. incompatible fill value 0 for data type r56:UserWarning"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what’s a “r56”?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a question probably for @LDeakin - presumably the representation of V7 in rust where 56 is the number of bits maybe and r is equivalent to V?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a human readable repr we can emit instead? (assuming we emit that message and not zarrs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That error is emitted by zarrs and it is the v3 name for the "raw bits" data type https://zarr-specs.readthedocs.io/en/latest/v3/data-types/index.html#core-data-types

I can have a look into tightening that error so it displays the V2 name when running on a V2 array.

#[derive(Clone)]
#[gen_stub_pyclass]
#[pyclass]
pub(crate) struct WithSubset {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename this to ChunkItem or ChunkSlice now that there’s only one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants