feat(serialization): Support tensors with very long dimensions #199

Eta0 · 2025-08-15T16:38:21Z

Support for tensors with long dimensions

This change adds support to tensorizer for serializing tensors with individual dimensions spanning $2^{32}$ or more elements. This subsequently fixes a bug (#198) with the tensorizer.torch_compat module that prevented it from serializing many large tensors, even if they didn't have single dimensions that large, as tensorizer.torch_compat serializes all data as flattened 1-dimensional arrays internally.

Resolves #198.

Problem with the existing code

Tensorizer currently stores tensor size as a 64-bit integer, but stores tensor shape as an array of 32-bit integers. While this can handle tensors $2^{32}$ bytes or larger with dimensions like ($2^{31}$, $2^{31}$), it can't handle them as 1-dimensional tensors with shapes that are just ($2^{32}$). With tensorizer.torch_compat, it does not serialize torch.Tensor objects, but torch storage objects. Storages are the underlying 1-dimensional memory region spanned by a tensor, without shape information, meaning that serializing a storage runs into this integer overflow issue for anything with a total size that is $2^{32}$ bytes/elements or larger, instead of just the ones with a single dimension like that.

Changes implemented to fix it

The naïve fix for this issue is obviously to change out all the 32-bit integers used to store the shape with 64-bit integers. However, this changes the data format to one unreadable by older tensorizer versions, so the actual implementation is complicated by the need to preserve backwards compatibility for any file that does not require extra-long dimension support. We would normally mark this change by incrementing the file's data version number if the serializer is given arguments that imply that it will need that feature, but in this case, extra-long dimension support is only known to be needed after serialization has started, at some unknown point when the first extra-long tensor is encountered. It is possible that data is already written at that point, and it becomes impractical to then rewrite existing portions of a file to update prior headers to use a different integer width in the shape field, as it would entail shifting around all the tensor data in the file. To circumvent this, the implementation here adds version numbers to each tensor header stub in the file header, prior to the tensor data, when this new data version is discovered to be needed. By editing this section, it is possible to "rewrite history," so to speak, as this section does not contain headers densely packed with tensor data—it is performant to even rewrite this entire header block if needed. The code in this PR uses a new internal class (_MetadataHandler) to track the state of headers throughout the serialization process, and rewrite parts of the file as-needed. In this way, backwards compatibility is preserved for all existing use cases, and the newer data format is employed whenever needed.

Versioning

This change updates the code version to v2.12.0a0 and updates the changelog to account for the new feature.

wbrown

Great job. 👍

feat(serialization): Support tensors with very long dimensions

03f2423

Eta0 requested a review from wbrown August 15, 2025 16:38

Eta0 self-assigned this Aug 15, 2025

Eta0 added the enhancement New feature or request label Aug 15, 2025

wbrown approved these changes Aug 17, 2025

View reviewed changes

wbrown merged commit 71cf174 into main Aug 17, 2025
2 checks passed

Eta0 deleted the eta/long-tensors branch August 19, 2025 05:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(serialization): Support tensors with very long dimensions #199

feat(serialization): Support tensors with very long dimensions #199

Uh oh!

Eta0 commented Aug 15, 2025 •

edited

Loading

Uh oh!

wbrown left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(serialization): Support tensors with very long dimensions #199

feat(serialization): Support tensors with very long dimensions #199

Uh oh!

Conversation

Eta0 commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Support for tensors with long dimensions

Problem with the existing code

Changes implemented to fix it

Versioning

Uh oh!

wbrown left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Eta0 commented Aug 15, 2025 •

edited

Loading