Skip to content

Conversation

@Eta0
Copy link
Collaborator

@Eta0 Eta0 commented Aug 15, 2025

Support for tensors with long dimensions

This change adds support to tensorizer for serializing tensors with individual dimensions spanning $2^{32}$ or more elements. This subsequently fixes a bug (#198) with the tensorizer.torch_compat module that prevented it from serializing many large tensors, even if they didn't have single dimensions that large, as tensorizer.torch_compat serializes all data as flattened 1-dimensional arrays internally.

Resolves #198.

Problem with the existing code

Tensorizer currently stores tensor size as a 64-bit integer, but stores tensor shape as an array of 32-bit integers. While this can handle tensors $2^{32}$ bytes or larger with dimensions like ($2^{31}$, $2^{31}$), it can't handle them as 1-dimensional tensors with shapes that are just ($2^{32}$). With tensorizer.torch_compat, it does not serialize torch.Tensor objects, but torch storage objects. Storages are the underlying 1-dimensional memory region spanned by a tensor, without shape information, meaning that serializing a storage runs into this integer overflow issue for anything with a total size that is $2^{32}$ bytes/elements or larger, instead of just the ones with a single dimension like that.

Changes implemented to fix it

The naïve fix for this issue is obviously to change out all the 32-bit integers used to store the shape with 64-bit integers. However, this changes the data format to one unreadable by older tensorizer versions, so the actual implementation is complicated by the need to preserve backwards compatibility for any file that does not require extra-long dimension support. We would normally mark this change by incrementing the file's data version number if the serializer is given arguments that imply that it will need that feature, but in this case, extra-long dimension support is only known to be needed after serialization has started, at some unknown point when the first extra-long tensor is encountered. It is possible that data is already written at that point, and it becomes impractical to then rewrite existing portions of a file to update prior headers to use a different integer width in the shape field, as it would entail shifting around all the tensor data in the file. To circumvent this, the implementation here adds version numbers to each tensor header stub in the file header, prior to the tensor data, when this new data version is discovered to be needed. By editing this section, it is possible to "rewrite history," so to speak, as this section does not contain headers densely packed with tensor data—it is performant to even rewrite this entire header block if needed. The code in this PR uses a new internal class (_MetadataHandler) to track the state of headers throughout the serialization process, and rewrite parts of the file as-needed. In this way, backwards compatibility is preserved for all existing use cases, and the newer data format is employed whenever needed.

Versioning

This change updates the code version to v2.12.0a0 and updates the changelog to account for the new feature.

@Eta0 Eta0 requested a review from wbrown August 15, 2025 16:38
@Eta0 Eta0 self-assigned this Aug 15, 2025
@Eta0 Eta0 added the enhancement New feature or request label Aug 15, 2025
Copy link
Contributor

@wbrown wbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. 👍

@wbrown wbrown merged commit 71cf174 into main Aug 17, 2025
2 checks passed
@Eta0 Eta0 deleted the eta/long-tensors branch August 19, 2025 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

> 4GB tensors don't seem to work with torch_compat

3 participants