feat(serialization): Support tensors with very long dimensions #199
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Support for tensors with long dimensions
This change adds support to tensorizer for serializing tensors with individual dimensions spanning$2^{32}$ or more elements. This subsequently fixes a bug (#198) with the
tensorizer.torch_compatmodule that prevented it from serializing many large tensors, even if they didn't have single dimensions that large, astensorizer.torch_compatserializes all data as flattened 1-dimensional arrays internally.Resolves #198.
Problem with the existing code
Tensorizer currently stores tensor size as a 64-bit integer, but stores tensor shape as an array of 32-bit integers. While this can handle tensors$2^{32}$ bytes or larger with dimensions like ($2^{31}$ , $2^{31}$ ), it can't handle them as 1-dimensional tensors with shapes that are just ($2^{32}$ ). With $2^{32}$ bytes/elements or larger, instead of just the ones with a single dimension like that.
tensorizer.torch_compat, it does not serializetorch.Tensorobjects, buttorchstorage objects. Storages are the underlying 1-dimensional memory region spanned by a tensor, without shape information, meaning that serializing a storage runs into this integer overflow issue for anything with a total size that isChanges implemented to fix it
The naïve fix for this issue is obviously to change out all the 32-bit integers used to store the shape with 64-bit integers. However, this changes the data format to one unreadable by older tensorizer versions, so the actual implementation is complicated by the need to preserve backwards compatibility for any file that does not require extra-long dimension support. We would normally mark this change by incrementing the file's data version number if the serializer is given arguments that imply that it will need that feature, but in this case, extra-long dimension support is only known to be needed after serialization has started, at some unknown point when the first extra-long tensor is encountered. It is possible that data is already written at that point, and it becomes impractical to then rewrite existing portions of a file to update prior headers to use a different integer width in the shape field, as it would entail shifting around all the tensor data in the file. To circumvent this, the implementation here adds version numbers to each tensor header stub in the file header, prior to the tensor data, when this new data version is discovered to be needed. By editing this section, it is possible to "rewrite history," so to speak, as this section does not contain headers densely packed with tensor data—it is performant to even rewrite this entire header block if needed. The code in this PR uses a new internal class (
_MetadataHandler) to track the state of headers throughout the serialization process, and rewrite parts of the file as-needed. In this way, backwards compatibility is preserved for all existing use cases, and the newer data format is employed whenever needed.Versioning
This change updates the code version to v2.12.0a0 and updates the changelog to account for the new feature.