Preparatory splitstream format changes for ostree support #185

alexlarsson · 2025-09-29T09:04:18Z

This changes the splitstream format a bit, with the goal of allowing splitstreams to support ostree files as well (see #144), but it it imho a generally nice change.

The primary differences are:

The header is not compressed
All referenced fs-verity objects are stored in the header, including external chunks, mapped splitstreams and (a new feature) references that are not used in chunks.
The mapping table is separate from the reference table (and generally smaller), and indexes into it.
There is a magic value to detect the file format.
There is a magic content type to detect the type wrapped in the stream.
We store a tag for what ObjectID format is used
The total size of the stream is stored in the header.

The ability to reference file objects in the repo even if they are not part of the splitstream "content" will be useful for the ostree support to reference file content objects.

This change also allows more efficient GC enumeration, because we don't have to parse the entire splitstream to find the referenced objects.

cgwalters · 2025-09-29T12:29:31Z

crates/composefs-oci/src/skopeo.rs


 use crate::{sha256_from_descriptor, sha256_from_digest, tar::split_async, ContentAndVerity};

+pub const TAR_LAYER_CONTENT_TYPE: u64 = 0x2a037edfcae1ffea;


Can you add a comment for where these came from? I'm guessing random? If so just add a comment // Random unique ID ?

That said I wonder if it wouldn't be nicer to store (variable length) strings for this in the format? Maybe it could go all the way to literally suggested to be the mediaType from OCI (if applicable)

Yes. These are random.

But, I'd rather avoid having variable length things in the header. That makes parsing it much more tricky.

We could make it a real uuid tho

I'm fine keeping as u64 too.

I added some comments here

cgwalters · 2025-09-29T12:31:27Z

crates/composefs/src/repository.rs

+    pub fn has_named_stream(&self, name: &str) -> bool {
+        let stream_path = format!("streams/refs/{}", name);
+
+        readlinkat(&self.repository, &stream_path, []).is_ok()


I don't like "swallowing" errors like this, I'd say call stat instead and require it's S_IFLNK

I redid this with stat()

cgwalters · 2025-09-29T12:53:54Z

crates/composefs/src/splitstream.rs

+#[derive(Clone, Debug, FromBytes, Immutable, IntoBytes, KnownLayout)]
+#[repr(C)]
+pub struct MappingEntry {
+    pub body: Sha256Digest,


If we're changing the format here...I think it'd be nice to make this one extensible.

However...bigger picture there's another consideration: There's obviously a metric ton of binary serialization formats out there. A custom one isn't wrong necessarily but...how about say CBOR ? It has some usage and a proper RFC etc.

I guess a dividing line is "do we care about mmap()"? Probably not?

splitstreams are essentially thin wrappers of existing binary formats (tars, ostree objects, etc), adding just references to other composefs repo objects. I'm not sure its overly helpful to use a complicated binary format for the wrapping, especially one which is completely different from the inner format.

That said, I agree that we should make it at least a bit extensible. This MR adds a magic header, but also adding a version field and a few bytes of unused/unparsed space does seem quite useful.

We discussed mmap, and the end result was, no, we don't want it.

splitstreams are essentially thin wrappers of existing binary formats (tars, ostree objects, etc), adding just references to other composefs repo objects. I'm not sure its overly helpful to use a complicated binary format for the wrapping, especially one which is completely different from the inner format.

Yeah, though the nice thing about Rust here is that for stuff like this there's a lot of well-done crates.

It also makes it a lot more obvious and easy to parse from other languages too if we can say "it's just CBOR" (or whatever).

Anyways: I'm basically fine with this as is too.

Although what about the algorithm agility? There's been some thoughts that for post-quantum crypto we may need to get away from sha256 in theory as far as I understand things.

I think the right thing is to add a header size field, and skip parts we don't understand. Then we can easily extend this later.

I added a new extension_size field which we skip on read.

allisonkarlitskaya

Let's talk about this. I might have a bit of bandwidth to work on this if you like.

allisonkarlitskaya · 2025-10-07T06:30:26Z

doc/splitstream.md

+
+struct SplitstreamHeader {
+    magic: [u8; 7], // Contains SPLITSTREAM_MAGIC
+    algorithm: u8,  // The fs-verity algorithm used, 1 == sha256, 2 == sha512


We should also do fs-verity block size here. That's usually expressed as a bit-shift count, so 12 or 16...

We could also write it like "fsverity-sha256-12" or so as a string... some relevant discussion in #181.

allisonkarlitskaya · 2025-10-07T06:32:00Z

doc/splitstream.md

+    n_refs: u64,
+    n_mappings: u64,
+    refs: [ObjectID; n_refs]    // sorted
+    mappings: [MappingEntry; n_mappings] // sorted by body


It's so sketch that we hardcode sha256 here... I think that's probably OK, but maybe we'd add an extension mechanism so we could add new types of mappings tables...

Something like this from the start:

magic

n_sections

n_sections * (

section_start

section_size_in_bytes

)

We could name the sections but I think it's quite OK to just know what the numbers mean and require that they're all present, in order. An empty section would be denoted by a zero size.

We could also get into compat vs. uncompat extensions... not sure how far it's worth going here...

cgwalters · 2025-10-30T12:42:34Z

We should get this in probably very soon and then I think declare the format stable?

allisonkarlitskaya · 2025-10-30T12:59:48Z

We should get this in probably very soon and then I think declare the format stable?

I have a branch..... lol

allisonkarlitskaya · 2025-10-30T13:07:48Z

One of the things that I'm tormenting myself on a bit right now is the sha256 mapping. I'm considering changing it to a general-purpose "named object reference" mapping: we could then have a hashmap mapping names like "sha256:12345" to the object ID in question, and adding sha512 would be seamless. I think I'd chose to encode that as a nul-separated series of strings of the form 0:name0, 1:name1, etc. and compress the whole thing.

The alternative is to stay with what we have now more or less, but it's much less flexible and is gonna be cruft one day, I'm sure. That being said: we have an extensibility mechanism now, at least...

The other thing that really needs fixing in @alexlarsson's work vs. the current version of the branch is that we should really take advantage of the fact that we have the references array out front now and use indexes into it from the splitstream content instead of repeating the whole object ID. It would have the additional advantage of ensuring that it became physically impossible to refer to an object that wasn't listed in the "depends" header (which we'll use during GC) and yet another advantage in that we could use the 64bit "starting word" for each internal/external section in the stream for both cases:

positive: this many internal bytes
negative or zero: an index into the references array

(or with the high bit as a flag or whatever).

It's just "work" to get this over the line....

After that, I think we need to figure out a way to kill off the content-sha256-based naming of splitstreams and perhaps even consider getting rid of the streams/ directory entirely... each backend would have its own way of 'caching' interesting splitstream objects for itself. I'm not entirely sure how I'd do that for the OCI backend. In a related conversation with @Johan-Liebert1 we discussed having the layers (and possibly the config) ((and possibly possibly the manifest some day)) referenced from the erofs image itself as a way to optionally prevent those things from being GC'd. We'd probably want some sort of a better "lookup table" still, though... but the key difference is that this table would be unique to OCI, not some global "streams" directory that we try to pretend is sharable by everyone on equal footing...

alexlarsson · 2025-11-17T10:09:56Z

crates/composefs/src/fsverity/hashvalue.rs


 use hex::FromHexError;
 use sha2::{digest::FixedOutputReset, digest::Output, Digest, Sha256, Sha512};
+use std::cmp::Ord;


I would move this change to a separate commit

Actually, I think we can drop this. We're not sorting the objects arrays anymore.

alexlarsson · 2025-11-17T10:37:10Z

I took a look at this, and the file format looks ok to me. I don't think the content type is needed anymore with the new structured names though.

Johan-Liebert1 · 2025-11-17T11:15:57Z

crates/composefs-oci/src/lib.rs

+    let config = if verity.is_none() {
+        // No verity means we need to verify the content hash
+        let mut data = vec![];
+        stream.read_to_end(&mut data)?;


Maybe we could use std::io::copy for the hashing?

Not a bad idea, although it means we can't use our existing hash() function...

I tried to do this but I don't think it's a good idea: we need the data for two things, the hashing and the creation of the ImageConfiguration. Reading it once into a buffer lets us do both.

cgwalters · 2025-11-18T13:45:05Z

.github/workflows/bootc-revdep.yml

      - name: Build sealed bootc image
        working-directory: bootc
-        run: just build-sealed
+        run: just build


Sorry it's actually just variant=composefs-sealeduki-sdboot build now.

I just ran into this too and then remembered it from here. I threw up bootc-dev/bootc#1787 to propose bringing back the old target as a wrapper.

cgwalters · 2025-11-18T21:06:18Z

doc/splitstream.md


+```rust
+struct FileRange {
+    start: U64,


Why is this U uppercase?

That's from zerocopy: it's a little endian u64.

Are you asking for a clarifying note in the docs? That's a good suggestion. I actually thought I had added one...

cgwalters · 2025-11-19T22:43:53Z

The thing I'll raise here again is I'm still uncertain about the "centrality" of splitstream for composefs-rs; this previously was discussed in #14 (comment)

To continue and flesh that out, I still feel it'd be a lot easier to understand if what we always parsed was EROFS images, and splitstream was just a lookaside that could help map from the EROFS back bit-for-bit only for heavyweight serializations like tar - and would only be used when we actually needed to return back a tar. Storing small data objects like manifests and config JSON would just be saved bit for bit as is.

cgwalters · 2025-11-19T22:51:23Z

Basically everyone dealing with composefs+OCI needs to deal with two binary serialization formats: tar and erofs. tar-split is funnily enough not binary but perhaps arguably should be...man I hadn't looked at this in a while but the choice of JSON with base64 encoded payloads is a bit wasteful...but eh, I guess compression still does well with it anyways. Anyways splitstream is a third binary format competing for mental bandwidth with both tar and erofs.

allisonkarlitskaya · 2025-11-20T12:38:39Z

Splitstream at its core, is really just tar-split, fs-verity content-addressed, in binary for performance reasons, with enough information for GC. A small addition is named references for things like OCI configs....

tar-split exists in the container space for a good reason.... and we've discussed this before. The canonical representation of the container depends on the ability to reconstruct the tar stream: and this isn't just for uploading: you need to be able to reconstruct that stream to validate it for local use as well...

You've mentioned a few times that you think canonical tar might be a better way here (so that we can rebuild the tar without storing the original tar stream in some fashion) but I'm not sure how we square that with the world as it is...

So, splitstream...

We need some new features in systemd-repart and mkfs.ext4. We were pulling those from feature branches and commit IDs in the past, but these features are now available in stable releases. Build those release versions instead. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

This patch is machine-generated. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

This is only used from tests and not exported, so conditionalize it. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

This wasn't yet stabilized when the code was first written but newer patches have already added the use of this function in other parts of the same file, so this ought to be safe by now. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

This comment is an overview, so move it to a higher level. Change it a bit to make it more accurate. Also: move the declaration of the buffer outside of the loop body to avoid having to re-zero it each time. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

We're going to start referring to OCI images by their names starting with `sha256:` (as podman names them in the `--iidfile`) soon. Skopeo doesn't like that, so add a workaround. This will soon let us get rid of some of the hacking about we do in our `examples/` build scripts. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

Let's just have users write the padding as a separate inline section after they write the external data. This makes things a lot easier and reduces thrashing of the internal buffer. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

cgwalters · 2025-11-20T14:25:11Z

you need to be able to reconstruct that stream to validate it for local use as well...

Sorry, remind me why one would need to do that outside of a push operation?

You've mentioned a few times that you think canonical tar might be a better way here (so that we can rebuild the tar without storing the original tar stream in some fashion) but I'm not sure how we square that with the world as it is...

Canonical tar would be great I think, but we definitely need to be able to store existing containers too. So let's set that aside.

Again my point is that our EROFS is already a well-known binary format that we have userspace (and obviously kernel side) parsers for that can contain named references to objects via fsverity digests. EROFS doesn't help with the "reconstruct exact original bits" but we can design a format that operates on the EROFS representation (in some ways this may be harder).

Or a different approach here is to keep splitstreams mostly as is, but just don't make them the Source Of Truth for garbage collection.

allisonkarlitskaya · 2025-11-20T20:19:55Z

Sorry, remind me why one would need to do that outside of a push operation?

Because the layers are referred to by their diffid from the config, which is the sha256 over their content, including the tar headers. When we construct the image we want to ensure that the file we're constructing from hasn't been tampered, which means we need to measure it...

Or a different approach here is to keep splitstreams mostly as is, but just don't make them the Source Of Truth for garbage collection.

So splitstreams contain references to the objects, which means that we need some way to keep those objects around if the splitstream exists. But it's not the only story here: if you create an erofs image (eg. prepare-boot), you could then throw the splitstreams away and still do GC based on the references found in the image.

GC on the erofs is unfortunately substantially more expensive since we need to walk the tree to find and parse all the xattrs....

allisonkarlitskaya · 2025-11-20T20:28:04Z

Note an interesting discussion in containers/skopeo#2750 right now about if referring to containers with sha256: in front is correct at all.... I had assumed that this is some semi-standard thing, but it might not be as standard as I thought...

alexlarsson · 2025-11-26T09:15:26Z

I know we discussed later changing how the refs/* stuff work, but lets keep it as-is for now. And the use of refs/* is currently broken in this MR. For example, if you cfsctl oci create image refs/SOMENAME you get Error: Opening ref 'streams/oci-config-refs/SOMENAME, which used to work.

alexlarsson · 2025-11-26T09:43:22Z

crates/composefs/src/splitstream.rs

-        }
+fn read_range(file: &mut File, range: FileRange) -> Result<Vec<u8>> {
+    let size: usize = (range.len().try_into())
+        .context("Unable to allocate buffer for implausibly large splitstream section")?;


The saturating_sub it range.len() will make an illegal range (end < start) return a "ok" zero size buffer here. Can we instead make it return an error? We already are returning a Result here, so that seems "cheap" to do.

I did that to match the semantics of ranges in Rust: (4..0).len() is 0.

I agree that this is undesirable for reading a file format, though: this is an error in the file format that we should detect. Will change it.

alexlarsson · 2025-11-26T09:43:51Z

crates/composefs/src/splitstream.rs

+    pub object_refs: FileRange, // location of the object references array
+    pub stream: FileRange,      // location of the zstd-compressed stream within the file
+    pub named_refs: FileRange,  // location of the compressed named references
+    pub content_type: U64,      // user can put whatever magic identifier they want there


I thought you were looking at removing the content type? I think we should do that given the new stream type prefixes.

There's one thing that makes me want to keep content type: some future "deep fsck" mode where we not only ensure basic connectivity at the repository level, but make sure that the splitstreams have the expected content. We need to know what kind of file we're dealing with in order to do that.

alexlarsson

I looked mainly at the core splitstream and repo changes, not in detail at the oci changes. But overall I think this looks good. I thought you wanted to remove the content type, and i think we should. Also, i thought you wanted to add the ability to store the content identifier in the splitstream itself. That I'm not sure how useful it is in practice, but wouldn't hurt.

One question that comes up though, is: do we ever want to support streaming writes of a splitstream? I.e. do we ever expect individual inline chunks to be large (in comparison with the total ram size). Because, the file format as defined is not really amendable to streaming, because we build up the inline_buffer in ram before flushing it, and we can't flush early because we need to first write out the chunk size, and we can't back-patch the chunk size at flush time because the data is compressed.

We could make it streamable by doing a "virtual flush" when the inline buffer is at some max size. However, that will not be read back the same way, because we have APIs and code relying on the chunking being represented perfectly (for example read_inline_exact() and expected_bytes in ensure_chunk()). If we want to be streamable, yet exactly reproduce the chunk boundaries we need to be able to mark chunks as being virtual or real.

allisonkarlitskaya · 2025-11-26T10:46:36Z

I know we discussed later changing how the refs/* stuff work, but lets keep it as-is for now. And the use of refs/* is currently broken in this MR. For example, if you cfsctl oci create image refs/SOMENAME you get Error: Opening ref 'streams/oci-config-refs/SOMENAME, which used to work.

I think I broke this semi-intentionally... "refs" haven't really been used by anything yet... the entire idea is totally half-baked...

alexlarsson · 2025-11-26T10:49:34Z

We could make it streamable by doing a "virtual flush" when the inline buffer is at some max size. However, that will not be read back the same way, because we have APIs and code relying on the chunking being represented perfectly (for example read_inline_exact() and expected_bytes in ensure_chunk()). If we want to be streamable, yet exactly reproduce the chunk boundaries we need to be able to mark chunks as being virtual or real.

Alternatively we could just not guarantee that chunking is 100% reproducible, but then some code needs changing to handle that.

allisonkarlitskaya · 2025-11-26T10:56:58Z

One question that comes up though, is: do we ever want to support streaming writes of a splitstream?

This is definitely in scope and something that Colin and I have discussed in the past. I designed the file format specifically to make this possible (ie: we can move the metadata, refs and info sections to the end of the file after writing out the stream).

I.e. do we ever expect individual inline chunks to be large (in comparison with the total ram size). Because, the file format as defined is not really amendable to streaming, because we build up the inline_buffer in ram before flushing it, and we can't flush early because we need to first write out the chunk size, and we can't back-patch the chunk size at flush time because the data is compressed.

I don't consider this to be a problem. We absolutely can flush the stream manually, but I don't think we'll ever hit an amount of data that makes that necessary. Keep in mind that this is tar headers between file contents... maybe we'll have an extended header or a few parent directories or something.... my concern about the streaming is more on the level of downloading a large OCI layer with millions of files in it: we need to put the entire splitstream into RAM and fs-verity it before we start writing it out... that's the real problem. And even now: it's not such a big problem...

I already consider splistreams to be a non-canonical/non-deterministic representation because of the zstd compression. I filed facebook/zstd#4173 about that a long time ago, but they turned it down...

cgwalters · 2025-11-26T14:25:52Z

Because the layers are referred to by their diffid from the config, which is the sha256 over their content, including the tar headers. When we construct the image we want to ensure that the file we're constructing from hasn't been tampered, which means we need to measure it...

"construct the image" = Construct when and which image precisely? The erofs...at pull time? The whole OCI config + layers...at push time?

Isn't the whole point again of https://github.com/containers/composefs-rs/blob/main/doc/plans/oci-sealing-spec.md that we can construct the erofs and verify its integrity without also checksumming the entire tar?

alexlarsson · 2025-11-27T08:25:49Z

I already consider splistreams to be a non-canonical/non-deterministic representation because of the zstd compression. I filed facebook/zstd#4173 about that a long time ago, but they turned it down...

In addition to that, there is discussions in this MR about things like version fields, extending headers, moving the header to the end if the file if using streaming writes. These all make the splitstream digest non-canonical. The stable between systems identifier must be the external id (the stream content identity string basically). not the local fs-verity digest of the splitstream object.

allisonkarlitskaya · 2025-11-27T08:50:44Z

Isn't the whole point again of https://github.com/containers/composefs-rs/blob/main/doc/plans/oci-sealing-spec.md that we can construct the erofs and verify its integrity without also checksumming the entire tar?

But this assumes that all images we consume would have this label set on it... including for base layers that we might want to build on, or images that we import to seal for ourselves...

There is an interesting argument here, I suppose: we don't care about the integrity of the OCI layers at all because the only thing that ultimately matters is that the assembled image has the correct bits in it. By that logic, someone could (wild example) sneak a rootkit into one layer and delete it in another, and the checksum of the image in the end would still be the the same... and indeed, the system image wouldn't contain the rootkit...

We also need to worry a bit about parsing untrusted data... you could corrupt a layer by playing with header bits (or whatever) in an attempt to exploit a bug in composefs-rs without changing the resulting image... Rust helps a whole lot here, but it's not perfect...

So ya, there's some very specific scenario in which what you're proposing might be OK. And we might even be able to make that "the normal case" in the future... but all told I think, I'm not quite ready to walk away from the idea that diffids are meaningful and should be validated...

This is a substantial change to the splitstream file format to add more features (required for ostree support) and to add forwards- and backwards- compatibility mechanisms for future changes. This change aims to finalize the file format so we can start shipping this to the systems of real users without future "breaks everything" changes. This change itself breaks everything: you'll need to delete your repository and start over. Hopefully this is the last time. The file format is substantially more fleshed-out at this point. Here's an overview of the changes: - there is a header with a magic value, a version. flags field, and the fs-verity algorithm number and block size in use - everything else in the file can be freely located which will help if we ever want to create a version of the writer that streams data to disk as it goes: in that case we may want to store the stream before the associated metadata - there is an expandable "info" section which contains most other information about the stream and is intended to be used as the primary mechanism for making compatible changes to the file format in the future - the info section stores the total decompressed/reassembled stream size and a unique identifier value for the file type stored in the stream - the referenced external objects and splitstreams are now stored in a flat array of binary fs-verity hash values to improve the performance of garbage collection operations in large repositories (informed by Alex's battlescars from dealing with GC on Flathub) - it is possible to add arbitrary external object and stream references - the "sha256 mapping" has been replaced with a more flexible "named stream refs" mechanism that allows assigning arbitrary names to associated streams. This will be useful if we ever want to support formats that are based on anything other than SHA-256 (including future OCI versions which may start using SHA-512 or something else). - whereas the previous implementation concerned itself with ensuring the correct SHA-256 content hash of the stream and creating a link to the stream with that hash value from the `streams/` directory, the new implementation requires that the user perform whatever hashing they consider appropriate and name their streams with a "content identifier". This change, taken together with the above change, removes all SHA-256 specific logic from the implementation. The main reason for this change is that a SHA-256 content hash over a file isn't a sufficiently unique identifier to locate the relevant splitstream for that file. Each different file type is split into a splitstream in a different way. It just so happens that OCI JSON documents, `.tar` files, and GVariant OSTree commit objects have no possible overlaps (which means that SHA-256 content hashes have uniquely identified the files up to this point), but this is mostly a coincidence. Each file type is now responsible to name its streams with a sufficiently unique "content identifier" based on the component name, the file name, and a content hash, for example: - `oci-commit-sha256:...` - `oci-layer-sha256:...` - `ostree-commit-...` - &c. Having the repository itself no longer care about the content hashes means that the OCI code can now trust the SHA-256 verification performed by skopeo, and we don't need to recompute it, which is a nice win. Update the file format documentation. Update the repository code and the users of splitstream (ie: OCI) to adjust to the post-sha256-hardcoded future. Adjust the way we deal with verification of OCI objects when we lack fs-verity digests: instead of having an "open" operation which verifies everything and a "shallow open" which doesn't, just have the open operation verify only the config and move the verification of the layers to when we access them. Co-authored-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

The `fdatasync()` per written object thing is an unmitigated performance disaster and we need to get rid of it. It forces the filesystem to create thousands upon thousands of tiny commits. I tried another approach where we would reopen and `fsync()` each object file referred to from a splitstream *after* all of the files were written, but before we wrote the splitstream data, but it was also quite bad. syncfs() is a really really dumb hammer, and it could get us into trouble if other users on the system have massive outstanding amounts of IO. On the other hand, it works: it's almost instantaneous, and is a massive performance improvement over what we were doing before. Let's just go with that for now. Maybe some day filesystems will have a mechanism for this which isn't horrible. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

cgwalters · 2025-12-04T01:00:09Z

There is an interesting argument here, I suppose: we don't care about the integrity of the OCI layers at all because the only thing that ultimately matters is that the assembled image has the correct bits in it. By that logic, someone could (wild example) sneak a rootkit into one layer and delete it in another, and the checksum of the image in the end would still be the the same... and indeed, the system image wouldn't contain the rootkit...

Yes. I think this gets into an interesting topic of what security scanners do, but I think most of them operate on the final flattened image.

So ya, there's some very specific scenario in which what you're proposing might be OK. And we might even be able to make that "the normal case" in the future... but all told I think, I'm not quite ready to walk away from the idea that diffids are meaningful and should be validated...

It does make sense to validate the diffid at the time we pull an image. And also in some kind of expensive fsck operation.

But past that...I think there are two cases:

The image config has an explicitly provided digest: Great! We don't need to validate each layer
The image config does not have an explicit digest: We reuse a cached composefs without validating it by default: This is equivalent to what existing container runtimes do (none of them reassemble and validate vs diffids as far as I know, except when pulling)

allisonkarlitskaya · 2026-01-07T08:17:46Z

New year, same old PR...

I want to land this, preferably today. It's 99%. There's too much outstanding work and a lot of it is going to conflict. We shouldn't have such a large change out for so long...

cgwalters · 2026-01-07T16:16:00Z

doc/splitstream.md


-After the header comes a number of data blocks.  Each block starts with a u64
-le "size" field followed by some amount of data.
+The values are not in any particular order, but implementations should produce


Wouldn't be it be useful to sort these so checking "does a split stream reference a particular object" could be done via binary search?

It probably doesn't matter that much in practice vs just reading in all of the references I guess, but...since we're already moving to a model where splitstreams are not generated incrementally, why not just require sorting this?

The reason for having it unsorted is that we want to write stable indexes into the stream as we go and if do sorted insert (or sort at end) then the indexes can change. Alex's first patch did sort this (and had a lookup function for a second pass at the end) for ostree but he never tried to apply it to the content of the splitstream itself. I made the change when we did that.

cgwalters · 2026-01-07T16:18:23Z

doc/splitstream.md

+    pub _flags: U16,      // is currently always 0 (but ignored)
+    pub algorithm: u8,    // kernel fs-verity algorithm identifier (1 = sha256, 2 = sha512)
+    pub lg_blocksize: u8, // log2 of the fs-verity block size (12 = 4k, 16 = 64k)
+    pub info: FileRange,  // can be used to expand/move the info section in the future


Given that FileRange is effectively a pointer, I think it'd be helpful to include at least pseudo-types for the values here. Like pub info: FileRange // (points to SplitStreamInfo)

That's an interesting suggestion. We might even add some phantom data and make it a proper generic. I'll take a look.

I tried to do this with proper generics but it turned into the "everything needs an objectid generic" thing because some of those ranges refer to the array of object IDs. I think I'm going to leave this one alone.

cgwalters · 2026-01-07T16:19:22Z

doc/splitstream.md

+```rust
+struct SplitstreamInfo {
+    pub stream_refs: FileRange, // location of the stream references array
+    pub object_refs: FileRange, // location of the object references array


Per above I think it'd be helpful to declare this something like object_refs: FileRange (points to ObjectReference[])
And we declare here type ObjectReference = u8[FsverityDigestSize] or so

cgwalters · 2026-01-07T16:24:20Z

doc/splitstream.md

   contains large chunks of embedded data

- - it's based on storing external objects in the composefs object store
+ - in addition to the ability to split out chunks of file content (like files


Just noting here this is what I like the least, I personally would find it way clearer to not store manifests/configs as splitstreams. But if we (or in some theroretical world, something else using splitstreams) decides not to do that, that's fine, it just stays empty here.

cgwalters · 2026-01-07T16:26:42Z

doc/splitstream.md

+struct SplitstreamInfo {
+    pub stream_refs: FileRange, // location of the stream references array
+    pub object_refs: FileRange, // location of the object references array
+    pub stream: FileRange,      // location of the zstd-compressed stream within the file


Would we ever want to support having anything after this?

It might be cleaner to just define this as implicitly following, i.e. the structure is: [Header][Info][ObjectReferences][InternalReferences][Stream] - then of course we don't need FileRange for Info, we just need a u64 for size.

But it's fine by me as is too...it adds more flexibility to the layout and at least FileRange is unambiguous whereas a fixed layout is more implicit.

The idea is that Info might grow, indeed.

The idea with putting it at the end is that we will probably want to change the writer in the future to not write the entire file in a single go (which involves keeping it all in memory) but to stream it to the disk as we go... in that case it would be very valuable to be able to put the info struct at the end and update a single pointer in the header to point to it.

It's also just "one less concept" — we use FileRange for everything else, so why not here as well?

cgwalters · 2026-01-07T16:29:17Z

doc/splitstream.md

-  - "External" blocks (`size == 0`): in this case the length of the data is 32
-    bytes.  This is the binary form of a sha256 hash value and is a reference
-    to an object in the composefs repository (by its fs-verity digest).
+This section is also zstd-compressed, and is a number of nul-terminated text


It seems like compressing this is perhaps unnecessary...it's going to be a very small amount of text and otherwise uncompressible digests, right?

It's going to be hex strings, which would compress well. Alex mentioned this as well but I don't think it makes a whole lot of difference either way... the only time those references matter to us is when we're accessing the content and it's also compressed...

The reason I did it is that everything in the splistream right now is either

header stuff (magic numbers, offsets, etc)

hashes in binary form (uncompressable)

other data (which is compressed)

For the fast "do GC" thing we have separate flat binary arrays..

cgwalters · 2026-01-07T16:44:30Z

doc/splitstream.md

+    pub algorithm: u8,    // kernel fs-verity algorithm identifier (1 = sha256, 2 = sha512)
+    pub lg_blocksize: u8, // log2 of the fs-verity block size (12 = 4k, 16 = 64k)


Also I think it might be cleaner to fully sever splitstream's references to fsverity since it's not responsible itself for hashing any data.

I have a local patch with the goal of adding more unit tests that severs SplitStreamWriter's references to Repository which relates to this.

I think all we need here is a pub digest_identifer: u64 (much like content_type) that is an externally-defined number for the digest type in use.

We could probably declare a canonical mapping for things like "classic sha256/512 and fsverity at different alg and block sizes.

Ya! This is interesting! I was noticing earlier that we don't really need FsVerityHashValue here for example, but rather a trait with many constraints fewer. I tried to make that change but it failed: the splitstream code calls into the repository code for storing the external objects and this code does hash, so we definitely need a FsVerityHashValue here, at least for the time being. That being said, I have a vague long term plan that we might have some sort of an ObjectStore trait that has a more flexible identifier...

@alexlarsson made this addition so I'd let him comment on it.

That being said, I have a vague long term plan that we might have some sort of an ObjectStore trait that has a more flexible identifier...

Yep #202 adds such a trait, but needs more work to actually abstract the digest stuff - and would be more of a conflict fest here, so I didn't try hard on it. We can look at that once this lands.

allisonkarlitskaya

From me: let's land this. And deal with the conflicts in the other PRs. And then sort out CI which is broken independently of this.

cgwalters · 2026-01-08T18:22:27Z

I think it would have made sense to split out the sync() thing to a separate PR for the record.

cgwalters · 2026-01-08T22:24:29Z

Regression in this fixed in d04f9e2

allisonkarlitskaya · 2026-01-09T11:46:02Z

I think it would have made sense to split out the sync() thing to a separate PR for the record.

You're probably right about that. It was a bit of a yolo at the end there...

cgwalters · 2026-01-12T16:47:59Z

Related to the syncfs thing, one thing I did in ostree is that there's a concept of a "staging" dir named after the boot-id that contains objects which are not fsync'd. The idea is that instead of fdatasync per object or a global sync, we write everything into the staging dir, letting writeback do its thing over time.

Then at some later point we could choose to persist, and AIUI calling fdatasync at that point per object would be relatively cheap as the inode should already be clean.

Think of it from the perspective of a user deploying a container or flatpak app - they install and run it, but if the power gets cut 15 seconds after downloading (before writeback) it's potentially OK if the app or update didn't persist in many cases - at least if nothing else depended on it.

I think we will eventually need to do this, as "queue OS upgrade while database is running" is a totally valid use case and the sync() could cause unexpected latency for the database (though of course many production use case will have distinct OS filesystems from data).

allisonkarlitskaya · 2026-01-14T07:15:41Z

Then at some later point we could choose to persist, and AIUI calling fdatasync at that point per object would be relatively cheap as the inode should already be clean.

I tried doing this after the operation completed and it was still insanely slow. I do believe it would be fast to do it later, though, for some value of "later".

alexlarsson force-pushed the splitstream-new-format branch from 3803dc3 to 9aedd96 Compare September 29, 2025 09:26

cgwalters reviewed Sep 29, 2025

View reviewed changes

alexlarsson force-pushed the splitstream-new-format branch 2 times, most recently from 057121b to bed66dc Compare October 6, 2025 15:19

allisonkarlitskaya requested changes Oct 7, 2025

View reviewed changes

allisonkarlitskaya force-pushed the splitstream-new-format branch from bed66dc to bcafffa Compare November 17, 2025 09:41

alexlarsson commented Nov 17, 2025

View reviewed changes

Johan-Liebert1 reviewed Nov 17, 2025

View reviewed changes

allisonkarlitskaya force-pushed the splitstream-new-format branch 4 times, most recently from c4b6ae7 to 65ade39 Compare November 18, 2025 10:55

cgwalters reviewed Nov 18, 2025

View reviewed changes

cgwalters reviewed Nov 19, 2025

View reviewed changes

allisonkarlitskaya added 7 commits November 20, 2025 13:50

various: clippy fixes

db98522

This patch is machine-generated. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

test.rs: #[cfg(test)] a function

2559c7c

This is only used from tests and not exported, so conditionalize it. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

oci: use a newer stdlib Vec API

1d487fc

This wasn't yet stabilized when the code was first written but newer patches have already added the use of this function in other parts of the same file, so this ought to be safe by now. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

allisonkarlitskaya force-pushed the splitstream-new-format branch 2 times, most recently from 2a8007f to fd91a26 Compare November 20, 2025 13:02

alexlarsson commented Nov 26, 2025

View reviewed changes

allisonkarlitskaya and others added 2 commits December 3, 2025 14:13

Johan-Liebert1 mentioned this pull request Dec 8, 2025

composefs backend lacks garbage collection for /boot and sysroot bootc-dev/bootc#1808

Open

Johan-Liebert1 mentioned this pull request Jan 5, 2026

cfsctl: Fix garbage collection #200

Merged

allisonkarlitskaya force-pushed the splitstream-new-format branch from fd91a26 to 91a3579 Compare January 7, 2026 08:17

cgwalters reviewed Jan 7, 2026

View reviewed changes

allisonkarlitskaya approved these changes Jan 8, 2026

View reviewed changes

allisonkarlitskaya merged commit c3677f1 into containers:main Jan 8, 2026
5 of 15 checks passed


		use crate::{sha256_from_descriptor, sha256_from_digest, tar::split_async, ContentAndVerity};

		pub const TAR_LAYER_CONTENT_TYPE: u64 = 0x2a037edfcae1ffea;

		pub algorithm: u8, // kernel fs-verity algorithm identifier (1 = sha256, 2 = sha512)
		pub lg_blocksize: u8, // log2 of the fs-verity block size (12 = 4k, 16 = 64k)

Preparatory splitstream format changes for ostree support #185

Preparatory splitstream format changes for ostree support #185

Uh oh!

Conversation

alexlarsson commented Sep 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonkarlitskaya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Oct 30, 2025

Uh oh!

allisonkarlitskaya commented Oct 30, 2025

Uh oh!

allisonkarlitskaya commented Oct 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexlarsson commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Nov 19, 2025

Uh oh!

cgwalters commented Nov 19, 2025

Uh oh!

allisonkarlitskaya commented Nov 20, 2025

Uh oh!

cgwalters commented Nov 20, 2025