-
Notifications
You must be signed in to change notification settings - Fork 14
Preparatory splitstream format changes for ostree support #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preparatory splitstream format changes for ostree support #185
Conversation
3803dc3 to
9aedd96
Compare
crates/composefs-oci/src/skopeo.rs
Outdated
|
|
||
| use crate::{sha256_from_descriptor, sha256_from_digest, tar::split_async, ContentAndVerity}; | ||
|
|
||
| pub const TAR_LAYER_CONTENT_TYPE: u64 = 0x2a037edfcae1ffea; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment for where these came from? I'm guessing random? If so just add a comment // Random unique ID ?
That said I wonder if it wouldn't be nicer to store (variable length) strings for this in the format? Maybe it could go all the way to literally suggested to be the mediaType from OCI (if applicable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. These are random.
But, I'd rather avoid having variable length things in the header. That makes parsing it much more tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make it a real uuid tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine keeping as u64 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments here
crates/composefs/src/repository.rs
Outdated
| pub fn has_named_stream(&self, name: &str) -> bool { | ||
| let stream_path = format!("streams/refs/{}", name); | ||
|
|
||
| readlinkat(&self.repository, &stream_path, []).is_ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like "swallowing" errors like this, I'd say call stat instead and require it's S_IFLNK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I redid this with stat()
crates/composefs/src/splitstream.rs
Outdated
| #[derive(Clone, Debug, FromBytes, Immutable, IntoBytes, KnownLayout)] | ||
| #[repr(C)] | ||
| pub struct MappingEntry { | ||
| pub body: Sha256Digest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're changing the format here...I think it'd be nice to make this one extensible.
However...bigger picture there's another consideration: There's obviously a metric ton of binary serialization formats out there. A custom one isn't wrong necessarily but...how about say CBOR ? It has some usage and a proper RFC etc.
I guess a dividing line is "do we care about mmap()"? Probably not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitstreams are essentially thin wrappers of existing binary formats (tars, ostree objects, etc), adding just references to other composefs repo objects. I'm not sure its overly helpful to use a complicated binary format for the wrapping, especially one which is completely different from the inner format.
That said, I agree that we should make it at least a bit extensible. This MR adds a magic header, but also adding a version field and a few bytes of unused/unparsed space does seem quite useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed mmap, and the end result was, no, we don't want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitstreams are essentially thin wrappers of existing binary formats (tars, ostree objects, etc), adding just references to other composefs repo objects. I'm not sure its overly helpful to use a complicated binary format for the wrapping, especially one which is completely different from the inner format.
Yeah, though the nice thing about Rust here is that for stuff like this there's a lot of well-done crates.
It also makes it a lot more obvious and easy to parse from other languages too if we can say "it's just CBOR" (or whatever).
Anyways: I'm basically fine with this as is too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although what about the algorithm agility? There's been some thoughts that for post-quantum crypto we may need to get away from sha256 in theory as far as I understand things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the right thing is to add a header size field, and skip parts we don't understand. Then we can easily extend this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new extension_size field which we skip on read.
057121b to
bed66dc
Compare
allisonkarlitskaya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's talk about this. I might have a bit of bandwidth to work on this if you like.
doc/splitstream.md
Outdated
|
|
||
| struct SplitstreamHeader { | ||
| magic: [u8; 7], // Contains SPLITSTREAM_MAGIC | ||
| algorithm: u8, // The fs-verity algorithm used, 1 == sha256, 2 == sha512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also do fs-verity block size here. That's usually expressed as a bit-shift count, so 12 or 16...
We could also write it like "fsverity-sha256-12" or so as a string... some relevant discussion in #181.
doc/splitstream.md
Outdated
| n_refs: u64, | ||
| n_mappings: u64, | ||
| refs: [ObjectID; n_refs] // sorted | ||
| mappings: [MappingEntry; n_mappings] // sorted by body |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's so sketch that we hardcode sha256 here... I think that's probably OK, but maybe we'd add an extension mechanism so we could add new types of mappings tables...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this from the start:
- magic
- n_sections
- n_sections * (
- section_start
- section_size_in_bytes
- )
We could name the sections but I think it's quite OK to just know what the numbers mean and require that they're all present, in order. An empty section would be denoted by a zero size.
We could also get into compat vs. uncompat extensions... not sure how far it's worth going here...
|
We should get this in probably very soon and then I think declare the format stable? |
I have a branch..... lol |
|
One of the things that I'm tormenting myself on a bit right now is the sha256 mapping. I'm considering changing it to a general-purpose "named object reference" mapping: we could then have a hashmap mapping names like "sha256:12345" to the object ID in question, and adding sha512 would be seamless. I think I'd chose to encode that as a nul-separated series of strings of the form The alternative is to stay with what we have now more or less, but it's much less flexible and is gonna be cruft one day, I'm sure. That being said: we have an extensibility mechanism now, at least... The other thing that really needs fixing in @alexlarsson's work vs. the current version of the branch is that we should really take advantage of the fact that we have the references array out front now and use indexes into it from the splitstream content instead of repeating the whole object ID. It would have the additional advantage of ensuring that it became physically impossible to refer to an object that wasn't listed in the "depends" header (which we'll use during GC) and yet another advantage in that we could use the 64bit "starting word" for each internal/external section in the stream for both cases:
(or with the high bit as a flag or whatever). It's just "work" to get this over the line.... After that, I think we need to figure out a way to kill off the content-sha256-based naming of splitstreams and perhaps even consider getting rid of the streams/ directory entirely... each backend would have its own way of 'caching' interesting splitstream objects for itself. I'm not entirely sure how I'd do that for the OCI backend. In a related conversation with @Johan-Liebert1 we discussed having the layers (and possibly the config) ((and possibly possibly the manifest some day)) referenced from the erofs image itself as a way to optionally prevent those things from being GC'd. We'd probably want some sort of a better "lookup table" still, though... but the key difference is that this table would be unique to OCI, not some global "streams" directory that we try to pretend is sharable by everyone on equal footing... |
bed66dc to
bcafffa
Compare
|
|
||
| use hex::FromHexError; | ||
| use sha2::{digest::FixedOutputReset, digest::Output, Digest, Sha256, Sha512}; | ||
| use std::cmp::Ord; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this change to a separate commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we can drop this. We're not sorting the objects arrays anymore.
|
I took a look at this, and the file format looks ok to me. I don't think the content type is needed anymore with the new structured names though. |
| let config = if verity.is_none() { | ||
| // No verity means we need to verify the content hash | ||
| let mut data = vec![]; | ||
| stream.read_to_end(&mut data)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use std::io::copy for the hashing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bad idea, although it means we can't use our existing hash() function...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do this but I don't think it's a good idea: we need the data for two things, the hashing and the creation of the ImageConfiguration. Reading it once into a buffer lets us do both.
c4b6ae7 to
65ade39
Compare
.github/workflows/bootc-revdep.yml
Outdated
| - name: Build sealed bootc image | ||
| working-directory: bootc | ||
| run: just build-sealed | ||
| run: just build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it's actually just variant=composefs-sealeduki-sdboot build now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just ran into this too and then remembered it from here. I threw up bootc-dev/bootc#1787 to propose bringing back the old target as a wrapper.
|
|
||
| ```rust | ||
| struct FileRange { | ||
| start: U64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this U uppercase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's from zerocopy: it's a little endian u64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking for a clarifying note in the docs? That's a good suggestion. I actually thought I had added one...
|
The thing I'll raise here again is I'm still uncertain about the "centrality" of splitstream for composefs-rs; this previously was discussed in #14 (comment) To continue and flesh that out, I still feel it'd be a lot easier to understand if what we always parsed was EROFS images, and splitstream was just a lookaside that could help map from the EROFS back bit-for-bit only for heavyweight serializations like tar - and would only be used when we actually needed to return back a tar. Storing small data objects like manifests and config JSON would just be saved bit for bit as is. |
|
Basically everyone dealing with composefs+OCI needs to deal with two binary serialization formats: tar and erofs. tar-split is funnily enough not binary but perhaps arguably should be...man I hadn't looked at this in a while but the choice of JSON with base64 encoded payloads is a bit wasteful...but eh, I guess compression still does well with it anyways. Anyways splitstream is a third binary format competing for mental bandwidth with both tar and erofs. |
|
Splitstream at its core, is really just tar-split, fs-verity content-addressed, in binary for performance reasons, with enough information for GC. A small addition is named references for things like OCI configs.... tar-split exists in the container space for a good reason.... and we've discussed this before. The canonical representation of the container depends on the ability to reconstruct the tar stream: and this isn't just for uploading: you need to be able to reconstruct that stream to validate it for local use as well... You've mentioned a few times that you think canonical tar might be a better way here (so that we can rebuild the tar without storing the original tar stream in some fashion) but I'm not sure how we square that with the world as it is... So, splitstream... |
We need some new features in systemd-repart and mkfs.ext4. We were pulling those from feature branches and commit IDs in the past, but these features are now available in stable releases. Build those release versions instead. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
This patch is machine-generated. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
This is only used from tests and not exported, so conditionalize it. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
This wasn't yet stabilized when the code was first written but newer patches have already added the use of this function in other parts of the same file, so this ought to be safe by now. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
This comment is an overview, so move it to a higher level. Change it a bit to make it more accurate. Also: move the declaration of the buffer outside of the loop body to avoid having to re-zero it each time. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
We're going to start referring to OCI images by their names starting with `sha256:` (as podman names them in the `--iidfile`) soon. Skopeo doesn't like that, so add a workaround. This will soon let us get rid of some of the hacking about we do in our `examples/` build scripts. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Let's just have users write the padding as a separate inline section after they write the external data. This makes things a lot easier and reduces thrashing of the internal buffer. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
2a8007f to
fd91a26
Compare
Sorry, remind me why one would need to do that outside of a push operation?
Canonical tar would be great I think, but we definitely need to be able to store existing containers too. So let's set that aside. Again my point is that our EROFS is already a well-known binary format that we have userspace (and obviously kernel side) parsers for that can contain named references to objects via fsverity digests. EROFS doesn't help with the "reconstruct exact original bits" but we can design a format that operates on the EROFS representation (in some ways this may be harder). Or a different approach here is to keep splitstreams mostly as is, but just don't make them the Source Of Truth for garbage collection. |
Because the layers are referred to by their diffid from the config, which is the sha256 over their content, including the tar headers. When we construct the image we want to ensure that the file we're constructing from hasn't been tampered, which means we need to measure it...
So splitstreams contain references to the objects, which means that we need some way to keep those objects around if the splitstream exists. But it's not the only story here: if you create an erofs image (eg. GC on the erofs is unfortunately substantially more expensive since we need to walk the tree to find and parse all the xattrs.... |
|
Note an interesting discussion in containers/skopeo#2750 right now about if referring to containers with |
|
I know we discussed later changing how the refs/* stuff work, but lets keep it as-is for now. And the use of refs/* is currently broken in this MR. For example, if you |
| } | ||
| fn read_range(file: &mut File, range: FileRange) -> Result<Vec<u8>> { | ||
| let size: usize = (range.len().try_into()) | ||
| .context("Unable to allocate buffer for implausibly large splitstream section")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The saturating_sub it range.len() will make an illegal range (end < start) return a "ok" zero size buffer here. Can we instead make it return an error? We already are returning a Result here, so that seems "cheap" to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that to match the semantics of ranges in Rust: (4..0).len() is 0.
I agree that this is undesirable for reading a file format, though: this is an error in the file format that we should detect. Will change it.
| pub object_refs: FileRange, // location of the object references array | ||
| pub stream: FileRange, // location of the zstd-compressed stream within the file | ||
| pub named_refs: FileRange, // location of the compressed named references | ||
| pub content_type: U64, // user can put whatever magic identifier they want there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you were looking at removing the content type? I think we should do that given the new stream type prefixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's one thing that makes me want to keep content type: some future "deep fsck" mode where we not only ensure basic connectivity at the repository level, but make sure that the splitstreams have the expected content. We need to know what kind of file we're dealing with in order to do that.
alexlarsson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked mainly at the core splitstream and repo changes, not in detail at the oci changes. But overall I think this looks good. I thought you wanted to remove the content type, and i think we should. Also, i thought you wanted to add the ability to store the content identifier in the splitstream itself. That I'm not sure how useful it is in practice, but wouldn't hurt.
One question that comes up though, is: do we ever want to support streaming writes of a splitstream? I.e. do we ever expect individual inline chunks to be large (in comparison with the total ram size). Because, the file format as defined is not really amendable to streaming, because we build up the inline_buffer in ram before flushing it, and we can't flush early because we need to first write out the chunk size, and we can't back-patch the chunk size at flush time because the data is compressed.
We could make it streamable by doing a "virtual flush" when the inline buffer is at some max size. However, that will not be read back the same way, because we have APIs and code relying on the chunking being represented perfectly (for example read_inline_exact() and expected_bytes in ensure_chunk()). If we want to be streamable, yet exactly reproduce the chunk boundaries we need to be able to mark chunks as being virtual or real.
I think I broke this semi-intentionally... "refs" haven't really been used by anything yet... the entire idea is totally half-baked... |
Alternatively we could just not guarantee that chunking is 100% reproducible, but then some code needs changing to handle that. |
This is definitely in scope and something that Colin and I have discussed in the past. I designed the file format specifically to make this possible (ie: we can move the metadata, refs and info sections to the end of the file after writing out the stream).
I don't consider this to be a problem. We absolutely can flush the stream manually, but I don't think we'll ever hit an amount of data that makes that necessary. Keep in mind that this is tar headers between file contents... maybe we'll have an extended header or a few parent directories or something.... my concern about the streaming is more on the level of downloading a large OCI layer with millions of files in it: we need to put the entire splitstream into RAM and fs-verity it before we start writing it out... that's the real problem. And even now: it's not such a big problem... I already consider splistreams to be a non-canonical/non-deterministic representation because of the zstd compression. I filed facebook/zstd#4173 about that a long time ago, but they turned it down... |
"construct the image" = Construct when and which image precisely? The erofs...at pull time? The whole OCI config + layers...at push time? Isn't the whole point again of https://github.com/containers/composefs-rs/blob/main/doc/plans/oci-sealing-spec.md that we can construct the erofs and verify its integrity without also checksumming the entire tar? |
In addition to that, there is discussions in this MR about things like version fields, extending headers, moving the header to the end if the file if using streaming writes. These all make the splitstream digest non-canonical. The stable between systems identifier must be the external id (the stream content identity string basically). not the local fs-verity digest of the splitstream object. |
But this assumes that all images we consume would have this label set on it... including for base layers that we might want to build on, or images that we import to seal for ourselves... There is an interesting argument here, I suppose: we don't care about the integrity of the OCI layers at all because the only thing that ultimately matters is that the assembled image has the correct bits in it. By that logic, someone could (wild example) sneak a rootkit into one layer and delete it in another, and the checksum of the image in the end would still be the the same... and indeed, the system image wouldn't contain the rootkit... We also need to worry a bit about parsing untrusted data... you could corrupt a layer by playing with header bits (or whatever) in an attempt to exploit a bug in composefs-rs without changing the resulting image... Rust helps a whole lot here, but it's not perfect... So ya, there's some very specific scenario in which what you're proposing might be OK. And we might even be able to make that "the normal case" in the future... but all told I think, I'm not quite ready to walk away from the idea that diffids are meaningful and should be validated... |
This is a substantial change to the splitstream file format to add more
features (required for ostree support) and to add forwards- and
backwards- compatibility mechanisms for future changes. This change
aims to finalize the file format so we can start shipping this to the
systems of real users without future "breaks everything" changes.
This change itself breaks everything: you'll need to delete your
repository and start over. Hopefully this is the last time.
The file format is substantially more fleshed-out at this point. Here's
an overview of the changes:
- there is a header with a magic value, a version. flags field, and the
fs-verity algorithm number and block size in use
- everything else in the file can be freely located which will help if
we ever want to create a version of the writer that streams data to
disk as it goes: in that case we may want to store the stream before
the associated metadata
- there is an expandable "info" section which contains most other
information about the stream and is intended to be used as the primary
mechanism for making compatible changes to the file format in the
future
- the info section stores the total decompressed/reassembled stream
size and a unique identifier value for the file type stored in the
stream
- the referenced external objects and splitstreams are now stored in a
flat array of binary fs-verity hash values to improve the performance
of garbage collection operations in large repositories (informed by
Alex's battlescars from dealing with GC on Flathub)
- it is possible to add arbitrary external object and stream references
- the "sha256 mapping" has been replaced with a more flexible "named
stream refs" mechanism that allows assigning arbitrary names to
associated streams. This will be useful if we ever want to support
formats that are based on anything other than SHA-256 (including
future OCI versions which may start using SHA-512 or something else).
- whereas the previous implementation concerned itself with ensuring
the correct SHA-256 content hash of the stream and creating a link to
the stream with that hash value from the `streams/` directory, the new
implementation requires that the user perform whatever hashing they
consider appropriate and name their streams with a "content
identifier".
This change, taken together with the above change, removes all SHA-256
specific logic from the implementation.
The main reason for this change is that a SHA-256 content hash over a
file isn't a sufficiently unique identifier to locate the relevant
splitstream for that file. Each different file type is split into a
splitstream in a different way. It just so happens that OCI JSON
documents, `.tar` files, and GVariant OSTree commit objects have no
possible overlaps (which means that SHA-256 content hashes have
uniquely identified the files up to this point), but this is mostly a
coincidence. Each file type is now responsible to name its streams
with a sufficiently unique "content identifier" based on the component
name, the file name, and a content hash, for example:
- `oci-commit-sha256:...`
- `oci-layer-sha256:...`
- `ostree-commit-...`
- &c.
Having the repository itself no longer care about the content hashes
means that the OCI code can now trust the SHA-256 verification
performed by skopeo, and we don't need to recompute it, which is a
nice win.
Update the file format documentation.
Update the repository code and the users of splitstream (ie: OCI) to
adjust to the post-sha256-hardcoded future.
Adjust the way we deal with verification of OCI objects when we lack
fs-verity digests: instead of having an "open" operation which verifies
everything and a "shallow open" which doesn't, just have the open
operation verify only the config and move the verification of the layers
to when we access them.
Co-authored-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
The `fdatasync()` per written object thing is an unmitigated performance disaster and we need to get rid of it. It forces the filesystem to create thousands upon thousands of tiny commits. I tried another approach where we would reopen and `fsync()` each object file referred to from a splitstream *after* all of the files were written, but before we wrote the splitstream data, but it was also quite bad. syncfs() is a really really dumb hammer, and it could get us into trouble if other users on the system have massive outstanding amounts of IO. On the other hand, it works: it's almost instantaneous, and is a massive performance improvement over what we were doing before. Let's just go with that for now. Maybe some day filesystems will have a mechanism for this which isn't horrible. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Yes. I think this gets into an interesting topic of what security scanners do, but I think most of them operate on the final flattened image.
It does make sense to validate the diffid at the time we pull an image. And also in some kind of expensive But past that...I think there are two cases:
|
fd91a26 to
91a3579
Compare
|
New year, same old PR... I want to land this, preferably today. It's 99%. There's too much outstanding work and a lot of it is going to conflict. We shouldn't have such a large change out for so long... |
|
|
||
| After the header comes a number of data blocks. Each block starts with a u64 | ||
| le "size" field followed by some amount of data. | ||
| The values are not in any particular order, but implementations should produce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be it be useful to sort these so checking "does a split stream reference a particular object" could be done via binary search?
It probably doesn't matter that much in practice vs just reading in all of the references I guess, but...since we're already moving to a model where splitstreams are not generated incrementally, why not just require sorting this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for having it unsorted is that we want to write stable indexes into the stream as we go and if do sorted insert (or sort at end) then the indexes can change. Alex's first patch did sort this (and had a lookup function for a second pass at the end) for ostree but he never tried to apply it to the content of the splitstream itself. I made the change when we did that.
| pub _flags: U16, // is currently always 0 (but ignored) | ||
| pub algorithm: u8, // kernel fs-verity algorithm identifier (1 = sha256, 2 = sha512) | ||
| pub lg_blocksize: u8, // log2 of the fs-verity block size (12 = 4k, 16 = 64k) | ||
| pub info: FileRange, // can be used to expand/move the info section in the future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that FileRange is effectively a pointer, I think it'd be helpful to include at least pseudo-types for the values here. Like pub info: FileRange // (points to SplitStreamInfo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting suggestion. We might even add some phantom data and make it a proper generic. I'll take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do this with proper generics but it turned into the "everything needs an objectid generic" thing because some of those ranges refer to the array of object IDs. I think I'm going to leave this one alone.
| ```rust | ||
| struct SplitstreamInfo { | ||
| pub stream_refs: FileRange, // location of the stream references array | ||
| pub object_refs: FileRange, // location of the object references array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per above I think it'd be helpful to declare this something like object_refs: FileRange (points to ObjectReference[])
And we declare here type ObjectReference = u8[FsverityDigestSize] or so
| contains large chunks of embedded data | ||
|
|
||
| - it's based on storing external objects in the composefs object store | ||
| - in addition to the ability to split out chunks of file content (like files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting here this is what I like the least, I personally would find it way clearer to not store manifests/configs as splitstreams. But if we (or in some theroretical world, something else using splitstreams) decides not to do that, that's fine, it just stays empty here.
| struct SplitstreamInfo { | ||
| pub stream_refs: FileRange, // location of the stream references array | ||
| pub object_refs: FileRange, // location of the object references array | ||
| pub stream: FileRange, // location of the zstd-compressed stream within the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we ever want to support having anything after this?
It might be cleaner to just define this as implicitly following, i.e. the structure is: [Header][Info][ObjectReferences][InternalReferences][Stream] - then of course we don't need FileRange for Info, we just need a u64 for size.
But it's fine by me as is too...it adds more flexibility to the layout and at least FileRange is unambiguous whereas a fixed layout is more implicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that Info might grow, indeed.
The idea with putting it at the end is that we will probably want to change the writer in the future to not write the entire file in a single go (which involves keeping it all in memory) but to stream it to the disk as we go... in that case it would be very valuable to be able to put the info struct at the end and update a single pointer in the header to point to it.
It's also just "one less concept" — we use FileRange for everything else, so why not here as well?
| - "External" blocks (`size == 0`): in this case the length of the data is 32 | ||
| bytes. This is the binary form of a sha256 hash value and is a reference | ||
| to an object in the composefs repository (by its fs-verity digest). | ||
| This section is also zstd-compressed, and is a number of nul-terminated text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like compressing this is perhaps unnecessary...it's going to be a very small amount of text and otherwise uncompressible digests, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's going to be hex strings, which would compress well. Alex mentioned this as well but I don't think it makes a whole lot of difference either way... the only time those references matter to us is when we're accessing the content and it's also compressed...
The reason I did it is that everything in the splistream right now is either
- header stuff (magic numbers, offsets, etc)
- hashes in binary form (uncompressable)
- other data (which is compressed)
For the fast "do GC" thing we have separate flat binary arrays..
| pub algorithm: u8, // kernel fs-verity algorithm identifier (1 = sha256, 2 = sha512) | ||
| pub lg_blocksize: u8, // log2 of the fs-verity block size (12 = 4k, 16 = 64k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think it might be cleaner to fully sever splitstream's references to fsverity since it's not responsible itself for hashing any data.
I have a local patch with the goal of adding more unit tests that severs SplitStreamWriter's references to Repository which relates to this.
I think all we need here is a pub digest_identifer: u64 (much like content_type) that is an externally-defined number for the digest type in use.
We could probably declare a canonical mapping for things like "classic sha256/512 and fsverity at different alg and block sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya! This is interesting! I was noticing earlier that we don't really need FsVerityHashValue here for example, but rather a trait with many constraints fewer. I tried to make that change but it failed: the splitstream code calls into the repository code for storing the external objects and this code does hash, so we definitely need a FsVerityHashValue here, at least for the time being. That being said, I have a vague long term plan that we might have some sort of an ObjectStore trait that has a more flexible identifier...
@alexlarsson made this addition so I'd let him comment on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, I have a vague long term plan that we might have some sort of an ObjectStore trait that has a more flexible identifier...
Yep #202 adds such a trait, but needs more work to actually abstract the digest stuff - and would be more of a conflict fest here, so I didn't try hard on it. We can look at that once this lands.
allisonkarlitskaya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From me: let's land this. And deal with the conflicts in the other PRs. And then sort out CI which is broken independently of this.
|
I think it would have made sense to split out the sync() thing to a separate PR for the record. |
|
Regression in this fixed in d04f9e2 |
You're probably right about that. It was a bit of a yolo at the end there... |
|
Related to the syncfs thing, one thing I did in ostree is that there's a concept of a "staging" dir named after the boot-id that contains objects which are not fsync'd. The idea is that instead of fdatasync per object or a global sync, we write everything into the staging dir, letting writeback do its thing over time. Then at some later point we could choose to persist, and AIUI calling fdatasync at that point per object would be relatively cheap as the inode should already be clean. Think of it from the perspective of a user deploying a container or flatpak app - they install and run it, but if the power gets cut 15 seconds after downloading (before writeback) it's potentially OK if the app or update didn't persist in many cases - at least if nothing else depended on it. I think we will eventually need to do this, as "queue OS upgrade while database is running" is a totally valid use case and the sync() could cause unexpected latency for the database (though of course many production use case will have distinct OS filesystems from data). |
I tried doing this after the operation completed and it was still insanely slow. I do believe it would be fast to do it later, though, for some value of "later". |
This changes the splitstream format a bit, with the goal of allowing splitstreams to support ostree files as well (see #144), but it it imho a generally nice change.
The primary differences are:
The ability to reference file objects in the repo even if they are not part of the splitstream "content" will be useful for the ostree support to reference file content objects.
This change also allows more efficient GC enumeration, because we don't have to parse the entire splitstream to find the referenced objects.