-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reducing overhead for payload encoding #63
Comments
IMO this needs to be fleshed out more.
|
Oh good call! As another alternative, the bundle format selected by in-toto, JSON lines, also offers a compression mode. |
Cc @dstufft who has an interest in this subject and has run a bunch of experiments |
These are my unedited notes when I was looking into this previously for another project:
I was specifically looking at TUF, and I was looking primarily at the snapshot role since that role will almost certainly be the largest file in my particular use case (TUF on PyPI), and I created a sort of torture test with ~500k delegations. You can see the actual files of that at dstufft/tuf-serialization (requires git-lfs) but the results basically end up looking like this:
I've gone ahead and added what I think this issue is proposing, which is serializing the DSSE payload as a utf8 text string when it's JSON (that's As you can see, the current DSSE serialization, when used from JSON ends up producing a 101M TUF snapshot, but if you compress that it drops down to 36-45M. Without compression the ion binary encoding is the smallest at 55M, it's also the smallest compressed by by a much smaller margin. The proposed utf8 encoding reduces the snapshot role from 101M down to 81M, and likewise there is a decrease in compressed size as well. Sorry for the brain dump, but hopefully this is useful in some way? |
Oh one other note, is that a possibly interesting property here is that the proposed JSON + UTF8 encoding, when compressed, isn't the absolute smallest in the snapshot test, but it's closer than the current JSON + Base64 is, and on the smaller root role test, it's basically tied for the smallest. To me, that ends up representing a really nice trade off, because the other serialization options, while generally available, are not nearly as ubiquitous as JSON as, and if you're worried about space constraints, the fact that wrapping the entire thing in any of the very ubiquitous compression algorithms brings this JSON + UTF8 encoding scheme in line with the smallest of the other options is a strong incentive to use it. One thing of note, is that my tests above assume you're using the same serialization scheme for both the payload and the DSSE envelope, but you're only compressing the final output of the DSSE. Arguably you might want to compress just the payload since it's bound to be the largest part of the DSSE output, and that would mean that you can validate the signatures prior to decompressing (which decompressing first should be safe in DSSE I think, but I'm always nervous around cryptography + compression). |
Excellent, thank you @dstufft! Before we decide on anything here, I'd love to see: (1) requirements, (2) constraints, and (3) numbers (like Donald produced) to back up findings. |
@dstufft: maybe this is a bit out-of-scope, strictly for DSSE, but I described a rationale that whole DSSE compression (and deferred to the upper-level context) could practically work out better than payload-only compression in the in-toto thread: in-toto has selected JSON line as the attestation bundle format, and in a bundle there will likely be multiple attestations of various kind, all for the same set of artifacts. So every in-toto statement "subject" array will contain a copy of identical content (but may not be in the same presentation, as neither in-toto has ordering requirement for subjects, nor does JSON dict serialization guarantee ordering). (And for some builds that generate 1000s of files or more, the "subject" array constitutes the major bulk of the attestation size.) If we compress the payload, or compress the DSSE envelope piece-wise:
If we were to embrace bundle level compression, I think the best DSSE can do is to preserve the original data as much as possible. To that end, UTF8 encoding works better than Base64 because there is less "clobbering" (that is, when the payload is already UTF8-compatible). For example, there are two attestations in a bundle, both for the same set of two files:
UTF8 encoding will allow a compression algorithm to easily discover the duplicated information; However, after Base64 encoding, the data will look like:
It would take a really smart compression algorithm to tell that the middle parts actually contain the same information. |
@AdamZWu I would not conflate in-toto Bundle-level compression with in-toto Attestation-level compression, so if you're looking for the former, I argue that it is out of the scope of this project. |
@trishankatdatadog given the interactions between DSSE and in-toto, I think it is not completely out of scope. Since in-toto attestation (and bundle) is a major applicator of DSSE, I think supporting in-toto to efficiently reduce bundle size is to both specs' best interest. And for that, my current thinking is to add a "utf-8 encoding" mode for payload, which allows minimum "clobbering" for payload that is already in utf-8 encoding (which is always the case for in-toto). Doing so would allow bundle level compression to more effectively discover data duplication across multiple envelopes (see my above post), which is one of the big sources of bloats in an attestation bundle. |
OK, I'm fairly convinced by the fact that the compressed UTF-8-encoded JSON is only ~5% larger than the compressed proto encoding, while compressed base64-encoded JSON is ~20-80% larger than compressed proto.
It would still be nice to gather a larger corpus of real-world data (not just two files) and do the comparison, but assuming it holds, then that's fairly compelling. I'm assuming we'd add a new field like this? message Envelope {
// Message to be signed.
// REQUIRED.
oneof payload_encoding {
// Raw bytes. In JSON, this is encoded as base64.
bytes payload = 1;
// Unicode string, where the signed byte stream is the UTF-8 encoding. In JSON, this is a regular unicode string.
string payloadUtf8 = 4;
}
} Note: The signature algorithm does not change at all, and existing signatures could be re-encoded with this new field without invalidating them. |
Adam, do you want to put together a PR that implements this? I think you'd need to edit the proto, the envelope.md, and other references to the payload. (I don't think protocol needs to be updated.) The other question here is backwards compatibility. Old consumers won't be able to DSSEs with the new field. I don't see any security concern, but I don't know how best to roll this out. (Edit: To clarify, I'm not saying this is accepted by DSSE, rather that it's helpful to have a concrete proposal that we can discuss for making a group decision.) |
Sounds good. I will put something up for review next week. :D |
Adam, are you arguing that Attestation-level compression will automatically help with Bundle-level compression? If so, then yes, I agree. However, the two levels of compression are distinct from each other. |
Not exactly. Yes, what DSSE does to payload will definitely affect bundle-level compression performance. But it looks to me the more complex processing DSSE does, probably the worse the bundle-level compression will perform, because complex mutations will hide cross-envelope data duplication. So I think DSSE could offer a mode that does less, e.g. allowing UTF-8 encoding for payload, so that JSON serialized in-toto statement which is already in UTF-8 can be presented pretty much unchanged (except for JSON string escapes). And yes, DSSE's payload encoding is completely orthogonal to in-toto attestation bundle compression. Just some encoding (e.g. UTF-8) is much friendlier to bundle-level compression than the other (e.g. base64). |
Oh, I see, thanks for the clarification! Hmm, now I'm curious about how the Attestation-level choice of payload encoding would affect the efficacy of Bundle-level compression. As you suggest, UTF-8 should work better for this. |
The payload field is currently defined as base64 encoded data, which is a reasonable choice for holding arbitrary data.
However, when the payload content is already a well-formed text string, the 33% size increase induced by the base64 encoding starts to feel a bit costly (See in-toto/attestation#289).
Could DSSE offer an "unencoded" mode, where users can directly put raw text string in the payload?
Or are there other alternatives / recommendations?
The text was updated successfully, but these errors were encountered: