Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing overhead for payload encoding #63

Open
AdamZWu opened this issue Oct 5, 2023 · 15 comments
Open

Reducing overhead for payload encoding #63

AdamZWu opened this issue Oct 5, 2023 · 15 comments

Comments

@AdamZWu
Copy link

AdamZWu commented Oct 5, 2023

The payload field is currently defined as base64 encoded data, which is a reasonable choice for holding arbitrary data.

However, when the payload content is already a well-formed text string, the 33% size increase induced by the base64 encoding starts to feel a bit costly (See in-toto/attestation#289).

Could DSSE offer an "unencoded" mode, where users can directly put raw text string in the payload?

Or are there other alternatives / recommendations?

@MarkLodato
Copy link
Collaborator

IMO this needs to be fleshed out more.

  • For real-world data, how much better is UTF8 vs base64?
  • What is a real-world example where this extra size overhead matters? The linked issue mentions "some complex artifacts may generate 1GB provenance data" but doesn't provide details. I feel that if something is generating a 1GB SLSA Provenance, it's doing it wrong. (Then again, a 1GB SBOM might make sense.) An alternate would be to have the big payload "detached", which has other trade-offs.
  • How does the proposal compare to compression under the following scenarios?
    • Compressing the whole DSSE as-is (base64)
    • Compressing the whole DSSE using the proposal (UTF8)
    • Compressing before base64 (which is invasive since the payloadType would change)

@AdamZWu
Copy link
Author

AdamZWu commented Oct 17, 2023

Oh good call!

As another alternative, the bundle format selected by in-toto, JSON lines, also offers a compression mode.

@trishankatdatadog
Copy link
Collaborator

Cc @dstufft who has an interest in this subject and has run a bunch of experiments

@dstufft
Copy link

dstufft commented Oct 17, 2023

These are my unedited notes when I was looking into this previously for another project:

I’ve only skimmed the DSSE repository, but it looks like using DSSE turns all of the TUF metadata “opaque” anyways, by wrapping the signed data into a base64 blob that gets stuck in an envelope (could be json, could be something else).

If that’s the case, there’s very little reason to stick with JSON, since the primary benefit is human readability, and we should definitely explore more compact options, typically binary options. We could try CBOR, Protobuf, Smile, BSON, MessagePack, or Ion and see what kind of results we get.

We could also consider compression here as well.

TUF is defined in a serialization independent format, which allows particular applications to select the serialization format that makes the most sense for them, so we need to settle on a serialization format that makes sense for us.

Our Constraints and Goals:

  • Serialized size should be as small as reasonably possible.
  • Deserialization must be both memory and CPU efficient.
  • Deserialization must be able to be done in pure Python or using the standard library.
  • For convenience, we are ignoring how signatures themselves are constructed for this.
  • We cannot assume that we have control over the deployed version of either the serializer or the deserializer, and we need to maintain as much forwards and backwards compatibility as possible.
  • We can consider compression of the serialized content.

Options considered:

  • “Canonical JSON”, as the format initially used by TUF.
    • Pro: Can be implemented using nothing but the standard library.
    • Pro: Fully human readable.
    • Pro: No schema to synchronize between consumers and producers.
    • Con: Uses a canonicalization scheme for signing, which is more fragile than traditional signing envelopes.
  • DSSE + JSON
    • Pro: Can be implemented using nothing but the standard library.
    • Pro: No schema to synchronize between consumers and producers.
    • Pro: Traditional signing envelope that is misuse resistant.
    • Con: Envelope treats the inner payload as binary, which makes the non envelope content opaque until after opening the envelope.
  • DSSE + MessagePack
    • Pro: No schema to synchronize between consumers and producers.
    • Pro: Traditional signing envelope that is misuse resistant.
    • Pro: Binary serialization, can represent binary values without using base64 encoding.
    • Con: Requires a library with a C extension for speed, but it does have a Pure Python fallback available.
  • DSSE + Ion:
    • Pro: No schema to synchronize between consumers and producers.
    • Pro: Traditional signing envelope that is misuse resistant.
    • Pro: Binary and Text serialization are both available, allowing tuning between human readable or binary compactness.
    • Con: Requires a library with a C extension for speed, but it does have a pure Python fallback.
    • Con: Packaging appears to be fetching the library by downloading it from an URL in the setup.py.
  • DSSE + CBOR:
    • Pro: No schema to synchronize between consumers and producers.
    • Pro: Traditional signing envelope that is misuse resistant.
    • Pro: Binary serialization, can represent binary values without using base64 encoding.
    • Con: Requires a library with a C extension for speed, but it does have a Pure Python fallback available.
  • DSSE + Protobuf
    • Pro: Traditional signing envelope that is misuse resistant.
    • Pro: Binary serialization, can represent binary values without using base64 encoding.
    • Pro: Able to be backwards and forwards compatible.
    • Con: Compatibility requires being somewhat careful when evolving the schema.
    • Con: Requires distributing a schema (in the form of a .proto file) to producers and consumers.
    • Con: Requires a library with a C extension for speed, but it does have a Pure Python fallback available.

I was specifically looking at TUF, and I was looking primarily at the snapshot role since that role will almost certainly be the largest file in my particular use case (TUF on PyPI), and I created a sort of torture test with ~500k delegations. You can see the actual files of that at dstufft/tuf-serialization (requires git-lfs) but the results basically end up looking like this:

❯ ls -lhR output
output/root:
Permissions Size User    Group   Date Modified    Git Name
.rw-r--r--  3.5k dstufft dstufft 2023-10-17 10:52  --  root.canonical.json
.rw-r--r--  1.2k dstufft dstufft 2023-10-17 10:52  --  root.canonical.json.br
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.canonical.json.gz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.canonical.json.xz
.rw-r--r--  1.3k dstufft dstufft 2023-10-17 10:52  --  root.canonical.json.zst
.rw-r--r--  3.0k dstufft dstufft 2023-10-17 10:52  --  root.dsse.cbor
.rw-r--r--  1.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.cbor.br
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.cbor.gz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.cbor.xz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.cbor.zst
.rw-r--r--  2.8k dstufft dstufft 2023-10-17 10:52  --  root.dsse.ionb
.rw-r--r--  1.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.ionb.br
.rw-r--r--  1.5k dstufft dstufft 2023-10-17 10:52  --  root.dsse.ionb.gz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.ionb.xz
.rw-r--r--  1.5k dstufft dstufft 2023-10-17 10:52  --  root.dsse.ionb.zst
.rw-r--r--  4.0k dstufft dstufft 2023-10-17 10:52  --  root.dsse.iont
.rw-r--r--  2.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.iont.br
.rw-r--r--  2.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.iont.gz
.rw-r--r--  2.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.iont.xz
.rw-r--r--  2.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.iont.zst
.rw-r--r--  4.1k dstufft dstufft 2023-10-17 10:52  --  root.dsse.json
.rw-r--r--  2.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.json.br
.rw-r--r--  2.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.json.gz
.rw-r--r--  2.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.json.xz
.rw-r--r--  2.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.json.zst
.rw-r--r--  3.5k dstufft dstufft 2023-10-17 10:52  --  root.dsse.jsont
.rw-r--r--  1.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.jsont.br
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.jsont.gz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.jsont.xz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.jsont.zst
.rw-r--r--  3.0k dstufft dstufft 2023-10-17 10:52  --  root.dsse.msgpack
.rw-r--r--  1.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.msgpack.br
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.msgpack.gz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.msgpack.xz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.msgpack.zst
.rw-r--r--  2.6k dstufft dstufft 2023-10-17 10:52  --  root.dsse.proto
.rw-r--r--  1.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.proto.br
.rw-r--r--  1.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.proto.gz
.rw-r--r--  1.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.proto.xz
.rw-r--r--  1.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.proto.zst
.rw-r--r--  2.7k dstufft dstufft 2023-10-17 10:52  --  root.dsse.sionb
.rw-r--r--  1.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.sionb.br
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.sionb.gz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.sionb.xz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.sionb.zst
.rw-r--r--  4.0k dstufft dstufft 2023-10-17 10:52  --  root.dsse.siont
.rw-r--r--  2.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.siont.br
.rw-r--r--  2.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.siont.gz
.rw-r--r--  2.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.siont.xz
.rw-r--r--  2.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.siont.zst

output/snapshot:
Permissions Size User    Group   Date Modified    Git Name
.rw-r--r--   75M dstufft dstufft 2023-10-17 10:52  --  snapshot.canonical.json
.rw-r--r--   32M dstufft dstufft 2023-10-17 10:54  --  snapshot.canonical.json.br
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:52  --  snapshot.canonical.json.gz
.rw-r--r--   33M dstufft dstufft 2023-10-17 10:53  --  snapshot.canonical.json.xz
.rw-r--r--   33M dstufft dstufft 2023-10-17 10:55  --  snapshot.canonical.json.zst
.rw-r--r--   65M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.cbor
.rw-r--r--   31M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.cbor.br
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.cbor.gz
.rw-r--r--   33M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.cbor.xz
.rw-r--r--   34M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.cbor.zst
.rw-r--r--   55M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.ionb
.rw-r--r--   30M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.ionb.br
.rw-r--r--   36M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.ionb.gz
.rw-r--r--   32M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.ionb.xz
.rw-r--r--   33M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.ionb.zst
.rw-r--r--   94M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.iont
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.iont.br
.rw-r--r--   45M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.iont.gz
.rw-r--r--   36M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.iont.xz
.rw-r--r--   40M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.iont.zst
.rw-r--r--  101M dstufft dstufft 2023-10-17 10:55  --  snapshot.dsse.json
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:58  --  snapshot.dsse.json.br
.rw-r--r--   45M dstufft dstufft 2023-10-17 10:55  --  snapshot.dsse.json.gz
.rw-r--r--   36M dstufft dstufft 2023-10-17 10:56  --  snapshot.dsse.json.xz
.rw-r--r--   40M dstufft dstufft 2023-10-17 10:59  --  snapshot.dsse.json.zst
.rw-r--r--   81M dstufft dstufft 2023-10-17 10:59  --  snapshot.dsse.jsont
.rw-r--r--   32M dstufft dstufft 2023-10-17 11:02  --  snapshot.dsse.jsont.br
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:59  --  snapshot.dsse.jsont.gz
.rw-r--r--   33M dstufft dstufft 2023-10-17 11:00  --  snapshot.dsse.jsont.xz
.rw-r--r--   33M dstufft dstufft 2023-10-17 11:03  --  snapshot.dsse.jsont.zst
.rw-r--r--   65M dstufft dstufft 2023-10-17 11:03  --  snapshot.dsse.msgpack
.rw-r--r--   31M dstufft dstufft 2023-10-17 11:06  --  snapshot.dsse.msgpack.br
.rw-r--r--   37M dstufft dstufft 2023-10-17 11:03  --  snapshot.dsse.msgpack.gz
.rw-r--r--   33M dstufft dstufft 2023-10-17 11:03  --  snapshot.dsse.msgpack.xz
.rw-r--r--   34M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.msgpack.zst
.rw-r--r--   57M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.proto
.rw-r--r--   31M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.proto.br
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.proto.gz
.rw-r--r--   32M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.proto.xz
.rw-r--r--   35M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.proto.zst
.rw-r--r--   55M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.sionb
.rw-r--r--   30M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.sionb.br
.rw-r--r--   36M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.sionb.gz
.rw-r--r--   32M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.sionb.xz
.rw-r--r--   33M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.sionb.zst
.rw-r--r--   94M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.siont
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.siont.br
.rw-r--r--   45M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.siont.gz
.rw-r--r--   36M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.siont.xz
.rw-r--r--   40M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.siont.zst

I've gone ahead and added what I think this issue is proposing, which is serializing the DSSE payload as a utf8 text string when it's JSON (that's .jsont). I've left the signatures themselves encoded as base64 in this case since they are proper binary data (though it's possible there's a more compact encoding that could be used than base64, but there's only a few signatures so it's pretty low value to find a better serialization for the signatures).

As you can see, the current DSSE serialization, when used from JSON ends up producing a 101M TUF snapshot, but if you compress that it drops down to 36-45M. Without compression the ion binary encoding is the smallest at 55M, it's also the smallest compressed by by a much smaller margin.

The proposed utf8 encoding reduces the snapshot role from 101M down to 81M, and likewise there is a decrease in compressed size as well.

Sorry for the brain dump, but hopefully this is useful in some way?

@dstufft
Copy link

dstufft commented Oct 17, 2023

Oh one other note, is that a possibly interesting property here is that the proposed JSON + UTF8 encoding, when compressed, isn't the absolute smallest in the snapshot test, but it's closer than the current JSON + Base64 is, and on the smaller root role test, it's basically tied for the smallest.

To me, that ends up representing a really nice trade off, because the other serialization options, while generally available, are not nearly as ubiquitous as JSON as, and if you're worried about space constraints, the fact that wrapping the entire thing in any of the very ubiquitous compression algorithms brings this JSON + UTF8 encoding scheme in line with the smallest of the other options is a strong incentive to use it.

One thing of note, is that my tests above assume you're using the same serialization scheme for both the payload and the DSSE envelope, but you're only compressing the final output of the DSSE. Arguably you might want to compress just the payload since it's bound to be the largest part of the DSSE output, and that would mean that you can validate the signatures prior to decompressing (which decompressing first should be safe in DSSE I think, but I'm always nervous around cryptography + compression).

@trishankatdatadog
Copy link
Collaborator

Excellent, thank you @dstufft! Before we decide on anything here, I'd love to see: (1) requirements, (2) constraints, and (3) numbers (like Donald produced) to back up findings.

@AdamZWu
Copy link
Author

AdamZWu commented Oct 18, 2023

Arguably you might want to compress just the payload since it's bound to be the largest part of the DSSE output

@dstufft: maybe this is a bit out-of-scope, strictly for DSSE, but I described a rationale that whole DSSE compression (and deferred to the upper-level context) could practically work out better than payload-only compression in the in-toto thread: in-toto has selected JSON line as the attestation bundle format, and in a bundle there will likely be multiple attestations of various kind, all for the same set of artifacts. So every in-toto statement "subject" array will contain a copy of identical content (but may not be in the same presentation, as neither in-toto has ordering requirement for subjects, nor does JSON dict serialization guarantee ordering).

(And for some builds that generate 1000s of files or more, the "subject" array constitutes the major bulk of the attestation size.)

If we compress the payload, or compress the DSSE envelope piece-wise:

  1. The compression will not have the context/visibility of the "subject" duplication of other envelope payloads;
  2. The resulting compressed data may be scrambled such that the "subject" duplication is no longer evident at the bundle-level, hence deterring bundle-level compression from gaining effective reduction.

If we were to embrace bundle level compression, I think the best DSSE can do is to preserve the original data as much as possible. To that end, UTF8 encoding works better than Base64 because there is less "clobbering" (that is, when the payload is already UTF8-compatible).

For example, there are two attestations in a bundle, both for the same set of two files:

  • The first attestation lists the subject:
    [{"name":"main", "digest":{"sha256":"1234abcd...."}},
     {"name":"aux_dat.xml", "digest":{"sha256":"cdef5678...."}}]`
    
  • The second attestation lists the subject (in the other order):
    [{"name":"aux_dat.xml", "digest":{"sha256":"cdef5678...."}},
     {"name":"main", "digest":{"sha256":"1234abcd...."}}]`
    

UTF8 encoding will allow a compression algorithm to easily discover the duplicated information; However, after Base64 encoding, the data will look like:

  • The first attestation subjects:
    W3sibmFtZSI6Im
    1haW4iLCAiZGlnZXN0Ijp7InNoYTI1NiI6IjEyMzRhYmNkLi4uLiJ9fSx7Im5hbWUiOiJhdXhfZGF0LnhtbC
    IsICJkaWdlc3QiOnsic2hhMjU2IjoiY2RlZjU2NzguLi4uIn19XQo=
    
  • The second attestation subjects:
    W3sibmFtZSI6Im
    F1eF9kYXQueG1sIiwgImRpZ2VzdCI6eyJzaGEyNTYiOiJjZGVmNTY3OC4uLi4ifX0seyJuYW1lIjoibWFpbi
    IsICJkaWdlc3QiOnsic2hhMjU2IjoiMTIzNGFiY2QuLi4uIn19XQo=
    

It would take a really smart compression algorithm to tell that the middle parts actually contain the same information.
(I did a quick test throwing the original subjects and the base64 encoded ones at gzip, xz, and lzma, in all cases the compress base64 is ~2x the size of compressed original, indicating those algorithm failed to detect the underlying redundancy -- probably not a very realistic test, just to illustrate the problem 😄 ).

@trishankatdatadog
Copy link
Collaborator

@AdamZWu I would not conflate in-toto Bundle-level compression with in-toto Attestation-level compression, so if you're looking for the former, I argue that it is out of the scope of this project.

@AdamZWu
Copy link
Author

AdamZWu commented Oct 27, 2023

@trishankatdatadog given the interactions between DSSE and in-toto, I think it is not completely out of scope.

Since in-toto attestation (and bundle) is a major applicator of DSSE, I think supporting in-toto to efficiently reduce bundle size is to both specs' best interest. And for that, my current thinking is to add a "utf-8 encoding" mode for payload, which allows minimum "clobbering" for payload that is already in utf-8 encoding (which is always the case for in-toto).

Doing so would allow bundle level compression to more effectively discover data duplication across multiple envelopes (see my above post), which is one of the big sources of bloats in an attestation bundle.

@MarkLodato
Copy link
Collaborator

OK, I'm fairly convinced by the fact that the compressed UTF-8-encoded JSON is only ~5% larger than the compressed proto encoding, while compressed base64-encoded JSON is ~20-80% larger than compressed proto.

root.dsse.json.gz    2.3k  (~75% overhead)
root.dsse.jsont.gz   1.4k  (~5% overhead)
root.dsse.proto.gz   1.3k

snapshot.dsse.json.gz   45M  (~20% overhead)
snapshot.dsse.jsont.gz  37M  (~0% overhead)
snapshot.dsse.proto.gz  37M

It would still be nice to gather a larger corpus of real-world data (not just two files) and do the comparison, but assuming it holds, then that's fairly compelling.

I'm assuming we'd add a new field like this?

message Envelope {
  // Message to be signed.
  // REQUIRED.
  oneof payload_encoding {
    // Raw bytes. In JSON, this is encoded as base64.
    bytes payload = 1;
    // Unicode string, where the signed byte stream is the UTF-8 encoding. In JSON, this is a regular unicode string.
    string payloadUtf8 = 4;
  }
}

Note: The signature algorithm does not change at all, and existing signatures could be re-encoded with this new field without invalidating them.

@MarkLodato
Copy link
Collaborator

MarkLodato commented Oct 27, 2023

Adam, do you want to put together a PR that implements this? I think you'd need to edit the proto, the envelope.md, and other references to the payload. (I don't think protocol needs to be updated.)

The other question here is backwards compatibility. Old consumers won't be able to DSSEs with the new field. I don't see any security concern, but I don't know how best to roll this out.

(Edit: To clarify, I'm not saying this is accepted by DSSE, rather that it's helpful to have a concrete proposal that we can discuss for making a group decision.)

@AdamZWu
Copy link
Author

AdamZWu commented Oct 27, 2023

Sounds good. I will put something up for review next week. :D

@trishankatdatadog
Copy link
Collaborator

Doing so would allow bundle level compression to more effectively discover data duplication across multiple envelopes (see my above post), which is one of the big sources of bloats in an attestation bundle.

Adam, are you arguing that Attestation-level compression will automatically help with Bundle-level compression? If so, then yes, I agree. However, the two levels of compression are distinct from each other.

@AdamZWu
Copy link
Author

AdamZWu commented Oct 27, 2023

Doing so would allow bundle level compression to more effectively discover data duplication across multiple envelopes (see my above post), which is one of the big sources of bloats in an attestation bundle.

Adam, are you arguing that Attestation-level compression will automatically help with Bundle-level compression? If so, then yes, I agree. However, the two levels of compression are distinct from each other.

Not exactly.

Yes, what DSSE does to payload will definitely affect bundle-level compression performance. But it looks to me the more complex processing DSSE does, probably the worse the bundle-level compression will perform, because complex mutations will hide cross-envelope data duplication. So I think DSSE could offer a mode that does less, e.g. allowing UTF-8 encoding for payload, so that JSON serialized in-toto statement which is already in UTF-8 can be presented pretty much unchanged (except for JSON string escapes).

And yes, DSSE's payload encoding is completely orthogonal to in-toto attestation bundle compression. Just some encoding (e.g. UTF-8) is much friendlier to bundle-level compression than the other (e.g. base64).

@trishankatdatadog
Copy link
Collaborator

Yes, what DSSE does to payload will definitely affect bundle-level compression performance. But it looks to me the more complex processing DSSE does, probably the worse the bundle-level compression will perform, because complex mutations will hide cross-envelope data duplication. So I think DSSE could offer a mode that does less, e.g. allowing UTF-8 encoding for payload, so that JSON serialized in-toto statement which is already in UTF-8 can be presented pretty much unchanged (except for JSON string escapes).

Oh, I see, thanks for the clarification! Hmm, now I'm curious about how the Attestation-level choice of payload encoding would affect the efficacy of Bundle-level compression. As you suggest, UTF-8 should work better for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants