v3 codec structure in zarr.json #298

d-v-b · 2024-05-31T11:59:04Z

the v3 spec states that codecs are stored in a JSON array under the key codecs. But the spec also states that the list of codecs is structured:

...the list of codecs must be of the following form:

zero or more array -> array codecs; followed by
exactly one array -> bytes codec; followed by
zero or more bytes -> bytes codecs.

This is actually a lot of semantic load for something simple like a JSON array. Instead of using a JSON array, I believe that the above structure could be expressed much better (where "expressed better" means "conveys intent more clearly, with no loss of information, and minimal added complexity") by using a JSON object with the following structure:

{
"codecs": {
    "array_array": [], # array of array -> array codecs, possibly empty
    "array_bytes": {"name": "bytes"}, # single of array -> bytes codec, required
    "bytes_bytes": [], # array of bytes -> bytes codecs, possibly empty
}

I am noting this because over in the zarr-python v3 implementation effort, we have written something like the above data structure as part of the basic parsing of the contents of zarr.json. In fact I think this data structure will arise in any implementation, because implementations must represent the structure of the codecs, and that structure is not captured at all by the JSON array representation. But, as I show here, it is trivial to describe the codec structure explicitly with JSON. A corollary benefit is that the above proposed data structure expresses much better the constraint that there be just 1 array -> bytes codec, which would reduce some validation burden from implementations.

So, if we care about making this easier for implementations (and I think making it easy for implementations also makes it easier for users), we should considering this change to zarr.json. There is no change to the semantics of the spec, but it makes zarr.json more clear. I understand that people may not want to change the spec. But I consider that a separate question from whether the current spec has defects that could in principle be fixed, such as the one described here.

The text was updated successfully, but these errors were encountered:

normanrz · 2024-05-31T16:04:33Z

My take is that this kind of change is hard to get in at this point.

Also, after having implemented codec pipelines in Python, Java and Scala, it doesn't feel like an undue burden to implement the validation logic. In zarr-python the validation code is really just a few lines of code.
I think it is fine for implementations to have their own internal data structures that don't exactly match the json schema.

d-v-b · 2024-05-31T16:08:07Z

It's also not an undue burden for us to make material improvements to the v3 spec when we can, and I think the changes I'm proposing here are definitely improvements. If the spec is effectively immutable, then we can table this for zarr v4 :)

ghidalgo3 · 2024-07-10T15:58:33Z

Even if it's too late to change course, I would prefer the additional properties under codecs or a type property for each codec. Then if users create the wrong kind of codec pipeline, the error message seen side-by-side with the zarr.json is easily understood without having to read the codecs specification.

This was referenced Jul 8, 2024

Fix Example JSON document for Array metadata. #302

Merged

Replace VirtualiZarr.ZArray with zarr ArrayMetadata zarr-developers/VirtualiZarr#175

Draft

d-v-b mentioned this issue Sep 13, 2024

Narrow JSON type, ensure that to_dict always returns a dict, and v2 filter / compressor parsing zarr-developers/zarr-python#2179

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3 codec structure in zarr.json #298

v3 codec structure in zarr.json #298

d-v-b commented May 31, 2024

normanrz commented May 31, 2024

d-v-b commented May 31, 2024

ghidalgo3 commented Jul 10, 2024 •

edited

Loading

v3 codec structure in zarr.json #298

v3 codec structure in zarr.json #298

Comments

d-v-b commented May 31, 2024

normanrz commented May 31, 2024

d-v-b commented May 31, 2024

ghidalgo3 commented Jul 10, 2024 • edited Loading

ghidalgo3 commented Jul 10, 2024 •

edited

Loading