Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3 codec structure in zarr.json #298

Open
d-v-b opened this issue May 31, 2024 · 3 comments
Open

v3 codec structure in zarr.json #298

d-v-b opened this issue May 31, 2024 · 3 comments

Comments

@d-v-b
Copy link
Contributor

d-v-b commented May 31, 2024

the v3 spec states that codecs are stored in a JSON array under the key codecs. But the spec also states that the list of codecs is structured:

...the list of codecs must be of the following form:

zero or more array -> array codecs; followed by
exactly one array -> bytes codec; followed by
zero or more bytes -> bytes codecs.

This is actually a lot of semantic load for something simple like a JSON array. Instead of using a JSON array, I believe that the above structure could be expressed much better (where "expressed better" means "conveys intent more clearly, with no loss of information, and minimal added complexity") by using a JSON object with the following structure:

{
"codecs": {
    "array_array": [], # array of array -> array codecs, possibly empty
    "array_bytes": {"name": "bytes"}, # single of array -> bytes codec, required
    "bytes_bytes": [], # array of bytes -> bytes codecs, possibly empty
}

I am noting this because over in the zarr-python v3 implementation effort, we have written something like the above data structure as part of the basic parsing of the contents of zarr.json. In fact I think this data structure will arise in any implementation, because implementations must represent the structure of the codecs, and that structure is not captured at all by the JSON array representation. But, as I show here, it is trivial to describe the codec structure explicitly with JSON. A corollary benefit is that the above proposed data structure expresses much better the constraint that there be just 1 array -> bytes codec, which would reduce some validation burden from implementations.

So, if we care about making this easier for implementations (and I think making it easy for implementations also makes it easier for users), we should considering this change to zarr.json. There is no change to the semantics of the spec, but it makes zarr.json more clear. I understand that people may not want to change the spec. But I consider that a separate question from whether the current spec has defects that could in principle be fixed, such as the one described here.

@normanrz
Copy link
Contributor

My take is that this kind of change is hard to get in at this point.

Also, after having implemented codec pipelines in Python, Java and Scala, it doesn't feel like an undue burden to implement the validation logic. In zarr-python the validation code is really just a few lines of code.
I think it is fine for implementations to have their own internal data structures that don't exactly match the json schema.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 31, 2024

It's also not an undue burden for us to make material improvements to the v3 spec when we can, and I think the changes I'm proposing here are definitely improvements. If the spec is effectively immutable, then we can table this for zarr v4 :)

@ghidalgo3
Copy link

ghidalgo3 commented Jul 10, 2024

Even if it's too late to change course, I would prefer the additional properties under codecs or a type property for each codec. Then if users create the wrong kind of codec pipeline, the error message seen side-by-side with the zarr.json is easily understood without having to read the codecs specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants