Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define the list of codecs in the v3 spec #312

Closed
wants to merge 11 commits into from
9 changes: 6 additions & 3 deletions docs/v3/codecs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,15 @@
Codecs
======

Under construction.
Core codecs
-----------

The following documents specify the core codecs. Read more about core codecs in the :ref:`core codecs <core-codecs>` section of the Zarr version 3 specification.

.. toctree::
:glob:
:maxdepth: 1
:titlesonly:
:caption: Contents:
:caption: Core codecs:

codecs/*/*
codecs/core/*/*
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
79 changes: 40 additions & 39 deletions docs/v3/core/v3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,7 @@ The following figure illustrates the first part of the terminology:

*Codec*

The list of *codecs* specified for an array_ determine the encoded byte
The list of *codecs* specified for an array_ determines the encoded byte
representation of each chunk in the store_.

.. _metadata document:
Expand Down Expand Up @@ -634,13 +634,12 @@ mandatory names:
``codecs``
^^^^^^^^^^

Specifies a list of codecs to be used for encoding and decoding chunks. The
value must be an array of objects, each object containing a member with
``name`` whose value is a string referring to a v3 codec specification. The
codec object may also contain a ``configuration`` object which consists of
the parameter names and values as defined by the corresponding codec
specification. Since an ``array -> bytes`` codec must be specified, the
list cannot be empty.
Specifies a list of codecs to be used for encoding and decoding chunks. The
value MUST be an array of objects, where each object contains a member with the name
``name`` whose value is a string. The ``name`` member identifies the specification for the codec. A

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the requirement was name refers to a v3 codec spec, so I guess there would be a catalogue of v3 codecs where I could go an look up the specification.

Now the name is abitrary, so given the name how do I know where to find the codec specification, so if I get a bit of zarr data I can find out how to decode it.

And how do we avoid naming conflicts between community codecs? Or is this not our concern?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my concerns are resolved by changes under "Extension codecs" below, so maybe worth either repeating a bit of that or signposting that section here?

codec object MAY contain a member named ``configuration``, which is an object defined by the respective codec

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be changed back to "parameter names and values" instead of "object", since this is a) out of scope of this PR, and b) I think parameter names and values makes more sense because this section is taking about a JSON document (I think??).

specification. Because ``codecs`` MUST contain an ``array -> bytes`` codec, the
list cannot be empty (See :ref:`codecs <codecs>`).

The following members are optional:

Expand Down Expand Up @@ -1211,40 +1210,42 @@ the following procedure:

4. The chunk array ``A`` is equal to ``EC[0]``.

.. _codec-specification:

Specifying codecs
-----------------

To allow for flexibility to define and implement new codecs, this
specification does not define any codecs, nor restrict the set of
codecs that may be used. Each codec must be defined via a separate
specification. In order to refer to codecs in array metadata
documents, each codec must have a unique identifier, which is a URI
that dereferences to a human-readable specification of the codec. A
codec specification must declare the codec identifier, and describe
(or cite documents that describe) the encoding and decoding algorithms
and the format of the encoded data.

A codec may have configuration parameters which modify the behaviour
of the codec in some way. For example, a compression codec may have a
compression level parameter, which is an integer that affects the
resulting compression ratio of the data. Configuration parameters must
be declared in the codec specification, including a definition of how
configuration parameters are represented as JSON.

The Zarr core development team maintains a repository of codec
specifications, which are hosted alongside this specification in the
`zarr-specs GitHub repository`_, and which are
published on the `zarr-specs documentation Web site
<https://zarr-specs.readthedocs.io/>`_. For ease of discovery, it is
recommended that codec specifications are contributed to the
zarr-specs GitHub repository. However, codec specifications may be
maintained by any group or organisation and published in any location
on the Web. For further details of the process for contributing a
codec specification to the zarr-specs GitHub repository, see
`ZEP 0 <https://zarr.dev/zeps/active/ZEP0000.html>`_ which describes
the process for Zarr specification changes.

Further details of how codecs are configured for an array are given in the `Array metadata`_ section.
.. _core-codecs:

Core codecs
-----------

This spec defines a set of codecs ("core codecs") which all Zarr implementations SHOULD implement in
order to ensure a minimal level of interoperability between Zarr implementations.
Each core codec is each defined by specification documents which are hosted in the
`zarr-specs GitHub repository`_, and are published on the `zarr-specs documentation web site
<https://zarr-specs.readthedocs.io/>`_. The list of core codecs is part of the Zarr v3 specification.
Changes to the list of core codecs MUST be made via the same protocol used for
changing the Zarr v3 specification. Changes to the list of core codecs SHOULD be made carefully and
in close collaboration with extant Zarr v3 implementations. A new core codec SHOULD be added to the
d-v-b marked this conversation as resolved.
Show resolved Hide resolved
list when a sufficient number of Zarr implementations support or intend to support that codec.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is sufficient? 50%? 80%? 100%? This should be clarified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any number we pick will be arbitrary. I think we have to rely on people's judgment here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think picking and agreeing on an arbitrary number is better than not picking a number and then leaving the door open to discussions/disagreements in the future on what a "sufficient number" means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess because it's a "SHOULD" we can put in a completely arbitrary number and then ignore it. How about 60%

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or "an overwhelming majority"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth removing this sentence completely? I don't particularly like adding an arbitrary number either as per your comments 😄 . I think it would be fine to remove this sentence, and if someone wants to add a core codec they can justify it as part of the normal ZEP process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove text completely -- I like this solution :)

An existing core codec SHOULD be removed from the list when a sufficient number of implementation
developers and Zarr users deem the codec worth removing, e.g. because of a technical flaw in the
algorithm underlying the codec.
Comment on lines +1232 to +1234

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the main goals (to me) of specifying core codecs is to ensure backwards compatibility long into the future. Removing a codec reduces backwards compatibility, so I think there should be very strict conditions under which a codec should be removed.

Suggested change
An existing core codec SHOULD be removed from the list when a sufficient number of implementation
developers and Zarr users deem the codec worth removing, e.g. because of a technical flaw in the
algorithm underlying the codec.
Existing core codecs SHOULD NOT be removed from the list to ensure future backwards compatibiility.
A core codec MAY be removed from the list if it has fundamental security flaws that cannot be resolved and the removal is agreed by the Zarr Implementation Council.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a reference to the ZIC here, since elsewhere in the PR I state that the list of core codecs should be changed according to the same process for changing the spec itself

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
An existing core codec SHOULD be removed from the list when a sufficient number of implementation
developers and Zarr users deem the codec worth removing, e.g. because of a technical flaw in the
algorithm underlying the codec.
Existing core codecs SHOULD NOT be removed from the list to ensure future backwards compatibiility.
A core codec MAY be removed from the list if it has fundamental security flaws that cannot be resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To ensure backwards compatibility, existing core codecs SHOULD NOT be removed from the list." reads better to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also something I thought of last night: if a codec is removed from the list of core codecs, it is still available as an extension codec. So I don't think there's actually a large backwards compatibility risk.


Community codecs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "Community" used elsewhere? Maybe "Extension", which I think is used in other places ("extension data types" at least).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget how I picked the word "community"... I don't think it's a term of art in the spec. If "extension" is a better fit I'm happy to switch to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I favor "extension" over community because it fits with the existing Zarr extension framework.

----------------

Zarr implementations MAY support a codec that is not in the list of core codecs
(hereafter termed a "community codec"), provided the community codec does not use an identifier

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth namespacing codec identifiers somehow? I'm slightly against, since ideally codecs can graduate from "community" (or "extension") to "core" (after careful consideration). It'd be nice to avoid needing to rewrite metadata files when that happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do any namespacing of the identifiers, we should not use any information about whether the codec is core or community / extension, for the reasons you note. Otherwise I don't have a preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what is written here is fine. Namespacing codecs sounds complicated.

that is already used by a core codec, as the identifiers of core codecs are reserved.

This specification places no other constraints on community codecs. It is possible that separate
developers may define distinct codecs that use the same identifier.
d-v-b marked this conversation as resolved.
Show resolved Hide resolved
To minimize the impact of such name collisions, codec developers are strongly encouraged
to publish their codec specifications as additions to the "community codecs" section of Zarr v3 specification.
Publication in the "community codecs" section does not confer primacy or an official designation to a codec.
The list of community codecs exists expressly as a tool to enable coordinated codec development.

Stores
======
Expand Down