Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(stores): draft zip file store specification #311

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Sep 11, 2024

This is a working draft of the v3 ZIP file store specification.

xref:

Comment on lines +88 to +90
* Delete a file.

* Delete a directory.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #103

@joshmoore
Copy link
Member

In my experience, the root of the zip is one of the trickiest parts for data creators (and I assume implementers) to get right, e.g.,

@joshmoore
Copy link
Member

cc: @DennisHeimbigner

@zoj613
Copy link
Contributor

zoj613 commented Sep 11, 2024

How useful is a ZipStore in practice? Are there a lot of use cases for it? Given how limited it is (no rename/deletion, etc) I am wondering if its worth having a spec for it

@DennisHeimbigner
Copy link

I have support equivalent to zipstore in nczarr in the netcdf-c library. I agree that it does not appear to be
very useful, but the basic idea behind it is reasonable: a single file containing a complete zarr file tree,
and using compression the component files to save space.
Personally, I think that using a single file file system (SFFS) with added compression makes more sense.
There are several implementations available, and it is easy enough to write your own,

@jhamman
Copy link
Member Author

jhamman commented Sep 12, 2024

In my experience, the root of the zip is one of the trickiest parts for data creators (and I assume implementers) to get right...

@joshmoore - do you have suggestions for the spec document that would make this clearer?


@zoj613 and @DennisHeimbigner - let's try to avoid making this about alternatives to the ZIP store concept. There are practical reasons to add this (Zarr-Python has long supported a ZIP store interface).

Remember, Zarr can support many storage backends. If there are alternatives to experiment with, let's do that in a separate issue.


@DennisHeimbigner - I would like to get your feedback on the spec as written. Is it aligned with your netcdf-c implementation?

Comment on lines +107 to +111
* ``get(key) -> value`` : Read and return the contents of the object at
within the archive at path ``key``.

* ``set(key, value)`` : Write ``value`` as the contents of the file at
into the archive at path ``key .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the use of at within and at into in these lines intentional? Sounds like a typo

@joshmoore
Copy link
Member

@joshmoore - do you have suggestions for the spec document that would make this clearer?

Thoughts that I have revolving in my head that include:

  • is it really a store spec or is it a format spec?
  • for the format, the most important item I know of is "don't include the top-level directory" (though I have run into some complaints about that from various repositories, since the behavior differs between implementations, e.g. on Windows)
  • for v2, I fully see getting this written down ASAP; for v3, I wonder if a general "archival" format that can be extended for, say, zip/tar/whatever wouldn't be something to consider

Comment on lines +63 to +64
* Each key has a name (sequence of characters) and contents
(sequence of bytes).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the keys are relative paths (not prefixed with a /).

@DennisHeimbigner
Copy link

for the format, the most important item I know of is "don't include the top-level directory" (though I have run into some complaints about that from various repositories, since the behavior differs between implementations, e.g. on Windows)

I think I have always used either linux zip or cygwin zip to create zarr zip files. What native windows program could I use to create a pure windows zip file?
As for the top-level directory, I think it is better to always include it. I say this so that my rule holds, namely:
1.unzipping a zip store creates a directory tree usable by the zarr directory tree storage manager.
2. zipping a zarr directory tree creates a zip store conforming to the proposed zip spec.

@jhamman
Copy link
Member Author

jhamman commented Oct 7, 2024

I think I have always used either linux zip or cygwin zip to create zarr zip files. What native windows program could I use to create a pure windows zip file?

🤷

As for the top-level directory, I think it is better to always include it. I say this so that my rule holds, namely:
1.unzipping a zip store creates a directory tree usable by the zarr directory tree storage manager.
2. zipping a zarr directory tree creates a zip store conforming to the proposed zip spec.

👍

@joshmoore
Copy link
Member

A few downsides of adding the directory:

  • It's a change from v2 😄 though that can be handled.
  • You must know/lookup that value before you open the zip.
  • What is the behavior if there are multiple .zarr directories within the zip?

zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 7, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 7, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
@jwindhager
Copy link

At the recent OME-NGFF Workflows Hackathon, a team has been discussing possible paths towards a "single-file" OME-NGFF standard. Our preferred path would be to build upon zipped zarrs, i.e. related to this PR. Please find below relevant points extracted from our discussions. Apologies for the long reply, happy to turn it into GitHub review/suggestion style if necessary.

TL;DR:

  • Yes to "ZipStore" (perhaps later generalize to "ArchiveStore", remove interface definition, add resource identifier scheme)
  • No to root directory (but readers may implement additional discovery strategies)
  • Specify that existing keys may not be overwritten (?)
  • More narrowly specify ZIP file format (see below)?

Replies to previous comments:

How useful is a ZipStore in practice? Are there a lot of use cases for it?

In the bioimaging domain, many researchers tend to prefer individual files over file system directories when handling small to medium-sized image data (cf. TIFF), not least for practical reasons (e.g., file sharing using traditional means, double-click-open/drag-and-drop support in existing tools) and because existing tooling largely isn't ready for handling file system stores. We'd argue this applies to other domains as well.

In practice, the limitations of file system stores when handling small data mean that people will archive (i.e., "zip") zarrs either way, independent of whether this is part of the specification or not. Specifying just how zarrs should be archived would enable tool developers to readily implement support for spec-compliant zarr archives, making Zarr a good choice also for their users.

Personally, I think that using a single file file system (SFFS) with added compression makes more sense.

We did not specifically discuss this idea. What would be the benefit of a full-fledged SFFS over archive file formats (which we'd argue are specific instances of SFFSs)? Regarding compression, Zarr itself already supports several codecs.

As a side note, we instead discussed the related idea of using a single-file container format (e.g. HDF5) for a second implementation of the OME-NGFF specification (in addition to Zarr) to enable single-file images. However, this would come at the cost of significant development overhead, would eventually necessitate conversion between different "backends", and would risk fragmenting the community (particularly if there are discrepancies in interpretation), so we'd strongly prefer to stay within Zarr territory for single-file OME-NGFF (which the ZipStore would allow us to do).

But, as @jhamman rightfully wrote, let's save further discussion on alternatives for another time.

I wonder if a general "archival" format that can be extended for, say, zip/tar/whatever wouldn't be something to consider

We agree that, depending on the scope of the specification, this draft could (at least in part) apply more generally to any archive file format. Perhaps this could be generalized in a second step, once the ZipStore has been added?

For now, we propose limiting the scope of this draft to a specific file format and endorse ZIP for the following reasons:

  • widely supported, including native support by several OS
  • several implementations already exist (zarr-python, zarrs_zip)
  • compression performance not as relevant, due to zarr's compression codecs
  • has a file index (important for remote access, e.g. using HTTP range requests)

Is it really a store spec or is it a format spec?

We too were wondering if it would make sense - in the long term - to separate the interface definition from the on-disk representation. Perhaps the interface definition could be considered an implementation detail, whereas the on-disk representation is more essential to ensure data portability? Not explicitly specifying store operations would also address compatibility issues (e.g. ZIP possibly not supporting in-place update/delete operations in place).

More generally, with "non-file system stores" defined, we think that the current specification is missing consistent resource identifier (e.g. URI) schemes and/or alternative means (e.g., file suffix, mime type, magic number, user decision) for delineating on-disk representations/stores. This is particularly relevant in the case of OME-NGFF, where OME-Zarrs may contain multiple images and users may therefore need to specify the path to a specific image within the zip (e.g. for visualization), ideally as part of the resource identifier pointing to the zip file. However, this is not specific to the ZipStore, should in our view not be mixed with the storage specification either, and may well be an "upstream problem" for a more general specification. We thus propose to leave it up to implementations to decide what "store" to use for a given resource for now.

For the format, the most important item I know of is "don't include the top-level directory" (though I have run into some complaints about that from various repositories, since the behavior differs between implementations, e.g. on Windows) ... and related comments

Having a root directory inside a zip file (with the same name as the zip file itself) can quickly become confusing/out of sync if the zip files have been renamed automatically (e.g. upon re-downloading an already existing file) and/or manually. We'd argue that not being able to unpack zip files into the same directory without first (automatically?) creating target root directories is far less confusing than ending up with directory names that may not match the zip file names (and just as in the case of no root folder, depending on tooling, one could still end up accidentally overriding "competing" root folders if they happen to have the same name). We therefore propose to NOT use root directories for archiving zarrs.

Specifically, for zarr-specific zip writer implementations, we propose to REQUIRE the creation of archives without a root directory (for above reasons and consistency, also with Zarr v2). However, since zarrs may also be archived using zarr-agnostic tooling, we propose to specify that zarr reader implementations MAY additionally check for single root directories or recursively scan for zarr.json files if no zarr.json file can be found in the zip file's root. We are not aware of a strong use case for introducing an optional root-folder metadata file pointing to an arbitrary specific location within a zip file for zarr.json discovery.

Additional remarks:

The current draft does not specify that existing keys cannot generally be overwritten (to our understanding, this is not generally possible according to the ZIP standard).

Should the draft specify the archive file format a bit more precisely, e.g., ZIP64 support (yes), support of empty or spanned zip files (no), supported compression formats (if any)? Perhaps writers should be required to support writing uncompressed ZIP64 files, whereas readers MAY support further compression algorithms?

-- @bpavie @jwindhager @leoschwarz @retogerber

Store limitations
=================

The following limitations for this store are know:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following limitations for this store are know:
The following limitations for this store are known:

@d-v-b
Copy link
Contributor

d-v-b commented Nov 25, 2024

thanks for the writeup @jwindhager et al. It might be clarifying to factor out the following two discussions, as I think they are logically separable:

  • question 1: how should Zarr implementations design APIs for the storage of Zarr data inside single-file archive formats (namely zip)? I think this is the question @jhamman is attempting to resolve with this particular PR. I think it may not be possible to answer this particular question in a way that provides normative guidelines for applications that target end users.
  • question 2: can we define best practices (or even a spec) for storing Zarr data inside single-file archives (namely, zip)? I think this is what @jwindhager et al are looking for (please correct me if I'm misreading). Such best practices / spec might reasonably carve out a restricted subset of the space expressible with a generic Zarr + zip API, and include things like file extensions, URI conventions, etc.

I think question 2 is the kind of thing that really ought to be explored alongside at least one implementation. So my pitch is as follows: we implement an opinionated zarr archive function in zarr-python. The function I am imagining takes a zarr array or group and saves all the stuff in that array / group in a destination zip file, according to a set of guidelines that we simultaneously document as a spec-like-document in the zarr-python repo. I think this could be really productive, and developing the spec alongside the implementation will keep them both balanced. At the end of this process, we should have both a useful implementation and a well-tested spec, which could be re-published in a broader venue than zarr-python.

Would anyone here be interested in working on such an effort? I do think this requires at least one champion to push forward. As a zarr-python maintainer I would be willing to put in the time for reviewing PRs and whatnot.

@jbms
Copy link
Contributor

jbms commented Nov 25, 2024

I'm in favor of supporting archive formats and zip. Tensorstore already supports reading but not writing.

Zip has some disadvantages in its design but I think they are outweighed by it being such a common format.

I agree that there should be no implicit root directory, and while some implementations may do auto-discovery, there should be a canonical url that makes any sub-directories within the zip file explicit.

The spec says the canonical url is just a file url, file:///path/to/file.zip. While that is reasonable for implementations that do auto-discovery, I don't think that is a good idea as the canonical url since it does not explicitly indicate the zip format at all, and would rely on implementations detecting it either by the filename or content.

Previously I proposed a different url syntax (zarr-developers/zeps#48) which allows nested formats like zip to be specified explicitly.

@DennisHeimbigner
Copy link

IMO zip is not a good single file storage format. Other choices like the various
single file file systems (SFFS) seem to me to be a much more flexible choice, especially
in the face of writing.

@d-v-b
Copy link
Contributor

d-v-b commented Nov 25, 2024

IMO zip is not a good single file storage format. Other choices like the various
single file file systems (SFFS) seem to me to be a much more flexible choice, especially
in the face of writing.

This might be true but I think it's orthogonal to the discussion at hand -- we are not trying to find the best archive file format, but rather devise standards to improve the utility of a popular archive file format (zip). Zip being sub-par doesn't bear on the fact that that people want to use it, and that latter fact is what we should build around IMO.

@zoj613
Copy link
Contributor

zoj613 commented Nov 26, 2024

The current draft does not specify that existing keys cannot generally be overwritten (to our understanding, this is not generally possible according to the ZIP standard).

I think this should be left to the implementation to decide whether to support write operations. Some people might want to rename/overwrite/delete entries in a ZIP as a convenience just like any other store. Sure the ZIP standard does not support this but there are ways to workaround this limitation (although quite inefficient). For example, I have 2 ZipStore implementations that support the full zarr v3 abstract store interface as defined in the core spec. I don't see the benefit of imposing this limitation to implementations.

@jbms
Copy link
Contributor

jbms commented Nov 26, 2024

The current draft does not specify that existing keys cannot generally be overwritten (to our understanding, this is not generally possible according to the ZIP standard).

I think this should be left to the implementation to decide whether to support write operations. Some people might want to rename/overwrite/delete entries in a ZIP as a convenience just like any other store. Sure the ZIP standard does not support this but there are ways to workaround this limitation (although quite inefficient). For example, I have 2 ZipStore implementations that support the full zarr v3 abstract store interface as defined in the core spec. I don't see the benefit of imposing this limitation to implementations.

Agreed, and in fact I think this spec be made much more concise. I don't think it is necessary to list the supported operations.

@jwindhager
Copy link

@d-v-b I agree that we could factor out the two discussions (this is what we meant with "separate the interface definition from the on-disk representation") and I would also add the "canonical url" proposal by @jbms to the list (this is roughly what we referred to as "consistent resource identifier schemes"). However, given the way stores are specified in the current spec, I'd pragmatically argue that it's probably easier to get this PR merged in its current form rather than to further broaden the scope / branch out.

I also agree that we could push the on-disk representation aspect on the zarr-python side ("guidelines"), but we'd need to make sure to simultaneously push this on the spec side (i.e. here) as well. Otherwise, we may risk ending up with a Python-specific de-facto standard that in the worst case collides with future upstream specs. But perhaps this worry is unfounded, in which case I'd be happy to contribute within my (unfortunately limited) time constraints.


@jbms Thanks a lot for pointing us to your ZEP! We somehow completely missed it in our group's discussions. I think this could address many important issues, also related to zipped Zarr (and thus single-file OME-NGFF), and will give this a read asap.


The current draft does not specify that existing keys cannot generally be overwritten (to our understanding, this is not generally possible according to the ZIP standard).

I think this should be left to the implementation to decide whether to support write operations. Some people might want to rename/overwrite/delete entries in a ZIP as a convenience just like any other store. Sure the ZIP standard does not support this but there are ways to workaround this limitation (although quite inefficient). For example, I have 2 ZipStore implementations that support the full zarr v3 abstract store interface as defined in the core spec. I don't see the benefit of imposing this limitation to implementations.

Agreed, and in fact I think this spec be made much more concise. I don't think it is necessary to list the supported operations.

Fair point. I still think that one could clarify this under "Store limitations" using the right phraseology, but no strong feelings.

@d-v-b
Copy link
Contributor

d-v-b commented Nov 26, 2024

I also agree that we could push the on-disk representation aspect on the zarr-python side ("guidelines"), but we'd need to make sure to simultaneously push this on the spec side (i.e. here) as well. Otherwise, we may risk ending up with a Python-specific de-facto standard that in the worst case collides with future upstream specs. But perhaps this worry is unfounded, in which case I'd be happy to contribute within my (unfortunately limited) time constraints.

We can iterate quickly in zarr-python, including on the spec side. What I'm imagining is "incubating" the zip convention in zarr-python (and / or any other zarr implementation), developing a spec and an implementation that are both tested, and finally propose something to this repo with that experience in hand. Until the spec is published here, we would publish it in zarr-python, and the spec itself would not be python-specific, because we would design it that way :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants