feature(stores): draft zip file store specification #311

jhamman · 2024-09-11T06:01:33Z

This is a working draft of the v3 ZIP file store specification.

xref:

jhamman · 2024-09-11T06:20:26Z

docs/v3/stores/zipfile/v1.0.rst

+* Delete a file.
+
+* Delete a directory.


joshmoore · 2024-09-11T08:09:40Z

In my experience, the root of the zip is one of the trickiest parts for data creators (and I assume implementers) to get right, e.g.,

joshmoore · 2024-09-11T08:10:23Z

cc: @DennisHeimbigner

zoj613 · 2024-09-11T20:14:25Z

How useful is a ZipStore in practice? Are there a lot of use cases for it? Given how limited it is (no rename/deletion, etc) I am wondering if its worth having a spec for it

DennisHeimbigner · 2024-09-12T00:51:27Z

I have support equivalent to zipstore in nczarr in the netcdf-c library. I agree that it does not appear to be
very useful, but the basic idea behind it is reasonable: a single file containing a complete zarr file tree,
and using compression the component files to save space.
Personally, I think that using a single file file system (SFFS) with added compression makes more sense.
There are several implementations available, and it is easy enough to write your own,

jhamman · 2024-09-12T15:35:02Z

In my experience, the root of the zip is one of the trickiest parts for data creators (and I assume implementers) to get right...

@joshmoore - do you have suggestions for the spec document that would make this clearer?

@zoj613 and @DennisHeimbigner - let's try to avoid making this about alternatives to the ZIP store concept. There are practical reasons to add this (Zarr-Python has long supported a ZIP store interface).

Remember, Zarr can support many storage backends. If there are alternatives to experiment with, let's do that in a separate issue.

@DennisHeimbigner - I would like to get your feedback on the spec as written. Is it aligned with your netcdf-c implementation?

zoj613 · 2024-09-12T20:39:01Z

docs/v3/stores/zipfile/v1.0.rst

+* ``get(key) -> value`` : Read and return the contents of the object at
+  within the archive at path ``key``. 
+
+* ``set(key, value)`` : Write ``value`` as the contents of the file at
+  into the archive at path ``key .


Is the use of at within and at into in these lines intentional? Sounds like a typo

joshmoore · 2024-09-13T10:13:56Z

@joshmoore - do you have suggestions for the spec document that would make this clearer?

Thoughts that I have revolving in my head that include:

is it really a store spec or is it a format spec?
for the format, the most important item I know of is "don't include the top-level directory" (though I have run into some complaints about that from various repositories, since the behavior differs between implementations, e.g. on Windows)
for v2, I fully see getting this written down ASAP; for v3, I wonder if a general "archival" format that can be extended for, say, zip/tar/whatever wouldn't be something to consider

jhamman · 2024-09-29T20:30:53Z

docs/v3/stores/zipfile/v1.0.rst

+* Each key has a name (sequence of characters) and contents
+  (sequence of bytes).


Note that the keys are relative paths (not prefixed with a /).

DennisHeimbigner · 2024-09-29T20:56:02Z

for the format, the most important item I know of is "don't include the top-level directory" (though I have run into some complaints about that from various repositories, since the behavior differs between implementations, e.g. on Windows)

I think I have always used either linux zip or cygwin zip to create zarr zip files. What native windows program could I use to create a pure windows zip file?
As for the top-level directory, I think it is better to always include it. I say this so that my rule holds, namely:
1.unzipping a zip store creates a directory tree usable by the zarr directory tree storage manager.
2. zipping a zarr directory tree creates a zip store conforming to the proposed zip spec.

jhamman · 2024-10-07T17:32:35Z

I think I have always used either linux zip or cygwin zip to create zarr zip files. What native windows program could I use to create a pure windows zip file?

🤷

As for the top-level directory, I think it is better to always include it. I say this so that my rule holds, namely:
1.unzipping a zip store creates a directory tree usable by the zarr directory tree storage manager.
2. zipping a zarr directory tree creates a zip store conforming to the proposed zip spec.

👍

joshmoore · 2024-10-08T07:34:41Z

A few downsides of adding the directory:

It's a change from v2 😄 though that can be handled.
You must know/lookup that value before you open the zip.
What is the behavior if there are multiple .zarr directories within the zip?

This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.

jwindhager · 2024-11-25T13:48:34Z

At the recent OME-NGFF Workflows Hackathon, a team has been discussing possible paths towards a "single-file" OME-NGFF standard. Our preferred path would be to build upon zipped zarrs, i.e. related to this PR. Please find below relevant points extracted from our discussions. Apologies for the long reply, happy to turn it into GitHub review/suggestion style if necessary.

TL;DR:

Yes to "ZipStore" (perhaps later generalize to "ArchiveStore", remove interface definition, add resource identifier scheme)
No to root directory (but readers may implement additional discovery strategies)
Specify that existing keys may not be overwritten (?)
More narrowly specify ZIP file format (see below)?

Replies to previous comments:

How useful is a ZipStore in practice? Are there a lot of use cases for it?

In the bioimaging domain, many researchers tend to prefer individual files over file system directories when handling small to medium-sized image data (cf. TIFF), not least for practical reasons (e.g., file sharing using traditional means, double-click-open/drag-and-drop support in existing tools) and because existing tooling largely isn't ready for handling file system stores. We'd argue this applies to other domains as well.

In practice, the limitations of file system stores when handling small data mean that people will archive (i.e., "zip") zarrs either way, independent of whether this is part of the specification or not. Specifying just how zarrs should be archived would enable tool developers to readily implement support for spec-compliant zarr archives, making Zarr a good choice also for their users.

Personally, I think that using a single file file system (SFFS) with added compression makes more sense.

We did not specifically discuss this idea. What would be the benefit of a full-fledged SFFS over archive file formats (which we'd argue are specific instances of SFFSs)? Regarding compression, Zarr itself already supports several codecs.

As a side note, we instead discussed the related idea of using a single-file container format (e.g. HDF5) for a second implementation of the OME-NGFF specification (in addition to Zarr) to enable single-file images. However, this would come at the cost of significant development overhead, would eventually necessitate conversion between different "backends", and would risk fragmenting the community (particularly if there are discrepancies in interpretation), so we'd strongly prefer to stay within Zarr territory for single-file OME-NGFF (which the ZipStore would allow us to do).

But, as @jhamman rightfully wrote, let's save further discussion on alternatives for another time.

I wonder if a general "archival" format that can be extended for, say, zip/tar/whatever wouldn't be something to consider

We agree that, depending on the scope of the specification, this draft could (at least in part) apply more generally to any archive file format. Perhaps this could be generalized in a second step, once the ZipStore has been added?

For now, we propose limiting the scope of this draft to a specific file format and endorse ZIP for the following reasons:

widely supported, including native support by several OS
several implementations already exist (zarr-python, zarrs_zip)
compression performance not as relevant, due to zarr's compression codecs
has a file index (important for remote access, e.g. using HTTP range requests)

Is it really a store spec or is it a format spec?

We too were wondering if it would make sense - in the long term - to separate the interface definition from the on-disk representation. Perhaps the interface definition could be considered an implementation detail, whereas the on-disk representation is more essential to ensure data portability? Not explicitly specifying store operations would also address compatibility issues (e.g. ZIP possibly not supporting in-place update/delete operations in place).

More generally, with "non-file system stores" defined, we think that the current specification is missing consistent resource identifier (e.g. URI) schemes and/or alternative means (e.g., file suffix, mime type, magic number, user decision) for delineating on-disk representations/stores. This is particularly relevant in the case of OME-NGFF, where OME-Zarrs may contain multiple images and users may therefore need to specify the path to a specific image within the zip (e.g. for visualization), ideally as part of the resource identifier pointing to the zip file. However, this is not specific to the ZipStore, should in our view not be mixed with the storage specification either, and may well be an "upstream problem" for a more general specification. We thus propose to leave it up to implementations to decide what "store" to use for a given resource for now.

For the format, the most important item I know of is "don't include the top-level directory" (though I have run into some complaints about that from various repositories, since the behavior differs between implementations, e.g. on Windows) ... and related comments

Having a root directory inside a zip file (with the same name as the zip file itself) can quickly become confusing/out of sync if the zip files have been renamed automatically (e.g. upon re-downloading an already existing file) and/or manually. We'd argue that not being able to unpack zip files into the same directory without first (automatically?) creating target root directories is far less confusing than ending up with directory names that may not match the zip file names (and just as in the case of no root folder, depending on tooling, one could still end up accidentally overriding "competing" root folders if they happen to have the same name). We therefore propose to NOT use root directories for archiving zarrs.

Specifically, for zarr-specific zip writer implementations, we propose to REQUIRE the creation of archives without a root directory (for above reasons and consistency, also with Zarr v2). However, since zarrs may also be archived using zarr-agnostic tooling, we propose to specify that zarr reader implementations MAY additionally check for single root directories or recursively scan for zarr.json files if no zarr.json file can be found in the zip file's root. We are not aware of a strong use case for introducing an optional root-folder metadata file pointing to an arbitrary specific location within a zip file for zarr.json discovery.

Additional remarks:

The current draft does not specify that existing keys cannot generally be overwritten (to our understanding, this is not generally possible according to the ZIP standard).

Should the draft specify the archive file format a bit more precisely, e.g., ZIP64 support (yes), support of empty or spanned zip files (no), supported compression formats (if any)? Perhaps writers should be required to support writing uncompressed ZIP64 files, whereas readers MAY support further compression algorithms?

-- @bpavie @jwindhager @leoschwarz @retogerber

d-v-b · 2024-11-25T13:52:04Z

docs/v3/stores/zipfile/v1.0.rst

+Store limitations
+=================
+
+The following limitations for this store are know:


Suggested change

The following limitations for this store are know:

The following limitations for this store are known:

d-v-b · 2024-11-25T14:47:53Z

thanks for the writeup @jwindhager et al. It might be clarifying to factor out the following two discussions, as I think they are logically separable:

question 1: how should Zarr implementations design APIs for the storage of Zarr data inside single-file archive formats (namely zip)? I think this is the question @jhamman is attempting to resolve with this particular PR. I think it may not be possible to answer this particular question in a way that provides normative guidelines for applications that target end users.
question 2: can we define best practices (or even a spec) for storing Zarr data inside single-file archives (namely, zip)? I think this is what @jwindhager et al are looking for (please correct me if I'm misreading). Such best practices / spec might reasonably carve out a restricted subset of the space expressible with a generic Zarr + zip API, and include things like file extensions, URI conventions, etc.

I think question 2 is the kind of thing that really ought to be explored alongside at least one implementation. So my pitch is as follows: we implement an opinionated zarr archive function in zarr-python. The function I am imagining takes a zarr array or group and saves all the stuff in that array / group in a destination zip file, according to a set of guidelines that we simultaneously document as a spec-like-document in the zarr-python repo. I think this could be really productive, and developing the spec alongside the implementation will keep them both balanced. At the end of this process, we should have both a useful implementation and a well-tested spec, which could be re-published in a broader venue than zarr-python.

Would anyone here be interested in working on such an effort? I do think this requires at least one champion to push forward. As a zarr-python maintainer I would be willing to put in the time for reviewing PRs and whatnot.

jbms · 2024-11-25T15:35:14Z

I'm in favor of supporting archive formats and zip. Tensorstore already supports reading but not writing.

Zip has some disadvantages in its design but I think they are outweighed by it being such a common format.

I agree that there should be no implicit root directory, and while some implementations may do auto-discovery, there should be a canonical url that makes any sub-directories within the zip file explicit.

The spec says the canonical url is just a file url, file:///path/to/file.zip. While that is reasonable for implementations that do auto-discovery, I don't think that is a good idea as the canonical url since it does not explicitly indicate the zip format at all, and would rely on implementations detecting it either by the filename or content.

Previously I proposed a different url syntax (zarr-developers/zeps#48) which allows nested formats like zip to be specified explicitly.

DennisHeimbigner · 2024-11-25T18:02:55Z

IMO zip is not a good single file storage format. Other choices like the various
single file file systems (SFFS) seem to me to be a much more flexible choice, especially
in the face of writing.

d-v-b · 2024-11-25T18:06:07Z

IMO zip is not a good single file storage format. Other choices like the various
single file file systems (SFFS) seem to me to be a much more flexible choice, especially
in the face of writing.

This might be true but I think it's orthogonal to the discussion at hand -- we are not trying to find the best archive file format, but rather devise standards to improve the utility of a popular archive file format (zip). Zip being sub-par doesn't bear on the fact that that people want to use it, and that latter fact is what we should build around IMO.

zoj613 · 2024-11-26T10:38:25Z

The current draft does not specify that existing keys cannot generally be overwritten (to our understanding, this is not generally possible according to the ZIP standard).

I think this should be left to the implementation to decide whether to support write operations. Some people might want to rename/overwrite/delete entries in a ZIP as a convenience just like any other store. Sure the ZIP standard does not support this but there are ways to workaround this limitation (although quite inefficient). For example, I have 2 ZipStore implementations that support the full zarr v3 abstract store interface as defined in the core spec. I don't see the benefit of imposing this limitation to implementations.

jbms · 2024-11-26T14:48:09Z

The current draft does not specify that existing keys cannot generally be overwritten (to our understanding, this is not generally possible according to the ZIP standard).

I think this should be left to the implementation to decide whether to support write operations. Some people might want to rename/overwrite/delete entries in a ZIP as a convenience just like any other store. Sure the ZIP standard does not support this but there are ways to workaround this limitation (although quite inefficient). For example, I have 2 ZipStore implementations that support the full zarr v3 abstract store interface as defined in the core spec. I don't see the benefit of imposing this limitation to implementations.

Agreed, and in fact I think this spec be made much more concise. I don't think it is necessary to list the supported operations.

jwindhager · 2024-11-26T16:01:08Z

@d-v-b I agree that we could factor out the two discussions (this is what we meant with "separate the interface definition from the on-disk representation") and I would also add the "canonical url" proposal by @jbms to the list (this is roughly what we referred to as "consistent resource identifier schemes"). However, given the way stores are specified in the current spec, I'd pragmatically argue that it's probably easier to get this PR merged in its current form rather than to further broaden the scope / branch out.

I also agree that we could push the on-disk representation aspect on the zarr-python side ("guidelines"), but we'd need to make sure to simultaneously push this on the spec side (i.e. here) as well. Otherwise, we may risk ending up with a Python-specific de-facto standard that in the worst case collides with future upstream specs. But perhaps this worry is unfounded, in which case I'd be happy to contribute within my (unfortunately limited) time constraints.

@jbms Thanks a lot for pointing us to your ZEP! We somehow completely missed it in our group's discussions. I think this could address many important issues, also related to zipped Zarr (and thus single-file OME-NGFF), and will give this a read asap.

The current draft does not specify that existing keys cannot generally be overwritten (to our understanding, this is not generally possible according to the ZIP standard).

I think this should be left to the implementation to decide whether to support write operations. Some people might want to rename/overwrite/delete entries in a ZIP as a convenience just like any other store. Sure the ZIP standard does not support this but there are ways to workaround this limitation (although quite inefficient). For example, I have 2 ZipStore implementations that support the full zarr v3 abstract store interface as defined in the core spec. I don't see the benefit of imposing this limitation to implementations.

Agreed, and in fact I think this spec be made much more concise. I don't think it is necessary to list the supported operations.

Fair point. I still think that one could clarify this under "Store limitations" using the right phraseology, but no strong feelings.

d-v-b · 2024-11-26T16:21:53Z

I also agree that we could push the on-disk representation aspect on the zarr-python side ("guidelines"), but we'd need to make sure to simultaneously push this on the spec side (i.e. here) as well. Otherwise, we may risk ending up with a Python-specific de-facto standard that in the worst case collides with future upstream specs. But perhaps this worry is unfounded, in which case I'd be happy to contribute within my (unfortunately limited) time constraints.

We can iterate quickly in zarr-python, including on the spec side. What I'm imagining is "incubating" the zip convention in zarr-python (and / or any other zarr implementation), developing a spec and an implementation that are both tested, and finally propose something to this repo with that experience in hand. Until the spec is published here, we would publish it in zarr-python, and the spec itself would not be python-specific, because we would design it that way :)

feature(stores): draft zip file store specification

95d18f6

jhamman mentioned this pull request Sep 11, 2024

feature(store): V3 ZipStore zarr-developers/zarr-python#2078

Merged

6 tasks

jhamman commented Sep 11, 2024

View reviewed changes

docs/v3/stores/zipfile/v1.0.rst

Comment on lines +88 to +90

* Delete a file.

* Delete a directory.

Copy link

Member Author

jhamman Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #103

zoj613 reviewed Sep 12, 2024

View reviewed changes

LDeakin mentioned this pull request Sep 12, 2024

Add zarr_python_compat_zip_store test LDeakin/zarrs#71

Merged

jhamman commented Sep 29, 2024

View reviewed changes

joshmoore mentioned this pull request Oct 4, 2024

Feat/store paths zarr-developers/zarr-python#2272

Open

6 tasks

zoj613 mentioned this pull request Nov 6, 2024

Add ZipStore zarr storage backend. zoj613/zarr-ml#79

Merged

d-v-b reviewed Nov 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(stores): draft zip file store specification #311

feature(stores): draft zip file store specification #311

jhamman commented Sep 11, 2024

jhamman Sep 11, 2024

joshmoore commented Sep 11, 2024

joshmoore commented Sep 11, 2024

zoj613 commented Sep 11, 2024

DennisHeimbigner commented Sep 12, 2024

jhamman commented Sep 12, 2024 •

edited

Loading

zoj613 Sep 12, 2024

joshmoore commented Sep 13, 2024

jhamman Sep 29, 2024

DennisHeimbigner commented Sep 29, 2024

jhamman commented Oct 7, 2024

joshmoore commented Oct 8, 2024

jwindhager commented Nov 25, 2024

d-v-b Nov 25, 2024

d-v-b commented Nov 25, 2024

jbms commented Nov 25, 2024

DennisHeimbigner commented Nov 25, 2024

d-v-b commented Nov 25, 2024

zoj613 commented Nov 26, 2024

jbms commented Nov 26, 2024

jwindhager commented Nov 26, 2024

d-v-b commented Nov 26, 2024

		* Each key has a name (sequence of characters) and contents
		(sequence of bytes).

	The following limitations for this store are know:
	The following limitations for this store are known:

feature(stores): draft zip file store specification #311

Are you sure you want to change the base?

feature(stores): draft zip file store specification #311

Conversation

jhamman commented Sep 11, 2024

jhamman Sep 11, 2024

Choose a reason for hiding this comment

joshmoore commented Sep 11, 2024

joshmoore commented Sep 11, 2024

zoj613 commented Sep 11, 2024

DennisHeimbigner commented Sep 12, 2024

jhamman commented Sep 12, 2024 • edited Loading

zoj613 Sep 12, 2024

Choose a reason for hiding this comment

joshmoore commented Sep 13, 2024

jhamman Sep 29, 2024

Choose a reason for hiding this comment

DennisHeimbigner commented Sep 29, 2024

jhamman commented Oct 7, 2024

joshmoore commented Oct 8, 2024

jwindhager commented Nov 25, 2024

d-v-b Nov 25, 2024

Choose a reason for hiding this comment

d-v-b commented Nov 25, 2024

jbms commented Nov 25, 2024

DennisHeimbigner commented Nov 25, 2024

d-v-b commented Nov 25, 2024

zoj613 commented Nov 26, 2024

jbms commented Nov 26, 2024

jwindhager commented Nov 26, 2024

d-v-b commented Nov 26, 2024

jhamman commented Sep 12, 2024 •

edited

Loading