-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature(stores): draft zip file store specification #311
base: main
Are you sure you want to change the base?
Conversation
* Delete a file. | ||
|
||
* Delete a directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #103
In my experience, the root of the zip is one of the trickiest parts for data creators (and I assume implementers) to get right, e.g., |
How useful is a ZipStore in practice? Are there a lot of use cases for it? Given how limited it is (no rename/deletion, etc) I am wondering if its worth having a spec for it |
I have support equivalent to zipstore in nczarr in the netcdf-c library. I agree that it does not appear to be |
@joshmoore - do you have suggestions for the spec document that would make this clearer? @zoj613 and @DennisHeimbigner - let's try to avoid making this about alternatives to the ZIP store concept. There are practical reasons to add this (Zarr-Python has long supported a ZIP store interface). Remember, Zarr can support many storage backends. If there are alternatives to experiment with, let's do that in a separate issue. @DennisHeimbigner - I would like to get your feedback on the spec as written. Is it aligned with your netcdf-c implementation? |
* ``get(key) -> value`` : Read and return the contents of the object at | ||
within the archive at path ``key``. | ||
|
||
* ``set(key, value)`` : Write ``value`` as the contents of the file at | ||
into the archive at path ``key . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the use of at within
and at into
in these lines intentional? Sounds like a typo
Thoughts that I have revolving in my head that include:
|
* Each key has a name (sequence of characters) and contents | ||
(sequence of bytes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the keys are relative paths (not prefixed with a /
).
I think I have always used either linux zip or cygwin zip to create zarr zip files. What native windows program could I use to create a pure windows zip file? |
🤷
👍 |
A few downsides of adding the directory:
|
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
At the recent OME-NGFF Workflows Hackathon, a team has been discussing possible paths towards a "single-file" OME-NGFF standard. Our preferred path would be to build upon zipped zarrs, i.e. related to this PR. Please find below relevant points extracted from our discussions. Apologies for the long reply, happy to turn it into GitHub review/suggestion style if necessary. TL;DR:
Replies to previous comments:
In the bioimaging domain, many researchers tend to prefer individual files over file system directories when handling small to medium-sized image data (cf. TIFF), not least for practical reasons (e.g., file sharing using traditional means, double-click-open/drag-and-drop support in existing tools) and because existing tooling largely isn't ready for handling file system stores. We'd argue this applies to other domains as well. In practice, the limitations of file system stores when handling small data mean that people will archive (i.e., "zip") zarrs either way, independent of whether this is part of the specification or not. Specifying just how zarrs should be archived would enable tool developers to readily implement support for spec-compliant zarr archives, making Zarr a good choice also for their users.
We did not specifically discuss this idea. What would be the benefit of a full-fledged SFFS over archive file formats (which we'd argue are specific instances of SFFSs)? Regarding compression, Zarr itself already supports several codecs. As a side note, we instead discussed the related idea of using a single-file container format (e.g. HDF5) for a second implementation of the OME-NGFF specification (in addition to Zarr) to enable single-file images. However, this would come at the cost of significant development overhead, would eventually necessitate conversion between different "backends", and would risk fragmenting the community (particularly if there are discrepancies in interpretation), so we'd strongly prefer to stay within Zarr territory for single-file OME-NGFF (which the ZipStore would allow us to do). But, as @jhamman rightfully wrote, let's save further discussion on alternatives for another time.
We agree that, depending on the scope of the specification, this draft could (at least in part) apply more generally to any archive file format. Perhaps this could be generalized in a second step, once the ZipStore has been added? For now, we propose limiting the scope of this draft to a specific file format and endorse ZIP for the following reasons:
We too were wondering if it would make sense - in the long term - to separate the interface definition from the on-disk representation. Perhaps the interface definition could be considered an implementation detail, whereas the on-disk representation is more essential to ensure data portability? Not explicitly specifying store operations would also address compatibility issues (e.g. ZIP possibly not supporting in-place update/delete operations in place). More generally, with "non-file system stores" defined, we think that the current specification is missing consistent resource identifier (e.g. URI) schemes and/or alternative means (e.g., file suffix, mime type, magic number, user decision) for delineating on-disk representations/stores. This is particularly relevant in the case of OME-NGFF, where OME-Zarrs may contain multiple images and users may therefore need to specify the path to a specific image within the zip (e.g. for visualization), ideally as part of the resource identifier pointing to the zip file. However, this is not specific to the ZipStore, should in our view not be mixed with the storage specification either, and may well be an "upstream problem" for a more general specification. We thus propose to leave it up to implementations to decide what "store" to use for a given resource for now.
Having a root directory inside a zip file (with the same name as the zip file itself) can quickly become confusing/out of sync if the zip files have been renamed automatically (e.g. upon re-downloading an already existing file) and/or manually. We'd argue that not being able to unpack zip files into the same directory without first (automatically?) creating target root directories is far less confusing than ending up with directory names that may not match the zip file names (and just as in the case of no root folder, depending on tooling, one could still end up accidentally overriding "competing" root folders if they happen to have the same name). We therefore propose to NOT use root directories for archiving zarrs. Specifically, for zarr-specific zip writer implementations, we propose to REQUIRE the creation of archives without a root directory (for above reasons and consistency, also with Zarr v2). However, since zarrs may also be archived using zarr-agnostic tooling, we propose to specify that zarr reader implementations MAY additionally check for single root directories or recursively scan for Additional remarks: The current draft does not specify that existing keys cannot generally be overwritten (to our understanding, this is not generally possible according to the ZIP standard). Should the draft specify the archive file format a bit more precisely, e.g., ZIP64 support (yes), support of empty or spanned zip files (no), supported compression formats (if any)? Perhaps writers should be required to support writing uncompressed ZIP64 files, whereas readers MAY support further compression algorithms? |
Store limitations | ||
================= | ||
|
||
The following limitations for this store are know: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following limitations for this store are know: | |
The following limitations for this store are known: |
thanks for the writeup @jwindhager et al. It might be clarifying to factor out the following two discussions, as I think they are logically separable:
I think question 2 is the kind of thing that really ought to be explored alongside at least one implementation. So my pitch is as follows: we implement an opinionated zarr archive function in Would anyone here be interested in working on such an effort? I do think this requires at least one champion to push forward. As a |
I'm in favor of supporting archive formats and zip. Tensorstore already supports reading but not writing. Zip has some disadvantages in its design but I think they are outweighed by it being such a common format. I agree that there should be no implicit root directory, and while some implementations may do auto-discovery, there should be a canonical url that makes any sub-directories within the zip file explicit. The spec says the canonical url is just a file url, file:///path/to/file.zip. While that is reasonable for implementations that do auto-discovery, I don't think that is a good idea as the canonical url since it does not explicitly indicate the zip format at all, and would rely on implementations detecting it either by the filename or content. Previously I proposed a different url syntax (zarr-developers/zeps#48) which allows nested formats like zip to be specified explicitly. |
IMO zip is not a good single file storage format. Other choices like the various |
This might be true but I think it's orthogonal to the discussion at hand -- we are not trying to find the best archive file format, but rather devise standards to improve the utility of a popular archive file format (zip). Zip being sub-par doesn't bear on the fact that that people want to use it, and that latter fact is what we should build around IMO. |
I think this should be left to the implementation to decide whether to support write operations. Some people might want to rename/overwrite/delete entries in a ZIP as a convenience just like any other store. Sure the ZIP standard does not support this but there are ways to workaround this limitation (although quite inefficient). For example, I have 2 ZipStore implementations that support the full zarr v3 abstract store interface as defined in the core spec. I don't see the benefit of imposing this limitation to implementations. |
Agreed, and in fact I think this spec be made much more concise. I don't think it is necessary to list the supported operations. |
@d-v-b I agree that we could factor out the two discussions (this is what we meant with "separate the interface definition from the on-disk representation") and I would also add the "canonical url" proposal by @jbms to the list (this is roughly what we referred to as "consistent resource identifier schemes"). However, given the way stores are specified in the current spec, I'd pragmatically argue that it's probably easier to get this PR merged in its current form rather than to further broaden the scope / branch out. I also agree that we could push the on-disk representation aspect on the @jbms Thanks a lot for pointing us to your ZEP! We somehow completely missed it in our group's discussions. I think this could address many important issues, also related to zipped Zarr (and thus single-file OME-NGFF), and will give this a read asap.
Fair point. I still think that one could clarify this under "Store limitations" using the right phraseology, but no strong feelings. |
We can iterate quickly in |
This is a working draft of the v3 ZIP file store specification.
xref: