Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension proposal for zarr accumulation #205

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

hailiangzhang
Copy link

@hailiangzhang hailiangzhang commented Feb 14, 2023

This PR contains the Zarr extension proposal for a method that we developed at NASA GESDISC -- Zarr-based Chunk-level Accumulation, which provides fast and cost-efficient data analysis services based on chunk level statistics. The proposal mainly includes the Zarr group structure and attribute schema for the chunk level statistics.

This related PR against zeps repo is here.

@jbms
Copy link
Contributor

jbms commented Feb 16, 2023

Thanks for this proposal. It seems like it may make sense to specify the stride in units of elements rather than chunks. Then this would more generally also cover downsampling.

@hailiangzhang
Copy link
Author

Thanks for this proposal. It seems like it may make sense to specify the stride in units of elements rather than chunks. Then this would more generally also cover downsampling.

Thanks for your feedback @jbms !

Yes, it will be great that we can make this approach more flexible by specifying the stride in units of elements rather than chunks, but I think it probably okay to save the aggregation statistics at the chunk level since:

  1. Chunk is the smallest transaction unit (IO, compression, filtering and etc), and performance-wise it should be okay to have the aggregation statistics aligned along the chunk boundary (since we need to load the whole chunk anyway).

  2. Setting strides at element level may increase the coding complexity quite a lot.

  3. More importantly, setting strides at chunk level doesn't mean the averaging/aggregation service can only be provided at chunk level. The end users can specify any random temporal range or bounding box, and the internal stride setup should be transparent to them. The application needs to locate the strides at the corner of the averaging boundary, and the averaging/aggregation result will be some simple linear combination of the accumulation values of these strides and internal values inside these chunks at the boundary.

@jbms
Copy link
Contributor

jbms commented Mar 23, 2023

@hailiangzhang As I mentioned at the meeting today, it would be very helpful if you can provide some mathematical formulas or pseudocode that describes which cumulative sums you are storing. Your proposal describes the attribute schema but doesn't seem to describe exactly which sums are actually being stored, or how to compute the weighted sums based on the stored values.

It occurs to me that the following strategy would accomplish the goals of your proposal and might be similar to what you are doing:

  • A summed area table A for a variable X(i, j, k) is defined by: A(i, j, k) = sum_{i1<=i, j1<=j, k1<=k} X(i1,j1,k1). I think this is the strategy you described in your presentation as the standard accumulation strategy. This can be computed in linear time, and allows the sum over any region to be computed with 2^n reads of A, where n is the number of dimensions (3 in this case), but as you noted can run into numerical precision issues.
  • Instead of storing A fully, we can store only the values A(i, j, k) where i, j, or k is a multiple of some stride ci, cj, ck (e.g. the zarr chunk size), i.e. just the positions on a chunk boundary. This can be computed with the same cost as computing A fully. The storage cost will be approximately 1 / ci + 1 / cj + 1 / ck.
  • To compute the sum over any region, we need to be able to read the 2^n values from A. To do this, for each read we can reconstruct the necessary chunk of A by reading the stored values of A at the boundary of the chunk, and reading the content of the corresponding chunk of X. This takes time linear in the size of the chunk.
  • To efficiently store the values of A only at chunk boundaries, we need to somehow pack it into some number of dense arrays. The simplest way to do that would be to have a separate array for each dimension, i.e. Ai(ic, j, k) = A(ic * ci, j, k), Aj(i, jc, k) = A(i, jc * cj, k), Ak(i, j, kc) = A(i, j, kc * ck), which is stored for all i, j, k, ic, jc, kc. This does mean we redundant store values that are on the boundary in more than one dimension, though; a more complicated stored representation could potentially avoid that.
  • This strategy still does not solve the numerical precision issues, so I suspect it is not exactly the same as what you are doing.

Any clarification you can offer would be very helpful.

@hailiangzhang
Copy link
Author

@hailiangzhang As I mentioned at the meeting today, it would be very helpful if you can provide some mathematical formulas or pseudocode that describes which cumulative sums you are storing. Your proposal describes the attribute schema but doesn't seem to describe exactly which sums are actually being stored, or how to compute the weighted sums based on the stored values.

It occurs to me that the following strategy would accomplish the goals of your proposal and might be similar to what you are doing:

  • A summed area table A for a variable X(i, j, k) is defined by: A(i, j, k) = sum_{i1<=i, j1<=j, k1<=k} X(i1,j1,k1). I think this is the strategy you described in your presentation as the standard accumulation strategy. This can be computed in linear time, and allows the sum over any region to be computed with 2^n reads of A, where n is the number of dimensions (3 in this case), but as you noted can run into numerical precision issues.
  • Instead of storing A fully, we can store only the values A(i, j, k) where i, j, or k is a multiple of some stride ci, cj, ck (e.g. the zarr chunk size), i.e. just the positions on a chunk boundary. This can be computed with the same cost as computing A fully. The storage cost will be approximately 1 / ci + 1 / cj + 1 / ck.
  • To compute the sum over any region, we need to be able to read the 2^n values from A. To do this, for each read we can reconstruct the necessary chunk of A by reading the stored values of A at the boundary of the chunk, and reading the content of the corresponding chunk of X. This takes time linear in the size of the chunk.
  • To efficiently store the values of A only at chunk boundaries, we need to somehow pack it into some number of dense arrays. The simplest way to do that would be to have a separate array for each dimension, i.e. Ai(i, j, k) = A(i * ci, j, k), Aj(i, j, k) = A(i, j * cj, k), Ak(i, j, k) = A(i, j, k * ck). This does mean we redundant store values that are on the boundary in more than one dimension, though; a more complicated stored representation could potentially avoid that.
  • This strategy still does not solve the numerical precision issues, so I suspect it is not exactly the same as what you are doing.

Any clarification you can offer would be very helpful.

Hi @jbms , thank you so much for writing down what's in your mind! I believe we're mostly aligned, except for some lower level details. As you suggested, I will provide some code so you can look into it closely, but it may take some time as I mentioned. But I can try to provide the formula (similar to what you have done) ASAP.
Meanwhile, just to make sure we are on the same page, in this formula you listed above Ai(i, j, k) = A(i * ci, j, k), is i, j, or k still "a multiple of some stride ci, cj, ck (e.g. the zarr chunk size)"?

@jbms
Copy link
Contributor

jbms commented Mar 23, 2023

Meanwhile, just to make sure we are on the same page, in this formula you listed above Ai(i, j, k) = A(i * ci, j, k), is i, j, or k still "a multiple of some stride ci, cj, ck (e.g. the zarr chunk size)"?

Sorry, I realize my notation wasn't very clear. I updated my comment to use the notation: Ai(ic, j, k) = A(ic * ci, j, k), so that I'm not using i in two different ways, and added an example with the sizes of the different arrays.

@jbms
Copy link
Contributor

jbms commented Mar 23, 2023

Independent of these details, I think it is clear now that this exists as an additional layer entirely on top of zarr itself. I'm still interested in learning more about this proposal, and happy to provide further review/feedback, but per discussion at the ZEP meeting today, it appears that this proposal may not be a good fit for the ZEP process itself.

The ZEP process is intended for changes to the zarr specification itself, and is intended to gather feedback from zarr implementors. This proposal more naturally exists as a layer entirely on top of zarr itself and requires no changes to zarr implementations like zarr-python. Therefore, it may make more sense for you to publish this standard independent of the zarr specification, as an external extension, similar to OME-zarr. If in the future we have a website that links to external extensions it could be included there.

@hailiangzhang
Copy link
Author

Independent of these details, I think it is clear now that this exists as an additional layer entirely on top of zarr itself. I'm still interested in learning more about this proposal, and happy to provide further review/feedback, but per discussion at the ZEP meeting today, it appears that this proposal may not be a good fit for the ZEP process itself.

The ZEP process is intended for changes to the zarr specification itself, and is intended to gather feedback from zarr implementors. This proposal more naturally exists as a layer entirely on top of zarr itself and requires no changes to zarr implementations like zarr-python. Therefore, it may make more sense for you to publish this standard independent of the zarr specification, as an external extension, similar to OME-zarr. If in the future we have a website that links to external extensions it could be included there.

Hi @jbms , yes, I was totally aware that this proposal is on top of Zarr itself. Actually I did mention this concern at some point, and was suggested to file a Zarr extension, instead of spec, proposal. Honestly I got a feeling before as well that this proposal may not be necessary to stay in Zarr core spec repo, and yes, we can host the spec as an external extension, similar to OME-zarr, and I will provide a link when it's in-place. Thanks a lot for your suggestions!

@rabernat
Copy link
Contributor

The ZEP process is intended for changes to the zarr specification itself, and is intended to gather feedback from zarr implementors.

Apologies for coming so late to this conversation.

I am the one who encouraged @hailiangzhang to consider proposing this as an extension. I wanted to state that, to me, this does seem like a good fit for an optional extension. I see this proposal as comparable to parquet's row-group statistics. Implementations can use this information to implement more efficient queries, e.g. predicate pushdown, or, in Hailiang's case, more efficient sums / means over regions of the array.

@jbms
Copy link
Contributor

jbms commented May 17, 2023

A more precise specification of this proposal would definitely help, as the current proposal does not seem to indicate precisely what calculated values are stored.

To me, a zarr extension is something that must be implemented by the zarr implementation itself, e.g. zarr-python. In contrast, something like ome-zarr is "layered on top of zarr", and makes use of a zarr implementation but does not require any changes to the zarr implementation. In principle the same package could implement both zarr and ome-zarr together, but likely would still employ a layered architecture where there is a lower-level "zarr" layer and then an ome-zarr layer on top.

This proposal seems like it falls under the "layered on top of zarr" category:

  • It can, and has been, implemented without any changes to the zarr implementation.
  • The additional information is stored as additional regular zarr arrays and zarr user attributes.
  • It relates to an operation (computing certain weighted sums) that is not normally provided by the zarr implementation itself.
  • The additional information is, I think, computed as a batch operation after the base array has already been written. Likely this would be done using a distributed processing framework like dask, which might be outside the scope of a zarr implementation.

@hailiangzhang
Copy link
Author

Thanks for your responses @rabernat and @jbms !
Actually this proposal is probably comparable to the approach used in the Xarray library, which introduces specific fields such as _ARRAY_DIMENSIONS within Zarr attributes to facilitate the implementation of the Xarray library, allowing it to provide labels and support operations on named dimensions within Zarr arrays. Similarly, this proposal introduces schemas to the data group and array attribute to describe the chunk level statistics and facilitate accumulation-based data analysis.
I remember at some point it was mentioned that the Xarray spec should have been incorporated into the Zarr specification, so I thought this proposal may be a candidate too. However, I must acknowledge that there may be some finer details or guidelines that I've overlooked. Regardless, I am comfortable proceeding in whichever direction aligns best with the existing guidelines:)
BTW, @jbms , despite this, I remember you have been asking for some source codes to help you understand some low level details. My apologies for not being able to dedicate more time to this task, but a cleaner version of library to prepare the accumulation data should be coming up very soon and will keep you updated.

@briannapagan
Copy link

Hi @MSanKeys963 @jbms @joshmoore @rabernat - we would like to submit this work for consideration of a The White House Office of Science & Technology Policy Open Science Recognition Challenge - https://www.challenge.gov/?challenge=ostp-year-of-open-science-recognition-challenge. Do I have your permission to list you all as collaborators? The form is rather simple, there is no formal way (i.e. adding emails etc), just wanted to at least list the names/affiliations. Also am I missing anyone else that may have helped push this forward (@hailiangzhang)?

@jbms
Copy link
Contributor

jbms commented Nov 20, 2023

I don't think my own involvement in reviewing this proposal is sufficient to be listed as a collaborator.

I actually would still very much welcome some clarifications about precisely how the cumulative sums are stored during the precomputation step and how they are used to compute the result of a query, so that I can better understand this proposal.

@joshmoore
Copy link
Member

Hi @briannapagan. No objection to being listed, and generally happy to do what I can to recognize open science! 😄

@rabernat
Copy link
Contributor

rabernat commented Nov 21, 2023

Same as Jeremy, I don't think I've done anything to merit authorship here. IMO lots more work needed to move this forward.

@briannapagan
Copy link

Thank you for the responses all! @hailiangzhang will provide some updates as there's been quite some work done since March. However, I think at this time we will skip submitting anything for recognition but might be of interest for your other zarr efforts! Cheers.

@MSanKeys963
Copy link
Member

MSanKeys963 commented Nov 30, 2023

Thanks for the update, @briannapagan.

Looking forward to the updates here.

This a bit of short notice, but it'd be great if you or @hailiangzhang could join the ZEP meeting today or the next one on (12/14) so that we can move this forward and assist you in any way possible. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants