Build virtual Zarr store using Xarray's dataset.to_zarr(region="...") model #308

maxrjones · 2024-11-19T17:05:43Z

I'm wondering how we can build virtualization pipelines as a map rather than a map-reduce process. The map paradigm would follow the structure below from Earthmover's excellent blog post on serverless datacube pipelines, where the virtual dataset from each worker would get written directly to an Icechunk Virtual Store rather than being transferred back to the coordination node for a concatenation step. Similar to their post, the parallelization between workers during virtualization could leverage a serverless approach like lithops or coiled. Dask concurrency could speed up virtualization within each worker. I think the main feature needed for this to work is an analogue to xr.Dataset.to_zarr(region="..."), complementary to xr.Dataset.to_zarr(append_dim="...")(documented in https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html) (xref #21, #272).

I tried to check that this feature request doesn't already exist, but apologies if I missed something.

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-11-19T17:17:27Z

This is a really extremely good question that I have also thought about a bit.

I think a big difference is that this "map" approach assumes more knowledge the structure of the dataset up front. To use zarr regions you have to know how big the regions are in advance to create an empty zarr array that will be filles, but In the "map-reduce" approach you are able to concatenate files whose shapes you don't know until you open them. Whether that actually matters much in practice is another question.

One thing to note is that IIUC having each worker write a region of virtual references would mean one icechunk commit per region, whereas transferring everything back to one node means one commit per logical version of the dataset, which seems a little nicer.

I think the main feature needed for this to work is an analogue to xr.Dataset.to_zarr(region="...")

I agree.

norlandrhagen · 2024-11-19T17:18:52Z

I really like the idea behind this approach @maxrjones! Being able to skip transferring all the references back to a 'reduce' step seem great. Also wondering about how all the commits between workers would work with icechunk.

norlandrhagen · 2024-11-19T17:19:38Z

This also seems like a good way to protect against a single failed reference killing the entire job.

dcherian · 2024-11-19T17:36:30Z

I think a big difference is that this "map" approach assumes more knowledge the structure of the dataset up front.

👍

IIUC having each worker write a region of virtual references would mean one icechunk commit per region,

It does not, once earth-mover/icechunk#357 is in. Though in practice, that also does map-reduce on changesets which will contain virtual references in this case...

maxrjones · 2024-11-19T22:24:13Z

IIUC having each worker write a region of virtual references would mean one icechunk commit per region,

It does not, once earth-mover/icechunk#357 is in. Though in practice, that also does map-reduce on changesets which will contain virtual references in this case...

Thanks for this information! Do you think it's accurate that map-reduce on changesets would still be more performant than map-reduce on virtual datasets when operating on thousands or more of files, such that this feature would be helpful despite not truly by a map operation? Also, do you anticipate there being other prerequisites in addition to your PR before VirtualiZarr could have something like dataset_to_icechunk(region="...")?

maxrjones added the enhancement New feature or request label Nov 19, 2024

TomNicholas added the Icechunk 🧊 Relates to Icechunk library / spec label Nov 19, 2024

maxrjones added the performance label Nov 22, 2024

maxrjones mentioned this issue Dec 4, 2024

Virtual chunks design document earth-mover/icechunk#436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build virtual Zarr store using Xarray's dataset.to_zarr(region="...") model #308

Build virtual Zarr store using Xarray's dataset.to_zarr(region="...") model #308

maxrjones commented Nov 19, 2024

TomNicholas commented Nov 19, 2024

norlandrhagen commented Nov 19, 2024

norlandrhagen commented Nov 19, 2024

dcherian commented Nov 19, 2024 •

edited

Loading

maxrjones commented Nov 19, 2024

Build virtual Zarr store using Xarray's dataset.to_zarr(region="...") model #308

Build virtual Zarr store using Xarray's dataset.to_zarr(region="...") model #308

Comments

maxrjones commented Nov 19, 2024

TomNicholas commented Nov 19, 2024

norlandrhagen commented Nov 19, 2024

norlandrhagen commented Nov 19, 2024

dcherian commented Nov 19, 2024 • edited Loading

maxrjones commented Nov 19, 2024

dcherian commented Nov 19, 2024 •

edited

Loading