Recommendations for creating a long vdslist quickly? #247

CRWayman · 2024-10-03T21:36:50Z

CRWayman
Oct 3, 2024

Hello!

I am trying to make a vds list for a month of hourly data, and I thought I'd try to speed up that process by using multiprocessing to map open_virtual_dataset() over a list of files. Doing this with list comprehension definitely works, but I was surprised to see that it doesn't seem to work cleanly with multiprocessing. The only way I can see to do it is by passing the function and the indexes variable into functools.partial, and then applying pool.map() to my partial object and the list of files. This results in the metadata not writing out correctly for some reason.

Any ideas or advice would be greatly appreciated! If sticking with the list comprehension is my best bet, then I'll do that.

Thanks,
Callum

Answered by CRWayman

Oct 4, 2024

import glob
import os
import time

import dask
import xarray as xr

from virtualizarr import open_virtual_dataset
from virtualizarr.kerchunk import FileType

monthlist = glob.glob('path/to/files/Y2024/M*')
monthlist.sort()

flist = []

m = 1

for month in monthlist[:1]:
    daylist = glob.glob(os.path.join(month,'D*'))
    daylist.sort()
    d = 1
    for day in daylist:
        hourlist = glob.glob(os.path.join(day, '*aqc*tavg*v1*'))
        hourlist.sort()
        flist += hourlist
        d += 1
    m += 1

num_files = len(flist)

flist = flist[:num_files]

dask_ovd = dask.delayed(open_virtual_dataset)
vds_lazy_list_local = [dask_ovd(f, filetype=FileType.netcdf4, indexes={}) for f in f…

View full answer

CRWayman · 2024-10-03T21:48:14Z

CRWayman
Oct 3, 2024
Author

For a little more context the error I get after reading a combined parquet from a list of vds objects built with multiprocessing is as follows:

zarr.errors.MetadataError: error decoding metadata

I do not get this when using list comprehension.

0 replies

TomNicholas · 2024-10-03T21:51:18Z

TomNicholas
Oct 3, 2024
Maintainer

Hi Callum!

Note that what you're trying to do is similar to #95. #95 (comment) might also be relevant.

I'm planning to use dask / cubed to parallelize a really big set of open_virtual_dataset calls, and write a notebook showing how to do it (#123). But I haven't got to that yet, so I don't actually know what the best solution is yet.

it doesn't seem to work cleanly with multiprocessing

This results in the metadata not writing out correctly for some reason.

These sound like unrelated issues - once you have an in-memory vds, the writing out of the metadata doesn't depend on how you opened it.

Generally I can't really offer more help on your specific issue without an MCVE.

6 replies

CRWayman Oct 4, 2024
Author

So while not necessarily a MCVE, I can post my code below in case I have made a glaring error:

pythondef process(filename):
    vds = open_virtual_dataset(
        filename,
        #decode_times=True,
        #loadable_variables=["time", "lat", "lon", "crs"],
        filetype="netcdf4",
        indexes={},
    )
    return vds

pool_proc = multiprocessing.Pool()
vdslist_poolproc = pool_proc.map(process, flist)
pool_proc.close()
ppcomb = xr.concat(vdslist_poolproc, dim='time', coords='minimal', compat='override')
ppcomb.virtualize.to_kerchunk('./combined.parq', format='parquet')

Where flist is a list of netcdf files which are holding global air quality data. It's when I go to read the parquet store with combined_ds = xr.open_dataset('./combined.parq', engine="kerchunk") that I get my metadata error.

I'm also reviewing your examples and the examples from @norlandrhagen so thank you both!

CRWayman Oct 4, 2024
Author

Actually I'm just returning to this to say that your comment worked! I'm not sure why I couldn't get the multiprocessing pool to work, but the dask solution works really well, opening 100 files to write to parquet in about 15 seconds. Thank you!

TomNicholas Oct 4, 2024
Maintainer

That's great! For posterity - what does your final solution look like?

CRWayman Oct 4, 2024
Author

import glob
import os
import time

import dask
import xarray as xr

from virtualizarr import open_virtual_dataset
from virtualizarr.kerchunk import FileType

monthlist = glob.glob('path/to/files/Y2024/M*')
monthlist.sort()

flist = []

m = 1

for month in monthlist[:1]:
    daylist = glob.glob(os.path.join(month,'D*'))
    daylist.sort()
    d = 1
    for day in daylist:
        hourlist = glob.glob(os.path.join(day, '*aqc*tavg*v1*'))
        hourlist.sort()
        flist += hourlist
        d += 1
    m += 1

num_files = len(flist)

flist = flist[:num_files]

dask_ovd = dask.delayed(open_virtual_dataset)
vds_lazy_list_local = [dask_ovd(f, filetype=FileType.netcdf4, indexes={}) for f in flist]
vds_list_parallel_processes_local = dask.compute(vds_lazy_list_local)
vds_list_dask = vds_list_parallel_processes_local[0]
dcomb = xr.concat(vds_list_dask, dim='time', coords='minimal', compat='override')
dcomb.virtualize.to_kerchunk('./combined.parq', format='parquet')

However, I'm certainly no parallelization expert so I'll keep an eye out for issue 123. Thank you!

Answer selected by TomNicholas

CRWayman Oct 4, 2024
Author

I'll also add, from what I can tell lithops and coiled won't work for me, because I am already on a HPC with my data. So I needed a local solution. Unless I misinterpreted the lithops and coiled documentation.

TomNicholas Oct 4, 2024
Maintainer

That looks good! I'm assuming you're on just one HPC node. There are ways to run Cubed and Dask locally, but not through lithops / coiled as you say. Those are intended for multi-machine jobs in the cloud.

norlandrhagen · 2024-10-04T02:37:26Z

norlandrhagen
Oct 4, 2024
Maintainer

Just adding on if it's helpful for you @CRWayman we have two examples so far of parallel reference generation with Virtualizarr.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommendations for creating a long vdslist quickly? #247

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Recommendations for creating a long vdslist quickly? #247

CRWayman Oct 3, 2024

Replies: 3 comments · 6 replies

CRWayman Oct 3, 2024 Author

TomNicholas Oct 3, 2024 Maintainer

CRWayman Oct 4, 2024 Author

CRWayman Oct 4, 2024 Author

TomNicholas Oct 4, 2024 Maintainer

CRWayman Oct 4, 2024 Author

CRWayman Oct 4, 2024 Author

TomNicholas Oct 4, 2024 Maintainer

norlandrhagen Oct 4, 2024 Maintainer

CRWayman
Oct 3, 2024

Replies: 3 comments 6 replies

CRWayman
Oct 3, 2024
Author

TomNicholas
Oct 3, 2024
Maintainer

CRWayman Oct 4, 2024
Author

CRWayman Oct 4, 2024
Author

TomNicholas Oct 4, 2024
Maintainer

CRWayman Oct 4, 2024
Author

CRWayman Oct 4, 2024
Author

TomNicholas Oct 4, 2024
Maintainer

norlandrhagen
Oct 4, 2024
Maintainer