Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cloudpathlib instead of fsspec? #172

Open
TomNicholas opened this issue Jul 1, 2024 · 6 comments
Open

Use cloudpathlib instead of fsspec? #172

TomNicholas opened this issue Jul 1, 2024 · 6 comments
Labels
remote files Reading references from non-local files

Comments

@TomNicholas
Copy link
Member

AFAIK the only filesystems we need to read from are local and cloud, so could we just use pathlib and cloudpathlib?

@TomNicholas
Copy link
Member Author

In particular we could just call cloudpathlib.AnyPath

https://cloudpathlib.drivendata.org/stable/anypath-polymorphism/

@norlandrhagen
Copy link
Collaborator

This would be really cool @TomNicholas!

Seems like it can read over s3 into xarray:

from cloudpathlib import CloudPath
import xarray as xr 
cloudpath = CloudPath("s3://carbonplan-share/air_temp.nc")
ds = xr.open_dataset(cloudpath)

@norlandrhagen
Copy link
Collaborator

A little more exploration. It looks like SingleHDFToZarr works both for s3 and local.

from kerchunk.hdf import SingleHdf5ToZarr
import io 
from cloudpathlib import CloudPath
import xarray as xr 
# from cloudpathlib import AnyPath

cloudpath = CloudPath("s3://carbonplan-share/air_temp.nc")

with open(cloudpath, 'rb') as f:
  contents = f.read()
  refs = SingleHdf5ToZarr(io.BytesIO(contents)).translate()
refs

@TomNicholas
Copy link
Member Author

Some more thoughts - one way to smooth this transition would be to replace all uses of UPath (which is based on fsspec) with cloudpathlib's AnyPath. They are both very similar - for example they both implement a .stat method, which is used in https://github.com/zarr-developers/VirtualiZarr/pull/187/files#r1678802398.

The snag here is that I don't think cloudpathlib supports https...

@TomNicholas TomNicholas added the remote files Reading references from non-local files label Jul 21, 2024
@TomNicholas
Copy link
Member Author

The snag here is that I don't think cloudpathlib supports https...

I raised drivendataorg/cloudpathlib#455

@scottyhq
Copy link
Contributor

I've just started using cloudpathlib and it provides a nice straightforward interface, however, it does seem to read entire remote files from object storage up front (https://cloudpathlib.drivendata.org/stable/caching/). As far as I can tell this is equivalent to fsspec 'filecache' machinery

Conceptually, pulling the full files is straightforward and has max compatibility with other libraries... but if most of the metadata is at the front of the file you could get everything you need by just transferring a small subset (ideally first 'n' bytes see https://discourse.pangeo.io/t/pangeo-showcase-hdf5-at-the-speed-of-zarr/4084)

So this is another issue to track for range-reads in cloudpathlib: drivendataorg/cloudpathlib#9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
remote files Reading references from non-local files
Projects
None yet
Development

No branches or pull requests

3 participants