-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cloudpathlib instead of fsspec? #172
Comments
In particular we could just call https://cloudpathlib.drivendata.org/stable/anypath-polymorphism/ |
This would be really cool @TomNicholas! Seems like it can read over s3 into xarray: from cloudpathlib import CloudPath
import xarray as xr
cloudpath = CloudPath("s3://carbonplan-share/air_temp.nc")
ds = xr.open_dataset(cloudpath) |
A little more exploration. It looks like SingleHDFToZarr works both for s3 and local. from kerchunk.hdf import SingleHdf5ToZarr
import io
from cloudpathlib import CloudPath
import xarray as xr
# from cloudpathlib import AnyPath
cloudpath = CloudPath("s3://carbonplan-share/air_temp.nc")
with open(cloudpath, 'rb') as f:
contents = f.read()
refs = SingleHdf5ToZarr(io.BytesIO(contents)).translate()
refs |
Some more thoughts - one way to smooth this transition would be to replace all uses of The snag here is that I don't think cloudpathlib supports https... |
I raised drivendataorg/cloudpathlib#455 |
I've just started using cloudpathlib and it provides a nice straightforward interface, however, it does seem to read entire remote files from object storage up front (https://cloudpathlib.drivendata.org/stable/caching/). As far as I can tell this is equivalent to fsspec 'filecache' machinery Conceptually, pulling the full files is straightforward and has max compatibility with other libraries... but if most of the metadata is at the front of the file you could get everything you need by just transferring a small subset (ideally first 'n' bytes see https://discourse.pangeo.io/t/pangeo-showcase-hdf5-at-the-speed-of-zarr/4084) So this is another issue to track for range-reads in cloudpathlib: drivendataorg/cloudpathlib#9 |
AFAIK the only filesystems we need to read from are local and cloud, so could we just use pathlib and cloudpathlib?
The text was updated successfully, but these errors were encountered: