Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on using UPath and fs copy/put #309

Open
scholtalbers opened this issue Nov 7, 2024 · 2 comments
Open

Question on using UPath and fs copy/put #309

scholtalbers opened this issue Nov 7, 2024 · 2 comments
Labels
question ❓ Further information is requested

Comments

@scholtalbers
Copy link

Not sure if this is the right place to ask this question, and I debated if it should go in the community Q&A?

Anyway, I want to leverage UPath and fsspec to work with multiple storage backends, for now limited to local disk and s3. One of the main use cases is to copy files and folders from one to the other.
So triggered by the answer, I feel I may have some misunderstanding of how to use the library effectively. I have read some of the documentation, but I may have missed something obvious.

What I started doing is along these lines:

class StorageVolume:
   (…)
   def get_fsspec_storage_options(self) -> dict:
        if self.storage_options:
	        return { "endpoint_url": self.endpoint, "key": self.key, "secret", "self.secret"}
        return {}
   def get_upath(self, file_obj) -> UPath
	return UPath(file_obj.uri, **self.get_fsspec_storage_options())

src_obj = FileObject.objects.get(uri="/tmp/dir/myfile.txt")
src = src_obj.storagevolume.get_upath(src_obj) 
dest_obj = FileObject.objects.get(uri="s3://bucket1/")
dest = dest_obj.storagevolume.get_upath(dest_obj)

if both_local or src_obj.storagevolume == dest_obj.storagevolume:
  dest.fs.copy(str(src), str(dest), recursive=src.is_dir())
elif not is_remote_upath(src):
  dest.fs.put(str(src), str(dest), recursive=src.is_dir())

I also explored the generic filesystem with something like

  # try generic 
  generic.set_generic_fs(src.protocol, src.fs)
  generic.set_generic_fs(dest.protocol, dest.fs)
  fs = fsspec.filesystem("generic", default_method="generic”)
  fs.copy(str(src), str(dest), recursive=src.is_dir())

It feels (at least) the part dest.fs.put(str(src), str(dest)) is not the right way for my use case, is there a better way?

@ap--
Copy link
Collaborator

ap-- commented Nov 8, 2024

Hello @scholtalbers

Here is an explanation for how all of this works together. Some of it skips over details, so I really recommend to read the filesystem_spec docs thoroughly and look at the fsspec.spec.AbstractFileSystem implementation in filesystem_spec.

fsspec

All the filesystem abstractions and filesystem operations are implemented and defined in filesystem_spec. The base class of all these filesystems is in fsspec.spec.AbstractFileSystem.

To use a filesystem registered in fsspec, you instantiate the specific subclass of AbstractFileSystem you want to use S3FileSystem for example, by providing storage options to the class constructor.
The instantiated filesystem then takes paths (or paths prefixed with the same protocol) as arguments in all its methods.

When you want to reference an object on a filesystem, you need 3 pieces of information:

  1. which filesystem? this is provided by the protocol, since subclasses of AbstractFileSystem can register their supported protocols in fsspec.registry. (for example: "s3")
  2. what filesystem configuration? this is provided by the storage_options. They are the constructor parameters for building the filesystem instance. (this could be your aws credentials)
  3. what object this is provided by the path. this is a string that references an object for the filesystem instance. (for example: "mybucket/my/special/key")

If you have all three pieces of information, you can provide access to the information stored in the object you're referencing.

Some filesystems in fsspec support combining the 3 pieces of information into a single string urlpath, usually of the form: protocol://path?storage_option1=1&storage_option=2

upath

The UPath class does 2 things:

  1. it provides a way to store all these 3 pieces of information in one object and makes it convenient to pass the information along.
  2. it provides the pathlib.PurePath and pathlib.Path interface for modifying the path and for reading/writing to the path, this allows you to easily add support for arbitrary filesystems if your existing code uses the pathlib.Path interface. (Note there will be a minor, but significant change with Inherit from PathBase instead of Path #193, which will basically remove the __fspath__ method from the UPath interface)

The copy functionality between filesystems will be available in UPath once #227 is completed, which relies on #193. And once that interface exists, the internal implementation will likely be based on the generic filesystem, but that's tbd.

when to use what

UPath simplifies passing around paths in your code and it's a convenient tool for building uris for supported protocols. It's not performance optimized, and adds (in some cases a lot of) overhead to operations.

  • If your only concern is performance: use fsspec exclusively.
  • If you like the pathlib interface: use UPath

For cases where it's you need to move between them: UPath provides you with an easy way to move to fsspec:

UPath().protocol  # the protocol string used by fsspec
UPath().storage_options  # the storage_options required to create the fsspec filesystem instance
UPath().path  # a path string that can be used by the fsspec instance
UPath().fs  # a convenience helper to do fsspec.filesystem(pth.protocol, **pth.storage_options)

Regarding your question

It feels (at least) the part dest.fs.put(str(src), str(dest)) is not the right way for my use case, is there a better way

As mentioned above: not yet. But hopefully soon.

Let me know if this helped,
Cheers,
Andreas 😃

@ap-- ap-- added the question ❓ Further information is requested label Nov 9, 2024
@scholtalbers
Copy link
Author

Yes thanks a lot Andreas, this clarifies a few things for me and I'm looking forward for the upcoming features!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question ❓ Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants