-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Directories and delimiter handling #562
Comments
Thanks for opening this, and let's take the chance to nail down how we would like this to work. In most cases, paths are stripped of any trailing "/". However, since the appearance of recursive operations and one-shot find, there is a certain amount of awkward code to back-fill inferred directories. There are known fail cases where paths end with "/", which would be illegal on posix; but they do appear when using the web consoles and some frameworks (spark?). Note that in dircache stores listings keyed by their (pseudo-)directory, and that these keys don't end in "/", but that means we don't know which listing to put a name ending "/" into. So I suggest:
|
I should add, that
but I don't think there's a way around this; the even more unsatisfactory situation (involving more calls) still fails intuition in the opposite direction
|
Originally, adlfs implemented
It should resolve It sounds as if your preference would be to omit the trailing delimiter from a pseudo-directory and leave mkdir as a no-op. Is that correct? |
It seems to me to be the intent of the API; the advantage being avoiding the need for extra calls to make the objects. I like the metadata idea, but presumably we'd also have to account for trees that were not written with this convention. |
Thinking about �
This also means that:
correct? This means Agree the reduction in calls is an advantage of the API, but are users OK with this behavior? |
Yes, that's exactly what currently happens for s3fs. I can't remember, rmdir might give a helpful "directory not empty" message. |
Tagging @isidentical , @ldacey, and @jorisvandenbossche Hoping for input from some users of adlfs that have given input on its development on the above discussion. |
Support for empty directories is really important for our use case. Not as in creating them with |
I've been testing the current behavior in Azure. If I go to the portal, and create:
Then I create a container_client, and call
But if I call
The list_blobs() response is consistent with a "pseudo-folder" in Azure Blob being an empty blob. Uploading a blob using either the client, or the portal to: This is not consistent with s3 described above. Azure SDK performs the following:
So leaving |
This would not be true if the placeholder "adlfs/test_folder/test_inner" had not been previously created though, right? |
Yes, it is true. See the following, using the Azure Storage Blob Python SDK directly:
Notice that the pseudo-directory remains in place, and it's a BlockBlob with size = 0. Both of the following operations return a StorageError "Specified blob already exists"
|
I'm on board with mkdir being a no-op when the bucket exists. This means we would expect the following behavior, if
|
Additionally, once bucket is created:
And the following should be true, when bucket does not exist:
|
Currently in s3fs,
silently passes whether the bucket exists or not; I agree that it ought to check, but we can't actually know if we have write privileges to that bucket without trying to write something. I agree with the rest. Note that |
Hopefully @jorisvandenbossche will be able to comment more about what ds.write_dataset() expects to happen since that is the primary use case I have right now. Right now on the 0.5.9 version of adlfs, we end up having duplicate blob paths. One path is an empty file up until the partition "folder" and the other includes the filename and data.
I think the asyn_close branch for adlfs matches my expectations. Writing a dataset does not duplicate the blob paths. Listing a partitioned directory clearly shows which blobs are "virtual directories" based on the trailing slash.
It sounds like the training slash might introduce other inconsistencies though. FYI, I tried to create a new virtual directory with Azure Storage Explorer and it forces you to add a file before the directory actually exists. I am not sure if it works this way with GCS and S3. |
Thank you Azure for the message! Since you don't add anything in that dialog, I wonder whether it does, in fact, create something anyway. Unfortunately, S3 and GCS's web interface creates an empty blob with "/" at the end of the file, and considers this a directory. By the way, one thing that I tried to toy with, was that mkdir (non-bucket) would create a folder in the file system instance only, and this sort of worked for some things, but I thought ended up being even more confusing. |
When you create the new directory (using Storage Explorer), you are unable to see anything until you load a file:
I think an empty blob with a trailing "/" is fine if we can avoid the empty blobs without the slash because these are duplicate files which really add up when partitioning a large dataset (literally double the number of blobs). I need to remove these because I have a tool connecting directly to these datasets (Power BI), and other teams are using the data and get confused with the empty blobs. |
Not super familiar with that code, but I think the Arrow dataset writing functionality basically assumes it can subsequently create a directory and then write files into it ( |
Something that we do on the DVC right now is that, we have a special base class that is being inherited by all prefix-based object storages (with no real directories) that kinda tries to handle empty directories in only 2 functions For example the entry = self.info(path_info)
return entry["type"] == "directory" or (
entry["size"] == 0
and entry["type"] == "file"
and entry["name"].endswith("/")
) Same goes for # When calling find() on a file, it returns the same file in a list.
# For object-based storages, the same behavior applies to empty
# directories since they are represented as files. This condition
# checks whether we should yield an empty list (if it is an empty
# directory) or just yield the file itself.
if len(files) == 1 and files[0] == path and self.isdir(path_info):
return None
yield from self._strip_buckets(files, detail=detail) I'm not sure if this should be done on the upstream or downstream, but in either way it would be really beneficial for everyone to be consistent among major filesystems (s3, gs, azure) etc. |
I definitely agree with this sentiment! I think a function like the one above can definitely be included in fsspec, perhaps in utils as a function or a mixin class. Perhaps more important would be to flesh out #651 to define what we think the behaviour ought to look like, and write some tests that span all the object stores. Anaconda is having a "hack week" and there are holidays too, so I may not be too responsive this week. |
Opened based on #554 .
Azure Storage consists of a 2-level hierarch, where the top level is a container, and all blobs underneath a container have a flat structure. Delimiters ("/") that denote folders in traditional filesystems have no meaning. Azure implements a
BlobPrefix
object for convenience, but the BlobPrefix can not be created directly, and is immediately removed when all blobs underneath the prefix are deleted.This creates challenges for filesystem operations like
mkdir()
, because the making of an empty directory can only be done by creating an empty blob. The result is that:both appear as size=0 blobs, and are unique. Choosing the former convention creates issues as described here with partitioned parquet files, while the latter approach runs counter to the convention of removing a trailing "/" when listing directories and/or storing them in dircache, as evidenced by #554 .
Its my understanding this is an issue with s3fs and gcsfs. Do these filesystems exhibit similar challenges, and if so, how are they handled there? Hoping to align on an approach that provides consistency for users, and with the use of dircache.
Thanks,
Greg
The text was updated successfully, but these errors were encountered: