Directories and delimiter handling #562

hayesgb · 2021-03-09T03:30:54Z

Opened based on #554 .

Azure Storage consists of a 2-level hierarch, where the top level is a container, and all blobs underneath a container have a flat structure. Delimiters ("/") that denote folders in traditional filesystems have no meaning. Azure implements a BlobPrefix object for convenience, but the BlobPrefix can not be created directly, and is immediately removed when all blobs underneath the prefix are deleted.

This creates challenges for filesystem operations like mkdir(), because the making of an empty directory can only be done by creating an empty blob. The result is that:

fs.mkdir("container/blob_folder")
fs.mkdir("container/blob_folder/")

both appear as size=0 blobs, and are unique. Choosing the former convention creates issues as described here with partitioned parquet files, while the latter approach runs counter to the convention of removing a trailing "/" when listing directories and/or storing them in dircache, as evidenced by #554 .

Its my understanding this is an issue with s3fs and gcsfs. Do these filesystems exhibit similar challenges, and if so, how are they handled there? Hoping to align on an approach that provides consistency for users, and with the use of dircache.

Thanks,
Greg

The text was updated successfully, but these errors were encountered:

martindurant · 2021-03-09T17:03:30Z

Thanks for opening this, and let's take the chance to nail down how we would like this to work.
Yes, s3fs and gcsfs face the exact same structure (buckets/keys and prefix/delimiter emulation) and problems. I'll take a moment to describe the current general conventions we have.

In most cases, paths are stripped of any trailing "/". mkdir is a no-op for anything below buckets, and directories are only "created" by containing other keys. In comments, these are usually called "pseudo-directories".

However, since the appearance of recursive operations and one-shot find, there is a certain amount of awkward code to back-fill inferred directories. There are known fail cases where paths end with "/", which would be illegal on posix; but they do appear when using the web consoles and some frameworks (spark?). Note that in dircache stores listings keyed by their (pseudo-)directory, and that these keys don't end in "/", but that means we don't know which listing to put a name ending "/" into.

So I suggest:

a set of short tests showing what ls, glob and find should return in various situations with or without "/"-ended path names, and at various depths including the bucket level (NB: a bucket name cannot contain"/")
for the same set of circumstances, explicitly checking that dircache contains the expected entries; i.e., repeated calling of ls/find/glob should still return the right output.

martindurant · 2021-03-09T17:16:44Z

I should add, that mkdir bein a no-op means that you get conterintuitive behaviour sometimes

>>> s3.mkdir("bucket/path")
>>> s3.exists("bucket/path")
False

but I don't think there's a way around this; the even more unsatisfactory situation (involving more calls) still fails intuition in the opposite direction

>>> s3.exists("bucket/path")
False
>>> s3.touch("bucket/path/file")  # OK
>>> s3.isdir("bucket/path")
True

hayesgb · 2021-03-09T18:15:16Z

Originally, adlfs implemented fs.mkdir() by creating a blob (rather than a no-op) without the trailing delimiter and size=0. This eventually led to empty blobs being created by pyarrow to_dataset() operations. The current verison in master will create an empty blob with a trailing delimiter, and this has resolved the problem of empty blobs, but means that:

    ## these are containers
    assert fs.ls("") == ["data/"]
    assert fs.ls("/") == ["data/"]
    assert fs.ls(".") == ["data/"]

    ## these are top-level directories and files
    assert fs.ls("data") == ["data/root/", "data/top_file.txt"]
    assert fs.ls("/data") == ["data/root/", "data/top_file.txt"]

    # root contains files and directories
    assert fs.ls("data/root") == [
        "data/root/a/",
        "data/root/a1/",
        "data/root/b/",
        "data/root/c/",
        "data/root/d/",
        "data/root/rfile.txt",
    ]

It should resolve fs.exists(), because an object does exist in that location. It can create confounding situations, where fs.isdir() or fs.isfile() return incorrect results. However, Azure blobs provide a metadata parameter, and with the mkdir or AzureBlobFile objects, I've been looking at adding the following: metadata['is_directory': "true or false" when an object is written.

It sounds as if your preference would be to omit the trailing delimiter from a pseudo-directory and leave mkdir as a no-op. Is that correct?

martindurant · 2021-03-09T18:49:55Z

It sounds as if your preference would be to omit the trailing delimiter from a pseudo-directory and leave mkdir as a no-op. Is that correct?

It seems to me to be the intent of the API; the advantage being avoiding the need for extra calls to make the objects. I like the metadata idea, but presumably we'd also have to account for trees that were not written with this convention.

hayesgb · 2021-03-10T16:11:48Z

Thinking about �mkdir() being a no-op, from above:

>>> s3.mkdir("bucket/path")
>>> s3.exists("bucket/path")

This also means that:

>>> s3.ls("bucket") = []
>>> s3.touch("bucket/path/file") <-- This creates a BlobPrefix for path
>>> s3.ls("bucket") = ["path"]     <-- Now the pseudo-directory appears
>>> s3.rm("bucket/path/file")      <-- Remove the file, which automatically deletes the BlobPrefix
>>> s3.ls("bucket") = []

correct?

This means s3.rmdir() should also be a no-op, for anything except buckets/containers, right?

Agree the reduction in calls is an advantage of the API, but are users OK with this behavior?

martindurant · 2021-03-10T16:14:29Z

Yes, that's exactly what currently happens for s3fs. I can't remember, rmdir might give a helpful "directory not empty" message.
In practice, I think people do rm(path, recursive=True), which may attempt to also delete the non-existent pseudo-directory and ignore the failure.

hayesgb · 2021-03-10T16:41:15Z

Tagging @isidentical , @ldacey, and @jorisvandenbossche

Hoping for input from some users of adlfs that have given input on its development on the above discussion.

isidentical · 2021-03-10T16:48:31Z

Support for empty directories is really important for our use case. Not as in creating them with .mkdir() but rather being able to do fs.isdir('empty/dir/') and fs.ls('empty/dir/') so that when someone creates a directory using the UI of that provider (for example "Create Directory" option in google cloud) we could understand it and create a local directory. Regarding .mkdir(), I have no comments.

hayesgb · 2021-03-11T00:51:12Z

I've been testing the current behavior in Azure. If I go to the portal, and create:

adlfs/test_folder/test_inner

Then I create a container_client, and call container_client.walk_blobs(name_starts_with="test_folder"), I get back a generator that, when printed, returns:

<class 'azure.storage.blob._list_blobs_helper.BlobPrefix'>
{'name': 'testfolder/', 'prefix': 'testfolder/', 'results_per_page': None, 'container': 'adlfs', 'delimiter': None, 'location_mode': 'primary'}
<class 'azure.storage.blob._list_blobs_helper.BlobPrefix'>
{'name': 'testfolder/test_inner/', 'prefix': 'testfolder/test_inner/', 'results_per_page': None, 'container': 'adlfs', 'delimiter': None, 'location_mode': 'primary'}

But if I call container_client.list_blobs(name_starts_with="test_folder"), I get back an iterator that returns:

<class 'azure.storage.blob._models.BlobProperties'>
{'name': 'testfolder', 'container': 'adlfs', 'snapshot': None, 'version_id': None, 'is_current_version': None, 'blob_type': <BlobType.BlockBlob: 'BlockBlob'>, 'metadata': {}, 'encrypted_metadata': None, 'last_modified': datetime.datetime(2021, 3, 10, 20, 30, 3, tzinfo=datetime.timezone.utc), 'etag': '0x8D8E403491339A9', 'size': 0, 'content_range': None, 'append_blob_committed_block_count': None, 'is_append_blob_sealed': None, 'page_blob_sequence_number': None, 'server_encrypted': True, 'copy': {'id': None, 'source': None, 'status': None, 'progress': None, 'completion_time': None, 'status_description': None, 'incremental_copy': None, 'destination_snapshot': None}, 'content_settings': {'content_type': 'application/octet-stream', 'content_encoding': None, 'content_language': None, 'content_md5': None, 'content_disposition': None, 'cache_control': 'no-cache'}, 'lease': {'status': 'unlocked', 'state': 'available', 'duration': None}, 'blob_tier': 'Hot', 'rehydrate_priority': None, 'blob_tier_change_time': None, 'blob_tier_inferred': True, 'deleted': None, 'deleted_time': None, 'remaining_retention_days': None, 'creation_time': datetime.datetime(2021, 3, 10, 20, 30, 3, tzinfo=datetime.timezone.utc), 'archive_status': None, 'encryption_key_sha256': None, 'encryption_scope': None, 'request_server_encrypted': None, 'object_replication_source_properties': [], 'object_replication_destination_policy': None, 'last_accessed_on': None, 'tag_count': None, 'tags': None}

The list_blobs() response is consistent with a "pseudo-folder" in Azure Blob being an empty blob.

Uploading a blob using either the client, or the portal to: testfolder2/test_inner2/myfile.txt, and subsequently deleting testfolder2/test_inner2/myfile.txt removes only myfile.txt� and leaves the nested folders in place.

This is not consistent with s3 described above. Azure SDK performs the following:

cc.delete_blob(blob="testfolder2/test_inner2/myfile.rtf")
cc.list_blobs(name_starts_with="testfolder2/test_inner2")  <-- This returns a blob of size=0 with name="testfolder2/test_inner2"

So leaving mkdir as a no-op would work, for writes, but the delete would leave the "pseudo-folder" blobs in place.

martindurant · 2021-03-11T17:26:24Z

Uploading a blob using either the client, or the portal to: testfolder2/test_inner2/myfile.txt, and subsequently deleting testfolder2/test_inner2/myfile.txt removes only myfile.txt� and leaves the nested folders in place.

This would not be true if the placeholder "adlfs/test_folder/test_inner" had not been previously created though, right?

hayesgb · 2021-03-12T03:31:45Z

Yes, it is true. See the following, using the Azure Storage Blob Python SDK directly:

>>> sc = BlobServiceClient(account_url = <my_url>, credential = <token>)
>>> cc = sc.get_container_client('my-container')
>>> cc.create_container()     <-- Creates a new container
>>> cc.upload_blob(name="testfolder/test_inner/myfile.rtf", data=b"012346789")  # Create a nested blob.

>>> blobs = cc.list_blobs(name_starts_with="testfolder/test_inner")
>>> for blob in blobs:
>>>     print(blob.name)

testfolder/test_inner
testfolder/test_inner/myfile.rtf

>>> cc.delete_blob(blob = "testfolder/test_inner/myfile.rtf")
>>> blobs2 = cc.list_blobs(name_starts_with="testfolder/test_inner")
>>> for blob in blobs2:
>>>     print(blob)

{'name': 'testfolder/test_inner', 'container': 'my-container', 'snapshot': None, 'version_id': None, 'is_current_version': None, 'blob_type': <BlobType.BlockBlob: 'BlockBlob'>, 'metadata': {}, 'encrypted_metadata': None, 'last_modified': datetime.datetime(2021, 3, 12, 3, 19, 4, tzinfo=datetime.timezone.utc), 'etag': '0x8D8E505973EE89A', 'size': 0, 'content_range': None, 'append_blob_committed_block_count': None, 'is_append_blob_sealed': None, 'page_blob_sequence_number': None, 'server_encrypted': True, 'copy': {'id': None, 'source': None, 'status': None, 'progress': None, 'completion_time': None, 'status_description': None, 'incremental_copy': None, 'destination_snapshot': None}, 'content_settings': {'content_type': None, 'content_encoding': None, 'content_language': None, 'content_md5': None, 'content_disposition': None, 'cache_control': None}, 'lease': {'status': 'unlocked', 'state': 'available', 'duration': None}, 'blob_tier': 'Hot', 'rehydrate_priority': None, 'blob_tier_change_time': None, 'blob_tier_inferred': True, 'deleted': None, 'deleted_time': None, 'remaining_retention_days': None, 'creation_time': datetime.datetime(2021, 3, 12, 3, 19, 4, tzinfo=datetime.timezone.utc), 'archive_status': None, 'encryption_key_sha256': None, 'encryption_scope': None, 'request_server_encrypted': None, 'object_replication_source_properties': [], 'object_replication_destination_policy': None, 'last_accessed_on': None, 'tag_count': None, 'tags': None}

Notice that the pseudo-directory remains in place, and it's a BlockBlob with size = 0.

Both of the following operations return a StorageError "Specified blob already exists"

cc.upload_blob(name = "testfolder/test_inner/", data="")
cc.upload_blob(name = "testfolder/test_inner", data="")

hayesgb · 2021-03-13T12:54:59Z

I'm on board with mkdir being a no-op when the bucket exists. This means we would expect the following behavior, if bucket does not exist:

>>> fs.mkdir("bucket")  # Creates bucket
>>> fs.mkdir("bucket/dir/nested_dir", create_parents=False)  # Raises a PermissionError
>>> fs.mkdir("bucket/dir/nested_dir", create_parents=True)   # I would expect this to create bucket, but none of the inner pseudo-directories.  Thoughts?

hayesgb · 2021-03-13T13:01:59Z

Additionally, once bucket is created:

>>> assert "bucket" in fs.ls(".")
>>> assert "bucket/" not in fs.ls(".")

And the following should be true, when bucket does not exist:

fs.makedir("bucket", exist_ok=False) <-- Raises a FileExistsError
fs.makedir("bucket", exist_ok=True) <-- Passes silently

martindurant · 2021-03-15T13:01:06Z

cf fsspec/s3fs#439

Currently in s3fs,

>>> fs.mkdir("bucket/dir/nested_dir", create_parents=False)

silently passes whether the bucket exists or not; I agree that it ought to check, but we can't actually know if we have write privileges to that bucket without trying to write something.

I agree with the rest. Note that ls("") (or "."? or "/") lists only owned buckets, there's no way to know about other buckets we may have read and/or write access to, without specifically trying. Also, for S3, ListBuckets and ListObjects are specific privileges that an role may not have, so ls("") should either error or return empty, and listing a bucket may fail even when accessing specific files works. GCS also has this access pattern, but it seems less likely that privileges are set up so fine-grained.

ldacey · 2021-03-15T14:26:08Z

Hopefully @jorisvandenbossche will be able to comment more about what ds.write_dataset() expects to happen since that is the primary use case I have right now. Right now on the 0.5.9 version of adlfs, we end up having duplicate blob paths. One path is an empty file up until the partition "folder" and the other includes the filename and data.

container/folder/date=2020-12-31
container/folder/date=2020-12-31/part-0.parquet

I think the asyn_close branch for adlfs matches my expectations. Writing a dataset does not duplicate the blob paths. Listing a partitioned directory clearly shows which blobs are "virtual directories" based on the trailing slash.

# ls() shows the directories within the directory I am listing
fs.ls("dev/close")
['dev/close/month_id=202101/', 'dev/close/month_id=202102/']

fs.find("dev/close")
['dev/close/month_id=202101/date_id=20210103/partition-1-20210315081223-0.parquet',
 'dev/close/month_id=202101/date_id=20210103/partition-2-20210315081223-0.parquet',
 'dev/close/month_id=202102/date_id=20210201/partition-1-20210315081223-1.parquet',
 'dev/close/month_id=202102/date_id=20210211/partition-2-20210315081223-1.parquet',
 'dev/close/month_id=202102/date_id=20210214/partition-1-20210315081223-2.parquet',
 'dev/close/month_id=202102/date_id=20210217/partition-2-20210315081223-2.parquet',
 'dev/close/month_id=202102/date_id=20210219/partition-1-20210315081223-3.parquet',
 'dev/close/month_id=202102/date_id=20210219/partition-2-20210315081223-3.parquet']

fs.info("dev/close")
{'name': 'dev/close/', 'size': 0, 'type': 'directory'}

# this doesn't seem to work, and no error is raised
fs.rm("dev/close/month_id=202101")

# this removes all blob paths with month_id=202101, and works as I would expect
fs.rm("dev/close/month_id=202101", recursive=True)

It sounds like the training slash might introduce other inconsistencies though.

FYI, I tried to create a new virtual directory with Azure Storage Explorer and it forces you to add a file before the directory actually exists. I am not sure if it works this way with GCS and S3.

martindurant · 2021-03-15T14:31:37Z

Thank you Azure for the message! Since you don't add anything in that dialog, I wonder whether it does, in fact, create something anyway.
Except that we don't know if it's a "folder" or a "directory"!

Unfortunately, S3 and GCS's web interface creates an empty blob with "/" at the end of the file, and considers this a directory.

By the way, one thing that I tried to toy with, was that mkdir (non-bucket) would create a folder in the file system instance only, and this sort of worked for some things, but I thought ended up being even more confusing.

ldacey · 2021-03-15T14:57:13Z

When you create the new directory (using Storage Explorer), you are unable to see anything until you load a file:

If I use fs.ls() on that container I cannot see that virtual directory "test" which I created
If I hit the back button there is no folder named "test"
If I create the directory and then upload a blob, the folder is visible (in the GUI of Explorer and if I use fs.ls())
If I delete the only blob, then the folder "test" is removed as well

I think an empty blob with a trailing "/" is fine if we can avoid the empty blobs without the slash because these are duplicate files which really add up when partitioning a large dataset (literally double the number of blobs). I need to remove these because I have a tool connecting directly to these datasets (Power BI), and other teams are using the data and get confused with the empty blobs.

jorisvandenbossche · 2021-03-16T16:33:57Z

Hopefully @jorisvandenbossche will be able to comment more about what ds.write_dataset() expects to happen since that is the primary use case I have right now.

Not super familiar with that code, but I think the Arrow dataset writing functionality basically assumes it can subsequently create a directory and then write files into it (mkdir and open(path, mode="wb") in fsspec-terms).
So whether this is by mkdir being a no-op and the subsequent open not failing to write to a directory that doesn't exist yet, or by mkdir already creating some pseudo-directory object (and not listing those pseudo objects as files), that shouldn't matter much (from pyarrow's point of view), as long as this chain of actions works fine together.

isidentical · 2021-06-28T13:01:47Z

Something that we do on the DVC right now is that, we have a special base class that is being inherited by all prefix-based object storages (with no real directories) that kinda tries to handle empty directories in only 2 functions isdir() and file();
https://github.com/iterative/dvc/blob/7c13897674214df1a4a20a0d1c9e0fd8ccff3e47/dvc/fs/fsspec_wrapper.py#L158-L196

For example the isdir() checks whether it is a directory, or a file with no contents and the name of the file ends with trailing slash.

        entry = self.info(path_info)
        return entry["type"] == "directory" or (
            entry["size"] == 0
            and entry["type"] == "file"
            and entry["name"].endswith("/")
        )

Same goes for find() (probably extra handling is also needed for ls() etc). When you pass a file to find(), even though it is a pseduo-dir (an empty file, with trailing slash) find() assumes that it is a regular file and returns [fs.info(file)]. So for them, we check whether the find()'s return is a single file, and whether it is a directory, and if so we simply stop the generator.

        # When calling find() on a file, it returns the same file in a list.
        # For object-based storages, the same behavior applies to empty
        # directories since they are represented as files. This condition
        # checks whether we should yield an empty list (if it is an empty
        # directory) or just yield the file itself.
        if len(files) == 1 and files[0] == path and self.isdir(path_info):
            return None

        yield from self._strip_buckets(files, detail=detail)

I'm not sure if this should be done on the upstream or downstream, but in either way it would be really beneficial for everyone to be consistent among major filesystems (s3, gs, azure) etc.

martindurant · 2021-06-28T13:08:20Z

I'm not sure if this should be done on the upstream or downstream, but in either way it would be really beneficial for everyone to be consistent among major filesystems (s3, gs, azure) etc.

I definitely agree with this sentiment! I think a function like the one above can definitely be included in fsspec, perhaps in utils as a function or a mixin class. Perhaps more important would be to flesh out #651 to define what we think the behaviour ought to look like, and write some tests that span all the object stores.

Anaconda is having a "hack week" and there are holidays too, so I may not be too responsive this week.

martindurant mentioned this issue Mar 9, 2021

Bug: find/ls/expand_path do not work as expected fsspec/gcsfs#357

Closed

hayesgb mentioned this issue Mar 13, 2021

Updated mkdir and find to pass kwargs to async implementations fsspec/adlfs#136

Merged

This was referenced Mar 22, 2021

Errors with pyarrow ds.write_dataset() using 0.6.0 fsspec/adlfs#171

Closed

AzureBlobFileSystem.mkdir seems to have no effect with exist_ok=True fsspec/adlfs#130

Closed

TomAugspurger mentioned this issue Mar 25, 2021

WIP: Remove nested async by copying code #581

Merged

hayesgb mentioned this issue Mar 31, 2021

mkdir does not create dirs with herarchical namespace enabled fsspec/adlfs#203

Closed

This was referenced Apr 5, 2021

S3FileSystem.du() FileNotFoundError for paths with empty prefixes fsspec/s3fs#450

Closed

Checklist for V1 #600

Open

hayesgb mentioned this issue Apr 14, 2021

Calling AzureBlobFileSystem.cat on a file path, adds a "/" to the end of the file path fsspec/adlfs#214

Closed

martindurant mentioned this issue Apr 14, 2021

_ls_from_cache: weird behavior regarding comparisons with trailing slash #554

Closed

hayesgb mentioned this issue May 13, 2021

LS seems not to be working fsspec/adlfs#224

Open

isidentical mentioned this issue Jun 28, 2021

get-url, import-url: handle empty directory imports iterative/dvc#5521

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Directories and delimiter handling #562

Directories and delimiter handling #562

hayesgb commented Mar 9, 2021

martindurant commented Mar 9, 2021

martindurant commented Mar 9, 2021

hayesgb commented Mar 9, 2021

martindurant commented Mar 9, 2021

hayesgb commented Mar 10, 2021 •

edited

Loading

martindurant commented Mar 10, 2021

hayesgb commented Mar 10, 2021

isidentical commented Mar 10, 2021

hayesgb commented Mar 11, 2021

martindurant commented Mar 11, 2021

hayesgb commented Mar 12, 2021 •

edited

Loading

hayesgb commented Mar 13, 2021

hayesgb commented Mar 13, 2021 •

edited

Loading

martindurant commented Mar 15, 2021

ldacey commented Mar 15, 2021

martindurant commented Mar 15, 2021

ldacey commented Mar 15, 2021

jorisvandenbossche commented Mar 16, 2021

isidentical commented Jun 28, 2021 •

edited

Loading

martindurant commented Jun 28, 2021

Directories and delimiter handling #562

Directories and delimiter handling #562

Comments

hayesgb commented Mar 9, 2021

martindurant commented Mar 9, 2021

martindurant commented Mar 9, 2021

hayesgb commented Mar 9, 2021

martindurant commented Mar 9, 2021

hayesgb commented Mar 10, 2021 • edited Loading

martindurant commented Mar 10, 2021

hayesgb commented Mar 10, 2021

isidentical commented Mar 10, 2021

hayesgb commented Mar 11, 2021

martindurant commented Mar 11, 2021

hayesgb commented Mar 12, 2021 • edited Loading

hayesgb commented Mar 13, 2021

hayesgb commented Mar 13, 2021 • edited Loading

martindurant commented Mar 15, 2021

ldacey commented Mar 15, 2021

martindurant commented Mar 15, 2021

ldacey commented Mar 15, 2021

jorisvandenbossche commented Mar 16, 2021

isidentical commented Jun 28, 2021 • edited Loading

martindurant commented Jun 28, 2021

hayesgb commented Mar 10, 2021 •

edited

Loading

hayesgb commented Mar 12, 2021 •

edited

Loading

hayesgb commented Mar 13, 2021 •

edited

Loading

isidentical commented Jun 28, 2021 •

edited

Loading