Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directories and delimiter handling #562

Open
hayesgb opened this issue Mar 9, 2021 · 20 comments
Open

Directories and delimiter handling #562

hayesgb opened this issue Mar 9, 2021 · 20 comments

Comments

@hayesgb
Copy link
Contributor

hayesgb commented Mar 9, 2021

Opened based on #554 .

Azure Storage consists of a 2-level hierarch, where the top level is a container, and all blobs underneath a container have a flat structure. Delimiters ("/") that denote folders in traditional filesystems have no meaning. Azure implements a BlobPrefix object for convenience, but the BlobPrefix can not be created directly, and is immediately removed when all blobs underneath the prefix are deleted.

This creates challenges for filesystem operations like mkdir(), because the making of an empty directory can only be done by creating an empty blob. The result is that:

fs.mkdir("container/blob_folder")
fs.mkdir("container/blob_folder/")

both appear as size=0 blobs, and are unique. Choosing the former convention creates issues as described here with partitioned parquet files, while the latter approach runs counter to the convention of removing a trailing "/" when listing directories and/or storing them in dircache, as evidenced by #554 .

Its my understanding this is an issue with s3fs and gcsfs. Do these filesystems exhibit similar challenges, and if so, how are they handled there? Hoping to align on an approach that provides consistency for users, and with the use of dircache.

Thanks,
Greg

@martindurant
Copy link
Member

Thanks for opening this, and let's take the chance to nail down how we would like this to work.
Yes, s3fs and gcsfs face the exact same structure (buckets/keys and prefix/delimiter emulation) and problems. I'll take a moment to describe the current general conventions we have.

In most cases, paths are stripped of any trailing "/". mkdir is a no-op for anything below buckets, and directories are only "created" by containing other keys. In comments, these are usually called "pseudo-directories".

However, since the appearance of recursive operations and one-shot find, there is a certain amount of awkward code to back-fill inferred directories. There are known fail cases where paths end with "/", which would be illegal on posix; but they do appear when using the web consoles and some frameworks (spark?). Note that in dircache stores listings keyed by their (pseudo-)directory, and that these keys don't end in "/", but that means we don't know which listing to put a name ending "/" into.

So I suggest:

  • a set of short tests showing what ls, glob and find should return in various situations with or without "/"-ended path names, and at various depths including the bucket level (NB: a bucket name cannot contain"/")
  • for the same set of circumstances, explicitly checking that dircache contains the expected entries; i.e., repeated calling of ls/find/glob should still return the right output.

@martindurant
Copy link
Member

I should add, that mkdir bein a no-op means that you get conterintuitive behaviour sometimes

>>> s3.mkdir("bucket/path")
>>> s3.exists("bucket/path")
False

but I don't think there's a way around this; the even more unsatisfactory situation (involving more calls) still fails intuition in the opposite direction

>>> s3.exists("bucket/path")
False
>>> s3.touch("bucket/path/file")  # OK
>>> s3.isdir("bucket/path")
True

@hayesgb
Copy link
Contributor Author

hayesgb commented Mar 9, 2021

Originally, adlfs implemented fs.mkdir() by creating a blob (rather than a no-op) without the trailing delimiter and size=0. This eventually led to empty blobs being created by pyarrow to_dataset() operations. The current verison in master will create an empty blob with a trailing delimiter, and this has resolved the problem of empty blobs, but means that:

    ## these are containers
    assert fs.ls("") == ["data/"]
    assert fs.ls("/") == ["data/"]
    assert fs.ls(".") == ["data/"]

    ## these are top-level directories and files
    assert fs.ls("data") == ["data/root/", "data/top_file.txt"]
    assert fs.ls("/data") == ["data/root/", "data/top_file.txt"]

    # root contains files and directories
    assert fs.ls("data/root") == [
        "data/root/a/",
        "data/root/a1/",
        "data/root/b/",
        "data/root/c/",
        "data/root/d/",
        "data/root/rfile.txt",
    ]

It should resolve fs.exists(), because an object does exist in that location. It can create confounding situations, where fs.isdir() or fs.isfile() return incorrect results. However, Azure blobs provide a metadata parameter, and with the mkdir or AzureBlobFile objects, I've been looking at adding the following: metadata['is_directory': "true or false" when an object is written.

It sounds as if your preference would be to omit the trailing delimiter from a pseudo-directory and leave mkdir as a no-op. Is that correct?

@martindurant
Copy link
Member

It sounds as if your preference would be to omit the trailing delimiter from a pseudo-directory and leave mkdir as a no-op. Is that correct?

It seems to me to be the intent of the API; the advantage being avoiding the need for extra calls to make the objects. I like the metadata idea, but presumably we'd also have to account for trees that were not written with this convention.

@hayesgb
Copy link
Contributor Author

hayesgb commented Mar 10, 2021

Thinking about �mkdir() being a no-op, from above:

>>> s3.mkdir("bucket/path")
>>> s3.exists("bucket/path")

This also means that:

>>> s3.ls("bucket") = []
>>> s3.touch("bucket/path/file") <-- This creates a BlobPrefix for path
>>> s3.ls("bucket") = ["path"]     <-- Now the pseudo-directory appears
>>> s3.rm("bucket/path/file")      <-- Remove the file, which automatically deletes the BlobPrefix
>>> s3.ls("bucket") = []

correct?

This means s3.rmdir() should also be a no-op, for anything except buckets/containers, right?

Agree the reduction in calls is an advantage of the API, but are users OK with this behavior?

@martindurant
Copy link
Member

Yes, that's exactly what currently happens for s3fs. I can't remember, rmdir might give a helpful "directory not empty" message.
In practice, I think people do rm(path, recursive=True), which may attempt to also delete the non-existent pseudo-directory and ignore the failure.

@hayesgb
Copy link
Contributor Author

hayesgb commented Mar 10, 2021

Tagging @isidentical , @ldacey, and @jorisvandenbossche

Hoping for input from some users of adlfs that have given input on its development on the above discussion.

@isidentical
Copy link
Member

Support for empty directories is really important for our use case. Not as in creating them with .mkdir() but rather being able to do fs.isdir('empty/dir/') and fs.ls('empty/dir/') so that when someone creates a directory using the UI of that provider (for example "Create Directory" option in google cloud) we could understand it and create a local directory. Regarding .mkdir(), I have no comments.

@hayesgb
Copy link
Contributor Author

hayesgb commented Mar 11, 2021

I've been testing the current behavior in Azure. If I go to the portal, and create:

adlfs/test_folder/test_inner

Then I create a container_client, and call container_client.walk_blobs(name_starts_with="test_folder"), I get back a generator that, when printed, returns:

<class 'azure.storage.blob._list_blobs_helper.BlobPrefix'>
{'name': 'testfolder/', 'prefix': 'testfolder/', 'results_per_page': None, 'container': 'adlfs', 'delimiter': None, 'location_mode': 'primary'}
<class 'azure.storage.blob._list_blobs_helper.BlobPrefix'>
{'name': 'testfolder/test_inner/', 'prefix': 'testfolder/test_inner/', 'results_per_page': None, 'container': 'adlfs', 'delimiter': None, 'location_mode': 'primary'}

But if I call container_client.list_blobs(name_starts_with="test_folder"), I get back an iterator that returns:

<class 'azure.storage.blob._models.BlobProperties'>
{'name': 'testfolder', 'container': 'adlfs', 'snapshot': None, 'version_id': None, 'is_current_version': None, 'blob_type': <BlobType.BlockBlob: 'BlockBlob'>, 'metadata': {}, 'encrypted_metadata': None, 'last_modified': datetime.datetime(2021, 3, 10, 20, 30, 3, tzinfo=datetime.timezone.utc), 'etag': '0x8D8E403491339A9', 'size': 0, 'content_range': None, 'append_blob_committed_block_count': None, 'is_append_blob_sealed': None, 'page_blob_sequence_number': None, 'server_encrypted': True, 'copy': {'id': None, 'source': None, 'status': None, 'progress': None, 'completion_time': None, 'status_description': None, 'incremental_copy': None, 'destination_snapshot': None}, 'content_settings': {'content_type': 'application/octet-stream', 'content_encoding': None, 'content_language': None, 'content_md5': None, 'content_disposition': None, 'cache_control': 'no-cache'}, 'lease': {'status': 'unlocked', 'state': 'available', 'duration': None}, 'blob_tier': 'Hot', 'rehydrate_priority': None, 'blob_tier_change_time': None, 'blob_tier_inferred': True, 'deleted': None, 'deleted_time': None, 'remaining_retention_days': None, 'creation_time': datetime.datetime(2021, 3, 10, 20, 30, 3, tzinfo=datetime.timezone.utc), 'archive_status': None, 'encryption_key_sha256': None, 'encryption_scope': None, 'request_server_encrypted': None, 'object_replication_source_properties': [], 'object_replication_destination_policy': None, 'last_accessed_on': None, 'tag_count': None, 'tags': None}

The list_blobs() response is consistent with a "pseudo-folder" in Azure Blob being an empty blob.

Uploading a blob using either the client, or the portal to: testfolder2/test_inner2/myfile.txt, and subsequently deleting testfolder2/test_inner2/myfile.txt removes only myfile.txt� and leaves the nested folders in place.

This is not consistent with s3 described above. Azure SDK performs the following:

cc.delete_blob(blob="testfolder2/test_inner2/myfile.rtf")
cc.list_blobs(name_starts_with="testfolder2/test_inner2")  <-- This returns a blob of size=0 with name="testfolder2/test_inner2"

So leaving mkdir as a no-op would work, for writes, but the delete would leave the "pseudo-folder" blobs in place.

@martindurant
Copy link
Member

Uploading a blob using either the client, or the portal to: testfolder2/test_inner2/myfile.txt, and subsequently deleting testfolder2/test_inner2/myfile.txt removes only myfile.txt� and leaves the nested folders in place.

This would not be true if the placeholder "adlfs/test_folder/test_inner" had not been previously created though, right?

@hayesgb
Copy link
Contributor Author

hayesgb commented Mar 12, 2021

Yes, it is true. See the following, using the Azure Storage Blob Python SDK directly:

>>> sc = BlobServiceClient(account_url = <my_url>, credential = <token>)
>>> cc = sc.get_container_client('my-container')
>>> cc.create_container()     <-- Creates a new container
>>> cc.upload_blob(name="testfolder/test_inner/myfile.rtf", data=b"012346789")  # Create a nested blob.

>>> blobs = cc.list_blobs(name_starts_with="testfolder/test_inner")
>>> for blob in blobs:
>>>     print(blob.name)

testfolder/test_inner
testfolder/test_inner/myfile.rtf

>>> cc.delete_blob(blob = "testfolder/test_inner/myfile.rtf")
>>> blobs2 = cc.list_blobs(name_starts_with="testfolder/test_inner")
>>> for blob in blobs2:
>>>     print(blob)

{'name': 'testfolder/test_inner', 'container': 'my-container', 'snapshot': None, 'version_id': None, 'is_current_version': None, 'blob_type': <BlobType.BlockBlob: 'BlockBlob'>, 'metadata': {}, 'encrypted_metadata': None, 'last_modified': datetime.datetime(2021, 3, 12, 3, 19, 4, tzinfo=datetime.timezone.utc), 'etag': '0x8D8E505973EE89A', 'size': 0, 'content_range': None, 'append_blob_committed_block_count': None, 'is_append_blob_sealed': None, 'page_blob_sequence_number': None, 'server_encrypted': True, 'copy': {'id': None, 'source': None, 'status': None, 'progress': None, 'completion_time': None, 'status_description': None, 'incremental_copy': None, 'destination_snapshot': None}, 'content_settings': {'content_type': None, 'content_encoding': None, 'content_language': None, 'content_md5': None, 'content_disposition': None, 'cache_control': None}, 'lease': {'status': 'unlocked', 'state': 'available', 'duration': None}, 'blob_tier': 'Hot', 'rehydrate_priority': None, 'blob_tier_change_time': None, 'blob_tier_inferred': True, 'deleted': None, 'deleted_time': None, 'remaining_retention_days': None, 'creation_time': datetime.datetime(2021, 3, 12, 3, 19, 4, tzinfo=datetime.timezone.utc), 'archive_status': None, 'encryption_key_sha256': None, 'encryption_scope': None, 'request_server_encrypted': None, 'object_replication_source_properties': [], 'object_replication_destination_policy': None, 'last_accessed_on': None, 'tag_count': None, 'tags': None}

Notice that the pseudo-directory remains in place, and it's a BlockBlob with size = 0.

Both of the following operations return a StorageError "Specified blob already exists"

cc.upload_blob(name = "testfolder/test_inner/", data="")
cc.upload_blob(name = "testfolder/test_inner", data="")

@hayesgb
Copy link
Contributor Author

hayesgb commented Mar 13, 2021

I'm on board with mkdir being a no-op when the bucket exists. This means we would expect the following behavior, if bucket does not exist:

>>> fs.mkdir("bucket")  # Creates bucket
>>> fs.mkdir("bucket/dir/nested_dir", create_parents=False)  # Raises a PermissionError
>>> fs.mkdir("bucket/dir/nested_dir", create_parents=True)   # I would expect this to create bucket, but none of the inner pseudo-directories.  Thoughts?

@hayesgb
Copy link
Contributor Author

hayesgb commented Mar 13, 2021

Additionally, once bucket is created:

>>> assert "bucket" in fs.ls(".")
>>> assert "bucket/" not in fs.ls(".")

And the following should be true, when bucket does not exist:

fs.makedir("bucket", exist_ok=False) <-- Raises a FileExistsError
fs.makedir("bucket", exist_ok=True) <-- Passes silently

@martindurant
Copy link
Member

cf fsspec/s3fs#439

Currently in s3fs,

>>> fs.mkdir("bucket/dir/nested_dir", create_parents=False)

silently passes whether the bucket exists or not; I agree that it ought to check, but we can't actually know if we have write privileges to that bucket without trying to write something.

I agree with the rest. Note that ls("") (or "."? or "/") lists only owned buckets, there's no way to know about other buckets we may have read and/or write access to, without specifically trying. Also, for S3, ListBuckets and ListObjects are specific privileges that an role may not have, so ls("") should either error or return empty, and listing a bucket may fail even when accessing specific files works. GCS also has this access pattern, but it seems less likely that privileges are set up so fine-grained.

@ldacey
Copy link

ldacey commented Mar 15, 2021

Hopefully @jorisvandenbossche will be able to comment more about what ds.write_dataset() expects to happen since that is the primary use case I have right now. Right now on the 0.5.9 version of adlfs, we end up having duplicate blob paths. One path is an empty file up until the partition "folder" and the other includes the filename and data.

container/folder/date=2020-12-31
container/folder/date=2020-12-31/part-0.parquet

I think the asyn_close branch for adlfs matches my expectations. Writing a dataset does not duplicate the blob paths. Listing a partitioned directory clearly shows which blobs are "virtual directories" based on the trailing slash.

# ls() shows the directories within the directory I am listing
fs.ls("dev/close")
['dev/close/month_id=202101/', 'dev/close/month_id=202102/']

fs.find("dev/close")
['dev/close/month_id=202101/date_id=20210103/partition-1-20210315081223-0.parquet',
 'dev/close/month_id=202101/date_id=20210103/partition-2-20210315081223-0.parquet',
 'dev/close/month_id=202102/date_id=20210201/partition-1-20210315081223-1.parquet',
 'dev/close/month_id=202102/date_id=20210211/partition-2-20210315081223-1.parquet',
 'dev/close/month_id=202102/date_id=20210214/partition-1-20210315081223-2.parquet',
 'dev/close/month_id=202102/date_id=20210217/partition-2-20210315081223-2.parquet',
 'dev/close/month_id=202102/date_id=20210219/partition-1-20210315081223-3.parquet',
 'dev/close/month_id=202102/date_id=20210219/partition-2-20210315081223-3.parquet']

fs.info("dev/close")
{'name': 'dev/close/', 'size': 0, 'type': 'directory'}

# this doesn't seem to work, and no error is raised
fs.rm("dev/close/month_id=202101")

# this removes all blob paths with month_id=202101, and works as I would expect
fs.rm("dev/close/month_id=202101", recursive=True)

It sounds like the training slash might introduce other inconsistencies though.

FYI, I tried to create a new virtual directory with Azure Storage Explorer and it forces you to add a file before the directory actually exists. I am not sure if it works this way with GCS and S3.

image

@martindurant
Copy link
Member

Thank you Azure for the message! Since you don't add anything in that dialog, I wonder whether it does, in fact, create something anyway.
Except that we don't know if it's a "folder" or a "directory"!

Unfortunately, S3 and GCS's web interface creates an empty blob with "/" at the end of the file, and considers this a directory.

By the way, one thing that I tried to toy with, was that mkdir (non-bucket) would create a folder in the file system instance only, and this sort of worked for some things, but I thought ended up being even more confusing.

@ldacey
Copy link

ldacey commented Mar 15, 2021

When you create the new directory (using Storage Explorer), you are unable to see anything until you load a file:

image

  • If I use fs.ls() on that container I cannot see that virtual directory "test" which I created
  • If I hit the back button there is no folder named "test"
  • If I create the directory and then upload a blob, the folder is visible (in the GUI of Explorer and if I use fs.ls())
  • If I delete the only blob, then the folder "test" is removed as well

I think an empty blob with a trailing "/" is fine if we can avoid the empty blobs without the slash because these are duplicate files which really add up when partitioning a large dataset (literally double the number of blobs). I need to remove these because I have a tool connecting directly to these datasets (Power BI), and other teams are using the data and get confused with the empty blobs.

@jorisvandenbossche
Copy link
Contributor

Hopefully @jorisvandenbossche will be able to comment more about what ds.write_dataset() expects to happen since that is the primary use case I have right now.

Not super familiar with that code, but I think the Arrow dataset writing functionality basically assumes it can subsequently create a directory and then write files into it (mkdir and open(path, mode="wb") in fsspec-terms).
So whether this is by mkdir being a no-op and the subsequent open not failing to write to a directory that doesn't exist yet, or by mkdir already creating some pseudo-directory object (and not listing those pseudo objects as files), that shouldn't matter much (from pyarrow's point of view), as long as this chain of actions works fine together.

@isidentical
Copy link
Member

isidentical commented Jun 28, 2021

Something that we do on the DVC right now is that, we have a special base class that is being inherited by all prefix-based object storages (with no real directories) that kinda tries to handle empty directories in only 2 functions isdir() and file();
https://github.com/iterative/dvc/blob/7c13897674214df1a4a20a0d1c9e0fd8ccff3e47/dvc/fs/fsspec_wrapper.py#L158-L196

For example the isdir() checks whether it is a directory, or a file with no contents and the name of the file ends with trailing slash.

        entry = self.info(path_info)
        return entry["type"] == "directory" or (
            entry["size"] == 0
            and entry["type"] == "file"
            and entry["name"].endswith("/")
        )

Same goes for find() (probably extra handling is also needed for ls() etc). When you pass a file to find(), even though it is a pseduo-dir (an empty file, with trailing slash) find() assumes that it is a regular file and returns [fs.info(file)]. So for them, we check whether the find()'s return is a single file, and whether it is a directory, and if so we simply stop the generator.

        # When calling find() on a file, it returns the same file in a list.
        # For object-based storages, the same behavior applies to empty
        # directories since they are represented as files. This condition
        # checks whether we should yield an empty list (if it is an empty
        # directory) or just yield the file itself.
        if len(files) == 1 and files[0] == path and self.isdir(path_info):
            return None

        yield from self._strip_buckets(files, detail=detail)

I'm not sure if this should be done on the upstream or downstream, but in either way it would be really beneficial for everyone to be consistent among major filesystems (s3, gs, azure) etc.

@martindurant
Copy link
Member

I'm not sure if this should be done on the upstream or downstream, but in either way it would be really beneficial for everyone to be consistent among major filesystems (s3, gs, azure) etc.

I definitely agree with this sentiment! I think a function like the one above can definitely be included in fsspec, perhaps in utils as a function or a mixin class. Perhaps more important would be to flesh out #651 to define what we think the behaviour ought to look like, and write some tests that span all the object stores.

Anaconda is having a "hack week" and there are holidays too, so I may not be too responsive this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants