-
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] is hdf5 the best format for data storage? #98
Comments
Haven't tested it myself but it looks like hdf5 and h5py should be able to support multi-process reads (https://docs.h5py.org/en/latest/swmr.html?#multiprocess-concurrent-write-and-read) although multithreading seems to not be possible as far as I can see. Haven't heard of zarr before but did some googling vs hdf5 and saw a few issues saying it was slower than hdf5 and this paper which seems to have that conclusion as well (but they don't mention multithreading besides once in the beginning so I'm guessing it's not testing that extensively, and that does seem to be one of the main advantages of zarr) https://arxiv.org/pdf/2207.09503.pdf As you say though Zarr does have advantages in concurrency and chunking which sounds useful (https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/) Here's something comparing it with parquet https://sites.google.com/view/raybellwaves/blog/zarr-vs-parquet We talked in the lambda meeting today a bit, the plan is to reach out to different people and see what they would prefer, to avoid arbitrarily changing it and changing to a new thing in the future. It seems like apache arrow may be a good choice: it's used by huggingface datasets, as well as ray data and as of recently pandas. You can save them to disk as either Arrow/Feather files (uncompressed afaik but fast to read) or parquet files (compressed and more intended for long term storage). It looks like huggingface saves them directly as arrow files, so that seems like a reasonable thing to do here too imo. Converting from parquet and arrow is supposed to be very easy, and I think parquet would be flexible enough to support complex nested data like minari has but maybe not (more info https://arrow.apache.org/blog/2022/10/17/arrow-parquet-encoding-part-3/) John mentioned something today about an issue with storing tuples in huggingface datasets though so it sounds like there may be some unexpected issues to work out. I posted in the dev channel but it feels like it's important to figure out these sorts of things early on as the longer it's delayed the more painful it will be to switch formats. |
I have done some research during some months to identify formats and in advantages. At the end, I have found hdf5 and h5py the best option. It is a well established standard or even more standard that other kind of format files. |
Thanks for the feedback, it seems to be the most widely used in the field of offline RL as well so in terms of compatibility and standardizing things it’s probably the best choice. There’s definitely an argument to be made for Zarr but imo the best option is to support alternative file formats like that as an option, but to still maintain compatibility with HDF5 as the standard. |
safetensors could be a good option. Also I would have each transition as a separate file. File size will be huge when observation space is huge (e.g. images). |
Question
While hdf5 and h5py is the most popular approach for multi-dimensional arrays storage, is has some major limitations. For example, the inability to read data in multiple processes / threads simultaneously, which can be important for the implementation of efficient data loading.
There is an alternative - Zarr, which is very similar, but a bit more capable. I think a discussion on this would be useful to the community.
The text was updated successfully, but these errors were encountered: