Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] is hdf5 the best format for data storage? #98

Open
Howuhh opened this issue Jun 23, 2023 · 4 comments
Open

[Question] is hdf5 the best format for data storage? #98

Howuhh opened this issue Jun 23, 2023 · 4 comments

Comments

@Howuhh
Copy link
Contributor

Howuhh commented Jun 23, 2023

Question

While hdf5 and h5py is the most popular approach for multi-dimensional arrays storage, is has some major limitations. For example, the inability to read data in multiple processes / threads simultaneously, which can be important for the implementation of efficient data loading.

There is an alternative - Zarr, which is very similar, but a bit more capable. I think a discussion on this would be useful to the community.

@elliottower
Copy link
Member

elliottower commented Jun 23, 2023

Haven't tested it myself but it looks like hdf5 and h5py should be able to support multi-process reads (https://docs.h5py.org/en/latest/swmr.html?#multiprocess-concurrent-write-and-read) although multithreading seems to not be possible as far as I can see.

Haven't heard of zarr before but did some googling vs hdf5 and saw a few issues saying it was slower than hdf5 and this paper which seems to have that conclusion as well (but they don't mention multithreading besides once in the beginning so I'm guessing it's not testing that extensively, and that does seem to be one of the main advantages of zarr) https://arxiv.org/pdf/2207.09503.pdf As you say though Zarr does have advantages in concurrency and chunking which sounds useful (https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/) Here's something comparing it with parquet https://sites.google.com/view/raybellwaves/blog/zarr-vs-parquet

We talked in the lambda meeting today a bit, the plan is to reach out to different people and see what they would prefer, to avoid arbitrarily changing it and changing to a new thing in the future.

It seems like apache arrow may be a good choice: it's used by huggingface datasets, as well as ray data and as of recently pandas. You can save them to disk as either Arrow/Feather files (uncompressed afaik but fast to read) or parquet files (compressed and more intended for long term storage). It looks like huggingface saves them directly as arrow files, so that seems like a reasonable thing to do here too imo. Converting from parquet and arrow is supposed to be very easy, and I think parquet would be flexible enough to support complex nested data like minari has but maybe not (more info https://arrow.apache.org/blog/2022/10/17/arrow-parquet-encoding-part-3/)

John mentioned something today about an issue with storing tuples in huggingface datasets though so it sounds like there may be some unexpected issues to work out. I posted in the dev channel but it feels like it's important to figure out these sorts of things early on as the longer it's delayed the more painful it will be to switch formats.

@jamartinh
Copy link
Contributor

I have done some research during some months to identify formats and in advantages.

At the end, I have found hdf5 and h5py the best option.

It is a well established standard or even more standard that other kind of format files.
It allows to save easily numpy arrays directly.
It allows "single writer multiple readers" directly
It admits compression
It , if carefully done, only will put in RAM memory the data that you just are reading, such as an episode etc. This allows to open the files read the stats and filter, i.e., only load episodes of interest, without consuming all the ram, allowing for big files. This also makes faster the data access.
I use hdf5 for multiprocess, each instance open the same file but just load the data it needs so RAM is safe for big files and multi-process.
My tests indicate that opening an hdf5 file for just opening one episode and reading it is the faster option that any other file type.

@elliottower
Copy link
Member

I have done some research during some months to identify formats and in advantages.

At the end, I have found hdf5 and h5py the best option.

It is a well established standard or even more standard that other kind of format files. It allows to save easily numpy arrays directly. It allows "single writer multiple readers" directly It admits compression It , if carefully done, only will put in RAM memory the data that you just are reading, such as an episode etc. This allows to open the files read the stats and filter, i.e., only load episodes of interest, without consuming all the ram, allowing for big files. This also makes faster the data access. I use hdf5 for multiprocess, each instance open the same file but just load the data it needs so RAM is safe for big files and multi-process. My tests indicate that opening an hdf5 file for just opening one episode and reading it is the faster option that any other file type.

Thanks for the feedback, it seems to be the most widely used in the field of offline RL as well so in terms of compatibility and standardizing things it’s probably the best choice. There’s definitely an argument to be made for Zarr but imo the best option is to support alternative file formats like that as an option, but to still maintain compatibility with HDF5 as the standard.

@eugeneteoh
Copy link

safetensors could be a good option.

Also I would have each transition as a separate file. File size will be huge when observation space is huge (e.g. images).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants