Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

h5py file writing is slow #16

Open
siemdejong opened this issue Mar 2, 2023 · 6 comments
Open

h5py file writing is slow #16

siemdejong opened this issue Mar 2, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@siemdejong
Copy link
Owner

Is your feature request related to a problem? Please describe.
Writing to hdf5 file takes very long. About an hour for a fold.

Describe the solution you'd like
Concurrent file writing.

Describe alternatives you've considered
HDF5 for python allows for concurrency [1].

Additional context
[1] https://docs.h5py.org/en/stable/mpi.html

@siemdejong siemdejong added the enhancement New feature or request label Mar 2, 2023
@siemdejong
Copy link
Owner Author

MPI must be available

@siemdejong
Copy link
Owner Author

siemdejong commented Mar 14, 2023

If a fold contains all images (slow/fast) of medulloblastoma and pilocytic astrocytoma, it takes 6+ hours for only the training set. This is unfeasible.

The transfer is slow for the following reasons:

  1. A temporary file is used to store all individual tiles
  2. All individual tiles are presented to the backbone in a sequence (1 at a time)

Other considerations:

  1. When building an h5 dataset for other folds, consisting of the same data, data is written again. Instead h5virtualdatasets may be used.

New solution:

  1. Only one file h5 file is needed. This file has the following structure:
    /case_id/img_id/data/<concatenated tiles> and /case_id/img_id/_other_/<concatenated metadata>
  2. tiles are presented to the backbone in batches, allowing for faster output computation.
  3. All feature vectors calculated from tiles are only stored once. Virtual datasets fetch appropriate tiles for particular train/test/val folds.

How?:

  • PMCHHGDataset's attribute dlup_dataset is a ConcatDataset, which has the datasets attribute. Every dataset in PMCHHGDataset.dlup_dataset.datasets contains tiles beloning to one image. All tiles from one image can be presented to the backbone as a batch. The model computes embeddings. Embeddings are concatenated and the embeddings are stored to an h5 dataset at /case_id/img_id/data/<array> with metadata in similar datasets ( /case_id/img_id/_other_/<concatenated metadata>)
  • When reading datasets, make use of h5py Virtual Datasets. Read img_id from fold file and use that to create a virtual dataset behaving as if it was a standalone dataset. Filter the self.dataset_indices with case_id/img_id beloning to the fold that the dataset belongs to.

Limitations:
As long as processes only read from the output h5 file, there is no problem.

@siemdejong siemdejong changed the title Parallel h5py writing h5py file writing is slow Mar 14, 2023
siemdejong added a commit that referenced this issue Mar 14, 2023
Fixes part of #16, namely the writing part.
@siemdejong
Copy link
Owner Author

siemdejong commented Mar 20, 2023

Although the feature compilation has been sped up, compiling features with CPU is taking approx 2 hours for all images.

@siemdejong
Copy link
Owner Author

Concerning reading from the file, _H5ls can list all datasets in the compiled file. If self.dataset_indices is filtered with only the case_id/img_ids that are given to the PMCHHGH5Dataset with the paths_and_targets keyword to the same label files as for the tile dataset, no H5 virtual dataset is needed.

@siemdejong
Copy link
Owner Author

Using the GPU it takes 20 minutes now.

@siemdejong
Copy link
Owner Author

Compiling features to their own dataset (so one hdf5 file per image) and later creating a virtual dataset combining the separate hdf5 datasets will allow for concurrent writing.

The model can fit multiple times on 1 GPU. Every model instance can concurrently calculate embeddings per image and store the embeddings of one image in one hdf5 file before going to the next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant