h5py file writing is slow #16

siemdejong · 2023-03-02T17:40:33Z

Is your feature request related to a problem? Please describe.
Writing to hdf5 file takes very long. About an hour for a fold.

Describe the solution you'd like
Concurrent file writing.

Describe alternatives you've considered
HDF5 for python allows for concurrency [1].

Additional context
[1] https://docs.h5py.org/en/stable/mpi.html

siemdejong · 2023-03-03T22:00:23Z

MPI must be available

siemdejong · 2023-03-14T11:59:56Z

If a fold contains all images (slow/fast) of medulloblastoma and pilocytic astrocytoma, it takes 6+ hours for only the training set. This is unfeasible.

The transfer is slow for the following reasons:

A temporary file is used to store all individual tiles
All individual tiles are presented to the backbone in a sequence (1 at a time)

Other considerations:

When building an h5 dataset for other folds, consisting of the same data, data is written again. Instead h5virtualdatasets may be used.

New solution:

Only one file h5 file is needed. This file has the following structure:
/case_id/img_id/data/<concatenated tiles> and /case_id/img_id/_other_/<concatenated metadata>
tiles are presented to the backbone in batches, allowing for faster output computation.
All feature vectors calculated from tiles are only stored once. Virtual datasets fetch appropriate tiles for particular train/test/val folds.

How?:

PMCHHGDataset's attribute dlup_dataset is a ConcatDataset, which has the datasets attribute. Every dataset in PMCHHGDataset.dlup_dataset.datasets contains tiles beloning to one image. All tiles from one image can be presented to the backbone as a batch. The model computes embeddings. Embeddings are concatenated and the embeddings are stored to an h5 dataset at /case_id/img_id/data/<array> with metadata in similar datasets ( /case_id/img_id/_other_/<concatenated metadata>)
~~When reading datasets, make use of h5py Virtual Datasets. Read img_id from fold file and use that to create a virtual dataset behaving as if it was a standalone dataset.~~ Filter the self.dataset_indices with case_id/img_id beloning to the fold that the dataset belongs to.

Limitations:
As long as processes only read from the output h5 file, there is no problem.

Fixes part of #16, namely the writing part.

siemdejong · 2023-03-20T10:27:20Z

Although the feature compilation has been sped up, compiling features with CPU is taking approx 2 hours for all images.

siemdejong · 2023-03-20T10:29:58Z

Concerning reading from the file, _H5ls can list all datasets in the compiled file. If self.dataset_indices is filtered with only the case_id/img_ids that are given to the PMCHHGH5Dataset with the paths_and_targets keyword to the same label files as for the tile dataset, no H5 virtual dataset is needed.

siemdejong · 2023-04-12T13:13:24Z

Using the GPU it takes 20 minutes now.

siemdejong · 2023-04-12T13:15:32Z

Compiling features to their own dataset (so one hdf5 file per image) and later creating a virtual dataset combining the separate hdf5 datasets will allow for concurrent writing.

The model can fit multiple times on 1 GPU. Every model instance can concurrently calculate embeddings per image and store the embeddings of one image in one hdf5 file before going to the next.

siemdejong added the enhancement New feature or request label Mar 2, 2023

siemdejong changed the title ~~Parallel h5py writing~~ h5py file writing is slow Mar 14, 2023

siemdejong added a commit that referenced this issue Mar 14, 2023

perf: speed up feature compilation

772a0bd

Fixes part of #16, namely the writing part.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

h5py file writing is slow #16

h5py file writing is slow #16

siemdejong commented Mar 2, 2023

siemdejong commented Mar 3, 2023

siemdejong commented Mar 14, 2023 •

edited

Loading

siemdejong commented Mar 20, 2023 •

edited

Loading

siemdejong commented Mar 20, 2023

siemdejong commented Apr 12, 2023

siemdejong commented Apr 12, 2023

h5py file writing is slow #16

h5py file writing is slow #16

Comments

siemdejong commented Mar 2, 2023

siemdejong commented Mar 3, 2023

siemdejong commented Mar 14, 2023 • edited Loading

siemdejong commented Mar 20, 2023 • edited Loading

siemdejong commented Mar 20, 2023

siemdejong commented Apr 12, 2023

siemdejong commented Apr 12, 2023

siemdejong commented Mar 14, 2023 •

edited

Loading

siemdejong commented Mar 20, 2023 •

edited

Loading