Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk space - IO vs compute #333

Open
jonwright opened this issue Oct 11, 2024 · 0 comments
Open

Disk space - IO vs compute #333

jonwright opened this issue Oct 11, 2024 · 0 comments

Comments

@jonwright
Copy link
Member

We appear to be using up a lot of disk space without much benefit (e.g. IO is slower than compute). It might be a glitch on this specific experiment.

For a real example (hc5590) then all the raw pixels are only 2 GB compressed:

2.0G  LMGO_BT6_01_slice_01_sparse.h5

The peaks table is saved without any compression:

5.0G LMGO_BT6_01_slice_01_peaks_table.h5
 h5ls LMGO_BT6_01_slice_01_peaks_table.h5/pks2d
glabel                   Dataset {111725522}
ipk                      Dataset {542}
npk                      Dataset {541, 3}
pk_props                 Dataset {5, 111725522}

This can be repacked to give a 1 GB file with gzip:1 on glabel and pk_props. The problem is that the 1GB file loads in 18 seconds versus 3 seconds for uncompressed (hdf limitation of single-threaded decompression). To regenerate it from the sparse pixels needed about 1 minute on a 20 core machine. So it is worth keeping this one.

4D merged peaks have ~9 spare columns out of 19 (~50%). It seems risky to write the computed columns as they should change with the parameter updates later.

4.8G  LMGO_BT6_01_slice_01_peaks_4d.h5
     -> h5ls LMGO_BT6_01_slice_01_peaks_4d.h5/peaks 
     Dataset {33205383}
Needed/data = Number_of_pixels, dty, f_raw, fc, npk2d, omega, s_raw, sc, spot3d_id, sum_intensity
Computed/result = ds, eta, gx, gy, gz, tth, xl, yl, zl

Reading the uncompressed columnfile is 6 seconds versus recomputing from the pks_table:

%%time
p4d = p0.pk2dmerge( ds.omega, ds.dty )
spat = ImageD11.blobcorrector.eiger_spatial(dxfile=ds.e2dxfile, dyfile=ds.e2dyfile)
cf_4new  = ImageD11.columnfile.colfile_from_dict( spat(p4d) )
cf_4new.parameters.loadparameters('LMGO_small_cubic.par')
cf_4new.updateGeometry()
CPU times: user 7.24 s, sys: 1.9 s, total: 9.15 s
Wall time: 4.38 s

2D unmerged peaks have ~9 spare columns out of 19 (~50%) not needed. Also uncompressed. Needs 16 seconds to read:

15G Oct 11 15:00 LMGO_BT6_01_slice_01_peaks_2d.h5
Needed (9): Number_of_pixels dty f_raw fc omega s_raw sc spot3d_id sum_intensity
Results (9): ds eta gx gy gz tth xl yl zl

This can be generated from the pks_table faster than reading (add the time to read the pks_table however, currently 3s).

%%time
cf2d = p0.pk2d( ds.omega, ds.dty )
spat = ImageD11.blobcorrector.eiger_spatial(dxfile=ds.e2dxfile, dyfile=ds.e2dyfile)
cf_new  = ImageD11.columnfile.colfile_from_dict( spat(cf2d) )
cf_new.parameters.loadparameters('LMGO_small_cubic.par')
cf_new.updateGeometry()

CPU times: user 17.3 s, sys: 5.5 s, total: 22.8 s
Wall time: 3.2 s

In total, we are writing 25 GB of data that were derived from 2 GB of pixels. To be checked/verified:

  • how fast is IO compared to computation in general? In a sane universe, we should be able to compute peak properties faster than reading them over the network.
  • is there a fast HDF5 compression plugin that is better suited for columnfiles? Perhaps blosc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant