The HDF5 library (libhdf5) includes a read-only virtual file driver for the AWS Simple Storage Service (S3) or any S3-compatible storage system. The driver, called Read-only S3 (ROS3), can be used to read data from an HDF5 file as S3 object. The driver can log various information related to its S3 operations since libhdf5 version 1.14.1. This information can be very helpful when deciding which HDF5 file or library features to use for better data access performance.
This repository contains a simple dashboard for the driver's log data about S3 (HTTP range GET) requests. These requests represent individual libhdf5 data read operations and directly affect performance. The dashboard is implemented as a Panel web app in a Jupyter notebook. It takes a ROS3 log file and displays statistics and two plots about the HTTP requests performed to read the data. Only log files up to 10 megabytes are accepted due to the current Panel limitation. The easiest way to evaluate the dashboard is via the Binder service links above.
Producing ROS3 logs requires building libhdf5 and ROS3 driver from source because it is not currently possible to enable this logging any other way. Download libhdf5 source from its GitHub repository. The latest release is always strongly recommended.
Enabling ROS3 logging requires modifying a single line of the H5FDs3comms.c file. For libhdf5 versions before 1.14.4
change #define S3COMMS_DEBUG 0
to #define S3COMMS_DEBUG 1
. For versions from 1.14.4 change #define S3COMMS_DEBUG 0
to #define S3COMMS_DEBUG 4
. After saving the change, build the library with the ROS3 driver according to the
instructions. Every time the ROS3 driver fulfills a read operation the logging information will be printed to stdout.
Redirect it to a file and you have something to upload to the dashboard.
Three ROS3 driver log files for the same data processing task are in this repository:
The original log file contains S3 requests for an HDF5 file created with default (typical) settings. The optimized log information shows the effects of appplying the paged aggregation file space strategy to create a copy of the original file and then using library's page bufffering when reading data. Note the significantly reduced number of S3 requests in the optimized log file which directly translates into much faster data access performance. The libhdf5-1.14.3 optimized log file shows the new feature introduced in the libhdf5 version 1.14.3 where the ROS3 driver reads and caches the first 16 MB on file open. This helps to reduce S3 requests even further for certain use cases. Download these files if you want to use the dashboard without collecting ROS3 logs yourself.
Another way to access HDF5 data in the cloud is through fsspec, a Python package that emulates many POSIX operations on remote files. We can save fsspec logs and use the dashboard to analyze their information.
A log from fsspec looks like:
<File-like object S3FileSystem, URL> read: 0 - 8
<File-like object S3FileSystem, URL> read: 8 - 16
...
In order to make the reader compatible we need to inject the file size in the log:
<File-like object S3FileSystem, URL> read: 0 - 8
<File-like object S3FileSystem, URL> read: 8 - 16
<File-like object S3FileSystem, URL> read: 16 - 32
...
FileSize: 736000000
Note: A caveat with fsspec logs is that they do not report cache hits vs real requests, the total requests number is likely less.
Recommended way to run the dashboard is to use either conda
or mamba
package managers. This repository includes a configuration file to install all the required Python packages:
conda env create --file environment.yml --name VENV_NAME
or
mamba env create --file environment.yml --name VENV_NAME
The dashboard can be run as a typical Jupyter notebook, or a standalone app in a browser with this command:
panel serve ros3vfd-log-info.ipynb --show