-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imaris Reader: support for LZ4 compression and performance improvements #4249
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @marcobitplane, this definitely looks like a reasonable approach.
Two higher-level questions before we proceed with more thorough review and testing:
lz4-java
is added as a dependency, but the existing dependency on aircompressor via ome-codecs should already support LZ4. Is there a reason to uselz4-java
specifically, or couldaircompressor
be used instead?- The new
LZ4Filter
uses theucar.nc2.filter
package; if at all possible, I'd prefer to see this in the existing Bio-Formats package structure (loci.formats.*
) rather thanucar.nc2.filter
. A brief read of https://docs.unidata.ucar.edu/netcdf-java/5.5/userguide/reading_zarr.html#implementing-a-filter suggests the package name of the custom filter itself doesn't matter, as long as it extends/implements the correct class/interface. Could you please either change the package name, or explain a bit why puttingLZ4Filter
inucar.nc2.filter
is necessary?
Finally, in order to merge this we would need a signed Contributor License Agreement. That's not urgent at this point, but it would be good to have sooner rather than later.
Thank you @melissalinkert! I modified the PR to use aircompressor instead of lz4-java and moved LZ4Filter to loci/formats/filter, let me know if that is not ideal or there's more to change there. I'll send the CLA today. |
Adding to tonight's build, so we should know in the morning if there are any test failures on existing data. @marcobitplane : maybe I missed this, but I don't see here or in #4217 a reference to lz4 test files. Do you have any Imaris HDF files with lz4 compression that we can use to test this? |
Thank you @melissalinkert, here are two lz4-compressed cropped versions of Imaris demo images (retina and CellDemoMembrane3D). Feel free to let me know if you would prefer to have more images or have them shared in a different way. |
@marcobitplane thanks for sharing some sample data. Our preferred way to collect such samples is to upload them to the Bio-Formats Zenodo community collection. For the OME team and the community, has the benefit of unambiguously assigning a license for the distribution and re-usage of these samples. In addition, it is possible to add additional samples to the upload and create new versions of the dataset as necessary during the review process. I had a quick go at testing these with the proposed changed and the Bio-Formats command-line utility. With the just released 8.0.1 command-line tools, Bio-Formats fails with
With a local build of Bio-Formats using the HEAD of this PR, I have
adding the
The other sample |
Thank you @sbesson for the clarification, I'll check with our team if it is ok to share the images on Zenodo for usage by the community. I think it should be fine since we already distribute retina.ims and CellDemoMembrane3D.ims with our free ImarisViewer, but they have been modified in the version I shared with you so I'd first like to make sure everything is fine there. Regarding the error with showinf, apologies, I have been testing the changes to the reader in Fiji (by replacing the formats-gpl jar) and have no issues there with the filter, but I indeed have the same error with the command line tool. I will investigate that and hope to understand the reason for this as soon as possible. |
The issue with showinf was a conflict between META-INF/services/ucar.nc2.filter.FilterProvider files. Bioformats_package.jar bundles many things including the NetCDF-Java library, which contains its own META-INF/services/ucar.nc2.filter.FilterProvider file providing the default filters like deflate and shuffle. Now we would also have a META-INF/services/ucar.nc2.filter.FilterProvider for the lz4 filter coming from formats-gpl, which creates a conflict since there can be only one file with the same path and name in a jar. Merging the two files seems to me a good solution here, and I read here one way to do it using the maven assembly plugin and Descriptor Handlers. I already committed a fix for the command line issue, let me know if you see problems with this approach. |
With 2c4dfbd, I still see the same error as in #4249 (comment). The
|
I tried a fresh build of Bio-Formats using the HEAD of this PR and I get:
I'm now thinking, I built Bio-Formats with maven for this test, maybe @melissalinkert your build of |
Yes, that's correct, I used |
Thank you, I committed a fix for the ant build, let me know if you encounter any other issues. The two lz4-compressed ims demo images have also been uploaded to Zenodo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @marcobitplane. I can now run showinf
on the test files after building bioformats_package.jar
with Ant, and see images displayed without error.
I have re-included this for automated testing, together with the new public sample files uploaded to Zenodo. I don't expect any test failures at this point, but we'll know for sure tomorrow.
The few in-line comments I have are fairly minor, but should be addressed before we merge.
<groupId>io.airlift</groupId> | ||
<artifactId>aircompressor</artifactId> | ||
<version>0.27</version> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially going to suggest removing this and updating ome-codecs instead, but realized we don't have a released version of ome-codecs that includes ome/ome-codecs#39.
In the context of this PR, I think I'm OK with leaving this dependency as-is. Separately, though, and before 8.1.0, we'll need to get ome-codecs released and updated so that we don't have two different versions of aircompressor.
components/formats-gpl/src/loci/formats/services/NetCDFServiceImpl.java
Outdated
Show resolved
Hide resolved
Thank you again for the help and feedback, I added a few commits to address the requested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes look good to me, and most recent build is passing: https://merge-ci.openmicroscopy.org/jenkins/job/BIOFORMATS-test-repo/264/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marcobitplane for sharing the samples using the Zenodo community (https://zenodo.org/records/14197622). We have copied them to our curated QA repository so that they can be integrated in the nightly CI repository tests. As indicated by @melissalinkert above, these have been passing for the last execution under all Java LTS versions.
Thanks for the changes to the Ant build infrastructure. I don't know off-hand of a better solution that to manually unzip and combine as you have done. Confirmed that the new filter is now included in the JAR bundle created either by Maven or Ant
sbesson@Sebastiens-MacBook-Pro-3 bioformats % unzip -qc artifacts/bioformats_package.jar META-INF/services/ucar.nc2.filter.FilterProvider
ucar.nc2.filter.Blosc$Provider
ucar.nc2.filter.Deflate$Provider
ucar.nc2.filter.Checksum32$Fletcher32Provider
ucar.nc2.filter.Checksum32$Adler32Provider
ucar.nc2.filter.Checksum32$CRC32Provider
ucar.nc2.filter.ScaleOffset$Provider
ucar.nc2.filter.Shuffle$Provider
loci.formats.filter.LZ4Filter$LZ4FilterProvider
sbesson@Sebastiens-MacBook-Pro-3 bioformats % unzip -qc components/bundles/bioformats_package/target/bioformats_package-8.1.0-SNAPSHOT.jar META-INF/services/ucar.nc2.filter.FilterProvider
loci.formats.filter.LZ4Filter$LZ4FilterProvider
ucar.nc2.filter.Blosc$Provider
ucar.nc2.filter.Deflate$Provider
ucar.nc2.filter.Checksum32$Fletcher32Provider
ucar.nc2.filter.Checksum32$Adler32Provider
ucar.nc2.filter.Checksum32$CRC32Provider
ucar.nc2.filter.ScaleOffset$Provider
ucar.nc2.filter.Shuffle$Provider
On the functional testing of the LZ4 compression, I confirm that the two sample LZ4 compressed Imaris files cannot be opened with Bio-Formats 8.0.1 with a RuntimeException
due to an unknown filter. With this PR included both samples can be initialized using the command-line utility and the binary data can be read and compared to the ground truth as shown in Imaris Viewer.
Do you have specific samples and testing scenarios that would allow us to test the performance/caching improvements?
int[] idcs = new int[] {blockNumber * blockSizeZPerResolution[resolutionIndex], 0, 0}; | ||
try { | ||
String path; | ||
for (int ch = 0; ch < getSizeC(); ch++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I understand the rationale for caching an entire block for a given channel given the way the data is stored, my understanding is that different channels are stored separately.
While there are scenarios where preloading all channels makes sense, would this add some unnecessary overhead in some others e.g. when only one channel is requested or in the case of a large number of channels where only a few are rendered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different channels are indeed stored separately.
If, when reading an image, planes were requested sequentially from a single channel (therefore moving along z before moving along channels) then the buffer could simply hold a stack of planes from that single channel. However, in my testing i saw that Bio-Formats (or Fiji?) requests the same image plane for all channels before moving to the next plane (moving along channels before moving along z). If the buffer just cached a stack from a single channel, the buffer would need to be updated every time, making it useless.
Let me know if I misunderstood your comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order in which planes are being retrieved including the index of the channels to be retrieved is fundamentally the decision of the caller. In the case of Fiji, I assume a single reader will indeed iterate through all the active channels for the given timepoint/Z-section before moving onto the next ones.
Other applications might make different decisions or even use concurrent requests in which case the benefit of caching all the channels might be limited. We will need to assess the cost/benefit of the caching logic across these scenarios as part of the testing.
Thank you @sbesson! With any 3D dataset you should be able to see a significant performance boost, for example an image of a few GBs should be read in a few seconds instead of a few minutes. How much faster depends on the chunk z size. For smaller images it might not be as obvious, I can look to provide an example of a larger ims file if you don't have one in the QA repository, just let me know. |
If you could share some sample IMS files that you have used for validating the cache improvements, it would be really useful to review these changes. |
I uploaded an example dataset to Zenodo (I created a new submission and set the access to restricted since this is not one of our demo images and I would again have to ensure it can be uploaded to a public repository). It is a typical 200MB, 3D image with two channels which should be read about 10 times faster with the caching mechanism. |
The new dataset on Zenodo is now open access. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest uploaded sample image, now available at https://zenodo.org/records/14235726, was used for testing the caching improvements. Measurements were made using the command-line tools using the different options available for reading different ranges of data.
Plane read times were compared between Bio-Formats built with this PR and Bio-Formats 8.0.1 using the average/standard deviation computed from three consecutive executions of the showinf
command. In all scenarios, the reader setId
time was unmodified by this PR. The output of the command-line utility e.g. 2.825s elapsed (14.413265ms per plane)
was used as the source of the times.
Bio-Formats 8.0.1 | This PR | |
---|---|---|
all planes / first resolution (no options) | 17.74s +/- 0.21s | 3.41s +/- 0.08s |
1/4 of all planes / first resolution (-crop 512,512,512,512) | 6.95s +/ 0.02s | 2.84s +/- 0.03s |
first plane / first resolution (-range 0 0) | 0.289s +/- 0.003s | 0.172s +/- 0.001s |
first plane / second resolution (-noflat -resolution 1 -range 0 0) | 0.200s +/- 0.001s | 0.195s +/- 0.003s |
51st plane / first resolution (-range 50 50) | 0.264s +/- 0.001s | 0.341s +/- 0.007s |
101st plane / first resolution (-range 100 100) | 0.299s +/- 0.003s | 0.182s +/- 0.004s |
last plane / first resolution (-range 195 195) | 0.221s +/- 0.005s | 0.279s +/- 0.001s |
As shown in the two first rows, the benefit of the caching is most visible in the scenario where multiple / all planes are being read using the same initialized reader. In the most advantageous scenario, each block will only be read once and the internal cache re-used for all subsequent reads. In practice, this should be the behavior of the Fiji/ImageJ Bio-Formats plugin when using the Hyperstack
view without the Use virtual stack
option selected since the stack order is XYCZT
.
For individual planes, the read times with this PR included are also consistent with the measurements using the latest release of Bio-Formats taking ~200ms per plane. It is worth noting the table above show an increase in the read times variance with this PR included. Although this might result from the different way the underlying netcdf
library is used, my proposal would be to follow-up with a more systematic measurement of the openBytes
time for every plane index of the sample dataset.
Hello,
this pull request adds support for LZ4 compressed ims files and modifies ImarisHDFReader to avoid multiple reads of the same 3D chunks. See also #4217.
LZ4 support is added using NetCDF-Java's ucar.nc2.filter package, which provides a mechanism for user-supplied filters as described here.
Regarding performance, ImarisHDFReader is modified to have a caching mechanism that reads a stack of planes (as many planes as the chunk z-size) from all channels into a buffer, which only needs to be updated after all data in it has been read. If the size of the buffer would exceed 1GB, the reader falls back to reading the requested plane only. The exact performance improvement will depend on the details of the dataset: in our testing, for 3D datasets the new reader can be over an order of magnitude faster than the existing implementation.