Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark run without optimization? #12

Open
h-2 opened this issue Jan 5, 2017 · 9 comments
Open

Benchmark run without optimization? #12

h-2 opened this issue Jan 5, 2017 · 9 comments

Comments

@h-2
Copy link

h-2 commented Jan 5, 2017

Hi,
this is Hannes from SeqAn. From your benchmark file it looks like you are neither setting c++ optimization levels, nor deactivating debug:
https://github.com/walaj/SeqLib/blob/master/benchmark/Makefile
Were the published benchmark results indeed run without -O3 -DNDEBUG?

If yes, could you please repeat them? As the other libraries are included pre-compiled, it is unclear which options they were built with.
Other third party benchmarks previously had similar problems where optimized versions of htslib were tested against unoptimized versions of SeqAn (the difference is huge!):
wilzbach/bam-perf-test@5aea37e
Since all our tests show that we are faster than htslib in almost all cases we would of course be interested in your test data if it turns out we indeed perform notably worse.

If on the other hand the results are more close to what we are seeing, we would kindly ask you to update the table also in the paper.

Thank you very much,
Hannes

@walaj
Copy link
Owner

walaj commented Jan 5, 2017

Hi Hannes,

Thank you, this is an excellent point that I missed, and I agree that this should be fixed. I tried again all of the tests with the flags that you suggested, and am getting the following results:

SeqAn with better flags: 40.6s user time
SeqAn w/o better flags: 94.3s user time
SeqLib: 13.6s user time
BamTools (core only): 21.3s user time
BamTools: 57.8s

So indeed there is a difference. The SeqAn API is amazing, and I think has mostly complementary functionality for the most part with SeqLib. But for this benchmark, I don't want to compare with the wrong approach.

Does this perhaps sync more with what you are seeing?

Best,
Jeremiah

@h-2
Copy link
Author

h-2 commented Jan 5, 2017

Hi Jeremiah,

thanks for the quick reply and rerunning the benchmark! Is this for the jumping test or reading test or combined? And what does "core only" mean, is that without decompressing the strings -- and to which "strings" does that refer? Query name, query sequence, and/or cigar?

Could you provide an overview of :

  1. reading the full record
  2. reading without the strings if that is important (seqan cannot do this right now)
  3. jumping with index

That would help me a lot.Thank you very much!

@walaj
Copy link
Owner

walaj commented Jan 5, 2017

Hi Hannes,

This was for the reading test. The "core" concept is from BamTools, and is the option to read in an alignment without decompressing all of the string data. For instance, if one only wanted the position or insert size, etc. For SeqLib this is basically the default, and most of the functionality interacts with the htslib object directly and may not need to be converted to another form (e.g. cigar operations). But if one wanted to use the sequence data in SeqLib, they would likely have to decompress from 4 bits to a C++ string or char array, so I added this as well into the benchmarking table to be more in line with BamTools options.

I do think that having the initial load be as light-weight as possible, and then allowing the user to expand the pieces as necessary, is an important component of SeqLib. For instance, tools for calculating coverage don't need the read name or sequence, just the position. Or for some other tools I have, it was much more efficient to look at an alignment first, and then only expand the sequence at the last second if it ended up being necessary.

The jumping test picks 1000 random locations and grabs a single read from each. I think the value published had SeqAn as the fastest, although they didn't differ by much.

Hopefully this helps! I wonder if the string data loading is the source of the benchmarking difference.

@h-2
Copy link
Author

h-2 commented Jan 5, 2017

Hi Jeremiah,

what I meant was, if you could provide the benchmark times for the individual steps I mentioned above?
So for reading the full record SeqLib required 14s and SeqAn requires 40s? That's interesting.
I assume SeqLib+Htslib is different than just using Htslib? Did you try benchmarking against "pure Htslib", as well? [1]

Thanks for your time and your comments!

[1] Similiar to https://github.com/wilzbach/bam-perf-test/

@walaj
Copy link
Owner

walaj commented Jan 5, 2017

Latest benchmarks:

  1. SeqLib + sequence expansion: 18.0s
  2. SeqLib default: (as above): 13.6s
  3. Jumping test: SeqLib 9.3s, SeqAn: 13.3s.
  4. (adding in htslib test, copied from bam-perf-test): 9.46s

Which is basically in line with the published table (this machine is a little slower), save for SeqAn should be closer to ~35s not ~80s. The wall time on the SeqAn jumping test is smaller, but it looks like it's using multiple CPUs. Still not seeing much of a difference either way for the jumping test. Adding in the htslib test, looks like a small amount of overhead in SeqLib (shared pointers, etc) vs htslib, which is expected.

Not sure where the 17s on the README from bam-perf-test came from for SeqAN. The log is showing a much slower time (7m+). I guess that log file was committed before the O3 addition. A 25x speed up doesn't sync with what I'm seeing (2x). Still an improvement over the version with non-optimal flags, but I don't see any way SeqAn could be faster than htslib

@h-2
Copy link
Author

h-2 commented Jan 6, 2017

but it looks like it's using multiple CPUs.

Yes, SeqAn always uses multiple threads for BamIO. Since (de)compression is a large part of the IO work this is an important feature (especially since every modern computer has multiple cores). Depending on how many cores you have you will see more or less speed-up. The benchmark I mentioned above was apparently run on eight cores so it might have benefited more strongly than yours.

@walaj
Copy link
Owner

walaj commented Jan 6, 2017

Ok great, I think we're converging on what is going on! Thank you for the quick and courteous note about the benchmarking, and I'll reach out today to get that table fixed for the corrected proof with the right compiler flags.

I also agree that the embedded multi-threading is an important feature for many developers that want fast code without dealing with pthread, but still think CPU time is the right benchmark. Modern computers have multiple cores, but are often shared systems and so the extra CPU isn't free (and a lot of work now is done on VM instances that may only allocate one core). A developer may also want to have their own parallelization framework with other aspects besides BAM IO, and could be puzzled by some subcomponent using more threads on its own.

So both approaches are useful for different needs, since I suppose it's not possible to make something that works out of the box for everything.

@h-2
Copy link
Author

h-2 commented Jan 6, 2017

Thank you for the quick and courteous note about the benchmarking, and I'll reach out today to get that table fixed for the corrected proof with the right compiler flags.

Ok, cool! If you get the table updated, could you also note that SeqAn decodes strings, too?

Method Memory (Gb) CPU (s)
SeqLib 3.15 11.77
SeqLib (with string data) 3.86 14.33
BamTools 6.93 17.52
BamTools (with string data) 13.32 48.74
SeqAn (with string data) 8.35 ...

or maybe it would be even clearer to mark those that don't read the full record:

Method Memory (Gb) CPU (s)
SeqLib (without string data) 3.15 11.77
SeqLib 3.86 14.33
BamTools (without string data) 6.93 17.52
BamTools 13.32 48.74
SeqAn 8.35 ...

=> otherwise people intuitively compare SeqAn to the "wrong" table entries.
It would probably also help if you mention the hardware it was benchmarked on and maybe briefly mention that SeqAn scales with / depends on cpu cores. Otherwise it might be difficult for others to reproduce your results.

A developer may also want to have their own parallelization framework with other aspects besides BAM IO, and could be puzzled by some subcomponent using more threads on its own.

actually we do some sophisticated probing, work balancing and cpu yielding to prevent ever using more threads than there are free cores. It shouldn't effect anything else negatively 😃
But, yes, we also plan to make this configurable by the developer/user in the future!

Thanks for being so quick to respond and for discussing these things openly!

@walaj
Copy link
Owner

walaj commented Jan 6, 2017

Hi Hannes,

OK sounds good, I agree about the table annotation. I contacted them and will see what needs to be done to make the change. I'll also make a note in the README about the differences between these approaches (i.e. threading) and a note about the updated benchmarking, so folks can decide which API is best for their needs. My sense is that a combination of both actually would be really powerful for most projects, since there is a great deal of functionality in SeqAn (it's a much bigger project) that is not in SeqLib, but the BWA / BLAT / Fermi / htslib wrappers in SeqLib are really incredibly useful as well.

Best,
Jeremiah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants