-p/--threads option always uses 1 more thread than is specified. #465

leowill01 · 2024-02-15T20:23:14Z

when i run bowtie2 and specify the number of threads to use(eg -p 1, -p 4), the amount of CPU usage is always 1 more core in use than is specified by the number of threads to use.
for example, if i specify -p 1:

bowtie2 --no-unal -p 1 \
-x 'ref-genome-idx' \
-1 reads-R1_001.fastq.gz \
-2 reads-R2_001.fastq.gz

the cpu usage (via htop) for the process with be at ~200% indicating a usage of 2 threads, not 1:

 CPU% Command
  2.8 └─ /usr/bin/perl5.34 /opt/homebrew/bin/bowtie2 --no-unal -p 1 -x PAO1-- ...
196.0 	└─ /opt/homebrew/bin/../Cellar/bowtie2/2.5.2/bin/bowtie2-align-s --w ...

likewise, for any other number of threads, it always uses n+1 CPU cores.
for -p 4:

bowtie2 --no-unal -p 4 \
-x 'ref-genome-idx' \
-1 reads-R1_001.fastq.gz \
-2 reads-R2_001.fastq.gz

CPU usage (via htop):

 CPU% Command
  2.4 └─ /usr/bin/perl5.34 /opt/homebrew/bin/bowtie2 --no-unal -p 4 -x
445.4 	└─ /opt/homebrew/bin/../Cellar/bowtie2/2.5.2/bin/bowtie2-align

(between 400-500% core usage because i have other jobs running, but bare system it would be ~500% indicating usage of 5 CPU cores)

this is effectively limiting cutting my sample throughput in half because when bowtie2 is in usage with 1 thread specified, every job uses 2 threads instead.
is this a bug or a feature? am i missing something?

tested on:

macos sonoma 14.2.1
mac studio m2 ultra
bowtie2 version:

bowtie2 --version
/opt/homebrew/bin/../Cellar/bowtie2/2.5.2/bin/bowtie2-align-s version 2.5.2
64-bit
Built on Ventura-arm64.local
Sat Oct 14 18:03:18 UTC 2023
Compiler: InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Options: -O3  -funroll-loops -g3 -std=c++11 -fopenmp-simd -DNO_SPINLOCK -DWITH_QUEUELOCK=1
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

The text was updated successfully, but these errors were encountered:

ch4rr0 · 2024-02-27T16:14:49Z

Hello,

As of bowtie2 v2.5.0 we moved to async model (dedicated thread) for processing I/O. The --threads flag specifies the number of computational (alignment) threads that bowtie2 will use, hence the thread count will always be +1 than the default or the value specified.

sfiligoi · 2024-09-11T22:42:33Z

@ch4rr0 The whole new CPU core usage was added in commit 6f6458c
as it moved from using a conditional variable to pure polling.

bowtie2/pat.h

Line 1344 in 4cf8d52

while (!q_.try_dequeue(ret)) ;

I benchmarked before and after the commit with -p 12, and while before it would use 12.2 cores, afterwards it uses 13.1.

I believe we should revert that commit.

sfiligoi · 2024-09-11T23:12:04Z

For completeness, I had to revert 3 commits to avoid revert conflicts when testing:

    Revert "Use moodycamel concurrentqueue for better performance"
    
    This reverts commit 6f6458c5dae6ef8931c6024b1a20b5c2e275607d.
...
    Revert "Potential fix for LeakSanitizer failure"
    
    This reverts commit 84afe73740187a10749d236190e1c9056885d3d6.
...
    Revert "Second attempt at fixing LeakSanitizer failure"
    
    This reverts commit d4fb35a7b24a4c8d03b08d0e7672a27a8533fa34.

Not sure if there are any other side effects, but the CPU utilization definitely was drastically reduced.

ch4rr0 · 2024-09-11T23:20:53Z

Hello Igor,

We put this change in place after a user reported an issue with poor thread scaling which was due to contention on the condition variable. Why does the n+1 CPU utilization come as a surprise? As I recall it was part of the async reads pull request. moody_camel is a concurrent, lock-free queue and should not spawn any additional threads on it's own.

sfiligoi · 2024-09-11T23:33:26Z

The async reads were added as a way to increase performance with minimal overhead.
Using a whole CPU just to wait for input defeats the original purpose, especially on low thread count.

The number of CPU cores in a node is limited, so any CPU cycles wasted on polling is CPU time that cannot be used on actual alignment work.

sfiligoi · 2024-09-12T00:05:00Z

If condition variables were really a problem at large thread count (can you reference me in the related issue, please?)
we could use the moody_camel after a certain thread count (polling is much less of a problem at high thread count)
but revert back to the more efficient condition variables at (more typical) lower thread counts.

@ch4rr0 I am willing to do the coding work, if you agree with the direction.

ch4rr0 · 2024-09-12T00:54:30Z

This is the issue: #437.

I am grateful that you are willing to do the work on this. If it's not too much to ask can you also provide some performance numbers for future reference?

sfiligoi · 2024-09-12T20:31:46Z

As a starting point, here are my benchmark numbers on the kind of problems of relevance to my group (QIITA).

The used CPU is a AMD EPYC 7302 16-Core Processor.
(Given the size of the problems, we almost always use server-class setups. This node has 2 CPUs, using the 2nd one.)

I am using the Wolka WolR1 DB (~40GB of memory use) and the Song53_24613_host_filtered.fastq.gz input file from QUIITA:
https://qiita.ucsd.edu/public/?artifact_id=101636

The command used for a single 16-core run is:

INDIR=/qmounts/qiita_data/per_sample_FASTQ/101636
INF1=Song53_24613_host_filtered.fastq.gz
time taskset -c 16-31  ./bowtie2 -p 16 -x /scratch/qp-woltka/WoLr1/WoLr1 -q ${INDIR}/${INF1},${INDIR}/${INF1},${INDIR}/${INF1},${INDIR}/${INF1} --seed 42 --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min  "L,0,-0.05" --no-head --no-unal >/dev/null

(aligning 4x the same file, so I can compare with aligning them in parallel below)

and for the 4x4-core runs I use

(time taskset -c 16-19 ./bowtie2 -p 4 -x /scratch/qp-woltka/WoLr1/WoLr1 -q ${INDIR}/${INF1} --seed 42 --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min  "L,0,-0.05" --no-head --no-unal >/dev/null)&
(time taskset -c 20-23 ./bowtie2 -p 4 -x /scratch/qp-woltka/WoLr1/WoLr1 -q ${INDIR}/${INF1} --seed 42 --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min  "L,0,-0.05" --no-head --no-unal >/dev/null)&
(time taskset -c 24-27 ./bowtie2 -p 4 -x /scratch/qp-woltka/WoLr1/WoLr1 -q ${INDIR}/${INF1} --seed 42 --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min  "L,0,-0.05" --no-head --no-unal >/dev/null)&
(time taskset -c 28-31 ./bowtie2 -p 4 -x /scratch/qp-woltka/WoLr1/WoLr1 -q ${INDIR}/${INF1} --seed 42 --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min  "L,0,-0.05" --no-head --no-unal >/dev/null)&
wait

With the current master, I get:
1x16 - 740s
2x8 - 790s
4x4 - 950s
(as expected, using -p N-1 gives almost identical timings)

If I revert the moodycamel commits I get:
1x16 - 690s
2x8 - 700s
4x4 - 740s

Two takeways:
a) It is more effective to use a single process using 16 cores, that many processes using a subset of the cores. i.e., I don't see any slowdown with larger thread numbers, quite the opposite.
b) The version without moodycamel is always faster, but is most notable when -p is smaller. The original async code was indeed adding only a minor overhead, while the polling moodycamel approach slows small -p setups drastically.

leowill01 · 2024-11-03T19:53:03Z

@sfiligoi many thanks for taking the time to work on this. this would give me a throughput boost of 20-30%.

leowill01 mentioned this issue Feb 15, 2024

bowtie2 subprocess using more CPU cores than allowed by -j option barricklab/breseq#365

Closed

sfiligoi mentioned this issue Sep 13, 2024

Add back the condition_variable based LockedQueue for low-thread setups #491

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

-p/--threads option always uses 1 more thread than is specified. #465

-p/--threads option always uses 1 more thread than is specified. #465

leowill01 commented Feb 15, 2024 •

edited

Loading

ch4rr0 commented Feb 27, 2024

sfiligoi commented Sep 11, 2024 •

edited

Loading

sfiligoi commented Sep 11, 2024

ch4rr0 commented Sep 11, 2024

sfiligoi commented Sep 11, 2024 •

edited

Loading

sfiligoi commented Sep 12, 2024

ch4rr0 commented Sep 12, 2024 •

edited

Loading

sfiligoi commented Sep 12, 2024 •

edited

Loading

leowill01 commented Nov 3, 2024

-p/--threads option always uses 1 more thread than is specified. #465

-p/--threads option always uses 1 more thread than is specified. #465

Comments

leowill01 commented Feb 15, 2024 • edited Loading

ch4rr0 commented Feb 27, 2024

sfiligoi commented Sep 11, 2024 • edited Loading

sfiligoi commented Sep 11, 2024

ch4rr0 commented Sep 11, 2024

sfiligoi commented Sep 11, 2024 • edited Loading

sfiligoi commented Sep 12, 2024

ch4rr0 commented Sep 12, 2024 • edited Loading

sfiligoi commented Sep 12, 2024 • edited Loading

leowill01 commented Nov 3, 2024

leowill01 commented Feb 15, 2024 •

edited

Loading

sfiligoi commented Sep 11, 2024 •

edited

Loading

sfiligoi commented Sep 11, 2024 •

edited

Loading

ch4rr0 commented Sep 12, 2024 •

edited

Loading

sfiligoi commented Sep 12, 2024 •

edited

Loading