Issues integrating GPU-accelerated search in colabfold alignment protocol #904

clami66 · 2024-11-22T14:23:53Z

I am trying to integrate the new GPU-accelerated search in colabfold_search. From what I can see, only search and easy-search are GPU-accelerated. However, the colabfold_search alignment protocol also includes a expandaln step (among others).

Unfortunately, it seems like expandaln is incompatible with the padded sequence DB generated and indexed for GPU, as running mmseqs expandaln on this database will cause it to crash. I think this is because the database .idx.index file lacks rows 24-25, i.e. ALNINDEX, ALNDATA as defined here: https://github.com/soedinglab/MMseqs2/blob/266c894c117a9bd650450974747424ce51124bf5/src/prefiltering/PrefilteringIndexReader.cpp#L33C1-L34C52

I thought that this was due to using the --index-subset 2 flag when running mmseqs createindex as recommended in the guide, but even using --index-subset 0 doesn't fix the issue for me.

Now I am wondering if the whole alignment protocol should change (e.g. by removing expandaln altogether) or perhaps there is something I am doing incorrectly when setting the database up? Thanks for any help on this!

Steps to Reproduce (for bugs)

Generate the padded DB:
mmseqs makepaddedseqdb uniref30_2302_db uniref30_2302_db_gpu
Generate the index (either with --index-subset 0 or --index-subset 2)

$ mmseqs createindex uniref30_2302_db_gpu tmp --split 0 --index-subset 0
...
Write VERSION (0)
Write META (1)
Write SCOREMATRIXNAME (2)
Write SPACEDPATTERN (23)
Write GENERATOR (22)
Write DBR1INDEX (5)
Write DBR1DATA (6)
Write HDR1INDEX (18)
Write HDR1DATA (19)
Write SCOREMATRIX3MER (4)
Write SCOREMATRIX2MER (3)
...
Write ENTRIES (9)
Write ENTRIESOFFSETS (10)
Write SEQINDEXDATASIZE (15)
Write SEQINDEXSEQOFFSET (16)
Write SEQINDEXDATA (14)
Write ENTRIESNUM (12)
Write SEQCOUNT (13)

The resulting .idx.index file lacks rows 24-25:

$ tail uniref30_2302_db_gpu.idx.index
...
21      10770190336     105711065
22      20480   41
23      16384   1

Run mmseqs expandaln

mmseqs expandaln ./example/qdb colabfold_databases/uniref30_2302_db_gpu.idx ./example/res colabfold_databases/uniref30_2302_db_gpu.idx ./res_exp

MMseqs Output

expandaln crashes while attempting to load the index:

MMseqs Version:                 dc7395810db17ec7de8adf32599562452b0c4d78
Expansion mode                  0
Substitution matrix             aa:blosum62.out,nucl:nucleotide.out
Gap open cost                   aa:11,nucl:5
Gap extension cost              aa:1,nucl:2
Max sequence length             65535
Score bias                      0
Compositional bias              1
Compositional bias              1
E-value threshold               0.001
Seq. id. threshold              0
Coverage threshold              0
Coverage mode                   0
Pseudo count mode               0
Pseudo count a                  substitution:1.100,context:1.400
Pseudo count b                  substitution:4.100,context:5.800
Expand filter clusters          0
Use filter only at N seqs       0
Maximum seq. id. threshold      0.9
Minimum seq. id.                0.0
Minimum score per column        -20
Minimum coverage                0
Select N most diverse seqs      1000
Preload mode                    0
Compressed                      0
Threads                         128
Verbosity                       3

Index version: 16
Generated by:  dc7395810db17ec7de8adf32599562452b0c4d78
ScoreMatrix:  VTML80.out
Index version: 16
Generated by:  dc7395810db17ec7de8adf32599562452b0c4d78
ScoreMatrix:  VTML80.out
Invalid database read for database data file=colabfold_databases/uniref30_2302_db_gpu.idx, database index=colabfold_databases/uniref30_2302_db_gpu.idx.index
getData: local id (4294967295) >= db size (22)

Your Environment

MMseqs2 commit: dc73958
Compiled with DENABLE_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES="75;80;86;89;90"
CUDA environment spec: gcccuda/12.1.1-gcc12.3.0
System: NVIDIA SuperPOD/DGX-A100 - Linux

The text was updated successfully, but these errors were encountered:

milot-mirdita · 2024-11-22T14:29:32Z

Still working on it, we'll likely release the changes to do ColabFold with MMseqs2-GPU this weekend. colabfold_search doesn't actually require any changes directly. The new protocol can be activated with environment variables only, after building GPU databases.

clami66 · 2024-11-22T14:34:21Z

Thanks for responding so quickly, I will keep an eye out for the updates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues integrating GPU-accelerated search in colabfold alignment protocol #904

Issues integrating GPU-accelerated search in colabfold alignment protocol #904

clami66 commented Nov 22, 2024 •

edited

Loading

milot-mirdita commented Nov 22, 2024

clami66 commented Nov 22, 2024

Issues integrating GPU-accelerated search in colabfold alignment protocol #904

Issues integrating GPU-accelerated search in colabfold alignment protocol #904

Comments

clami66 commented Nov 22, 2024 • edited Loading

Steps to Reproduce (for bugs)

MMseqs Output

Your Environment

milot-mirdita commented Nov 22, 2024

clami66 commented Nov 22, 2024

clami66 commented Nov 22, 2024 •

edited

Loading