Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with very small inputs: Please use a computer with more main memory #901

Open
ktmeaton opened this issue Nov 12, 2024 · 2 comments
Open

Comments

@ktmeaton
Copy link

Expected Behavior

I'm running cluster unit tests on a very small file. I'm trying to control the memory usage with --split-memory-limit, but it errors unless I give it at least 9GB of memory. This seems like a very disproportionate amount of memory. I haven't been able to create a test dataset that will run with less than 9G of memory, which suggests to me this might be a bug?

Current Behavior

When I try to cluster a very small number of sequences with less than 9G of memory, I get the error: Please use a computer with more main memory.

>seq1
GTTTATTTTCTCCTGTTAAATTGTCAGGCCAGAACGGCCAGTTTTCACGGGGTTCAGATA
>seq2
GTTTATTTTCTCCTGTTAAATTGTCAGGCCAGAACGGCCAGTTTTCACGGGGTTCAGATA
>seq3
TATCTGAACCCCGTGAAAACTGGCCGTTCTGGCCTGACAATTTAACAGGAGAAAATAAAC

I've tried easy-cluster and createdb + cluster. I've tried running through docker and conda, and I've tried the latest docker image from master. So far they all raise this error.

Steps to Reproduce (for bugs)

The following commands raise the error. It can only be fixed by using at least 9G of memory (--split-memory-limit 9G).

# Docker
docker run --rm -v $(pwd):/data ghcr.io/soedinglab/mmseqs2:15-6f452 easy-cluster /data/test.txt /data/mmseqs tmp --split-memory-limit 8G --threads 1

# Conda
micromamba create -n mmseqs2 bioconda::mmseqs2=15.6f452
micromamba run -n mmseqs2 mmseqs easy-cluster test.txt mmseqs tmp --split-memory-limit 8G --threads 1

MMseqs Output (for bugs)

Context

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used: 6f45232
  • MMseqs version: conda and docker image from GitHub package registry.
  • For self-compiled and Homebrew: Not self-compiled.
  • Server specifications: 32GB Memory, AVX2 support, although I'm not sure how that interacts with the containers.
  • Operating system and version: WSL2, Ubuntu 20.04.
@milot-mirdita
Copy link
Member

The default k-mer size for nucleotides is 15, which indeed requires more than 8GB of RAM.
You can use reduce the k-mer size to 13 (-k 13) so that the k-mer data structures fit in less than 4GB RAM.

Additionally (unrelated to memory use), I recommend to disable spaced k-mers for nucleotides (--spaced-kmer-mode 0). This is an issue we have discovered with regarding sensitivity. We are reworking this currently.

@ktmeaton
Copy link
Author

Memory

Thank you so much for your help! I understand now what's happening with the memory, and -k 13 fixes it!

mmseqs easy-cluster test.txt mmseqs tmp --split-memory-limit 2G --threads 1 -k 13

Spaced Kmers

Regarding --spaced-kmer-mode 0, I'm finding that setting is fragmenting my clusters. I wonder if this is at all related to #489?

In this example data, the following 4 sequences are identical except for position 10, which has a T in seq1 and an A in everything else.

>seq1
CGACGTCAGTGCAGTCGCTAACGTGGCAG
>seq2
CGACGTCAGAGCAGTCGCTTACGTGGCAG
>seq3
CGACGTCAGAGCAGTCGCTTACGTGGCAG
>seq4
CGACGTCAGAGCAGTCGCTTACGTGGCAG

When I run with --spaced-kmer-mode 1, I get the desired clustering result (all are grouped together in one cluster).

mmseqs easy-cluster snp_example.fasta spaced_1 tmp --spaced-kmer-mode 1

# spaced_1_cluster.tsv
seq2    seq2
seq2    seq1
seq2    seq3
seq2    seq4

When I run with --spaced-kmer-mode 0, the sequences are split into two clusters based on that SNP.

mmseqs easy-cluster snp_example.fasta spaced_0 tmp --spaced-kmer-mode 0

# spaced_0_cluster.tsv
seq1    seq1
seq2    seq2
seq2    seq3
seq2    seq4

I can't seem to find any other parameters that will group them all back together again. I am still reading through the manual, but just wanted to document this example data in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants