-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow RNA variant calling #118
Comments
Is there like one particular region that take forever to finish? |
Thank you for your reply. There are many problematic regions, not just one. Do you have any other idea what can I try? |
Can you send me one of those regions of bam file and bed file? |
Of course. https://filesender.cesnet.cz/?s=download&token=d17a2d32-5950-46e4-9a9a-bfd2755a01b2 |
Sorry just gotten around to it and the link expired. |
No problem. I'm sending you the new link: The SomaticSeq itself was slow. Callers took some time but it wasn't something crazy considering the size of the dataset. |
Hello @litaifang, I just wanted to follow up on our conversation regarding the slow performance of SomaticSeq on RNA data. I understand you may have been busy, but I wanted to check if you had the chance to look into the files I shared with you. By the way, I wanted to mention that I managed to analyze my current RNA data using alternative methods, but I am still curious about resolving the performance issue in SomaticSeq. I wanted to keep exploring it because in the future, there might be more RNA data that I'll need to analyze. If you've had the opportunity to review the files, I would greatly appreciate any insights or suggestions you may have. |
Just got back from vacation. I'll take a look in the next couple of weeks when I get more time. |
Hello,
I am trying to run SomaticSeq on RNA data (single-end reads) but it's really slow. It never finished because I had to kill it after 7 days. I have many large samples (~10 GB) and if the smallest one (3.0 GB) takes more than 7 days to finish, I can't use it like this. It could run for months. I’m using Kubernetes so I have enough computing capacity.
First, I used SplitNCigarReads tool (https://gatk.broadinstitute.org/hc/en-us/articles/360036858811-SplitNCigarReads) on my mapped RNA and after that, I ran variant callers (lofreq, mutect2, strelka, vardict, varscan). I ran SomaticSeq as the last with command:
somaticseq_parallel.py --threads 20 --output-directory somatic_varcalls/sample1 --genome-reference GRCh38-p10.fa --inclusion-region wgs.bed --minimum-num-callers 0.4 single --bam-file sample1.RNAsplit.bam --mutect2-vcf somatic_varcalls/sample1/MuTect2.vcf --vardict-vcf somatic_varcalls/sample1/VarDict.vcf --lofreq-vcf somatic_varcalls/sample1/Lofreq.vcf --strelka-vcf somatic_varcalls/sample1/variants.vcf.gz --varscan-vcf somatic_varcalls/sample1/VarScan2.vcf
I also tried to run SomaticSeq only with one variant caller. First, just with Vardict and it took 51 hours to finish. Second, just with Strelka and it took 32 hours to finish. I also tried to use a smaller bed file (only a few exome positions) but nothing changed.
My theory is that SomaticSeq has a problem when it encounters heavily covered reads because when it splits bed file for parallelization, some were counted fast but some took many hours or days. I think that maybe if I will do some subset of those heavily covered areas, It could help but I still don't know how to approach this.
Do you have any idea or advice on what can I do with it? I used SomaticSeq on DNA data many times before, so I know that normally it ran from a few minutes to a few hours.
The text was updated successfully, but these errors were encountered: