Replies: 2 comments
-
Hi Jan, interesting questions and good thinking! Lowering --seedSearchStartLmax should not generally have an adverse effect (except mapping slowdown), until its value is very small, ~<1/4 of the read length. --outSJfilterOverhangMin only affects the novel (unannotated junctions). Note that it only affects splice junction output to SJ.out.tab and not the BAM file, unless you use --outFilterType BySJout option. Note that by definition the Overhang cannot be smaller than readLength/2, so having 30 will completely prohibit the non-canonical junctions, and even 12 is very restrictive - If you are interested in novel junctions, it would be better to reduce these values. 2-step mapping is advisable for short reads, as it will help to recover more spliced reads for novel junctions. The filtering you have in mind makes sense, but with such a large sample you may still have too many junctions. In that case, you can do some additional filtering, e.g. based on the number of reads per junction. Cheers |
Beta Was this translation helpful? Give feedback.
-
Hi Alex, sorry for the late response. Thank you so much for your advice, it's great that you help out STAR's user base like this! I did end up using 1/2 read length for --seedSearchStartLmax and reduced the minimal overhang for canonical splice junctions to 10. I left the one for non-canonical junctions at 30 because I was planning to filter those out anyway... I actually ended up doing the filtering as planned, the 2nd pass did take quite a bit longer than the the 1st but it wasn't too bad, so I didn't do any further filtering. Like this, around 10^6 new junctions were added in the second step. Best, |
Beta Was this translation helpful? Give feedback.
-
Hello,
I want to use STAR to align some human paired-end scRNAseq data generated with the SMART-Seq2 protocol (around 9000 cells) in order to genome guided transcriptome assembly with Stringtie after. I was now wondering about the best values to choose for some parameters in order to maximize the potential for novel Isoform and transcript discovery. I would be really grateful to get some input on this. The main issue is that the read length is only 38bp (of course not optimal at all for my goal) which prompted me to think that some of the parameters could need optimization.
Mainly, I was wondering about --seedSearchStartLmax which I was thinking I should set to a lower number than the default of 50. I tried different values starting from 10 and observed slower mapping speeds at the lower values but in turn higher mapping rates and more detected splice junctions. I was now wondering whether there is any other drawback apart from the computational cost in using lower values? I was now planning on using 19 as a balance between speed and mapping rate...
How low could I go in theory?
Also, for the option --outSJfilterOverhangMin with default 30 12 12 12, is it ok to lower the values for annotated junctions a bit in order to achieve some gain in sensitivity regarding novel splice junction detection or will this easily result in too many false positives. I was thinking I could use 30 10 10 10, mainly since the reads are so short... I would leave the first value at 30 (or anything >19) to exclude novel noncanonical junctions altogether (can usually be discarded anyway, right?).
Finally, I want to do 2 pass mapping, is the following strategy ok for this:
Might this cause problems for STAR because of the single cell nature of the data which results in a high number of files?
Is there anything else that I should change considering the type of data and my goal of looking for novel isoforms and transcripts specifically among long non-coding RNAs?
Thank you in advance!!
Jan
Beta Was this translation helpful? Give feedback.
All reactions