SMART-seq2 data with short read length #1336

janhoelzl · 2021-08-25T00:48:55Z

janhoelzl
Aug 25, 2021

Hello,

I want to use STAR to align some human paired-end scRNAseq data generated with the SMART-Seq2 protocol (around 9000 cells) in order to genome guided transcriptome assembly with Stringtie after. I was now wondering about the best values to choose for some parameters in order to maximize the potential for novel Isoform and transcript discovery. I would be really grateful to get some input on this. The main issue is that the read length is only 38bp (of course not optimal at all for my goal) which prompted me to think that some of the parameters could need optimization.

Mainly, I was wondering about --seedSearchStartLmax which I was thinking I should set to a lower number than the default of 50. I tried different values starting from 10 and observed slower mapping speeds at the lower values but in turn higher mapping rates and more detected splice junctions. I was now wondering whether there is any other drawback apart from the computational cost in using lower values? I was now planning on using 19 as a balance between speed and mapping rate...
How low could I go in theory?

Also, for the option --outSJfilterOverhangMin with default 30 12 12 12, is it ok to lower the values for annotated junctions a bit in order to achieve some gain in sensitivity regarding novel splice junction detection or will this easily result in too many false positives. I was thinking I could use 30 10 10 10, mainly since the reads are so short... I would leave the first value at 30 (or anything >19) to exclude novel noncanonical junctions altogether (can usually be discarded anyway, right?).

Finally, I want to do 2 pass mapping, is the following strategy ok for this:

filter all sjdb files from the first pass (will be around 9000)
keep only novel junctions on chr1-22, X, Y having at least one uniquely mapping read
provide these filtered sjdb files in the second mapping run

Might this cause problems for STAR because of the single cell nature of the data which results in a high number of files?

Is there anything else that I should change considering the type of data and my goal of looking for novel isoforms and transcripts specifically among long non-coding RNAs?

Thank you in advance!!
Jan

alexdobin · 2021-08-30T13:37:11Z

alexdobin
Aug 30, 2021
Maintainer

Hi Jan,

interesting questions and good thinking!

Lowering --seedSearchStartLmax should not generally have an adverse effect (except mapping slowdown), until its value is very small, ~<1/4 of the read length.
In some cases, it may result in too many seeds which may prevent mapping.
Selecting it to be 1/2 of the read length is a good choice.

--outSJfilterOverhangMin only affects the novel (unannotated junctions). Note that it only affects splice junction output to SJ.out.tab and not the BAM file, unless you use --outFilterType BySJout option. Note that by definition the Overhang cannot be smaller than readLength/2, so having 30 will completely prohibit the non-canonical junctions, and even 12 is very restrictive - If you are interested in novel junctions, it would be better to reduce these values.

2-step mapping is advisable for short reads, as it will help to recover more spliced reads for novel junctions. The filtering you have in mind makes sense, but with such a large sample you may still have too many junctions. In that case, you can do some additional filtering, e.g. based on the number of reads per junction.
It would probably be easier to concatenate 9000 SJ.out.tab files into one file.

Cheers
Alex

0 replies

janhoelzl · 2021-09-13T15:40:48Z

janhoelzl
Sep 13, 2021
Author

Hi Alex,

sorry for the late response. Thank you so much for your advice, it's great that you help out STAR's user base like this!

I did end up using 1/2 read length for --seedSearchStartLmax and reduced the minimal overhang for canonical splice junctions to 10. I left the one for non-canonical junctions at 30 because I was planning to filter those out anyway...

I actually ended up doing the filtering as planned, the 2nd pass did take quite a bit longer than the the 1st but it wasn't too bad, so I didn't do any further filtering. Like this, around 10^6 new junctions were added in the second step.

Best,
Jan

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMART-seq2 data with short read length #1336

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

SMART-seq2 data with short read length #1336

janhoelzl Aug 25, 2021

Replies: 2 comments

alexdobin Aug 30, 2021 Maintainer

janhoelzl Sep 13, 2021 Author

janhoelzl
Aug 25, 2021

alexdobin
Aug 30, 2021
Maintainer

janhoelzl
Sep 13, 2021
Author