- Output enhancements:
- Summary.csv statistics output for raw and filtered cells useful for quick run quality assessment.
- --soloCellFilter option for basic filtering of the cells, similar to the methods used by CellRanger 2.2.x.
- Better compatibility with CellRanger 3.x.x:
- --soloUMIfiltering MultiGeneUMI option introduced in CellRanger 3.x.x for filtering UMI collisions between different genes.
- --soloCBmatchWLtype 1MM_multi_pseudocounts option, introduced in CellRanger 3.x.x, which slightly changes the posterior probability calculation for CB with 1 mismatch.
- Velocyto spliced/unspliced/ambiguous quantification:
- --soloFeatures Velocyto option to produce Spliced, Unspliced, and Ambiguous counts similar to the velocyto.py tool developed by LaManno et al. This option is under active development and the results may change in the future versions.
- Support for complex barcodes, e.g. inDrop:
- Complex barcodes in STARsolo with --soloType CB_UMI_Complex, --soloCBmatchWLtype --soloAdapterSequence, --soloAdapterMismatchesNmax, --soloCBposition,--soloUMIposition
- BAM tags:
- CB/UB for corrected CellBarcode/UMI
- GX/GN for gene ID/name
STARsolo is a turnkey solution for analyzing droplet single cell RNA sequencing data (e.g. 10X Genomics Chromium System) built directly into STAR code. STARsolo inputs the raw FASTQ reads files, and performs the following operations
- error correction and demultiplexing of cell barcodes using user-input whitelist
- mapping the reads to the reference genome using the standard STAR spliced read alignment algorithm
- error correction and collapsing (deduplication) of Unique Molecular Identifiers (UMIa)
- quantification of per-cell gene expression by counting the number of reads per gene
- quantification of other transcriptomic features: splice junctions; pre-mRNA; spliced/unspliced reads similar to Velocyto
STARsolo output is designed to be a drop-in replacement for 10X CellRanger gene quantification output. It follows CellRanger logic for cell barcode whitelisting and UMI deduplication, and produces nearly identical gene counts in the same format. At the same time STARsolo is ~10 times faster than the CellRanger.
-
STARsolo is run the same way as normal STAR run, with addition of several STARsolo parameters:
/path/to/STAR --genomeDir /path/to/genome/dir/ --readFilesIn ... [...other parameters...] --soloType ... --soloCBwhitelist ...
The genome index is the same as for normal STAR runs.
The parameters required to run STARsolo on 10X Chromium data are described below: -
The STAR solo algorithm is turned on with:
--soloType Droplet
or, since 2.7.3a, with more descriptive:
--soloType CB_UMI_Simple
-
The CellBarcode whitelist has to be provided with:
--soloCBwhitelist /path/to/cell/barcode/whitelist
The 10X Chromium whitelist file can be found inside the CellRanger distribution, e.g. 10X-whitelist. Please make sure that the whitelist is compatible with the specific version of the 10X chemistry: V2 or V3. For instance, in CellRanger 3.1.0, the V2 whitelist is:
cellranger-cs/3.1.0/lib/python/cellranger/barcodes/737K-august-2016.txt
and V3 whitelist (gunzip it for STAR):
cellranger-cs/3.1.0/lib/python/cellranger/barcodes/3M-february-2018.txt.gz
-
The default barcode lengths (CB=16b, UMI=10b) work for 10X Chromium V2. For V3, specify:
--soloUMIlen 12
-
Importantly, in the --readFilesIn option, the 1st file has to be cDNA read, and the 2nd file has to be the barcode (cell+UMI) read, i.e.
--readFilesIn cDNAfragmentSequence.fastq.gz CellBarcodeUMIsequence.fastq.gz
For instance, standard 10X runs have cDNA as Read2 and barcode as Read1:
--readFilesIn Read2.fastq.gz Read1.fastq.gz
For multiple lanes, use commas separated lists for Read2 and Read1:
--readFilesIn Read2_Lane1.fastq.gz,Read2_Lane2.fastq.gz,Read2_Lane3.fastq.gz Read1_Lane1.fastq.gz,Read1_Lane2.fastq.gz,Read1_Lane3.fastq.gz
-
CellRanger uses its own "filtered" version of annotations (GTF file) which is a subset of ENSEMBL annotations, with several gene biotypes removed (mostly small non-coding RNA). Annotations affect the counts, and to match CellRanger counts CellRanger annotations have to be used.
-
10X provides several versions of the CellRanger annotations:
https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest
For the best match, the annotations in CellRanger run and STARsolo run should be exactly the same. -
The FASTA and GTF files
refdata-cellranger-GRCh38-3.0.0/genes/genes.gtf refdata-cellranger-GRCh38-3.0.0/genes/genome.fa
have to be used in STAR genome index generation step before mapping:
STAR --runMode genomeGenerate --runThreadN ... --genomeDir ./ --genomeFastaFiles /path/to/genome.fa --sjdbGTFfile /path/to/genes.gtf
-
If you want to use your own GTF (e.g. newer version of ENSEMBL or GENCODE), you can generate the "filtered" GTF file using 10X's mkref tool:
https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references -
To make the agreement between STARsolo and CellRanger even more perfect, you can add
--genomeSAsparseD 3
to the genome generation options, which is used by CellRanger to generate STAR genomes. It will generate sparse suffixs array whic has an additional benefit of fitting into 16GB of RAM. However, it also results in 30-50% reduction of speed.
-
The considerations above are for raw counts, i.e. when cell filtering is not performed. To get filtered results, refer to Basic cell filtering section.
- By default, cell barcode and UMI collapsing parameters are designed to give the best agreement with CellRanger 2.x.x. CellRanger 3.x.x introduced some minor changes to this algorithm. To get the best agreement between STARsolo and CellRanger 3.x.x, add these parameters:
--soloUMIfiltering MultiGeneUMI --soloCBmatchWLtype 1MM_multi_pseudocounts
- Simple barcode lengths and start positions on barcode reads are described with
which works with
--soloCBstart, --soloCBlen, --soloUMIstart, --soloUMIlen
--soloType CB_UMI_Simple (a.k.a Droplet)
- More complex barcodes are activated with
--soloType CB_UMI_complex
and are described with the following parameterssoloCBposition - strings(s) position of Cell Barcode(s) on the barcode read. Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2. Format for each barcode: startAnchor_startDistance_endAnchor_endDistance start(end)Anchor defines the anchor base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end start(end)Distance is the distance from the CB start(end) to the Anchor base String for different barcodes are separated by space. Example: inDrop (Zilionis et al, Nat. Protocols, 2017): --soloCBposition 0_0_2_-1 3_1_3_8 soloUMIposition - string position of the UMI on the barcode read, same as soloCBposition Example: inDrop (Zilionis et al, Nat. Protocols, 2017): --soloCBposition 3_9_3_14 soloAdapterSequence - string: adapter sequence to anchor barcodes. soloAdapterMismatchesNmax 1 int>0: maximum number of mismatches allowed in adapter sequence
- Since 2.7.3a, in addition to raw, unfiltered output of gene/cell counts, STARsolo performs simple (knee-like) filtering of the cells, similar to the methods used by CellRanger 2.2.x. This is turned on by default and is controlled by:
soloCellFilter CellRanger2.2 3000 0.99 10 string(s): cell filtering type and parameters CellRanger2.2 ... simple filtering of CellRanger 2.2, followed by thre numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count TopCells ... only report top cells by UMI count, followed by the excat number of cells None ... do not output filtered cells
- This filtering is used to produce summary statistics for filtered cells in the Summary.csv file, which is similar to CellRanger's summary and is useful for Quality Control.
- Recent versions of CellRanger switched to more advanced filtering done with the EmptyDrop tool developed by Lun et al. To obtain filtered counts similar to recent CellRanger versions, we need to run this tools on raw STARsolo output
- In addition to the gene counts (deafult), STARsolo can calculate counts for other transcriptomic features:
- pre-mRNA counts, useful for single-nucleus RNA-seq. This counts all read that overlap gene loci, i.e. included both exonic and intronic reads:
--soloFeatures GeneFull
- Counts for annotated and novel splice junctions:
--soloFeatures SJ
-
This option will calculate Spliced, Unspliced, and Ambiguous counts per cell per gene similar to the velocyto.py tool developed by LaManno et al. This option is under active development and the results may change in the future versions.
Note that Velocyto quantification requires Gene features
--soloFeatures Gene Velocyto
- All the features can be conveniently quantified in one run:
--soloFeatures Gene GeneFull SJ Velocyto
- pre-mRNA counts, useful for single-nucleus RNA-seq. This counts all read that overlap gene loci, i.e. included both exonic and intronic reads:
- To output BAM tags into SAM/BAM file, add them to the list of standard tags in
Any combinations of tags can be used.
--outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM
- CR/UR: raw (uncorrected) CellBarcode/UMI
- CY/UY: quality score for CellBarcode/UMI
- GX/GN: for gene ID/names
- sS/sQ: for sequence/quality combined CellBarcode and UMI; sM for barcode match status.
- CB/UB: corrected CellBarcode/UMI. Note, that these tags require sorted BAM output, i.e. we need to add:
--outSAMtype BAM SortedByCoordinate
For completenes, all parameters that control STARsolo output are listed again below with defaults and short descriptions:
soloType None
string(s): type of single-cell RNA-seq
CB_UMI_Simple ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium
CB_UMI_Complex ... one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.
soloCBwhitelist -
string(s): file(s) with whitelist(s) of cell barcodes. Only one file allowed with
soloCBstart 1
int>0: cell barcode start base
soloCBlen 16
int>0: cell barcode length
soloUMIstart 17
int>0: UMI start base
soloUMIlen 10
int>0: UMI length
soloBarcodeReadLength 1
int: length of the barcode read
1 ... equal to sum of soloCBlen+soloUMIlen
0 ... not defined, do not check
soloCBposition -
strings(s) position of Cell Barcode(s) on the barcode read.
Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2.
Format for each barcode: startAnchor_startDistance_endAnchor_endDistance
start(end)Anchor defines the anchor base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
start(end)Distance is the distance from the CB start(end) to the Anchor base
String for different barcodes are separated by space.
Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
--soloCBposition 0_0_2_-1 3_1_3_8
soloUMIposition -
string position of the UMI on the barcode read, same as soloCBposition
Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
--soloCBposition 3_9_3_14
soloAdapterSequence -
string: adapter sequence to anchor barcodes.
soloAdapterMismatchesNmax 1
int>0: maximum number of mismatches allowed in adapter sequence.
soloCBmatchWLtype 1MM_multi
string: matching the Cell Barcodes to the WhiteList
Exact ... only exact matches allowed
1MM ... only one match in whitelist with 1 mismatched base allowed. Allowed CBs have to have at least one read with exact match.
1MM_multi ... multiple matches in whitelist with 1 mismatched base allowed, posterior probability calculation is used choose one of the matches.
Allowed CBs have to have at least one read with exact match. Similar to CellRanger 2.2.0
1MM_multi_pseudocounts ... same as 1MM_Multi, but pseudocounts of 1 are added to all whitelist barcodes.
Similar to CellRanger 3.x.x
soloStrand Forward
string: strandedness of the solo libraries:
Unstranded ... no strand information
Forward ... read strand same as the original RNA molecule
Reverse ... read strand opposite to the original RNA molecule
soloFeatures Gene
string(s): genomic features for which the UMI counts per Cell Barcode are collected
Gene ... genes: reads match the gene transcript
SJ ... splice junctions: reported in SJ.out.tab
GeneFull ... full genes: count all reads overlapping genes' exons and introns
Transcript3p ... quantification of transcript for 3' protocols
soloUMIdedup 1MM_All
string(s): type of UMI deduplication (collapsing) algorithm
1MM_All ... all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once)
1MM_Directional ... follows the "directional" method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).
Exact ... only exactly matching UMIs are collapsed
soloUMIfiltering -
string(s) type of UMI filtering
- ... basic filtering: remove UMIs with N and homopolymers (similar to CellRanger 2.2.0)
MultiGeneUMI ... remove lower-count UMIs that map to more than one gene (introduced in CellRanger 3.x.x)
soloOutFileNames Solo.out/ features.tsv barcodes.tsv matrix.mtx
string(s) file names for STARsolo output:
file_name_prefix gene_names barcode_sequences cell_feature_count_matrix
soloCellFilter CellRanger2.2 3000 0.99 10
string(s): cell filtering type and parameters
CellRanger2.2 ... simple filtering of CellRanger 2.2, followed by thre numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
TopCells ... only report top cells by UMI count, followed by the excat number of cells
None ... do not output filtered cells