Skip to content

Latest commit

 

History

History
171 lines (135 loc) · 7.12 KB

documentation.md

File metadata and controls

171 lines (135 loc) · 7.12 KB

Technical Documentation

What follows is a technical documentation of each step in the pipeline.

The steps get_cellranger, get_reference, unpack_bcl2fastq, and clean_names are not here covered, as the simply move or rename files.

Rule: counts

Uses Cellranger counts to generate the counts matrix for each sample.

The --noseconday flag is passed, since we are (obviously) performing our own.

The --include-introns flag is controlled by a user-provide boolean switch, allowing flexibility for choosing between single-cell and single-nucleus data.

Currently, only a --sample flag is passed, collating all counts into a single matrix. Future work will also pass a --lane flag, which will allow for batch tracking, even when samples have been pooled and de-multiplexed.

The --expect-cells flag takes an integer count that must be specified by the user based on knowledge about experimental design. In my experience, Cellranger always over estimates the number of cells, regardless of this flag, hence the custom implementation of emptyDrops.

Rule: filter_empty

Runs DropletUtils::emptyDrops on the 10x output.

The challenge in implementing this function is determining the correct value for lower, Set it too high, and you will lose droplets that are actual cells and confound the identification of the rest. Set it too low, and you will gain cells at the cost of increased error in the cell call (as garbage will be included). In my experience, the most effective way to set this in an unbiased manner is to take the median of counts between 25 and 2000 (to prevent 0 inflation) and set the bound at 2 standard deviations above this mean.

Rule: qc

Standard sc/snRNAseq quality control metrics run using Scanpy.

This is run on a per sample basis. Filtering occurs on the following criteria:

  1. Any droplet whose FDR from filter_empty is > 0.005 is dropped.
  2. Any feature not found in at least 1 cell is dropped.
  3. Any cell with more than the user-defined threshold of percent mitochondrial counts is dropped.
  • This defaults to 5%. This is ~ correct for single nucleus data. 20% is better for single cell.
  1. Any cell with more than the user-defined threshold of total counts is dropped
  2. Any cell with less than the user-defined threshold of genes by counts is dropped
  3. If Cellranger included introns, then all mitochondrial genes are dropped.

Finally, the sample label is extracted from the Snakemake wildcard to label the sample for future downstream analysis.

Rule: doublets

Doublet detection performed using Solo implemented in the scVI package.

Briefly, Solo uses an (unsupervised) variational auto-encoder to embed cells then appends a feed-forward neural network to form a supervised classifier. Please see their original manuscript for more details.

As a neural network, it has the usual litany of strengths and weaknesses. It is GPU accelerable, allowing it perform well on large datasets. However, it can be prone to over-fitting, a problem addressed in this pipeline by the use of early-stopping.

Solo is run on a lane-by-lane basis, as each lane will have different technical variation.

Rule: dim_reduc

Dimensional reduction and bath-correction using Scanpy and Harmony.

At this point in the analysis, all samples and lanes are concatenated into a single Anndata object. Lane and sample labels are retained for identification downstream, where necessary.

The number of highly variable genes to use is a key choice in sc/snRNAseq analysis, and one that frequently can feel quite arbitrary. In effort to make this more quantitative, I default to half the number of cells or 10,000 gene, whichever is less, while enforcing a floor of 1000 genes Should the user desire, this default can be overridden by specifying nHVG in the configuration file.

Highly expressed genes - those accounting for more than 5% of counts in cells - are excluded from the calculation of the normalisation factor to prevent skew in the normalised values of other genes. Frequently, this simply excludes MALAT1, though the user may check the log files for the exact details.

After a log1p transformation, the effects of total counts and percent counts from mitochondria are regressed out. The latter can be debated as to whether it is a relevant step. I find that, unless there is a strong reason to suspect a phenotype defined by mitochondrial content, it reduces noise in the final data. Additionally, for single nucleus data, these genes can be considered a source of technical error, as they should not be detected.

50 PCs are calculated on the variable features. To automate the number of PCs to keep, those PCs accounting for 85% of the variance are kept. This can be overridden by the user, should there be a motivation to keep more or less PCs.

Batch correction is performed over lanes using Harmony. The correct application of batch correction is a discussion beyond the scope of these technical notes. A number of papers (see here and here) suggest that Harmony is a robust choice for batch correction. Here, it is run on all PCs selected in the previous step until model convergence.

Rule: cluster

Clustering using Leiden with Scanpy.

A neighbor network is calculated using 15 neighbors on the Harmony-corrected PCs. The Leiden algorithm for clustering is used on this network at a user-defined resolution. A basic differential expression test is performed in a 1-vs-all manner across all clusters, using a T-test with overestimated variance and a Benjamini-Hochberg multiple correction. The user is advised that these should be consider with a pinch of salt, as they are not batch corrected. For more robust differential expression analysis across a condition or cell type, the user is directed to tools such as MAST that can be integrated into this pipeline.

Rule: densities

Density embeddings using Scanpy.

This step allows the user to definge categorical covariates on which density embeddings will be calculated. I find this a useful sanity check on my analysis, as well as providing an example of how the basic workflow may be extended.