After processing the long-read data with nanoranger and performing the pileup of reads (perform_pileup.py) aligning with the locus of interest, the R package will be used for the actual variant calling.
To extract single nucleotide variants, we need to first define a cut-off that will be used to identify cell barcodes that likely represent high-quality events in the targeted sequencing library.
In general, the higher the cut-off, the more stringent low-quality cells represented by chimeric reads will be excluded.
knee_plot(x = data.table::fread('./vignettes/data/1002_pileup_TP53_1.csv.gz'), cutoff = 20)
knee_plot(x = data.table::fread('./vignettes/data/1002_pileup_U2AF1.csv.gz'), cutoff = 20)
After having identified a cut-off, the actual variant calling can be performed. By using the downsample = ...
parameter,
we can speed things up for illustration purposes (in that case, the cut-off should also be changed).
TP53 = extract_mutation(BC.data.file = './vignettes/data/1002_pileup_TP53_1.csv.gz', REF = 'G', ALT = 'T', FILTER = 20, downsample = 100000)
U2AF1 = extract_mutation(BC.data.file = './vignettes/data/1002_pileup_U2AF1.csv.gz', REF = 'G', ALT = 'T', FILTER = 20, downsample = 100000)
For indels or insertions, the same workflow applies as for single nucleotide variants. After defining a cut-off for
cells we consider high-quality, the piled up reads are processed to screen for cells with an insertion or deletion at the indicated position.
Using CONSENSUS = ...
we define how many bases are deleted or inserted at the indicated position.
knee_plot(x = data.table::fread('./vignettes/data/1007_pileup_STAG2.csv.gz'), cutoff = 20)
extract_indel(BC.data.file = './vignettes/data/1007_pileup_STAG2.csv.gz', REF = 'C', CONSENSUS = -2, FILTER = 20)
For the identification of cells expressing fusion versus wildtype genes, reads are aligned against both fusion partners with
the nanoranger script fusion_gene.py. The output is then processed with extract_fusion_gene()
.
The approach assumes that as the locus-specific primer targets the fusion partner at the 3' end, reads aligning against the fusion partner at the 5' end derive from fusion events.
Thus, the 3' (WILDTYPE = ...
) and 5' (FUSION = ...
) partners need to be indicated when identifying cells expressing the fusion like so:
CAR.ONT = extract_fusion_gene('./vignettes/data/97_6_CART_CD28.csv.gz', WILDTYPE = 'CD28', FUSION = 'CARTmod')
This works in the same way with fusions like BCR::ABL1, where the parameters would be WILDTYPE = 'ABL1'
and FUSION = 'BCR'
For identifying differential expression of exons within genes, the nanoranger script isoforms.py
is used.
This will provide a list of reads mapping to the gene of interest alongside information with how many nucleotides
each read overlaps with the exon in question. If the overlap of a particular read with the exon is <50%, it will be
considered non-expressed. For example, in the gene PTPRC, exon4 is relevant for identifying CD45RA versus CD45RO.
Again, we need to look at the coverage first to identify a cut-off.
And then we are able to identify cells expressing both isoforms.
PTPRC.1 = extract_isoforms('./vignettes/data/1007_1_exons.csv.gz', GENE = 'PTPRC', EXON = 'exon4', filter = 10)