-
Notifications
You must be signed in to change notification settings - Fork 7
Output
When running the main MINTIE pipeline to completion, the final results for a given sample will be under:
<sample>/<sample>_results.tsv
The following files will also be of interest for loading into IGV (note that these files may include some variants that are subsequently filtered out in the final results output):
- <sample>/novel_contigs.vcf
- <sample>/novel_contigs.bam
See the VCF output page for details of how we implement this format. The tsv
file contains the following fields.
The results text file shows a list of all variants of interest, and includes a VAF estimate and information from the equivalence class DE analysis.
Output fields are:
- chr[1-2], pos[1-2], strand[1-2]: position information for ends 1 and 2 for the given variant.
- variant_type: MINTIE's estimated classification of the variant type. See VCF output.
- overlapping_genes: any genes that the contig overlaps, if separated by colons, this indicates the gene(s) overlapped by each hard-clipped segment.
- sample: sample this variant belongs to.
- variant_id: an assigned variant ID (matches the VCF file).
- partner_id: fusions and junctions will have two coordinates and thus two variants.
- VAF: a rough variant allele estimate. See VAF.
- vars_in_contig: the number of variants found on the aligned contig.
- varsize: size of the variant on the genome.
- contig_varsize: variant size on the contig.
- cpos: position on the contig where the variant occurs (note: this is the variant position on the contig irrespective of alignment direction).
- large_varsize: whether the variant size passes the min_clip threshold (default 30).
- is_contig_spliced: whether the contig is spliced in any way (contains gaps).
- spliced_exon: whether this is an novel/extended exon variant with a matching junction.
- overlaps_exon: whether this variant overlaps any reference exon.
- overlaps_gene: whether this variant overlaps any reference gene.
- valid_motif: whether the variant has a valid splice motif. Some variants will not be tested (e.g. TSVs or splice variants at known boundaries).
- logFC: maximum log fold change of the EC(s) associated with this variant vs. controls.
- Pvalue: minimum p value of the EC(s) associated with this variant vs. controls.
- FDR: corrected p value.
- TPM: length-corrected transcript-per-million estimate for the variant contig.
- mean_WT_TPM: mean length-corrected transcript-per-million estimate all wild-type genes associated with this EC.
- ec_names: unique names of all equivalence classes associated with this variant contig.
- n_contigs_in_ec: number of de novo assembled contigs in the given equivalence class.
- case_reads: total counts for all ECs associated with the contig in the case sample.
- controls_total_reads: total read counts for all ECs associated with the contig in the control samples.
- contig_id: contig name from de novo assembly.
- unique_contig_ID: for the visualisation output (see below), this will be the ID of the modified SuperTranscipt for this variant contig.
- contig_len: length of the contig (sequence length).
- contig_cigar: the contig's genome mapping CIGAR string (may have two strings in the case of hard-clips).
MINTIE has a number of rules for classifying variants, and considering certain variants as ones of interest. Firstly, the de novo assembled contig must have a length of at least 100, the contig ECC must have a CPM > 0.1 and be significant with logFC > 5 if using controls. Secondly, upon alignment, the contig must map to the genome with at least 30bp and 30% of its total length (by default). All these thresholds are defaults are can be adjusted in the params.txt
file. Once the contig passes all these checks, more variant-specific criteria are applied:
variant | clipping | spliced contig | varsize | overlaps gene | overlaps exon | spliced exon | valid motif | special criteria |
---|---|---|---|---|---|---|---|---|
FUS | hard | - | >min_clip | Y | - | - | - | - |
IGR | hard | - | >min_clip | Y | - | - | - | within same gene |
UN | soft | - | >min_clip | Y | - | - | - | - |
DEL | - | Y | >min_gap | Y | Y | - | - | |
INS | - | Y | >min_gap | Y | Y | - | - | |
AS | - | Y | - | Y | Y | - | - | novel splice btw. 2 known sites |
NE/EE | - | Y | >min_clip | Y | N | Y | Y | |
NE (intergenic) | - | Y | >min_clip | N | N | Y | - | |
NEJ/PNJ (truncated exon/novel intron) | - | Y | >min_gap | Y | Y | Y | Y | |
RI | - | Y | >min_clip | Y | N | - | - |
See variant types. In the table '-' indicates that the criteria is not checked. 'Spliced contig' refers to splicing of any kind occurring in the alignment (at least one gap in the alignment). 'Spliced exon' refers to a novel block (EE/NE) with a matched splice junction (NEJ/PNJ), or truncated exon/novel intron variant.
One other special criterion is that any variant directly at a fusion (i.e. hard-clip) boundary is automatically retained as a variant of interest, regardless of other criteria.
The variant allele frequency (VAF) measures the proportion of supporting reads for the variant, over the total reads. A true VAF calculation is difficult to obtain without aligning reads to the genome. We therefore calculate an approximate VAF using Salmon's transcript quantification output. We obtain transcripts-per-million (TPM) estimates of the variant contig, as well as the mean TPM of all wild-type (WT) genes associated with all ECs associated with the given contig. TPMs are obtained from tximport. The VAF calculation is then simply the contig TPM, divided by the contig TPM plus the mean WT TPM. Keep in mind that this is a rough estimate of the VAF, and will not be highly accurate.
NOTE: this refers to running MINTIE using the experimental visualisation step. Please see Visualisation.
Contigs containing cryptic variants can be visualised using IGV. MINTIE creates a variant SuperTranscript (ST) for each novel contig, inserting any new variants into the reference ST. To load these, load the following as a reference genome in IGV:
<sample1>_<sample2>...<sampleN>_collated/supersupertranscript.fasta
Under the alignment
sub-folder, you will find these files (load them into IGV):
-
<sample>_hisatAligned.bam
: reads aligned to the variant STs -
<sample>_novel_contigs_st_aligned.bam
: contigs aligned to the variant STs
Back in your sample folder, you will also need to load these files:
-
<sample>/<sample>_blocks_supertranscript.bed
: mark merged exonic regions (number may not necessarily be the reference exon number) -
<sample>/<sample>_genes_supertranscript.bed
: mark where the gene would sit on the ST reference.
You can use the expected_ST_alignment
and real_ST_alignment
fields in your results files to check each ST of interest.