Skip to content
Marek Cmero edited this page Nov 30, 2022 · 8 revisions

Frequently Asked Questions

I don't have any controls, can I still run MINTIE?

Yes! See Running MINTIE without controls.

Why couldn't MINTIE find my variant?

If you're looking for a specific variant that you know is in your data, and MINTIE doesn't find it, there are a number of possible reasons for this. For example:

  1. Your variant is not novel: we use the CHESS reference, which is a comprehensive set of normal transcription. Make sure to cross-check this reference with your expected novel variant.
  2. Your variant is lowly expressed: your variant may fail to be assembled, or may be filtered out at the DE step, if it is lowly expressed.
  3. Your variant is present in the controls: checking the control read alignments in the area of interest may indicate this. Even if only a small number of controls have this variant, it may be missed.
  4. A common variant has obscured your variant of interest: in rare cases, an assembled contig containing your variant may also contain a proximal common SNP or INDEL. If this variant is also in your controls, reads may be assigned to the variant EC over the reference transcript. In such cases, the variant EC may not be significant and will be filtered out.

To find the exact reason, some digging through the results is required. The first thing to try would be to check your novel_contigs_info.tsv file. Your variant may be present but is not marked as a 'variant of interest'. If this is the case, some of the criteria for marking the variant as novel have not been met; looking at the other boolean columns may indicate why this is. If you have ruled out reason 1, MINTIE can be run without the DE step, to get around any potential issues with the controls, see Running MINTIE without controls.

This output will allow you to check whether the variant was assembled. Open your aligned_contigs_against_genome.bam file in IGV, and check the target variant location. If your variant was not assembled, this could be due to low expression. You could also try different kmer sizes, or a different assembler.

If your variant is assembled. Note the contig ID (IGV will tell you the read name if you click on the target contig, it should look like k49_123 if using SOAPdenovotrans). Now check your original (DE enabled) output for why you variant was not found:

  1. ec_tx_table.txt: check if the last column is 'true' for any records where your contig ID appears. If not, your variant matches the reference. If so, go to step 2.
  2. counts_summary.txt: check that the last column is 'true' for any fields where your contig appears. If not, your contig was filtered out due to low CPM. If so, go to step 3.
  3. full_edgeR_results.txt: check that the last two fields are 'true' for any rows containing your contig ID. If not, your contig was either filtered out by FDR significance threshold, or because it was below the logFC threshold. If your contig passed DE, go to step 4.
  4. annotate.log: check if your contig ID appears. The log file should state a reason why the contig was filtered out.

If none of these steps work, please report an issue and we can look into it!

MINTIE gives me too many results! How do I prioritise my variants?

In some cases, MINTIE may return a large number of results, especially when not using controls. There is some risk in removing true positives when using these feature. To address this, MINTIE has some standard ways of filtering in the params.txt file:

  • Gene lists: if you are interested in only a subset of genes, you can supply a file of genes (one per line) and specify the path parameters file.
  • Variant type: if you are, for example, uninterested in splice variants, you can filter on certain variant types. These can be specified as a comma-separated list.
  • CPM & logFC: the CPM cutoff and minimum logFC can be changed.

These measures may still be inadequate. And thus, MINTIE provides a number of extra fields that can be used to filter the results down using something like R, python or excel:

  • VAF: this is a rough variant allele frequency estimate. Filtering beyond a low fraction (e.g. >0.1) may help refine results.
  • case_reads: similar to filtering CPM, a higher threshold can be added to the number of reads within the variant EC. >10 for example might be a good threshold.
  • control_reads: if controls were used, try filtering on low control counts, for example <10 control reads.
  • vars_in_contig: this is the number of variants found on the given de novo assembled contig. High numbers of variants likely point to poor alignment. Filtering, for example, <10 for vars_in_contig may help remove false positives.

Sorting the list by p-value (if controls were used) or case_reads (if no controls were used) may help with prioritisation.

Of course, there is always a risk of removing true positives when using these strategies, especially if your variant of interest is lowly expressed, so caution is advised.

How do the config parameters k-mer (Ks), minimum read length (min_read_length) and minimum contig length (min_contig_len) influence sensitivity?

Using a range of k-mer sizes, all shorter than your read length, is recommended. In general, shorter k-mers may aid in reconstructing contigs as they are more likely to appear in your reads. The consequence of this, is that shorter sequences are more likely to be repeated in your reference transcriptome. Shorter k-mers are thus likely to create more false positive results, while longer k-mer lengths may result in the variant contig not being assembled. We therefore recommend a range of k-mer values, calibrated to your read length.

Minimum read length is used for trimming reads and should be greater than your smallest k-mer value. Longer read lengths will improve your assembly in general, but only up to a point (see Chang et al), so keep this parameter as close to your read length as possible, while considering the base quality profile of reads.

Minimum contig length filters out anything below this size from the transcriptome assembly. This should be set higher than your read size. Setting this parameter too high may remove true positive transcripts and decrease sensitivity, particularly with SOAPdenovotrans for example, which can assemble short variant contigs. Setting this to something like 1.5x your read length (of a single read) would be a good rule of thumb.

Can I run MINTIE using a reference other than hg38?

Yes, please see here. There's only one instance of hg38 hard-coding I really should fix. So you'll have to turn off motif checking for this to work (splice_motif_mismatch=4).

Does MINTIE work on my favourite non-human organism?

In theory there is nothing human-specific about MINTIE's design, other than the default references are human. That said, we have not tested MINTIE in other organisms, so caveat emptor. One modification you will have to make is to use a custom reference (see above) and make sure you turn off motif checking otherwise you'll likely get an error.