Skip to content

Use case test

Merly Escalona edited this page Feb 15, 2018 · 10 revisions

Effect of the variation of depth of coverage on SNP recovery

One of the possible uses of NGSphy might be the optimization of depth of coverage for a given purpose. In this case we designed a small experiment to visualize the potential trade-off between NGS coverage and SNP discovery. In this case we used NGSphy to simulate a single sequence alignment from a given gene-tree (Figure 1) and from it we generated 100 NGS datasets at different depths of coverage.

Use case test tree

Figure 1: Gene-tree used for the use case simulation. It represents four species with two individuals per species. Numbers above the branches represent branch lengths, in expected number of substitutions.

The sequence alignments were simulated under a JC69 substitution model (Jukes and Cantor, 1969), with equal base frequencies and a length of 10.000 bp. The Illumina runs generated 150 bp paired-end reads for all individuals at a coverage of 2X, 10X, 50X, 100X and 200X (100 replicates for each level). The detailed settings are shown in Table 1.

Table 1: NGSphy settings parameters. varies for 2, 10, 20, 100 and 200x.

Block Option Value
general path ./output/<coverage>
output_folder_path NGSphy_test2_<coverage>
ploidy 1
data inputmode 1
gene_tree_file files/supp.test2.tree
indelible_control_file files/control.supp.test2.txt
coverage experiment <coverage>
ngs-reads-art fcov true
l 150
m 250
p true
q true
s 50
ss HS25
execution environment bash
runART on
threads 2

Mapping was carried out using the MEM algorithm of BWA Version 0.7.7-r441 (Li and Durbin, 2009), against a randomly chosen reference (sequence 1_0_2 in all cases). Following a standardized best-practices pipeline (Van der Auwera and Carneiro, 2013) mapped reads from all replicates were independently processed, performing local realignment around indels and removing PCR duplicates. Variant calling was made with GATK (Mckenna et al., 2010), using the single-sample variant-calling joint-genotyping framework using the HaplotypeCaller and GenotypeVCF modules. SNP calls from each replicate were compared to the true variant sites, showing that SNP recovery increased very rapidly until 10X, when almost all true variants were called (Figure 2).

Figure 2: Called variants and true positives (mean and Q1/Q3) at different depths of coverage. The true number of SNPs is 386. At 100x and 200x only 1 site (average) is not recovered.

Execution

The script use to carry out this test can be found under ngsphy/manuscript/supp.material/scripts/supp.test2.sh.

To run this test you need the following files:

To execute this script it is required to have installed:

Steps

  1. Following the script file, first we have to state:
    • Where the NGSphy repository is located
    • Where will the output be written
    • The name of the output folder
    • A random seed number
    • Paths for GATK and PICARD jar packages
    • Coverage levels
  2. Organize the data to run the simulation and the analysis, copying the setting files, reference sequence and the reference allele file into the analysis folder.
  3. Extract the true variants from the dataset, and we do that using NGSphy with the NGS read counts
  4. Select the sequence with the label 1_0_0, and put it into a separate file, this will be the reference sequence.
  5. Run 100 replicates of NGSphy with the different coverage levels, each:
  6. Index the reference with bwa,samtools and Picard:
  7. Map all the datasets to the selected reference:
  8. Sort and covert into BAM file
  9. Mark duplicates
  10. Realignm around indels
  11. Single sample variant calling with each dataset, using HaplotypeCaller mode GCVF
  12. Call joint genotypes with GenotypeGVCFs
  13. Count discovered variants
  14. Compare discovered variatns with true variants

References

  • Jukes, T.H. and Cantor, C.R. (1969) Evolution of Protein Molecules. In, Mammalian Protein Metabolism., pp. 21–132.
  • Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760.
  • Mckenna, A. et al. (2010) The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20, 1297–1303.
  • Van der Auwera, G.A. and Carneiro,M.O. (2013) From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline. Current protocols in Bioinformatics.