title | date | bibliography | csl |
---|---|---|---|
havic user manual |
21 September 2020 |
paper.bib |
harvard-the-university-of-melbourne.csl |
Detect Hepatitis A Virus Infection Clusters from virus consensus sequences. havic
allows objective, fast and automated detection of infection clusters from clinical virus samples.
havic
is a bioinformatics pipeline for detecting infection clusters in clinical Hepatitis A Virus (HAV) samples from DNA or cDNA sequence data. The pipeline is written in python3
and uses ruffus
to connect a number of open-source software tools to achieve this task. The user feeds havic
some query files via a yaml
config file, waits for the program to run and then checks the output folder for results. havic
allows fast and objective detection of infection clusters in clinical virus sample sequences. The figure below is a schematic representation of the pipeline.
The user is free to modify parameters of a havic
run through modifying a config file. The above pipeline is summarised briefly here.
- create output directory to receive results files
- QC query sequences
- collect queries into a single set
- discard duplicate sequences based on seqIDs
- seqIDs are sequence headers up until the first space character
- duplicate seqID are reported to file
- replace 'troublesome' characters in sequence headers (character replacements reported to file)
- map query sequences to reference sequence
- reverse complement as required
- extract alignment from mapping file
- optionally, trim sequences to target region
- perform Maximum Likelihood phylogenetic inference on alignment
- pick infection clusters based on tree and alignment
- optionally, visualise:
- phylogentic tree next to alignment with samples of interest and infection clusters highlighted
- a heatmap of genetic distances between samples in alignment with infection clusters highlighted
havic
has been optimised for analysis of the the VP1/P2A amplicon, which is the genomic marker recommended by the Hepatitis A Virus Network (HAVNet) . The VP1/P2A region is shown here in the context of the HAV genome:
The bed coordinates of the HAVNet VP1/P2A amplicon are 2915 to 3374.
Installation of havic
requires Miniconda and git. After installing these packages, simply do:
git clone https://github.com/schultzm/havic.git
cd havic
. install.sh
The installation process will take up to 30 minutes with verbose output printed to screen during the install. If the installation fails, read the screen output to determine the error via traceback. Submit installation issues to github. Installation has been tested via continuous integration on CircleCI and tested inside a conda environment. At installation time, a test suite is run. The suite analyses pre-packaged HAV amplicon data, HAV whole genome sequence (WGS) data and measles WGS data. After the install, the user is free to delete the test output folder if desired using rm -r havic/havic_test_results/
.
After installing, activate the conda environment by doing conda activate havic_env
. The most basic usage of havic
is to type havic
on the command line and hit enter/return. If the install has worked correctly, the user should see:
usage: havic [-h] ...
optional arguments:
-h, --help show this help message and exit
Sub-commands help:
detect Detect infection clusters from cDNA or DNA consensus sequences.
version Print version.
test Run havic test using pre-packaged example data.
The program is accessed via three subcommands, with help via the -h
suffix.
havic detect
is the main sub-command. Use this for detecting infection clusters from user-specified cDNA or DNA consensus sequences.
havic version
will print the installed version to stdout
.
havic test
will run havic detect
on a pre-packaged test dataset. If successful, the analyst should see ok
at the end of each test.
The results in this example were obtained using the command havic test
. Let's walk-through this test analysis of HAV VP1/P2A amplicons, using the same config.yaml
as havic test
. With the user's own config.yaml file, the command would be havic detect path/to/config.yaml
.
havic detect
receives instructions from a yaml
config file via the command havic detect path/to/yaml.yaml
. The test.yaml
file from havic/havic/data/hav_amplicon.yaml
is presented below as an example:
---
FORCE_OVERWRITE_AND_RE_RUN:
Yes # Yes for full re-run, No to start from an interrupted run,
DEFAULT_REFS:
Yes # Yes if using havic pre-packaged SUBJECT test data, No otherwise
DEFAULT_QUERIES:
Yes # Yes if using havic pre-packaged QUERY test data, No otherwise
SUBJECT_FILE: # the "SUBJECT" sequence in BLAST terms, i.e., reference genome
data/NC_001489.fa # relative or absolute paths to fasta file
# if DEFAULT_REFS is Yes, path will be prefixed to use pre-packaged data
SUBJECT_TARGET_REGION: # the target region of the genome to focus on
data/havnet_amplicon.fa # in fasta format, relative or absolute paths okay
# if DEFAULT_REFS is Yes, path will be prefixed to use pre-packaged data
OUTDIR: # the parent directory for the results folders
havic_test_results/amplicon # relative or absolute path to parent result folder
TREE_ROOT:
midpoint # sequence name to root iqtree on, or midpoint for midpoint root
RUN_PREFIX:
HAV_amplicon_
PLOTS:
Yes # Yes to make plots (slow for large runs), No otherwise.
MAPPER_SETTINGS:
executable:
minimap2 # https://github.com/lh3/minimap2
other:
-c --cs --secondary=no
k_mer: # select an odd number, between 3 and 27 inclusive
-k 5 # 5 has been good for the HAV amplicon seqs, adjust sensibly
IQTREE2_SETTINGS: # http://www.iqtree.org/doc/iqtree-doc.pdf
executable:
iqtree # command to call iqtree2
other:
'-T AUTO -m MFP+FO --ufboot 1000 -pers 0.2 -nstop 500'
CLUSTER_PICKER_SETTINGS: # https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4228337/
executable:
ClusterPicker
coarse_subtree_support: # divide tree into subtrees at/above this threshold
70
fine_cluster_support: # branch support minimum value for clusters of tips
95
distance_fraction: # float please, genetic distance
0.01 # (e.g., 1 SNP in 100 bp = 0.01)
large_cluster_threshold:
15
distance_method:
valid # options are ambiguity, valid, gap, or abs
HIGHLIGHT_TIP:
- CmvAXJTIqH # Specify tip name to highlight in final plot
- CCHkiFhcxG # Specify tip name to highlight in final plot
- PAvYXhYkLM # Specify tip name to highlight in final plot
TRIM_SEQS: # these sequences will be trimmed to length of SUBJECT_AMPLICON
- AY644337_55443_seq_1 # these are sequences in the QUERY_FILES
- RIVM-HAV171_64913_seq_2_MapsOutsideTrimRegionSoEmpty
- nDNLdjtgha#HashInSeqName
- '' # give it nothing
- xyzyx # give it a non-name
QUERY_FILES:
- data/example1.fa # relative or absolute paths to fasta files
- data/example2.fa
- xyz # to test a dud file name
- '' # to test an empty file name (which would return a folder, not file)
...
Before starting a run, cd
to a working directory (preferably not inside the git cloned folder). Either copy the above yaml
file, or use wget https://raw.githubusercontent.com/schultzm/havic/master/havic/data/hav_amplicon.yaml
. For more information on the yaml
standard, refer to https://yaml.org/.
Lets go through the yaml
step-by-step.
yaml
code blocks open and close with---
and...
, respectively. Ensure your file includes these lines.- Indents are two spaces, use a carriage return followed by an increas of two spaces to increase a nesting level.
- Special characters or numbers in tip names or folders can be correctly parsed by enclosing values in single-quotes to allow string interpretation of values.
FORCE_OVERWRITE_AND_RE_RUN:
Yes
havic
manages tasks via ruffus
, and out-of-date stages of the pipeline will be re-run as required. To start a new run or force overwrite files in the OUTDIR, set FORCE_OVERWRITE_AND_RE_RUN
to Yes
. Otherwise to start off from the last point, set to No
.
DEFAULT_SUBJECT:
Yes # Yes if using havic pre-packaged SUBJECT (i.e., 'reference') sequence and region test data, No otherwise
DEFAULT_QUERIES:
Yes # Yes if using havic pre-packaged QUERY test data, No otherwise
If DEFAULT_SUBJECT
is set to Yes
havic
will prefix the filepaths in SUBJECT_FILE
SUBJECT_TARGET_REGION
with the havic
install path (using pkg_resources.resource_filename
) for the pre-packaged data. The same logic applies for DEFAULT_QUERY
. To specify a custom path to SUBJECT_FILE
and SUBJECT_TARGET_REGION
set DEFAULT_SUBJECT
to No
. To specify custom QUERY_FILES
, set DEFAULT_QUERY
to No
.
SUBJECT_FILE: # the "SUBJECT" sequence in BLAST terms, i.e., reference genome
data/NC_001489.fa # relative or absolute paths to fasta file
# if DEFAULT_SUBJECT is Yes, path will be prefixed to use pre-packaged data
havic
will use this fasta sequence as the subject/reference sequence. If a different reference is required, change the path value. The subject sequence may only be a single consensus sequence.
SUBJECT_TARGET_REGION: # the target region of the genome to focus on
data/havnet_amplicon.fa # in fasta format, relative or absolute paths okay
# if DEFAULT_REFS is Yes, path will be prefixed to use pre-packaged data
This regions will guide trimming of the alignment. In this example, the VP1/P2A region is the target region. Sample names listed in TRIM_SEQS will be trimmed to match the boundaries of this region. A sequence is used here instead of a bed coordinates file because the exact boundaries of the target region in the final alignment are not always obvious. After mapping this region to the subject sequence, the boundaries become obvious. Automatic delineation of this region alleviates the need for the analyst to manually search for and define the boundaries.
OUTDIR: # the parent directory for the results folders
havic_test_results/amplicon # relative or absolute path to parent result folder
Specify the path to the output directory. The files listed in the table below will be sent to this directory as the run progresses.
Stage number | Stage name | File or directory name |
---|---|---|
1 | create_outdir | havic_test_results/amplicon |
2 | compile_input_fasta | HAV_amplicon_duplicate_seqs.txt |
2 | compile_input_fasta | HAV_amplicon_seq_id_replace.tsv |
2 | compile_input_fasta | HAV_amplicon_tmpfasta.fa |
3 | map_input_fasta_to_ref | HAV_amplicon_map.bam |
3 | map_input_fasta_to_ref | HAV_amplicon_map.bam.bai |
4 | bam2fasta | HAV_amplicon_map.bam2fasta.R |
4 | bam2fasta | HAV_amplicon_map.bam2fasta.Rout |
4 | bam2fasta | HAV_amplicon_map.stack.fa |
5 | get_cleaned_fasta | HAV_amplicon_map.stack.trimmed.fa |
6 | run_iqtree | HAV_amplicon_map.stack.trimmed.fa.bionj |
6 | run_iqtree | HAV_amplicon_map.stack.trimmed.fa.ckp.gz |
6 | run_iqtree | HAV_amplicon_map.stack.trimmed.fa.contree |
6 | run_iqtree | HAV_amplicon_map.stack.trimmed.fa.iqtree |
6 | run_iqtree | HAV_amplicon_map.stack.trimmed.fa.log |
6 | run_iqtree | HAV_amplicon_map.stack.trimmed.fa.mldist |
6 | run_iqtree | HAV_amplicon_map.stack.trimmed.fa.model.gz |
6 | run_iqtree | HAV_amplicon_map.stack.trimmed.fa.splits.nex |
6 | run_iqtree | HAV_amplicon_map.stack.trimmed.fa.treefile |
6 | run_iqtree | HAV_amplicon_map.stack.trimmed.fa.ufboot |
6 | run_iqtree | HAV_amplicon_map.stack.trimmed.fa.uniqueseq.phy |
7 | root_iqtree | HAV_amplicon_map.stack.trimmed.fa.rooted.treefile |
8 | clusterpick_from_rooted_iqtree_and_cleaned_fasta | HAV_amplicon_map.stack.trimmed.fa_HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks_cluster4_sequenceList.txt |
8 | clusterpick_from_rooted_iqtree_and_cleaned_fasta | HAV_amplicon_map.stack.trimmed.fa_HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks.fas |
8 | clusterpick_from_rooted_iqtree_and_cleaned_fasta | HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks_list.txt |
8 | clusterpick_from_rooted_iqtree_and_cleaned_fasta | HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks_log.txt |
8 | clusterpick_from_rooted_iqtree_and_cleaned_fasta | HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks.nwk |
8 | clusterpick_from_rooted_iqtree_and_cleaned_fasta | HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks.nwk.figTree |
9 | summarise_cluster_assignments | HAV_amplicon_map.stack.trimmed.fa.rooted_clusterPicks_summarised.txt |
10 | plot_results_ggtree | HAV_amplicon_map.stack.trimmed.fa_SNPcountsOverAlignLength.csv |
10 | plot_results_ggtree | HAV_amplicon_map.stack.trimmed.fa_SNPdists.csv |
10 | plot_results_ggtree | HAV_amplicon_map.stack.trimmed.fa_SNPdists.pdf |
10 | plot_results_ggtree | HAV_amplicon_map.stack.trimmed.fa.rooted.treefile_1percent_divergence_valid_msa.pdf |
10 | plot_results_ggtree | HAV_amplicon_map.stack.trimmed.fa.Rplot.R |
10 | plot_results_ggtree | HAV_amplicon_map.stack.trimmed.fa.Rplot.Rout |
11 | pipeline_printout_graph | pipeline_graph.svg |
TREE_ROOT:
midpoint # sequence name to root iqtree on, or midpoint for midpoint root
For visual representation only, the tree root is set to orientate the plot in HAV_amplicon_map.stack.trimmed.fa.rooted.treefile_1percent_divergence_valid_msa.pdf
. The tree root does not affect cluster definitions.
RUN_PREFIX:
HAV_amplicon_
To facilitate tracking of output files, the user is able to specify a custom prefix for output files.
PLOTS:
Yes # Yes to make plots (slow for large runs), No otherwise.
This setting controls the drawing of output plots. The plots (shown below) are helpful to understand how the multiple sequence alignment affects tree topology, cluster detection and pairwise SNP distances.
Input query sequences should be in fasta format with one sequence per sample. Multiple samples may be included per file, and/or multiple files may be passed to havic
. Query sequences within files will be reverse complemented as necessary during their mapping to the subject/reference. If the query sequence files are named batch1.fa
, batch2.fa
, batch3.fa
, edit the QUERY_FILES
section of the yaml
file as follows:
QUERY_FILES:
- batch1.fa # relative or absolute paths to fasta files
- batch2.fa
- batch3.fa
To trim input queries to the reference VP1/P2A amplicon, list the sequence name of the query under TRIM_SEQS
, otherwise ignore this section.
MAPPER_SETTINGS: # https://github.com/lh3/minimap2
IQTREE2_SETTINGS: # http://www.iqtree.org/doc/iqtree-doc.pdf
CLUSTER_PICKER_SETTINGS: # https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4228337/
Use these variables to set parameters for Minimap2
, IQ-Tree2
and ClusterPicker
. For further information, refer to the user manuals for each software in the above links.
To highlight query sequences in the final plots, list the sequence names under HIGHLIGHT_TIP
in the yaml
, otherwise ignore this section.
HIGHLIGHT_TIP:
- CmvAXJTIqH # Specify tip name to highlight in final plot
- CCHkiFhcxG # Specify tip name to highlight in final plot
- PAvYXhYkLM # Specify tip name to highlight in final plot
Samples listed under HIGHLIGHT_TIP will be annotated in the final tree plot with a red dot, as shown below.
TRIM_SEQS: # these sequences will be trimmed to length of SUBJECT_AMPLICON
- AY644337_55443_seq_1 # these are sequences in the QUERY_FILES
- RIVM-HAV171_64913_seq_2_MapsOutsideTrimRegionSoEmpty
- nDNLdjtgha#HashInSeqName
Sometimes query sequences are whole genome, off target, or longer than the target regions. By supplying those sequence names here, havic
will trim the aligned sequence to the SUBJECT_TARGET_REGION. This list may be long, which is why it is placed toward the end of the yaml
file.
QUERY_FILES:
- data/example1.fa # relative or absolute paths to fasta files
- data/example2.fa
- xyz # to test a dud file name
- '' # to test an empty file name (which would return a folder, not file)
Provide relative or absolute paths to files containing query sequences. Each sample may only consist of a single sequence. Each file may contain one or more samples. Multiple files may be input to havic
via this option.
For larger datasets, when runtimes are prohibitive, it is preferable to perform analyses by subtype. HAV sub-genotypes (or 'subtypes') infecting humans are IA, IB, IIA, IIB, IIIA and IIIB. Typically, the minimum genetic divergence between the subtypes is around 0.076 (i.e., more than 7.6 nucleotides in 100 nucleotides are different between subtypes in pairwise comparisons). havic
can be used to approximately type samples. Here we describe the process to subset data for analysis of a single VP1/P2A query sequence in the context of thousands of VP1/2A sequences obtained from NCBI GenBank.
First we need to run the analysis in fast mode to obtain the subtype for the query sequence. Within the havic
pipeline, this will require tweaking the settings for IQTree
and, consequently, ClusterPicker
. A run in fast mode might look like:
IQTREE2_SETTINGS: # http://www.iqtree.org/doc/iqtree-doc.pdf
executable:
iqtree
other:
'-T 4 -m GTR+I+G --fast -bnni -alrt 2000'
CLUSTER_PICKER_SETTINGS: # https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4228337/
executable:
ClusterPicker
coarse_subtree_support: # divide tree into subtrees at/above this threshold
70
fine_cluster_support: # branch support minimum value for clusters of tips
80 # as we are using -alrt in iqtree, use a value of 80 here instead of 95 as we would for --ufboot values
distance_fraction: # float please, genetic distance
0.076 # genetic distance between subtypes is roughly equal to this
large_cluster_threshold:
15
distance_method:
valid # options are ambiguity, valid, gap, or abs
The --fast
iqtree command -T 3 -m GTR+I+G --fast -bnni -alrt 2000
is explained more thoroughly in the IQTree2 User Manual. Briefly, compute time is reduced by not AUTO
searching for the best threading strategy and not AUTO
searching for the best fit model. Working with a short amplicon of 460 bp, we can safely choose three threads (-T 3
). Pre-emptively, opting for the highly-parameterised GTR+I+G model, -bnni
is used to compensate for any severe model violations. In --fast
mode, we also need to use an alternative to the --ufboot
branch support method, so we have implemented the -alrt
single branch test.
To reiterate, our aim for the fast analysis is to find clades that approximately correspond to HAV subtype, and then pick the subtype/clade that contains our novel query sequence. Given that we have used -alrt
as a proxy for branch support we need to lower our acceptance threshold for branch support. That is, in ClusterPicker
we set fine_cluster_support
to 80 (as opposed to 95 for UFBoot
)to find the well supported clades (please refer to IQTree manual for further advice on this), and we increase the genetic divergence to 0.076 or 7.6% to cluster the subtypes.
To subset the dataset, after running havic
in fast mode, open the output file <RUN_PREFIX>map.stack.trimmed.fa.rooted_clusterPicks.nwk.figTree
in FigTree
. Search for the sample of interest. Select the appropriate subset to give context to the sample of interest. Note, sample names are santised by havic
to remove problematic characters from fasta headers. Original sample names are in <RUN_PREFIX>seq_id_replace.tsv
. Use the list of original sample names to subset the input data and modify the yaml
file accordingly.
After selecting the subset of interest, re-run the analysis in 'slow' mode at least three times.
A re-run in slow mode might look like:
IQTREE2_SETTINGS: # http://www.iqtree.org/doc/iqtree-doc.pdf
executable:
iqtree # command to call iqtree2
other: # threads
'-T AUTO -m MFP+FO --ufboot 1000 -pers 0.2 -nstop 200'
CLUSTER_PICKER_SETTINGS: # https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4228337/
executable:
ClusterPicker
coarse_subtree_support: # divide tree into subtrees at/above this threshold
70
fine_cluster_support: # UPFboot minimum value for clusters of tips
95
distance_fraction: # float please, genetic distance
0.01 # (e.g., 1 SNP in 100 bp = 0.01)
large_cluster_threshold:
15
To run the pipeline from a user-specified stage, delete or re-prefix the files in the output directory and set FORCE_OVERWRITE_AND_RE_RUN
to No
. For example, to re-run the pipeline from the ClusterPicker stage, firstly set FORCE_OVERWRITE_AND_RE_RUN
to No
and then delete files shown in the Output files table (above) numbered 8 and larger. Alternatively, re-prefix the files numbered 8 and larger with an underscore. Note, when FORCE_OVERWRITE_AND_RE_RUN
is set to Yes
, all files in OUTDIR
with the prefix as per RUN_PREFIX
will be deleted.
During development of havic
, it was recognised that HAV surveillance will move to whole genome sequencing in the near future. To improve utility of havic
over the coming years, havic
is written to allow the user to pass in any query and subject sequences. Prior to phylogenetic analysis, query headers listed under TRIM_SEQS
will be trimmed to the subject target region given by SUBJECT_TARGET_REGION
. To avoid cropping the alignment, either set the value of SUBJECT_TARGET_REGION
to SUBJECT_FILE
or set TRIM_SEQS
to ''
.
To include the subject sequence in the final alignment, just add the path to the subject file to the list in the QUERY_FILES
block.
As havic
implements ML phylogenetic inference (via IQ-Tree2
), there is a chance of arriving on a local optimum; hence, the analysis should be run multiple times (>3) to more completely explore tree space. Epidemiological conclusions should be based on the consensus of multiple runs and patient metadata (e.g., contact tracing, travel history).
Insertions in alleles relative to the reference will be deleted in the alternative allele during output to alignment. For example, if the REF
has ACCCCCCCCT
and the ALT
has ACCCCCCCCCCT
, the final alignment will be:
>REF
ACCCCCCCCT
>ALT
ACCCCCCCCT
Note, the deletion above in ALT
of CC
. 4644M1I8M1D3M7I287M1D10760M
Pre-release.
Why the name havic
?
havic
is an acronym for Hepatitis A Virus Infection Cluster (HAVIC), the VIC acknowledges that the development team hails from Victoria, Australia.
Who is havic
for?
havic
is for molecular epidemiologists working in public health laboratories who want to discover infection clusters in their virus sample cDNA or DNA sequences.
What is havic
for?
havic
is for bioinformatic analysis of Hepatitis A Virus genome sequences. It takes fasta files as input (QUERIES
), maps the QUERIES
to a reference (SUBJECT
), extracts the alignment from the binary alignment map (bam) file, infers a phylogenetic tree from the alignment, picks infection clusters within the QUERIES
using the tree and alignment as evidence. Theoretically, havic
can be used on other viral genomes though testing on non-HAV samples has so far been limited to Measles and SARS-CoV-2.
How do you define SUBJECT
and QUERY
sequences?
To maintain consistency with already established methods, SUBJECT (BLAST nomenclature) is used interchangeably with REFERENCE, REF or reference allele .vcf standard. SUBJECT
is the backbone onto which all QUERY
sequences will be mapped. QUERY
(BLAST nomenclature) is used interchangeably with ALTERNATE
or ALT
or alternate allele (.vcf
standard).
Can havic be used with a custom SUBJECT
sequence?
Yes. The havic pipeline is expected to work for any non-segmented virus genome.
Can the the SUBJECT
file consist of multiple contigs?
No. The SUBJECT
sequence needs to be a single consensus sequence from a single sample.
Can input QUERY
samples be comprised of multiple consensus sequences from the same sample?
No. A QUERY
file may NOT consist of multiple contigs from the same sample. However, a QUERY
file may consist of multiple sequences, one sequence from each sample.
Can input QUERY
files consist of multiple sequences?
Yes. A QUERY
file may either be a single consensus sequence from a single sample, or multiple samples with a single consensus sequence for each sample. A single QUERY
file can be input to havic
, but the program is designed to accept as many QUERY
files as you wish to feed it.
What's all this talk about consensus sequences? I'm used to talking about contigs.
In the 2020 pandemic era, virus genome sequencing is dominated by tiled-PCR-amplicon Illumina paired-end sequencing and/or Oxford Nanpore Technologies (ONT) long read sequencing. The typically low input nucleic acid quantity from clinical samples means that Illumina sequencing of tiled PCR amplicons is the preferred method whole genome sequencing of clinical virus samples. Tiled amplicon Illumina sequencing allows mapping of reads from a single sample to a single reference, with the final sample genome sequence called as the consensus variants against the reference, padded by inter-variant reference bases. The final sample sequence is not produced from a de novo assembly of reads so is referred to as a consensus sequence. Further, in diagnostic laboratories worldwide, quantitative Reverse Transcriptase Real-time PCR (qRT-PCR, qPCR or sometimes just RT-PCR) is used to detect positive cases. Due to difficulties associated with whole genome sequencing, diagnostic laboratorie often use Sanger sequencing of PCR products to call the strain of virus. havic
was originally written to discover and characterise outbreak clusters from short amplicon Sanger sequences, but now is also capable of analysis virus whole genome consensus sequences.
Will havic
work on organisms other than viruses?
Probably. havic has been designed and tested specifically to work on Hepatitis A Virus (HAV, genome size ~7.5kb) genomes. However, havic
should work on any non-segmented virus genome, and successful test analyses have been performed on Measles (~15.9kb) and SARS-CoV-2 (~30kb) genomes. Ultimately it is up to the analyst to decide whether havic
's treatment of the data makes biological sense.
What is the minimum number of sequences that can be analysed using havic
?
The answer is 3. To obtain context sequences for the query sample/s, go to NCBI's GenBank or RIVM's HAVNet. It is recommended to use entrez e-utils for obtaining large numbers of sequences and associated metadata.
Acronym | Expansion |
---|---|
HAV | Hepatitis A Virus |
MSA | Multiple Sequence Alignment |
HAVNet | Hepatitis A Virus Network |
ML | Maximum Likelihood |
NCBI | National Center for Biotechnology Information |
RIVM | Rijksinstituut voor Volksgezondheid en Milieu |
PCR | Polymerase Chain Reaction |
cDNA | complementary DNA |
WGS | Whole genome sequence/ing |