Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
KevinKuchinski authored Sep 4, 2024
1 parent 63148af commit 8eda8ad
Showing 1 changed file with 41 additions and 29 deletions.
70 changes: 41 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# FluViewer

FluViewer is an automated pipeline for generating influenza A virus (IAV) genome sequences from FASTQ data. If provided with a sufficiently diverse and representative database of IAV reference sequences, it can generate sequences regardless of host and subtype without any human intervention required.
FluViewer is an automated pipeline for generating influenza A and B virus (flu) genome sequences from FASTQ data. If provided with a sufficiently diverse and representative database of reference sequences, it can generate sequences regardless of host and subtype/lineage without any human intervention required.

Here is a brief description of the FluViewer process. First, the provided reads are normalized and downsampled using a kmer-based approach to reduce any excessive coverage of certain genome regions. Next, the normalized/downsampled reads are assembled de novo into contigs. The contigs are then aligned to a database of IAV reference sequences. These alignments are used to trim contigs and roughly position them within their respective genome segment. Afterwards, a multiple sequencing alignment in conducted on the trimmed/positioned contigs, generating scaffold sequences for each IAV genome segment. Next, these scaffolds are aligned to the IAV reference sequence database to find their best matches. These best matches are used to fill in any missing regions in the scaffold, creating mapping references. The normalized/downsampled reads are mapped to these mapping references, then variants are called and the final consensus genomes are produced.
Here is a brief description of the FluViewer process. First, the provided reads are assembled de novo into contigs. The contigs are then aligned to a database of flu reference sequences. These alignments are used to trim contigs and roughly position them within their respective genome segment. Afterwards, a multiple sequence alignment in conducted on the trimmed/positioned contigs, generating scaffold sequences for each genome segment. Next, these scaffolds are aligned to the reference sequence database to find their best matches. These best matches are used to fill in any missing regions in the scaffold, thereby creating mapping references. The provided reads are mapped to these mapping references, then variants are called, low coverage positions are masked, and final consensus sequences are generated for each genome segment.

## Installation
1. Create a virtual environment and install the necessary dependencies using the YAML file provided in this repository. For example, if using conda:
```
conda create -n FluViewer -f FluViewer_v_0_1_x.yaml
conda create -n FluViewer -f FluViewer_v_0_2_x.yaml
```

2. Activate the FluViewer environment created in the previous step. For example, if using conda:
Expand All @@ -20,7 +20,7 @@ conda activate FluViewer
pip3 install FluViewer
```

4. Download and unzip the default FluViewer DB (FluViewer_db.fa.gz) provided in this repository. Custom DBs can be created and used as well (instructions below).
4. Download and unzip the default FluViewer DB (FluViewer_db_v_0_2_0.fa.gz) provided in this repository. Custom DBs can be created and used as well (instructions below).

## Usage
```
Expand Down Expand Up @@ -52,75 +52,87 @@ FluViewer -f <path_to_fwd_reads> -r <path_to_rev_reads> -d <path_to_db_file> -n

-V : Variant allele fraction threshold for masking ambiguous variants (float, default = 0.25, min = 0, max = 1

-N : Target depth for pre-normalization of reads (int, default = 200, min = 1)
-L : Coverage depth limit for variant calling (int, default = 100, min = 1)

-L : Coverage depth limit for variant calling (int, default = 200, min = 1)
-t : Length tolerance for consensus sequences (percentage, default = 1, min = 0, max = 100)

-T : Threads used for BLAST alignments (int, default = 1, min = 1)


<b>Optional flags:</b>

-m : Allow analysis of mixed infections

-g : Disable garbage collection and retain intermediate analysis files


## FluViewer Database
FluViewer requires a curated FASTA file "database" of IAV reference sequences. Headers for these sequences must be formatted and annotated as follows:
FluViewer requires a curated FASTA file "database" of flu reference sequences. Headers for these sequences must be formatted and annotated as follows:
```
>unique_id|strain_name(strain_subtype)|sequence_segment|sequence_subtype
>unique_id|strain_name(strain_subtype)|sequence_species|sequence_segment|sequence_subtype
```
Here are some example entries:
```
>CY230322|A/Washington/32/2017(H3N2)|PB2|none
>CY230322|A/Washington/32/2017(H3N2)|A|PB2|none
TCAATTATATTCAGCATGGAAAGAATAAAAGAACTACGGAATCTAATGTCGCAGTCTCGCACTCGCGA...
>JX309816|A/Singapore/TT454/2010(H1N1)|HA|H1
>JX309816|A/Singapore/TT454/2010(H1N1)|A|HA|H1
CAAAAGCAACAAAAATGAAGGCAATACTAGTAGTTCTGCTATATACATTTACAACCGCAAATGCAGACA...
>MH669720|A/Iowa/52/2018(H3N2)|NA|N2
>MH669720|A/Iowa/52/2018(H3N2)|A|NA|N2
AGGAAAGATGAATCCAAATCAAAAGATAATAACGATTGGCTCTGTTTCTCTCACCATTTCCACAATATG...
>EPI_ISL_413816|B/Iowa/08/2020(yamagata)|B|PB2|none
GTTTTCAAGATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACAATGAAGCCAAAACA...
>EPI_ISL_413816|B/Iowa/08/2020(yamagata)|B|HA|Yamagata
ATTTTCTAATATCCACAAAATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAGACCG...
>EPI_ISL_413816|B/Iowa/08/2020(yamagata)|B|NA|Yamagata
ATCTTCTCAAAAACTGAGGCAAATAGGCCAAAAATGAACAATGCTACCTTCAACTATACAAACGTTAAC...
```
For HA and NA segments, strain_subtype should reflect the HA and NA subtypes of the isolate (eg H1N1), but sequence_subtype should only indicate the HA or NA subtype of the segment sequence of the entry (eg H1 for an HA sequence or N1 for an NA sequence).
For influenza A viruses and influenza B viruses, strain_subtype should reflect the HA/NA subtype or lineage of the isolate (eg H1N1 or Yamagata).
For HA segments of influenza A viruses, segment_subtype should reflect only the HA subtype of the isolate (eg H3 for the HA segment of an H3N2 virus). Similarly, for NA segments of influenza A viruses, segment_subtype should reflect only the NA subtype of the isolate (eg N2 for the NA segment of an H3N2 virus). For HA and NA segments of influenza B viruses, segment_subtype should reflect the lineage of the isolate (eg Yamagata).

For internal segments (i.e. PB2, PB1, PA, NP, M, and NS), strain_subtype should reflect the HA/NA subtypes of the isolate, but 'none' should be entered for sequence_subtype. If strain_subtype is unknown, 'none' should be entered there as well.
For internal segments (i.e. PB2, PB1, PA, NP, M, and NS), strain_subtype should reflect the subtypes/lineage of the isolate, but 'none' should be entered for sequence_subtype.

FluViewer will only accept reference sequences composed entirely of uppercase canonical nucleotides (i.e. A, T, G, and C).

## FluViewer Output
FluViewer generates four main output files for each library:
1. A FASTA file containing consensus sequences for the IAV genome segments
2. A sorted BAM file with reads mapped to the mapping references generated for that library (the mapping reference is also retained)
1. A FASTA file containing consensus sequences for each genome segments
2. A sorted BAM file with reads mapped to the mapping references generated for that library
3. A report TSV file describing segment, subtype, and sequencing metrics for each consensus sequence generated
4. Depth of coverage plots for each segment

Headers in the FASTA file have the following format:
Headers in the consensus sequences FASTA file have the following format:
```
>output_name|segment|subject
>output_name|species|segment|subject|
```


The report TSV files contain the following columns:

<b>seq_name</b> : the name of the consensus sequence described by this row

<b>segment</b> : IAV genome segment (PB2, PB1, PA, HA, NP, NA, M, NS)
<b>seq_length</b> : the estimated length of the genome segment described by this row

<b>subtype</b> : HA or NA subtype ("none" for internal segments)
<b>reads_mapped</b> : the number of sequencing reads mapped to this segment

<b>reads_mapped</b> : the number of sequencing reads mapped to this segment (post-normalization/downsampling)
<b>scaffold_completeness</b> : the percentage of nucleotide positions in genome segment that were present in the scaffold that was assembled from the provided reads

<b>seq_length</b> : the length (in nucleotides) of the consensus sequence generated by FluViewer
<b>scaffold_completeness</b> : the percentage of nucleotide positions that were sequenced to sufficient depth in the consensus sequence generated for this genome segment

<b>scaffold_completeness</b> : the number of nucleotide positions in the scaffold that were assembled from the provided reads (post-normalization/downsampling)
<b>low_cov_perc</b> : the percentage of nucleotide positions that were masked in the consensus sequence due to insufficient sequencing depth (determined by the depth threshold set by -D)

<b>consensus_completeness</b> : the number of nucleotide positions in the consensus with a succesful base call (e.g. A, T, G, or C)
<b>ambig_perc</b> : the percentage of nucleotide positions that were masked in the consensus sequence generated for this genome segment because of mixed base calls (determined by the VAF thresholds set by - and -V)

<b>ref_seq_used</b> : the unique ID and strain name of the scaffold's best-matching reference sequence used for filling in missing regions in the scaffold (if the scaffold completeness was 100%, then this is provided pro forma as none of it was used to create the mapping reference)
<b>variant_perc</b> : the percentage of nucleotide positions in the consensus sequence that were called as variants (in relation to the mapping reference)

<b>ref_seq</b> : the unique ID and strain name of the scaffold's best-matching reference sequence used for filling in missing regions in the scaffold (if the scaffold completeness was 100%, then this is provided pro forma as none of it was used to create the mapping reference)

The depth of coverage plots contains the following elements:
- A black line indicating the depth of coverage pre-variant calling
- A grey line indicating the depth of coverage post-variant calling
- Red shading covering positions where coverage was too low for base calling
- Orange lines indicating positions where excess variation resulted in an ambiguous base call
- Blue lines indicating positions where a variant was called
- Red shading covering positions where masking was applied because coverage was too low
- Blue shading covering positions where masking was applied because base calls were ambiguous
- Green shading covering positions with variants (in relation to mapping reference)

0 comments on commit 8eda8ad

Please sign in to comment.