AGapEs

Gap filling based on template sequence for fragmented ancient DNA (aDNA) assemblies [1].

Requirements

python2
bwa
samtools
bedtools
FPSAC (install by running install_FPSAC.sh)

Quickstart

install_FPSAC.sh
snakemake --snakefile preprocessing.snakefile -j <N>
#Checkpoint: check templates
snakemake --snakefile run_gapFilling.snakefile -j <N>
snakemake --snakefile run_partial_gapFilling.snakefile -j <N>
#Checkpoint: solve conflicts
snakemake --snakefile finishing.snakefile -j <N>

Note: AGapEs has not been tested in compute cluster environments yet.

Pipeline

We split the pipeline into 4 snakefiles that should be run successively. If possible, we suggest to allow running the snakefile with multiple cores (parameter -j), since many jobs in the pipeline can be run in parallel. Each snakefile ends with a "checkpoint" that can be used to e.g. check template sequences or solve conflicts based on gap filling results. The following paragraphs provide detailed information for each pipeline step and how to run the snakefiles.

Input

aDNA reads in fasta/fastq format
assembled contigs of aDNA reads in fasta format (multi-fasta file)
assembled reference sequences in fasta format (multi-fasta file)
phylogenetic tree in newick format, placement of the ancient sample marked with @
IS annotations for all considered IS families in extant genomes, format >Genome_ID:start-stop IS_family_ID

Before running the pipeline, the location of these input files has to be specified in the file config.yaml.

Preprocessing

Markers and gap template sequences can be computed using the FPSAC [2] pipeline.

install_FPSAC.sh
snakemake --snakefile preprocessing.snakefile -j <N>

FPSAC computes the template sequences based on a multiple alignment of the respective extant gap sequences.

Important files created in directory $DIR='AGapEs_analysis':

$DIR/<X>.info - overview files for X={simple|conflicting|IS} gaps. Contains for each potential adjacency: marker extremities, gap ID, ingroup and outgroup occurrences and positions, length differences in the extant gaps, IS annotations, conflicting components
$DIR/results/contigs/families_with_contig_names - markers computed (FPSAC)
$DIR/results/contigs/families.fasta - markers sequences (FPSAC)
$DIR/results/finishing/alignments/ - template sequences for all gaps and underlying extant gap sequences

CHECKPOINT 1: If gaps get too large because of inadequate marker coverage, the multiple alignment of extant gap sequences can fail. In these cases, no template gap sequence will be used in the subsequent gap filling step. If a multiple alignment has been successfully obtained with other software, the file can simply be copied into the respective folder and will be picked up by the pipeline.

Filling gaps with AGapEs

The two following snakefiles define the full and partial gap filling steps. Since gaps are filled independently, these should be run with multiple threads if possible (-j parameter).

snakemake --snakefile run_gapFilling.snakefile -j <N>
snakemake --snakefile run_partial_gapFilling.snakefile -j <N>

Important files/directories created:

$DIR/results/assemblies/<simple|IS|conflicting>/<gapID>.fasta_assembly - template gap sequence together with its flanking marker sequences, for IS-annotated gaps all template alternatives are given. The header of each assembly file is >gapID leftMarker rightMarker length gapStart-gapEnd
$DIR/results/assemblies/<simple|IS|conflicting>/<gapID>.bam_noclips - mapping BAM files, soft-clipped reads are removed
$DIR/results/assemblies/<simple|IS|conflicting>/<gapID>.out - gap filling result. The header of each assembly file is >gapID leftMarker rightMarker length gapStart-gapEnd dist_to_template followed by the reconstructed gap sequence
$DIR/results/gapFilling_partial/ - assemblies and partial gap filling results for not completely filled simple gaps
$DIR/results/gapFilling_partial_IS/ - assemblies and partial gap filling results for not completely filled IS gaps

CHECKPOINT 2: Based on the gap filling results, the conflicting components as summarized in $DIR/conflicts.info have to be solved manually. Before running the finishing phase, the two files $DIR/results/scaffold/adjacencies_kept and $DIR/results/scaffold/adjacencies_discarded have to be adjusted accordingly.

Finishing

The last step will produce a set of scaffolds from kept adjacencies and gap sequences either based on aDNA reads or completed with template sequence. The sequence map also indicates the type of gap and support by aDNA reads.

snakemake --snakefile finishing.snakefile -j <N>

Important files created:

$DIR/results/finishing/ancestral_sequence.fasta - resulting ancestral scaffolds
$DIR/results/finishing/ancestral_sequence_map_augmented - detailed map over the resulting scaffolds, including the status of all gaps included in the assembly. For each gap we have a line: GAP gapID adjacency position orientation length length_covered_by_reads status OK/DUB We mark all gaps as dubious (DUB) if they are not covered by any reads (COV) or their extant gaps are very unconserved (EXT_LENGTH_DIFF).

Example

The pipeline has been used to obtain improved assemblies of ancient Yersinia pestis genomes as described in [1]. The two config files config_London.yaml and config_Marseille.yaml describe all input data for both assemblies. To run the pipeline, download input files from here and the aDNA read files from the respective databases, IDs given in [1].

References

[1] Nina Luhmann, Daniel Doerr, and Cedric Chauve. "Comparative scaffolding and gap filling of ancient bacterial genomes applied to two ancient Yersinia pestis genomes." Microbial genomics. 2017 Jul 8;3(9).

[2] Ashok Rajaraman, Eric Tannier, and Cedric Chauve. "FPSAC: fast phylogenetic scaffolding of ancient contigs." Bioinformatics 29.23 (2013): 2987-2994.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AGapEs

Requirements

Quickstart

Pipeline

Input

Preprocessing

Filling gaps with AGapEs

Finishing

Example

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
FPSAC_runfiles		FPSAC_runfiles
scripts		scripts
README.md		README.md
config.yaml		config.yaml
config_London.yaml		config_London.yaml
config_Marseille.yaml		config_Marseille.yaml
finishing.snakefile		finishing.snakefile
install_FPSAC.sh		install_FPSAC.sh
preprocessing.snakefile		preprocessing.snakefile
run_gapFilling.snakefile		run_gapFilling.snakefile
run_partial_gapFilling.snakefile		run_partial_gapFilling.snakefile

nluhmann/AGapEs

Folders and files

Latest commit

History

Repository files navigation

AGapEs

Requirements

Quickstart

Pipeline

Input

Preprocessing

Filling gaps with AGapEs

Finishing

Example

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages