Skip to content
Wanlin Li edited this page Apr 14, 2023 · 8 revisions

Phylogeography investigates the historical and geographic distribution of genetic variation within and among species. The geographic distribution of species is often linked with the patterns associated with their genes (Knowles and Maddison, 2002). Populations living in different environments with varying climatic conditions are subject to pressures that can lead to evolutionary divergence and reproductive isolation (Matthew and Smith, 1998; Schluter 2001). Thus, phylogeography integrates geographic data and molecular genetic data, such as climate data, nucleotide or amino acid sequences, to infer the evolutionary history and geographic distribution of a set of species. Phylogeographic analyses provide insights into past climate change, geological events, and dispersal barriers that have shaped the genetic diversity and distribution of species.

The aPhyloGeo applies the sliding windows strategy to scan the genetic sequence information for examining how patterns of variation within species coincide with geographic features, such as climatic features. In this algorithm, the multiple sequence alignment (MSA) was partitioned into windows by specifying the sliding window size and step size. Then a phylogenetic tree for each window was constructed. Secondly, cluster analyses for each geographic factor were performed by calculating a distance matrix and creating a reference tree based on the distance matrix and the Neighbor-Joining clustering method (Cardoso et al., 2022). Reference trees (based on geographic factors) and phylogenetic trees (based on slid- ing windows) were defined on the same set of leaves (i.e., names of species). Subsequently, the correlation between phylogenetic and reference trees was evaluated using the Robinson and Foulds (RF) distance calculation. As depicted in the figure, RF distances were calculated for each combination of the phylogenetic tree and the reference tree. Finally, bootstrap and RF thresholds were applied to identify gene fragments in which patterns of variation within species coincided with a particular geographic feature. These fragments can serve as informative reference points for future studies.

sliding-windows

This process was computationally expensive since each window must be processed independently. In addition, the sliding window size and the step size play crucial roles in analysis, often requiring various parameter settings to optimize the analysis. This makes reproducibility play a critical role in this process. In response to these challenges, we designed a phylogeographic pipeline, aPhyloGeo (Analysis of PhyloGeography), to ensure smooth and efficient integration of phylogenetic and geographic analyses.

The aPhyloGeo-pipeline utilizes a modern computational workflow management system called Snakemake (Koster and Rahmann, 2012). Compared to other workflow management systems, such as Galaxy (Jalili et al., 2020) and Nextflow (Spišaková et al., 2023), Snakemake is unique in that it is written in Python. This makes it highly portable, as it only requires a Python installation to run Snakefiles (Wratten et al., 2021). The aPhyloGeo pipeline takes advantage of a range of Python packages, including Biopython (Cock et al., 2009) and Pandas (McKinney, 2011), for efficient reading and writing of sequencing data and phylogenetic analysis. As a result, Python-based Snakemake is the ideal choice for aPhylo- Geo. Furthermore, the Snakemake pipeline can easily integrate with other tools through Conda or Singularity for dependency and environment management. This allows a single command to download and install all the necessary dependencies for aPhyloGeo. Another major advantage of Snakemake is its scalability, as it can handle large workflows with many rules and dependencies and can be run on various computing environments, including workstations, clusters, and cloud computing platforms (such as Kubernetes, Google Cloud Platform, and Amazon Web Services). Additionally, Snakemake supports the parallel execution of jobs, which can greatly accelerate the execution of the pipeline.

The aPhyloGeo-pipeline is scalable, reproducible, customizable, and easy to use, making it an effective tool for phylogeographic research. It is easily automatable from the Snakemake Workflow Catalog. Additionally, Snakemake developers can import the entire aPhyloGeo workflow or its parts into their projects via the modularization system introduced with Snakemake 6.0 (Mölder et al., 2021).


References

Cardoso,P. et al. (2022) Calculating functional diversity metrics using neighbour-joining trees. bioRxiv, 2022-11.

Cock,P.J. et al. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422-1423.

Jalili,V. et al. (2020) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update. Nucleic acids research, 48(W1), W395-W402.

Knowles,L.L. and Maddison,W.P. (2002) Statistical phylogeography. Molecular Ecology, 11(12), 2623-2635.

Köster,J. and Rahmann,S. (2012) Snakemake-- a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520-2522.

Matthew,R.O. and Smith,T.B. (1998) Ecology and speciation. Trends in Ecology & Evolution, 13(12), 502–506.

McKinney,W. (2011) pandas: a foundational Python library for data analysis and statistics. Python for high performance and scientific computing, 14(9), 1-9.

Mölder,F. et al. (2021) Sustainable data analysis with Snakemake. F1000Research, 10.

Schluter,D. (2001) Ecology and the origin of species. Trends in ecology & evolution, 16(7), 372–380.

Spišaková,V. et al. (2023) Nextflow in bioinformatics: Executors performance comparison using genomics data. Future Generation Computer Systems, 142(2023), 328-339.

Wratten,L. et al. (2021) Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nature methods, 18(10), 1161-1168.

Clone this wiki locally