Skip to content

3 Analysis example

Wanlin Li edited this page Apr 13, 2023 · 2 revisions

In 2001, Uphyrkina et al. conducted a study on the geographical partitioning of modern leopards (_Panthera pardus_) using DNA sequence variation in two mtDNA regions: 1) a 611 bp segment of the nicotinamide adenine dinucleotide hydrogen 5 (NADH-5) gene, and 2) a 116 bp segment of the control region (CR). However, the study did not include an integrated analysis of molecular genetic variation and specific environmental features, which is a crucial aspect in understanding the complex relationships between genetic variation and environmental factors.

Datasets preparation

In this example, we employed the aPhyloGeo-pipeline and sampled 31 NADH-5 gene segment sequences from the leopard subspecies mentioned in the article of Uphyrkina et al. (2001). For each leopard subspecies sample, the corresponding geographical information was provided in the article. The average temperatures of the relevant areas from 1984 to 2000 were extracted by referencing the NASA/POWER and Daily Gridded weather datasets (Marzouk, 2021). By integrating genetic and environmental data with the aPhyloGeo-pipeline, our study provides new insights into the geographical partitioning of leopards and contributes to a better understanding of the genetic mechanisms underlying their adaptation to different environments.

Genetic dataset

The 31 NADH-5 gene sequences from leopards were aligned using the MUSCLE software (Edgar, 2004) to generate a multiple sequence alignment. The alignment programs vary based on data type, amount, and study purpose. For this reason, the aPhyloGeo-pipeline does not include alignment, offering more flexibility in choosing alignment strategies for a wider range of research projects. After completing alignment, the results (in FASTA format) are used as input for aPhyloGeo-pipeline (see aln_leopard.fa).

Samples information and sequencing methods were described in the literature (Uphyrkina et al., 2001), including GenBank accession numbers (AY035260-AY035275, AY035277-AY035290, AY035292).

Environmental dataset

The monthly temperature data for regions corresponding to each leopard sample was collected from the NASA/POWER website based on the geographical information provided in the article.

The study focused on nine geographical areas, namely Amur, Central Africa, India, Java, North Chinese, Persian, South Arabian, South Chinese, and Sri Lanka. Based on the hypothesis that subspecies variation in leopards is related to seasonal temperature variations, the temperature data was categorized into four seasons, and the seasonal averages were calculated from 1984 to 2000 (see geo_leopard.csv).

Workflow deployment

Conda is an open-source package and environment management system that simplifies the installation of multiple software packages and their dependencies. The aPhyloGeo-pipeline relies on Conda for efficient and accurate dependency installation.

All dependencies required by the aPhylo-Geo-pipeline are recorded in our YAML file, which is used by Conda to install the necessary packages and minimize the risk of errors during installation. This approach provides a streamlined solution for installing and managing the dependencies of aPhyloGeo-pipeline, making it an accessible tool for researchers in various fields.

Workflow configuration

To configure the aPhyloGeo-pipeline, the YAML file must be completed with the required parameters.

The simplest way to do this is by modifying the template YAML file provided by aPhyloGeo-pipeline according to the specific research needs while ensuring the parameter and file names remain unchanged.

The parameters that need to be set include:

  • Thresholds in config.yaml:

    • bootstrap_threshold: Only sliding windows with bootstrap values greater than user-set bootstrap_threshold (value from 0 to 1) will be written to the output file.

    • rf_threshold: The tree distance between each combination of sliding windows and environmental features will be calculated. Only sliding windows with Robinson–Foulds (RF) distance below the user-set rf_threshold (value from 0 to 100) will be written to the output file.

  • params in config.yaml:

    • data_type: aa for the amino acid dataset (case insensitive); Any other values set by the user will be treated as nucleotide dataset (default)

    • step_size: the size of the Sliding Window Movement Step (bp)

    • window_size: the size of the Sliding Window (bp)

    • strategy: For constructing the phylogenetic tree, two alternative algorithms are provided, RAxML-Ng and FastTree. fasttree for the FastTree strategy (case insensitive); Any other values set by the user will be treated as RAxML-Ng strategy (default)

    • geo_file: the path of input file (the environmental data .csv )

    • seq_file: the path of input file (the Multiple Sequence Alignment data .fasta )
      Note: To use a Relative Path to describe the input file relatively to the path related to the aPhyloGeo-pipeline directory (i.e., the default Present Working Directory should be the workflow).

    • specimen_id: the name of the column containing the sample id in geo_file

    • feature_names: The names of the columns corresponding to the environmental factors that will be involved in the analysis (in geo_file)
      Note: Each column name is on a separate line, don't forget to keep the "-" in front of it.

  • input files:

    • aln_leopard.fa - Multiple Sequence Alignment for nucleotide sequences in FASTA format
    • geo_leopard.csv - Environmental data corresponding to sequencing samples

Workflow running

Running the workflow is straightforward by simply entering snakemake -- use-conda --cores all in the command line to execute the entire workflow.

Specifying --cores all means that all available CPU cores will be utilized. To use a specific number of cores, we can specify --cores N.

Output

  • .csv and related metadata will be stored in the 'results' directory.

The aPhyloGeo-pipeline generates an output file in CSV format at the end of the analysis, which includes the average bootstrap value of the phylogenetic tree for each alignment window, as well as the RF distance between the phylogenetic tree of that window and the reference tree of a particular environmental feature.

The bootstrap value estimates the confidence level of a given clade in a phylogenetic tree, while the RF distance quantifies the dissimilarity between the phylogenetic tree constructed from a sliding window and the reference tree constructed from an environmental feature.

The aPhyloGeo-pipeline also saves intermediate files, including sequence information, and Newick files for phylogenetic and reference trees, to facilitate further research.

output-visualization-1

Fig.1. Variation of bootstrap value and normalized Robinson and Foulds (RF) distance on the Multiple Sequence Alignment (MSA)

output-visualization-2

Fig. 2. Variation of normalized Robinson and Foulds (RF) distance on the Multiple Sequence Alignment (MSA) for different temperatures in four seasonal groups


References

Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research, 32(5), 1792-1797.

Marzouk,O.A. (2021) Assessment of global warming in Al Buraimi, sultanate of Oman based on statistical analysis of NASA POWER data over 39 years, and testing the reliability of NASA POWER against meteorological measurements. Heliyon, 7(3), e06625.

Uphyrkina,O. et al. (2001) Phylogenetics, genome diversity and origin of modern leopard, Panthera pardus. Molecular ecology, 10(11), 2617-2633.