-
Notifications
You must be signed in to change notification settings - Fork 0
2 Getting started
1. Clone this repo.
git clone https://github.com/tahiri-lab/aPhyloGeo-pipeline.git
cd aPhyloGeo-pipeline
2. 🚀 Install dependencies.
2.1 If you do not have Conda installed, then use the following method to install it. If you already have Conda installed, then refer directly to the next step (2.2).
# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
2.2 Create a conda environment named aPhyloGeo and install all the dependencies in that environment.
# create a new environment with dependencies
conda env create -n aPhyloGeo -f environment.yaml
2.3 Activate the environment
conda activate aPhyloGeo
3. Configure the workflow
3.1 Prepare the config file:
To configure the aPhyloGeo-pipeline, the YAML file must be completed with the required parameters.
The simplest way to do this is by modifying the template YAML file provided by aPhyloGeo-pipeline according to the specific research needs while ensuring the parameter and file names remain unchanged.
The parameters that need to be set include:
-
Thresholds in
config.yaml
:-
bootstrap_threshold
: Only sliding windows with bootstrap values greater than user-set bootstrap_threshold (value from 0 to 1) will be written to the output file. -
rf_threshold
: The tree distance between each combination of sliding windows and environmental features will be calculated. Only sliding windows with Robinson–Foulds (RF) distance below the user-set rf_threshold (value from 0 to 100) will be written to the output file.
-
-
params in
config.yaml
:-
data_type
:aa
for the amino acid dataset (case insensitive); Any other values set by the user will be treated as nucleotide dataset (default) -
step_size
: the size of the Sliding Window Movement Step (bp) -
window_size
: the size of the Sliding Window (bp) -
strategy
: For constructing the phylogenetic tree, two alternative algorithms are provided, RAxML-Ng and FastTree.fasttree
for the FastTree strategy (case insensitive); Any other values set by the user will be treated as RAxML-Ng strategy (default) -
geo_file
: the path of input file (the environmental data.csv
) -
seq_file
: the path of input file (the Multiple Sequence Alignment data.fasta
)
Note: To use a Relative Path to describe the input file relatively to the path related to theaPhyloGeo-pipeline
directory (i.e., the default Present Working Directory should be theworkflow
). -
specimen_id
: the name of the column containing the sample id ingeo_file
-
feature_names
: The names of the columns corresponding to the environmental factors that will be involved in the analysis (ingeo_file
)
Note: Each column name is on a separate line, don't forget to keep the "-" in front of it.
-
3.2 Prepare the input files:
- A file of multiple sequences alignment in FASTA format
- A CSV file includes environmental data on the geographical habitat of the studied species
4. Execute the workflow.
Run workflow
# In a conda environment where all dependencies are already installed
# Specify the maximum number of CPU cores to be used at the same time.
# To use N cores: --cores N or -cN.
snakemake --cores all
📧 Please email us at: Nadia.Tahiri@USherbrooke.ca for any questions or feedback.