💡 A new vesion of the Generator of clusters of phylogenetic trees with overlapping and HGT is available here.
This solution (Generator of Phylogenetic trees) generates phylogenetic trees in Newick format with a specified number of leaves and a controlled level of overlap between the trees. The generator simulates gene trees with horizontal gene transfer (HGT) and is useful for scientific experiments such as testing clustering algorithms or inferring supertrees.
📑 If you use GPTree generator in your research or experiments, please consider citing the following paper:
Koshkarov, A., & Tahiri, N. (2023). GPTree: Generator of Phylogenetic Trees with Overlapping and Biological Events for Supertree Inference. In BIOINFORMATICS (pp. 212-219). DOI: Link to Paper
🏆 Thank you for your contribution to the community!
The generator is based on the use of the AsymmeTree library.
- Generates phylogenetic trees with horizontal gene transfer (HGT).
- Allows users to specify the number of leaves and overlap level between trees.
- Outputs gene trees and species trees in Newick format.
- Designed to handle large datasets with configurable parameters.
The script depends on the following Python libraries:
ete3
PyQt5
asymmetree
pandas
The user needs to provide several initial parameters:
- Lmin: Minimum number of leaves per tree (integer, 5 ≤ Lmin < 500).
- Lmax: Maximum number of leaves per tree (integer, Lmin < Lmax ≤ 500).
- Ngen: Number of trees to generate (integer, 3 ≤ Ngen ≤ 500).
- Plevel: Average level of overlap (common leaves) between trees, as a decimal (0.2 ≤ plevel ≤ 0.7).
The overlap level between trees is calculated based on the number of common leaves between them, with additional controls to ensure the desired level of overlap.
Currently, the generator works slow for the levels of overlap <0.2 and >0.7.
The basic workflow:
To run the script, use the following command:
python gptree.py Lmin Lmax Ngen plevel
python gptree.py 15 25 30 0.5
This will generate 30 trees with leaves ranging from 15 to 25 and an average overlap of 0.5. The trees will be saved in the following files:
- Gene trees:
genetrees_50.txt
- Species trees:
speciestrees_50.txt
The generated trees are saved in Newick format:
genetrees_XX.txt
: Contains the gene trees with the specified overlap level (XX = plevel * 100).speciestrees_XX.txt
: Contains the species trees used for generating the gene trees.
See examples of generated datasets here.
The Jupiter notebook also contains steps to validate the generated dataset (tree visualization, number of trees and leaves, and level of overlap).