Skip to content

Latest commit

 

History

History
110 lines (62 loc) · 13.1 KB

File metadata and controls

110 lines (62 loc) · 13.1 KB

Sequential trait evolution did not drive deep-time diversification in sharks

Summary

This repository's purpose is to give a means of replicability to the article "Sequential trait evolution did not drive deep-time diversification in sharks" but can be generalized to other similar data. All of the presented scripts are written in R language (R Core Team, 2022). Each script is available as both an annotated notebook (.ipynb) or a raw .r file (unannotated). If you plan to use any of these scripts, please cite "Marion et al., 2024".

Overview

This repository contains scripts and html files for performing the following comparative analyses:

1: Modeling trait evolution

2: Analyses of diversification

3: Sensitivity analyses

Data from Marion et al., (2024) are available here "10.6084/m9.figshare.24581718".

1 Modeling trait evolution

package requirement (corHMM, mclust, string, phytools, qpcR, ggtree, secsse, ggtree, phytools, treedataverse, RColorBrewer, ggplot2, ggpmisc, optional(stringr))

used script (corHMM.r; corHMM_Diet.r; Plot_ASE.r)

The purpose of this analysis is twofold. First, we want to test how traits evolved and what evolutionary scenario they followed. Second, we performed ancestral state estimation which can be later compared with the ancestral state estimation from SecSSE. In this section, we will be using the corHMM package (Boyko & Beaulieu, 2021).

1.1 Data cleaning and model construction

The first step in any comparative analysis is to clean and prepare the trait data. To do so, we must extract and isolate these values in a vector. As maximum body-size is a continuous character, we may want to discretize it. In this script, we use a Gaussian mixture model with the mclust function of the mclust (Fraley et al., 2012) package to cluster continuous data. This package automatically selects the most likely or a user-specified number of groups. Then, we name each trait value according to its species name.

The second step is to build the actual evolutionary model we want to test. Usually, three models are tested when performing trait evolution analyses: an equal rate, a symmetric rate (transition from A to B is the same that B to A), and all rates differ, which is self-explanatory.

However, these models may not be adequate when testing specific evolutionary scenarios. For example, we may want to test whether the direct transition from small size to large size is impossible. To do so, we create a custom transition matrix by indicating that direct transitions between opposed states are impossible. In our case of Shark evolution, all these models will be referred to from now on as "sequential models", namely if one wants to go from A to C, there are two transitions, A to B, and B to C. Each model parameter specification was then entered in a standardised corHMM fashion. To properly model trait evolution, one should always keep in mind that evolutionary rates may vary across the phylogeny, thus it is best practice to introduce another parameter, the "rate" (herein referred to as rate-class) parameter. This parameter allows for transition rate heterogeneity across the phylogeny, that is if one part of the phylogeny has short branching patterns as opposed to the rest of the tree, corHMM will take into account such heterogeneity, regardless of the trait state. Introducing these parameters is a great way to avoid false positives but is also very resource-consuming as these models are parameters rich (often more than twice the number of parameters for a one-rate model). To keep the likelihood space to allow and avoid over-parametrization, we did not go as far as 3-rate classes.

1.2 Selecting the likeliest model

Then, we may perform each model and save relevant metrics for comparison. This script saves each model output into a data frame filled with the log-likelihood, the number of parameters, the AICc, the $\Delta$ AICc, and the $\omega$ AICc. We then automatically save the output from the best-fitting model for each trait.

1.3 Ancestral state estimation

Lastly, one may be interested in estimating the ancestral condition of his clade of interest. Fortunately, corHMM jointly estimates transition rates and ancestral states, thus it is pretty easy to extract these values and plot directly on the phylogeny. To do this, we must use corHMM_ASE_script.r. While it is pretty easy to estimate them, ancestral state estimations are and remain estimations, thus one should always interpret them with utmost care.

2 Diversification analyses

package requirement (ape, mclust, secsse, DDD, tidyverse, qgraph, ggtree, phytools, treedataverse, RColorBrewer, ggplot2, ggpmisc, optional(stringr))

used script (SecSSE_Size.r; SecSSE_Reproduction.r; SecSSE_Habitat.r; SecSSE_Diet.r; SecSSE_ASE.r; Plot_ASE.r)

Now that we pictured trait evolution dynamics, we may want to assess whether our examined traits are responsible for extant diversity patterns. To do so we rely on SSE models (State-dependent speciation and extinction models), which are well-known for accounting for the impact that trait evolution has on patterns of lineage diversification. However, accounting for trait-dependent diversification is subject to numerous methodological biases (Beaulieu and Donoghue, 2013). Indeed, SSE models can falsely indicate an effect of the focal trait on diversification. Models with hidden traits (aka concealed traits), such as SeCSSE (Herrera-Alsina et al., 2019) or HiSSE (Beaulieu and O'Meara, 2016) can account for hidden variables in trait-dependant diversification. Thus, we will be using SecSSE, an SSE implementation for detecting trait-dependent diversification using phylogenies on multi-state characters, while being robust to false-postive.

2.1 Data cleaning and model construction

Highly similar to corHMM, we will first need to discretise our continuous traits and clean our data. However, unlike corHMM, SecSSE estimates jointly transition rates and diversification rates (speciation and extinction). This is both a good thing, as it has been shown that accounting for diversification allows for better estimation of trait evolution (Maddison, 2006), and a bad thing as it dramatically increases the number of parameters in our analyses. Over-parametrization is a common issue in statistics, but it has not often been addressed in macroevolution studies. Here, we want to avoid this as much as possible by limiting the upper number of estimated parameters. To do so, we kept the structure of the best-fitting corHMM model for each trait, while still estimating its transition rates. Namely, if the best-fitting model for reproduction is a two-rate sequential symmetrical model, we will forbid direct transition from A to C, while constraining transitions from A to B to be equal to B to A. To minimize the effect of structuration on the output of the model, we kept the same structure of the transition matrix across all variants for each trait.

For each trait, we constructed seven models, one for constant rates (CR), three for examined-trait diversification (ETD) and three for concealed-trait diversification (CTD). The three variants for CTD and ETD models include a pure speciation model (different speciation for each state, one shared extinction), a pure extinction model (different extinction for each state, one shared speciation) and a net diversification model (different speciation and extinction for each state). For each of the seven models to avoid finding a local optimum, following Herrera-Alsina et al. (2019), we used three sets of initial parameters. One set was estimated using the standard birth-death model (bd_ML function in the R package “DDD” 5.2.2; Etienne et al., 2012), one was halved, and the other was doubled.

2.2 Running and selecting the likeliest model

Then, we may perform each model and save relevant metrics for comparison. Here, we save the output of each of the 21 models for each trait in a separate directory. SecSSE does not compute the AICc directly, thus we used the following formula to estimate the AICc "2*(-(ll)+2*k+(2*k*(k+1))/(n-k-1))" where ll is the log-likelihood of the model, k its number of parameters and n the number of taxa included in the analysis. Then directly in the script, we construct a data frame filled with the log-likelihood, the number of parameters, the AICc, the $\Delta$ AICc, the $\omega$ AICc and the estimated rates.

2.3 Ancestral state estimation

After the selection of the best-fitting model, one may want to estimate ancestral states across the phylogeny. However, unlike corHMM, SecSSE does not directly infer ancestral state, rather we may require another procedure to properly estimate ancestral state. This procedure is highlighted in the script "SecSSE_ASE.r", where there is the code for filling proper parameters. This procedure is a bit tricky, but you will find in the code that everything is annotated. Finally, after running our model (which is quite fast), we may plot our ancestral state directly on the phylogeny, similarly to what was done with corHMM

3 Sensitivity analyses

package requirement (ape, mclust, secsse, DDD, tidyverse, qgraph, tidyverse, ggpubr, rstatix, optional(string))

used script (corHMM.r; corHMM_Diet.r; SecSSE_Size.r; SecSSE_Reproduction.r; SecSSE_Habitat.r; SecSSE_Diet.r; Posterior_test_comparative_analysis.r)

Both trait-dependent evolution and diversification model require phylogeny. However, phylogenetic trees are hypotheses. Consequently, a phylogeny may not accurately reflect, which may affect downstream analyses. Worse, when a calibrated phylogeny is employed the uncertainty is twofold: as both dating and phylogenetic estimation are affected. We accounted for both phylogenetic and dating uncertainty by performing sensitivity analyses on 100 trees extracted from the posterior distribution of the BEAST analysis. To perform these analyses, the user just has to replace the input tree and the data frame output manually in each script (for an example of how to do it, see the following scripts "corHMM_replicated.r", "SecSSE_Size_replicated.r", and "multiRun_script.sh"). We analyzed whether we could reliably recover the best-fitting model for each trait and analysis (corHMM, SecSSE) by performing a non-parametric alternative of the repeated measure ANOVA (Friedman test) comparing the AICc of each model and then performing pairwise comparisons of each model with a paired signed-rank Wilcoxon test using the R package “rstatix” (Kassambara, 2023) where we reported the T statistic and p-value. These tests are implemented in the script "Posterior_test_comparative_analysis.r", which will require a directory with all data frame replicates (either corHMM or SecSSE) as input.

Reference

Beaulieu, J. M. Donoghue, M. J. (2013). Fruit evolution and diversification in campanulid angiosperms: Campanulid fruit evolution. Evolution. 67(11): 3132-3144.

Beaulieu, J. M., & O’Meara, B. C. (2016). Detecting Hidden Diversification Shifts in Models of Trait-Dependent Speciation and Extinction. Systematic Biology, 65(4), 583-601.

Boyko, J. D., & Beaulieu, J. M. (2021). Generalized hidden Markov models for phylogenetic comparative datasets. Methods in Ecology and Evolution, 12(3), 468-478.

Etienne, R. S., Haegeman, B., Stadler, T., Aze, T., Pearson, P. N., Purvis, A., & Phillimore, A. B. (2012). Diversity-dependence brings molecular phylogenies closer to agreement with the fossil record. Proceedings of the Royal Society B: Biological Sciences, 279(1732), 1300-1309.

Herrera-Alsina, L. van Els, P. Etienne, R. S. (2019). Detecting the dependence of diversification on multiple traits from phylogenetic trees and trait data. Systematic Biology. 68(2): 317-328.

Fraley, C. Raftery, A. E. Murphy, T. B. & Scrucca, L. (2012). mclust Version 4 for R: normal mixture modelling for model-based clustering, classification, and density estimation. Technical Report No. 597. Seattle, WA: Department of Statistics, University of Washington.

Maddison, W. P. (2006). Confounding asymmetries in evolutionary diversification and character change. Evolution, 60(8), 1743-1746.

R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.