---
title: "Birth-death models and deep time"
author: "Matheus Januario, Andressa Viol, and Daniel Rabosky"
date: "Jan 2024"
output: rmarkdown::html_vignette
vignette: >
 %\VignetteIndexEntry{birthdeath_deeptime}
 %\VignetteEngine{knitr::rmarkdown}
 %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(rmarkdown.html_vignette.check_title = FALSE)
```

# Learning objectives

1. The birth-death (BD) model 
2. Deterministic expectations of the birth-death model
3. Estimation under the simple birth-death
4. Effects of variation on the net diversification rate
5. The birth-death model is a stochastic process
6. Adding in extinction
7. Age-richness models and empirical richness through time

# Introduction

Today we will work with the main mathematical model used in diversity dynamics: the birth-death model. The main idea of the model is simple: at any instant in time, species can (I) split into two, (II) go extinct, or (III) do nothing. The model can be fully described (i.e. parameterized) by the rate of speciation (the "births") and the rate of extinction (the "deaths").

```{r}
library(evolved)
```

# The birth-death (BD) model

The birth-death model is a simple stochastic process that is widely used to quantitatively study diversification (speciation and extinction). It is used in other contexts as well, as in ecology or in physics.

The birth-death model we will use here has two parameters: $S$, the _rate of speciation_ (the rate at which one species will split into two species), and $E$ the _rate of extinction_. Usually, these rates are "per capita" (or "per lineage"), meaning that the total rate of speciation for a clade of N species would be $N \times S$ and $N \times E$ for extinction. 

The compound parameter $S - E$ is known as the net diversification rate: it is the rate at which new species are produced, minus the rate at which they go extinct. If $S - E > 0$, the net production of new species is positive and the clade tends to increase exponentially in diversity through time. When $S - E < 0$, the clade will lose species through time on average, meaning the clade will be strongly biased to go extinct.

### Glossary:

$S =$ speciation rate (per lineage)

$E =$ extinction rate (per lineage)

$S - E = R =$ net diversification rate

$K = \frac{E}{S} =$ relative extinction rate (becomes important later)

**pure-birth model:** a special type of birth-death model where only speciation happens (i.e., E = 0)

# Deterministic expectations of the BD model

The expected species richness under the birth-death model is:

\begin{equation}
N_t = N_0 * e^{t (S - E)}
\end{equation}

where $N_0$ is the starting number of species, $t$ is the elapsed time, and $N_t$ is the number of species at time $t$. This is just a simple exponential growth model that you've seen before (recall the intro to popgen vignette, where we discussed Malthusian growth), except now we are dealing with the birth and death rates of _species_ and not individuals.


\fbox{\begin{minipage}{48em}
Question 1:\\

Are the relative contributions of speciation and extinction to net diversification distinguishable in this simple model?\\

Hint: What is the expected richness through time curve for S = 10, E = 9, versus S = 1, E = 0?  Plot those 2 cases in a panel side by side on the same scale.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 2:\\

Imagine that you can watch the entire history of the 2 cases from Q1 (S = 10, E = 9 vs S = 1, E = 0), such that you could see every lineage branching out and going extinct. What, if anything, would be different about diversification under these 2 scenarios?\\

Hint: Recall the lineage-through-time plots are. Imagine what a lineage-through-time plot may look like for these two cases. What is different or similar about them?

\end{minipage}}

# Estimation under the simplest BD

Let's imagine that we are interested in the diversification of mammals, which currently have around 5500 extant species. Can we estimate how fast mammals diversified through time? 

Assume that we know that the first of all mammals existed $164.9$ million years ago. At this time point, there were $N = 1$ species in this group. Thus, we initiate a model where $N_t = 5500$, $N_0 = 1$, and $t = 164.9$. What is our estimate of $R = S - E$?

We can make a simple estimator for the net diversification rate $R$ under the birth-death process just by rearranging equation (1) above. We start with:

\begin{equation}
N_t = N_0 * e^{Rt}
\end{equation}

and solve for R, giving:

\begin{equation}
R = \frac{log(N_t) - log(N_0)}{t}
\end{equation}

\fbox{\begin{minipage}{48em}
Question 3:\\

Estimate R for mammals, and plot the estimated species richness through time since 164.9 million years ago. Using this model, how many species of mammals do you think existed at the time of the K-Pg extinction? What assumptions went into this estimate?
\end{minipage}}

# Effects of variation on the net diversification rate

\fbox{\begin{minipage}{48em}
Question 4:\\

How sensitive is species richness to variation in the net diversification rate? Imagine that you are "replaying the tape" of the evolutionary history of mammals, but let's pretend now that your estimated net diversification rate is double the rate you calculated in Q3. Plot the species richness trajectory, given this new net diversification, together with the one you plotted in Q3.\\ 

Hint: You may need to use a log-scale y axis! To do so, pass your richness estimate into the {\tt log()} function, specifying the argument {\tt base=10}, before plotting. 

\end{minipage}}

# The birth-death model is a stochastic process

Let's imagine that the $R$ we've computed for mammals in Q3 is in fact the _true_ $R$. What if we replayed the tape of life for mammals many times, allowing the clade to diversify over and over again (always starting from only one species). Do you think we would get the exact same number of species every time? 

To explore this, we will simulate evolutionary histories of mammals under our value of $R$. 

Fundamentally, the birth-death process is stochastic. We've been speaking of $S$ and $E$ as rates, but it is more appropriate now to think of them as relative probabilities -- the relative probabilities that each species alive in the simulation will either split into two or die, respectively. 

Simulating the birth-death process is very easy, but the details involve some basic probability theory and calculus that go beyond the scope here. So, we have given you a function that simulates the number of species in a clade, given a set of parameters. 

To simulate a single evolutionary history starting with $N_0 = 2$ species, for $t = 10$ million years, with $S = 1$ and _no extinction_ (i.e. $E= 0$), we should type:

```{r}
simulateBirthDeathRich(t = 10, S = 1, E = 0)
```


\fbox{\begin{minipage}{48em}
Question 5:\\
 
Assume that we are modeling the evolution of mammals as a pure-birth process (i.e., E = 0, see glossary above). What then is the mathematical relationship between the value of S and the value of R? 
\end{minipage}}

Try it yourself: Simulate a history of mammals using the function `simulateBirthDeathRich()`. Did you get something close to the true value of 5500 species? 

\fbox{\begin{minipage}{48em}
Question 6:\\

Now that we can simulate the diversification of a clade, we can in a sense "replay the tape of life" and examine how the results may have been different. Using the BD model, let's simulate the history of mammals 100,000 times to see how the final number of species could vary under this model. Do the following:\\

(I) Simulate a pure-birth process for the mammal clade. Run 100,000 simulations with the same parameters. Hint: Use a loop.
(II) Plot a histogram of the final species richness in each of your simulations.\\
(II) Use the function {\tt abline()} to plot the mean species richness of all your simulations as an orange vertical line.\\
(III) Then use the {\tt abline()} function again to add the actual number of extant mammalian species.\\

What do you think? Were you expecting this range of variation in the outcomes?
\end{minipage}}

# Adding in extinction

In previous exercises, we have been ignoring extinction by setting $E = 0$. But the fossil record tells us that the extinction rate has been nearly equal to the speciation rate for much of the history of mammals. So, let's assume that E is 99% of the speciation rate, or K = E/S = 0.99. 

\fbox{\begin{minipage}{48em}
Question 7:\\

If R has the same value as before (calculated in Q3), what are the new S and E values?
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 8:\\

Repeat the simulations you did in Q6 with 100,000 clades, but now use the new new S and E values.

Examine the distribution of final species richness. How is this distribution different from the one you saw in Q6?\\

Hint: Use the function {\tt var()}, in the exact same way you use the function {\tt mean()}, to calculate the variance of the number of species in each of the scenarios you simulated.
\end{minipage}}

# Age-richness models and empirical richness through time

Now, we will apply this method to a real richness curve, and compare our estimates with actual data. We will use the dataset provided by Rabosky & Benson (2021).

First we will read in our data.

```{r}
data("timeseries_fossil")
```

We can select among many possible clades to run our analysis:

```{r}
unique(timeseries_fossil$clade)
```

`anth` = Anthozoan corals

`art` = Articulata (an extant subclass of "sea lilies" and "feather stars" within Echinodermata)

`biv` = Bivalve mollusks

`bryo` = Bryozoans

`ceph` = Cephalopod mollusks

`chon` = Cartilaginous fishes

`crin` = The extinct "sea lilies" and "feather stars"

`ech` = Sea urchins

`gast` = Gastropod mollusks

`ling` = A class of brachiopods

`ostr`  = A class of Crustaceans

`tril` = Trilobites 

`foram` = Foraminifers

`dinosauria` = Non-avian, extinct dinosaurs

`graptoloids` = An extinct group of colonial animals

What we will do is fit a birth-death model to a clade's diversification trajectory, but instead of using the present as the reference point for $N_t$, and calculating the diversity trajectory up to the present day, we will choose an arbitrary point in the past and use it as a reference point instead. This will allow us to then project our birth-death model to the actual present, and see if the model can properly predict how diversity has changed since our arbitrary reference point in the past.

To do this, we will do the following:

**Step 1:** Choose a dataset.

```{r}
# Here we will use dinosaurs:
dino_rich <- timeseries_fossil[timeseries_fossil$clade=="dinosauria",]
head(dino_rich)
```

\

**Step 2:** Choose a point in time for which we have data.

```{r}
unique(dino_rich$time_ma)
```

```{r}
# We will use the early Jurassic, around 201 Mya, as our reference for this analysis
t_obs <- 201

rich_early_J <- dino_rich$richness[dino_rich$time_ma==t_obs] 
# species richness from this time point
rich_early_J

# plotting our data and our point of observation:
plot(x = dino_rich$time_ma,
     y=dino_rich$richness,
     xlim=c(max(dino_rich$time_ma), min(dino_rich$time_ma)),
     xlab = "Millions of years ago", ylab = "Dino species richness")

# connecting the dots:
lines(x = dino_rich$time_ma,
     y=dino_rich$richness)

# highlighting the point we will pick:
points(x=t_obs, y=rich_early_J, col="red", pch=16)
abline(v=t_obs, col="red")
```

**Step 3:** Estimate R using the methods above.

As we saw above, to estimate R, we need to know $N_t$, $N_0$, and $t$. Let's use our arbitrary reference time point, 201 Mya, as the 'present', and calculate values for these terms.

```{r}
# Richness at the stem age of the dinosaurs:
rich_0 <- 1

# The start age of our clade marks time 0 for diversification:
t0 = dino_rich$stem_age[1]

# Since we now know N_t, N_0, and the elapsed time, calculate R:
R_dino = (log(rich_early_J) - log(1)) / (t0 - t_obs)
R_dino

# Plot our data and project the richness since the beginning of our time series:
time_from_t0 = t0 - dino_rich$time_ma

projected_rich = rich_0 * exp(R_dino * time_from_t0)

# Finally, we calculate the difference between our projections and the data:
projected_rich-dino_rich$richness

# Plotting the difference between our projections and the data:
plot(x = dino_rich$time_ma,
     y=dino_rich$richness,
     xlim=c(max(dino_rich$time_ma), min(dino_rich$time_ma)),
     )
# Connecting the dots:
lines(x = dino_rich$time_ma,
     y=dino_rich$richness)

# Highlighting our point of observation again:
points(x=t_obs, y=rich_early_J, col="red", pch=16)

# Now, we will add the predicted richness curve based on our estimations:
lines(x = dino_rich$time_ma,
     y=projected_rich, col="red")

#Finally, we plot the differences between our projections and our data using bars:
segments(x0 = dino_rich$time_ma,
         x1 = dino_rich$time_ma,
         y0 = dino_rich$richness,
         y1 = projected_rich,
         col="red")
```

The current axis scale does not allow us to properly visualize the discrepancy between our prediction and the data. So let's plot again, re-scaling the y-axis:

```{r}
#we can plot the data again, this time with extra care with the scale of our axes:
plot(x = dino_rich$time_ma,
     y=dino_rich$richness,
     xlim=c(max(dino_rich$time_ma), min(dino_rich$time_ma)),
     
     #we will change the y-axis limits to fit all our calculations:
     ylim=c(
       min(c(dino_rich$richness, projected_rich)), 
       max(c(dino_rich$richness, projected_rich))),
     )

#connecting the dots:
lines(x = dino_rich$time_ma,
     y=dino_rich$richness)

#highlighting our observation point:
points(x=t_obs, y=rich_early_J, col="red", pch=16)

#adding the predictions of the model:
lines(x = dino_rich$time_ma,
     y=projected_rich, col="red")

#Finally, we plot the differences between our projections and our data using bars:
segments(x0 = dino_rich$time_ma,
         x1 = dino_rich$time_ma,
         y0 = dino_rich$richness,
         y1 = projected_rich,
         col="red")
```

What do you think? Does this model fit our data well?

Now, let's divide the error of our estimates by the actual data to determine what the scale of our error is, in terms of a percentage.

```{r}
perc_error <- (projected_rich - dino_rich$richness)/dino_rich$richness * 100

min(perc_error)
max(perc_error)

hist(perc_error, xlab = "Error as a percentage of the data")
```

# Some food for thought:

(A) What do you think of an estimator that can underestimate by 87% or can overestimate the data by 32,746,816%?

(B) How many species of dinosaurs does this model predict immediately before the asteroid impact (mass extinction) at the K-Pg boundary?

```{r}
proj_rich_dinos_KPg <- (R_dino +1)^(dino_rich$stem_age[1]-66)
proj_rich_dinos_KPg
```

What do you think of this estimate? Before answering, consider that current estimates for the total number of species on Earth are fewer than 10 million species overall.


\fbox{\begin{minipage}{48em}
Challenge (extra question 1):\\

Maybe such a large difference between projection and data is a result of our choice of reference time point. Choose another moment in geologic time for which we have dinousar data, and replicate the analysis above. Do you get a more accurate prediction?
\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra question 2):\\

Maybe the result we found is typical for dinosaurs only. Choose a different dataset and replicate this analysis. Do you get a more accurate prediction?
\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra question 3):\\

Consider 2 clades that are the same age and that have identical diversities, i.e. both clade 1 and clade 2 have exactly 500 species each after 100 million years. Can you think of two very distinct diversification scenarios (e.g., richness-through-time plots) that would give you the same estimate of diversification rate?

\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 9:\\

Do you think the estimates we derived from the BD model are a reasonable representation of clade dynamics? Why? 
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 10:\\

Statistical models (such as the birth-death estimator we used in this lab) often fail to explain data because their assumptions are unrealistic. Can you identify one or more potentially unrealistic assumptions in the analyses we did here?\\

\end{minipage}}

# References:
Rabosky, D. L., & Benson, R. B. (2021). Ecological and biogeographic drivers of biodiversity cannot be resolved using clade age-richness data. Nature communications, 12(1), 1-10.