Michael Landis

RevBayes overview

27 Jan 2016

This is the first of a series of posts to outline how to infer a molecular phylogeny in RevBayes. My goal is to demonstrate the flexibility and power of RevBayes using simple, modifiable code snippets. These posts will additionally serve as a foundation to explore advanced techniques and models in the future – which is the fun part! Topic-specific tutorials are also available online.

But what is RevBayes? RevBayes is an open source software package for Bayesian phylogenetic inference specified through probabilistic graphical models and an interactive language. This design allows researchers to estimate species relationships using a modeling interface that is simple, flexible, and efficient. The result is that phylogenetic model space becomes more compact (without shrinking), so it may be explored with ease by empirical and theoretical biologists, who may have great ideas but do not have the time or skills to translate them into computer code.

Let’s concretely examine what this means for phylogenetic models. The most widely adopted phylogenetic models rely on four submodel components (or modules): the diversification process, the molecular substitution process, the branch-rate variation model (or the relaxed clock model), and site-rate variation models. Each of these modules comes in a variety of flavors. For example, there are a vast number of substitution processes depending on what evolutionary features you wish to model – e.g. Do transitions and transversions occur at equal rates? Do all bases occur at equal frequencies? Should the process be time-reversible? Each of these models is fully described by an instantaneous rate matrix, and only differ in how the rate matrix elements are parameterized: are all rates fixed to be equal, all different, or somewhere in between? Ideally, a researcher should be able to compose her phylogenetic model not only from canonical modules described in the literature, but also to apply new types within any given class of modules that she imagines.

The following posts will explore how these modules interact and how they may be customized in RevBayes. As mentioned earlier, RevBayes specifies models through a programming language. Learning a new language begins with exposure, so a natural place to start is with a boilerplate phylogenetic model of molecular substitution.

# Read the NEXUS file
data = readDiscreteCharacterData("turtles.nex")

################################
# Create a birth-death process #
################################

# birth rate
birth ~ dnExp(10)

# death rate
death ~ dnExp(10)

# age of the root node
root_age ~ dnUniform(0, 100)

# time tree generated by a birth-death process
tree ~ dnBDP(lambda=birth, 
             mu=death,
             rootAge=root_age,
             taxa=data.taxa())

################################################
# Create a general time-reversible rate matrix #
################################################

# exchangeability rates
er ~ dnDirichlet(rep(1, 6))

# stationary frequencies
bf ~ dnDirichlet(rep(1, 4))

# GTR rate matrix
Q := fnGTR(er, bf)

###################################################
# Create an independent gamma rates relaxed clock #
###################################################

# a global molecular clock
base_rate ~ dnExp(10)

# for each branch in the tree
n_branches = 2 * data.ntaxa() - 2
for (i in 1:n_branches) {

    # create a rate multiplier for each branch
    branch_rates_mult[i] ~ dnGamma(2,2)

    # rescale the global clock by the rate multiplier
    branch_rates[i] := base_rate * branch_rates_mult[i]  
}

##################################################################
# Create a +Gamma site-rate variation model with five categories #
##################################################################

# shape and scale parameter for rate variation
alpha ~ dnUnif(0, 20)

# create a Gamma(alpha, alpha) distribution
rate_distribution = dnGamma(alpha, alpha)

# create a deterministic vector of Gamma(alpha, alpha) quintiles
n_cats = 5
site_rates := fnDiscretizeDistribution(rate_distribution, n_cats)

#################################################
# Create the phylogenetic model of substitution #
#################################################

seq ~ dnPhyloCTMC(tree=tree,
                  Q=Q,
                  branchRates=branch_rates,
                  siteRates=site_rates,
                  nSites=data.nchar(),
                  type="DNA")

###########################################
# Treat the data as observed by the model #
###########################################

seq.clamp(data)

For now, just give the code a glance and keep it in mind for the future. If you are well-versed in phylogenetic models, the structure of some variables should be familiar. The anatomy of the code will be covered in detail throughout the imminent series of posts. By the end, you’ll be comfortable reading and modifying the code, tailored to your datset and interests. Below is a tentative outline, which will be updated with post links as they’re completed.

Topics

  • Graphical models
  • Reading data
  • Phylogenetic substitution models
  • Rate matrices
  • Substitution processes
  • Diversification processes
  • Relaxed clocks
  • Site-rate variation
  • Estimating the posterior
  • Advanced analyses
  • Scripting with RevBayes

Of course, feel free send me an email or tweet with any feedback!