Michael Landis

bash scripting with RevBayes

09 Jun 2016

This post is part of a series exploring features in RevBayes.

Whether working on genomic datasets or conducting a simulation study, research projects often require identical or nearly identical jobs to be replicated across vast numbers of datasets. The front-end of RevBayes is an interpretted language, which provides users with an agile and powerful interface for scripting.

Let’s set up a little scenario to make it easy on the imagination. Suppose your task is to assess the sensitivity of posterior tree probabilities for one simple and one complex model—say, a Jukes-Cantor model, where all transition rates are equal, and the Felsenstein 81 model, where base frequencies are free parameters to be estimated. You’ve stored multiple sequence alignments for all the genes of interest in the folder genes in NEXUS format. You want to use RevBayes to estimate the posterior density for each gene, once assuming the rate matrix is fnJC and a second time assuming it is fnF81. Instead of repeatedly modifying and running your RevBayes script, rb_gene_model.Rev, you can automate the job in bash using the echo command and pipes (|).

To follow along or download the scripts, issue these commands in the shell.

> git clone https://github.com/mlandis/revbayes_bash_scripts.git
> cd batch_job

First, we’ll create a RevBayes script called rb_gene_model.Rev that expects three pre-defined variables: data_file gives the name of the gene alignment, job_id gives the analysis an identity that matches the filename, and type_Q gives the type of rate matrix to use.

# print errors for undefined variables
if (!exists("data_file")) "ERROR: `data_file` undefined"
if (!exists("type_Q"))    "ERROR: `type_Q` undefined"
if (!exists("job_id"))    "ERROR: `job_id` undefined"

# settings
out_prefix = job_id + ".Q_" + type_Q 
out_fp = "output/"

# move/monitor vector indices
mni = 1
mvi = 1

# read data
data = readDiscreteCharacterData(data_file)
taxa = data.taxa()

# define tree model
tree ~ dnBDP(lambda=1, mu=0, rootAge=1, taxa=taxa)
mv[mvi++] = mvNNI(tree, weight=taxa.size())
mv[mvi++] = mvNodeTimeSlideUniform(tree, weight=taxa.size())

# define clock model
clock ~ dnExp(10)
mv[mvi++] = mvScale(clock, weight=2)

# define rate matrix based on value of type_Q
if (type_Q=="JC") {
    Q <- fnJC(4)
} else if (type_Q=="F81") {
    bf ~ dnDirichlet([1,1,1,1])
    mv[mvi++] = mvSimplexElementScale(bf, weight=2)
    Q := fnF81(bf)
}

# create phylogenetic model distribution
seq ~ dnPhyloCTMC(tree=tree,
                  Q=Q,
                  branchRates=clock,
                  nSites=data.nchar(),
                  type="DNA")

# attach the data
seq.clamp(data)

# make the model
mdl = model(seq)

# add monitors
mn[mni++] = mnScreen(clock, printgen=100)
mn[mni++] = mnFile(tree, filename=out_fp+out_prefix+".tre", printgen=100)
mn[mni++] = mnModel(filename=out_fp+out_prefix+".params.log", printgen=100)

# construct MCMC
ch = mcmc(model=mdl, moves=mv, monitors=mn)

# burn-in then run the analysis
ch.burnin(5000, 100)
ch.run(10000)

# quit upon completion
quit()

Next we’ll create a bash script called rb_gene_job.sh. When called, RevBayes treats arguments as files to be sourced, which means we can pipe (|) the stdout file from echo as a source file into RevBayes. Perfect for scripting!

#!/bin/bash

# arg1, input filename
file=$1

# arg2, stripped filename
id=$2

# arg3, rate matrix type
type_Q=$3

# construct the command string
rb_command="data_file = \"$file\";
            job_id = \"$id\";
            type_Q = \"$type_Q\";
            source(\"rb_gene_model.Rev\");"

# pipe the command into RevBayes
echo $rb_command | rb

Third, we’ll create rb_gene_batch.sh to repeatedly call rb_gene_job.sh with different combinations of arguments. Note that debug=0 by default, which will cause RevBayes to be run in the background (&) and without being hung up upon closing the terminal (nohup). These are useful features when running jobs on clusters.

#!/bin/bash
debug=0

# accepts a file list, e.g. "*.nex"
filelist=$@

# model list must match .Rev script
models=("JC" "F81")

# call `rb_gene_job.sh` for each file
for file in $filelist; do
    for m in ${models[@]}; do
        echo "Running model \"$m\" for \"$file\""
        id=$(basename ${file%%.*})
        if [ $debug == 0 ] ; then
            # run all jobs without stdout
            nohup ./rb_gene_job.sh $file $id $m &>/dev/null &
        else 
            # run serial jobs with stdout
            ./rb_gene_job.sh $file $id $m
        fi
    done
done
echo "...done!"

Everything is in place, and we can run the script.

> chmod +x *.sh
> ./rb_gene_batch.sh genes/*.nex
Running model "JC" for "genes/primates_cox2.nex"
Running model "F81" for "genes/primates_cox2.nex"
Running model "JC" for "genes/primates_cytb.nex"
Running model "F81" for "genes/primates_cytb.nex"
...done!

…and, in time, the results will appear in the output directory.

> ls -1 output
primates_cox2.Q_F81.params.log
primates_cox2.Q_F81.tre
primates_cox2.Q_JC.params.log
primates_cox2.Q_JC.tre
primates_cytb.Q_F81.params.log
primates_cytb.Q_F81.tre
primates_cytb.Q_JC.params.log
primates_cytb.Q_JC.tre

That’s it!


graphical models in RevBayes

16 May 2016

This post is part of a series exploring features in RevBayes.

You bet heads on the coin flip, and lose again. Your host wonders aloud what the chances of losing five games in a row might be. Most would reply the odds for heads over tails is fifty-fifty, which comes to for five trials, assuming a fair coin. But, then again, the hour is late, the carnival is closed, and your host with the pointed mustache pours you more wine. You begin to question whether it is a fair coin.

This is to ask how are the data distributed for what parameters. A coin flip may be represented as a random variable, , which can either be heads, , or tails, . The Bernoulli distribution says a random variable takes the value with probability . That is, is a parameter that controls how fair the coin is. Mathematically, states that the value of is distributed by a Bernoulli distribution, where and each occur with probability . The naive and trusting model, , is programmed in RevBayes like so

> # M0: the mark's coin-flipping model
> h ~ dnBernoulli(p=0.5)
> h.clamp(1)
> h.probability()
   0.5

Alternatively, a model anticipating a con game, , seeks to represent that the fairness of the coin is unknown, but biased towards flipping tails (i.e. is expected). is , where might take on any possible amount of unfairness, . This is described mathematically with the three statements, , , and . The corresponding RevBayes code for is

> # M1: the grifter's coin-flipping model
> lower <- 0.0
> upper <- 0.1
> q ~ dnUniform(lower, upper)
> p := q^2 
> h ~ dnBernoulli(p)
>
> # Assume a heads is flipped
> h.clamp(1)
> 
> # What is the likelihood if p==0.25 (q=0.5)?
> q.setValue(0.5)
> h.probability()
   0.25
>
> # What if p==0.01 (q=0.1)?
> q.setValue(0.1)
> h.probability()
   0.01

In the example for , we see that the probability the coin is heads, h, depends on the value of p. p itself depends on the value of q, which, in turn, is drawn from a uniform distribution with bounds imposed by lower and upper.

Probabilistic graphical models

A pattern is emerging where model complexity grows in response to the number of parameters and the dependencies among those variables. When a model has very few variables, like with our coin-flipping example, these parameter interactions are easy to summarize. The dependencies between all random variables in the model can be written explicitly in terms of equations. For large and complex models, like phylogenetic models, the incredible number of parameter interactions can be overwhelming.

Graphical model

Any model with a strong and separable dependency structure may be equivalently represented as a probabilistic graphical model, and visualized as a directed acyclic graph. RevBayes fundamentally treats models as probabilistic graphs, a perspective that offers several advantages. First, it offers a power visualization tool, useful both for summarizing and reducing model complexity, and for teaching. The dependency structure lends itself to various computational efficiencies, including marginalization techniques, Markov chain Monte Carlo, generative simulations. Graphs also modular, in that they can be decomposed into subgraphs while retaining their internal topologies. This is improves code design and reusability both when specifying models using the Rev language, and when developing the back-end architecture powering RevBayes. These sentiments are summarized in Hoehna et al. (2014) and will be reinforced throughout this series of posts.

A brief overview of specifying graphical models in RevBayes is given below.

Constant nodes: solid squares

A model variable that is known is a constant node. Constant nodes are treated as known variables and are not estimated during inference. Creating a constant node is done using the <- assignment operator.

> p <- 0.5
> p
   0.5

Stochastic nodes: solid circles

Probabilistic inference is primarily interested in learning the values of the unknown parameters that summarize the data. Each of these parameters is a stochastic node. Stochastic nodes are created using the ~ operator, where the right-hand side gives the distribution (and parameters) that measure the probability of the random variable on the left-hand side.

> p <- 0.5
> h ~ dnBernoulli(p)
>
> # h is initialized with a random value
> h
    0
> h.probability()
    0.5

RevBayes works with generative models, which allows for an exact match between the simulation model and the inference model. This greatly simplifies workflows involving simulation analyses, such as model validation and statistical power analyses.

Deterministic nodes: dotted circles

A variable whose value is exactly determined by other model variables is a deterministic node. Deterministic nodes have a value equal to a function of any number and combination of constant, deterministic, and stochastic nodes, which is defined using the := operator. This affords great flexibility in parameterizing the model.

> q <- 0.5
> p := q^2
> p
   0.25

Deterministic nodes are immediately updated when any of their parental nodes’ values are changed.

> q <- 0.4
> p 
   0.16

The probability of a stochastic node changes whenever its distribution’s parameters change.

> q <- 0.5
> p := q^2
> h ~ dnBernoulli(p)
> h.probability()
   0.25
> q <- 0.4
> h.probability()
   0.16

Observed (stochastic) nodes: shaded circles

During inference, some random variables are observed to have particular values. For example, an individual coin (the variable) might show heads or tails (the values) after being flipped. Once the coin is observed to be, say, tails (), the question of interest becomes what is the likelihood that the coin flip resulted in tails under its distribution. That is, the unobserved random variable might take any number of values, but the observed random variable is “clamped” to the actual value of the trial. To clamp a stochastic node to be observed for some value

> q <- 0.5
> p := q^2
> h ~ dnBernoulli(p)
> obs <- 0
> h.clamp(obs)
> h.probability()
    0.75

In effect, calling clamp sets the node’s value and distinguishes it to be included in the model likelihood (rather than part of the prior). Future posts will discuss how this distinction affects certain parts of Bayesian analyses, such as computing Bayes factors.

Plates (replicated subgraphs)

Models often contain repetitive structures, which may be represented compactly. Graphical models use plates. In RevBayes for loops are the simplest way to represent.

> p <- 0.25
> for (i in 1:100) {
+     x[i] ~ dnBernoulli(p)
+ }
> x_mean := mean(x)
> x_mean
    0.22

They are also useful for succinctly imposing a Markov property between model variables.

> x[1] ~ dnNormal(mean=0, sd=1)
> dx[1] <- 0.0
> for (i in 2:10) {
+     x[i] ~ dnNormal(mean=x[i-1], sd=1)
+     dx[i] := x[i] - x[i-1]
+ }
> z
   [ 0.596103, 0.291775, 1.02862, -0.0848241, -3.08975,
     -2.90331, -3.41427, -4.46115, -5.62696, -6.33428 ]
> dz
   [ 0, -0.304328, 0.736843, -1.11344, -3.00493, 
     0.186437, -0.510954, -1.04688, -1.16581, -0.707318 ]

Parent-child relationships

The structure (or str) function is useful to learn about how the nodes relate to one another. For example, the node q is the parent of p and the child of lower and upper.

> lower <- 0.0
> upper <- 1.0
> q ~ dnUniform(lower, upper)
> p := q^2
> structure(q)

   _variable     = q
   _dagType      = Stochastic DAG node
   _clamped      = FALSE
   _lnProb       = 0
   _parents      = [ lower, upper ]
   _children     = [ p ]

Assigning a new value to p will cause q to lose p as a child.

> p <- 5.0
> structure(q)

   _variable     = q
   _dagType      = Stochastic DAG node
   _clamped      = FALSE
   _lnProb       = 0
   _parents      = [ lower, upper ]
   _children     = [ p ]

Wrapping up

Of course, phylogenetic models are more complex than coin flipping experiments. The basic building blocks to compose richer models are essentially the same. That’s the beauty of graphical models: that simple relationships can generate scalable complexity. In the long run, the goal is not only to build models that formalize our understanding of nature, but to learn about the behavior of the processes that generate biodiversity.

Next time we’ll look into RevBayes data structures.


RevBayes overview

27 Jan 2016

This is the first of a series of posts to outline how to infer a molecular phylogeny in RevBayes. My goal is to demonstrate the flexibility and power of RevBayes using simple, modifiable code snippets. These posts will additionally serve as a foundation to explore advanced techniques and models in the future – which is the fun part! Topic-specific tutorials are also available online.

But what is RevBayes? RevBayes is an open source software package for Bayesian phylogenetic inference specified through probabilistic graphical models and an interactive language. This design allows researchers to estimate species relationships using a modeling interface that is simple, flexible, and efficient. The result is that phylogenetic model space becomes more compact (without shrinking), so it may be explored with ease by empirical and theoretical biologists, who may have great ideas but do not have the time or skills to translate them into computer code.

Let’s concretely examine what this means for phylogenetic models. The most widely adopted phylogenetic models rely on four submodel components (or modules): the diversification process, the molecular substitution process, the branch-rate variation model (or the relaxed clock model), and site-rate variation models. Each of these modules comes in a variety of flavors. For example, there are a vast number of substitution processes depending on what evolutionary features you wish to model – e.g. Do transitions and transversions occur at equal rates? Do all bases occur at equal frequencies? Should the process be time-reversible? Each of these models is fully described by an instantaneous rate matrix, and only differ in how the rate matrix elements are parameterized: are all rates fixed to be equal, all different, or somewhere in between? Ideally, a researcher should be able to compose her phylogenetic model not only from canonical modules described in the literature, but also to apply new types within any given class of modules that she imagines.

The following posts will explore how these modules interact and how they may be customized in RevBayes. As mentioned earlier, RevBayes specifies models through a programming language. Learning a new language begins with exposure, so a natural place to start is with a boilerplate phylogenetic model of molecular substitution.

# Read the NEXUS file
data = readDiscreteCharacterData("turtles.nex")

################################
# Create a birth-death process #
################################

# birth rate
birth ~ dnExp(10)

# death rate
death ~ dnExp(10)

# age of the root node
root_age ~ dnUniform(0, 100)

# time tree generated by a birth-death process
tree ~ dnBDP(lambda=birth, 
             mu=death,
             rootAge=root_age,
             taxa=data.taxa())

################################################
# Create a general time-reversible rate matrix #
################################################

# exchangeability rates
er ~ dnDirichlet(rep(1, 6))

# stationary frequencies
bf ~ dnDirichlet(rep(1, 4))

# GTR rate matrix
Q := fnGTR(er, bf)

###################################################
# Create an independent gamma rates relaxed clock #
###################################################

# a global molecular clock
base_rate ~ dnExp(10)

# for each branch in the tree
n_branches = 2 * data.ntaxa() - 2
for (i in 1:n_branches) {

    # create a rate multiplier for each branch
    branch_rates_mult[i] ~ dnGamma(2,2)

    # rescale the global clock by the rate multiplier
    branch_rates[i] := base_rate * branch_rates_mult[i]  
}

##################################################################
# Create a +Gamma site-rate variation model with five categories #
##################################################################

# shape and scale parameter for rate variation
alpha ~ dnUnif(0, 20)

# create a Gamma(alpha, alpha) distribution
rate_distribution = dnGamma(alpha, alpha)

# create a deterministic vector of Gamma(alpha, alpha) quintiles
n_cats = 5
site_rates := fnDiscretizeDistribution(rate_distribution, n_cats)

#################################################
# Create the phylogenetic model of substitution #
#################################################

seq ~ dnPhyloCTMC(tree=tree,
                  Q=Q,
                  branchRates=branch_rates,
                  siteRates=site_rates,
                  nSites=data.nchar(),
                  type="DNA")

###########################################
# Treat the data as observed by the model #
###########################################

seq.clamp(data)

For now, just give the code a glance and keep it in mind for the future. If you are well-versed in phylogenetic models, the structure of some variables should be familiar. The anatomy of the code will be covered in detail throughout the imminent series of posts. By the end, you’ll be comfortable reading and modifying the code, tailored to your datset and interests. Below is a tentative outline, which will be updated with post links as they’re completed.

Topics

  • Graphical models
  • Reading data
  • Phylogenetic substitution models
  • Rate matrices
  • Substitution processes
  • Diversification processes
  • Relaxed clocks
  • Site-rate variation
  • Estimating the posterior
  • Advanced analyses
  • Scripting with RevBayes

Of course, feel free send me an email or tweet with any feedback!