Michael Landis

visualizing uncertainty in admixture graphs

07 Aug 2013

I’ve been working on a Bayesian implementation of Joe Pickrell and Jonathan Pritchard’s admixture model (implemented as TreeMix). Their model extends Cavalli-Sforza and Edwards’ seminal work on phylogenetic analysis, which assumed that modern populations allele frequencies evolved according to a Brownian motion model and covary according to an underlying bifurcating tree. By permitting admixture edges, the Pickrell-Pritchard model can capture signals of gene flow in cases where a bifurcating tree describes covariance in the populations’ allele frequencies poorly (which is anything but a rare occurrence in, say, humans). These admixture edges transform our beautiful (though biologically unrealistic) bifurcating tree into a less wieldy tree-like graph. While phylogenetics have used consensus trees to describe topological uncertainty, I couldn’t find a good way to summarize uncertainty for this sort of tree-like directed acyclic graph (DAG).

One approach I’m exploring is to plot the majority rule consensus tree for the underlying bifurcating divergence tree. Then conditioning on that tree, plot the admixture edges with posterior probability greater than, say, 0.5 given that the source and destination branches exist. This gives you a conditioned majority rule consensus DAG of sorts.

Now, I’ve been interested in how well this method works for only a single diploid sample per population, so data was simulated for 20 populations with two samples per population and 100000 SNPs per sample. You can see the resulting conditioned majority rule consensus DAG below, generated by some R scripts that parse RevBayes’ MCMC output.

The tree height is one, but scaled by a parameter that captures information about the mean population size, time to the most recent common ancestor, and generation time (not shown). Branch lengths are proportional to time and widths are informative of population size relative to the population size mean (log scale). All divergence events were supported with posterior probability 1.0, so those values are not shown. Although time and population size aren’t identifiable, they are useful to separate since I model admixture events to occur instantaneously in time. Admixture edges report their posterior probability and mean posterior admixture weight.

What’s important to note is that the analysis records the admixture edges p4->p5 and p3->(p7,p8) as having high posterior probability. There’s some uncertainty in the exact placement of edge p3->(p7,p8). The order of admixture events is reversed, partly owing to the non-identifiability of age from population size. Finally, since population size and time aren’t identifiable and are inversely related, we see the model redistributes these parameter values fairly evenly (notably, the sister lineage to p0 is lengthened, possibly due to the birth-death prior on divergence times).

True graph:

True admixture graph

Inferred graph:

Inferred admixture graph


summer updates

02 Jun 2013

Evolution 2013 was very enjoyable, though exhausting (my classmates and I carpooled and camped). To share some of my “paperless” research, such as my Evolution talk, I created a Figshare account.

I received some great feedback about BayArea at the conference, both from empiricists and theorists. Designing new models is the next part, which is driving force behind increasing the number of areas per analysis. As an aside, I updated BayArea to fix a problem for proposing histories for small numbers of areas (N < 10). Thanks to Julien Vieu for mentioning the problem.

Also, I updated creepy-jerk (continuous character evolution using Lévy processes) to improve performance for computing the compound Poisson process with normally distributed jumps. While doing this, I modified the code to produce a FigTree-compatible output file that indicates the size and polarity of jumps on the phylogeny (using the posterior of sampled jumps and the signal-to-noise ratio divided by the square root of the branch length).

creepy-jerk posterior jumps

Looks nice, I think!


BayArea v1.0 release

09 May 2013

Exciting news! I just uploaded the first version of BayArea, a Bayesian method to infer ancestral species ranges using a molecular phylogeny and presence-absence data. The main focus of the method is to accommodate a very large number of discrete areas, constituting a geography over which species’ ranges span and may change with time through the gain and loss of area occupancy. BayArea allows the inclusion of hundreds to thousands of areas per analysis, a substantial improvement from the previous limit of ten to twenty areas. The source code, manual, and an example dataset are available at http://bayarea.googlecode.com. You should expect to see a Systematic Biology paper exposing the technical details soon (written with John Huelsenbeck, Nick Matzke, and Brian Moore).