# modeling git-DAG evolution

13 Jan 2014

Projects with concurrently working contributors and complex contribution schemes are common. Version control systems grant granular control over how project updates are integrated or discarded into different versions of the project. Originally designed to manage updates to the Linux kernel, git is a simple yet powerful tool used to manage these sorts of projects. The management of a git project, or repository, is facilitated in large part by three git commands:

1. git-commit allows workers to save the current state of their work
2. git-merge allows workers to merge their work with another set of work and resolve copy conflicts, and
3. git-branch allows workers to create a new copy from an existing set of work.

What’s more, git-log reports the entire history of branches, merges, and commits in the repository. Each event records both the SHA-1 hash key assign to itself and to its parent. Treating events as nodes and the parent-child relationships as edges, a repository’s history is a directed acyclic graph (DAG). Here’s an example of a git-DAG I simulated then pushed to http://github.com/mlandis/git-coal:

Evolutionary biologists also study a sizeable project with concurrently working contributors and complex contribution schemes. What might we learn about the “biodiversity” contained within version control system repositories? For instance, how might we model the branch-merge-commit process of a git-DAG?

Searching online, I couldn’t find many examples of probabilistic models of version control system evolution. So, I began by modeling a git-DAG using a simple continuous-time Markov chain. Our process value, $N_t$, equals the number of branches active at time $t$, and undergoes transitions, $m \rightarrow n$, according to the instantaneous rate matrix

Each active branch undergoes a branch, merge, or commit event at rates $r_B$, $r_M$, and $r_C$ respectively. Unlike phylogenetic inference, where the evolutionary history shared among a group of taxa is unknown, we know exactly what happens and when from git-log. This means we don’t have to integrate over unobserved intermediate git events.

The probability for an observed sequence of competing exponentially-distributed events is easy to compute. We take the initial state and time (in our case, this is $N_{t}=1$ and $t=0$). Then, proceeding one event at a time, we compute the probability that an event of any type occurs given the chain is in state $m$, which is

where

is the rate moving away from the current state, times the probability of the next event being of type $n$ given the chain is state $m$, given as

which simplifies to

The likelihood of a sequence of timed git events is simply the product of these stepwise probabilities. (There is typically a final term to compute the probability that no further events occur given the time of the final observed event. I decided to omit this factor since many git projects are intensely active then go dormant for years on end.)

To infer the parameters of this process, I set up a slapdash Markov chain Monte Carlo implemented in Python and found at https://github.com/mlandis/git-coal/blob/master/git_dag.py.

Finally, I applied this model to a simulated git-log history. Looking at the posterior from the analysis using Tracer, we see the rate of branch and commit events are approximately equal (which is true under my simulation settings).

Many disregarded factors should affect the rates of branching, merging, and committing, such as who the user is, the absolute time (e.g. of day or year), which branch is being worked on (e.g. master vs. patch), the distribution of commit sizes, which files are checked out, etc. Since git-log retains records for all these variables, they can be modeled as affecting the event rates very easily.

I’ll be playing around with this more in my spare time, but that’s all for now.

# new year, new site

04 Jan 2014

For the past year, I hosted my research progress on Google Sites. This is a wonderful free option for hosting cookie-cutter websites. Unfortunately, executing arbitrary Javascript within Google Sites was not permitted for security reasons. Because of this, I never managed to configure Google Sites to properly display syntax-colored code blocks or LaTeX-based equations.

I moved my content to GitHub Pages which is compatible with Jekyll-generated static content. Jekyll supports MathJax and Pygments, enabling LaTeX-rendering and code-syntax-highlighting, respectively.

For example, a Jekyll Markdown post containing the code

$$L_{l}^{(i)} (s) = \left[ \sum_{s_{j}} \text{Prob} (s_j \mid s, v_j) L_{j}^{(i)}(s_j) \right] \times \left[ \sum_{s_{k}} \text{Prob} (s_k \mid s, v_k) L_{k}^{(i)}(s_k) \right]$$


with MathJax produces

…and the code

{% highlight python %}
def f(L,tp,j,k,l):
for sl,ll in enumerate(L[l]):
llj = sum([ lj * tp[l,j,sl,sj] for sj,lj in enumerate(L[j]) ])
llk = sum([ lk * tp[l,k,sl,sk] for sk,lk in enumerate(L[k]) ])
L[l,sl] = llj * llk
{% endhighlight %}


with Pygments produces

…and voilà! We have a major component of Felsenstein’s pruning algorithm!

With these features working, I’m looking forward to writing practical posts containing code snippets or educational posts dissecting the math we use to model evolution.

All the code used for the site (and this post) are located at http://github.com/mlandis/mlandis.github.io.

# fall updates

18 Nov 2013

It’s been a busy semester! I’m studying in Durham, North Carolina under the NESCent graduate fellowship, where daily I grow fonder of the city’s rugged tobacco-and-cinder aesthetic. Not so fond that I’ve forgotten Oakland, of course.

While here, part of my plan is to explore new biogeographic range evolution models designed for large numbers of areas (see initial work here). The data augmentation method I’m using was originally introduced to phylogenetics for use with protein evolution by Robinson et al. (2003) from Jeff Thorne’s group and others. To my great fortune, being a short bus ride away to Raleigh from Durham, Jeff generously agreed to act as my research advisor for the semester.

In early September, I presented some past work on applying Lévy processes to phylogenetic inference (collaboration with Josh Schraiber and Mason Liang) at the Mathematics for an Evolving Biodiversity workshop hosted at the University of Montréal. A very stimulating conference! Mid-September, I presented on Bayesian biogeography for the Phylogenetics & Evolutionary Biology Seminar group at NCSU.

With the presentations aside, I’ve had time to focus more on my fellowship project. With a little effort, I now have a working version of BayArea implemented in RevBayes. It’s extremely easy to tinker with new dispersal models in this framework, so I expect my next biogeography inference method will be released solely under RevBayes.

Finally, I’m happy to say that Phylowood – what I began on a bit of a whim as a “fun and relaxing” post-qualifying exam project, then matured into a Google Summer of Code project – is now published as a Bioinformatics Application Note (link). Trevor Bedford, my GSoC mentor, was very supportive throughout the whole process, and even covered the Open Access fee.

Here’s a gif demonstrating some of the features of Phylowood:

Plenty in the pipeline to come, but that’s all to report on for now!