Wednesday 27 February 2013

The Binomial Distribution, Python and Bisulphite Sequencing

This is Part 2 in a series on Bisulphite Sequencing.
You can return to Part 1 (Post Processing Bismark Bisulphite Sequencing Data) or skip to Part 3 (Simple Visualisation of Bisulphite Sequencing Data).

Today I'm going to describe how the binomial distribution can be applied to bisulphite sequencing data in order to accurately determine the number of CpG sites which exhibit methylation in a sample.

The binomial distribution is actually pretty straightforward.  If you have a coin that has a probability $$$p$$$ to come up heads then the binomial distribution can tell us the probability that if you throw $$$n$$$ coins then what is the probability that you will get $$$k$$$ heads.  We call this the probability mass function.

Let's do the simplest example first.

If I throw one coin and it has a probability of 0.5 (an unbiased coin) then we have two outcomes; we throw a heads or we throw a tails.

Result of 1st Coin Toss Number of Heads Overall Probability
Tails (p=0.5) 0 0.5
Heads (p=0.5) 1 0.5

Ok, that's fairly trivial. We have two possible outcomes each with a possibility of 0.5. So our probability of getting exactly one heads is simply 0.5.

Let's add a second coin toss. Again we keep our probability as 0.5 chance for a heads.

Result of 1st Coin Toss Result of 2nd Coin Toss Number of Heads Overall Probability
Tails (p=0.5) Tails (p=0.5) 0 $$$0.5 * 0.5 = 0.25$$$
Tails (p=0.5) Heads (p=0.5) 1 $$$0.5 * 0.5 = 0.25$$$
Heads (p=0.5) Tails (p=0.5) 1 $$$0.5 * 0.5 = 0.25$$$
Heads (p=0.5) Heads (p=0.5) 2 $$$0.5 * 0.5 = 0.25$$$

Each coin toss still has a 0.5 probability of coming up heads but we now have four possible results.  We can work out the probability of each result by multiplying the probability of the first coin toss by the probability of the second coin toss ($$$0.5 * 0.5 = 0.25$$$ for each row above).

Note that in both examples above the probabilities for the table always add up to 1.  In the second table we have one row where we get 2 heads, two rows where we get 1 heads and one row where we get 0 heads.  So our cumulative probability for getting 2, 1 or 0 heads is 0.25, 0.5 and 0.25 respectively.

Let's change the odds.  We now have a coin that we've fixed to be strongly biased towards tails.  Our probability is now only 0.05 that we will get a heads and 0.95 that we will get a tails.

Result of 1st Coin Toss Result of 2nd Coin Toss Number of Heads Overall Probability
Tails (p=0.95) Tails (p=0.95) 0 $$$0.95 * 0.95 = 0.9025$$$
Tails (p=0.95) Heads (p=0.05) 1 $$$0.95 * 0.05 = 0.0475$$$
Heads (p=0.05) Tails (p=0.95) 1 $$$0.05 * 0.95 = 0.0475$$$
Heads (p=0.05) Heads (p=0.05) 2 $$$0.05 * 0.05 = 0.0025$$$

We expect in most cases that we will get two tails and we see that that happens about 90% of the time from above.  We can also see that our probability of getting two heads is the sum of the middle two rows -- but that each has a low probability.  Nonetheless we have a 9.5% of getting at least one heads from either coin toss ($$$0.0475+0.0475=0.095$$$). Bringing up the rear is a very small, 1/4 of a percent, chance that we get two heads.

We don't want to work out a table like above every time we want to know the probability, so, we can define a function as follows:

$$$$binom.pmf(k,n,p) = {\frac{n!}{k!(n-k)!}}p^k(1-p)^{n-k}$$$$
Where $$$k$$$=number of heads, $$$n$$$=number of coin tosses and $$$p$$$=probability of heads.  The $$${\frac{n!}{k!(n-k)!}}$$$ portion of the function above is simply the binomial coefficient or the "n choose k" function (i.e. "how many different ways can this happen"). You'll often find this on a calculator as $$$_nC_k$$$. For more details on how this function works see the Wikipedia article on the Binomial Distribution.  Otherwise, observe that:

$$$$binom.pmf(0,2,0.05) = 0.9025$$$$
$$$$binom.pmf(1,2,0.05) = 0.095$$$$
$$$$binom.pmf(2,2,0.05) = 0.0025$$$$

Luckily, we don't need to implement the distribution ourselves as most programming languages will have a stats library with this function in it somewhere.  In Python this is:

>>> import scipy.stats
>>> scipy.stats.binom.pmf([0,1,2],2,0.05)
array([ 0.9025,  0.095 ,  0.0025])

Excellent, that matches exactly what we had above.  Now to a practical application.

In bisulphite sequencing we sequence lots of CpG's and the result is either a C or T (methylated or unmethylated respectively).  We essentially have a large matrix representing the number of methylated and unmethylated reads which will look like the following (entirely made up data).

chr1    1000    5    2
chr1    1025    4    0
chr1    1026    1    10
chr1    1055    1    0
chr1    1089    0    1
chr1    1090    1    1

There are a number of places errors can occur but in particular we are worried about seeing methylated signal where no such signal exists.  This can happen for a number of reasons; including an incorrectly sequenced base (sequencing error) or incomplete bisulphite conversion.

If we did our sequencing properly we also sequenced an unmethylated spike-in control (usually lambda).  As a result we can work out an overall error rate by seeing how many methylated bases we sequence from our unmethylated control.  Divide this by the total number of (relevant) bases sequenced from our control and we have our overall probability of reading a C (methylated) when it was actually a T (unmethylated).

For each CpG we sequenced we want to estimate the probability (p-value) that we would have sequenced that number (or more) methylated bases under the assumption that the CpG in question was actually entirely unmethylated (our null hypothesis) within the cell population.

If we sequence any given CpG and find that 4 out of 10 reads are methylated, and that our error rate from our lambda sequences was 0.005, then we can calculate the probability of this happening by summing the binom.pmf for each result.  i.e. What is the probability that we would get either 4,5,6,7,8,9 or 10 methylated bases out of 10 total reads based on errors alone. The number of methylated bases is used in place of the number of heads and the number of sequenced bases is used instead of the number of coin tosses.  The error rate can then equivalent to the probability of throwing a heads.

>>> import math
>>> math.fsum(scipy.stats.binom.pmf([4,5,6,7,8,9,10],10,0.005))

We can formalise this as the cumulative distribution function which calculates the sum of the probability mass function from $$$0...k$$$ (still parametrised by $$$n$$$ and $$$p$$$).  Since all the probabilities must add to 1 then the probability of 4 or more heads is equal to 1 minus the probability of 3 or less heads (0,1,2 or 3).

$$$$binom.cdf = \sum_{i=0}^{k} {\frac{n!}{i!(n-i)!}}p^i(1-p)^{n-i}$$$$

However, the way we are working this out isn't ideal.  We introduce float precision errors for every pmf calculated as well as another precision error when we actually sum them.  (Although math.fsum does reduce this final error over a naive sum by tracking the cumulative error from intermediate partial sums). Luckily we can avoid summing the terms directly and use an implementation based on the incomplete beta integral as available in the python scipy package which should be higher precision:

>>> 1-scipy.stats.binom.cdf(4-1,10,0.005)

Note above that we calculate the cumulative distribution function of $$$k=4-1$$$ (i.e. 3) so we get the cumulative probability for 0,1,2 or 3 heads. 

The survival function is the same as above but instead of operating over the range $$$0...k$$$ it works for the range $$$(k+1)...n$$$.

$$$$binom.sf = \sum_{i=k+1}^{n} {\frac{n!}{i!(n-i)!}}p^i(1-p)^{n-i}$$$$
Which in scipy is:
>>> scipy.stats.binom.sf(4-1,10,0.005)

Note again that we use $$$k-1$$$.  This time it is because the survival function works from $$$(k+1)...n$$$ so if we want the range $$$k..n$$$ then we need to subtract 1 from $$$k$$$.

All of the above tests are one-tailed (probability of equal to or greater number of heads).  If we want to test for divergence from the expected number of heads (either greater than or less than) we want to use a two-tailed test which splits the probability across the two tails.

To calculate a two-tailed p-value we use the method of small p-values which sums all the p-values which are smaller than the probability of getting exactly $$$k$$$ out of $$$n$$$ heads.

def binom_test(k,n,p):
    # probability of exactly this result
    exactp = scipy.stats.binom.pmf(k,n,p)
    # p-vals for every possibility
    allpvalues = scipy.stats.binom.pmf(range(0,n+1),n,p)
    # p-vals smaller than exactp
    smallpvalues = [x if x <= exactp else 0 for x in allpvalues]
    return math.fsum(smallpvalues)

However, again we have the issue that we are calculating lots of individual probability mass functions and adding the results which will introduce small precision errors.  Luckily, scipy has a more optimised version of this function which uses the cumulative distribution and survival functions to increase the precision.

>>> scipy.stats.binom_test(4,10,0.005)

In the example above the expected number of heads ($$$p*n$$$) is very low so using two-tailed p-values doesn't change the p-value.  This is because the p-values are essentially ordered from largest to smallest for the range $$$0...n$$$ as the most likely value for $$$k$$$ under the null hypothesis is 0 with increasingly smaller probabilities as $$$k$$$ is increased.  As such the only p-values that contribute are those which are in the "greater than" tail (the range $$$k...n$$$).

However, if the error rate is higher then the choice of one-tailed vs two-tailed will obviously make more of a difference.

# one tailed test
>>> scipy.stats.binom.sf(10-1,20,0.3)

# two-tailed test
>>> scipy.stats.binom_test(10,20,0.3)

Now we can use the binom_test function from scipy to implement a methylation caller in Python.  This will calculate the probability that the methylation we see at any given CpG is purely the result of errors.  We use the same file format as in the previous post such that the first two columns identify the CpG and the remaining columns are pairs of methylated and unmethylated counts for a number of samples.

import getopt
import sys
import csv
from scipy.stats import binom_test
from itertools import izip_longest
from multiprocessing import Pool

def binomRow(row):
    chrm, coord, data = row

    outputRow = [chrm, coord]
    for rowextension in map(binomMethCall,data):
    return outputRow

def binomMethCall(group):
        meth,unmeth,errorrate = group
        return [meth,

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

def unpackLine(chrm,coord,*data):
    return chrm.strip(), int(coord), [int(d) for d in data]

def extractLine(line):
    chrm,coord,data = unpackLine(*line)
    if not chrm.startswith("chr"):
        chrm = "chr"+chrm
    return (chrm,coord,data)
if __name__ == '__main__':
        opts, args = getopt.getopt(sys.argv[1:], "", ["combined=","errorRates=","cores="])
    except getopt.GetoptError, err:
        # print help information and exit:
        print str(err) # will print something like "option -a not recognized"
    infile = None
    cores = 4
    for o, a in opts:
        if o=="--combined":
            infile = a
            print "combined", a
        elif o=="--errorRates":
            errorRates = [float(x) for x in a.split(",")]
            print "Error Rates", errorRates
        elif o=="--cores":
            cores = int(a)
    assert infile != None
    methfile = csv.reader(open(infile, "r"), delimiter="\t")
    binomOutput = csv.writer(open(infile+".binom","w"),delimiter='\t')

    p = Pool(cores)
    rows = []    
    for count,line in enumerate(methfile):
        if count % 100000 == 0:
            print count
            for row in,rows):
            rows = []
        chrm,coord,data = extractLine(line)
        row = [chrm,coord]
        # extract in pairs
        samples = [(meth,unmeth,errorRates[sampleno]) for sampleno,(meth,unmeth) in enumerate(grouper(2,data))]        
    print count
    for row in,rows):

Executed as:

python \
--combined "Samples.1.and.2.CpG_context.aggregate" \
--errorRates 0.005,0.002 

Where the error rates are obtained from the lambda spike-in alignments for each sample.  This will create an extra column per sample.  You should correct the p-values using the False Discovery Rate method and threshold at at a suitable value (FDR<=0.01).

Update: (12 July 2013): I rewrote a chunk of the binomialMethCall python script to take better advantage of pythons multiprocessing libraries. The updated version should be better at maxing out the number of cores you requested it to use.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.