Wednesday 27 February 2013

The Binomial Distribution, Python and Bisulphite Sequencing

This is Part 2 in a series on Bisulphite Sequencing.
You can return to Part 1 (Post Processing Bismark Bisulphite Sequencing Data) or skip to Part 3 (Simple Visualisation of Bisulphite Sequencing Data).


Today I'm going to describe how the binomial distribution can be applied to bisulphite sequencing data in order to accurately determine the number of CpG sites which exhibit methylation in a sample.

The binomial distribution is actually pretty straightforward.  If you have a coin that has a probability $$$p$$$ to come up heads then the binomial distribution can tell us the probability that if you throw $$$n$$$ coins then what is the probability that you will get $$$k$$$ heads.  We call this the probability mass function.

Let's do the simplest example first.

If I throw one coin and it has a probability of 0.5 (an unbiased coin) then we have two outcomes; we throw a heads or we throw a tails.


Result of 1st Coin Toss Number of Heads Overall Probability
Tails (p=0.5) 0 0.5
Heads (p=0.5) 1 0.5


Ok, that's fairly trivial. We have two possible outcomes each with a possibility of 0.5. So our probability of getting exactly one heads is simply 0.5.

Let's add a second coin toss. Again we keep our probability as 0.5 chance for a heads.


Result of 1st Coin Toss Result of 2nd Coin Toss Number of Heads Overall Probability
Tails (p=0.5) Tails (p=0.5) 0 $$$0.5 * 0.5 = 0.25$$$
Tails (p=0.5) Heads (p=0.5) 1 $$$0.5 * 0.5 = 0.25$$$
Heads (p=0.5) Tails (p=0.5) 1 $$$0.5 * 0.5 = 0.25$$$
Heads (p=0.5) Heads (p=0.5) 2 $$$0.5 * 0.5 = 0.25$$$

Each coin toss still has a 0.5 probability of coming up heads but we now have four possible results.  We can work out the probability of each result by multiplying the probability of the first coin toss by the probability of the second coin toss ($$$0.5 * 0.5 = 0.25$$$ for each row above).

Note that in both examples above the probabilities for the table always add up to 1.  In the second table we have one row where we get 2 heads, two rows where we get 1 heads and one row where we get 0 heads.  So our cumulative probability for getting 2, 1 or 0 heads is 0.25, 0.5 and 0.25 respectively.

Let's change the odds.  We now have a coin that we've fixed to be strongly biased towards tails.  Our probability is now only 0.05 that we will get a heads and 0.95 that we will get a tails.

Result of 1st Coin Toss Result of 2nd Coin Toss Number of Heads Overall Probability
Tails (p=0.95) Tails (p=0.95) 0 $$$0.95 * 0.95 = 0.9025$$$
Tails (p=0.95) Heads (p=0.05) 1 $$$0.95 * 0.05 = 0.0475$$$
Heads (p=0.05) Tails (p=0.95) 1 $$$0.05 * 0.95 = 0.0475$$$
Heads (p=0.05) Heads (p=0.05) 2 $$$0.05 * 0.05 = 0.0025$$$

We expect in most cases that we will get two tails and we see that that happens about 90% of the time from above.  We can also see that our probability of getting two heads is the sum of the middle two rows -- but that each has a low probability.  Nonetheless we have a 9.5% of getting at least one heads from either coin toss ($$$0.0475+0.0475=0.095$$$). Bringing up the rear is a very small, 1/4 of a percent, chance that we get two heads.

We don't want to work out a table like above every time we want to know the probability, so, we can define a function as follows:

$$$$binom.pmf(k,n,p) = {\frac{n!}{k!(n-k)!}}p^k(1-p)^{n-k}$$$$
Where $$$k$$$=number of heads, $$$n$$$=number of coin tosses and $$$p$$$=probability of heads.  The $$${\frac{n!}{k!(n-k)!}}$$$ portion of the function above is simply the binomial coefficient or the "n choose k" function (i.e. "how many different ways can this happen"). You'll often find this on a calculator as $$$_nC_k$$$. For more details on how this function works see the Wikipedia article on the Binomial Distribution.  Otherwise, observe that:

$$$$binom.pmf(0,2,0.05) = 0.9025$$$$
$$$$binom.pmf(1,2,0.05) = 0.095$$$$
$$$$binom.pmf(2,2,0.05) = 0.0025$$$$

Luckily, we don't need to implement the distribution ourselves as most programming languages will have a stats library with this function in it somewhere.  In Python this is:

>>> import scipy.stats
>>> scipy.stats.binom.pmf([0,1,2],2,0.05)
array([ 0.9025,  0.095 ,  0.0025])

Excellent, that matches exactly what we had above.  Now to a practical application.

In bisulphite sequencing we sequence lots of CpG's and the result is either a C or T (methylated or unmethylated respectively).  We essentially have a large matrix representing the number of methylated and unmethylated reads which will look like the following (entirely made up data).

chr1    1000    5    2
chr1    1025    4    0
chr1    1026    1    10
chr1    1055    1    0
chr1    1089    0    1
chr1    1090    1    1

There are a number of places errors can occur but in particular we are worried about seeing methylated signal where no such signal exists.  This can happen for a number of reasons; including an incorrectly sequenced base (sequencing error) or incomplete bisulphite conversion.

If we did our sequencing properly we also sequenced an unmethylated spike-in control (usually lambda).  As a result we can work out an overall error rate by seeing how many methylated bases we sequence from our unmethylated control.  Divide this by the total number of (relevant) bases sequenced from our control and we have our overall probability of reading a C (methylated) when it was actually a T (unmethylated).

For each CpG we sequenced we want to estimate the probability (p-value) that we would have sequenced that number (or more) methylated bases under the assumption that the CpG in question was actually entirely unmethylated (our null hypothesis) within the cell population.

If we sequence any given CpG and find that 4 out of 10 reads are methylated, and that our error rate from our lambda sequences was 0.005, then we can calculate the probability of this happening by summing the binom.pmf for each result.  i.e. What is the probability that we would get either 4,5,6,7,8,9 or 10 methylated bases out of 10 total reads based on errors alone. The number of methylated bases is used in place of the number of heads and the number of sequenced bases is used instead of the number of coin tosses.  The error rate can then equivalent to the probability of throwing a heads.

>>> import math
>>> math.fsum(scipy.stats.binom.pmf([4,5,6,7,8,9,10],10,0.005))
1.2813262561414168e-07

We can formalise this as the cumulative distribution function which calculates the sum of the probability mass function from $$$0...k$$$ (still parametrised by $$$n$$$ and $$$p$$$).  Since all the probabilities must add to 1 then the probability of 4 or more heads is equal to 1 minus the probability of 3 or less heads (0,1,2 or 3).

$$$$binom.cdf = \sum_{i=0}^{k} {\frac{n!}{i!(n-i)!}}p^i(1-p)^{n-i}$$$$

However, the way we are working this out isn't ideal.  We introduce float precision errors for every pmf calculated as well as another precision error when we actually sum them.  (Although math.fsum does reduce this final error over a naive sum by tracking the cumulative error from intermediate partial sums). Luckily we can avoid summing the terms directly and use an implementation based on the incomplete beta integral as available in the python scipy package which should be higher precision:

>>> 1-scipy.stats.binom.cdf(4-1,10,0.005)
1.2813262562083594e-07

Note above that we calculate the cumulative distribution function of $$$k=4-1$$$ (i.e. 3) so we get the cumulative probability for 0,1,2 or 3 heads. 

The survival function is the same as above but instead of operating over the range $$$0...k$$$ it works for the range $$$(k+1)...n$$$.

$$$$binom.sf = \sum_{i=k+1}^{n} {\frac{n!}{i!(n-i)!}}p^i(1-p)^{n-i}$$$$
Which in scipy is:
>>> scipy.stats.binom.sf(4-1,10,0.005)
1.2813262561414144e-07

Note again that we use $$$k-1$$$.  This time it is because the survival function works from $$$(k+1)...n$$$ so if we want the range $$$k..n$$$ then we need to subtract 1 from $$$k$$$.

All of the above tests are one-tailed (probability of equal to or greater number of heads).  If we want to test for divergence from the expected number of heads (either greater than or less than) we want to use a two-tailed test which splits the probability across the two tails.

To calculate a two-tailed p-value we use the method of small p-values which sums all the p-values which are smaller than the probability of getting exactly $$$k$$$ out of $$$n$$$ heads.

def binom_test(k,n,p):
    # probability of exactly this result
    exactp = scipy.stats.binom.pmf(k,n,p)
    
    # p-vals for every possibility
    allpvalues = scipy.stats.binom.pmf(range(0,n+1),n,p)
    
    # p-vals smaller than exactp
    smallpvalues = [x if x <= exactp else 0 for x in allpvalues]
    
    return math.fsum(smallpvalues)

However, again we have the issue that we are calculating lots of individual probability mass functions and adding the results which will introduce small precision errors.  Luckily, scipy has a more optimised version of this function which uses the cumulative distribution and survival functions to increase the precision.

>>> scipy.stats.binom_test(4,10,0.005)
1.2813262561414144e-07

In the example above the expected number of heads ($$$p*n$$$) is very low so using two-tailed p-values doesn't change the p-value.  This is because the p-values are essentially ordered from largest to smallest for the range $$$0...n$$$ as the most likely value for $$$k$$$ under the null hypothesis is 0 with increasingly smaller probabilities as $$$k$$$ is increased.  As such the only p-values that contribute are those which are in the "greater than" tail (the range $$$k...n$$$).

However, if the error rate is higher then the choice of one-tailed vs two-tailed will obviously make more of a difference.

# one tailed test
>>> scipy.stats.binom.sf(10-1,20,0.3)
0.0479618973313434

# two-tailed test
>>> scipy.stats.binom_test(10,20,0.3)
0.08344502962981204

Now we can use the binom_test function from scipy to implement a methylation caller in Python.  This will calculate the probability that the methylation we see at any given CpG is purely the result of errors.  We use the same file format as in the previous post such that the first two columns identify the CpG and the remaining columns are pairs of methylated and unmethylated counts for a number of samples.

import getopt
import sys
import csv
from scipy.stats import binom_test
from itertools import izip_longest
from multiprocessing import Pool

def binomRow(row):
    chrm, coord, data = row

    outputRow = [chrm, coord]
    for rowextension in map(binomMethCall,data):
        outputRow.extend(rowextension)
    return outputRow

def binomMethCall(group):
        meth,unmeth,errorrate = group
        return [meth,
                unmeth,
                binom_test(meth,meth+unmeth,errorrate)]

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

def unpackLine(chrm,coord,*data):
    return chrm.strip(), int(coord), [int(d) for d in data]

def extractLine(line):
    chrm,coord,data = unpackLine(*line)
    if not chrm.startswith("chr"):
        chrm = "chr"+chrm
    return (chrm,coord,data)
    
if __name__ == '__main__':
    try:
        opts, args = getopt.getopt(sys.argv[1:], "", ["combined=","errorRates=","cores="])
    except getopt.GetoptError, err:
        # print help information and exit:
        print str(err) # will print something like "option -a not recognized"
        sys.exit(2)
    
    infile = None
    cores = 4
    
    for o, a in opts:
        if o=="--combined":
            infile = a
            print "combined", a
        elif o=="--errorRates":
            errorRates = [float(x) for x in a.split(",")]
            print "Error Rates", errorRates
        elif o=="--cores":
            cores = int(a)
    
    assert infile != None
    
    methfile = csv.reader(open(infile, "r"), delimiter="\t")
    
    binomOutput = csv.writer(open(infile+".binom","w"),delimiter='\t')

    p = Pool(cores)
    
    rows = []    
    for count,line in enumerate(methfile):
        
        if count % 100000 == 0:
            print count
            for row in p.map(binomRow,rows):
                binomOutput.writerow(row)
            rows = []
        
        chrm,coord,data = extractLine(line)
        
        row = [chrm,coord]
        
        # extract in pairs
        samples = [(meth,unmeth,errorRates[sampleno]) for sampleno,(meth,unmeth) in enumerate(grouper(2,data))]        
        row.append(samples)
        rows.append(row)
    
    print count
    for row in p.map(binomRow,rows):
        binomOutput.writerow(row)
        
        

Executed as:

python binomialMethCall.py \
--combined "Samples.1.and.2.CpG_context.aggregate" \
--errorRates 0.005,0.002 

Where the error rates are obtained from the lambda spike-in alignments for each sample.  This will create an extra column per sample.  You should correct the p-values using the False Discovery Rate method and threshold at at a suitable value (FDR<=0.01).

Update: (12 July 2013): I rewrote a chunk of the binomialMethCall python script to take better advantage of pythons multiprocessing libraries. The updated version should be better at maxing out the number of cores you requested it to use.

Thursday 7 February 2013

Post-Processing Bismark Bisulphite Sequencing Data

This is Part 1 in a series on Bisulphite Sequencing.
You can skip to Part 2 (Using the Binomial Distribution in Bisulphite Sequencing)
.


In this post I will present a couple of scripts that are useful for processing the data produced by the Bismark bisulphite sequencing aligner into a form that is a little easier to work with.

When you run Bismark it will produce output in either its own "vanilla" format (version <= 0.5x) or in SAM format (version >= 0.6x).

Either way, the approach at this point is to remove duplicate reads that might have occurred as PCR duplicates using deduplicate_bismark_alignment_output.  I usually also remove any fragments which look like they haven't been bisulphite converted (multiple apparently methylated CHH or CHG within a single fragment).

For Bismark vanilla output you can use something like this:

awk 'FNR>1{
      countl = split($8,a,"[HX]");
      countr = split($11,a,"[HX]");
      if (countl-1 <= 3 && countr-1 <= 3)
           print;
     }' $file > $file.unconvertedreadsremoved

(Where the 8th and 11th columns contain the methylation call string in Bismark format).

Next up is to extract the methylation calls for each base using the methylation_extractor script (included with Bismark) using the --no_overlap and --comprehensive options.

This will give you a file that looks a little like this:

FCD0KMMACXX:3:1101:17551:2000#CTTCCTCC/1    +    chr3    116402006    Z
FCD0KMMACXX:3:1101:17551:2000#CTTCCTCC/1    +    chr3    116401980    Z
FCD0KMMACXX:3:1101:17551:2000#CTTCCTCC/1    +    chr3    116401960    Z
FCD0KMMACXX:3:1101:17865:1999#CTTCCTCC/1    -    chrX    115417650    z
FCD0KMMACXX:3:1101:18710:1999#CTTCCTCC/1    -    chr15    51474193    z
FCD0KMMACXX:3:1101:18710:1999#CTTCCTCC/1    +    chr15    51474352    Z
FCD0KMMACXX:3:1101:18599:2000#CTTCCTCC/1    +    chr3    101217644    Z

First we make it a little easier by sorting by chromosome and base position such that all the methylation calls for the same base are together as well as removing the Bismark header line.

awk 'FNR>1' CpG_context.txt | \
sort -k3,3 -k4,4n -S10G > CpG_context.sorted.txt

Again, as in the previous post, you can control sorts memory usage using the -S flag (set to 10G above). We can then sum up (aggregate) the file on a per CpG basis.

import getopt
import sys
import csv

def parseMethLine(line):
    (readid,meth,chrm,coord,methcall) = line
    return (True if meth == '+' else False, chrm, coord)

if __name__ == '__main__':
    try:
        opts, args = getopt.getopt(sys.argv[1:], "",
                                  ["bismark=", "aggregate="])
    except getopt.GetoptError, err:
        # print help information and exit
        # will print something like "option -a not recognized"
        print str(err) 
        sys.exit(2)
    
    infile = None
    aggregatefile= None
    
    for o, a in opts:
        if o=="--bismark":
            infile = a
            print "Bismark", a
        elif o=="--aggregate":
            aggregatefile = a
            print "Aggregate", a
    
    assert infile != None
    assert aggregatefile != None
    
    bismark = csv.reader(open(infile, "r"), delimiter="\t")
    
    aggregate = csv.writer(open(aggregatefile,"w"),delimiter='\t')      
    
    currentchrm = None
    currentpos = None
    
    methylated = 0
    unmethylated = 0
    
    for line in bismark:
        (methcall, chrm, coord) = parseMethLine(line)
        
        # meth calls in 1 base, ucsc / bedgraph in 0 base
        coord = int(coord) -1 
        
        # init
        if currentpos == None or chrm == None:
            currentchrm = chrm
            currentpos = coord
        elif currentpos != coord or currentchrm != chrm:
            # a new position or chromosome encountered
            # output our counts until now
            aggregate.writerow([currentchrm,currentpos,
                                methylated,unmethylated])
            
            # reset counts for next position
            currentchrm = chrm
            currentpos = coord
            methylated = 0
            unmethylated = 0
        
        if methcall:
            methylated += 1
        else:
            unmethylated += 1
    
    # output last line
    aggregate.writerow([currentchrm,currentpos,
                        methylated,unmethylated])

Executed as:

python aggregateBismark.py \
--bismark "CpG_context.sorted.txt" \
--aggregate "CpG_context.aggregate"

This will produce a file that's a little more useful.  Each CpG is represented by a single line with two data columns indicating the number of methylated and unmethylated calls made for that base (3rd and 4th columns respectively).

In other words we will have a file that looks like this:

chr1    1343    1    0
chr1    1625    4    0
chr1    1626    1    0
chr1    1655    1    0
chr1    1689    2    1
chr1    1690    2    0

Since we probably have more than one sample that we will want to be able to compare with each other we will want to combine two (or more) of the aggregate files above.  The script below will combine any two files with chromosome and coordinate as the first columns which are sorted the same way.

import getopt
import sys
import csv

if __name__ == '__main__':
    try:
        opts, args = getopt.getopt(sys.argv[1:], "",
                                  ["one=", "two=", "combine="])
    except getopt.GetoptError, err:
        # print help information and exit
        # will print something like "option -a not recognized"
        print str(err) 
        sys.exit(2)
    
    oneloc = None
    twoloc = None
    combineloc = None 
    
    for o, a in opts:
        if o=="--one":
            oneloc = a
            print "One", a
        elif o=="--two":
            twoloc = a
            print "Two", a
        elif o=="--combine":
            combineloc = a
            print "Combine", a
    
    assert oneloc != None
    assert twoloc != None
    assert combineloc != None
    
    left = csv.reader(open(oneloc, "r"), delimiter="\t")
    right = csv.reader(open(twoloc, "r"), delimiter="\t")
    
    combine = csv.writer(open(combineloc,"w"),delimiter='\t')      
    
    def unpackLine(chrm,coord,*data):
        return chrm.strip(), int(coord), [int(d) for d in data]
    
    def extractLine(line):
        chrm,coord,data = unpackLine(*line)
        if not chrm.startswith("chr"):
            chrm = "chr"+chrm
        return (chrm,coord,data)
    
    # get the first data points
    
    lchr,lcoord,ldata = extractLine(left.next())
    rchr,rcoord,rdata = extractLine(right.next())
    
    # find out how many data columns in each object
    # first line will tell us
    leftcols = len(ldata)
    rightcols = len(rdata)
    
    leftFinished = False
    rightFinished = False
    
    
    def spoolBoth():
        global lchr,lcoord,ldata,leftFinished
        global rchr,rcoord,rdata,rightFinished
        
        row = [lchr,lcoord]
        row.extend(ldata)
        row.extend(rdata)        
        combine.writerow(row)
        
        try:    
            lchr,lcoord,ldata = extractLine(left.next())
        except StopIteration:
            leftFinished = True    
        try:            
            rchr,rcoord,rdata = extractLine(right.next())
        except StopIteration:
            rightFinished = True 
        
    def spoolLeft():
        global lchr,lcoord,ldata,leftFinished
        
        rdata = [0 for i in range(0,rightcols)]
        
        row = [lchr,lcoord]
        row.extend(ldata)
        row.extend(rdata)        
        combine.writerow(row)
        
        try:    
            lchr,lcoord,ldata = extractLine(left.next())
        except StopIteration:
            leftFinished = True          
    
    def spoolRight():
        global rchr,rcoord,rdata,rightFinished
        
        ldata = [0 for i in range(0,leftcols)]
        
        row = [rchr,rcoord]
        row.extend(ldata)
        row.extend(rdata)
        combine.writerow(row)
        
        try:            
            rchr,rcoord,rdata = extractLine(right.next())
        except StopIteration:
            rightFinished = True    
    
    while not leftFinished or not rightFinished:
   
        if leftFinished:
            spoolRight()
        elif rightFinished:
            spoolLeft()
        else:
            if lchr == rchr and lcoord == rcoord:
                # we have a match, print it and spool both
                spoolBoth()
            elif lchr < rchr:
                spoolLeft()
            elif lchr > rchr:
                spoolRight()   
            # we are on the same chromosome
            # but we have skipped one (or more) position
            elif lcoord < rcoord:
                spoolLeft()
            else:
                spoolRight()

Execute this as follows:

python combineAggregate.py \
--one "Sample1.CpG_context.aggregate" \
--two "Sample2.CpG_context.aggregate" \
--combine "Samples.1.and.2.CpG_context.aggregate"


Ultimately you'll be left with a file where each CpG (or CHH, CHG) is a single line. A pair of columns represents the number of methylated and unmethylated calls in one sample and multiple pairs indicate multiple samples. Giving you something that looks like this.
chr1    1343    1    0    2    0
chr1    1625    4    0    5    3
chr1    1626    1    0    1    0
chr1    1655    1    0    1    5
chr1    1689    2    1    2    1
chr1    1690    2    0    3    1

I find this file format much easier to deal with for further downstream analysis. Once you've combined Sample 1 and Sample 2 you can then take the output and combine that with Sample 3 and so on.