Tuesday, 16 July 2013

The G Test and Bisulphite Sequencing DMRs

This is Part 4 in a series on Bisulphite Sequencing.
You can return to Part 3 (Bisulphite Sequencing Simple Visualisation).

Okay, so we have our bisulphite sequencing data on the UCSC genome browser. Lets assume we have two conditions CondA and CondB, and we have two replicates of each condition. We can see what appear to be changes from our visual inspection on the browser. Now what?

Well, we probably want to compute differentially methylated regions. With 1 replicate of each that's fairly straight forward for categorical data like the bisulphite seq data we have - we use fishers exact test. With replicates a simple approach is just to pool the two replicates together and do our fishers exact test on the pooled data, but this doesn't give us a very good idea of what's going on if the two replicates disagree for a given region.

Instead we use the G-test which can be calculated for any given contingency table with the following formula:

$$$$ G = 2 \sum_{i=0}^{r} \sum_{j=0}^{c} {O_{i,j} \ln(O_{i,j}/E_{i,j}) } $$$$
Where, $$$r$$$ and $$$c$$$ are the number of rows and columns in the contingency table, $$$O_{i,j}$$$ and $$$E_{i,j}$$$ are the Observed and Expected values for cell $$$i,j$$$ in the table. I've detailed how to implement the G-test for a 2x2 contingency table in Python previously.

The main selling point of the G-test, over a regular chi-square test, is that the G value is additive. Consider the following:

G-test 1G-test 2
A B G-value
C 100 70 = 7.1
D 35 50
A B G-value
C 200 140 = 14.2
D 70 100

To find out where we have differential regions we first partition the genome into 2kb windows. There are large regions of the genome where there are no CpG's or, alternatively, where we have no sequencing coverage - we will skip those regions. We align our windows such that each window starts at an individual CpG and stops at the last CpG within 2kb of the start position.

Pooled G

We first compute a Pooled G which is the G-test performed on the pooled replicates. The contingency table would consist of columns representing each condition and rows representing methylated and unmethylated counts. The number of degrees of freedom is equal to $$$(R-1) \cdot (C-1)$$$ so as we are doing a single 2x2 contingency table this will always be equal to 1.

def pooledG(lmeth,lunmeth,rmeth,runmeth):   
    dof = 1
    pooledg = gtest(lmeth,lunmeth,rmeth,runmeth)
    return pooledg, dof, GtoP(pooledg,dof)

Total G

We then compute a G-value for each individual replicate and add them together. Each G-test we do adds 1 to the total number of degrees of freedom.

def totalG(data):
    dof = 0
    totalg = 0
    for lmeth,lunmeth,rmeth,runmeth in data:
            totalg += gtest(lmeth,lunmeth,rmeth,runmeth)
    return totalg, dof, GtoP(totalg,dof)

Heterogeneous G

If each of our three replicates reproduce exactly then Pooled G will be identical to Total G (due to the additive nature of G-test results). However, if the ratios are not identical then they will not match. The difference between the Total G-value and the Pooled G-value is the heterogeneous G-value and tells us if the difference between our replicates is significant (i.e. our replicates are not all doing the same thing.

def heterogenousG(totalg,totaldof,pooledg,pooleddof):
    heteroG = totalg-pooledg
    heterodof = totaldof-pooleddof
    return heteroG,heterodof,GtoP(heteroG,totaldof-pooleddof)

Putting it all together

Now, we can iterate through a file summarising the number of methylated and unmethylated reads at each CpG (described in a previous post) and compute the Pooled, Total and Heterogenous G-values (and corresponding p-values) for each genomic window.
import getopt
import sys
import csv
from itertools import izip_longest
from itertools import izip
from memoized import memoized

# sudo easy_install mpmath
import mpmath
#mpmath.mp.dps = 20 # decimal digits of precision

def mpmathcdf(g,df,dps=10):
    mpmath.mp.dps = dps
    x,k = mpmath.mpf(g), mpmath.mpf(df)
    cdf = mpmath.gammainc(k/2, 0, x/2, regularized=True)
    # floating point precision insufficient, use more precision
    if cdf == 1.0:
        if dps > 4000:
            return cdf # give up after a certain point
            cdf = mpmathcdf(g,df,dps*2)
    return cdf

def GtoP(g,df):
    assert g >= 0, g
    return float(1-mpmathcdf(g,df))

import math

def flnf(f):
    return f*math.log(f) if f>0.5 else 0

def gtest(a,b,c,d):
    row1 = a+b
    row2 = c+d
    col1 = a+c
    col2 = b+d
    total = flnf(a+b+c+d)
    celltotals = flnf(a)+flnf(b)+flnf(c)+flnf(d)
    rowtotals = flnf(row1)+flnf(row2)
    coltotals = flnf(col1)+flnf(col2)
    # abs is necessary to account for float precision errors when doing the subtraction
    return abs(2*((celltotals + total)-(rowtotals+coltotals)))

# extract the pooled meth counts from the sample
def pooledMethCounts(data,covs,compare):
    lmethtotal = 0
    lunmethtotal = 0
    rmethtotal = 0
    runmethtotal = 0
    if len(compare)==2:
        # if we are given covariates to use
        covleft, covright = compare    
        for cov, (meth,unmeth) in izip(covs,grouper(2,data)): # extract in pairs of meth/unmeth
            if cov == covleft:
                lmethtotal += meth
                lunmethtotal += unmeth
            elif cov == covright:
                rmethtotal += meth
                runmethtotal += unmeth
        # if we have not been given them, assume that each adjacent sample (which is one pair) is one replicate
        for lmeth,lunmeth,rmeth,runmeth in grouper(4,data): # extract in quads (2 pairs) representing one replicate
            lmethtotal += lmeth
            lunmethtotal += lunmeth
            rmethtotal += rmeth
            runmethtotal += runmeth
    return lmethtotal,lunmethtotal,rmethtotal,runmethtotal

# extract individual replicate counts from the sample and return as a list of (lmeth,lunmeth,rmeth,runmeth)
# where l and r refer to the left and right cov in the compare array
# if compare array is empty assume that data is formatted as:
# lmeth1,lunmeth1,rmeth1,runmeth1,lmeth2,lunmeth2,rmeth2,runmeth2 etc
def covReplicateMethCounts(data,covs,compare):
    if len(compare)==2:
        covleft, covright = compare
        lreplicates = []
        rreplicates = []
        for cov, (meth,unmeth) in izip(covs,grouper(2,data)): # extract in pairs of meth/unmeth
            if cov == covleft:
            elif cov == covright:

        #return [(lmeth,lunmeth,rmeth,runmeth),(),(),...]    
        return [(a,b,c,d) for (a,b),(c,d) in izip(lreplicates,rreplicates)]
        # assume that each pair of samples (i.e. 4 columns) in the input is one replicate
        # in this case our input file is in the right order and we just need to split it up.
        return grouper(4,data)

# total G: at least some of the data is different
# 1 degree of freedom per replicate (i.e. 3 in total), add G's from each seperate test
def totalG(data,covs,compare):
    dof = 0
    totalg = 0
    for lmeth,lunmeth,rmeth,runmeth in covReplicateMethCounts(data,covs,compare): # extract in quads (2 pairs)
            totalg += gtest(lmeth,lunmeth,rmeth,runmeth)
    assert totalg >= 0
    return totalg, dof, GtoP(totalg,dof)

# pooled G: the data as a whole is different
# 1 degree of freedom, pool results from all replicates and do 1 2x2 table
def pooledG(data,covs,compare):    
    lmeth,lunmeth,rmeth,runmeth = pooledMethCounts(data,covs,compare)
    dof = 1
    pooledg = gtest(lmeth,lunmeth,rmeth,runmeth)
    assert pooledg >= 0
    return pooledg, dof, GtoP(pooledg,dof)

# there is evidence that they are significantly different from each other
# heterogeneity G: total G - pooled G, df = 3-1 = 2
def heterogeneityG(totalg,totaldof,pooledg,pooleddof):
    heteroG = max(totalg-pooledg,0)
    heterodof = totaldof-pooleddof
    return heteroG,heterodof,GtoP(heteroG,totaldof-pooleddof)

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

def unpackLine(chrm,coord,*data):
    return chrm.strip(), int(coord), [int(d) for d in data]

def extractLine(line):
    chrm,coord,data = unpackLine(*line)
    if not chrm.startswith("chr"):
        chrm = "chr"+chrm
    return (chrm,coord,data)

if __name__ == '__main__':
        opts, args = getopt.getopt(sys.argv[1:], "", ["combined=","cov=","compare="])
    except getopt.GetoptError, err:
        # print help information and exit:
        print str(err) # will print something like "option -a not recognized"
    infile = None
    covs = None
    compare = None
    for o, a in opts:
        if o=="--combined":
            infile = a
            print "combined", a
        elif o=="--cov":
            covs = a.split(",")
            print covs
        elif o=="--compare":
            compare = a.split(",")
            print compare
            assert len(compare)==2
    assert infile != None
    # we either have compare and covs or neither
    assert compare == None or len(compare)==2
    if compare == None: assert covs == None
    if len(compare)==2: assert covs != None
    methfile = csv.reader(open(infile, "r"), delimiter="\t")
    # ---    
    # windows
    if compare != None:
        leftcov,rightcov = compare
        windows = csv.writer(open(infile+"."+leftcov+"-vs-"+rightcov+".windows.bedGraph","w"),delimiter='\t')
        percent = csv.writer(open(infile+"."+leftcov+"-vs-"+rightcov+".percent.windows.bedGraph","w"),delimiter='\t')
        windows = csv.writer(open(infile+".windows.bedGraph","w"),delimiter='\t')
        percent = csv.writer(open(infile+".percent.windows.bedGraph","w"),delimiter='\t')
    windowSize = 2000
    currentWindowPos = None
    currentChr = None
    cs = 0
    dataCount = None
    def writeWindow(chrm,pos,epos):
        global dataCount
        global cs
        global windows
        global percent
        lmethCount,lunmethCount,rmethCount,runmethCount = pooledMethCounts(dataCount,covs,compare)
        if (lmethCount + lunmethCount) == 0 or (rmethCount + runmethCount) == 0:
            print "Unprintable window:",chrm, pos, epos, lmethCount, lunmethCount, rmethCount, runmethCount
        lPercentage = (float(lmethCount)/float(lmethCount+lunmethCount)) * 100.0
        rPercentage = (float(rmethCount)/float(rmethCount+runmethCount)) * 100.0
        total,totaldof,totalp = totalG(dataCount,covs,compare)
        pooled,pooleddof,pooledp = pooledG(dataCount,covs,compare)
        hetero,heterodof,heterop = heterogeneityG(total,totaldof,pooled,pooleddof)
        windows.writerow([chrm,pos,epos+1,'1' if rPercentage > lPercentage else '-1',totalp,pooledp,heterop,cs])
        dataCount = None
        cs = 0

    lastCoordCounted = None
    # the actual work here
    for line in methfile:
        chrm,coord,data = extractLine(line)

        if currentChr == None:
            currentChr = chrm
            currentWindowPos = coord
            lastCoordCounted = coord
            # always write a window if we change chrm
            if chrm != currentChr:
                currentChr = chrm
                currentWindowPos = coord
            # write a window if the next cpg is outside the current window
            elif coord - currentWindowPos > windowSize:
                currentChr = chrm
                currentWindowPos = coord
            # otherwise dont write any window (the current cpg is in the window)
        # add data to existing data count
        if dataCount == None:
            dataCount = data
            for index,d in enumerate(data):
                dataCount[index] += d
        lastCoordCounted = coord
        # inc number of c's counted
        cs += 1
    # after we write a window there is always at least one unmeasured coord after it so we have one more window to write

Note that the above script relies upon the Memoized decorator.

Executed as:

python replicatesGTest.py \
--combined "Conditions.1.and.2.CpG_context.aggregate" \
--cov Condition1,Condition2,Condition1,Condition2,Condition1,Condition2
--compare Condition1,Condition2 

Where --cov lists the order of the conditions in the file and --cov indicates which conditions should be compared.

For each window, this script outputs a tab separated file containing the chromosome,start and stop position of the window,'1' if it is "increasing" otherwise '-1' and p-values for computed from the Total, Pooled ahd Hetero G computations as well as the number of methylation sites considered in the window.

Note that the p-values generated are uncorrected for multiple testing and should be FDR corrected. I've previously described a suitable approach for large files such as these.

Differentially methylated regions can then be defined as regions of adjacent windows with a FDR adjusted p-value less than or equal to a fixed level (i.e. 0.05 FDR) that change in the same direction.

No comments:

Post a Comment