You can return to Part 3 (Bisulphite Sequencing Simple Visualisation).
Okay, so we have our bisulphite sequencing data on the UCSC genome browser. Lets assume we have two conditions CondA and CondB, and we have two replicates of each condition. We can see what appear to be changes from our visual inspection on the browser. Now what?
Well, we probably want to compute differentially methylated regions. With 1 replicate of each that's fairly straight forward for categorical data like the bisulphite seq data we have - we use fishers exact test. With replicates a simple approach is just to pool the two replicates together and do our fishers exact test on the pooled data, but this doesn't give us a very good idea of what's going on if the two replicates disagree for a given region.
Instead we use the G-test which can be calculated for any given contingency table with the following formula:
$$$$ G = 2 \sum_{i=0}^{r} \sum_{j=0}^{c} {O_{i,j} \ln(O_{i,j}/E_{i,j}) } $$$$
Where, $$$r$$$ and $$$c$$$ are the number of rows and columns in the contingency table, $$$O_{i,j}$$$ and $$$E_{i,j}$$$ are the Observed and Expected values for cell $$$i,j$$$ in the table. I've detailed how to implement the G-test for a 2x2 contingency table in Python previously.
The main selling point of the G-test, over a regular chi-square test, is that the G value is additive. Consider the following:
G-test 1 | G-test 2 | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
To find out where we have differential regions we first partition the genome into 2kb windows. There are large regions of the genome where there are no CpG's or, alternatively, where we have no sequencing coverage - we will skip those regions. We align our windows such that each window starts at an individual CpG and stops at the last CpG within 2kb of the start position.
Pooled G
We first compute a Pooled G which is the G-test performed on the pooled replicates. The contingency table would consist of columns representing each condition and rows representing methylated and unmethylated counts. The number of degrees of freedom is equal to $$$(R-1) \cdot (C-1)$$$ so as we are doing a single 2x2 contingency table this will always be equal to 1.
def pooledG(lmeth,lunmeth,rmeth,runmeth): dof = 1 pooledg = gtest(lmeth,lunmeth,rmeth,runmeth) return pooledg, dof, GtoP(pooledg,dof)
Total G
We then compute a G-value for each individual replicate and add them together. Each G-test we do adds 1 to the total number of degrees of freedom.
def totalG(data): dof = 0 totalg = 0 for lmeth,lunmeth,rmeth,runmeth in data: totalg += gtest(lmeth,lunmeth,rmeth,runmeth) dof+=1 return totalg, dof, GtoP(totalg,dof)
Heterogeneous G
If each of our three replicates reproduce exactly then Pooled G will be identical to Total G (due to the additive nature of G-test results). However, if the ratios are not identical then they will not match. The difference between the Total G-value and the Pooled G-value is the heterogeneous G-value and tells us if the difference between our replicates is significant (i.e. our replicates are not all doing the same thing.
def heterogenousG(totalg,totaldof,pooledg,pooleddof): heteroG = totalg-pooledg heterodof = totaldof-pooleddof return heteroG,heterodof,GtoP(heteroG,totaldof-pooleddof)
Putting it all together
Now, we can iterate through a file summarising the number of methylated and unmethylated reads at each CpG (described in a previous post) and compute the Pooled, Total and Heterogenous G-values (and corresponding p-values) for each genomic window.
import getopt import sys import csv from itertools import izip_longest from itertools import izip from memoized import memoized # sudo easy_install mpmath import mpmath #mpmath.mp.dps = 20 # decimal digits of precision @memoized def mpmathcdf(g,df,dps=10): mpmath.mp.dps = dps x,k = mpmath.mpf(g), mpmath.mpf(df) cdf = mpmath.gammainc(k/2, 0, x/2, regularized=True) # floating point precision insufficient, use more precision if cdf == 1.0: if dps > 4000: return cdf # give up after a certain point else: cdf = mpmathcdf(g,df,dps*2) return cdf def GtoP(g,df): assert g >= 0, g return float(1-mpmathcdf(g,df)) import math @memoized def flnf(f): return f*math.log(f) if f>0.5 else 0 @memoized def gtest(a,b,c,d): row1 = a+b row2 = c+d col1 = a+c col2 = b+d total = flnf(a+b+c+d) celltotals = flnf(a)+flnf(b)+flnf(c)+flnf(d) rowtotals = flnf(row1)+flnf(row2) coltotals = flnf(col1)+flnf(col2) # abs is necessary to account for float precision errors when doing the subtraction return abs(2*((celltotals + total)-(rowtotals+coltotals))) # extract the pooled meth counts from the sample def pooledMethCounts(data,covs,compare): lmethtotal = 0 lunmethtotal = 0 rmethtotal = 0 runmethtotal = 0 if len(compare)==2: # if we are given covariates to use covleft, covright = compare for cov, (meth,unmeth) in izip(covs,grouper(2,data)): # extract in pairs of meth/unmeth if cov == covleft: lmethtotal += meth lunmethtotal += unmeth elif cov == covright: rmethtotal += meth runmethtotal += unmeth else: # if we have not been given them, assume that each adjacent sample (which is one pair) is one replicate for lmeth,lunmeth,rmeth,runmeth in grouper(4,data): # extract in quads (2 pairs) representing one replicate lmethtotal += lmeth lunmethtotal += lunmeth rmethtotal += rmeth runmethtotal += runmeth return lmethtotal,lunmethtotal,rmethtotal,runmethtotal # extract individual replicate counts from the sample and return as a list of (lmeth,lunmeth,rmeth,runmeth) # where l and r refer to the left and right cov in the compare array # if compare array is empty assume that data is formatted as: # lmeth1,lunmeth1,rmeth1,runmeth1,lmeth2,lunmeth2,rmeth2,runmeth2 etc def covReplicateMethCounts(data,covs,compare): if len(compare)==2: covleft, covright = compare lreplicates = [] rreplicates = [] for cov, (meth,unmeth) in izip(covs,grouper(2,data)): # extract in pairs of meth/unmeth if cov == covleft: lreplicates.append((meth,unmeth)) elif cov == covright: rreplicates.append((meth,unmeth)) #return [(lmeth,lunmeth,rmeth,runmeth),(),(),...] return [(a,b,c,d) for (a,b),(c,d) in izip(lreplicates,rreplicates)] else: # assume that each pair of samples (i.e. 4 columns) in the input is one replicate # in this case our input file is in the right order and we just need to split it up. return grouper(4,data) # total G: at least some of the data is different # 1 degree of freedom per replicate (i.e. 3 in total), add G's from each seperate test def totalG(data,covs,compare): dof = 0 totalg = 0 for lmeth,lunmeth,rmeth,runmeth in covReplicateMethCounts(data,covs,compare): # extract in quads (2 pairs) totalg += gtest(lmeth,lunmeth,rmeth,runmeth) dof+=1 assert totalg >= 0 return totalg, dof, GtoP(totalg,dof) # pooled G: the data as a whole is different # 1 degree of freedom, pool results from all replicates and do 1 2x2 table def pooledG(data,covs,compare): lmeth,lunmeth,rmeth,runmeth = pooledMethCounts(data,covs,compare) dof = 1 pooledg = gtest(lmeth,lunmeth,rmeth,runmeth) assert pooledg >= 0 return pooledg, dof, GtoP(pooledg,dof) # there is evidence that they are significantly different from each other # heterogeneity G: total G - pooled G, df = 3-1 = 2 def heterogeneityG(totalg,totaldof,pooledg,pooleddof): heteroG = max(totalg-pooledg,0) heterodof = totaldof-pooleddof return heteroG,heterodof,GtoP(heteroG,totaldof-pooleddof) def grouper(n, iterable, fillvalue=None): "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx" args = [iter(iterable)] * n return izip_longest(fillvalue=fillvalue, *args) def unpackLine(chrm,coord,*data): return chrm.strip(), int(coord), [int(d) for d in data] def extractLine(line): chrm,coord,data = unpackLine(*line) if not chrm.startswith("chr"): chrm = "chr"+chrm return (chrm,coord,data) if __name__ == '__main__': try: opts, args = getopt.getopt(sys.argv[1:], "", ["combined=","cov=","compare="]) except getopt.GetoptError, err: # print help information and exit: print str(err) # will print something like "option -a not recognized" sys.exit(2) infile = None covs = None compare = None for o, a in opts: if o=="--combined": infile = a print "combined", a elif o=="--cov": covs = a.split(",") print covs elif o=="--compare": compare = a.split(",") print compare assert len(compare)==2 assert infile != None # we either have compare and covs or neither assert compare == None or len(compare)==2 if compare == None: assert covs == None if len(compare)==2: assert covs != None methfile = csv.reader(open(infile, "r"), delimiter="\t") # --- # windows if compare != None: leftcov,rightcov = compare windows = csv.writer(open(infile+"."+leftcov+"-vs-"+rightcov+".windows.bedGraph","w"),delimiter='\t') percent = csv.writer(open(infile+"."+leftcov+"-vs-"+rightcov+".percent.windows.bedGraph","w"),delimiter='\t') else: windows = csv.writer(open(infile+".windows.bedGraph","w"),delimiter='\t') percent = csv.writer(open(infile+".percent.windows.bedGraph","w"),delimiter='\t') windowSize = 2000 currentWindowPos = None currentChr = None cs = 0 dataCount = None def writeWindow(chrm,pos,epos): global dataCount global cs global windows global percent lmethCount,lunmethCount,rmethCount,runmethCount = pooledMethCounts(dataCount,covs,compare) if (lmethCount + lunmethCount) == 0 or (rmethCount + runmethCount) == 0: print "Unprintable window:",chrm, pos, epos, lmethCount, lunmethCount, rmethCount, runmethCount return lPercentage = (float(lmethCount)/float(lmethCount+lunmethCount)) * 100.0 rPercentage = (float(rmethCount)/float(rmethCount+runmethCount)) * 100.0 total,totaldof,totalp = totalG(dataCount,covs,compare) pooled,pooleddof,pooledp = pooledG(dataCount,covs,compare) hetero,heterodof,heterop = heterogeneityG(total,totaldof,pooled,pooleddof) windows.writerow([chrm,pos,epos+1,'1' if rPercentage > lPercentage else '-1',totalp,pooledp,heterop,cs]) percent.writerow([chrm,pos,epos+1,lPercentage,rPercentage]) dataCount = None cs = 0 lastCoordCounted = None # the actual work here for line in methfile: chrm,coord,data = extractLine(line) if currentChr == None: currentChr = chrm currentWindowPos = coord lastCoordCounted = coord else: # always write a window if we change chrm if chrm != currentChr: writeWindow(currentChr,currentWindowPos,lastCoordCounted) currentChr = chrm currentWindowPos = coord # write a window if the next cpg is outside the current window elif coord - currentWindowPos > windowSize: writeWindow(currentChr,currentWindowPos,lastCoordCounted) currentChr = chrm currentWindowPos = coord # otherwise dont write any window (the current cpg is in the window) #### # add data to existing data count #### if dataCount == None: dataCount = data else: for index,d in enumerate(data): dataCount[index] += d lastCoordCounted = coord # inc number of c's counted cs += 1 # after we write a window there is always at least one unmeasured coord after it so we have one more window to write writeWindow(currentChr,currentWindowPos,lastCoordCounted)
Note that the above script relies upon the Memoized decorator.
Executed as:
python replicatesGTest.py \ --combined "Conditions.1.and.2.CpG_context.aggregate" \ --cov Condition1,Condition2,Condition1,Condition2,Condition1,Condition2 --compare Condition1,Condition2
Where --cov lists the order of the conditions in the file and --cov indicates which conditions should be compared.
For each window, this script outputs a tab separated file containing the chromosome,start and stop position of the window,'1' if it is "increasing" otherwise '-1' and p-values for computed from the Total, Pooled ahd Hetero G computations as well as the number of methylation sites considered in the window.
Note that the p-values generated are uncorrected for multiple testing and should be FDR corrected. I've previously described a suitable approach for large files such as these.
Differentially methylated regions can then be defined as regions of adjacent windows with a FDR adjusted p-value less than or equal to a fixed level (i.e. 0.05 FDR) that change in the same direction.