You can return to Part 2 (Using the Binomial Distribution in Bisulphite Sequencing). or skip to Part 4 (The G Test and Bisulphite Sequencing DMRs).
Jumping back to bisulphite sequencing analysis for a bit. After we have summarised our data by individual CpG we probably want to do something useful with it.
First up, lets just display it on the UCSC genome browser. When we display bisulphite sequening data on UCSC we probably want something like this below:
We also want to filter out low confidence CpGs (those with less than a certain number of reads). bismarkToBedGraph does something pretty much like this.
However, it is important that when we filter a particular CpG from any given sample we should also filter it from the other samples displayed at the same time - otherwise we give a misleading indication that there is 0% methylation in the filtered sample and a non-negative amount of methylation in the unfiltered sample.
To address this I made a short script that can take multiple replicates/samples of bisulphite sequencing data, summarised by individual CpGs, and output bedGraph files such that the same CpGs are always reported based on coverage.
import getopt import sys import csv from itertools import izip_longest class CommentException(Exception): pass def grouper(n, iterable, fillvalue=None): "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx" args = [iter(iterable)] * n return izip_longest(fillvalue=fillvalue, *args) def unpackLine(chrm,coord,*data): return chrm.strip(), int(coord), [int(d) for d in data] def extractLine(line): if line[0].startswith("#"): raise CommentException() chrm,coord,data = unpackLine(*line) if not chrm.startswith("chr"): chrm = "chr"+chrm return (chrm,coord,data) if __name__ == '__main__': try: opts, args = getopt.getopt(sys.argv[1:], "", ["combined=","suffixes="]) except getopt.GetoptError, err: # print help information and exit: print str(err) # will print something like "option -a not recognized" sys.exit(2) infile = None suffixes = None rightsuffix = None for o, a in opts: if o=="--combined": infile = a print "combined", a elif o=="--suffixes": suffixes = a.split(",") print "Suffix", suffixes assert infile != None assert suffixes != None methfile = csv.reader(open(infile, "r"), delimiter="\t") files = {} for suffix in suffixes: percentage = csv.writer(open(infile+"-"+suffix+".percent.bedGraph","w"),delimiter='\t') coverage = csv.writer(open(infile+"-"+suffix+".coverage.bedGraph","w"),delimiter='\t') files[suffix]=(percentage,coverage) # --- # for each # coverage # percentage minReadsForPercentage = 5 minSamplesWithCoverage = 1 for line in methfile: try: (chrm,coord,data) = extractLine(line) except CommentException: continue percentages = [] coverages = [] for meth,unmeth in grouper(2,data): coverage = meth + unmeth if coverage == 0: percentage = 0 else: percentage = (float(meth)/float(meth+unmeth)) * 100.0 percentages.append(percentage) coverages.append(coverage) satisfycoveragecriteria = 0 for coverage in coverages: if coverage >= minReadsForPercentage: satisfycoveragecriteria += 1 if satisfycoveragecriteria < minSamplesWithCoverage: continue for index,(percentage,coverage) in enumerate(izip_longest(percentages,coverages)): percentfile,coveragefile = files[suffixes[index]] percentfile.writerow([chrm,coord,coord+1,percentage]) coveragefile.writerow([chrm,coord,coord+1,coverage])
Executed as:
python bisulphiteToBedGraph.py \ --combined "Samples.1.and.2.CpG_context.aggregate" \ --suffixes Sample1,Sample2,Sample3,Sample4
You can then use bedGraphToBigWig to create bigWig files suitable for UCSC custom track / track hubs.
No comments:
Post a Comment
Note: only a member of this blog may post a comment.