You can skip to Part 2 (Using the Binomial Distribution in Bisulphite Sequencing).
In this post I will present a couple of scripts that are useful for processing the data produced by the Bismark bisulphite sequencing aligner into a form that is a little easier to work with.
When you run Bismark it will produce output in either its own "vanilla" format (version <= 0.5x) or in SAM format (version >= 0.6x).
Either way, the approach at this point is to remove duplicate reads that might have occurred as PCR duplicates using deduplicate_bismark_alignment_output. I usually also remove any fragments which look like they haven't been bisulphite converted (multiple apparently methylated CHH or CHG within a single fragment).
For Bismark vanilla output you can use something like this:
awk 'FNR>1{ countl = split($8,a,"[HX]"); countr = split($11,a,"[HX]"); if (countl-1 <= 3 && countr-1 <= 3) print; }' $file > $file.unconvertedreadsremoved
(Where the 8th and 11th columns contain the methylation call string in Bismark format).
Next up is to extract the methylation calls for each base using the methylation_extractor script (included with Bismark) using the --no_overlap and --comprehensive options.
This will give you a file that looks a little like this:
FCD0KMMACXX:3:1101:17551:2000#CTTCCTCC/1 + chr3 116402006 Z FCD0KMMACXX:3:1101:17551:2000#CTTCCTCC/1 + chr3 116401980 Z FCD0KMMACXX:3:1101:17551:2000#CTTCCTCC/1 + chr3 116401960 Z FCD0KMMACXX:3:1101:17865:1999#CTTCCTCC/1 - chrX 115417650 z FCD0KMMACXX:3:1101:18710:1999#CTTCCTCC/1 - chr15 51474193 z FCD0KMMACXX:3:1101:18710:1999#CTTCCTCC/1 + chr15 51474352 Z FCD0KMMACXX:3:1101:18599:2000#CTTCCTCC/1 + chr3 101217644 Z
First we make it a little easier by sorting by chromosome and base position such that all the methylation calls for the same base are together as well as removing the Bismark header line.
awk 'FNR>1' CpG_context.txt | \ sort -k3,3 -k4,4n -S10G > CpG_context.sorted.txt
Again, as in the previous post, you can control sorts memory usage using the -S flag (set to 10G above). We can then sum up (aggregate) the file on a per CpG basis.
import getopt import sys import csv def parseMethLine(line): (readid,meth,chrm,coord,methcall) = line return (True if meth == '+' else False, chrm, coord) if __name__ == '__main__': try: opts, args = getopt.getopt(sys.argv[1:], "", ["bismark=", "aggregate="]) except getopt.GetoptError, err: # print help information and exit # will print something like "option -a not recognized" print str(err) sys.exit(2) infile = None aggregatefile= None for o, a in opts: if o=="--bismark": infile = a print "Bismark", a elif o=="--aggregate": aggregatefile = a print "Aggregate", a assert infile != None assert aggregatefile != None bismark = csv.reader(open(infile, "r"), delimiter="\t") aggregate = csv.writer(open(aggregatefile,"w"),delimiter='\t') currentchrm = None currentpos = None methylated = 0 unmethylated = 0 for line in bismark: (methcall, chrm, coord) = parseMethLine(line) # meth calls in 1 base, ucsc / bedgraph in 0 base coord = int(coord) -1 # init if currentpos == None or chrm == None: currentchrm = chrm currentpos = coord elif currentpos != coord or currentchrm != chrm: # a new position or chromosome encountered # output our counts until now aggregate.writerow([currentchrm,currentpos, methylated,unmethylated]) # reset counts for next position currentchrm = chrm currentpos = coord methylated = 0 unmethylated = 0 if methcall: methylated += 1 else: unmethylated += 1 # output last line aggregate.writerow([currentchrm,currentpos, methylated,unmethylated])
Executed as:
python aggregateBismark.py \ --bismark "CpG_context.sorted.txt" \ --aggregate "CpG_context.aggregate"
This will produce a file that's a little more useful. Each CpG is represented by a single line with two data columns indicating the number of methylated and unmethylated calls made for that base (3rd and 4th columns respectively).
In other words we will have a file that looks like this:
chr1 1343 1 0 chr1 1625 4 0 chr1 1626 1 0 chr1 1655 1 0 chr1 1689 2 1 chr1 1690 2 0
Since we probably have more than one sample that we will want to be able to compare with each other we will want to combine two (or more) of the aggregate files above. The script below will combine any two files with chromosome and coordinate as the first columns which are sorted the same way.
import getopt import sys import csv if __name__ == '__main__': try: opts, args = getopt.getopt(sys.argv[1:], "", ["one=", "two=", "combine="]) except getopt.GetoptError, err: # print help information and exit # will print something like "option -a not recognized" print str(err) sys.exit(2) oneloc = None twoloc = None combineloc = None for o, a in opts: if o=="--one": oneloc = a print "One", a elif o=="--two": twoloc = a print "Two", a elif o=="--combine": combineloc = a print "Combine", a assert oneloc != None assert twoloc != None assert combineloc != None left = csv.reader(open(oneloc, "r"), delimiter="\t") right = csv.reader(open(twoloc, "r"), delimiter="\t") combine = csv.writer(open(combineloc,"w"),delimiter='\t') def unpackLine(chrm,coord,*data): return chrm.strip(), int(coord), [int(d) for d in data] def extractLine(line): chrm,coord,data = unpackLine(*line) if not chrm.startswith("chr"): chrm = "chr"+chrm return (chrm,coord,data) # get the first data points lchr,lcoord,ldata = extractLine(left.next()) rchr,rcoord,rdata = extractLine(right.next()) # find out how many data columns in each object # first line will tell us leftcols = len(ldata) rightcols = len(rdata) leftFinished = False rightFinished = False def spoolBoth(): global lchr,lcoord,ldata,leftFinished global rchr,rcoord,rdata,rightFinished row = [lchr,lcoord] row.extend(ldata) row.extend(rdata) combine.writerow(row) try: lchr,lcoord,ldata = extractLine(left.next()) except StopIteration: leftFinished = True try: rchr,rcoord,rdata = extractLine(right.next()) except StopIteration: rightFinished = True def spoolLeft(): global lchr,lcoord,ldata,leftFinished rdata = [0 for i in range(0,rightcols)] row = [lchr,lcoord] row.extend(ldata) row.extend(rdata) combine.writerow(row) try: lchr,lcoord,ldata = extractLine(left.next()) except StopIteration: leftFinished = True def spoolRight(): global rchr,rcoord,rdata,rightFinished ldata = [0 for i in range(0,leftcols)] row = [rchr,rcoord] row.extend(ldata) row.extend(rdata) combine.writerow(row) try: rchr,rcoord,rdata = extractLine(right.next()) except StopIteration: rightFinished = True while not leftFinished or not rightFinished: if leftFinished: spoolRight() elif rightFinished: spoolLeft() else: if lchr == rchr and lcoord == rcoord: # we have a match, print it and spool both spoolBoth() elif lchr < rchr: spoolLeft() elif lchr > rchr: spoolRight() # we are on the same chromosome # but we have skipped one (or more) position elif lcoord < rcoord: spoolLeft() else: spoolRight()
Execute this as follows:
python combineAggregate.py \ --one "Sample1.CpG_context.aggregate" \ --two "Sample2.CpG_context.aggregate" \ --combine "Samples.1.and.2.CpG_context.aggregate"
Ultimately you'll be left with a file where each CpG (or CHH, CHG) is a single line. A pair of columns represents the number of methylated and unmethylated calls in one sample and multiple pairs indicate multiple samples. Giving you something that looks like this.
chr1 1343 1 0 2 0 chr1 1625 4 0 5 3 chr1 1626 1 0 1 0 chr1 1655 1 0 1 5 chr1 1689 2 1 2 1 chr1 1690 2 0 3 1
I find this file format much easier to deal with for further downstream analysis. Once you've combined Sample 1 and Sample 2 you can then take the output and combine that with Sample 3 and so on.
Hi,
ReplyDeleteBismark's methylation extractor now includes --bedGraph and --counts options, which produces a very similar output.
Hi, you are absolutely correct. Bismark now includes a bismark2bedGraph tool which does the same job as the first half of my post (but was originally included as part of the methylation extractor tool). As of v0.7.8 (01 Mar 2013) it also added a --buffer_size option to control the memory usage of the sort (as accomplished in my tool with the -S flag).
ReplyDeleteAs far as I'm aware it doesn't yet include anything for efficiently combining the count files for multiple samples for downstream analysis.
I plan to post some more on downstream analysis of bisulphite data in the near future.
This comment has been removed by the author.
ReplyDeleteSorry, I might have done something wrong. Anyway, I've ported the code to Python 3 and was able to run it smoothly. Thanks again for the very well described approach!
DeleteWhat happened to this result?
ReplyDeletechr1 1625
chr1 1626
1625 and 1626 is closed to each other? or they are from top or bottom chain?
Thanks.
Hi Shicheng, CpG's are identified by the position of the C within the CpG. This allows us to talk about the methylation of both strands separately. This also allows us to talk about methylation of C's in non CpG contexts sensibly (since it's always the cytosine base at that position which is undergoing methylation).
Delete