## Thursday, 7 February 2013

### Post-Processing Bismark Bisulphite Sequencing Data

This is Part 1 in a series on Bisulphite Sequencing.
You can skip to Part 2 (Using the Binomial Distribution in Bisulphite Sequencing)
.

In this post I will present a couple of scripts that are useful for processing the data produced by the Bismark bisulphite sequencing aligner into a form that is a little easier to work with.

When you run Bismark it will produce output in either its own "vanilla" format (version <= 0.5x) or in SAM format (version >= 0.6x).

Either way, the approach at this point is to remove duplicate reads that might have occurred as PCR duplicates using deduplicate_bismark_alignment_output.  I usually also remove any fragments which look like they haven't been bisulphite converted (multiple apparently methylated CHH or CHG within a single fragment).

For Bismark vanilla output you can use something like this:

awk 'FNR>1{
countl = split($8,a,"[HX]"); countr = split($11,a,"[HX]");
if (countl-1 <= 3 && countr-1 <= 3)
print;
}' $file >$file.unconvertedreadsremoved


(Where the 8th and 11th columns contain the methylation call string in Bismark format).

Next up is to extract the methylation calls for each base using the methylation_extractor script (included with Bismark) using the --no_overlap and --comprehensive options.

This will give you a file that looks a little like this:

FCD0KMMACXX:3:1101:17551:2000#CTTCCTCC/1    +    chr3    116402006    Z
FCD0KMMACXX:3:1101:17551:2000#CTTCCTCC/1    +    chr3    116401980    Z
FCD0KMMACXX:3:1101:17551:2000#CTTCCTCC/1    +    chr3    116401960    Z
FCD0KMMACXX:3:1101:17865:1999#CTTCCTCC/1    -    chrX    115417650    z
FCD0KMMACXX:3:1101:18710:1999#CTTCCTCC/1    -    chr15    51474193    z
FCD0KMMACXX:3:1101:18710:1999#CTTCCTCC/1    +    chr15    51474352    Z
FCD0KMMACXX:3:1101:18599:2000#CTTCCTCC/1    +    chr3    101217644    Z


First we make it a little easier by sorting by chromosome and base position such that all the methylation calls for the same base are together as well as removing the Bismark header line.

awk 'FNR>1' CpG_context.txt | \
sort -k3,3 -k4,4n -S10G > CpG_context.sorted.txt


Again, as in the previous post, you can control sorts memory usage using the -S flag (set to 10G above). We can then sum up (aggregate) the file on a per CpG basis.

import getopt
import sys
import csv

def parseMethLine(line):
return (True if meth == '+' else False, chrm, coord)

if __name__ == '__main__':
try:
opts, args = getopt.getopt(sys.argv[1:], "",
["bismark=", "aggregate="])
except getopt.GetoptError, err:
# print help information and exit
# will print something like "option -a not recognized"
print str(err)
sys.exit(2)

infile = None
aggregatefile= None

for o, a in opts:
if o=="--bismark":
infile = a
print "Bismark", a
elif o=="--aggregate":
aggregatefile = a
print "Aggregate", a

assert infile != None
assert aggregatefile != None

aggregate = csv.writer(open(aggregatefile,"w"),delimiter='\t')

currentchrm = None
currentpos = None

methylated = 0
unmethylated = 0

for line in bismark:
(methcall, chrm, coord) = parseMethLine(line)

# meth calls in 1 base, ucsc / bedgraph in 0 base
coord = int(coord) -1

# init
if currentpos == None or chrm == None:
currentchrm = chrm
currentpos = coord
elif currentpos != coord or currentchrm != chrm:
# a new position or chromosome encountered
# output our counts until now
aggregate.writerow([currentchrm,currentpos,
methylated,unmethylated])

# reset counts for next position
currentchrm = chrm
currentpos = coord
methylated = 0
unmethylated = 0

if methcall:
methylated += 1
else:
unmethylated += 1

# output last line
aggregate.writerow([currentchrm,currentpos,
methylated,unmethylated])


Executed as:

python aggregateBismark.py \
--bismark "CpG_context.sorted.txt" \
--aggregate "CpG_context.aggregate"


This will produce a file that's a little more useful.  Each CpG is represented by a single line with two data columns indicating the number of methylated and unmethylated calls made for that base (3rd and 4th columns respectively).

In other words we will have a file that looks like this:

chr1    1343    1    0
chr1    1625    4    0
chr1    1626    1    0
chr1    1655    1    0
chr1    1689    2    1
chr1    1690    2    0


Since we probably have more than one sample that we will want to be able to compare with each other we will want to combine two (or more) of the aggregate files above.  The script below will combine any two files with chromosome and coordinate as the first columns which are sorted the same way.

import getopt
import sys
import csv

if __name__ == '__main__':
try:
opts, args = getopt.getopt(sys.argv[1:], "",
["one=", "two=", "combine="])
except getopt.GetoptError, err:
# print help information and exit
# will print something like "option -a not recognized"
print str(err)
sys.exit(2)

oneloc = None
twoloc = None
combineloc = None

for o, a in opts:
if o=="--one":
oneloc = a
print "One", a
elif o=="--two":
twoloc = a
print "Two", a
elif o=="--combine":
combineloc = a
print "Combine", a

assert oneloc != None
assert twoloc != None
assert combineloc != None

combine = csv.writer(open(combineloc,"w"),delimiter='\t')

def unpackLine(chrm,coord,*data):
return chrm.strip(), int(coord), [int(d) for d in data]

def extractLine(line):
chrm,coord,data = unpackLine(*line)
if not chrm.startswith("chr"):
chrm = "chr"+chrm
return (chrm,coord,data)

# get the first data points

lchr,lcoord,ldata = extractLine(left.next())
rchr,rcoord,rdata = extractLine(right.next())

# find out how many data columns in each object
# first line will tell us
leftcols = len(ldata)
rightcols = len(rdata)

leftFinished = False
rightFinished = False

def spoolBoth():
global lchr,lcoord,ldata,leftFinished
global rchr,rcoord,rdata,rightFinished

row = [lchr,lcoord]
row.extend(ldata)
row.extend(rdata)
combine.writerow(row)

try:
lchr,lcoord,ldata = extractLine(left.next())
except StopIteration:
leftFinished = True
try:
rchr,rcoord,rdata = extractLine(right.next())
except StopIteration:
rightFinished = True

def spoolLeft():
global lchr,lcoord,ldata,leftFinished

rdata = [0 for i in range(0,rightcols)]

row = [lchr,lcoord]
row.extend(ldata)
row.extend(rdata)
combine.writerow(row)

try:
lchr,lcoord,ldata = extractLine(left.next())
except StopIteration:
leftFinished = True

def spoolRight():
global rchr,rcoord,rdata,rightFinished

ldata = [0 for i in range(0,leftcols)]

row = [rchr,rcoord]
row.extend(ldata)
row.extend(rdata)
combine.writerow(row)

try:
rchr,rcoord,rdata = extractLine(right.next())
except StopIteration:
rightFinished = True

while not leftFinished or not rightFinished:

if leftFinished:
spoolRight()
elif rightFinished:
spoolLeft()
else:
if lchr == rchr and lcoord == rcoord:
# we have a match, print it and spool both
spoolBoth()
elif lchr < rchr:
spoolLeft()
elif lchr > rchr:
spoolRight()
# we are on the same chromosome
# but we have skipped one (or more) position
elif lcoord < rcoord:
spoolLeft()
else:
spoolRight()


Execute this as follows:

python combineAggregate.py \
--one "Sample1.CpG_context.aggregate" \
--two "Sample2.CpG_context.aggregate" \
--combine "Samples.1.and.2.CpG_context.aggregate"


Ultimately you'll be left with a file where each CpG (or CHH, CHG) is a single line. A pair of columns represents the number of methylated and unmethylated calls in one sample and multiple pairs indicate multiple samples. Giving you something that looks like this.
chr1    1343    1    0    2    0
chr1    1625    4    0    5    3
chr1    1626    1    0    1    0
chr1    1655    1    0    1    5
chr1    1689    2    1    2    1
chr1    1690    2    0    3    1


I find this file format much easier to deal with for further downstream analysis. Once you've combined Sample 1 and Sample 2 you can then take the output and combine that with Sample 3 and so on.

1. Hi,
Bismark's methylation extractor now includes --bedGraph and --counts options, which produces a very similar output.

2. Hi, you are absolutely correct. Bismark now includes a bismark2bedGraph tool which does the same job as the first half of my post (but was originally included as part of the methylation extractor tool). As of v0.7.8 (01 Mar 2013) it also added a --buffer_size option to control the memory usage of the sort (as accomplished in my tool with the -S flag).

As far as I'm aware it doesn't yet include anything for efficiently combining the count files for multiple samples for downstream analysis.

I plan to post some more on downstream analysis of bisulphite data in the near future.

3. This comment has been removed by the author.

1. Sorry, I might have done something wrong. Anyway, I've ported the code to Python 3 and was able to run it smoothly. Thanks again for the very well described approach!

4. What happened to this result?

chr1 1625
chr1 1626

1625 and 1626 is closed to each other? or they are from top or bottom chain?

Thanks.

1. Hi Shicheng, CpG's are identified by the position of the C within the CpG. This allows us to talk about the methylation of both strands separately. This also allows us to talk about methylation of C's in non CpG contexts sensibly (since it's always the cytosine base at that position which is undergoing methylation).