Thursday, 7 February 2013

Post-Processing Bismark Bisulphite Sequencing Data

This is Part 1 in a series on Bisulphite Sequencing.
You can skip to Part 2 (Using the Binomial Distribution in Bisulphite Sequencing)
.


In this post I will present a couple of scripts that are useful for processing the data produced by the Bismark bisulphite sequencing aligner into a form that is a little easier to work with.

When you run Bismark it will produce output in either its own "vanilla" format (version <= 0.5x) or in SAM format (version >= 0.6x).

Either way, the approach at this point is to remove duplicate reads that might have occurred as PCR duplicates using deduplicate_bismark_alignment_output.  I usually also remove any fragments which look like they haven't been bisulphite converted (multiple apparently methylated CHH or CHG within a single fragment).

For Bismark vanilla output you can use something like this:

awk 'FNR>1{
      countl = split($8,a,"[HX]");
      countr = split($11,a,"[HX]");
      if (countl-1 <= 3 && countr-1 <= 3)
           print;
     }' $file > $file.unconvertedreadsremoved

(Where the 8th and 11th columns contain the methylation call string in Bismark format).

Next up is to extract the methylation calls for each base using the methylation_extractor script (included with Bismark) using the --no_overlap and --comprehensive options.

This will give you a file that looks a little like this:

FCD0KMMACXX:3:1101:17551:2000#CTTCCTCC/1    +    chr3    116402006    Z
FCD0KMMACXX:3:1101:17551:2000#CTTCCTCC/1    +    chr3    116401980    Z
FCD0KMMACXX:3:1101:17551:2000#CTTCCTCC/1    +    chr3    116401960    Z
FCD0KMMACXX:3:1101:17865:1999#CTTCCTCC/1    -    chrX    115417650    z
FCD0KMMACXX:3:1101:18710:1999#CTTCCTCC/1    -    chr15    51474193    z
FCD0KMMACXX:3:1101:18710:1999#CTTCCTCC/1    +    chr15    51474352    Z
FCD0KMMACXX:3:1101:18599:2000#CTTCCTCC/1    +    chr3    101217644    Z

First we make it a little easier by sorting by chromosome and base position such that all the methylation calls for the same base are together as well as removing the Bismark header line.

awk 'FNR>1' CpG_context.txt | \
sort -k3,3 -k4,4n -S10G > CpG_context.sorted.txt

Again, as in the previous post, you can control sorts memory usage using the -S flag (set to 10G above). We can then sum up (aggregate) the file on a per CpG basis.

import getopt
import sys
import csv

def parseMethLine(line):
    (readid,meth,chrm,coord,methcall) = line
    return (True if meth == '+' else False, chrm, coord)

if __name__ == '__main__':
    try:
        opts, args = getopt.getopt(sys.argv[1:], "",
                                  ["bismark=", "aggregate="])
    except getopt.GetoptError, err:
        # print help information and exit
        # will print something like "option -a not recognized"
        print str(err) 
        sys.exit(2)
    
    infile = None
    aggregatefile= None
    
    for o, a in opts:
        if o=="--bismark":
            infile = a
            print "Bismark", a
        elif o=="--aggregate":
            aggregatefile = a
            print "Aggregate", a
    
    assert infile != None
    assert aggregatefile != None
    
    bismark = csv.reader(open(infile, "r"), delimiter="\t")
    
    aggregate = csv.writer(open(aggregatefile,"w"),delimiter='\t')      
    
    currentchrm = None
    currentpos = None
    
    methylated = 0
    unmethylated = 0
    
    for line in bismark:
        (methcall, chrm, coord) = parseMethLine(line)
        
        # meth calls in 1 base, ucsc / bedgraph in 0 base
        coord = int(coord) -1 
        
        # init
        if currentpos == None or chrm == None:
            currentchrm = chrm
            currentpos = coord
        elif currentpos != coord or currentchrm != chrm:
            # a new position or chromosome encountered
            # output our counts until now
            aggregate.writerow([currentchrm,currentpos,
                                methylated,unmethylated])
            
            # reset counts for next position
            currentchrm = chrm
            currentpos = coord
            methylated = 0
            unmethylated = 0
        
        if methcall:
            methylated += 1
        else:
            unmethylated += 1
    
    # output last line
    aggregate.writerow([currentchrm,currentpos,
                        methylated,unmethylated])

Executed as:

python aggregateBismark.py \
--bismark "CpG_context.sorted.txt" \
--aggregate "CpG_context.aggregate"

This will produce a file that's a little more useful.  Each CpG is represented by a single line with two data columns indicating the number of methylated and unmethylated calls made for that base (3rd and 4th columns respectively).

In other words we will have a file that looks like this:

chr1    1343    1    0
chr1    1625    4    0
chr1    1626    1    0
chr1    1655    1    0
chr1    1689    2    1
chr1    1690    2    0

Since we probably have more than one sample that we will want to be able to compare with each other we will want to combine two (or more) of the aggregate files above.  The script below will combine any two files with chromosome and coordinate as the first columns which are sorted the same way.

import getopt
import sys
import csv

if __name__ == '__main__':
    try:
        opts, args = getopt.getopt(sys.argv[1:], "",
                                  ["one=", "two=", "combine="])
    except getopt.GetoptError, err:
        # print help information and exit
        # will print something like "option -a not recognized"
        print str(err) 
        sys.exit(2)
    
    oneloc = None
    twoloc = None
    combineloc = None 
    
    for o, a in opts:
        if o=="--one":
            oneloc = a
            print "One", a
        elif o=="--two":
            twoloc = a
            print "Two", a
        elif o=="--combine":
            combineloc = a
            print "Combine", a
    
    assert oneloc != None
    assert twoloc != None
    assert combineloc != None
    
    left = csv.reader(open(oneloc, "r"), delimiter="\t")
    right = csv.reader(open(twoloc, "r"), delimiter="\t")
    
    combine = csv.writer(open(combineloc,"w"),delimiter='\t')      
    
    def unpackLine(chrm,coord,*data):
        return chrm.strip(), int(coord), [int(d) for d in data]
    
    def extractLine(line):
        chrm,coord,data = unpackLine(*line)
        if not chrm.startswith("chr"):
            chrm = "chr"+chrm
        return (chrm,coord,data)
    
    # get the first data points
    
    lchr,lcoord,ldata = extractLine(left.next())
    rchr,rcoord,rdata = extractLine(right.next())
    
    # find out how many data columns in each object
    # first line will tell us
    leftcols = len(ldata)
    rightcols = len(rdata)
    
    leftFinished = False
    rightFinished = False
    
    
    def spoolBoth():
        global lchr,lcoord,ldata,leftFinished
        global rchr,rcoord,rdata,rightFinished
        
        row = [lchr,lcoord]
        row.extend(ldata)
        row.extend(rdata)        
        combine.writerow(row)
        
        try:    
            lchr,lcoord,ldata = extractLine(left.next())
        except StopIteration:
            leftFinished = True    
        try:            
            rchr,rcoord,rdata = extractLine(right.next())
        except StopIteration:
            rightFinished = True 
        
    def spoolLeft():
        global lchr,lcoord,ldata,leftFinished
        
        rdata = [0 for i in range(0,rightcols)]
        
        row = [lchr,lcoord]
        row.extend(ldata)
        row.extend(rdata)        
        combine.writerow(row)
        
        try:    
            lchr,lcoord,ldata = extractLine(left.next())
        except StopIteration:
            leftFinished = True          
    
    def spoolRight():
        global rchr,rcoord,rdata,rightFinished
        
        ldata = [0 for i in range(0,leftcols)]
        
        row = [rchr,rcoord]
        row.extend(ldata)
        row.extend(rdata)
        combine.writerow(row)
        
        try:            
            rchr,rcoord,rdata = extractLine(right.next())
        except StopIteration:
            rightFinished = True    
    
    while not leftFinished or not rightFinished:
   
        if leftFinished:
            spoolRight()
        elif rightFinished:
            spoolLeft()
        else:
            if lchr == rchr and lcoord == rcoord:
                # we have a match, print it and spool both
                spoolBoth()
            elif lchr < rchr:
                spoolLeft()
            elif lchr > rchr:
                spoolRight()   
            # we are on the same chromosome
            # but we have skipped one (or more) position
            elif lcoord < rcoord:
                spoolLeft()
            else:
                spoolRight()

Execute this as follows:

python combineAggregate.py \
--one "Sample1.CpG_context.aggregate" \
--two "Sample2.CpG_context.aggregate" \
--combine "Samples.1.and.2.CpG_context.aggregate"


Ultimately you'll be left with a file where each CpG (or CHH, CHG) is a single line. A pair of columns represents the number of methylated and unmethylated calls in one sample and multiple pairs indicate multiple samples. Giving you something that looks like this.
chr1    1343    1    0    2    0
chr1    1625    4    0    5    3
chr1    1626    1    0    1    0
chr1    1655    1    0    1    5
chr1    1689    2    1    2    1
chr1    1690    2    0    3    1

I find this file format much easier to deal with for further downstream analysis. Once you've combined Sample 1 and Sample 2 you can then take the output and combine that with Sample 3 and so on.

6 comments:

  1. Hi,
    Bismark's methylation extractor now includes --bedGraph and --counts options, which produces a very similar output.

    ReplyDelete
  2. Hi, you are absolutely correct. Bismark now includes a bismark2bedGraph tool which does the same job as the first half of my post (but was originally included as part of the methylation extractor tool). As of v0.7.8 (01 Mar 2013) it also added a --buffer_size option to control the memory usage of the sort (as accomplished in my tool with the -S flag).

    As far as I'm aware it doesn't yet include anything for efficiently combining the count files for multiple samples for downstream analysis.

    I plan to post some more on downstream analysis of bisulphite data in the near future.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Sorry, I might have done something wrong. Anyway, I've ported the code to Python 3 and was able to run it smoothly. Thanks again for the very well described approach!

      Delete
  4. What happened to this result?

    chr1 1625
    chr1 1626

    1625 and 1626 is closed to each other? or they are from top or bottom chain?

    Thanks.

    ReplyDelete
    Replies
    1. Hi Shicheng, CpG's are identified by the position of the C within the CpG. This allows us to talk about the methylation of both strands separately. This also allows us to talk about methylation of C's in non CpG contexts sensibly (since it's always the cytosine base at that position which is undergoing methylation).

      Delete

Note: only a member of this blog may post a comment.