Monday, 8 July 2013

Bisulphite Sequencing Simple Visualisation (Multiple Samples)

This is Part 3 in a series on Bisulphite Sequencing.
You can return to Part 2 (Using the Binomial Distribution in Bisulphite Sequencing).
or skip to Part 4 (The G Test and Bisulphite Sequencing DMRs).

Jumping back to bisulphite sequencing analysis for a bit.  After we have summarised our data by individual CpG we probably want to do something useful with it.

First up, lets just display it on the UCSC genome browser.  When we display bisulphite sequening data on UCSC we probably want something like this below:

We also want to filter out low confidence CpGs (those with less than a certain number of reads). bismarkToBedGraph does something pretty much like this.

However, it is important that when we filter a particular CpG from any given sample we should also filter it from the other samples displayed at the same time - otherwise we give a misleading indication that there is 0% methylation in the filtered sample and a non-negative amount of methylation in the unfiltered sample.

To address this I made a short script that can take multiple replicates/samples of bisulphite sequencing data, summarised by individual CpGs, and output bedGraph files such that the same CpGs are always reported based on coverage.

import getopt
import sys
import csv
from itertools import izip_longest

class CommentException(Exception):

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

def unpackLine(chrm,coord,*data):
    return chrm.strip(), int(coord), [int(d) for d in data]

def extractLine(line):
    if line[0].startswith("#"):
        raise CommentException()
    chrm,coord,data = unpackLine(*line)

    if not chrm.startswith("chr"):
        chrm = "chr"+chrm
    return (chrm,coord,data)
if __name__ == '__main__':
        opts, args = getopt.getopt(sys.argv[1:], "", ["combined=","suffixes="])
    except getopt.GetoptError, err:
        # print help information and exit:
        print str(err) # will print something like "option -a not recognized"
    infile = None
    suffixes = None
    rightsuffix = None
    for o, a in opts:
        if o=="--combined":
            infile = a
            print "combined", a
        elif o=="--suffixes":
            suffixes = a.split(",")
            print "Suffix", suffixes

    assert infile != None
    assert suffixes != None
    methfile = csv.reader(open(infile, "r"), delimiter="\t")
    files = {}
    for suffix in suffixes:
        percentage = csv.writer(open(infile+"-"+suffix+".percent.bedGraph","w"),delimiter='\t')
        coverage = csv.writer(open(infile+"-"+suffix+".coverage.bedGraph","w"),delimiter='\t')
    # ---
    # for each
    # coverage
    # percentage
    minReadsForPercentage = 5
    minSamplesWithCoverage = 1

    for line in methfile:
            (chrm,coord,data) = extractLine(line)
        except CommentException:
        percentages = []
        coverages = []
        for meth,unmeth in grouper(2,data):
            coverage = meth + unmeth
            if coverage == 0:
                percentage = 0
                percentage = (float(meth)/float(meth+unmeth)) * 100.0
        satisfycoveragecriteria = 0
        for coverage in coverages:
            if coverage >= minReadsForPercentage:
                satisfycoveragecriteria += 1
        if satisfycoveragecriteria < minSamplesWithCoverage:

        for index,(percentage,coverage) in enumerate(izip_longest(percentages,coverages)):
            percentfile,coveragefile = files[suffixes[index]]

Executed as:

python \
--combined "Samples.1.and.2.CpG_context.aggregate" \
--suffixes Sample1,Sample2,Sample3,Sample4 

You can then use bedGraphToBigWig to create bigWig files suitable for UCSC custom track / track hubs.

No comments:

Post a Comment