McBryan's Musings: genome browser

This is Part 3 in a series on Bisulphite Sequencing.
You can return to Part 2 (Using the Binomial Distribution in Bisulphite Sequencing). or skip to Part 4 (The G Test and Bisulphite Sequencing DMRs).

Jumping back to bisulphite sequencing analysis for a bit. After we have summarised our data by individual CpG we probably want to do something useful with it.

First up, lets just display it on the UCSC genome browser. When we display bisulphite sequening data on UCSC we probably want something like this below:

We also want to filter out low confidence CpGs (those with less than a certain number of reads). bismarkToBedGraph does something pretty much like this.

However, it is important that when we filter a particular CpG from any given sample we should also filter it from the other samples displayed at the same time - otherwise we give a misleading indication that there is 0% methylation in the filtered sample and a non-negative amount of methylation in the unfiltered sample.

To address this I made a short script that can take multiple replicates/samples of bisulphite sequencing data, summarised by individual CpGs, and output bedGraph files such that the same CpGs are always reported based on coverage.

import getopt
import sys
import csv
from itertools import izip_longest

class CommentException(Exception):
    pass

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

def unpackLine(chrm,coord,*data):
    return chrm.strip(), int(coord), [int(d) for d in data]

def extractLine(line):
    if line[0].startswith("#"):
        raise CommentException()
    chrm,coord,data = unpackLine(*line)

    if not chrm.startswith("chr"):
        chrm = "chr"+chrm
    return (chrm,coord,data)
    
if __name__ == '__main__':
    try:
        opts, args = getopt.getopt(sys.argv[1:], "", ["combined=","suffixes="])
    except getopt.GetoptError, err:
        # print help information and exit:
        print str(err) # will print something like "option -a not recognized"
        sys.exit(2)
    
    infile = None
    suffixes = None
    rightsuffix = None
    
    for o, a in opts:
        if o=="--combined":
            infile = a
            print "combined", a
        elif o=="--suffixes":
            suffixes = a.split(",")
            print "Suffix", suffixes

    
    assert infile != None
    assert suffixes != None
    
    methfile = csv.reader(open(infile, "r"), delimiter="\t")
    
    files = {}
    
    for suffix in suffixes:
        percentage = csv.writer(open(infile+"-"+suffix+".percent.bedGraph","w"),delimiter='\t')
        coverage = csv.writer(open(infile+"-"+suffix+".coverage.bedGraph","w"),delimiter='\t')
        files[suffix]=(percentage,coverage)
    
    # ---
    # for each
    # coverage
    # percentage
    
    minReadsForPercentage = 5
    minSamplesWithCoverage = 1

    for line in methfile:
        try:
            (chrm,coord,data) = extractLine(line)
        except CommentException:
            continue
        
        percentages = []
        coverages = []
        
        for meth,unmeth in grouper(2,data):
            coverage = meth + unmeth
            if coverage == 0:
                percentage = 0
            else:           
                percentage = (float(meth)/float(meth+unmeth)) * 100.0
            percentages.append(percentage)
            coverages.append(coverage)
    
        satisfycoveragecriteria = 0
        for coverage in coverages:
            if coverage >= minReadsForPercentage:
                satisfycoveragecriteria += 1
        if satisfycoveragecriteria < minSamplesWithCoverage:
            continue

        for index,(percentage,coverage) in enumerate(izip_longest(percentages,coverages)):
            
            percentfile,coveragefile = files[suffixes[index]]
            percentfile.writerow([chrm,coord,coord+1,percentage])
            coveragefile.writerow([chrm,coord,coord+1,coverage])

Executed as:

python bisulphiteToBedGraph.py \
--combined "Samples.1.and.2.CpG_context.aggregate" \
--suffixes Sample1,Sample2,Sample3,Sample4

You can then use bedGraphToBigWig to create bigWig files suitable for UCSC custom track / track hubs.

I frequently display results within the Lab using the UCSC genome browser and I'm often asked how I do the "overlapping tracks" feature - specifically I'm usually asked for a minimal configuration that they can then extend.

It's something UCSC has been doing on their own tracks for some time (e.g. the Promotor/Enhancer Histones mark track) but was made available to 3rd parties reasonably recently through their track hub feature.

The first requirement is that you will have to have somewhere you can upload bigWig files to that UCSC can access via HTTP. You can generate bigWig files using UCSC apps bedGraphToBigWig or wigToBigWig.

Note that if you intend on displaying tracks related together they should be appropriately normalised to make comparisons make sense. If your data is already normalised (for example % methylation from Bisulfite sequencing) then this is fine. Otherwise, if you have a bedGraph file representing a pileup (number of reads covering each basepair) then something like the following should work to normalise by library size:

awk -v NUM_FRAGS=x 'BEGIN{OFS="\t"}{print $1,$2,$3,$4/(NUM_FRAGS/1000000)}' file.bedGraph > file.normalised.bedGraph

Where NUM_FRAGS=x is changed to the correct number of fragments in your dataset. At this point you can follow the instructions from UCSC to setup your first track hub. I will briefly summarise them here though.

There's a number of little files required to make it work right.. The file structure I use is something like this:

/hub.txt
/genomes.txt
/hg18/trackDb.txt
/hg18/project1.txt

Here hub.txt is the "root" of the hub and is the URL that you give to UCSC to initialise the hub and load it into the browser. The format is minimal and just describes the human readable description of your hub and the location of a "genomes" file.

/hub.txt

hub HubNameHere
shortLabel Label
longLabel Longer Label Here
genomesFile genomes.txt
email tony@mcbryan.co.uk

The genomes file is simply a list of each genome you want to be able to display data against. Valid options are any of the genomes that UCSC supports on the browser (or your mirror).

Minimally this must include one genome and link to the associated trackDB file which describes the tracks to be loaded.

/genomes.txt

genome hg18
trackDb hg18/trackDb.txt

UCSC uses trackDb files internally to format the display of tracks onto the genome. Track Hubs allow us to use a subset of the trackDb functionality to load our own tracks. UCSC has long had a Custom Tracks feature but this only ever allowed a very small fraction of the features of the trackDb while the Track Hubs allow us much more.
You can, if desired, simply keep all your track definitions in the single trackDb file but this will quickly get messy. Luckily we can import additional files.

/hg18/trackDb.txt

include project1.txt

While in project1.txt we can include only the track definitions that are relevant to that project.

/hg18/project1.txt

track SuperTrack
shortLabel Label
longLabel Long Label
superTrack on none
priority 1

track CompositeTrack
container multiWig
configurable on
shortLabel Label
longLabel Long Label
visibility hide
type bigWig 0 100
autoScale off
aggregate overlay
viewLimits 0:100
windowingFunction mean
superTrack SuperTrack full
showSubtrackColorOnUi on

track SubTrack1
type bigWig
shortLabel Label
longLabel Long Label
parent CompositeTrack
visibility full
bigDataUrl http://a.b.com/path/to/wigs/SubTrack1.bigWig
color 51,102,153

track SubTrack2
type bigWig
shortLabel Label
longLabel Long Label
parent CompositeTrack
visibility full
bigDataUrl http://user:pass@a.b.com/path/to/wigs/SubTrack2.bigWig
color 255,102,0

Therefore project1.txt is where the magic happens. We have a super track (named SuperTrack) which contains all other tracks. This gives us a single drop-down box on the UCSC browser which we can use to turn a whole projects tracks on or off at once.

This super track contains a single composite track which is of type multiWig. The multiWig is capable of overlaying multiple wig tracks on top of each other (using the aggregate option). To make this work we create regular bigWig tracks with their parent set to the composite track.

The bigDataUrl is the important one as that tells UCSC how to actually get to your data (hosted on a web server somewhere). Note that UCSC supports providing the URL along with username and password (as shown above in SubTrack2). This is compatible with usernames and passwords enforced by htaccess on Apache server.

You can fix the viewlimits on wig files (here fixed at 0 to 100 within the composite track. Note that settings on a parent track are inherited by their child tracks unless the child has a conflicting setting.

If you prefer you can have a dynamic range (may be appropriate for ChIP-Seq data etc) which will resize the view to the bounds of your data. If you use dynamic scaling you probably also want to include the alwaysZero attribute on the composite track and set it to "on" otherwise UCSC will scale from your lowest value to your highest value within the window.

You can then view the track hub by entering it on the UCSC Hub Connect page or by creating a link such as:

http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg18&hubUrl=http://user:pass@a.b.com/hub/hub.txt

You should be able to use the above as a reference point to building your own Track Hubs. For more information, and details on the other features you can use in Track Hubs, see UCSC's help pages.

McBryan's Musings

Monday, 8 July 2013

Bisulphite Sequencing Simple Visualisation (Multiple Samples)

Friday, 5 April 2013

Minimal UCSC Track Hubs