The G-Test and Python

The G-Test is a statistical test which can be used much like a chi-squared test. It is defined as: $$$$ G = 2 \sum_{i=0}^{r} \sum_{j=0}^{c} {O_{i,j} \ln(O_{i,j}/E_{i,j}) } $$$$ Where, $$$r$$$ and $$$c$$$ are the number of rows and columns in the contingency table and $$$O_{i,j}$$$ and $$$E_{i,j}$$$ are the Observed and Expected values for cell $$$i,j$$$ in the table.

To make a straightforward implementation for it in Python we first do a bit of math. For a 2x2 contingency table:

$$$ G = 2 \Bigg({O_{1,1} \ln(\frac{O_{1,1}}{E_{1,1}})} + {O_{1,2} \ln(\frac{O_{1,2}}{E_{1,2}})} + {O_{2,1} \ln(\frac{O_{2,1}}{E_{2,1}})} + {O_{2,2} \ln(\frac{O_{2,2}}{E_{2,2}})} \Bigg) $$$

Expand via log equivalences:

$$$ G = 2 \Bigg( $$$	$$$ \bigg( {O_{1,1} \ln(O_{1,1}) - O_{1,1} \ln(E_{1,1})} \bigg) + \bigg( {O_{1,2} \ln(O_{1,2}) - O_{1,2} \ln(E_{1,2})} \bigg) $$$
	$$$ + \bigg( {O_{2,1} \ln(O_{2,1}) - O_{2,1} \ln(E_{2,1})} \bigg) + \bigg( {O_{2,2} \ln(O_{2,2}) - O_{2,2} \ln(E_{2,2})} \bigg) \Bigg) $$$

Group Observed vs Expected together:

$$$ G = 2 \Bigg( $$$	$$$ \bigg( O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) \bigg)$$$
	$$$ - \bigg( O_{1,1} \ln(E_{1,1}) + O_{1,2} \ln(E_{1,2}) + O_{2,1} \ln(E_{2,1}) + O_{2,2} \ln(E_{2,2}) \Bigg) $$$

For a 2x2 contingency table; $$$E_{i,j} = \frac{T_{i} \cdot T_{j}}{T} $$$, where $$$T_{i}$$$ and $$$T_{j}$$$ are the totals for the row and column of cell $$$E_{i,j}$$$ and $$$T$$$ is the total for all cells in the table. Specifically, $$$$ \ln(E_{1,1}) = \ln \Bigg( \frac{ (O_{1,1}+O_{1,2}) \cdot (O_{1,1}+O_{2,1}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Bigg) $$$$ $$$$ \ln(E_{1,2}) = \ln \Bigg( \frac{ (O_{1,1}+O_{1,2}) \cdot (O_{1,2}+O_{2,2}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Bigg) $$$$ $$$$ \ln(E_{2,1}) = \ln \Bigg( \frac{ (O_{2,1}+O_{2,2}) \cdot (O_{1,1}+O_{2,1}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Bigg) $$$$ $$$$ \ln(E_{2,2}) = \ln \Bigg( \frac{ (O_{2,1}+O_{2,2}) \cdot (O_{1,2}+O_{2,2}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Bigg) $$$$
We then substitute these into our equation for G.

$$$ G = 2 \Bigg( $$$	$$$\bigg( O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) \bigg) - $$$
	$$$ \bigg( O_{1,1} \ln \Big( \frac{ (O_{1,1}+O_{1,2}) \cdot (O_{1,1}+O_{2,1}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Big) + O_{1,2} \ln \Big( \frac{ (O_{1,1}+O_{1,2}) \cdot (O_{1,2}+O_{2,2}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Big) + $$$
	$$$ O_{2,1} \ln \Big( \frac{ (O_{2,1}+O_{2,2}) \cdot (O_{1,1}+O_{2,1}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Big) + O_{2,2} \ln \Big( \frac{ (O_{2,1}+O_{2,2}) \cdot (O_{1,2}+O_{2,2}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Big) \bigg) \Bigg) $$$

Another log equivalence gives us this:

$$$ G = 2 \Bigg( $$$	$$$\bigg( O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) \bigg) - $$$
	$$$ \bigg( O_{1,1} \ln ( (O_{1,1}+O_{1,2}) \cdot (O_{1,1}+O_{2,1}) ) - O_{1,1} \ln ( O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} ) + $$$
	$$$ O_{1,2} \ln ( (O_{1,1}+O_{1,2}) \cdot (O_{1,2}+O_{2,2}) ) - O_{1,2} \ln ( O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} ) + $$$
	$$$ O_{2,1} \ln ( (O_{2,1}+O_{2,2}) \cdot (O_{1,1}+O_{2,1}) ) - O_{2,1} \ln ( O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} ) + $$$
	$$$ O_{2,2} \ln ( (O_{2,1}+O_{2,2}) \cdot (O_{1,2}+O_{2,2}) ) - O_{2,2} \ln ( O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} ) \bigg) \Bigg) $$$

Group common factor of $$$\ln ( O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} )$$$:

$$$ G = 2 \Bigg( $$$	$$$ \bigg( O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) \bigg) +$$$
	$$$ \bigg( (O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \cdot \ln(O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \bigg) - $$$
	$$$ \bigg( O_{1,1} \ln \big( (O_{1,1}+O_{1,2}) \cdot (O_{1,1}+O_{2,1}) \big) + O_{1,2} \ln \big( (O_{1,1}+O_{1,2}) \cdot (O_{1,2}+O_{2,2}) \big) + $$$
	$$$ O_{2,1} \ln \big( (O_{2,1}+O_{2,2}) \cdot (O_{1,1}+O_{2,1}) \big) + O_{2,2} \ln \big( (O_{2,1}+O_{2,2}) \cdot (O_{1,2}+O_{2,2}) \big) \bigg) \Bigg) $$$

More log equivalences:

$$$ G = 2 \Bigg( $$$	$$$ \bigg( O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) \bigg) + $$$
	$$$ \bigg( (O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \cdot \ln(O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \bigg) -$$$
	$$$ \bigg( O_{1,1} \ln (O_{1,1}+O_{1,2}) + O_{1,1} \ln (O_{1,1}+O_{2,1}) + O_{1,2} \ln (O_{1,1}+O_{1,2}) + O_{1,2} \ln (O_{1,2}+O_{2,2}) + $$$
	$$$ O_{2,1} \ln (O_{2,1}+O_{2,2}) + O_{2,1} \ln (O_{1,1}+O_{2,1}) + O_{2,2} \ln (O_{2,1}+O_{2,2}) + O_{2,2} \ln (O_{1,2}+O_{2,2}) \bigg) \Bigg) $$$

Group common factors of $$$\ln (O_{1,1}+O_{1,2})$$$, $$$\ln (O_{1,1}+O_{2,1})$$$ etc. Yielding:

$$$ G = 2 \Bigg( $$$	$$$ \bigg( O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) \bigg) $$$
	$$$ + \bigg( (O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \cdot \ln(O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \bigg) $$$
	$$$ - \bigg( (O_{1,1} + O_{1,2}) \ln (O_{1,1}+O_{1,2}) + (O_{2,1} + O_{2,2}) \ln (O_{2,1}+O_{2,2}) \bigg) $$$
	$$$ - \bigg( (O_{1,1} + O_{2,1}) \ln (O_{1,1}+O_{2,1}) + (O_{1,2} + O_{2,2}) \ln (O_{1,2}+O_{2,2}) \bigg) \Bigg) $$$

We can then summarise the four parts of the above equation as follows:
$$$$ celltotals = O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) $$$$ $$$$ total = (O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \ln(O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) $$$$ $$$$ rowtotals = (O_{1,1} + O_{1,2}) \ln (O_{1,1}+O_{1,2}) + (O_{2,1} + O_{2,2}) \ln (O_{2,1}+O_{2,2}) $$$$ $$$$ columntotals = (O_{1,1} + O_{2,1}) \ln (O_{1,1}+O_{2,1}) + (O_{1,2} + O_{2,2}) \ln (O_{1,2}+O_{2,2}) $$$$ where: $$$$ G = 2 \big(celltotals + total - (rowtotals + columntotals) \big) $$$$ If we assign $$$a = O_{1,1}, b = O_{1,2}, c = O_{2,1}, d = O_{2,2}$$$ then we can then implement as follows in Python:

import math

def flnf(f):
    return f*math.log(f) if f>0.5 else 0

def gtest(a,b,c,d):
    row1 = a+b
    row2 = c+d
    col1 = a+c
    col2 = b+d
    
    total = flnf(a+b+c+d)
    celltotals = flnf(a)+flnf(b)+flnf(c)+flnf(d)
    rowtotals = flnf(row1)+flnf(row2)
    coltotals = flnf(col1)+flnf(col2)
    
    return 2*(celltotals + total-(rowtotals+coltotals))

We then probably want to convert our G-value into a P-value. The G-test approximates a chi squared distribution and has the same number of degrees of freedom.

If the calculated chi-squared distribution cumulative distribution function is equal to exactly 1 (i.e. a p-value of exactly 0) then we increase the precision and try again. This means that we can find the correct order of magnitude for our p-value (which is good for when we FDR correct these p-values later on) while minimising the number of calculations we need to do (the majority of conversions don't need 4000 decimal places computed). ($$$g$$$ is the G-value and $$$df$$$ is the degrees of freedom - 1 for a 2x2 table).

import mpmath

def mpmathcdf(g,df,dps=10):
    mpmath.mp.dps = dps
    
    x,k = mpmath.mpf(g), mpmath.mpf(df)
    cdf = mpmath.gammainc(k/2, 0, x/2, regularized=True)
    
    # floating point precision insufficient, use more precision
    if cdf == 1.0:
        if dps > 4000:
            return cdf # give up after a certain point
        else:
            cdf = mpmathcdf(g,df,dps*2)
    return cdf

def GtoP(g,df):
    assert g >= 0, g
    return float(1-mpmathcdf(g,df))

Update:

Following on from the comments below I've written up a generalised version of the G-test for any sized contingency table which uses the trick from above. I've not proven explicitly that this is necessarily generalisable for any sized table but I'm pretty sure it is.

def gtest(a):
    rowtotals = []
    coltotals = [0]*len(a[0])
    cells = []
    for row in a:
        rowtotals.append(sum(row))
        for index,col in enumerate(row):
            coltotals[index] += col
            cells.append(col)
    return 2*(math.fsum([flnf(x) for x in cells]) + flnf(sum(cells)) - ( math.fsum([flnf(x) for x in rowtotals]) + math.fsum([flnf(x) for x in coltotals]) ))

For a table with 2 columns and 3 rows this can then be called as follows:

gtest([[268,807],
       [199,759],
       [42,184]])

> 7.300817075862142

Again, because I've not proven this will always work it should, of course, be used at your own risk. That said, the same technique is used for up to 10 columns and 50 rows in the G-test spreadsheet at the Handbook for Biological Statistics so it's probably safe at least that far.

5 comments:

Anonymous8 October 2013 at 12:20
Hi,

thanks for this code. Could you provide an exmaple using your solution? I am kind of confused how to actually calculate G-value and P-value for my data.
McBryan8 October 2013 at 12:37
Hi,

I've used the G-Test for calculating the significance of methylation changes in whole genome sequencing experiments (http://blog.mcbryan.co.uk/2013/07/the-g-test-and-bisulphite-sequencing.html).

That link is a specific (although complicated) example on how you might use the G-Test in practice.

If you already have a 2x2 contingency table generated in some python code of your own you should be able to just paste the two code samples on this page into your script and call gtest() to return you a G-value followed by GtoP() to convert the G-value to a p-value.

Tony
bluekeybox10 February 2014 at 03:45
Thanks. Here is a quick-and-dirty Chi-square computation for independence of two samples (for 2xN contingency tables where N is number of rows), after http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/chi2samp.htm

Although G-test is preferable in most cases, I would argue that the way Chi-square statistic here handles zero entries is a bit neater. Apologies for the formatting, not sure how to enter pre-formatted text into these comments.

def chisqtest2(a):
"""Compute Chi-square statistic for independence of two samples"""
def compute_row(row):
row_total = math.fsum(row)
return (K1 * row[0] - K2 * row[1]) ** 2 / row_total \
if row_total > 0 \
else 0.0

sum_r = math.fsum(row[0] for row in a)
sum_s = math.fsum(row[1] for row in a)
K1 = math.sqrt(sum_s / sum_r)
K2 = math.sqrt(sum_r / sum_s)
return math.fsum(map(compute_row, a))

Note: only a member of this blog may post a comment.

McBryan's Musings

Tuesday, 9 July 2013

The G-Test and Python

5 comments: