Tuesday, 9 July 2013

The G-Test and Python

The G-Test is a statistical test which can be used much like a chi-squared test. It is defined as: $$G = 2 \sum_{i=0}^{r} \sum_{j=0}^{c} {O_{i,j} \ln(O_{i,j}/E_{i,j}) }$$ Where, $r$ and $c$ are the number of rows and columns in the contingency table and $O_{i,j}$ and $E_{i,j}$ are the Observed and Expected values for cell $i,j$ in the table.

To make a straightforward implementation for it in Python we first do a bit of math. For a 2x2 contingency table:

$G = 2 \Bigg({O_{1,1} \ln(\frac{O_{1,1}}{E_{1,1}})} + {O_{1,2} \ln(\frac{O_{1,2}}{E_{1,2}})} + {O_{2,1} \ln(\frac{O_{2,1}}{E_{2,1}})} + {O_{2,2} \ln(\frac{O_{2,2}}{E_{2,2}})} \Bigg)$

Expand via log equivalences:

 $G = 2 \Bigg($ $\bigg( {O_{1,1} \ln(O_{1,1}) - O_{1,1} \ln(E_{1,1})} \bigg) + \bigg( {O_{1,2} \ln(O_{1,2}) - O_{1,2} \ln(E_{1,2})} \bigg)$ $+ \bigg( {O_{2,1} \ln(O_{2,1}) - O_{2,1} \ln(E_{2,1})} \bigg) + \bigg( {O_{2,2} \ln(O_{2,2}) - O_{2,2} \ln(E_{2,2})} \bigg) \Bigg)$

Group Observed vs Expected together:

 $G = 2 \Bigg($ $\bigg( O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) \bigg)$ $- \bigg( O_{1,1} \ln(E_{1,1}) + O_{1,2} \ln(E_{1,2}) + O_{2,1} \ln(E_{2,1}) + O_{2,2} \ln(E_{2,2}) \Bigg)$

For a 2x2 contingency table; $E_{i,j} = \frac{T_{i} \cdot T_{j}}{T}$, where $T_{i}$ and $T_{j}$ are the totals for the row and column of cell $E_{i,j}$ and $T$ is the total for all cells in the table. Specifically, $$\ln(E_{1,1}) = \ln \Bigg( \frac{ (O_{1,1}+O_{1,2}) \cdot (O_{1,1}+O_{2,1}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Bigg)$$ $$\ln(E_{1,2}) = \ln \Bigg( \frac{ (O_{1,1}+O_{1,2}) \cdot (O_{1,2}+O_{2,2}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Bigg)$$ $$\ln(E_{2,1}) = \ln \Bigg( \frac{ (O_{2,1}+O_{2,2}) \cdot (O_{1,1}+O_{2,1}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Bigg)$$ $$\ln(E_{2,2}) = \ln \Bigg( \frac{ (O_{2,1}+O_{2,2}) \cdot (O_{1,2}+O_{2,2}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Bigg)$$
We then substitute these into our equation for G.
 $G = 2 \Bigg($ $\bigg( O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) \bigg) -$ $\bigg( O_{1,1} \ln \Big( \frac{ (O_{1,1}+O_{1,2}) \cdot (O_{1,1}+O_{2,1}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Big) + O_{1,2} \ln \Big( \frac{ (O_{1,1}+O_{1,2}) \cdot (O_{1,2}+O_{2,2}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Big) +$ $O_{2,1} \ln \Big( \frac{ (O_{2,1}+O_{2,2}) \cdot (O_{1,1}+O_{2,1}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Big) + O_{2,2} \ln \Big( \frac{ (O_{2,1}+O_{2,2}) \cdot (O_{1,2}+O_{2,2}) }{ O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} } \Big) \bigg) \Bigg)$

Another log equivalence gives us this:
 $G = 2 \Bigg($ $\bigg( O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) \bigg) -$ $\bigg( O_{1,1} \ln ( (O_{1,1}+O_{1,2}) \cdot (O_{1,1}+O_{2,1}) ) - O_{1,1} \ln ( O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} ) +$ $O_{1,2} \ln ( (O_{1,1}+O_{1,2}) \cdot (O_{1,2}+O_{2,2}) ) - O_{1,2} \ln ( O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} ) +$ $O_{2,1} \ln ( (O_{2,1}+O_{2,2}) \cdot (O_{1,1}+O_{2,1}) ) - O_{2,1} \ln ( O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} ) +$ $O_{2,2} \ln ( (O_{2,1}+O_{2,2}) \cdot (O_{1,2}+O_{2,2}) ) - O_{2,2} \ln ( O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} ) \bigg) \Bigg)$

Group common factor of $\ln ( O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2} )$:
 $G = 2 \Bigg($ $\bigg( O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) \bigg) +$ $\bigg( (O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \cdot \ln(O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \bigg) -$ $\bigg( O_{1,1} \ln \big( (O_{1,1}+O_{1,2}) \cdot (O_{1,1}+O_{2,1}) \big) + O_{1,2} \ln \big( (O_{1,1}+O_{1,2}) \cdot (O_{1,2}+O_{2,2}) \big) +$ $O_{2,1} \ln \big( (O_{2,1}+O_{2,2}) \cdot (O_{1,1}+O_{2,1}) \big) + O_{2,2} \ln \big( (O_{2,1}+O_{2,2}) \cdot (O_{1,2}+O_{2,2}) \big) \bigg) \Bigg)$

More log equivalences:
 $G = 2 \Bigg($ $\bigg( O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) \bigg) +$ $\bigg( (O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \cdot \ln(O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \bigg) -$ $\bigg( O_{1,1} \ln (O_{1,1}+O_{1,2}) + O_{1,1} \ln (O_{1,1}+O_{2,1}) + O_{1,2} \ln (O_{1,1}+O_{1,2}) + O_{1,2} \ln (O_{1,2}+O_{2,2}) +$ $O_{2,1} \ln (O_{2,1}+O_{2,2}) + O_{2,1} \ln (O_{1,1}+O_{2,1}) + O_{2,2} \ln (O_{2,1}+O_{2,2}) + O_{2,2} \ln (O_{1,2}+O_{2,2}) \bigg) \Bigg)$

Group common factors of $\ln (O_{1,1}+O_{1,2})$, $\ln (O_{1,1}+O_{2,1})$ etc. Yielding:
 $G = 2 \Bigg($ $\bigg( O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2}) \bigg)$ $+ \bigg( (O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \cdot \ln(O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \bigg)$ $- \bigg( (O_{1,1} + O_{1,2}) \ln (O_{1,1}+O_{1,2}) + (O_{2,1} + O_{2,2}) \ln (O_{2,1}+O_{2,2}) \bigg)$ $- \bigg( (O_{1,1} + O_{2,1}) \ln (O_{1,1}+O_{2,1}) + (O_{1,2} + O_{2,2}) \ln (O_{1,2}+O_{2,2}) \bigg) \Bigg)$

We can then summarise the four parts of the above equation as follows:
$$celltotals = O_{1,1} \ln(O_{1,1}) + O_{1,2} \ln(O_{1,2}) + O_{2,1} \ln(O_{2,1}) + O_{2,2} \ln(O_{1,2})$$ $$total = (O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2}) \ln(O_{1,1} + O_{1,2} + O_{2,1} + O_{2,2})$$ $$rowtotals = (O_{1,1} + O_{1,2}) \ln (O_{1,1}+O_{1,2}) + (O_{2,1} + O_{2,2}) \ln (O_{2,1}+O_{2,2})$$ $$columntotals = (O_{1,1} + O_{2,1}) \ln (O_{1,1}+O_{2,1}) + (O_{1,2} + O_{2,2}) \ln (O_{1,2}+O_{2,2})$$ where: $$G = 2 \big(celltotals + total - (rowtotals + columntotals) \big)$$ If we assign $a = O_{1,1}, b = O_{1,2}, c = O_{2,1}, d = O_{2,2}$ then we can then implement as follows in Python:

import math

def flnf(f):
return f*math.log(f) if f>0.5 else 0

def gtest(a,b,c,d):
row1 = a+b
row2 = c+d
col1 = a+c
col2 = b+d

total = flnf(a+b+c+d)
celltotals = flnf(a)+flnf(b)+flnf(c)+flnf(d)
rowtotals = flnf(row1)+flnf(row2)
coltotals = flnf(col1)+flnf(col2)

return 2*(celltotals + total-(rowtotals+coltotals))


We then probably want to convert our G-value into a P-value. The G-test approximates a chi squared distribution and has the same number of degrees of freedom.

If the calculated chi-squared distribution cumulative distribution function is equal to exactly 1 (i.e. a p-value of exactly 0) then we increase the precision and try again. This means that we can find the correct order of magnitude for our p-value (which is good for when we FDR correct these p-values later on) while minimising the number of calculations we need to do (the majority of conversions don't need 4000 decimal places computed). ($g$ is the G-value and $df$ is the degrees of freedom - 1 for a 2x2 table).

import mpmath

def mpmathcdf(g,df,dps=10):
mpmath.mp.dps = dps

x,k = mpmath.mpf(g), mpmath.mpf(df)
cdf = mpmath.gammainc(k/2, 0, x/2, regularized=True)

# floating point precision insufficient, use more precision
if cdf == 1.0:
if dps > 4000:
return cdf # give up after a certain point
else:
cdf = mpmathcdf(g,df,dps*2)
return cdf

def GtoP(g,df):
assert g >= 0, g
return float(1-mpmathcdf(g,df))


Update:

Following on from the comments below I've written up a generalised version of the G-test for any sized contingency table which uses the trick from above. I've not proven explicitly that this is necessarily generalisable for any sized table but I'm pretty sure it is.
def gtest(a):
rowtotals = []
coltotals = [0]*len(a[0])
cells = []
for row in a:
rowtotals.append(sum(row))
for index,col in enumerate(row):
coltotals[index] += col
cells.append(col)
return 2*(math.fsum([flnf(x) for x in cells]) + flnf(sum(cells)) - ( math.fsum([flnf(x) for x in rowtotals]) + math.fsum([flnf(x) for x in coltotals]) ))


For a table with 2 columns and 3 rows this can then be called as follows:
gtest([[268,807],
[199,759],
[42,184]])

> 7.300817075862142


Again, because I've not proven this will always work it should, of course, be used at your own risk. That said, the same technique is used for up to 10 columns and 50 rows in the G-test spreadsheet at the Handbook for Biological Statistics so it's probably safe at least that far.

1. Hi,

thanks for this code. Could you provide an exmaple using your solution? I am kind of confused how to actually calculate G-value and P-value for my data.

2. Hi,

I've used the G-Test for calculating the significance of methylation changes in whole genome sequencing experiments (http://blog.mcbryan.co.uk/2013/07/the-g-test-and-bisulphite-sequencing.html).

That link is a specific (although complicated) example on how you might use the G-Test in practice.

If you already have a 2x2 contingency table generated in some python code of your own you should be able to just paste the two code samples on this page into your script and call gtest() to return you a G-value followed by GtoP() to convert the G-value to a p-value.

Tony

1. Hi Tony,

So using a contingency table sized other than 2x2 (e.g. 5 columns and 12 rows) would require a different implementation, right?

Nikita

2. Hi,

I've only shown the proof for a 2x2 table above. I think, but have not proven explicitly, that the same trick can be applied to contingency tables of any size.

i.e. G = 2 * (celltotals + total - (rowtotals + columntotals) ) for any sized contingency table.

If you only have a small number of tables to do it might be easier to use a spreadsheet such as the one at the Handbook of Biological Statistics.

I notice they use exactly the same approach for a larger table in their spreadsheet so I'm obviously not the first to notice this pattern.

I've updated the post above to include a gtest function which should work for any sized contingency table if my assumption proves to be correct. Note I've not tested this latest function in any particularly robust way and it is provided entirely as is and without any warranty as to correctness.

It does seem to agree with likelihood.test from the Deducer package in R for the few samples I've tested it with.

Tony

3. Thanks. Here is a quick-and-dirty Chi-square computation for independence of two samples (for 2xN contingency tables where N is number of rows), after http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/chi2samp.htm

Although G-test is preferable in most cases, I would argue that the way Chi-square statistic here handles zero entries is a bit neater. Apologies for the formatting, not sure how to enter pre-formatted text into these comments.

def chisqtest2(a):
"""Compute Chi-square statistic for independence of two samples"""
def compute_row(row):
row_total = math.fsum(row)
return (K1 * row[0] - K2 * row[1]) ** 2 / row_total \
if row_total > 0 \
else 0.0

sum_r = math.fsum(row[0] for row in a)
sum_s = math.fsum(row[1] for row in a)
K1 = math.sqrt(sum_s / sum_r)
K2 = math.sqrt(sum_r / sum_s)
return math.fsum(map(compute_row, a))