Binomial algorithm snippets

Introduction

Elsewhere on this blog I summarise an analysis of the performance of a broad range of different confidence interval calculations and 2 × 2 contingency tests against equivalent ‘exact’ Binomial tests calculated from first principles.

For transparency, it is necessary to show how I went about computing these results.

Many of these algorithms are summarised in mathematical terms in this paper. However, for those who wish to recreate the computation, here is the code in the programming language C.

Warning: colleagues have pointed out that this post is not for the faint hearted!

Binomial test

Precalculated

  • long double array fact[n] contains factorial of n, written n!

A ‘long double’ is a 64-bit floating point number, which can express very large numbers. Since the factorial of n is the product of the series 1…n, it can get very large (see below). However with 64 bits we have a limit of n=1,500, which is more than enough. A mere ‘double’ is a 32 bit floating point number, and a ‘float’ a 16 bit one.

Core functions

  • combinatorial function nCr
  • single Binomial probability BinP(p, n, x)
  • cumulative Binomial probability CumBinP(p, n, x)
  • population Binomial probability PopBinP(n, x)
long double nCr(int n, int r)
{
   if (n>LIMIT) return -1;
   return fact[n]/(fact[r]*fact[n-r]);
}

In other words, provided n is not too large, define the combinatorial function nCr (see footnote 5 in the Binomial paper) as follows:

nCr = n!/r!{– r}!

This is a fraction of two very large numbers, so we tell C to use 64 bit arithmetic to be as accurate as possible. Bear in mind that for this paper we need to deal with n = 500:

500! = fact[500] = 1.22013682599111007 × 101134.

Other functions are defined similarly.

Here are the single Binomial probability, equation (5), and equation (6), representing the cumulative Binomial probability. pow is the power function.

long double BinP(float p, int n, int x)
{
   if (p=1) return 0;
   return nCr(n, x) * pow(p,x) * pow(1-p,n-x);
}
long double CumBinP(float p, int n, int x)
{
   register int r;
   register long double c = 0;
   for (r = x; r<=n; r++)
      c += BinP(p, n, r);
   return c;
}

Although this code can seem forbidding, in fact it is just another way of writing the mathematical formulae in the paper. Instead of using the summation sign in the above, we perform an explicit loop where r increases from x to n, and add up the result. The code ‘r++’ simply means add one to r.

Finally, here is one last function: to compute the population cumulative Binomial. I include this because we need a way of calculating the inverse of equation (6). This simply adds up single Binomial probabilities until it exceeds Pcrit = α/2.

float popCumBinP(int n, int x)
{
   register int r;
   register double c = 0;
   float p = x/(float)n;
   for (r=0; r<x; r++)
   {
      c += BinP(p, n, r);
      if (c > Pcrit)
          return (r-1)/(float)n;
   }
   return (r-1)/(float)n;
}

Find Binomial P

Input

  • p = observed o / total n.

This function requires a bit more explanation. It performs a simple search method to find P for an exact tail area Pcrit = α/2. In the paper at a number of places we refer to “a search procedure”. I thought therefore I should give an example of what such a procedure might look like.

The algorithm is accurate to 4 decimal places (it continues while absolute difference d > 0.00001), with a for loop providing a stopping condition.

The way it works is it picks a value for P and tests it. If the result is too high, it tries a lower value of P; if too low, it tries a higher value of P. Provided that the function we are searching is monotonic (always increasing or always decreasing) this type of procedure can be guaranteed to zoom in on the correct value if you keep dividing by 2.

Note that this process is quite computationally expensive, performing the CumBinP function potentially hundreds of times with different values of P until it converges on the value. This is not the optimum approach, as it is possible to use the gradient to improve the estimate of P and increase the rate of convergence, but it is robust, which is more important here.

float findBinP(int n, int o)
{
   float p2 = o/(float)n, p = p2/1.5;
   long double a, d, a2, a3, d2, d3, pdiff;
   int i;
   for (i=0;i<1000;i++)
   {
      a = CumBinP(p, n, o);
      d = fabs(a-Pcrit);
      if (d>0.00001) // accurate to 4 dps
      {
         if (p+p2>1)
            d2 = 1;
         else
         {
            a2 = CumBinP(p+p2, n, o);
            d2 = fabsl(a2-Pcrit);
         }
         if (p<p2)
            d3 = 1;
         else
         {
            a3 = CumBinP(p-p2, n, o);
            d3 = fabsl(a3-Pcrit);
         }
         pdiff = p2;
         if (d3 < d2)
         {
            pdiff=-p2;
            d2 = d3;
         }
         if (d2 < d)
            p+=pdiff;
         else
            p2 = p2/2;
      }
      else return p;
   }
   return p;
}

Fisher sum test

Core functions

  • Fisher point probability Fisher(a, b, c, d)
  • Fisher sum probability FisherSum(a, b, c, d)

These are equations (13) and (14) in the paper. I have included these because these are two of the trickier calculations to perform.

long double Fisher(int a, int b, int c, int d)
{
   return (fact[a+c]/fact[a]) *   // arranged like this 
          (fact[b+d]/fact[b]) *   // to minimise overflow error
          (fact[a+b]/fact[c]) *
          (fact[c+d]/fact[d]) / fact[a+b+c+d];
}

The FisherSum function (equation 14) requires some explanation.

Take a matrix [[a, b], [c, d]].

a b
c d

The instruction is to sum Fisher probabilities for that pattern and add this to the Fisher probabilities for each more extreme version of the matrix, noting that all the time a+b, c+d, a+c and b+d must be constant. Consider the matrix below, where row and column totals both equal 20.

5 15
15 5

A “more extreme” version of this matrix which satisfies the condition that the row and column totals sum to 20 is the following.

4 16
16 4

This is further away from the flat matrix (all cells equal to 10), but the totals are still locked to 20. So, the way to enumerate all versions of the matrix which are more extreme is to traverse the matrix diagonally, either by increasing a and d and decreasing b and c, or the opposite (as above), until no other combination is possible. We keep adding individual Fisher probabilities until we hit the limit below.

0 20
20 0

To make this clearer, we can define row totals n₁ = a+b and n₂ = c+d.

float FisherSum(int a, int b, int c, int d)
{
   double p = 0;
   int n1 = a+b, n2 = c+d;

   if (a/(float)n1 > c/(float)n2)
      for (;(a=0);a++,c--)
         p+=Fisher(a,n1-a,c,n2-c);
   else
      for (;(a>=0)&&(c<=n2);a--,c++)
         p+=Fisher(a,n1-a,c,n2-c);

   success = p<=Pcrit;
   return p;
}

See also

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s