Binomial confidence intervals and contingency tests

Abstract Paper (PDF)

Many statistical methods rely on an underlying mathematical model of probability which is based on a simple approximation, one that is simultaneously well-known and yet frequently poorly understood.

This approximation is the Normal approximation to the Binomial distribution, and it underpins a range of statistical tests and methods, including the calculation of accurate confidence intervals, performing goodness of fit and contingency tests, line-and model-fitting, and computational methods based upon these. What these methods have in common is the assumption that the likely distribution of error about an observation is Normally distributed.

The assumption allows us to construct simpler methods than would otherwise be possible. However this assumption is fundamentally flawed.

This paper is divided into two parts: fundamentals and evaluation. First, we examine the estimation of error using three approaches: the ‘Wald’ (Normal) interval, the Wilson score interval and the ‘exact’ Clopper-Pearson Binomial interval. Whereas the first two can be calculated directly from formulae, the Binomial interval must be approximated towards by computational search, and is computationally expensive. However this interval provides the most precise significance test, and therefore will form the baseline for our later evaluations.

We consider two further refinements: employing log-likelihood in computing intervals (also requiring search) and the effect of adding a correction for the transformation from a discrete distribution to a continuous one.

In the second part of the paper we consider a thorough evaluation of this range of approaches to three distinct test paradigms. These paradigms are the single interval or 2 × 1 goodness of fit test, and two variations on the common 2 × 2 contingency test. We evaluate the performance of each approach by a ‘practitioner strategy’. Since standard advice is to fall back to ‘exact’ Binomial tests in conditions when approximations are expected to fail, we simply count the number of instances where one test obtains a significant result when the equivalent exact test does not, across an exhaustive set of possible values.

We demonstrate that optimal methods are based on continuity-corrected versions of the Wilson interval or Yates’ test, and that commonly-held assumptions about weaknesses of χ² tests are misleading.

Log-likelihood, often proposed as an improvement on χ², performs disappointingly. At this level of precision we note that we may distinguish the two types of 2 × 2 test according to whether the independent variable partitions the data into independent populations, and we make practical recommendations for their use.


Estimating the error in an observation is the first, crucial step in inferential statistics. It allows us to make predictions about what would happen were we to repeat our experiment multiple times, and, because each observation represents a sample of the population, predict the true value in the population (Wallis 2013).

Consider an observation that a proportion p of a sample of size n is of a particular type.

For example

  • the proportion p of coin tosses in a set of n throws that are heads,
  • the proportion of light bulbs p in a production run of n bulbs that fail within a year,
  • the proportion of patients p who have a second heart attack within six months after a drug trial has started (n being the number of patients in the trial),
  • the proportion p of interrogative clauses n in a spoken corpus that are finite.

We have one observation of p, as the result of carrying out a single experiment. We now wish to infer about the future. We would like to know how reliable our observation of p is without further sampling. Obviously, we don’t want to repeat a drug trial on cardiac patients if the drug may be adversely affecting their survival.

Continue reading

z-squared: the origin and application of χ²

Abstract Paper (PDF)

A set of statistical tests termed contingency tests, of which χ² is the most well-known example, are commonly employed in linguistics research. Contingency tests compare discrete distributions, that is, data divided into two or more alternative categories, such as alternative linguistic choices of a speaker or different experimental conditions. These tests are highly ubiquitous, and are part of every linguistics researcher’s arsenal.

However the mathematical underpinnings of these tests are rarely discussed in the literature in an approachable way, with the result that many researchers may apply tests inappropriately, fail to see the possibility of testing particular questions, or draw unsound conclusions. Contingency tests are also closely related to the construction of confidence intervals, which are highly useful and revealing methods for plotting the certainty of experimental observations.

This paper is organised in the following way. The foundations of the simplest type of χ² test, the 2 × 1 goodness of fit test, are introduced and related to the z test for a single observed proportion p and the Wilson score confidence interval about p. We then show how the 2 × 2 test for independence (homogeneity) is derived from two observations p₁ and p₂ and explain when each test should be used. We also briefly introduce the Newcombe-Wilson test, which ideally should be used in preference to the χ² test for observations drawn from two independent populations (such as two subcorpora). We then turn to tests for larger tables, generally termed “r × c” tests, which have multiple degrees of freedom and therefore may encompass multiple trends, and discuss strategies for their analysis. Finally, we turn briefly to the question of differentiating test results. We introduce the concept of effect size (also termed ‘measures of association’) and finally explain how we may perform statistical separability tests to distinguish between two sets of results.


Karl Pearson’s famous chi-square test is derived from another statistic, called the z statistic, based on the Normal distribution.

The simplest versions of χ² can be shown to be mathematically identical to equivalent z tests. The tests produce the same result in all circumstances. For all intents and purposes “chi-squared” could be called “z-squared”. The critical values of χ² for one degree of freedom are the square of the corresponding critical values of z.

  • The standard 2 × 2 χ² test is another way of calculating the z test for two independent proportions taken from the same population (Sheskin 1997: 226).
  • This test is based on an even simpler test. The 2 × 1 (or 1 × 2) “goodness of fit” (g.o.f.) χ² test is an implementation of one of the simplest tests in statistics, called the Binomial test, or population z test (Sheskin 1997: 118). This test compares a sample observation against a predicted value which is assumed to be Binomially distributed.

If this is the case, why might we need chi-square? Continue reading