A statistics crib sheet

Confidence intervalsHandout

Confidence intervals on an observed rate p should be computed using the Wilson score interval method. A confidence interval on an observation p represents the range that the true population value, P (which we cannot observe directly) may take, at a given level of confidence (e.g. 95%).

Note: Confidence intervals can be applied to onomasiological change (variation in choice) and semasiological change (variation in meaning), provided that P is free to vary from 0 to 1 (see Wallis 2012). Naturally, the interpretation of significant change in either case is different.

Methods for calculating intervals employ the Gaussian approximation to the Binomial distribution.

Confidence intervals on Expected (Population) values (P)

The Gaussian interval about P uses the mean and standard deviation as follows:

mean xP = F/N,
standard deviation S ≡ √P(1 – P)/N.

The Gaussian interval about P can be written as P ± E, where E = z.S, and z is the critical value of the standard Normal distribution at a given error level (e.g., 0.05). Although this is a bit of a mouthful, critical values of z are constant, so for any given level you can just substitute the constant for z. [z(0.05) = 1.95996 to six decimal places.]

In summary:

Gaussian intervalP ± z√P(1 – P)/N.

Confidence intervals on Observed (Sample) values (p)

We cannot use the same formula for confidence intervals about observations. Many people try to do this!

Most obviously, if p gets close to zero, the error e can exceed p, so the lower bound of the interval can fall below zero, which is clearly impossible! The problem is most apparent on smaller samples (larger intervals) and skewed values of p (close to 0 or 1).

The Gaussian is a reasonable approximation for an as-yet-unknown population probability P, it is incorrect for an interval around an observation p (Wallis 2013a). However the latter case is precisely where the Gaussian interval is used most often!

What is the correct method?

Continue reading

Comparing χ² tests for separability


Abstract Paper (PDF)

This paper describes a series of statistical meta-tests for comparing independent contingency tables for different types of significant difference. Recognising when an experiment obtains a significantly different result and when it does not is an issue frequently overlooked in research publication. Papers are frequently published citing ‘p values’ or test scores suggesting a ‘stronger effect’ substituting for sound statistical reasoning. This paper sets out a series of tests which together illustrate the correct approach to this question.

These meta-tests permit us to evaluate whether experiments have failed to replicate on new data; whether a particular data source or subcorpus obtains a significantly different result than another; or whether changing experimental parameters obtains a stronger effect.

The meta-tests are derived mathematically from the χ² test and the Wilson score interval, and consist of pairwise ‘point’ tests, ‘homogeneity’ tests and ‘goodness of fit’ tests. Meta-tests for comparing tests with one degree of freedom (e.g. ‘2 × 1’ and ‘2 × 2’ tests) are generalised to those of arbitrary size). Finally, we compare our approach with a competing approach offered by Zar (1999), which, while straightforward to calculate, turns out to be both less powerful and less robust.


Researchers often wish to compare the results of their experiments with those of others.

Alternatively they may wish to compare permutations of an experiment to see if a modification in the experimental design obtains a significantly different result. By doing so they would be able to investigate the empirical question of the effect of modifying an experimental design on reported results, as distinct from a deductive argument concerning the optimum design.

One of the reasons for carrying out such a test concerns the question of replication. Significance tests and confidence intervals rely on an a priori Binomial model predicting the likely distribution of future runs of the same experiment. However, there is a growing concern that allegedly significant results published in eminent psychology journals have failed to replicate (see, e.g. Gelman and Loken 2013). The reasons may be due to variation of the sample, or problems with the experimental design (such as unstated assumptions or baseline conditions that vary over experimental runs). The methods described here permit us to define a ‘failure to replicate’ as occurring when subsequent repetitions of the same experiment obtain statistically separable results on more occasions than predicted by the error level, ‘α’, used for the test.

Consider Table 1, taken from Aarts, Close and Wallis (2013). The two tables summarise a pair of 2 × 2 contingency tests for two different sets of British English corpus data for the modal alternation shall vs. will. The spoken data is drawn from the Diachronic Corpus of Present-day Spoken English, which contains matching data from the London-Lund Corpus and the British Component of the International Corpus of English (ICE-GB). The written data is drawn from the Lancaster-Oslo-Bergen (LOB) corpus and the matching Freiburg-Lancaster-Oslo-Bergen (FLOB) corpus.

Both 2 × 2 subtests are individually significant (χ² = 36.58 and 35.65 respectively). The results (see the effect size measures φ and percentage difference d%). appear to be different.

How might we test if the tables are significantly different from each other?

(spoken) shall will Total χ²(shall) χ²(will) summary
LLC (1960s) 124 501 625 15.28 2.49 d% = -60.70% ±19.67%

φ = 0.17

χ² = 36.58 s

ICE-GB (1990s) 46 544 590 16.18 2.63
TOTAL 170 1,045 1,215 31.46 5.12
(written) shall+ will+’ll Total χ²(shall+) χ²(will+’ll) summary
LOB (1960s) 355 2,798 3,153 15.58 1.57 d% = -39.23% ±12.88%

φ = 0.08

χ² = 35.65 s

FLOB (1990s) 200 2,723 2,923 16.81 1.69
TOTAL 555 5,521 6,076 32.40 3.26

Table 1: A pair of 2 × 2 tables for shall/will alternation, after Aarts et al. (2013): upper, spoken, lower: written, with other differences in the experimental design. Note that χ² values are almost identical but Cramér’s φ and percentage swing d% are different.

We can plot Table 1 as two independent pairs of probability observations, as in Figure 1. We calculate the proportion p = f/n in each case, and – in order to estimate the likely range of error introduced by the sampling procedure – compute Wilson score intervals at a 95% confidence level.

Figure 1: Example data in Table 1, plotted with 95% Wilson score intervals

Figure 1: Example data in Table 1, plotted with 95% Wilson score intervals (Wallis 2013a).

The intervals in Figure 1 are shown by ‘I’ shaped error bars: were the experiment to be re-run multiple times, in 95% of predicted repeated runs, observations at each point will fall within the interval. Where Wilson intervals do not overlap at all (e.g. LLC vs. LOB, marked ‘A’) we can identify the difference is significant without further testing; where they overlap such that one point is within the interval the difference is non-significant; otherwise a test must be applied.

In this paper we discuss two different analytical comparisons.

  1. ‘Point tests’ compare pairs of observations (‘points’) across the dependent variable (e.g. shall/will) and tables t = {1, 2}. To do this we compare the two points and their confidence intervals. We carry out a 2 × 2 χ² test for homogeneity or a Newcombe-Wilson test (Wallis 2013a) to compare each point. We can compare the initial 1960s data (LLC vs. LOB, indicated) in the same way as we might compare spoken 1960s and 1990s data (e.g. LLC vs. ICE-GB).
  2. ‘Gradient tests’ compare differences in ‘sizes of effect’ (e.g. a change in the ratio shall/will over time) between tables t. We might ask, is the gradient significantly steeper for the spoken data than for the written data?

Note that these tests evaluate different things and have different outcomes. If plot-lines are parallel, the gradient test will be non-significant, but the point test could still be significant at every pair of points. The two tests are complementary analytical tools.

1.1 How not to compare test results

A common, but mistaken, approach to comparing experimental results involves simply citing the output of significance tests (Goldacre 2011). Researchers frequently make claims citing, t, F or χ² scores, ‘p values’ (error levels), etc, as evidence for the strength of results. However, this fundamentally misinterprets the meaning of these measures, and comparisons between them are not legitimate.

Continue reading

Binomial confidence intervals and contingency tests

Abstract Paper (PDF)

Many statistical methods rely on an underlying mathematical model of probability which is based on a simple approximation, one that is simultaneously well-known and yet frequently poorly understood.

This approximation is the Normal approximation to the Binomial distribution, and it underpins a range of statistical tests and methods, including the calculation of accurate confidence intervals, performing goodness of fit and contingency tests, line-and model-fitting, and computational methods based upon these. What these methods have in common is the assumption that the likely distribution of error about an observation is Normally distributed.

The assumption allows us to construct simpler methods than would otherwise be possible. However this assumption is fundamentally flawed.

This paper is divided into two parts: fundamentals and evaluation. First, we examine the estimation of error using three approaches: the ‘Wald’ (Normal) interval, the Wilson score interval and the ‘exact’ Clopper-Pearson Binomial interval. Whereas the first two can be calculated directly from formulae, the Binomial interval must be approximated towards by computational search, and is computationally expensive. However this interval provides the most precise significance test, and therefore will form the baseline for our later evaluations.

We consider two further refinements: employing log-likelihood in computing intervals (also requiring search) and the effect of adding a correction for the transformation from a discrete distribution to a continuous one.

In the second part of the paper we consider a thorough evaluation of this range of approaches to three distinct test paradigms. These paradigms are the single interval or 2 × 1 goodness of fit test, and two variations on the common 2 × 2 contingency test. We evaluate the performance of each approach by a ‘practitioner strategy’. Since standard advice is to fall back to ‘exact’ Binomial tests in conditions when approximations are expected to fail, we simply count the number of instances where one test obtains a significant result when the equivalent exact test does not, across an exhaustive set of possible values.

We demonstrate that optimal methods are based on continuity-corrected versions of the Wilson interval or Yates’ test, and that commonly-held assumptions about weaknesses of χ² tests are misleading.

Log-likelihood, often proposed as an improvement on χ², performs disappointingly. At this level of precision we note that we may distinguish the two types of 2 × 2 test according to whether the independent variable partitions the data into independent populations, and we make practical recommendations for their use.


Estimating the error in an observation is the first, crucial step in inferential statistics. It allows us to make predictions about what would happen were we to repeat our experiment multiple times, and, because each observation represents a sample of the population, predict the true value in the population (Wallis 2013).

Consider an observation that a proportion p of a sample of size n is of a particular type.

For example

  • the proportion p of coin tosses in a set of n throws that are heads,
  • the proportion of light bulbs p in a production run of n bulbs that fail within a year,
  • the proportion of patients p who have a second heart attack within six months after a drug trial has started (n being the number of patients in the trial),
  • the proportion p of interrogative clauses n in a spoken corpus that are finite.

We have one observation of p, as the result of carrying out a single experiment. We now wish to infer about the future. We would like to know how reliable our observation of p is without further sampling. Obviously, we don’t want to repeat a drug trial on cardiac patients if the drug may be adversely affecting their survival.

Continue reading

z-squared: the origin and application of χ²

Abstract Paper (PDF)

A set of statistical tests termed contingency tests, of which χ² is the most well-known example, are commonly employed in linguistics research. Contingency tests compare discrete distributions, that is, data divided into two or more alternative categories, such as alternative linguistic choices of a speaker or different experimental conditions. These tests are highly ubiquitous, and are part of every linguistics researcher’s arsenal.

However the mathematical underpinnings of these tests are rarely discussed in the literature in an approachable way, with the result that many researchers may apply tests inappropriately, fail to see the possibility of testing particular questions, or draw unsound conclusions. Contingency tests are also closely related to the construction of confidence intervals, which are highly useful and revealing methods for plotting the certainty of experimental observations.

This paper is organised in the following way. The foundations of the simplest type of χ² test, the 2 × 1 goodness of fit test, are introduced and related to the z test for a single observed proportion p and the Wilson score confidence interval about p. We then show how the 2 × 2 test for independence (homogeneity) is derived from two observations p₁ and p₂ and explain when each test should be used. We also briefly introduce the Newcombe-Wilson test, which ideally should be used in preference to the χ² test for observations drawn from two independent populations (such as two subcorpora). We then turn to tests for larger tables, generally termed “r × c” tests, which have multiple degrees of freedom and therefore may encompass multiple trends, and discuss strategies for their analysis. Finally, we turn briefly to the question of differentiating test results. We introduce the concept of effect size (also termed ‘measures of association’) and finally explain how we may perform statistical separability tests to distinguish between two sets of results.


Karl Pearson’s famous chi-square test is derived from another statistic, called the z statistic, based on the Normal distribution.

The simplest versions of χ² can be shown to be mathematically identical to equivalent z tests. The tests produce the same result in all circumstances. For all intents and purposes “chi-squared” could be called “z-squared”. The critical values of χ² for one degree of freedom are the square of the corresponding critical values of z.

  • The standard 2 × 2 χ² test is another way of calculating the z test for two independent proportions taken from the same population (Sheskin 1997: 226).
  • This test is based on an even simpler test. The 2 × 1 (or 1 × 2) “goodness of fit” (g.o.f.) χ² test is an implementation of one of the simplest tests in statistics, called the Binomial test, or population z test (Sheskin 1997: 118). This test compares a sample observation against a predicted value which is assumed to be Binomially distributed.

If this is the case, why might we need chi-square? Continue reading