A statistics crib sheet

Confidence intervalsHandout

Confidence intervals on an observed rate p should be computed using the Wilson score interval method. A confidence interval on an observation p represents the range that the true population value, P (which we cannot observe directly) may take, at a given level of confidence (e.g. 95%).

Note: Confidence intervals can be applied to onomasiological change (variation in choice) and semasiological change (variation in meaning), provided that P is free to vary from 0 to 1 (see Wallis 2012). Naturally, the interpretation of significant change in either case is different.

Methods for calculating intervals employ the Gaussian approximation to the Binomial distribution.

Confidence intervals on Expected (Population) values P

The Gaussian interval about P uses the mean and standard deviation as follows:

mean xP,
standard deviation S ≡ √P(1 – P)/n,

where n is the sample size.

The Gaussian interval about P can be written as P ± E, where E = z.S, and z is the two-tailed critical value of the standard Normal distribution at a given error level (e.g., 0.05), often written zα/2. Although this is a bit of a mouthful, critical values of z are constant, so for any given level you can just substitute the constant for z. [z(0.05) = 1.95996 to six decimal places.]

In summary:

Gaussian intervalP ± z√P(1 – P)/n.

Confidence intervals on Observed (Sample) values p

We cannot use the same formula for confidence intervals about observations. Many people try to do this!

Most obviously, if p gets close to zero, the error e can exceed p, so the lower bound of the interval can fall below zero, which is clearly impossible! The problem is most apparent on smaller samples (larger intervals) and skewed values of p (close to 0 or 1).

The Gaussian is a reasonable approximation for an as-yet-unknown population probability P, it is incorrect for an interval around an observation p (Wallis 2013a). However the latter case is precisely where the Gaussian interval is used most often!

What is the correct method?

The figure below illustrates the relationship between intervals on p and P. The skewed distribution on the right is the Wilson score interval for p. This can be thought of as the projection of Gaussian (Normal) intervals on P.


To plot accurate intervals around observed we need to calculate Wilson’s score interval:

Wilson score interval (w⁻, w⁺) ≡ p + z²/2n ± zp(1 – p)/n + z²/4
1 + z²/n

The score interval is asymmetric (except where p = 0.5) and tends towards the middle of the distribution (as the figure reveals). It cannot exceed the probability range [0, 1] and it should always be used instead of the Gaussian, particularly with skewed data and small samples (a common condition in corpus linguistics). A continuity-corrected version of Wilson’s interval should be used where n is small.

See also

Contingency correlation tests

Contingency correlation tests, including log-likelihood, χ², and its variations, are premised on the population z test (Wallis 2013b). The 2 × 1 goodness of fit χ² test is a reformulation of a single sample test based on a expected baseline frequency.

We might use this to check

  • whether the ratio of a ‘Type A’ term correlates with a baseline or
  • to compare two competing frequencies (proportions of values of the same variable) for significant difference.
The single-sample population z test (goodness of fit χ² test)

Similarly the 2 × 2 χ² test of homogeneity (independence) is identical to a two-sample z test where samples are drawn from the same population (see Wallis 2013b).

These tests work by creating a new confidence interval out of the inner intervals at each point. For χ² the equivalent combined interval is based on the overall probability. So in the figure below, O₁ and O₂ represent observed distributions about two points, and the new combined interval is related to the standard deviation (a measure of spread) of each distribution.

The 2 × 2 χ² test assumes uncertainty in both observations.

The optimum method of calculation is to employ Yates’ χ² test. This can also be used for evaluating larger tables with more than two columns or rows. The main problem with larger r × c tables is interpretation: with more than 1 degree of freedom, a significant result merely tell you that the variables interact. The correct approach is discussed in Wallis (2013b): to restructure tables and refocus the experimental design on key areas of variation.

The standard test can be thought of as testing if a difference d = p₂ – p₁ is other than zero (and therefore p₂ ≠ p₁).

But there are many circumstances where we wish to test d against another arbitrary difference score, D.  We do this if we wish to calculate confidence intervals on d. In this situation an alternative approach should be used.

Newcombe (1998) employs Wilson’s score interval to create an accurate difference interval. The resulting test (preferably with a continuity-correction) is more flexible as it is accurate for zero but also whenever a constant difference is predicted. For this reason it is preferred in separability meta-tests (see below).

See also

Effect size

To compare different results we can focus on these difference measures alone.

Wallis (2013b) notes that simple swing, d = p₂ – p₁, and percentage swing, d% = d/p₁, are commonly used for comparing observed probabilities, and explains how these may be plotted with confidence intervals.

More advanced methods include Cramér’s φ and a modified goodness of fit φ’, both of which can be extended to assess the size of an effect of an independent variable across more than 2 dependent values. Cramér’s φ (Wallis 2012a) is a measure of association based on a χ² test of homogeneity (measuring change on both A and B over a contrast).

The other measures are designed for goodness of fit applications (estimating the degree of variation of a single term A against a fixed baseline over a contrast). φ’ can be extended to measure variation over multiple points (such as text categories), whereas difference measures can only refer to two p values. These φ measures are standardised to the probabilistic range [0, 1].

The type of problem these measures are aimed to address is given in (Wallis 2012b). The graph below shows three distributions: O represents the frequency of present perfect verb phrases in each subtext in a corpus, and two expected distributions, E, represent the distribution of present-referring and past-referring verb phrases in the same corpus. The question that researchers wish to know is does the present perfect correlate more strongly with present- or past-referring VPs?

The distribution of the present perfect O, scaled distributions E for present and past, across text categories of DCPSE.

It is possible to define alternative fitness estimates by making different mathematical assumptions about the relative importance of variation. With 2 categories, different formulae obtain closely similar results. However, as the number of categories increase – and those categories may be uneven in size (e.g. DCPSE ‘genres’ vary from 126 to 3 texts per category: columns in the chart above vary greatly between genre) – we obtain much more varied results.

He identifies standardised root mean square error φp, as a robust and ‘well-behaved’ measure. This is obtained by the formula

φp = √½Σ(O(i) – E(i))² / N,

where O(i) and E(i) represent observed (term) and expected (baseline) frequencies for category i.

See also

Separability tests

Finally, Wallis (2013b) also points out that it is possible to compare a pair of 2 × 2 contingency tests for statistical separability, that is, to test if the results are significantly different from each other.

The idea is an extension of the derivation of the z test for the difference between two proportions (2 × 2 contingency test), by evaluating the difference between two differences. Wallis (2019) extends the paradigm to compare outcomes from any pair of identically-structured χ² test (with equations for 2 × 1 goodness of fit, r × 1 goodness of fit and r × c contingency test for independence).

Suppose that you carry out the same experiment twice, but vary the conditions slightly. On the second attempt you appear to get a stronger effect than on the first. A separability test determines whether the difference between these two test outcomes is significant, i.e. that one is significantly greater than the other.

Note that just because two results are individually significant (i.e. a change is significantly different from zero) does not mean that they are significantly different from each other. Likewise, just because one result reports a numerically greater size of effect, χ² score or error level than another does not mean that results are “stronger”.

This is also why I strongly advise against quoting χ² scores (or p error values) in papers because it can be very misleading, although it is common practice to do so. A much better approach is to pick an error level, say, 0.05, and then stick to it.

You should use appropriate tests to draw out distinctions, and cite confidence intervals around probability values (e.g. “the number of finite VPs increased as a proportion by 25% ± 10%”) when discussing change.

See also


Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.