Why ‘Wald’ is Wrong: once more on confidence intervals


The idea of plotting confidence intervals on data, which is discussed in a number of posts elsewhere on this blog, should be straightforward. Everything we observe is uncertain, but some things are more certain than others! Instead of marking an observation as a point, its better to express it as a ‘cloud’, an interval representing a range of probabilities.

But the standard method for calculating intervals that most people are taught is wrong.

The reasons why are dealt with in detail in (Wallis 2013). In preparing this paper for publication, however, I came up with a new demonstration, using real data, as to why this is the case.

Plotting ‘Wald’ intervals on sparse and skewed data

First: some data.

In a paper published in a volume on the Verb Phrase in English, Aarts, Close and Wallis (2013) examined the alternation over time in British English from first person declarative uses of modal shall to will over a thirty year period by plotting over time the probability of selecting shall given the choice, which we can write as p(shall | {shall, will}).

Our data is reproduced in the following table. The dataset has a number of attributes: data is sparse (this corpus is below 1 million words) and many datapoints are skewed: observed probability does not merely approach zero or 1 but reaches it.

Year shall will Total n p(shall) z.s e e
1958 1 0 1 1.0000 0.0000 1.0000 1.0000
1959 1 0 1 1.0000 0.0000 1.0000 1.0000
1960 5 1 6 0.8333 0.2982 0.5351 1.1315
1961 7 8 15 0.4667 0.2525 0.2142 0.7191
1963 0 1 1 0.0000 0.0000 0.0000 0.0000
1964 6 0 6 1.0000 0.0000 1.0000 1.0000
1965 3 4 7 0.4286 0.3666 0.0620 0.7952
1966 7 6 13 0.5385 0.2710 0.2675 0.8095
1967 3 0 3 1.0000 0.0000 1.0000 1.0000
1969 2 2 4 0.5000 0.4900 0.0100 0.9900
1970 3 1 4 0.7500 0.4243 0.3257 1.1743
1971 12 6 18 0.6667 0.2178 0.4489 0.8844
1972 2 2 4 0.5000 0.4900 0.0100 0.9900
1973 3 0 3 1.0000 0.0000 1.0000 1.0000
1974 12 8 20 0.6000 0.2147 0.3853 0.8147
1975 26 23 49 0.5306 0.1397 0.3909 0.6703
1976 11 7 18 0.6111 0.2252 0.3859 0.8363
1990 5 8 13 0.3846 0.2645 0.1202 0.6491
1991 23 36 59 0.3898 0.1244 0.2654 0.5143
1992 8 8 16 0.5000 0.2450 0.2550 0.7450

Table 1. Alternation of first person declarative modal shall vs. will over recent time, data from the spoken DCPSE corpus (after Aarts et al. 2013).

We have added three columns to our original table. These are the Gaussian (Wald) 95% error interval width z.s, and the lower and upper bounds e⁻, e⁺ respectively, obtained by subtracting and adding z.s from p(shall), where

mean xp = f/n,
standard deviation s ≡ √p(1 – p)/n.

To calculate p(shall), therefore, we simply divide the number of cases of shall (the frequency f(shall) if you prefer) by the total n, and to calculate the standard deviation s we use the formula above.

Fully-skewed values, i.e. where p(shall) = zero or 1, obtain zero-width intervals, which are highlighted in bold in the z.s column. However an interval of zero width represents complete certainty. We cannot say on the basis of a single observation that it is certain that all similarly-sampled speakers in 1958 used shall in place of will in first person declarative contexts!

Secondly, this data provides two examples (1960, 1970) of overshoot, where the upper bound of the interval exceeds the range [0, 1]. Again, any part of an interval outside the probabilistic range simply cannot be obtained, indicating that the interval is miscalculated. We plot this data in the figure below.

Plot of p(shall) over time, data from Aarts et al., with 95% Wald intervals, illustrating overshoot (dotted lines), zero-width intervals (circles), and 3-sigma rule failures (empty points).

Thirdly, common statistical advice (the ‘3-sigma rule’) outlaws extreme values and employs the limit p – 3s ∈ [0, 1] before using the Wald interval. This means that we simply give up estimating the error for low or high p values or for small n, a situation that is not exactly satisfactory! Fewer than half the values of p(shall) in the table satisfy this rule (the filled points in the figure above). Needless to say, when it comes to line-fitting or other less explicit uses of this estimate, such limits tend to be forgotten.

A similar heuristic for the χ² test (the Cochran rule) avoids employing the test where expected cell values fall below 5. This has proved so unsatisfactory that a series of statisticians have proposed competing alternatives to the chi-square test such as the log-likelihood test, in a series of attempts to cope with low frequencies and skewed datasets.

Plotting Wilson’s score interval on the same data

If, however, we apply the Wilson score interval to Table 1 we can now plot credible confidence intervals on the same data which have none of the problems observed above. This interval is computed by

Wilson score interval (w⁻, w⁺) ≡ p + z²/2n ± zp(1 – p)/n + z²/4
1 + z²/n

Plot of p(shall) over time, data from Table 1, with 95% Wilson score confidence intervals (after Aarts et al. 2013).

The figure above depicts the result of this recalculation. Previously zero-width intervals have a large width – as one would expect, they represent highly uncertain observations rather than certain ones – in some instances, extending nearly 80% of the probabilistic range. The overshooting 1960 and 1970 datapoints in the first graph now fall within the probability range. 1969 and 1972, which extended over nearly the entire range, have shrunk.

The Wilson score interval is not perfect, but it is a tremendous start. It is possible to add a continuity correction (similar to Yates’ adjustment) to the Wilson interval, which slightly increases the widths of the intervals above. In the paper we show that, even without a continuity correction, it is a more reliable interval than those obtained with log-likelihood using complex search methods!

The Wald interval, on the other hand, is premised on a mathematical error that is corrected by Wilson’s formulation, one that is discussed in the paper, and is not a good basis for further generalisation. For this reason the Wald interval can be said not just to be problematic, but to be wrong, and should be discontinued.

See also


Aarts, B., Close, J, and Wallis, S.A. 2013. Choices over time: methodological issues in investigating current change. » ePublished. Chapter 2 in Aarts, B., Close, J, Leech, G. and Wallis, S.A. (eds.) The Verb Phrase in English. Cambridge: CUP. » Table of contents and ordering info

Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208 » Post

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.