### Introduction

The idea of plotting confidence intervals on data, which is discussed in a number of posts elsewhere on this blog, should be straightforward. Everything we observe is uncertain, but some things are more certain than others! Instead of marking an observation as a point, its better to express it as a ‘cloud’, an interval representing a range of probabilities.

But the standard method for calculating intervals that most people are taught is **wrong**.

The reasons *why* are dealt with in detail in (Wallis 2013). In preparing this paper for publication, however, I came up with a new demonstration, using real data, as to why this is the case.

### Plotting ‘Wald’ intervals on sparse and skewed data

First: some data.

In a paper published in a volume on the Verb Phrase in English, Aarts, Close and Wallis (2013) examined the alternation over time in British English from first person declarative uses of modal *shall* to *will* over a thirty year period by plotting over time the probability of selecting *shall* given the choice, which we can write as *p*(*shall* | {*shall*, *will*}).

Our data is reproduced in the following table. The dataset has a number of attributes: data is **sparse** (this corpus is below 1 million words) and many datapoints are **skewed**: observed probability does not merely approach zero or 1 but reaches it.

Year |
shall |
will |
Total n |
p(shall) |
z.s |
e⁻ |
e⁺ |

1958 |
1 | 0 | 1 | 1.0000 | 0.0000 |
1.0000 | 1.0000 |

1959 |
1 | 0 | 1 | 1.0000 | 0.0000 |
1.0000 | 1.0000 |

1960 |
5 | 1 | 6 | 0.8333 | 0.2982 | 0.5351 | 1.1315 |

1961 |
7 | 8 | 15 | 0.4667 | 0.2525 | 0.2142 | 0.7191 |

1963 |
0 | 1 | 1 | 0.0000 | 0.0000 |
0.0000 | 0.0000 |

1964 |
6 | 0 | 6 | 1.0000 | 0.0000 |
1.0000 | 1.0000 |

1965 |
3 | 4 | 7 | 0.4286 | 0.3666 | 0.0620 | 0.7952 |

1966 |
7 | 6 | 13 | 0.5385 | 0.2710 | 0.2675 | 0.8095 |

1967 |
3 | 0 | 3 | 1.0000 | 0.0000 |
1.0000 | 1.0000 |

1969 |
2 | 2 | 4 | 0.5000 | 0.4900 | 0.0100 | 0.9900 |

1970 |
3 | 1 | 4 | 0.7500 | 0.4243 | 0.3257 | 1.1743 |

1971 |
12 | 6 | 18 | 0.6667 | 0.2178 | 0.4489 | 0.8844 |

1972 |
2 | 2 | 4 | 0.5000 | 0.4900 | 0.0100 | 0.9900 |

1973 |
3 | 0 | 3 | 1.0000 | 0.0000 |
1.0000 | 1.0000 |

1974 |
12 | 8 | 20 | 0.6000 | 0.2147 | 0.3853 | 0.8147 |

1975 |
26 | 23 | 49 | 0.5306 | 0.1397 | 0.3909 | 0.6703 |

1976 |
11 | 7 | 18 | 0.6111 | 0.2252 | 0.3859 | 0.8363 |

1990 |
5 | 8 | 13 | 0.3846 | 0.2645 | 0.1202 | 0.6491 |

1991 |
23 | 36 | 59 | 0.3898 | 0.1244 | 0.2654 | 0.5143 |

1992 |
8 | 8 | 16 | 0.5000 | 0.2450 | 0.2550 | 0.7450 |

We have added three columns to our original table. These are the Gaussian (Wald) 95% error interval width *z.s*, and the lower and upper bounds *e*⁻, *e*⁺ respectively, obtained by subtracting and adding *z.s* from *p*(*shall*), where

*mean* *x* ≡ *p* = *f*/*n*,

*standard deviation s* ≡ √*p*(1 – *p*)/*n*.

To calculate *p*(*shall*), therefore, we simply divide the number of cases of *shall* (the frequency *f*(*shall*) if you prefer) by the total *n*, and to calculate the standard deviation *s* we use the formula above.

Fully-skewed values, i.e. where *p*(*shall*) = zero or 1, obtain **zero-width intervals**, which are highlighted in bold in the *z.s* column. However an interval of zero width represents complete certainty. We cannot say on the basis of a single observation that it is certain that all similarly-sampled speakers in 1958 used *shall* in place of *will* in first person declarative contexts!

Secondly, this data provides two examples (1960, 1970) of **overshoot**, where the upper bound of the interval exceeds the range [0, 1]. Again, any part of an interval outside the probabilistic range simply cannot be obtained, indicating that the interval is miscalculated. We plot this data in the figure below.

Thirdly, common statistical advice (the ‘3-sigma rule’) outlaws extreme values and employs the limit *p* – 3*s* ∈ [0, 1] before using the Wald interval. This means that we simply give up estimating the error for low or high *p* values or for small *n*, a situation that is not exactly satisfactory! Fewer than half the values of *p*(*shall*) in the table satisfy this rule (the filled points in the figure above). Needless to say, when it comes to line-fitting or other less explicit uses of this estimate, such limits tend to be forgotten.

A similar heuristic for the χ² test (the Cochran rule) avoids employing the test where expected cell values fall below 5. This has proved so unsatisfactory that a series of statisticians have proposed competing alternatives to the chi-square test such as the log-likelihood test, in a series of attempts to cope with low frequencies and skewed datasets.

### Plotting Wilson’s score interval on the same data

If, however, we apply the Wilson score interval to Table 1 we can now plot credible confidence intervals on the same data which have none of the problems observed above. This interval is computed by

*Wilson’s score interval* (*w*⁻, *w*⁺)

≡ [*p* + *z*²/2*n* ± *z*√*p*(1 – *p*)/*n* + *z*²/4*n*²] / [1 + *z*²/*n*].

The figure above depicts the result of this recalculation. Previously zero-width intervals have a large width – as one would expect, they represent highly uncertain observations rather than certain ones – in some instances, extending nearly 80% of the probabilistic range. The overshooting 1960 and 1970 datapoints in the first graph now fall within the probability range. 1969 and 1972, which extended over nearly the entire range, have shrunk.

The Wilson score interval is not perfect, but it is a tremendous start. It is possible to add a continuity correction (similar to Yates’ adjustment) to the Wilson interval, which slightly increases the widths of the intervals above. In the paper we show that, even without a continuity correction, it is a more reliable interval than those obtained with log-likelihood using complex search methods!

The Wald interval, on the other hand, is premised on a mathematical error that is corrected by Wilson’s formulation, one that is discussed in the paper, and is not a good basis for further generalisation. For this reason the Wald interval can be said not just to be problematic, but to be **wrong**, and should be discontinued.

### See also

### References

Aarts, B., Close, J, and Wallis, S.A. 2013. Choices over time: methodological issues in investigating current change. » ePublished. Chapter 2 in Aarts, B., Close, J, Leech, G. and Wallis, S.A. (eds.) *The Verb Phrase in English*. Cambridge: CUP. » Table of contents and ordering info

Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics ***20**:3, 178-208 » Post