### Introduction

One of the questions that keeps coming up with students is the following.

*What does the Wilson score interval represent, and why is it the right way to calculate a confidence interval based around an observation? *

In this blog post I will attempt to explain, in a series of hopefully simple steps, how we get from the Binomial distribution to the Wilson score interval. I have written about this in a more ‘academic’ style elsewhere, but I haven’t spelled it out in a blog post.

### The Binomial distribution

The Binomial distribution is the **mathematically-ideal** distribution of the total frequency obtained from a **binomial sampling procedure**.

#### The Binomial sampling procedure

Here is an example I performed in class. I asked twenty students to toss a coin ten times and count up the number of heads they obtained. I then asked them to put their hands up if they got zero heads, one head, two heads,… right up to ten heads.

The pattern I obtained was something like the following. No students reported getting all tails (no heads) or all heads (no tails). In this histogram, “Frequency” means the total number of students scoring *r* heads.

This is a pretty ragged distribution, which is actually representative of the patterns you will tend to get if you only perform the sampling process a few times.

This graph is expressed in terms of the **frequency**, *f *of throwing *r *heads, *f*(*r*), but it could be rescaled in terms of **probability** by simply dividing *f* by 20. In this case *b*(*r*) would represent the chance that, given this data, you picked a student at random from the set who threw *r* heads.

- Each student performed the same “experiment”, so
**we ran an experiment twenty times**(or sampled twenty different datasets). - Crucially (and this is the head-scratching part)
**the graph plots the distribution of the results of twenty experiments**, not the distribution of a single experiment. - Although we tossed a coin, exactly
**the same principle applies to any other sampling procedure**, e.g. to estimate the true rate of the first person declarative modal*shall*vs.*will*alternation, provided we obtain twenty samples independently. **The more samples we take, the more accurate our overall estimate of the true rate will be**, and the best estimate is obtained by summing the samples.*But the actual rate for each sample is likely to vary between samples, as shown here.*

We’ll use *b* to represent this observed Binomial probability, and *r* to represent any value from 0 to the maximum number of throws, *n*, which in this case is 10.

#### The mathematically-ideal Binomial distribution

The mathematically-ideal Binomial distribution is smoother. It looks something like this.

This graph is the expected distribution of the probability function *B*(*r*) after an **infinite** number of runs, assuming that the probability of throwing a head, *P*, is 0.5. (We use capital letters to remind ourselves these are idealised, expected distributions.)

**To be clear: this is a distribution of samples about a population mean. **Suppose the true chance of throwing a head is 0.5. If we sample this probability by tossing a coin ten times, the most likely result would be 5 out of 10 heads, but this is not the only possible outcome. It is also possible that there would be 4 out of 10, 6 out of 10, etc. The likelihood of these other outcomes is given by the heights of each column.

To calculate this graph we don’t actually perform an infinite number of coin tosses! What we need to do is work out how many different ways you can obtain zero heads, 1 head, 2 heads, etc.

First, let’s make it simpler. Imagine for a minute we only toss the coin twice. And let’s assume our coin is fair, i.e. the chance of getting one head is 0.5.

- What is the chance of getting zero heads (or two tails, i.e.
*r*=0)? We can write this pattern as TT. - Now, what is the chance of obtaining two heads (zero tails,
*r*=2), or HH? - Finally, what is the chance of obtaining one head (one tail,
*r*=1)?

We can obtain the middle pattern in **two** distinct ways – either by throwing one head, then a tail; or by one tail, then one head. To put it another way, we can get HT or TH. The frequency distribution looks something like this: *F*(*r*) = {1, 2, 1}, and the probability distribution *B*(*r*) = {¼, ½, ¼}.

OK, so this is a simple example. What if the expected probability is not 0.5? What about higher numbers than *n*=2?

Here is the full Binomial formula:

*Binomial probability B*(*r*; *n*, *P*) ≡ *nCr* . *P ^{r}*(1 −

*P*)

^{(n-r)}.

Let’s break this down. This function calculates the probability of getting any given number of heads, *r*, out of *n* cases (coin tosses), when the probability of throwing a single head is *P*. The first part of the equation, *nCr*, is the combinatorial function, which calculates the total number of ways (combinations) you can obtain *r* heads out of *n* throws. The second part is the chance of throwing just one of these combinations.

Needless to say, different values of *P* obtain different Binomial distributions:

Note that as *P* becomes closer to zero, the distribution becomes increasingly lop-sided. The mirror of this pattern would apply if *P* approached 1.

### Binomial → Normal

The main problem with the Binomial distribution is two-fold.

- Historically it was simply difficult to calculate, although modern computers can manage to calculate quite high values of
*nCr*without too many problems. This is obtained by multiplying and dividing three factorials, which are very large numbers (e.g. the factorial of 10, written 10!, is 1×2×3…×10 = 3,628,800). - If you need to compute a confidence interval, you need to calculate a
**cumulative value**of*B*, that is the total across a range of values of*B*(Wallis 2013). This can be computationally expensive to calculate, because you have to calculate a column total for every little vertical ‘strip’ for*r*.

So statisticians performed a trick. They said, **let us assume that the Binomial distribution is approximately the same as the Normal distribution. **

The Normal distribution (also called the **Gaussian**) can be expressed by two parameters: the **mean**, in this case *P*, and the **standard deviation**, which we will write as *S*.

- If
*n*is large, then it is much easier to calculate the Normal distribution than the Binomial, and the Binomial does tend to match it pretty closely, provided that*P*is not very close to zero (or 1). - For smaller values of
*n*, or if*P*is close to zero, then it may not match it every single time.

To see how this works, let us consider the cases above where *P*=0.3 and *P*=0.05.

Note: So far we have drawn the discrete Binomial distribution on an Interval scale, where it looks ‘chunky’, like a series of tall tower blocks clustered together. But the **width** of each block is undefined. The Binomial for *r* = 1.5 (for example) is undefined. In approximating the Normal to the Binomial we wish to compare it with a continuous distribution, the Normal, which must be plotted on a Real scale. In the following graphs, we compare the centre-point of the ‘chunk’, where *p* = 0.0, 0.1, etc.

The Normal distribution is continuous and symmetric. It follows the Binomial distribution fairly well.

As you would expect when substituting a continuous distribution line for a discrete one (series of integer steps), there is some slight disagreement between the two results, marked here as “error”.

What do these discrepancies mean?

- If the line is
**outside**the middle point of the ‘step’, the error is a conservative**‘Type II’ error**. This means that by using the Normal distribution instead of the Binomial we increase the chance of failing to reject a true null hypothesis. We fail to accept a result is significant when we should. - If the line is
**inside**the mid-point of the ‘step’, we have a radical**‘Type I’ error**. Using the Normal distribution may cause us to reject a null hypothesis that we would not have done had we used the more accurate Binomial distribution. In other words, we accept that a result is significant when we should not.

If you look at either tail end of the two distributions in Figure 6, we can see that the Binomial has a greater spread than the equivalent Normal distribution. Since we tend to use the tail ends in experimental science (where the area under the curve α = 0.05 / 2, say), this is where differences in the two distributions will have an effect on results.

Now let’s see what happens as *P* gets close to zero at *P*=0.05.

As we saw, the Binomial distribution is concentrated at zero heads. You can read this graph to mean that if you had a coin that was tails 95% times, and you then tossed it ten times, the most likely outcome (60% of the time you did this experiment) is that you would get no heads out of all ten tosses.

In this graph the Normal line does not match the Binomial steps as well as it did for *P*=0.3. You can see that it is reasonably accurate for 1 head, but the mid point of the Binomial is much higher than the Normal for two and three heads — risking an under-cautious Type I error.

Both the standard Normal and Binomial distributions sum to 1. That is, the total area under the curve is constant. You can see that when *P* is close to zero the Normal distribution “bunches up”, just like the Binomial. However, it also spans an impossible area to the left of the graph. There cannot be -1 heads, but the curve appears to include this probability. This means that in fact, the total area under the possible part of the Normal distribution is less than 1, and this simple fact alone means that for skewed values of *P*, the Normal distribution is increasingly ‘radical’.

The standard solution to this problem is to employ Yates’ continuity correction, which essentially expands the Normal line outwards a fraction. This reduces the number of errors arising out of this approximation to the Normal, as Wallis (2013) empirically demonstrates.

### Normal → Wilson

The final stage in our journey takes us to the Wilson score interval. So far we have computed Normal distributions about an expected population probability, *P*. However, when we carry out experiments with real data, whether linguistic data or not, we obtain a single observed rate, which we will call *p*. (In corp.ling.stats we use the simple convention that lower case refers to observations, and capital letters refers to population values.)

We can compute a **Gaussian (Normal) interval about P** using the

*mean*and

*standard deviation*as follows:

*mean* *x* ≡ *P* = *F*/*n*,

*standard deviation S* ≡ √*P*(1 – *P*)/*n*.

**The Gaussian interval about P** (

*E*⁻,

*E*⁺) can be written as

*P*±

*z.S*, and

*z*is the

**critical value of the standard Normal distribution**at a given error level (e.g., 0.05). The interval for

*P*is shown in the diagram below as a range on the horizontal axis centred on

*P*.

Although this is a bit of a mouthful, critical values of *z* are constant, so for any given level you can just substitute the constant for *z*. [*z*(0.05) = 1.95996 to six decimal places.]

However, we rarely know the true value of *P*! We might use this formula in a significance test (the single sample *z* test) where we assume a particular value of *P* and test against it, but rarely do we plot such confidence intervals.

We want to calculate confidence intervals around an **observed** value, *p*.

The first thing to note is that it is **incorrect** to insert *p* in place of *P* in the formula above. (Unfortunately, this is exactly what students have been taught to do for generations.) This approach leads to all kinds of confusion. See Why ‘Wald’ is Wrong, for more on this.

The correct approach was pointed out by Wilson (1927) in a paper which appears to have been read by relatively few people at the time. So much for Impact Factors! This paper was rediscovered in the late 1990s by medical statisticians keen to accurately estimate confidence intervals for skewed observations, that is where *p* is close to zero or 1.

Wilson points out that the correct solution involves an inversion of the formula above. When *p* is at the error limit for *P*, i.e. *p* = *E*⁻ or *E*⁺, then it is also true that *P* must be at the corresponding limit for *p*.

In Wallis (2013) I call this the “interval equality principle”, and offer the following sketch.

*w*⁻ and

*w*⁺ are the desired lower and upper bounds of a sample interval for any error level α:

*w*⁻ = *P*₁ ↔ *E*₁⁺ = *p* where *P*₁ < *p*, and

*w*⁺ = *P*₂ ↔ *E*₂⁻ = *p* where *P*₂ > *p*.

If the lower bound for *p* (labelled *w*⁻) is a possible population mean *P*₁, then the upper bound of *P*₁ would be *p*, and vice-versa.

This insight also allows us to use a computer to search for any confidence interval about *p* if we know how to calculate the interval about *P*. The computer calculates confidence intervals for possible values of *P* and tries different values until this equality holds. See Wallis (2013).

However we don’t need a search procedure in this case. It is possible to derive a single formula for calculating *w*⁻ and *w*⁺. This is the **Wilson score interval** formula:

*Wilson’s score interval*

(*w*⁻, *w*⁺) ≡ [*p* + *z*²/2*n* ± *z*√*p*(1 – *p*)/*n* + *z*²/4*n²*] / [1 + *z*²/*n*].

The score interval is **asymmetric** (except where *p*=0.5) and tends towards the middle of the distribution (as the figure above reveals). It cannot exceed the probability range [0, 1].

A continuity-corrected version of Wilson’s interval should be used where *n* is small.

### See also

- A statistics crib sheet
- Why ‘Wald’ is wrong
- Plotting confidence intervals on graphs
- Plotting the Wilson distribution

### References

Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. *Journal of the American Statistical Association* **22**: 209-212.

Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics* **20**:3, 178-208. » Post

Can you give a theoretical justification for the interval equality principle? I’m a mathematician in process of becoming a statistician and am trying to avoid hand waving. Some integral should equal some other integral.

Good question. The simple answer is that this principle is central to the definition of the Wilson interval itself. I suggest you start with Wilson’s (1927) paper and work through his original argument, which I have popularised here.

The lower bound of Wilson’s interval for

pis obtained by solving to findPinp=P+z√[P(1 –P)/N], wherezrefers to a particular critical value of the Normal distribution.The upper bound for

pcan be found with, as you might expect,p=P–z√[P(1 –P)/N].Other intervals can be obtained in the same way. Clopper-Pearson’s interval for

pis obtained by the same method using the ‘exact’ Binomial interval aboutP. Newcombe’s continuity-corrected Wilson interval derives from Yates’ continuity-corrected Normal, and you can obtain a log-likelihood interval by the same method. Hence I think it is reasonable to call this an interval equalityprinciple– that, at the threshold of significance, both intervals aboutPand a derived interval aboutpwill be at the same critical point.