### Confidence intervalsHandout

Confidence intervals on an observed rate *p* should be computed using the **Wilson** score interval method. A confidence interval on an observation *p* represents the range that the true population value, *P* (which we cannot observe directly) may take, at a given level of confidence (e.g. 95%).

**Note:** Confidence intervals can be applied to onomasiological change (variation in choice) and semasiological change (variation in meaning), provided that *P* is **free to vary** from 0 to 1 (see Wallis 2012). Naturally, the interpretation of significant change in either case is different.

Methods for calculating intervals employ the **Gaussian approximation to the Binomial distribution**.

#### Confidence intervals on Expected (Population) values (*P*)

The Gaussian interval about *P* uses the *mean* and *standard deviation *as follows:

*mean* *x* ≡ *P* = *F*/*N*,

*standard deviation S* ≡ √*P*(1 – *P*)/*N*.

**The Gaussian interval about P** can be written as

*P*±

*E*, where

*E*=

*z.S*, and

*z*is the

**critical value of the standard Normal distribution**at a given error level (e.g., 0.05). Although this is a bit of a mouthful, critical values of

*z*are constant, so for any given level you can just substitute the constant for

*z*. [

*z*(0.05) = 1.95996 to six decimal places.]

In summary:

*Gaussian interval* ≡ *P* ± z√*P*(1 – *P*)/*N*.

#### Confidence intervals on Observed (Sample) values (*p*)

**We cannot use the same formula for confidence intervals about observations. **Many people try to do this!

Most obviously, if *p* gets close to zero, the error *e* can exceed *p*, so the lower bound of the interval can fall below zero, which is clearly impossible! The problem is most apparent on smaller samples (larger intervals) and skewed values of *p* (close to 0 or 1).

The Gaussian is a reasonable approximation for an as-yet-unknown population probability *P*, it is incorrect for an interval around an observation *p *(Wallis 2013a). However the latter case is precisely where the Gaussian interval is used most often!

What is the correct method?