Plotting the Clopper-Pearson distribution


In Plotting the Wilson distribution (Wallis 2018), I showed how it is possible to plot the distribution of the Wilson interval for all values of α. This exercise is revealing in a number of ways.

First, it shows the relationship between

  1. the Normal distribution of probable Binomial observations about the population, ideal or given proportion P, and
  2. the corresponding distribution of probable values of P about an observed Binomial proportion, p, (referred to as the Wilson distribution, as it is based on the Wilson score interval).

Over the last few years I have become convinced that approaching statistical understanding from the perspective of the tangible observation p is more instructive and straightforward to conceptualise than approaching it (as is traditional) from the imaginary ‘true value’ in the population, P. In particular, whenever you conduct an experiment you want to know how reliable your results are (or to put it an other way, what range of values you might reasonably expect were you to repeat your experiment) — not just if it is statistically significantly different from some arbitrary number, P!

Second, and as a result, just as it is possible to see the closeness of fit between the Binomial and the Normal distribution, through this exercise we can visualise the inverse relationship between Normal and Wilson distributions. We can see immediately that it is a fallacy to assume that the distribution of probable values about p is Normal, although numerous statistics books still quote ‘Wald’-type intervals and many methods operate on this assumption. (I am intermittently amused by plots of otherwise sophisticated modelling algorithms with impossibly symmetric intervals in probability space.)

Third, I showed in the paper that ‘the Wilson distribution’ is properly understood as two distributions: the distribution of probable values of P below and above p. If we employ a continuity-correction, the two distributions become clearly distinct.

This issue sometimes throws people. Compare:

  1. the most probable location of P,
  2. the most probable location of P if we know that P < p (lower interval),
  3. the most probable location of P if we know that P > p (upper interval).

Wilson distributions correspond to (2) and (3) above, obtained by finding the roots of the Normal approximation. See Wallis (2013). The sum, or mean, of these is not (1), as becomes clearer when we plot other related distributions.

There are a number of other interesting and important conclusions from this work, including that the logit Wilson interval is in fact almost Normal, except for p = 0 or 1.

In this post I want to briefly comment on some recent computational work I conducted in preparation for my forthcoming book (Wallis, in press). This involves plotting the Clopper-Pearson distribution. Continue reading “Plotting the Clopper-Pearson distribution”

Correcting for continuity


Many conventional statistical methods employ the Normal approximation to the Binomial distribution (see Binomial → Normal → Wilson), either explicitly or buried in formulae.

The well-known Gaussian population interval (1) is

Gaussian interval (E⁻, E⁺) ≡ P ± zP(1 – P)/n,(1)

where n represents the size of the sample, and z the two-tailed critical value for the Normal distribution at an error level α, more properly written zα/2. The standard deviation of the population proportion P is S = √P(1 – P)/n, so we could abbreviate the above to (E⁻, E⁺) ≡ P ± z.S.

When these methods require us to calculate a confidence interval about an observed proportion, p, we must invert the Normal formula using the Wilson score interval formula (Equation (2)).

Wilson score interval (w⁻, w⁺) ≡ p + z²/2n ± zp(1 – p)/n + z²/4
1 + z²/n

In a 2013 paper for JQL (Wallis 2013a), I referred to this inversion process as the ‘interval equality principle’. This means that if (1) is calculated for p = E⁻ (the Gaussian lower bound of P), then the upper bound that results, w⁺, will equal P. Similarly, for p = E⁺, the lower bound of pw⁻ will equal P.

We might write this relationship as

p ≡ GaussianLower(WilsonUpper(p, n, α), n, α), or, alternatively
P ≡ WilsonLower(GaussianUpper(P, n, α), n, α), etc. (3)

where E⁻ = GaussianLower(P, n, α), w⁺ = WilsonUpper(p, n, α), etc.

Note. The parameters n and α become useful later on. At this stage the inversion concerns only the first parameter, p or P.

Nonetheless the general principle is that if you want to calculate an interval about an observed proportion p, you can derive it by inverting the function for the interval about the expected population proportion P, and swapping the bounds (so ‘Lower’ becomes ‘Upper’ and vice versa).

In the paper, using this approach I performed a series of computational evaluations of the performance of different interval calculations, following in the footsteps of more notable predecessors. Comparison with the analogous interval calculated directly from the Binomial distribution showed that a continuity-corrected version of the Wilson score interval performed accurately. Continue reading “Correcting for continuity”

The other end of the telescope


The standard approach to teaching (and thus thinking about) statistics is based on projecting distributions of ranges of expected values. The distribution of an expected value is a set of probabilities that predict what the value will be, according to a mathematical model of what you predict should happen.

For the experimentalist, this distribution is the imaginary distribution of very many repetitions of the same experiment that you may have just undertaken. It is the output of a mathematical model.

  • Note that this idea of a projected distribution is not the same as the term ‘expected distribution’. An expected distribution is a series of values you predict your data should match.
  • Thus in what follows we simply compare a single expected value P with an observed value p. This can be thought of as comparing the expected distribution E = {P, 1 – P} with the observed distribution O = {p, 1 – p}.

Thinking about this projected distribution represents a colossal feat of imagination: it is a projection of what you think would happen if only you had world enough and time to repeat your experiment, again and again. But often you can’t get more data. Perhaps the effort to collect your data was huge, or the data is from a finite set of available data (historical documents, patients with a rare condition, etc.). Actual replication may be impossible for material reasons.

In general, distributions of this kind are extremely hard to imagine, because they are not part of our directly-observed experience. See Why is statistics difficult? for more on this. So we already have an uphill task in getting to grips with this kind of reasoning.

Significant difference (often shortened to ‘significance’) refers to the difference between your observations (the ‘observed distribution’) and what you expect to see (the expected distribution). But to evaluate whether a numerical difference is significant, we have to take into account both the shape and spread of this projected distribution of expected values.

When you select a statistical test you do two things:

  • you choose a mathematical model which projects a distribution of possible values, and
  • you choose a way of calculating significant difference.

The problem is that in many cases it is very difficult to imagine this projected distribution, or — which amounts to the same thing — the implications of the statistical model.

When tests are selected, the main criterion you have to consider concerns the type of data being analysed (an ‘ordinal scale’, a ‘categorical scale’, a ‘ratio scale’, and so on). But the scale of measurement is only one of several parameters that allows us to predict how random selection might affect the resampling of data.

A mathematical model contains what are usually called assumptions, although it might be more accurate to call them ‘preconditions’ or parameters. If these assumptions about your data are incorrect, the test is likely to give an inaccurate result. This principle is not either/or, but can be thought of as a scale of ‘degradation’. The less the data conforms to these assumptions, the more likely your test is to give the wrong answer.

This is particularly problematic in some computational applications. The programmer could not imagine the projected distribution, so they tweaked various parameters until the program ‘worked’. In a ‘black-box’ algorithm this might not matter. If it appears to work, who cares if the algorithm is not very principled? Performance might be less than optimal, but it may still produce valuable and interesting results.

But in science there really should be no such excuse.

The question I have been asking myself for the last ten years or so is simply can we do better? Is there a better way to teach (and think about) statistics than from the perspective of distributions projected by counter-intuitive mathematical models (taken on trust) and significance tests? Continue reading “The other end of the telescope”