Correcting for continuity

Introduction

Many conventional statistical methods employ the Normal approximation to the Binomial distribution (see Binomial → Normal → Wilson), either explicitly or buried in formulae.

The well-known Gaussian population interval (1) is

Gaussian interval (E⁻, E⁺) ≡ P ± zP(1 – P)/n,(1)

where n represents the size of the sample, and z the two-tailed critical value for the Normal distribution at an error level α, more properly written zα/2. The standard deviation of the population proportion P is S = √P(1 – P)/n, so we could abbreviate the above to (E⁻, E⁺) ≡ P ± zS.

When these methods require us to calculate a confidence interval about an observed proportion, p, we must invert the Normal formula using the Wilson score interval formula (Equation (2)).

Wilson score interval (w⁻, w⁺) ≡ [p + z²/2n ± zp(1 – p)/n + z²/4] / [1 + z²/n].(2)

In a 2013 paper for JQL (Wallis 2013a), I referred to this inversion process as the ‘interval equality principle’. This means that if (1) is calculated for p = E⁻ (the Gaussian lower bound of P), then the upper bound that results, w⁺, will equal P. Similarly, for p = E⁺, the lower bound of pw⁻ will equal P.

We might write this relationship as

p ≡ GaussianLower(WilsonUpper(p, n, α), n, α), or, alternatively
P ≡ WilsonLower(GaussianUpper(P, n, α), n, α), etc. (3)

where E⁻ = GaussianLower(P, n, α), w⁺ = WilsonUpper(p, n, α), etc.

Note. The parameters n and α become useful later on. At this stage the inversion concerns only the first parameter, p or P.

Nonetheless the general principle is that if you want to calculate an interval about an observed proportion p, you can derive it by inverting the function for the interval about the expected population proportion P, and swapping the bounds (so ‘Lower’ becomes ‘Upper’ and vice versa).

In the paper, using this approach I performed a series of computational evaluations of the performance of different interval calculations, following in the footsteps of more notable predecessors. Comparison with the analogous interval calculated directly from the Binomial distribution showed that a continuity-corrected version of the Wilson score interval performed accurately. Continue reading

The other end of the telescope

Introduction

The standard approach to teaching (and thus thinking about) statistics is based on projecting distributions of ranges of expected values. The distribution of an expected value is a set of probabilities that predict what the value will be, according to a mathematical model of what you predict should happen.

For the experimentalist, this distribution is the imaginary distribution of very many repetitions of the same experiment that you may have just undertaken. It is the output of a mathematical model.

  • Note that this idea of a projected distribution is not the same as the term ‘expected distribution’. An expected distribution is a series of values you predict your data should match.
  • Thus in what follows we simply compare a single expected value P with an observed value p. This can be thought of as comparing the expected distribution E = {P, 1 – P} with the observed distribution O = {p, 1 – p}.

Thinking about this projected distribution represents a colossal feat of imagination: it is a projection of what you think would happen if only you had world enough and time to repeat your experiment, again and again. But often you can’t get more data. Perhaps the effort to collect your data was huge, or the data is from a finite set of available data (historical documents, patients with a rare condition, etc.). Actual replication may be impossible for material reasons.

In general, distributions of this kind are extremely hard to imagine, because they are not part of our directly-observed experience. See Why is statistics difficult? for more on this. So we already have an uphill task in getting to grips with this kind of reasoning.

Significant difference (often shortened to ‘significance’) refers to the difference between your observations (the ‘observed distribution’) and what you expect to see (the expected distribution). But to evaluate whether a numerical difference is significant, we have to take into account both the shape and spread of this projected distribution of expected values.

When you select a statistical test you do two things:

  • you choose a mathematical model which projects a distribution of possible values, and
  • you choose a way of calculating significant difference.

The problem is that in many cases it is very difficult to imagine this projected distribution, or — which amounts to the same thing — the implications of the statistical model.

When tests are selected, the main criterion you have to consider concerns the type of data being analysed (an ‘ordinal scale’, a ‘categorical scale’, a ‘ratio scale’, and so on). But the scale of measurement is only one of several parameters that allows us to predict how random selection might affect the resampling of data.

A mathematical model contains what are usually called assumptions, although it might be more accurate to call them ‘preconditions’ or parameters. If these assumptions about your data are incorrect, the test is likely to give an inaccurate result. This principle is not either/or, but can be thought of as a scale of ‘degradation’. The less the data conforms to these assumptions, the more likely your test is to give the wrong answer.

This is particularly problematic in some computational applications. The programmer could not imagine the projected distribution, so they tweaked various parameters until the program ‘worked’. In a ‘black-box’ algorithm this might not matter. If it appears to work, who cares if the algorithm is not very principled? Performance might be less than optimal, but it may still produce valuable and interesting results.

But in science there really should be no such excuse.

The question I have been asking myself for the last ten years or so is simply can we do better? Is there a better way to teach (and think about) statistics than from the perspective of distributions projected by counter-intuitive mathematical models (taken on trust) and significant tests? Continue reading

Plotting the Wilson distribution

Introduction Full Paper (PDF)

We have discussed the Wilson score interval at length elsewhere (Wallis 2013a, b). Given an observed Binomial proportion p = f / n observations, and confidence level 1-α, the interval represents the two-tailed range of values where P, the true proportion in the population, is likely to be found. Note that f and n are integers, so whereas P is a probability, p is a proper fraction (a rational number).

The interval provides a robust method (Newcombe 1998, Wallis 2013a) for directly estimating confidence intervals on these simple observations. It can take a correction for continuity in circumstances where it is desired to perform a more conservative test and err on the side of caution. We have also shown how it can be employed in logistic regression (Wallis 2015).

The point of this paper is to explore methods for computing Wilson distributions, i.e. the analogue of the Normal distribution for this interval. There are at least two good reasons why we might wish to do this.

The first is to shed insight onto the performance of the generating function (formula), interval and distribution itself. Plotting an interval means selecting a single error level α, whereas visualising the distribution allows us to see how the function performs over the range of possible values for α, for different values of p and n.

A second good reason is to counteract the tendency, common in too many presentations of statistics, to present the Gaussian (‘Normal’) distribution as if it were some kind of ‘universal law of data’, a mistaken corollary of the Central Limit Theorem. This is particularly unwise in the case of observations of Binomial proportions, which are strictly bounded at 0 and 1. Continue reading