We have discussed the Wilson score interval at length elsewhere (Wallis 2013a, b). Given an observed Binomial proportion p = f / n observations, and confidence level 1-α, the interval represents the two-tailed range of values where P, the true proportion in the population, is likely to be found. Note that f and n are integers, so whereas P is a probability, p is a proper fraction (a rational number).

The interval provides a robust method (Newcombe 1998, Wallis 2013a) for directly estimating confidence intervals on these simple observations. It can take a correction for continuity in circumstances where it is desired to perform a more conservative test and err on the side of caution. We have also shown how it can be employed in logistic regression (Wallis 2015).

The point of this paper is to explore methods for computing Wilson distributions, i.e. the analogue of the Normal distribution for this interval. There are at least two good reasons why we might wish to do this.

The first is to shed insight onto the performance of the generating function (formula), interval and distribution itself. Plotting an interval means selecting a single error level α, whereas visualising the distribution allows us to see how the function performs over the range of possible values for α, for different values of p and n.

A second good reason is to counteract the tendency, common in too many presentations of statistics, to present the Gaussian (‘Normal’) distribution as if it were some kind of ‘universal law of data’, a mistaken corollary of the Central Limit Theorem. This is particularly unwise in the case of observations of Binomial proportions, which are strictly bounded at 0 and 1. Continue reading “Plotting the Wilson distribution”→

This post is a little off-topic, as the exercise I am about to illustrate is not one that most corpus linguists will have to engage in.

However, I think it is a good example of why a mathematical approach to statistics (instead of the usual rote-learning of tests) is extremely valuable.

Case study: The declared ‘deficit’ in the USS pension scheme

At the time of writing (March 2018) nearly two hundred thousand university staff in the UK are active members of a pension scheme called USS. This scheme draws in income from these members and pays out to pensioners. Every three years the pension is valued, which is not a simple process. The valuation consists of two aspects, both uncertain:

to value the liabilities of the pension fund, which means the obligations to current pensioners and future pensioners (current active members), and

to estimate the future asset value of the pension fund when the scheme is obliged to pay out to pensioners.

What happened in 2017 (and happened in the last two valuations) is that the pension fund has been declared to be in deficit, meaning that the liabilities are greater than the assets. However, in all cases this ‘deficit’ is a projection forwards in time. We do not know how long people will actually live, so we don’t know how much it will cost to pay them a pension. And we don’t know what the future values of assets held by the pension fund will be.

The September valuation

In September 2017, the USS pension fund published a table which included two figures using the method of accounting they employed at the time to value the scheme.

They said the best estimate of the outcome was a surplus of £8.3 billion.

But they said that the deficit allowing for uncertainty (‘prudence’) was –£5.1 billion.

Now, if a pension fund is in deficit, it matters a great deal! Someone has to pay to address the deficit. Either the rules of the pension fund must change (so cutting the liabilities) or the assets must be increased (so the employers and/or employees, who pay into the pension fund must pay more). The dispute about the deficit engulfed UK universities in March 2018 with strikes by many tens of thousands of staff, lectures cancelled, etc. But is there really a ‘deficit’, and if so, what does this tell us?

The first additional bit of information we need to know is how the ‘uncertainty’ is modelled. In February 2018 I got a useful bit of information. The ‘deficit’ is the lower bound on a 33% confidence interval (α = 2/3). This is an interval that divides the distribution into thirds by area. One third is below the lower bound, one third above the upper bound, and one third is in the middle. This gives us a picture that looks something like this:

Of course, experimentalist statisticians will never use such an error-prone confidence interval. We wouldn’t touch anything below 95% (α = 0.05)! To make things a bit more confusing, the actuaries talk about this having a ‘67% level of prudence’ meaning that two-thirds of the distribution is above the lower bound. All of this is fine, but it means we must proceed with care to decode the language and avoid making mistakes.

In any case, the distribution of this interval is approximately Normal. The detailed graphs I have seen of USS’s projections are a bit more shaky (which makes them appear a bit more ‘sciency’), but let’s face it, these are projections with a great deal of uncertainty. It is reasonable to employ a Normal approximation and use a ‘Wald’ interval in this case because the interval is pretty much unbounded – the outcome variable could eventually fall over a large range. (Note that we recommend Wilson intervals on probability ranges precisely because probability p is bounded by 0 and 1.) Continue reading “Mathematical operations with the Normal distribution”→

Recently I’ve been working on a problem that besets researchers in corpus linguistics who work with samples which are not drawn randomly from the population but rather are taken from a series of sub-samples. These sub-samples (in our case, texts) may be randomly drawn, but we cannot say the same for any two cases drawn from the same sub-sample. It stands to reason that two cases taken from the same sub-sample are more likely to share a characteristic under study than two cases drawn entirely at random. I introduce the paper elsewhere on my blog.

In this post I want to focus on an interesting and non-trivial result I needed to address along the way. This concerns the concept of variance as it applies to a Binomial distribution.

Most students are familiar with the concept of variance as it applies to a Gaussian (Normal) distribution. A Normal distribution is a continuous symmetric ‘bell-curve’ distribution defined by two variables, the mean and the standard deviation (the square root of the variance). The mean specifies the position of the centre of the distribution and the standard deviation specifies the width of the distribution.

Common statistical methods on Binomial variables, from χ² tests to line fitting, employ a further step. They approximate the Binomial distribution to the Normal distribution. They say, although we know this variable is Binomially distributed, let us assume the distribution is approximately Normal. The variance of the Binomial distribution becomes the variance of the equivalent Normal distribution.

In this methodological tradition, the variance of the Binomial distribution loses its meaning with respect to the Binomial distribution itself. It seems to be only valuable insofar as it allows us to parameterise the equivalent Normal distribution.

What I want to argue is that in fact, the concept of the variance of a Binomial distribution is important in its own right, and we need to understand it with respect to the Binomial distribution, not the Normal distribution. Sometimes it is not necessary to approximate the Binomial to the Normal, and if we can avoid this approximation our results are likely to be stronger as a result.