Plotting the Wilson distribution

Introduction Full Paper (PDF)

We have discussed the Wilson score interval at length elsewhere (Wallis 2013a, b). Given an observed Binomial proportion p = f / n observations, and confidence level 1-α, the interval represents the two-tailed range of values where P, the true proportion in the population, is likely to be found. Note that f and n are integers, so whereas P is a probability, p is a proper fraction (a rational number).

The interval provides a robust method (Newcombe 1998, Wallis 2013a) for directly estimating confidence intervals on these simple observations. It can take a correction for continuity in circumstances where it is desired to perform a more conservative test and err on the side of caution. We have also shown how it can be employed in logistic regression (Wallis 2015).

The point of this paper is to explore methods for computing Wilson distributions, i.e. the analogue of the Normal distribution for this interval. There are at least two good reasons why we might wish to do this.

The first is to shed insight onto the performance of the generating function (formula), interval and distribution itself. Plotting an interval means selecting a single error level α, whereas visualising the distribution allows us to see how the function performs over the range of possible values for α, for different values of p and n.

A second good reason is to counteract the tendency, common in too many presentations of statistics, to present the Gaussian (‘Normal’) distribution as if it were some kind of ‘universal law of data’, a mistaken corollary of the Central Limit Theorem. This is particularly unwise in the case of observations of Binomial proportions, which are strictly bounded at 0 and 1. Continue reading

Mathematical operations with the Normal distribution

This post is a little off-topic, as the exercise I am about to illustrate is not one that most corpus linguists will have to engage in.

However, I think it is a good example of why a mathematical approach to statistics (instead of the usual rote-learning of tests) is extremely valuable.

Case study: The declared ‘deficit’ in the USS pension scheme

At the time of writing (March 2018) nearly two hundred thousand university staff in the UK are active members of a pension scheme called USS. This scheme draws in income from these members and pays out to pensioners. Every three years the pension is valued, which is not a simple process. The valuation consists of two aspects, both uncertain:

  • to value the liabilities of the pension fund, which means the obligations to current pensioners and future pensioners (current active members), and
  • to estimate the future asset value of the pension fund when the scheme is obliged to pay out to pensioners.

What happened in 2017 (and happened in the last two valuations) is that the pension fund has been declared to be in deficit, meaning that the liabilities are greater than the assets. However, in all cases this ‘deficit’ is a projection forwards in time. We do not know how long people will actually live, so we don’t know how much it will cost to pay them a pension. And we don’t know what the future values of assets held by the pension fund will be.

The September valuation

In September 2017, the USS pension fund published a table which included two figures using the method of accounting they employed at the time to value the scheme.

  • They said the best estimate of the outcome was a surplus of £8.3 billion.
  • But they said that the deficit allowing for uncertainty (‘prudence’) was –£5.1 billion.

Now, if a pension fund is in deficit, it matters a great deal! Someone has to pay to address the deficit. Either the rules of the pension fund must change (so cutting the liabilities) or the assets must be increased (so the employers and/or employees, who pay into the pension fund must pay more). The dispute about the deficit engulfed UK universities in March 2018 with strikes by many tens of thousands of staff, lectures cancelled, etc. But is there really a ‘deficit’, and if so, what does this tell us?

The first additional bit of information we need to know is how the ‘uncertainty’ is modelled. In February 2018 I got a useful bit of information. The ‘deficit’ is the lower bound on a 33% confidence interval (α = 2/3). This is an interval that divides the distribution into thirds by area. One third is below the lower bound, one third above the upper bound, and one third is in the middle. This gives us a picture that looks something like this:

Figure 1: Sketch of the probability distribution of the difference between USS assets and liabilities projected on September valuation assumptions (delayed ‘de-risking’).

Of course, experimentalist statisticians will never use such an error-prone confidence interval. We wouldn’t touch anything below 95% (α = 0.05)! To make things a bit more confusing, the actuaries talk about this having a ‘67% level of prudence’ meaning that two-thirds of the distribution is above the lower bound. All of this is fine, but it means we must proceed with care to decode the language and avoid making mistakes.

In any case, the distribution of this interval is approximately Normal. The detailed graphs I have seen of USS’s projections are a bit more shaky (which makes them appear a bit more ‘sciency’), but let’s face it, these are projections with a great deal of uncertainty. It is reasonable to employ a Normal approximation and use a ‘Wald’ interval in this case because the interval is pretty much unbounded – the outcome variable could eventually fall over a large range. (Note that we recommend Wilson intervals on probability ranges precisely because probability p is bounded by 0 and 1.) Continue reading

The variance of Binomial distributions

Introduction

Recently I’ve been working on a problem that besets researchers in corpus linguistics who work with samples which are not drawn randomly from the population but rather are taken from a series of sub-samples. These sub-samples (in our case, texts) may be randomly drawn, but we cannot say the same for any two cases drawn from the same sub-sample. It stands to reason that two cases taken from the same sub-sample are more likely to share a characteristic under study than two cases drawn entirely at random. I introduce the paper elsewhere on my blog.

In this post I want to focus on an interesting and non-trivial result I needed to address along the way. This concerns the concept of variance as it applies to a Binomial distribution.

Most students are familiar with the concept of variance as it applies to a Gaussian (Normal) distribution. A Normal distribution is a continuous symmetric ‘bell-curve’ distribution defined by two variables, the mean and the standard deviation (the square root of the variance). The mean specifies the position of the centre of the distribution and the standard deviation specifies the width of the distribution.

Common statistical methods on Binomial variables, from χ² tests to line fitting, employ a further step. They approximate the Binomial distribution to the Normal distribution. They say, although we know this variable is Binomially distributed, let us assume the distribution is approximately Normal. The variance of the Binomial distribution becomes the variance of the equivalent Normal distribution.

In this methodological tradition, the variance of the Binomial distribution loses its meaning with respect to the Binomial distribution itself. It seems to be only valuable insofar as it allows us to parameterise the equivalent Normal distribution.

What I want to argue is that in fact, the concept of the variance of a Binomial distribution is important in its own right, and we need to understand it with respect to the Binomial distribution, not the Normal distribution. Sometimes it is not necessary to approximate the Binomial to the Normal, and if we can avoid this approximation our results are likely to be stronger as a result.

Continue reading

Binomial confidence intervals and contingency tests

Abstract Paper (PDF)

Many statistical methods rely on an underlying mathematical model of probability which is based on a simple approximation, one that is simultaneously well-known and yet frequently poorly understood.

This approximation is the Normal approximation to the Binomial distribution, and it underpins a range of statistical tests and methods, including the calculation of accurate confidence intervals, performing goodness of fit and contingency tests, line-and model-fitting, and computational methods based upon these. What these methods have in common is the assumption that the likely distribution of error about an observation is Normally distributed.

The assumption allows us to construct simpler methods than would otherwise be possible. However this assumption is fundamentally flawed.

This paper is divided into two parts: fundamentals and evaluation. First, we examine the estimation of error using three approaches: the ‘Wald’ (Normal) interval, the Wilson score interval and the ‘exact’ Clopper-Pearson Binomial interval. Whereas the first two can be calculated directly from formulae, the Binomial interval must be approximated towards by computational search, and is computationally expensive. However this interval provides the most precise significance test, and therefore will form the baseline for our later evaluations.

We consider two further refinements: employing log-likelihood in computing intervals (also requiring search) and the effect of adding a correction for the transformation from a discrete distribution to a continuous one.

In the second part of the paper we consider a thorough evaluation of this range of approaches to three distinct test paradigms. These paradigms are the single interval or 2 × 1 goodness of fit test, and two variations on the common 2 × 2 contingency test. We evaluate the performance of each approach by a ‘practitioner strategy’. Since standard advice is to fall back to ‘exact’ Binomial tests in conditions when approximations are expected to fail, we simply count the number of instances where one test obtains a significant result when the equivalent exact test does not, across an exhaustive set of possible values.

We demonstrate that optimal methods are based on continuity-corrected versions of the Wilson interval or Yates’ test, and that commonly-held assumptions about weaknesses of χ² tests are misleading.

Log-likelihood, often proposed as an improvement on χ², performs disappointingly. At this level of precision we note that we may distinguish the two types of 2 × 2 test according to whether the independent variable partitions the data into independent populations, and we make practical recommendations for their use.

Introduction

Estimating the error in an observation is the first, crucial step in inferential statistics. It allows us to make predictions about what would happen were we to repeat our experiment multiple times, and, because each observation represents a sample of the population, predict the true value in the population (Wallis 2013).

Consider an observation that a proportion p of a sample of size n is of a particular type.

For example

  • the proportion p of coin tosses in a set of n throws that are heads,
  • the proportion of light bulbs p in a production run of n bulbs that fail within a year,
  • the proportion of patients p who have a second heart attack within six months after a drug trial has started (n being the number of patients in the trial),
  • the proportion p of interrogative clauses n in a spoken corpus that are finite.

We have one observation of p, as the result of carrying out a single experiment. We now wish to infer about the future. We would like to know how reliable our observation of p is without further sampling. Obviously, we don’t want to repeat a drug trial on cardiac patients if the drug may be adversely affecting their survival.

Continue reading