Plotting the Wilson distribution

Introduction Full Paper (PDF)

We have discussed the Wilson score interval at length elsewhere (Wallis 2013a, b). Given an observed Binomial proportion p = f / n observations, and confidence level 1-α, the interval represents the two-tailed range of values where P, the true proportion in the population, is likely to be found. Note that f and n are integers, so whereas P is a probability, p is a proper fraction (a rational number).

The interval provides a robust method (Newcombe 1998, Wallis 2013a) for directly estimating confidence intervals on these simple observations. It can take a correction for continuity in circumstances where it is desired to perform a more conservative test and err on the side of caution. We have also shown how it can be employed in logistic regression (Wallis 2015).

The point of this paper is to explore methods for computing Wilson distributions, i.e. the analogue of the Normal distribution for this interval. There are at least two good reasons why we might wish to do this.

The first is to shed insight onto the performance of the generating function (formula), interval and distribution itself. Plotting an interval means selecting a single error level α, whereas visualising the distribution allows us to see how the function performs over the range of possible values for α, for different values of p and n.

A second good reason is to counteract the tendency, common in too many presentations of statistics, to present the Gaussian (‘Normal’) distribution as if it were some kind of ‘universal law of data’, a mistaken corollary of the Central Limit Theorem. This is particularly unwise in the case of observations of Binomial proportions, which are strictly bounded at 0 and 1. Continue reading

Advertisements

Mathematical operations with the Normal distribution

This post is a little off-topic, as the exercise I am about to illustrate is not one that most corpus linguists will have to engage in.

However, I think it is a good example of why a mathematical approach to statistics (instead of the usual rote-learning of tests) is extremely valuable.

Case study: The declared ‘deficit’ in the USS pension scheme

At the time of writing nearly two hundred thousand university staff in the UK are active members of a pension scheme called USS. This scheme draws in income from these members and pays out to pensioners. Every three years the pension is valued, which is not a simple process. The valuation consists of two aspects, both uncertain:

  • to value the liabilities of the pension fund, which means the obligations to current pensioners and future pensioners (current active members), and
  • to estimate the future asset value of the pension fund when the scheme is obliged to pay out to pensioners.

What happened in 2017 (and happened in the last two valuations) is that the pension fund has declared itself to be in deficit, meaning that the liabilities are greater than the assets. However, in all cases this ‘deficit’ is a projection forwards in time. We do not know how long people will actually live, so we don’t know how much it will cost to pay them a pension. And we don’t know what the future values of assets held by the pension fund will be.

The September valuation

In September 2017, the USS pension fund published a table which included two figures using the method of accounting they employed at the time to value the scheme.

  • They said the best estimate of the outcome was a surplus of £8.3 billion.
  • But they said that the deficit allowing for uncertainty (‘prudence’) was –£5.1 billion.

Now, if a pension fund is in deficit, it matters a great deal! Someone has to pay to address the deficit. Either the rules of the pension fund must change (so cutting the liabilities) or the assets must be increased (so the employers and/or employees, who pay into the pension fund must pay more). The dispute about the deficit is now engulfing UK universities with strikes by many tens of thousands of staff, lectures cancelled, etc. But is there really a ‘deficit’, and if so, what does this tell us?

The first additional bit of information we need to know is how the ‘uncertainty’ is generated. In February 2018 I got a useful bit of information. The ‘deficit’ is the lower bound on a 33% confidence interval (α = 2/3). This is an interval that divides the distribution into thirds by area. One third is below the lower bound, one third above the upper bound, and one third is in the middle. This gives us a picture that looks something like this:

Figure 1: Sketch of the probability distribution of the difference between USS assets and liabilities projected on September valuation assumptions (delayed ‘de-risking’).

Of course, experimental statisticians will never use such an error-prone confidence interval. We wouldn’t touch anything below 95% (α = 0.05)! To make things a bit more confusing, the actuaries talk about this having a ‘67% level of prudence’ meaning that two-thirds of the distribution is above the lower bound. All of this is fine, but it means we must proceed with care to decode the language and avoid making mistakes.

In any case, the distribution of this interval is approximately Normal. The detailed graphs I have seen of USS’s projections are a bit more shaky (which makes them appear a bit more ‘sciency’), but let’s face it, these are projections with a great deal of uncertainty. It is reasonable to employ a Normal approximation and use a ‘Wald’ interval in this case because the interval is pretty much unbounded – the outcome variable could eventually fall over a large range. (Note that we recommend Wilson intervals on probability ranges precisely because probability p is bounded by 0 and 1.) Continue reading

Point tests and multi-point tests for separability of homogeneity

Introduction

I have been recently reviewing and rewriting a paper for publication that I first wrote back in 2011. The paper (Wallis forthcoming) concerns the problem of how we test whether repeated runs of the same experiment obtain essentially the same results, i.e. results are not significantly different from each other.

These meta-tests can be used to test an experiment for replication: if you repeat an experiment and obtain significantly different results on the first repetition, then, with a 1% error level, you can say there is a 99% chance that the experiment is not replicable.

These tests have other applications. You might be wishing to compare your results with those of others in the literature, compare results with different operationalisation (definitions of variables), or just compare results obtained with different data – such as comparing a grammatical distribution observed in speech with that found within writing.

The design of tests for this purpose is addressed within the t-testing ANOVA community, where tests are applied to continuously-valued variables. The solution concerns a particular version of an ANOVA, called “the test for interaction in a factorial analysis of variance” (Sheskin 1997: 489).

However, anyone using data expressed as discrete alternatives (A, B, C etc) has a problem: the classical literature does not explain what you should do.

Gradient and point tests

gradient-point-test

Figure 1: Point tests (A) and gradient tests (B), from Wallis (forthcoming).

The rewrite of the paper caused me to distinguish between two types of tests: ‘point tests’, which I describe below, and ‘gradient tests’. Continue reading