Further evaluation of Binomial confidence intervals

Abstract Paper (PDF)

Wallis (2013) provides an account of an empirical evaluation of Binomial confidence intervals and contingency test formulae. The main take-home message of that article was that it is possible to evaluate statistical methods objectively and provide advice to researchers that is based on an objective computational assessment.

In this article we develop the evaluation of that article further by re-weighting estimates of error using Binomial and Fisher weighting, which is equivalent to an ‘exhaustive Monte-Carlo simulation’. We also develop an argument concerning key attributes of difference intervals: that we are not merely concerned with when differences are zero (conventionally equivalent to a significance test) but also accurate estimation when difference may be non-zero (necessary for plotting data and comparing differences).

1. Introduction

All statistical procedures may be evaluated in terms of the rate of two distinct types of error.

  • Type I errors (false positives): this is evidence of so-called ‘radical’ or ‘anti-conservative’ behaviour, i.e. rejecting null hypotheses which should not have been rejected, and
  • Type II errors (false negatives): this is evidence of ‘conservative’ behaviour, i.e. retaining or failing to reject null hypotheses unnecessarily.

It is customary to treat these errors separately because the consequences of rejecting and retaining a null hypothesis are qualitatively distinct. Continue reading “Further evaluation of Binomial confidence intervals”

Deconstructing the chi-square


Elsewhere in this blog we introduce the concept of statistical significance by considering the reliability of a single sampled observation of a Binomial proportion: an estimate of the probability of selecting an item in the future. This allows us to develop an understanding of the likely distribution of what the true value of that probability in the population might be. In short, were we to make future observations of that item, we could expect that each sampled probability would be found within a particular range – a confidence interval – a fixed proportion of times, such as 1 in 20 or 1 in 100. This ‘fixed proportion’ is termed the ‘error level’ because we predict that the true value will be outside the range 1 in 20 or 1 in 100 times.

This process of inferring about future observations is termed ‘inferential statistics’. Our approach is to build our understanding in a series of stages based on confidence intervals about the single proportion. Here we will approach the same question by deconstructing the chi-square test.

A core idea of statistical inference is this: randomness is a fact of life. If you sample the same phenomenon multiple times, drawing on different data each time, it is unlikely that the observation will be identical, or – to put it in terms of an observed sample – it is unlikely that the mean value of the observation will be the same. But you are more likely than not to find the new mean near the original mean, and the larger the size of your sample, the more reliable your estimate will be. This, in essence, is the Central Limit Theorem.

This principle applies to the central tendency of data, usually the arithmetic mean, but occasionally a median. It does not concern outliers: extreme but rare events (which, by the way, you should include, and not delete, from your data).

We are mainly concerned with Binomial or Multinomial proportions, i.e. the fraction of cases sampled which have a particular property. A Binomial proportion is a statement about the sample, a simple fraction p = f / n. But it is also the sample mean probability of selecting a value. Suppose we selected a random case from the sample. In the absence of any other knowledge about that case, the average chance that X = x₁ is also p.

The same principle applies to the mean of Real or Integer values, for which one might use Welch’s or Student’s t test, and the median rank of Ordinal data, for which a Mann-Whitney U test may be appropriate.

With this in mind, we can form an understanding of significance, or to be precise, significant difference. The ‘difference’ referred to here is the difference between an uncertain observed value and a predicted or known population value, d = pP, or the difference between two uncertain observed values, d = p₂ – p₁. The first of these differences is found in a single-sample z test, the second in a two-sample z test. See Wallis (2013b).

Figure 1. The single-sample population z test. The statistical model assumes that future unobserved samples are Normally distributed, centred on the population mean P. Distance d is compared with a critical threshold, zα/2.S, to carry out the test.

A significance test is created by comparing an observed difference with a second element, a critical threshold extrapolated from the underlying statistical model of variation. Continue reading “Deconstructing the chi-square”

Correcting for continuity


Many conventional statistical methods employ the Normal approximation to the Binomial distribution (see Binomial → Normal → Wilson), either explicitly or buried in formulae.

The well-known Gaussian population interval (1) is

Gaussian interval (E⁻, E⁺) ≡ P ± zP(1 – P)/n,(1)

where n represents the size of the sample, and z the two-tailed critical value for the Normal distribution at an error level α, more properly written zα/2. The standard deviation of the population proportion P is S = √P(1 – P)/n, so we could abbreviate the above to (E⁻, E⁺) ≡ P ± z.S.

When these methods require us to calculate a confidence interval about an observed proportion, p, we must invert the Normal formula using the Wilson score interval formula (Equation (2)).

Wilson score interval (w⁻, w⁺) ≡ p + z²/2n ± zp(1 – p)/n + z²/4
1 + z²/n

In a 2013 paper for JQL (Wallis 2013a), I referred to this inversion process as the ‘interval equality principle’. This means that if (1) is calculated for p = E⁻ (the Gaussian lower bound of P), then the upper bound that results, w⁺, will equal P. Similarly, for p = E⁺, the lower bound of pw⁻ will equal P.

We might write this relationship as

p ≡ GaussianLower(WilsonUpper(p, n, α), n, α), or, alternatively
P ≡ WilsonLower(GaussianUpper(P, n, α), n, α), etc. (3)

where E⁻ = GaussianLower(P, n, α), w⁺ = WilsonUpper(p, n, α), etc.

Note. The parameters n and α become useful later on. At this stage the inversion concerns only the first parameter, p or P.

Nonetheless the general principle is that if you want to calculate an interval about an observed proportion p, you can derive it by inverting the function for the interval about the expected population proportion P, and swapping the bounds (so ‘Lower’ becomes ‘Upper’ and vice versa).

In the paper, using this approach I performed a series of computational evaluations of the performance of different interval calculations, following in the footsteps of more notable predecessors. Comparison with the analogous interval calculated directly from the Binomial distribution showed that a continuity-corrected version of the Wilson score interval performed accurately. Continue reading “Correcting for continuity”