Continuity correction for risk ratio and other intervals

Introduction

In An algebra of intervals, we showed that we can calculate confidence intervals for formulae composed of common mathematical operators, including powers and logarithms. We employed a method proposed by Zou and Donner (2008), itself an extension of Newcombe (1998). Wallis (forthcoming) describes the method more formally.

However, Newcombe’s method is arguably better-founded mathematically than that of Zou and Donner, who make an additional assumption. They assume that the number scale on which two properties are distinguished is not material to the quality of the resulting interval.

Why might this assumption be problematic? Well, when we compute a difference interval with Newcombe’s method, we do so by summing squared inner interval widths. These are equal to independent variance terms (multiplied by a constant, the critical value of the Normal distribution zα/2), which are Normal at inner bounds. So far, so good. However, if such an interval is transformed onto a different number scale, but the same summation-of-variance (Bienaymé) method is then employed — Zou and Donner’s method — we are now summing terms which are by definition no longer Normal!

I was suspicious of this assumption, which seemed to me to be optimistic at best, and I was concerned to evaluate it computationally. The method I used was as follows.

  1. Perform the same inner interval calculation for every potential value of two proportions, p1 and p2, over a range of sample sizes (1 to 200). This interval can be treated as a significance test equivalent to the exact Fisher test (evaluating if p1 is significantly different from p2). Thus, for a difference d = p2p1, if the resulting interval for d includes 0, the result is not significant. For a ratio, e.g. r = p1/p2, if the interval includes 1, the result is not significant.
  2. We then compared the result of the two tests: our new test and Fisher.
  3. If there is a discrepancy in the outcome, it will be of one of two types:
    1. Type I errors (our test was improperly deemed significant) and
    2. Type II errors (our test was improperly deemed non-significant, i.e. it failed to detect a significant result according to Fisher).
  4. To properly account for the chance of observing a particular pair of proportions, each error is weighted by Fisher scores before being summed.

This method evaluates the inner (mesial) interval close to the middle of the range. It does not evaluate the same interval for non-zero points, or for the outer interval. But unlike Monte Carlo methods, it is exhaustive.

What I found partly supported my suspicions. There was indeed an additional error cost introduced by these approximations, and this error differed by number scale (or, by the formula, which amounts to the same thing). The graph below demonstrates the scale of the issue. If we aim for α = 0.05 but then compute an interval with an additional Type I error ε of 0.03, this additional error is not negligible!

All of these interpolated intervals, including Newcombe’s for d, obtain detectable errors, but there is some good news. We observed that employing a continuity correction reduces the scale of those errors.

Figure 1 shows an example plot obtained by this method (taken from a recent blog post). This includes computations for simple difference d, Cohen’s h, risk and odds ratios, and logarithm, each of which perform Zou and Donner’s difference calculations on different number scales.

Cohen’s h evaluated by Fisher-weighted error rates for Type I errors, performance for h ≠ 0 against the Fisher ‘exact’ test, computed for values of n1 = n2 ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples.
Figure 1. Difference d, Cohen’s h, odds, risk and log ratios evaluated by Fisher-weighted error rates for Type I errors, against the Fisher ‘exact’ test, computed for values of n1 = n2 ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples.

One can make a number of observations about this graph. The saw-tooth behaviour, the ordering of intervals by performance, and so on. But if we want to minimise Type I errors (where we wrongly assess a non-significant difference as ‘significant’), this graph reveals that employing a continuity correction suppresses them.

Our previous evaluations showed that for unequal-sized sample sizes, where n1 = 5n2, we tended to see a lower overall error rate (this is not quite correct for χ2). See also Table 1 below. The increased sample size for p1 (amounting to 3 times the data in the table overall) means that the discrete Fisher is smoother, and therefore the ‘smoothing correction’ aspect of the continuity correction is less necessary. But there remains an error. Continue reading “Continuity correction for risk ratio and other intervals”

Confidence intervals for Cohen’s h

1. Introduction

Cohen’s h (Cohen, 2013) is an effect size for the difference of two independent proportions that is sometimes cited in the literature. h ranges between minus and plus pi, i.e. h ∈ [–π, π].

Jacob Cohen suggests that if |h| > 0.2, this is a ‘small effect size’, if |h| > 0.5, it is ‘medium’, and if |h| > 0.8 it is ‘large’. This conventional application of effect sizes – as a descriptive method for distinguishing sizes – is widespread.

The score is defined as the difference between the arcsine transform of the root of Binomial proportions pi for i ∈ {1, 2}, hence the expanded range, ±π.

That is,

h = ψ(p1) – ψ(p2),(1)

where the transform function ψ(p) is defined as

ψ(p) = 2 arcsin(√p).(2)

In this blog post I will explain how to derive an accurate confidence interval for this property h. The benefits of doing so are multiple.

  1. We can plot h scores with intervals, so we can visualise the reliability of their estimate, pay attention to the smallest bound, etc.
  2. We can compare two scores, h1 and h2, for significant difference. In other words, we can conclude that h2 > h1, or vice versa.
  3. We can reinterpret ‘large’ and ‘small’ effects for statistical power.
  4. We can consider whether an inner bound is greater than Jacob’s thresholds. Thus if h is positive, if h > 0.5 we can report that the likely population score is at least a ‘medium’ effect.

An absolute (unsigned and non-directional) version of |h| is sometimes cited. We can compute intervals for unsigned |h|. We will return to this question later.

Continue reading “Confidence intervals for Cohen’s h”

Confidence intervals

In this blog we identify efficient methods for computing confidence intervals for many properties.

When we observe any measure from sampled data, we do so in order to estimate the most likely value in the population of data – ‘the real world’, as it were – from which our data was sampled. This is subject to a small number of assumptions (the sample is randomly drawn without bias, for example). But this observed value is merely the best estimate we have, on the information available. Were we to repeat our experiment, sample new data and remeasure the property, we would probably obtain a different result.

A confidence interval is the range of values in which we predict that the true value in the population will likely be, based on our observed best estimate and other properties of the sample, subject to a certain acceptable level of error, say, 5% or 1%.

A confidence interval is like a blur in a photograph. We know where a feature of an object is, but it may be blurry. With more data, better lenses, a greater focus and longer exposure times, the blur reduces.

In order to make the reader’s task a little easier, I have summarised the main methods for calculating confidence intervals here. If the property you are interested in is not explicitly listed here, it may be found in other linked posts.

1. Binomial proportion p

The following methods for obtaining the confidence interval for a Binomial proportion have high performance.

  • The Clopper-Pearson interval
  • The Wilson score interval
  • The Wilson score interval with continuity correction

A Binomial proportion, p ∈ [0, 1], and represents the proportion of instances of a particular type of linguistic event, which we might call A, in a random sample of interchangeable events of either A or B. In corpus linguistics this means we need to be confident (as far as it is possible) that all instances of an event in our sample can genuinely alternate (all cases of A may be B and vice-versa).

These confidence intervals express the range of values where a possible population value, P, is not significantly different from the observed value p at a given error level α. This means that they are a visual manifestation of a simple significance test, where all points beyond the interval are considered significantly different from the observed value p. The difference between the intervals is due to the significance test they are derived from (respectively: Binomial test, Normal z test, z test with continuity correction).

As well as my book, Wallis (2021), a good place to start reading is Wallis (2013), Binomial confidence intervals and contingency tests.

The ‘exact’ Clopper-Pearson interval is obtained by a search procedure from the Binomial distribution. As a result, it is not easily generalised to larger sample sizes. Usually a better option is to employ the Wilson score interval (Wilson 1927), which inverts the Normal approximation to the Binomial and can be calculated by a formula. This interval may also accept a continuity correction and other adjustments for properties of the sample.

Continue reading “Confidence intervals”