Continuity correction for risk ratio and other intervals

Introduction

In An algebra of intervals, we showed that we can calculate confidence intervals for formulae composed of common mathematical operators, including powers and logarithms. We employed a method proposed by Zou and Donner (2008), itself an extension of Newcombe (1998). Wallis (forthcoming) describes the method more formally.

However, Newcombe’s method is arguably better-founded mathematically than that of Zou and Donner, who make an additional assumption. They assume that the number scale on which two properties are distinguished is not material to the quality of the resulting interval.

Why might this assumption be problematic? Well, when we compute a difference interval with Newcombe’s method, we do so by summing squared inner interval widths. These are equal to independent variance terms (multiplied by a constant, the critical value of the Normal distribution zα/2), which are Normal at inner bounds. So far, so good. However, if such an interval is transformed onto a different number scale, but the same summation-of-variance (Bienaymé) method is then employed — Zou and Donner’s method — we are now summing terms which are by definition no longer Normal!

I was suspicious of this assumption, which seemed to me to be optimistic at best, and I was concerned to evaluate it computationally. The method I used was as follows.

  1. Perform the same inner interval calculation for every potential value of two proportions, p1 and p2, over a range of sample sizes (1 to 200). This interval can be treated as a significance test equivalent to the exact Fisher test (evaluating if p1 is significantly different from p2). Thus, for a difference d = p2p1, if the resulting interval for d includes 0, the result is not significant. For a ratio, e.g. r = p1/p2, if the interval includes 1, the result is not significant.
  2. We then compared the result of the two tests: our new test and Fisher.
  3. If there is a discrepancy in the outcome, it will be of one of two types:
    1. Type I errors (our test was improperly deemed significant) and
    2. Type II errors (our test was improperly deemed non-significant, i.e. it failed to detect a significant result according to Fisher).
  4. To properly account for the chance of observing a particular pair of proportions, each error is weighted by Fisher scores before being summed.

This method evaluates the inner (mesial) interval close to the middle of the range. It does not evaluate the same interval for non-zero points, or for the outer interval. But unlike Monte Carlo methods, it is exhaustive.

What I found partly supported my suspicions. There was indeed an additional error cost introduced by these approximations, and this error differed by number scale (or, by the formula, which amounts to the same thing). The graph below demonstrates the scale of the issue. If we aim for α = 0.05 but then compute an interval with an additional Type I error ε of 0.03, this additional error is not negligible!

All of these interpolated intervals, including Newcombe’s for d, obtain detectable errors, but there is some good news. We observed that employing a continuity correction reduces the scale of those errors.

Figure 1 shows an example plot obtained by this method (taken from a recent blog post). This includes computations for simple difference d, Cohen’s h, risk and odds ratios, and logarithm, each of which perform Zou and Donner’s difference calculations on different number scales.

Cohen’s h evaluated by Fisher-weighted error rates for Type I errors, performance for h ≠ 0 against the Fisher ‘exact’ test, computed for values of n1 = n2 ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples.
Figure 1. Difference d, Cohen’s h, odds, risk and log ratios evaluated by Fisher-weighted error rates for Type I errors, against the Fisher ‘exact’ test, computed for values of n1 = n2 ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples.

One can make a number of observations about this graph. The saw-tooth behaviour, the ordering of intervals by performance, and so on. But if we want to minimise Type I errors (where we wrongly assess a non-significant difference as ‘significant’), this graph reveals that employing a continuity correction suppresses them.

Our previous evaluations showed that for unequal-sized sample sizes, where n1 = 5n2, we tended to see a lower overall error rate (this is not quite correct for χ2). See also Table 1 below. The increased sample size for p1 (amounting to 3 times the data in the table overall) means that the discrete Fisher is smoother, and therefore the ‘smoothing correction’ aspect of the continuity correction is less necessary. But there remains an error.

Continuity corrections, reprised

Statistical advice has tended to be rather ambivalent about continuity corrections (Yates, 1934).

In part, this is due to suspicion about so-called ‘exact’ tests. It is argued that they are not always beneficial, as they will err on the side of caution given a particular error level. The term ‘exact’ refers to the fact that an exact error rate for an observation given an expected distribution may be obtained. It does not mean that a test is exact: on the contrary, it means that it is conservative.

In my book (Wallis 2021: 106), I give a couple of worked examples where a target tail area of α/2 = 0.025 yields an exact Binomial population interval about P with an upper tail area of 0.0106 and 0.0115 respectively, i.e. around half the target error level.

But if we want to rely on intervals for the purpose of empirical evaluation, we should err on the side of caution when employing confidence intervals too.

Zou and Donner’s method is not dependent on which underlying ‘good coverage interval’ is employed. So we can substitute a continuity-corrected Wilson score interval for the standard one. For example, to compute a continuity-corrected risk ratio interval for r = p1/p2 we calculate the continuity-corrected Wilson intervals for p1 and p2 respectively, and apply the risk ratio formula as usual.

Conventionally, a continuity correction is an adjustment to the Normal approximation to the Binomial distribution for a proportion p that moves the best estimate for p out by half a unit, i.e. c12n. We apply this adjustment to each interval calculation separately.

This has two well-known benefits:

  1. For small n, this rounding adjustment pushes the Normal ‘envelope’ out conservatively to incorporate the discrete Binomial interval.
  2. For skewed p, the adjustment absorbs errors due to the skewed distribution of the Binomial.

In Correcting for continuity, I derived an efficient and decomposable formula for adding a continuity correction to the Wilson score interval for p.  The method allows us to treat the continuity correction as a distinct element in the formula.

Essentially, we define Wilson functions for the lower and upper bounds of the Wilson score interval, like this:

w = WilsonLower(p, n, α/2),
w+ = WilsonUpper(p, n, α/2).

The continuity-corrected versions of these intervals are then obtained by moving p half a unit outwards,

wcc = WilsonLower(p12n, n, α/2),
w+cc = WilsonUpper(p + 12n, n, α/2).

This reformulation is elegant. In my book, and in other posts on this blog, I show how a correction for continuity may be applied in conditions where the sample size might need to be independently re-weighted, either to account for a small finite population, or to address the common situation in corpus linguistics where data is drawn from a random sample of texts, but where multiple instances are drawn from the same text or subtext, produced by the same person, etc., rather than being randomly obtained from a population of utterances.

In this blog post, I will show a further advantage of this reformulation. It makes it very easy to scale the correction factor itself.

Is greater correction required?

Given that we are seeing additional errors due to the difference theorem being applied on different number scales, the obvious question is should we employ a more conservative approach when computing intervals in these cases?

We will consider the impact of increasing the correction factor c12n. This increases the number of Type II errors at the expense of Type I, i.e. it will make the resulting interval more conservative.

Note that even for large n, where corrections for continuity are generally considered to be unnecessary, Figure 1 is evidence of a benefit in terms of reduced Type I errors. An excess error rate of over 0.01 for an assessment ostensibly against α = 0.05 is not negligible.

A simple initial assessment summing the errors across all 200 cases shows that this method is likely to be fruitful. If we double c, excess Type I errors are all-but eliminated.

χ2 NW Cohen’s h risk ratio logarithm odds ratio
n1 = n2
no c.c. 3.8476 3.3051 3.8610 4.0109 4.2974 4.3852
c.c. 0.0000 0.5780 0.7589 0.8584 1.3264 1.4061
2 × c.c. 0.0000 0.0000 0.0237 0.0950 0.0950 0.0950
n1 = 5n2
no c.c. 3.4533 3.1207 3.2112 3.3081 3.4745 3.4975
c.c. 0.4016 0.7377 0.7744 0.7944 0.7880 0.8090
2 × c.c. 0.0000 0.0000 0.0000 0.0118 0.0300 0.0300

Table 1. Total Type I total error rates, summed for n2 ∈ {1, 2,… 200}, α = 0.05.

What is the optimum adjustment? We can visualise the trade-off between Type I and II errors for different continuity correction scale factors by simply multiplying c12n by a score, γ, and repeating the evaluation. We plot mean excess mesial errors, ε,¯that is, mean additional errors for the mesial (inner) interval.

In Figure 2, γ = 0 refers to no correction. Yates’s conventional continuity correction is where γ = 1 (centre).

aluating the trade-off between mean Type I and Type II errors, with a continuity correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on equal-sized samples, n1 = n2 ∈ {1, 2,… 200}, α = 0.05.
Figure 2. Evaluating the trade-off between mean Type I and Type II errors, with a continuity correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on equal-sized samples, n1 = n2 ∈ {1, 2,… 200}, α = 0.05.

This graph offers us a ready-reckoner for different number scales.

We might aim for a similar level of performance as Yates’s χ2 test. This is conservative, minimising Type I errors but permitting a small number of Type II errors.

To achieve comparable performance, we must increase γ to 1.5 for all evaluations over all but the odds ratio, where we might select γ = 1.75. (This ‘bump’ is due to an anomalous high-cost error at n2 = 1, so we might still consider γ = 1.5 to be acceptable.) Although the logarithm function, logp2(p1), appears to be more resistant to Type I error elimination than all others apart from the odds ratio, the mean error reaches 0.0008 for γ = 1.5.

Revealingly, this evaluation shows a very similar benefit for the Newcombe-Wilson difference interval, for which we had previously accepted an additional Type I mean error rate of around 0.0025 with the continuity-corrected interval (γ = 1). A recommendation to increase γ for this interval also makes sense.

Lest we be overly critical, applying Newcombe’s method to the difference between proportions with exact Clopper-Pearson intervals is not optimal either. The mean Type I error of 0.0047 for equal-sized data (a total error rate of 0.9334) is twice the mean error found with the conventionally continuity-corrected Wilson interval. See Further evaluation of Binomial confidence intervals. The Clopper-Pearson interval is exact — it computes a continuous interval where the error rate is precisely α/2 — but the act of combining intervals for each proportion generates errors. By contrast, the conservatism of the ordinary continuity correction with the Wilson interval absorbs almost half of these errors!

A correction factor of γ = 1.5 effectively eliminates mesial Type I errors for Newcombe’s (1998) difference interval in the equal-sample case. The interval is a little more conservative, but still well-behaved, as we can see in Figure 8 below. Errors identified for small samples disappear.

Figure 3 performs the same evaluation for unequal-sized samples where n1 = 5n2, i.e. where n1 ranges from 5 to 1,000. We can see that the overall mean rate tends to be lower, but Type I errors are more difficult to eliminate altogether, becoming negligible (at the expense of increased Type II errors) where γ = 2. Nonetheless, at γ = 1.5, the risk of additional Type I errors has fallen to around 0.001 (outperforming Yates’s test at the expense of more Type II errors), and at γ = 1.75, the rate is less than 0.0005.

The continuity-corrected χ2 test outperforms the other, independent interval difference formulae, although Yates’s test (γ = 1) is not error-free. On the other hand, the performance of Zou and Donner’s method over different interval scales seems to be remarkably consistent.

Evaluating the trade-off between mean Type I and Type II errors, with a continuity correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on unequal-sized samples, n1 = 5n2, n2 ∈ {1, 2,… 200}, α = 0.05.
Figure 3. Evaluating the trade-off between mean Type I and Type II errors, with a continuity-correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on unequal-sized samples, n1 = 5n2, n2 ∈ {1, 2,… 200}, α = 0.05.

Substituting α = 0.01 has a comparable performance benefit, and the proposal to increase γ to 1.5 is robust. See below.

These mean excess mesial error rates are, of course, means. We can also examine the particular error rate for any one of these evaluations over sample size, as in Figure 1. For example, Figure 4 shows the effect of increasing γ on the likelihood of the Newcombe-Wilson difference interval obtaining a significant result when the Fisher test does not, for samples of equal size.

The effect of varying the degree of correction γ on the Newcombe-Wilson difference test (d ≠ 0) when compared to the Fisher ‘exact’ test, evaluated by Fisher-weighted error rates for Type I errors, for values of n1 = n2 ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples.
Figure 4. The effect of varying the degree of correction γ on the Newcombe-Wilson difference test (d ≠ 0) when compared to the Fisher ‘exact’ test, evaluated by Fisher-weighted error rates for Type I errors, for values of n1 = n2 ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples.

Note: Some high-penalty errors for small n may be eliminated by checking whether the continuity-corrected Wilson centres overlap. For example, if p1 > p2, we could check if p1 – γ.c1 < p2 – γ.c2 (where c1 and c2 are the respective continuity corrections). But this is beside the point. For this exercise we are not optimising the test, but evaluating the performance of the interval.

Varying α

Thus far we have focused on an error rate of α = 0.05 and considered the performance of 95% intervals. This is entirely proper: the correct way to approach statistical evaluation is to identify an error rate and stick to it. Some researchers consistently opt for 99% intervals, i.e. where α = 0.01.

The practice of opting for a smaller α in the belief that doing so yields ‘better’ results is based on a common but mistaken assumption about ‘p-values’. In truth, in selecting an error level we are simply exercising a precautionary trade-off. However, there are circumstances where one may reduce the error level α by dividing it by the number of evaluations one wishes to undertake.

So we should confirm that our assessment stands up with smaller error levels α < 0.05.

99% intervals

Figure 6 plots mean excess mesial errors (weighted by a Fisher prior), ε,¯against the continuity correction factor γ, for equal-sized samples and difference-from-zero tests for the same series of difference and ratio formulae, this time with α = 0.01.

eval-1-cc
Figure 5. Evaluating the trade-off between mean Type I and Type II errors, with a continuity-correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on equal-sized samples, n1 = n2 ∈ {1, 2,… 200}, α = 0.01.

Again, we see a trade-off at γ = 1.5 and an odds ratio ‘bump’ owing to an error at n2 = 1 (compare with Figure 2). For unequal samples where n1 = 5n2, our analysis obtains Figure 6.

Evaluating the trade-off between mean Type I and Type II errors, with a continuity-correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on equal-sized samples, n1 = 5n2, n2 ∈ {1, 2,… 200}, α = 0.01.
Figure 6. Evaluating the trade-off between mean Type I and Type II errors, with a continuity-correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on unequal-sized samples, n1 = 5n2, n2 ∈ {1, 2,… 200}, α = 0.01.

To obtain comparable performance to Yates’s test (grey line at γ = 1), setting γ = 1.5 would suffice (cf. mean errors, Yates: 0.000836 and odds ratio, γ = 1.5: 0.000839), but further improvement is possible.

These graphs with a smaller error level reveal a greater relative difference between functions. The plot lines appear separated out. Indeed, in Figure 3, the Type I error rate with α = 0.05 appears to be little affected by formula, contrary to my initial concern about number scales. But this optimism may have been premature. With a smaller target error level, the different number scales obtain more distinct mean excess mesial error rate performances.

Smaller α

So what happens with even smaller target error levels, α, which might be used in experiments where multiple comparisons are performed? We obtained a mean Type I error rate for a range of error levels from 0.001 to 0.1 and plotted them for different correction factors.

Mean excess Type I mesial errors ε as a fraction of the target error level, α, for Fisher-weighted error levels from 0.001 to 0.1 (log scale).
Figure 7. Mean excess Type I mesial errors ε as a fraction of the target error level, α, for Fisher-weighted error levels from 0.001 to 0.1 (log scale).

Figure 7, computed with equal-sized samples, shows that irrespective of α, increasing γ reduces the number of errors, as we would expect. Indeed, expressed as a proportion of the target α, excess error rates are quite stable once we discount small samples n2 < 5, where a single error can disproportionately impact the mean.

But we can also see what we might describe as a ‘fanning out effect’ for small α. The smaller the error rate α (and the wider the interval), the greater the impact the formula and number scale has on the accuracy of the interval, especially for smaller samples.

These observations do not affect the rationale for increasing γ, however, which remains robust and well-justified. Irrespective of the α level, increasing the correction factor γ to 1.5 suppresses Type I errors (bottom of figure).

Conclusions

Continuity corrections are traditionally applied to significance tests to account for the fact that a Binomial discrete and asymmetric distribution is approximated by a smooth and symmetric Normal one. They increase the coverage of an interval by repositioning the best estimate of each proportion outwards by a factor in proportion to the reciprocal of the sample size.

Frequently, statistics books advise their use in a somewhat ad hoc manner, usually only recommending their use with respect to small samples and simple tests. But this always seemed odd to me. Thus in Wallis (2021: 237-239, 259), I observed that original proportions in more complex meta-tests were no less discrete in this condition as they were in conventional ones.

In this situation we might use a meta-test to compare two runs of the same evaluation. Suppose an original evaluation was performed with Yates’s χ2 test (or Fisher’s test). It would make sense to compare the two evaluations for significant difference by also employing a correction for continuity. Why employ a less conservative method for comparing the difference between two differences?

The same reasoning applies to algebraic functions of proportions. Indeed, as we have seen, we should be even more careful.

We have demonstrated that in fact, the benefits of employing a continuity correction extend beyond the small-sample and simple test situation to the task of optimising the performance of derived intervals.

  • We found that to obtain comparable performance to Yates’s χ2 test, we must employ a more conservative correction, multiplying the conventional half-unit constant proposed by Frank Yates, c12n, by 1.5.
  • In the case of the odds ratio, a multiple of 1.75 may be preferred, especially for very small samples.

We developed a ready-reckoner by plotting mean error rates over an increased scale factor γ for equal-sized samples based on the mean Type I error rate for the inner interval. Continuity corrections, traditionally seen as smoothing adjustments, are also capable of being effective adjustments for building in a small amount of additional conservatism when combining intervals.

These intervals are well-behaved, i.e. they perform plausibly across all values on the range, provided we address extreme proportions in ratios with a small delta. See An algebra of intervals.

How does this increased scale factor affect intervals in practice? In Figure 8 we consider the impact of applying such a correction on ratio, product, sum and difference intervals for equal-sized samples where n1 = n2 = 10.

As before, we plot operators of p1 and p2 with intervals, varying a table across a diagonal interpolation ϕ (we vary p1 from 0 to 1 and hold p2 = 1 – p1).

Plotting intervals with γ = 0 (no correction), 1 (Yates) or 1.5 (Wallis). Although the difference between γ = 1 and 1.5 appears small (except, perhaps, for the ratio), this additional conservatism absorbs Type I errors generated by the Newcombe / Zou and Donner theorem.
Figure 8. Plotting intervals with γ = 0 (no correction), 1 (Yates) or 1.5 (Wallis). Although the difference between γ = 1 and 1.5 appears small (except, perhaps, for the ratio), this additional conservatism absorbs Type I errors generated by the Newcombe / Zou and Donner theorem.

References

Newcombe, R.G. (1998). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine17, 873-890.

Wallis, S.A. (2021). Statistics in Corpus Linguistics Research. New York: Routledge.

Wallis, S.A. (forthcoming). Accurate confidence intervals on Binomial proportions, functions of proportions and other related scores. » Post

Yates, F. (1934). Contingency tables involving small numbers and the chi-square test. Journal of the Royal Statistical Society, 1(2), 217-235.

Zou, G.Y. & A. Donner (2008). Construction of confidence limits about effect measures: A general approach. Statistics in Medicine27:10, 1693-1702.

See also

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.