Introduction
In An algebra of intervals, we showed that we can calculate confidence intervals for formulae composed of common mathematical operators, including powers and logarithms. We employed a method proposed by Zou and Donner (2008), itself an extension of Newcombe (1998). Wallis (forthcoming) describes the method more formally.
However, Newcombe’s method is arguably better-founded mathematically than that of Zou and Donner, who make an additional assumption. They assume that the number scale on which two properties are distinguished is not material to the quality of the resulting interval.
Why might this assumption be problematic? Well, when we compute a difference interval with Newcombe’s method, we do so by summing squared inner interval widths. These are equal to independent variance terms (multiplied by a constant, the critical value of the Normal distribution zα/2), which are Normal at inner bounds. So far, so good. However, if such an interval is transformed onto a different number scale, but the same summation-of-variance (Bienaymé) method is then employed — Zou and Donner’s method — we are now summing terms which are by definition no longer Normal!
I was suspicious of this assumption, which seemed to me to be optimistic at best, and I was concerned to evaluate it computationally. The method I used was as follows.
- Perform the same inner interval calculation for every potential value of two proportions, p1 and p2, over a range of sample sizes (1 to 200). This interval can be treated as a significance test equivalent to the exact Fisher test (evaluating if p1 is significantly different from p2). Thus, for a difference d = p2 – p1, if the resulting interval for d includes 0, the result is not significant. For a ratio, e.g. r = p1/p2, if the interval includes 1, the result is not significant.
- We then compared the result of the two tests: our new test and Fisher.
- If there is a discrepancy in the outcome, it will be of one of two types:
- Type I errors (our test was improperly deemed significant) and
- Type II errors (our test was improperly deemed non-significant, i.e. it failed to detect a significant result according to Fisher).
- To properly account for the chance of observing a particular pair of proportions, each error is weighted by Fisher scores before being summed.
This method evaluates the inner (mesial) interval close to the middle of the range. It does not evaluate the same interval for non-zero points, or for the outer interval. But unlike Monte Carlo methods, it is exhaustive.
What I found partly supported my suspicions. There was indeed an additional error cost introduced by these approximations, and this error differed by number scale (or, by the formula, which amounts to the same thing). The graph below demonstrates the scale of the issue. If we aim for α = 0.05 but then compute an interval with an additional Type I error ε of 0.03, this additional error is not negligible!
All of these interpolated intervals, including Newcombe’s for d, obtain detectable errors, but there is some good news. We observed that employing a continuity correction reduces the scale of those errors.
Figure 1 shows an example plot obtained by this method (taken from a recent blog post). This includes computations for simple difference d, Cohen’s h, risk and odds ratios, and logarithm, each of which perform Zou and Donner’s difference calculations on different number scales.
One can make a number of observations about this graph. The saw-tooth behaviour, the ordering of intervals by performance, and so on. But if we want to minimise Type I errors (where we wrongly assess a non-significant difference as ‘significant’), this graph reveals that employing a continuity correction suppresses them.
Our previous evaluations showed that for unequal-sized sample sizes, where n1 = 5n2, we tended to see a lower overall error rate (this is not quite correct for χ2). See also Table 1 below. The increased sample size for p1 (amounting to 3 times the data in the table overall) means that the discrete Fisher is smoother, and therefore the ‘smoothing correction’ aspect of the continuity correction is less necessary. But there remains an error. Continue reading “Continuity correction for risk ratio and other intervals”