Confidence intervals for Cohen’s h

1. Introduction

Cohen’s h (Cohen, 2013) is an effect size for the difference of two independent proportions that is sometimes cited in the literature. h ranges between minus and plus pi, i.e. h ∈ [–π, π].

Jacob Cohen suggests that if |h| > 0.2, this is a ‘small effect size’, if |h| > 0.5, it is ‘medium’, and if |h| > 0.8 it is ‘large’. This conventional application of effect sizes – as a descriptive method for distinguishing sizes – is widespread.

The score is defined as the difference between the arcsine transform of the root of Binomial proportions pi for i ∈ {1, 2}, hence the expanded range, ±π.

That is,

h = ψ(p1) – ψ(p2),(1)

where the transform function ψ(p) is defined as

ψ(p) = 2 arcsin(√p).(2)

In this blog post I will explain how to derive an accurate confidence interval for this property h. The benefits of doing so are multiple.

  1. We can plot h scores with intervals, so we can visualise the reliability of their estimate, pay attention to the smallest bound, etc.
  2. We can compare two scores, h1 and h2, for significant difference. In other words, we can conclude that h2 > h1, or vice versa.
  3. We can reinterpret ‘large’ and ‘small’ effects for statistical power.
  4. We can consider whether an inner bound is greater than Jacob’s thresholds. Thus if h is positive, if h > 0.5 we can report that the likely population score is at least a ‘medium’ effect.

An absolute (unsigned and non-directional) version of |h| is sometimes cited. We can compute intervals for unsigned |h|. We will return to this question later.

2. Deriving an interval

2.1 Preliminaries: the Wilson score interval

We will use the Wilson score interval on Binomial proportions p at an error level α for our purposes (Wilson 1927). This is written p ∈ (w, w+), is directly calculable by formula, and has good performance. It may be corrected for continuity and adjusted for finite populations or random-text sampling.

Once corrected for continuity, Wilson’s interval has similar performance to the ‘exact’ Clopper-Pearson interval (Wallis 2013, 2021: 311), which could be substituted into what follows. Other intervals are available, although few outperform the continuity-corrected Wilson interval (Newcombe 1998a, Wallis 2013).

The Wilson score interval is asymmetric. Unless p = 0.5, the interval width on one side of p will not be the same as the interval width on the other.

wilson1
Figure 1. Wilson score interval about p = 0.3, with n = 10, obtained by inverting the Normal approximation to the Binomial about P, at 95% interval bounds (w,w+).

2.2 Stage 1. An interval for the transform

In An Algebra of Intervals (and in my book), we noted that we can obtain an interval for any monotonic transformation of a Binomial proportion p by simply applying the same transform function to the interval bounds for p.

The method is as follows.

For a function ψ(p), we first determine if it is monotonic.

  • If it is monotonic within the range of P = [0, 1], the bounds will be ψ(w) and ψ(w+).
    • If the function increases with p then ψ(w) < ψ(w+).
    • If it is falling, we swap the bounds to place the lower number first.
  • If the function is not monotonic, it will contain at least one turning point (a local maximum or minimum) where the function changes direction. This complicates matters, but it does not mean that an interval is not computable. For an example of such a function, see The confidence of entropy.

The first step is therefore to examine the behaviour of Equation (2).

It turns out that ψ(p) is a monotonic function, which means that for every value pP = [0, 1] there is a unique value of ψ(p).

How do we know? We simply compute Equation (2) over P, and observe that the function always increases with increasing p, and has no local maximum or minimum along the range. See Figure 2.

Plotting ψ(p) against p, with confidence intervals ψ(w–), ψ(w+) for n = 10, α = 0.05.
Figure 2. Plotting ψ(p) against p, with confidence intervals ψ(w), ψ(w+) for n = 10, α = 0.05.

We may now calculate confidence intervals for ψ. Since it is a rising monotonic function, we have, simply:

ψ(p) ∈ (ψ(w), ψ(w+)).(3)

2.3 Stage 2. An interval for the difference

Next, we need to compute an interval for Cohen’s h (Equation (1)).

Newcombe (1998b) pointed out that when we compare two intervals on two proportions p1 and p2, we are concerned with the inner intervals, i.e. the interval on p1 close to p2, and vice versa.

For a difference d = p2p1, we may derive a zero-based interval as

0 ∈ (wd, wd+) = (–√(u1)² + (u2+, (u1+)² + (u2),(4)

where interval widths ui = piwi and ui+ = wi+pi, and w1, etc. are the interval bounds for p1, etc. Selecting the relevant interval width is important. As we saw, the Wilson interval for p is asymmetric. Figure 2 shows that the interval for ψ(p) is also asymmetric.

Consider the upper bound of a zero-based interval for d. If d is positive, then p2 > p1. The upper bound of the zero-based interval, wd+, is the smallest (positive) value that d may be that allows us to report as ‘a significant difference’. It is calculated from the square root of the sum of the following pair of squared interval widths: the upper width for p1 (u1+) and the lower width for p2 (u2). These are the inner intervals where p2 > p1. On the other hand if p2 < p1, d is negative, and we focus on wd and the opposite intervals.

This is a long-winded way of saying: check the geometry!

We can then rewrite Equation (4) as an interval for d by simple subtraction:

d ∈ (d, d+) = d – (wd, wd+) = (dwd+, dwd).(5)

Zou and Donner (2008) generalise this formula by arguing that it can be applied to any pair of good-coverage intervals. That is, provided that the intervals are reasonably accurate, we can compute a difference interval between them by the same process of summing squared interval widths, paying attention to the inner interval. In Wallis (forthcoming), I evaluate this claim more critically.

Nonetheless, this means that to create an interval according to Zou and Donner, all we need to do is substitute p1 with ψ(p2) and p2 with ψ(p1) in Newcombe’s formula (Equation (4)). We swap indices because h is expressed as ψ(p1) – ψ(p2), rather than the other way around.

We could simply substitute u1, etc., but for clarity we will spell this out.

0 ∈ (wh, wh+) = (–√(ψ(w1+) – ψ(p1))² + (ψ(p2) – ψ(w2))², (ψ(p1) – ψ(w1))² + (ψ(w2+) – ψ(p2))²),(6)

and the following interval for h:

h ∈ (h, h+) = h – (wh, wh+) = (hwh+, hwh).(7)

We can plot d and h, and their respective intervals computed by Equations (5) and (7) respectively. To express the overall range we will permute p2 = 1 – p1 from p1 = 0 to 1, or Cramér’s ϕ along the diagonal from ϕ = -1 to 1.

Figure 3. Intervals for Cohen’s h plotted against Cramér’s ϕ for a diagonal interpolation (p2 = 1 – p1) from p1 = 0 (left) to p1 = 1 (right), n = 10, α = 0.05.
Figure 3. Intervals for Cohen’s h plotted against Cramér’s ϕ for a diagonal interpolation (p2 = 1 – p2) from p1= 0 (left) to p1 = 1 (right), n = 10, α = 0.05.

3. Evaluating the interval

Figure 3 shows that the point at which h and h+ cross the zero axis is almost the same as the equivalent point for d and d+. But we know that Zou and Donner’s method involves an approximation.

The justification for selecting the Wilson inner intervals for p1 and p2 was that these interval widths are in proportion to a Normal interval at the closest bounds of p1 and p2. See Wallis (2021: 125).

But we transformed all values of p by applying Equation (2) to p1, w1, w1+, etc. If the inner intervals were Normal before they were transformed, they will not be Normal afterwards.

The question is then how much additional error is created by applying this transformation beforehand?

One way to evaluate this is discussed in Wallis (forthcoming) and Evaluating the performance of risk ratio and odds ratio tests. Testing if h is significantly different from zero should be equivalent to a χ2 or Fisher test. This evaluation is a subset of all possible ones. However we can perform it exhaustively. Errors are weighted by a Fisher prior probability to account for the combinatorial chance of a particular outcome.

This method permits us to see two things:

  1. the scale of Type I errors introduced by the transformation – this is the risk that our new interval might exclude zero, and yet an exact Fisher test would rule it to be ‘non-significant’ – and
  2. how these errors rank Cohen’s transformation against others, such as the risk ratio or odds ratio.

We rerun our ratio evaluation, including our new interval. In Figures 4 and 5, we can see the performance of the Cohen’s h inner interval at zero against a Fisher test. Figure 4 computes errors for tables where n1 = n2 ∈ {1, 2,… 200}. Figure 5 performs the same where n1 = 5n2.

Figure 4. Cohen’s h evaluated by Fisher-weighted error rates for Type I errors, performance for h ≠ 0 against the Fisher ‘exact’ test, computed for values of n1 = n2 ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples.
Figure 4. Cohen’s h evaluated by Fisher-weighted error rates for Type I errors, performance for h ≠ 0 against the Fisher ‘exact’ test, computed for values of n1 = n2 ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples.
Cohen’s h evaluated by Fisher-weighted error rates for Type I errors, performance for h  0 against the Fisher ‘exact’ test, computed for values of n1 ∈ {1, 2,… 200}, α = 0.05, with unequal-sized samples
Figure 5. Cohen’s h evaluated by Fisher-weighted error rates for Type I errors, performance for h ≠ 0 against the Fisher ‘exact’ test, computed for values of n1 ∈ {1, 2,… 200}, α = 0.05, with unequal-sized samples, n1= 5n2.

We can observe that there is a small additional error cost involved in employing Zou and Donner’s theorem with Cohen’s h compared to the simple difference d (Newcombe-Wilson). This error is smaller still for the unequal-sized sample, where more data supports p2.

To rank them we can sum all two hundred Type I error scores (Table 1). The rate 3.3051/200 represents an additional mean error of 0.0165 where the uncorrected Newcombe-Wilson interval finds a significant difference and the Fisher test does not. If a continuity correction is applied, this error cost falls to 0.0029.

χ2 NW Cohen’s h risk ratio logarithm odds ratio
n1 = n2
no c.c. 3.8476 3.3051 3.8610 4.0109 4.2974 4.3852
c.c. 0.0000 0.5780 0.7589 0.8584 1.3264 1.4061
n1 = 5n2
no c.c. 3.4533 3.1207 3.2112 3.3081 3.4745 3.4975
c.c. 0.4016 0.7377 0.7744 0.7944 0.7880 0.8090

Table 1. Total Type I error rates, summed for n2 ∈ {1, 2,… 200}, α = 0.05.

The error is slightly smaller than that for the risk ratio (p1/p2), where the transformation function is a logarithm, and smaller than the ratio of logs or odds. These errors can be reduced substantially by employing a correction for continuity.

An even more conservative interval which eliminates, or near-eliminates, Type I errors may be obtained by employing a larger correction.

How does it perform against an equivalent Newcombe-Wilson test? The following animation plots Type I errors generated by estimating difference on these additional scales, against the equivalent NW test, with or without continuity correction. This is a method of visualising discrepancies between the application of the inner interval Normal approximation when it is applied to different number scales.

Again, we see that Cohen’s h (here, shown as a red line) obtains fewer errors than the risk ratio (ordinary ratio). The odds ratio and logarithm (dashed) are more difficult to distinguish.

Animation. Evaluating Cohen's h, risk, odds and log ratios against the equivalent Newcombe-Wilson test.
Animation 1. Evaluating Cohen’s h, risk, odds and log ratios against the equivalent Newcombe-Wilson test.

4. Unsigned Cohen’s |h|

It is quite common to see unsigned scores, |h|, cited in the literature.

Consider Figure 3. Note that an interval may include zero – indeed, this corresponds to the state of a ‘non-significant difference’ (i.e. not significantly different from zero). The points where the interval crosses the zero axis are indicated, and all points between the two circled areas all include zero.

We can derive a confidence interval for an unsigned score from a signed one. The following method is preferred. We transform the interval for signed h by paying attention to the global minimum of |h|, i.e. zero.

If the interval excludes zero, we have, simply:

|h| ∈ (|h|, |h|+) = (min(|h|, |h+|), max(|h|, |h+|)).

If the interval includes zero, |h| = 0. The interval is closed at zero, hence the square bracket:

|h| ∈ (|h|, |h|+) = [0, max(|h|, |h+|)).(8)

This transformation loses information, but this may be what we want. Consider the task of comparing two signed scores, h2 > h1. This will detect outcomes where the scores have different signs. The proposition |h2| > |h1| is a subset of these results.

The method is also conservative because the absolute function is a non-monotonic transform. The interval ‘folds back’ on itself at 0. See Confidence intervals on goodness of fit ϕ scores.

Why is this conservative? Consider the interval for the contingency table [[6, 4], [4, 6]] at α = 0.05. We have n1 = n2 = 10, p1 = 0.6, p2 = 0.4. We obtain h = 0.4027 ∈ (-0.4251, 1.1442). This predicts that if the true value is less than h, there is a 0.05 chance that it is less than -0.4251.

The transformed interval for |h| = 0.4027 ∈ [0, 1.1442).

But now we have lost that 0.05 chance of the true population value being a lower score. Instead, for this table, the threshold is equivalent to a lower bound of h = -1.1442, an event that has an infinitesimal chance (α < 0.000001) of occurring!

5. Conclusions

We have demonstrated how to derive an interval for Cohen’s h, an effect size for pairs of proportions or 2 × 2 contingency tables, like Cramér’s ϕ.

Armed with this interval we can perform any of the additional procedures outlined in the introduction, including plotting intervals on observed h-scores and comparing their significant difference. We can also identify when what Cohen calls a ‘large’ or ‘medium’ effect is supportable by inferential statistics, i.e. when the lower bound of the unsigned interval excludes the threshold value.

Cohen’s h is the difference between two arcsine-transformed proportions, a fact which necessarily transforms the number scale on which probability density function distributions are computed. Whereas the Newcombe-Wilson difference method for the difference interval relies on an observation that uncertainty is Normally distributed at Wilson score interval bounds, the arcsine transform (Equation (2)) is non-linear (Figure 1), and therefore any Normal distribution plotted on the p-axis will become non-Normal on a transformed axis.

Consequently, Zou and Donner’s (2008) method, which generalises the Newcombe-Wilson formula from differences in p to differences in any property with a good coverage interval, will obtain slightly different outcomes due to this additional approximation. It introduces small discrepancies, classed either as additional Type I and II errors. The question is: how substantial are these errors, and to what extent are they addressed by standard methods, e.g. continuity corrections?

We found that these errors were not negligible, but, when compared to the ‘exact’ Fisher test, were of approximately the same order overall as those obtained for the Newcombe-Wilson difference interval. The error rate was slightly greater than for the simple difference, but the method performs better than the equivalent for the risk ratio, logarithm or odds ratio.

Note: Since writing this blog post, I have discovered that these errors can be controlled by a simple method. If we include a continuity correction factor that is 1.5 times larger than normal into both Wilson intervals for p1 and p2, the resulting interval for h has very few Type I errors.

Finally, we considered intervals for unsigned |h|. Unsigned effect sizes are quite common, because they can be obtained for tables with more than one degree of freedom. However, they are conservative and lossy. When dealing with an effect size with a single degree of freedom, the optimum method is to first compute an interval for the signed score, and then collapse the interval onto the positive number scale, as we discussed in this article.

References

Cohen, J. (2013). Statistical power analysis for the behavioral sciences (2nd ed). New York: Routledge.

Newcombe, R.G. (1998a). Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine17, 857-872.

Newcombe, R.G. (1998b). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine17, 873-890.

Wallis, S.A. (2013). Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics20:3, 178-208. » Post

Wallis, S.A. (2021). Statistics in Corpus Linguistics Research. New York: Routledge.

Wallis, S.A. (forthcoming). Accurate confidence intervals on Binomial proportions, functions of proportions and other related scores. » Post

Zou, G.Y. & A. Donner (2008). Construction of confidence limits about effect measures: A general approach. Statistics in Medicine27:10, 1693-1702.

See also

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.