An algebra of intervals

Introduction

Many researchers wish to compute confidence intervals on measures other than the simple Binomial proportion, p, or difference between two independent proportions, d = p2p1. On this blog we have identified the Wilson score interval on p and the repositioned Newcombe-Wilson difference interval on d as suitable for these purposes.

For example, we might want to estimate intervals for

  1. functions of single proportions, fn(p),
  2. functions of multiple independent proportions, fn(p1, p2,…),
  3. functions of related proportions, as above, but where p1 and p2 (etc) are not independent.

One reason you might wish to do this is to plot effect sizes. These are descriptive statistics about a sample, similar to proportions or differences. So in principle you should cite them with an upper and lower interval range, especially if you want to compare them or identify how robust an estimate is. I have explained how to compute intervals on ϕ. In my book, I also discuss intervals on the difference between differences (Wallis 2021: 233) and other measures. In a new paper for JQL (Wallis forthcoming), I have been exploring this question further.

Unfortunately, in the traditional statistics literature one concept dominates: standard error, taken as the standard Normal deviate at an error level α, zα/2, multiplied by the standard deviation, S or σ. Since it is directly related to standard deviation and thus variance, ‘standard error’ seems natural enough. But in fact this ‘Wald’ concept contains a whopping mistake.

The solution was pointed out by Edwin Wilson (1927), but few paid attention at the time. The error is found again and again in statistics books. The effect of this mistake is to make confidence intervals for small samples and skewed proportions particularly inaccurate, even though these are the cases when you most need statistical precision!

The first point to note is that the correct approach to the problem of computing an interval about an observed proportion, p, is to invert the interval about the true value, P. I call this the interval equality principle:

where p is at the (just significant) bound of P, P must be at the (just significant) opposite bound of p.

The idea is simple. Imagine you and I are standing apart from each other and want to measure the distance between us. It does not matter whether I measure the distance from me to you or from you to me: the result is the same.

Employing this principle with Binomial proportions means that even if a population distribution is approximately Normal about P, the resulting probability density function shape for p is no longer bell-curved (Normal), thanks to the effect of the boundaries at 0 and 1. This ‘reflection’ is distorted, but in a rather elegant way.

Figure 1. The interval equality principle. Where P is at the boundary of p, p is at the opposite boundary of P.
Figure 1. The interval equality principle. Where P is at the boundary of p, p is at the opposite boundary of P.

The resulting interval for p is guaranteed to obtain results that are consistent with the method used to compute the interval for P, such as the Binomial, Normal approximation to the Binomial (possibly with Yates’ continuity correction), or log-likelihood.

Intervals can be calculated in two ways: by formula and search. Direct computation from a formula should be used where possible because it is fast and has a finite performance. However some intervals, such as the Binomial interval for P, cannot be easily inverted by algebra, and a search procedure is necessary. We simply search for values of P where p is at the opposite bound. See Plotting the Clopper-Pearson distribution. Wallis (2013) discusses this in some detail, and the subject is covered further in my new book, Wallis (2021).

After some considerable evaluation by researchers employing different methods (see Wallis 2013 for a review), it is possible to arrive at an optimum method of direct calculation for Binomial proportions about p. This is the Wilson score interval, which we will write as (w, w+).

It is directly calculable by formula, it can be corrected for continuity (correcting for the fact that actual proportions are exact fractions of sample size n), and other adjustments may be applied to it. Since it is the inverse of the Normal approximation about P, it is guaranteed to obtain identical results as the equivalent 2 × 1 χ2 or z test. A nice visual introduction to the Wilson score interval is found in Plotting the Wilson distribution.

  • Aside: If you prefer an alternative interval, such as the Clopper-Pearson or Jeffrey’s prior, simply substitute the upper and lower interval bounds into the discussion that follows.

Robert Newcombe (1998) further showed that it was possible to compute an efficient and quite accurate interval for the difference between two independent proportions, d = p2p1, using the Wilson interval as a starting point.

But what about other relationships between proportions? We can calculate difference intervals, but what about sums (+), ratios (÷) and products (×)? Can we create an ‘algebra of intervals’ that would allow us to make the calculation of intervals for any function of proportions straightforward? And what should we do if proportions are not independent?

Monotonic functions of p

Elsewhere on this blog I have explained how to calculate confidence intervals on monotonic functions of p. Michael Smithson (2007) calls this method ‘the transformation principle’.

Monotonic functions always increase or decrease with their parameter. This means that there is a 1:1 relationship between p and fn(p), and it can be inverted to obtain a single solution. If we know fn(p) we can obtain p.

Consider the reciprocal function, 1/p. Suppose we want to know what the 95% confidence interval should be for the average length of a clause in a corpus, l = words / clauses. We can calculate a 95% Wilson confidence interval on p = 1/l, the probability that a word chosen at random is the first word in a clause, p = clauses / words. This gives us an interval (w, w+) on p. To obtain a confidence interval for l we simply transform the interval back to the length scale, giving us (1/w+, 1/w). For more information see Reciprocating the Wilson interval.

Some monotonic functions, including p² and 1/p. Note that a function with a negative gradient, such as 1/p, will flip the upper and lower bounds.
Figure 2. Some monotonic functions, including p² and 1/p. Note that a function with a negative gradient, such as 1/p, will flip the upper and lower bounds.

We can write the interval for a monotonic function fn(p) as

fn(p) ∈ (min(fn(w), fn(w+)), max(fn(w), fn(w+))).(1)

Other monotonic functions are also entirely legitimate, such as natural logarithm ln(p), the inverse logistic logit(p), and so on. For many calculations we may need to invert the function, so it is handy to know the inverse function to return back to an original variable, e.g. l = 1/(1/l), k = ln(exp(k)), p = logistic(logit(p)), etc. (As you may have realised, the inverse of a monotonic function, such as exp(k), is also a monotonic function.)

Whereas the transformation principle is mathematically straightforward and widely accepted by statisticians, rather less attention has been paid to circumstances when functions are not monotonic. But it is also possible to calculate intervals for non-monotonic functions of p, like p2. Note that where a function of a proportion (such as p2) is monotonic over the probabilistic scale P = [0, 1], we can treat it as monotonic for our purposes.

However, there are circumstances where it is useful to compute intervals for non-monotonic functions of pP

Let us take a simple example function for demonstrative purposes. The function fn(p) = (p – 0.5)2 is non-monotonic over pP, with a local minimum, a = 0.5 where fn(a) = 0. If a minimum or maximum value a falls within (w, w+) we might write

fn(p) ∈ (min(fn(w), fn(w+), fn(a)), max(fn(w), fn(w+), fn(a))).(2)

Otherwise, a does not fall within the interval for p, the function is monotonic within the interval, and we employ equation (1).

There are two issues to bear in mind.

Firstly, a non-monotonic transformation loses information. We cannot distinguish p = 0 and p = 1 if we compare scores involving (p – 0.5)2! Resulting measures and their intervals are ‘lossy’ and will tend to be conservative.

But they may still be desirable. It is not unusual to employ multi-dimensional effect sizes that reduce multiple degrees of freedom to a single numerical score. These are also lossy and conservative. Whether a measure is useful depends on our purpose.

Secondly, non-monotonic functions cannot be inverted to obtain a unique solution. For example, fn(p) = (p – 0.5)2 = 0.25 has two solutions, p = 0 and 1. But even if we cannot invert a function, we can still cite and plot intervals on it.

Zou and Donner’s interval difference theorem

We have explored Newcombe’s (1998) difference interval at some length in this blog. A good starting point is Change and certainty, which discusses how to plot an interval about d.

Newcombe’s interval may be simply stated based about zero, and compared with d.

(wd, wd+) = (–√(p1w1)² + (w2+p2, (w1+p1)² + (p2w2), (3)

where (wi, wi+) are the Wilson score interval bounds for pi, i ∈ {0, 1}. We can reposition the interval about the difference by subtracting it from d.

d ∈ (d, d+) = d – (wd, wd+) = (dwd+, dwd). (4)

While I was writing my book, I went back to this interval and original sources, and carried out a number of evaluations that never appeared in the final manuscript for reasons of space. Wallis (2013) had mainly concerned itself with its performance at the inner interval, and compared its performance to chi-square, Fisher and other 2 × 2 tests. This is important for consistency, otherwise we would be advocating an interval that obtained radically different results than other tests. Although there are small rounding differences that can obtain different outcomes in marginal cases, Newcombe’s Wilson-based interval performs comparably to the equivalent 2 × 2 χ².

In fact, Robert Newcombe made a number of other claims about its performance in his 1998 paper. In particular he noted that the outer interval bound was also comparably accurate, and indeed, that the interval performed well when bounds were not close to zero. In my book I show that the same is not the case for repositioned Gaussian χ² intervals.

An important extension to Newcombe’s work is to be found in Zou and Donner (2008). They offer the following interval difference theorem to obtain an interval for the difference between any two independently distributed parameters, θˆ1 and θˆ2.

(L, U) ≡ (θˆ1 – θˆ2ˆ1l1)² + (u2 – θˆ2, θˆ1 – θˆ2 + (u1 – θˆ1)² + (θˆ2l2), (5)

where (li, ui) are the lower and upper interval bounds for parameter θˆi.

It should be straightforward to see that the Newcombe-Wilson interval (3), repositioned by Equation (4), is an instance of (5) (indexes are reversed). However Equation (5) may be applied to functions of p.

Sums, ratios and products

The sum of two independent proportions may be obtained by simply substituting as follows into Equation (5):

  • θˆ1 = p1, (l1, u1) = (w1, w1+), and
  • θˆ2 = –p2, (l2, u2) = (–w2+, –w2).

Indeed, we can generalise a summation over k independent proportions as

sum ∈ (L, U) = (∑pi∑(piwi,pi + ∑(wi+pi), (6)

where i ∈ {1, 2,… k} is the index.

  • Applications: Summing independent proportions is not very common, but a version of Equation (4) may be employed for computing intervals for goodness of fit effect sizes and other metrics that sum a finite number of constrained terms.

Zou and Donner themselves offer the ratio of two proportions, p1 / p2, as an example. This time we employ the natural logarithm as a monotonic transform.

If you recall from school, differences between two logarithms equal the logarithm of the ratio:

log(a / b) ≡ log(a) – log(b),

where ‘log’ is a logarithm with any base. We will use the natural logarithm, ‘ln’.

Zou and Donner substitute as follows into Equation (5).

  • θˆ1 = ln(p1), (l1, u1) = (ln(w1), ln(w1+)), and
  • θˆ2 = ln(p2), (l2, u2) = (ln(w2), ln(w2+)).

This gives us a difference interval on the log scale. To obtain a ratio interval, the log transform must be reversed with the exponent function (‘exp’).

ratio ∈ (L, U) = (exp[ln(p1 / p2) – √(ln(p1) – ln(w1))² + (ln(w2+) – ln(p2))²],
exp[ln(p1 / p2) + √(ln(w1+) – ln(p1))² + (ln(p2) – ln(w2))²]). (7)

Finally, let us consider the product of two independent proportions, p1 × p2. This may be calculated by combining the above two steps together, applying the sum formula to the log scale.

  • θˆ1 = ln(p1), (l1, u1) = (ln(w1), ln(w1+)), and
  • θˆ2 = –ln(p2), (l2, u2) = (–ln(w2+), –ln(w2)).

As above, we quote the result by computing the exponent of the resulting interval bounds.

  • Applications: The product of two independent probabilities is their joint probability (intersection). This method is also applicable to the power function.

In Figure 3 we plot intervals over an unskewed diagonal matrix with varying 2 × 2 ϕ, using the method outlined in Measures of association for contingency tables. This obtains a table where p1 increases steadily from 0 as ϕ increases and p2 = 1 – p1. We can see that the difference declines linearly with ϕ, whereas the sum is constant.

Figure 3. 95% confidence intervals on functions of two proportions obtained from a diagonal matrix with varying ϕ.
Figure 3. 95% confidence intervals on functions of two proportions, p1 and p2, n1 = n2 = 10, obtained from a diagonal matrix with varying ϕ.

Zou and Donner’s method has one obvious limitation, which we can see in Figure 3. Due to the use of the log transform, ratio and product intervals are uncomputable if either p1 or p2 are zero.

The ratio equation naturally involves division by zero for p2 = 0, but a more general problem is that the logarithm of zero is –∞. This seems counterintuitive: observing a zero proportion is both entirely feasible and uncertain. By inspection, an effective solution in this figure substitutes the Wilson interval. However, a more general heuristic is simply to recognise that these functions are convergent. If we substitute a small delta, δp = 10-6 say, for any zero term, we obtain a result very close to the correct one. This permits us to observe, for example, that the lower bound of the ratio, r, actually converges to a figure very close to 3.6 (1/w2+) for the infinite interval on the right hand side of Figure 3.

If you wish to experiment, the graph in Figure 3 is plotted with a spreadsheet I created for these purposes. You can experiment with different interval settings and skew the table so n1n2.

Example: the odds ratio

Now that we have a solution for the risk ratio, we can compute an interval for the odds ratio. I tend to prefer 2 × 2 ϕ as an effect size because it is better behaved (it is restricted to [-1, 1] for a start), but you should now be able to compute the odds ratio using Zou and Donner’s theorem.

The ‘odds’ are the ratio p:(1 – p) (e.g. the odds 2:1 is where p = 2/3). We write this as

odds(p) = p / (1 – p).

This is a monotonic function of p, so we can proceed by applying the function to proportions and interval bounds. The odds ratio is simply the ratio of two independent odds,

odds ratio = odds(p1) / odds(p2).

To compute the interval for the ratio, we employ the method above, substituting into Equation (3).

  • θˆ1 = ln(odds(p1)), (l1, u1) = (ln(odds(w1)), ln(odds(w1+))), and
  • θˆ2 = ln(odds(p2)), (l2, u2) = (ln(odds(w2)), ln(odds(w2+))).

The result is an interval on the natural logarithm scale, so to finish we transform this interval using the inverse log (‘exp’) function as above.

I have included this calculation in my classic ‘2 × 2 χ² test’ spreadsheet. If we compare the interval with 1, we can create a new contingency test, with a performance similar to χ² and Newcombe-Wilson tests.

A performance comparison against the Newcombe-Wilson test sees two Type I error subtypes introduced: asymptotic behaviour when one or both proportions are extreme (p = 0 or 1) and n<12, and a rounding error due to the difference in employing the Pythagoras approximation on a logistic and probability scale. Both of these almost completely disappear in favour of Type II (conservative) errors if a continuity correction is employed.

We can convert odds ratio and risk ratio to a test equivalent to a pairwise contingency test (2 × 2 χ² or Fisher test) by testing whether the resulting interval contains 1. We can then employ this insight in an evaluation using the Fisher test as a benchmark.

This finds a small increase in errors introduced by each transformation, from simple difference (Newcombe-Wilson, P scale) to risk ratio (ln(P) scale) to odds ratio (ln(odds(P)) scale). However, these errors are small compared to those removed by employing a continuity correction. Indeed, we can reduce these errors to nearly zero by multiplying Yates’s continuity correction term by a factor of between 1.5 and 1.75.

Non-independent parameters?

Notwithstanding the limitation for ln(0), the ‘algebra’ identified above is quite robust. However, it relies on observed Binomial proportions (or parameters) being independent from each other. A much more complex question concerns cases when parameters are not independent.

Consider competing proportions, for example, two proportions in competition, p1 = 1 – p2. There is a deterministic relationship between p1 and p2, and a single degree of freedom.

In this case, the difference d = p2p1 = 2p2 – 1. Since fn(p2) = 2p2 – 1 is a monotonic function, the interval for d is the Wilson score interval for p2, appropriately transformed:

d ∈ (2w2 – 1, 2w2+ – 1).(8)

Elsewhere we have converted this insight into a test.

As a general rule, before we carry out any computation of intervals, we must first identify whether terms are truly independent. If they are not, we should attempt to simplify the equation into one consisting only of independent proportions (or monotonic functions of them, like odds). This analytical reduction should be the first step, simplifying the formula as far as possible into a formula consisting of independent parameters, each mentioned once only. (A good example concerns the treatment of percentage difference.)

This does not resolve every issue. Some properties are computed from more than two competing proportions, for example categorical ‘diversity’, entropy or goodness of fit ϕ scores. These have k – 1 degrees of freedom (so, more than 1 where k > 2), but no single proportion can be determined from another. Work on intervals for Multinomial properties is discussed in Confidence intervals on goodness of fit ϕ scores. We employ a version of Equation (6) that accounts for the additional interdependency of proportions.

Other examples involving this theorem include intervals for entropy, where we find that the Multinomial approximation can create excessively conservative intervals under some conditions.

In the case of Cramér’s 2 × 2 signed ϕ interval, we first show that signed 2 × 2 ϕ is the root of the product of two deterministically-related difference measures, and then employ the same root-product formula to the Newcombe-Wilson interval bounds.

In subsequent work, we have extended this method to estimating intervals on powers and logs.

Conclusions

We summarised an algebraic method for computing intervals for properties derived from independent Binomial proportions. The method combines Zou and Donner’s (2008) interval difference theorem with monotonic transformations of the Wilson interval on proportions.

The log transform turns ratios on the P scale into differences on ln(P), allowing us to employ the difference theorem to a fraction. Although the log of the Wilson distribution is not Normal, Zou and Donner comment that this does not appear to negatively impact the performance of the resulting interval for the ratio unduly. They report that the method is more accurate than more traditional approaches which assume that variance is Normal on some scale. I have subsequently explored this claim, and evaluated ratio formulae from the perspective of alternative difference tests.

We use this approach to derive product and sum intervals. The interval for the odds ratio, a popular 2 × 2 effect size measure (like 2 × 2 ϕ), may be calculated by taking the ratio of the odds, a monotonic transform of p. In Wallis (2021: 233) we also offer a range of difference of differences tests.

We are beginning to enumerate, hesitatingly, an algebra of intervals, allowing us to directly calculate accurate confidence intervals on properties such as effect sizes. This algebra is limited to combinations of independent parameters, such as monotonic functions of independent proportions, and so the first step should be to try to obtain a formula in this form.

References

Newcombe, R.G. (1998). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine, 17, 873-890.

Smithson, M. (2007). Confidence intervals. Thousand Oaks, CA: Sage Press.

Wallis, S.A. (2013). Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics, 20:3, 178-208. » Post

Wallis, S.A. (2021). Statistics in Corpus Linguistics Research. New York: Routledge. » Announcement

Wallis, S.A. (2021, forthcoming). Accurate confidence intervals on Binomial proportions, functions of proportions and other related scores. » Post

Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212.

Zou, G.Y. & A. Donner (2008). Construction of confidence limits about effect measures: A general approach. Statistics in Medicine, 27:10, 1693-1702.

See also

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.