Reciprocating the Wilson interval

Introduction

How can we calculate confidence intervals on a property like sentence length (as measured by the number of words per sentence)?

You might want to do this to find out whether or not, say, spoken utterances consist of shorter or longer sentences than those found in writing.

The problem is that the average number of words per sentence is not a probability. If you think about it, this ratio will (obviously) equal or exceed 1. So methods for calculating intervals on probabilities won’t work without recalibration.

Aside: You are most likely to hit this type of problem if you want to plot a graph of some non-probabilistic property, or you wish to cite a property with an upper and lower bound for some reason. Sometimes expressing something as a probability does not seem natural. However, it is a good discipline to think in terms of probabilities, and to convert your hypotheses into hypotheses about probabilities as far as possible. As we shall see, this is exactly what you have to do to apply the Wilson score interval.

Note also that if you want to calculate confidence intervals on a property, you also have to consider whether the property is freely varying when expressed as a probability.

The Wilson score interval (w, w+), is a robust method for computing confidence intervals about probabilistic observations p.

Elsewhere we saw that the Wilson score interval obtained an accurate approximation to the ‘exact’ Binomial interval based on an observed probability p, obtained by search. It is also well-constrained, so that neither upper nor lower bound can exceed the probabilistic range [0, 1].

But the Wilson interval is based on a probability. In this post we discuss how this method can be used for other quantities.

Reciprocating the Wilson interval

Let us return to our initial question. How might we calculate confidence intervals on a property like the number of words per sentence?

Let’s call the length l. In this case the ‘trick’ is to take the reciprocal of the property (p = 1/l), which is a probability p. We are able to calculate Wilson intervals on “the number of sentences per word”, or, perhaps more meaningfully, the proportion of all words which are initial words in sentences.

If this sounds a bit odd, consider the following.

Suppose there are l = 10 words in a sentence.

The probability of selecting the first word in the sentence at random (assuming everything else is equal), p = 1/l = 1/10.

We can calculate the Wilson score interval for p as (w, w+).

The confidence interval for l = 1/p is simply (1/w+, 1/w).

The inverse function of the reciprocal is also the reciprocal, i.e. if p = 1/l, then l = 1/p.

This method works because of an important property of the reciprocal function (1/p). It is monotonic, which means that it either always increases as p increases, or always decreases as p increases. (Since the reciprocal function actually gets smaller with increasing p, we swap the interval bounds around so the smaller number is stated first.)

We return to what this means in more detail below.

Some example data

The following data was taken from ICE-GB using ICECUP. We have three data columns: number of parse units (parsed ‘sentences’) per subcorpus, number of clauses and number of words. We also have two ratio columns: the number of words per parse unit and the number of words per clause.

parse units clauses words l = words/PU words/CL
dialogue 43,894 57,161 374,516 8.5323 6.5519
mixed 2,443 5,648 43,632 17.8600 7.7252
monologue 13,133 27,613 225,184 17.1464 8.1550
spoken 59,470 90,422 643,332 10.8178 7.1148
non-printed 6,836 14,007 114,362 16.7294 8.1646
printed 17,099 40,750 359,634 21.0325 8.8254
written 23,935 54,757 473,996 19.8035 8.6564
TOTAL 83,405 145,179 1,117,328 13.3964 7.6962

Table 1. Raw frequencies and ratios for the number of words per sentence and clause in ICE-GB subcorpora. Raw data and calculations are in this Excel spreadsheet.

Let us now compute confidence intervals on l, the words/PU column, with an error level α = 0.05. To do this, take the reciprocal, i.e. p = 1/l = PUs/word, and n = number of words.

p n z²/n p z.s w w+
dialogue 0.1172 374,516 0.0000 0.1172 0.0010 0.1162 0.1182
mixed 0.0560 43,632 0.0001 0.0560 0.0022 0.0539 0.0582
monologue 0.0583 225,184 0.0000 0.0583 0.0010 0.0574 0.0593
spoken 0.0924 643,332 0.0000 0.0924 0.0007 0.0917 0.0932
non-printed 0.0598 114,362 0.0000 0.0598 0.0014 0.0584 0.0612
printed 0.0475 359,634 0.0000 0.0476 0.0007 0.0469 0.0482
written 0.0505 473,996 0.0000 0.0505 0.0006 0.0499 0.0511
TOTAL 0.0746 1,117,328 0.0000 0.0746 0.0005 0.0742 0.0751

Table 2. Calculation of Wilson score intervals for 1/(words/PU) = parse units per word.

The interval for p, the number of parse units per word, is not what we wanted, but it is a necessary intermediate step.

We can now take the reciprocal of this interval, l = 1/p, to get back to where we started, and plot the graph.

words/PU 1/w 1/w+
dialogue 8.5323 8.6077 8.4577
mixed 17.8600 18.5623 17.1858
monologue 17.1464 17.4335 16.8644
spoken 10.8178 10.9009 10.7353
non-printed 16.7294 17.1186 16.3495
printed 21.0325 21.3425 20.7271
written 19.8035 20.0495 19.5606
TOTAL 13.3964 13.4842 13.3093

Table 3. Computing the reciprocal of the Wilson interval.

Note that because l = 1/p declines with increasing p, 1/w+ is less than 1/w. The inverted interval is (1/w+, 1/w).

This is what the graph looks like with intervals added.

Ratio of number of words to parse unit (‘sentence’) in ICE-GB subcorpora, with inverse Wilson score intervals. (Note that we have cropped the y axis so it does not start at zero.)
Ratio of number of words to parse unit (‘sentence’) in ICE-GB subcorpora, with inverse Wilson score intervals. (Note that we have cropped the y axis so it does not start at zero.)

The interpretation of overlapping intervals on this graph is exactly the same as for standard Wilson score interval graphs:

  • non-overlapping intervals = significant difference,
  • overlapping central point = non-significant difference, and
  • for everything else, carry out a Newcombe-Wilson test on p = 1/l.

We can check to see if the ratio of words to parse units in the “mixed” and “monologue” subcorpora are significantly different using a Newcombe-Wilson test. In this case the difference is non-significant at the α = 0.05 level.

In the first table above we also included the ratio of words per clause (CL). To avoid repetition, we have not presented the corresponding calculation and graph here, but it is included in an Excel spreadsheet containing the raw data.

Theorem

We can prove a useful general theorem which allows us to use the Wilson interval for properties other than probabilities.

For any function of p, f(p), that is monotonic over the range of p ∈ [0, 1], the Wilson interval for f(p) is

Wilson (f(p)) ≡ (f(w), f(w+)) if f increases with p or
Wilson (f(p)) ≡ (f(w+), f(w)) otherwise.

Note: The term monotonic means that the function always increases with its parameter (the gradient d(f(x))/dx > 0) or always decreases with its parameter (the gradient < 0).

The slope of a sloping roof is monotonic. The top of a roof (or a flat roof) is not! The gradient (slope) of a monotonic function may change, but it may not become horizontal or change direction from positive to negative (or vice-versa).

Other example monotonic functions include any constant multiple of p, e.g. 5p (a score on a scale from 0 to 5), the alternate probability q = 1 – p, any power of p such as p², and so on. (For more example transformations see this PDF ‘cheat sheet’.)

The function must always behave in this monotonic way over the probability range (it doesn’t matter what it does for p < 0 or p > 1). For example, across all values of x, x² is non-monotonic. However, as long as p ∈ [0, 1], p² is monotonic.

Some monotonic functions, including p² and 1/p. Note that a function with a negative gradient, such as 1/p, will flip the upper and lower bounds.
Some monotonic functions, including p² and 1/p. Note that a function with a negative gradient, such as 1/p, will flip the upper and lower bounds.

Importantly, the logistic function (that defines an ‘S’ curve) is monotonic. See this short paper for more on the relationship between the Wilson score interval and the logistic function.

All monotonic functions can be inverted and obtain a single solution, and this inverse is also monotonic. As a result we can compute an interval on a monotonic function of p by simply computing the interval on p and then apply the inverse of the function to the new interval.

Note that even though 1/p is infinite when p = 0, it is still possible to apply the Wilson score interval to p and report the reciprocal of the bounds.

By way of comparison, here are two examples of non-monotonic functions.

Two non-monotonic functions. In the lower curve, f(p) = (p – 0.5)², different values of p obtain the same value of f(p). The upper stepped function includes a plateau where all values obtain the same value of f(p).
Two non-monotonic functions. In the lower curve, f(p) = (p – 0.5)², different values of p obtain the same value of f(p). The upper stepped function includes a plateau where all values obtain the same value of f(p).

Citation (book version)

Wallis, S.A. 2021. Reciprocating the Wilson Interval. Chapter 10 in Wallis, S.A. Statistics in Corpus Linguistics Research. New York: Routledge. 171-177.

See also

4 thoughts on “Reciprocating the Wilson interval”

  1. Intriguing stuff! Can I please ask for some clarification regarding the exactitude of the above relationships? I absolutely see how the reciprocal confidence intervals are generated, and this makes good intuitive sense, but I am wondering how well the reciprocal (or related) intervals perform in practice. For example, if the Wilson score is giving excellent performance at the p level, is that same level of performance guaranteed for f(p)? Are there cases where these checks have been done? The main reason I am asking is that I have implemented some f(p) intervals and tested them via simulated numerical tests and I don’t seem to be getting great performance in all cases (e.g. for 95% intervals I may find 10%+ of the time that the true value lies outside the interval). This may be an error on my part though, hence why I am asking 🙂 It may be this is discussed in the textbook (which I have on order). Any insight you can provide would be much appreciated!

    1. Good question. The simple answer is that for a monotonic function f(p) the answer is the interval must perform exactly the same as the interval on p. If you imagine drawing a closed shape, a circle say, on a deflated balloon, and then draw a dot on the inside and a dot outside it. Now inflate the balloon. The new shape may be distorted, but the dot that was inside must still be inside, and the dot outside must still be outside.

      1. Thanks for the fast (and helpful!) replies! I am still bottoming things out a bit, but I think the issue was simply that I wasn’t doing enough runs to get to converged results concerning testing of confidence interval performance (I was running 50-200 example cases, whereas it looks like convergence doesn’t start to kick in until a few thousand examples are run). The transform seems to play a role here also… while performance does appear to behave exactly as you describe, a more severe transformation seems to require more numerical tests to reach convergence of the test metric (this being the proportion lying outside the interval). I will try and remember to post again to confirm that this is indeed where I was going wrong.

        Thanks again!

    2. My advice would be to use the Wilson score interval with continuity correction or Clopper-Pearson interval as a starting point if you want good performance for small n.

      One caveat: if you are seeing what appear to be failures to replicate you should check that the proportion p is actually Binomial, i.e. free to vary from 0 to 1, randomly sampled (and extrapolated to a very large if not infinite population). In other words, if the property does not actually behave as the Binomial model predicts, why is that the case?

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.