Introduction
How can we calculate confidence intervals on a property like sentence length (as measured by the number of words per sentence)?
You might want to do this to find out whether or not, say, spoken utterances consist of shorter or longer sentences than those found in writing.
The problem is that the average number of words per sentence is not a probability. If you think about it, this ratio will (obviously) equal or exceed 1. So methods for calculating intervals on probabilities won’t work without recalibration.
Aside: You are most likely to hit this type of problem if you want to plot a graph of some non-probabilistic property, or you wish to cite a property with an upper and lower bound for some reason. Sometimes expressing something as a probability does not seem natural. However, it is a good discipline to think in terms of probabilities, and to convert your hypotheses into hypotheses about probabilities as far as possible. As we shall see, this is exactly what you have to do to apply the Wilson score interval.
Note also that just because you want to calculate confidence intervals on a property, you also have to consider whether the property is freely varying when expressed as a probability.
The Wilson score interval (w⁻, w⁺), is a robust method for computing confidence intervals about probabilistic observations p.
Elsewhere we saw that the Wilson score interval obtained an accurate approximation to the ‘exact’ Binomial interval based on an observed probability p, obtained by search. It is also well-constrained, so that neither upper nor lower bound can exceed the probabilistic range [0, 1].
But the Wilson interval is based on a probability. In this post we discuss how this method can be used for other quantities.
Reciprocating the Wilson interval
Let us return to our initial question. How might we calculate confidence intervals on a property like the number of words per sentence?
Let’s call the length l. In this case the ‘trick’ is to take the reciprocal of the property (p = 1/l), which is a probability p. We are able to calculate Wilson intervals on “the number of sentences per word”, or, perhaps more meaningfully, the proportion of all words which are initial words in sentences.
If this sounds a bit odd, consider the following.
Suppose there are l = 10 words in a sentence.
The probability of selecting the first word in the sentence at random (assuming everything else is equal), p = 1/l = 1/10.
We can calculate the Wilson score interval for p as (w⁻, w⁺).
The confidence interval for l = 1/p is simply (1/w⁺, 1/w⁻).
The inverse function of the reciprocal is also the reciprocal, i.e. if p = 1/l, then l = 1/p.
This method works because of an important property of the reciprocal function (1/p). It is monotonic, which means that it either always increases as p increases, or always decreases as p increases. (Since the reciprocal function actually gets smaller with increasing p, we swap the interval bounds around so the smaller number is stated first.)
We return to what this means in more detail below.
Some example data
The following data was taken from ICE-GB using ICECUP. We have three data columns: number of parse units (parsed ‘sentences’) per subcorpus, number of clauses and number of words. We also have two ratio columns: the number of words per parse unit and the number of words per clause.
parse units | clauses | words | l = words/PU | words/CL | |
dialogue | 43,894 | 57,161 | 374,516 | 8.5323 | 6.5519 |
mixed | 2,443 | 5,648 | 43,632 | 17.8600 | 7.7252 |
monologue | 13,133 | 27,613 | 225,184 | 17.1464 | 8.1550 |
spoken | 59,470 | 90,422 | 643,332 | 10.8178 | 7.1148 |
non-printed | 6,836 | 14,007 | 114,362 | 16.7294 | 8.1646 |
printed | 17,099 | 40,750 | 359,634 | 21.0325 | 8.8254 |
written | 23,935 | 54,757 | 473,996 | 19.8035 | 8.6564 |
TOTAL | 83,405 | 145,179 | 1,117,328 | 13.3964 | 7.6962 |
Let us now compute confidence intervals on l, the words/PU column, with an error level α = 0.05. To do this, take the reciprocal, i.e. p = 1/l = PUs/word, and n = number of words.
p | n | z²/n | p’ | z.s’ | w⁻ | w⁺ | |
dialogue | 0.1172 | 374,516 | 0.0000 | 0.1172 | 0.0010 | 0.1162 | 0.1182 |
mixed | 0.0560 | 43,632 | 0.0001 | 0.0560 | 0.0022 | 0.0539 | 0.0582 |
monologue | 0.0583 | 225,184 | 0.0000 | 0.0583 | 0.0010 | 0.0574 | 0.0593 |
spoken | 0.0924 | 643,332 | 0.0000 | 0.0924 | 0.0007 | 0.0917 | 0.0932 |
non-printed | 0.0598 | 114,362 | 0.0000 | 0.0598 | 0.0014 | 0.0584 | 0.0612 |
printed | 0.0475 | 359,634 | 0.0000 | 0.0476 | 0.0007 | 0.0469 | 0.0482 |
written | 0.0505 | 473,996 | 0.0000 | 0.0505 | 0.0006 | 0.0499 | 0.0511 |
TOTAL | 0.0746 | 1,117,328 | 0.0000 | 0.0746 | 0.0005 | 0.0742 | 0.0751 |
The interval for p, the number of parse units per word, is not what we wanted, but it is a necessary intermediate step.
We can now take the reciprocal of this interval, l = 1/p, to get back to where we started, and plot the graph.
words/PU | 1/w⁻ | 1/w⁺ | |
dialogue | 8.5323 | 8.6077 | 8.4577 |
mixed | 17.8600 | 18.5623 | 17.1858 |
monologue | 17.1464 | 17.4335 | 16.8644 |
spoken | 10.8178 | 10.9009 | 10.7353 |
non-printed | 16.7294 | 17.1186 | 16.3495 |
printed | 21.0325 | 21.3425 | 20.7271 |
written | 19.8035 | 20.0495 | 19.5606 |
TOTAL | 13.3964 | 13.4842 | 13.3093 |
Note that because l = 1/p declines with increasing p, 1/w⁺ is less than 1/w⁻. The inverted interval is (1/w⁺, 1/w⁻).
This is what the graph looks like with intervals added.
The interpretation of overlapping intervals on this graph is exactly the same as for standard Wilson score interval graphs:
- non-overlapping intervals = significant difference,
- overlapping central point = non-significant difference, and
- for everything else, carry out a Newcombe-Wilson test on p = 1/l.
We can check to see if the ratio of words to parse units in the “mixed” and “monologue” subcorpora are significantly different using a Newcombe-Wilson test. In this case the difference is non-significant at the α = 0.05 level.
In the first table above we also included the ratio of words per clause (CL). To avoid repetition, we have not presented the corresponding calculation and graph here, but it is included in an Excel spreadsheet containing the raw data.
Theorem
We can prove a useful general theorem which allows us to use the Wilson interval for properties other than probabilities.
For any function of p, f(p), that is monotonic over the range of p ∈ [0, 1], the Wilson interval for f(p) is
Wilson (f(p)) ≡ (f(w⁻), f(w⁺)) if f increases with p or
Wilson (f(p)) ≡ (f(w⁺), f(w⁻)) otherwise.
Note: The term monotonic means that the function always increases with its parameter (the gradient d(f(x))/dx > 0) or always decreases with its parameter (the gradient < 0).
The slope of a sloping roof is monotonic. The top of a roof (or a flat roof) is not! The gradient (slope) of a monotonic function may change, but it may not become horizontal or change direction from positive to negative (or vice-versa).
Other example monotonic functions include any constant multiple of p, e.g. 5p (a score on a scale from 0 to 5), the alternate probability q = 1 – p, any power of p such as p², and so on. (For more example transformations see this PDF ‘cheat sheet’.)
The function must always behave in this monotonic way over the probability range (it doesn’t matter what it does for p < 0 or p > 1). For example, across all values of x, x² is non-monotonic. However, as long as p ∈ [0, 1], p² is monotonic.
Importantly, the logistic function (that defines an ‘S’ curve) is monotonic. See this short paper for more on the relationship between the Wilson score interval and the logistic function.
All monotonic functions can be inverted and obtain a single solution, and this inverse is also monotonic. As a result we can compute an interval on a monotonic function of p by simply computing the interval on p and then apply the inverse of the function to the new interval.
Note that even though 1/p is infinite when p = 0, it is still possible to apply the Wilson score interval to p and report the reciprocal of the bounds.
By way of comparison, here are two examples of non-monotonic functions.