Reciprocating the Wilson interval

Introduction

How can we calculate confidence intervals on a property like sentence length (as measured by the number of words per sentence)?

You might want to do this to find out whether or not, say, spoken utterances consist of shorter or longer sentences than those found in writing.

The problem is that the average number of words per sentence is not a probability. If you think about it, this ratio will (obviously) equal or exceed 1. So methods for calculating intervals on probabilities won’t work without recalibration.

Aside: You are most likely to hit this type of problem if you want to plot a graph of some non-probabilistic property, or you wish to cite a property with an upper and lower bound for some reason. Sometimes expressing something as a probability does not seem natural. However, it is a good discipline to think in terms of probabilities, and to convert your hypotheses into hypotheses about probabilities as far as possible. As we shall see, this is exactly what you have to do to apply the Wilson score interval.

Note also that if you want to calculate confidence intervals on a property, you also have to consider whether the property is freely varying when expressed as a probability.

The Wilson score interval (w^–, w⁺), is a robust method for computing confidence intervals about probabilistic observations p.

Elsewhere we saw that the Wilson score interval obtained an accurate approximation to the ‘exact’ Binomial interval based on an observed probability p, obtained by search. It is also well-constrained, so that neither upper nor lower bound can exceed the probabilistic range [0, 1].

But the Wilson interval is based on a probability. In this post we discuss how this method can be used for other quantities.

Reciprocating the Wilson interval

Let us return to our initial question. How might we calculate confidence intervals on a property like the number of words per sentence?

Let’s call the length l. In this case the ‘trick’ is to take the reciprocal of the property (p = 1/l), which is a probability p. We are able to calculate Wilson intervals on “the number of sentences per word”, or, perhaps more meaningfully, the proportion of all words which are initial words in sentences.

If this sounds a bit odd, consider the following.

Suppose there are l = 10 words in a sentence.

The probability of selecting the first word in the sentence at random (assuming everything else is equal), p = 1/l = 1/10.

We can calculate the Wilson score interval for p as (w^–, w⁺).

The confidence interval for l = 1/p is simply (1/w⁺, 1/w^–).

The inverse function of the reciprocal is also the reciprocal, i.e. if p = 1/l, then l = 1/p.

This method works because of an important property of the reciprocal function (1/p). It is monotonic, which means that it either always increases as p increases, or always decreases as p increases. (Since the reciprocal function actually gets smaller with increasing p, we swap the interval bounds around so the smaller number is stated first.)

We return to what this means in more detail below.

Some example data

The following data was taken from ICE-GB using ICECUP. We have three data columns: number of parse units (parsed ‘sentences’) per subcorpus, number of clauses and number of words. We also have two ratio columns: the number of words per parse unit and the number of words per clause.

parse units		clauses	words	l = words/PU	words/CL
dialogue	43,894	57,161	374,516	8.5323	6.5519
mixed	2,443	5,648	43,632	17.8600	7.7252
monologue	13,133	27,613	225,184	17.1464	8.1550
spoken	59,470	90,422	643,332	10.8178	7.1148
non-printed	6,836	14,007	114,362	16.7294	8.1646
printed	17,099	40,750	359,634	21.0325	8.8254
written	23,935	54,757	473,996	19.8035	8.6564
TOTAL	83,405	145,179	1,117,328	13.3964	7.6962

Table 1. Raw frequencies and ratios for the number of words per sentence and clause in ICE-GB subcorpora. Raw data and calculations are in this Excel spreadsheet.

Let us now compute confidence intervals on l, the words/PU column, with an error level α = 0.05. To do this, take the reciprocal, i.e. p = 1/l = PUs/word, and n = number of words.

	p	n	z²/n	p′	z.s′	w^–	w⁺
dialogue	0.1172	374,516	0.0000	0.1172	0.0010	0.1162	0.1182
mixed	0.0560	43,632	0.0001	0.0560	0.0022	0.0539	0.0582
monologue	0.0583	225,184	0.0000	0.0583	0.0010	0.0574	0.0593
spoken	0.0924	643,332	0.0000	0.0924	0.0007	0.0917	0.0932
non-printed	0.0598	114,362	0.0000	0.0598	0.0014	0.0584	0.0612
printed	0.0475	359,634	0.0000	0.0476	0.0007	0.0469	0.0482
written	0.0505	473,996	0.0000	0.0505	0.0006	0.0499	0.0511
TOTAL	0.0746	1,117,328	0.0000	0.0746	0.0005	0.0742	0.0751

Table 2. Calculation of Wilson score intervals for 1/(words/PU) = parse units per word.

The interval for p, the number of parse units per word, is not what we wanted, but it is a necessary intermediate step.

We can now take the reciprocal of this interval, l = 1/p, to get back to where we started, and plot the graph.

words/PU		1/w^–	1/w⁺
dialogue	8.5323	8.6077	8.4577
mixed	17.8600	18.5623	17.1858
monologue	17.1464	17.4335	16.8644
spoken	10.8178	10.9009	10.7353
non-printed	16.7294	17.1186	16.3495
printed	21.0325	21.3425	20.7271
written	19.8035	20.0495	19.5606
TOTAL	13.3964	13.4842	13.3093

Table 3. Computing the reciprocal of the Wilson interval.

Note that because l = 1/p declines with increasing p, 1/w⁺ is less than 1/w^–. The inverted interval is (1/w⁺, 1/w^–).

This is what the graph looks like with intervals added.

Ratio of number of words to parse unit (‘sentence’) in ICE-GB subcorpora, with inverse Wilson score intervals. (Note that we have cropped the y axis so it does not start at zero.)

The interpretation of overlapping intervals on this graph is exactly the same as for standard Wilson score interval graphs:

non-overlapping intervals = significant difference,
overlapping central point = non-significant difference, and
for everything else, carry out a Newcombe-Wilson test on p = 1/l.

We can check to see if the ratio of words to parse units in the “mixed” and “monologue” subcorpora are significantly different using a Newcombe-Wilson test. In this case the difference is non-significant at the α = 0.05 level.

In the first table above we also included the ratio of words per clause (CL). To avoid repetition, we have not presented the corresponding calculation and graph here, but it is included in an Excel spreadsheet containing the raw data.

Theorem

We can prove a useful general theorem which allows us to use the Wilson interval for properties other than probabilities.

For any function of p, f(p), that is monotonic over the range of p ∈ [0, 1], the Wilson interval for f(p) is

Wilson (f(p)) ≡ (f(w^–), f(w⁺)) if f increases with p or
Wilson (f(p)) ≡ (f(w⁺), f(w^–)) otherwise.

Note: The term monotonic means that the function always increases with its parameter (the gradient d(f(x))/dx > 0) or always decreases with its parameter (the gradient < 0).

The slope of a sloping roof is monotonic. The top of a roof (or a flat roof) is not! The gradient (slope) of a monotonic function may change, but it may not become horizontal or change direction from positive to negative (or vice-versa).

Other example monotonic functions include any constant multiple of p, e.g. 5p (a score on a scale from 0 to 5), the alternate probability q = 1 – p, any power of p such as p², and so on. (For more example transformations see this PDF ‘cheat sheet’.)

The function must always behave in this monotonic way over the probability range (it doesn’t matter what it does for p < 0 or p > 1). For example, across all values of x, x² is non-monotonic. However, as long as p ∈ [0, 1], p² is monotonic.

Some monotonic functions, including p² and 1/p. Note that a function with a negative gradient, such as 1/p, will flip the upper and lower bounds.

Importantly, the logistic function (that defines an ‘S’ curve) is monotonic. See this short paper for more on the relationship between the Wilson score interval and the logistic function.

All monotonic functions can be inverted and obtain a single solution, and this inverse is also monotonic. As a result we can compute an interval on a monotonic function of p by simply computing the interval on p and then apply the inverse of the function to the new interval.

Note that even though 1/p is infinite when p = 0, it is still possible to apply the Wilson score interval to p and report the reciprocal of the bounds.

By way of comparison, here are two examples of non-monotonic functions.

Two non-monotonic functions. In the lower curve, f(p) = (p – 0.5)², different values of p obtain the same value of f(p). The upper stepped function includes a plateau where all values obtain the same value of f(p).

Citation (book version)

Wallis, S.A. 2021. Reciprocating the Wilson Interval. Chapter 10 in Wallis, S.A. Statistics in Corpus Linguistics Research. New York: Routledge. 171-177.

4 thoughts on “Reciprocating the Wilson interval”

Edward Hart May 14, 20243:57 pm Reply

Intriguing stuff! Can I please ask for some clarification regarding the exactitude of the above relationships? I absolutely see how the reciprocal confidence intervals are generated, and this makes good intuitive sense, but I am wondering how well the reciprocal (or related) intervals perform in practice. For example, if the Wilson score is giving excellent performance at the p level, is that same level of performance guaranteed for f(p)? Are there cases where these checks have been done? The main reason I am asking is that I have implemented some f(p) intervals and tested them via simulated numerical tests and I don’t seem to be getting great performance in all cases (e.g. for 95% intervals I may find 10%+ of the time that the true value lies outside the interval). This may be an error on my part though, hence why I am asking 🙂 It may be this is discussed in the textbook (which I have on order). Any insight you can provide would be much appreciated!
1. Sean May 14, 20247:23 pm Reply
  
  Good question. The simple answer is that for a monotonic function f(p) the answer is the interval must perform exactly the same as the interval on p. If you imagine drawing a closed shape, a circle say, on a deflated balloon, and then draw a dot on the inside and a dot outside it. Now inflate the balloon. The new shape may be distorted, but the dot that was inside must still be inside, and the dot outside must still be outside.
  1. Edward Hart May 15, 202412:06 pm
    
    Thanks for the fast (and helpful!) replies! I am still bottoming things out a bit, but I think the issue was simply that I wasn’t doing enough runs to get to converged results concerning testing of confidence interval performance (I was running 50-200 example cases, whereas it looks like convergence doesn’t start to kick in until a few thousand examples are run). The transform seems to play a role here also… while performance does appear to behave exactly as you describe, a more severe transformation seems to require more numerical tests to reach convergence of the test metric (this being the proportion lying outside the interval). I will try and remember to post again to confirm that this is indeed where I was going wrong.
    
    Thanks again!
2. Sean May 14, 20247:33 pm Reply
  
  My advice would be to use the Wilson score interval with continuity correction or Clopper-Pearson interval as a starting point if you want good performance for small n.
  
  One caveat: if you are seeing what appear to be failures to replicate you should check that the proportion p is actually Binomial, i.e. free to vary from 0 to 1, randomly sampled (and extrapolated to a very large if not infinite population). In other words, if the property does not actually behave as the Binomial model predicts, why is that the case?

Reciprocating the Wilson interval

Introduction

Reciprocating the Wilson interval

Some example data

Theorem

Citation (book version)

See also

Published by Sean

4 thoughts on “Reciprocating the Wilson interval”

Leave a comment Cancel reply

Introduction

Reciprocating the Wilson interval

Some example data

Theorem

Citation (book version)

See also

Rate this:

Share this:

Related

Published by Sean

4 thoughts on “Reciprocating the Wilson interval”

Leave a comment Cancel reply