### Introduction

Occasionally it is useful to cite measures in papers other than simple probabilities or differences in probability. When we do, we should estimate confidence intervals on these measures. There are a number of ways of estimating intervals, including bootstrapping and simulation, but these are computationally heavy.

For many measures it is possible to derive intervals from the Wilson score interval by employing a little mathematics. Elsewhere in this blog I discuss how to manipulate the Wilson score interval for simple transformations of *p*, such as 1/*p*, 1 *–* *p*, etc.

Below I am going to explain how to derive an interval for grammatical diversity, *d*, which we can define as **the probability that two randomly-selected instances have different outcome classes**.

Diversity is an effect size measure of a vector of *k* values. If all values are the same, the data is evenly spread, and the score will be at its maximum. If all values except for one are zero, the chance of picking two different instances will be zero.

To compute this notion of diversity we sum across the set of outcomes (all functions, all nouns, etc.), **C**:

*diversity d*(*c*∈**C**) = ∑*p*₁(*c*).(1 –*p*₂(*c*)) if*n*> 1; 1 otherwise

where **C** is a set of *k *> 1 categories, *p*₁(*c*)* *is the probability that item 1 is category *c* and *p*₂(*c*) is the probability that item 2 is the same category *c*.

We have probabilities

*p*₁(*c*) =*F*(*c*)/*n,**p*₂(*c*) = (*F*(*c*)*–*1)/(*n –*1),

where *n* is the total number of instances.

The formula for *p*₂ includes an adjustment for the fact that we already know that the first item is *c*. This principle is used in card-playing statistics *–* suppose I draw cards from a pack. If the first card I pick is a heart, I know that there are only 10 other hearts in the pack, so the probability of the next card I pick up being a heart is 10 out of 51, not 11 out of 52.

Note that as the set is closed, ∑*p*₁(*c*) = ∑*p*₂(*c*) = 1.

The maximum score is (*k –* 1) / *k*. If we wished to place diversity on a scale from 0 to 1, then the score could be rescaled.

### An example

In a forthcoming paper with Bas Aarts and Jill Bowie, we found that the share of functions of *–ing* clauses (‘gerunds’) appeared to change over time in the *Diachronic Corpus of Present-day Spoken English* (DCPSE).

We obtained the following graph. The bars marked ‘LLC’ refer to data drawn from the period 1956-1972; those marked ‘ICE-GB’ are from 1990-1992.

This graph considers six functions **C** = {CO, CS, OD, SU, A, PC} of the clause. It plots *p*(*c*) over **C**. Considered individually, note that some significantly increase and some decrease, and that the increases appear to be concentrated in the shorter bars (smaller *p*) and the decreases in the longer ones. Intuitively this appears to mean that over time we are seeing a greater diversity of the use of *–ing* clauses.

Here is the LLC data.

CO | CS | SU | OD | A | PC | Total |

6 | 33 | 61 | 326 | 610 | 1,203 | 2,239 |

Computing diversity scores, we arrive at

*d*(LLC) = 0.6152 and*d*(ICE-GB) = 0.6440.

### Confidence intervals for *d*

Suppose next we wish to compare these two diversity measures. The first step is to estimate a confidence interval for *d*.

**Note:** A useful shortcut, which we employ here, involves the use of a **relative Wilson score interval**. Normally we quote intervals in absolute terms, such as *p*₁ is within the range (*w*₁⁻, *w*₁⁺). But to perform many mathematical generalisations we need to consider the interval **width**, *y*₁⁻ = |*p*₁ *–* *w*₁⁻|, *y*₁⁺ = |*p*₁ *–* *w*₁⁺|. For example, the Newcombe-Wilson interval takes the square root of the sum of the squares of the inner interval widths.

The formula (-*y*₁⁻, *y*₁⁺) is the Wilson interval relative to *p*₁ and is typically used to plot intervals in Excel.

To compute a confidence interval for the **product** (two multiplied terms *a* × *b* is called a “product” in mathematics) of two probabilities, *p*₁ × *p*₂, we need a formula that looks something like this. The interval should be plus or minus the product of the two pairs of interval widths:

- CI(
*p*₁ ×*p*₂) = (*y*₁⁻ ×*y*₂⁻,*y*₁⁺ ×*y*₂⁺).

In our case we want the product *p*₁ × (1 – *p*₂). Since the probability (1 – *p*₂) is simply the alternate to *p*₂, the lower and upper bounds are (-*y*₂⁺, *y*₁⁻). The relative product interval is then simply

- CI(
*p*₁ × (1 –*p*₂)) = (*y*₁⁻ ×*y*₂⁺,*y*₁⁺ ×*y*₂⁻).

As diversity *d* is the sum of independent terms each with these intervals, we add them together to estimate the confidence interval.

**Note:** In this formula, *p*₁(*c*) and *p*₂(*c*) are co-dependent, and almost identical, so each term is equivalent to a population variance estimate.

Confidence intervals on *d* are then obtained by summing each bound separately.

- CI
*(d*) = (∑*y*₁⁻(*c*) ×*y*₂⁺(*c*), ∑*y*₁⁺(*c*) ×*y*₂⁻(*c*)).

### Example data

To see how this works, let’s return to our example. The following is drawn from the LLC data (first, blue bar in the graph), at an error level α = 0.05.

function | CO | CS | SU | OD | A | PC |

p₁ |
0.0027 | 0.0147 | 0.0272 | 0.1456 | 0.2724 | 0.5373 |

w₁⁻ |
0.0012 | 0.0105 | 0.0213 | 0.1316 | 0.2544 | 0.5166 |

w₁⁺ |
0.0058 | 0.0206 | 0.0348 | 0.1608 | 0.2913 | 0.5379 |

y₁⁻ |
0.0015 | 0.0042 | 0.0060 | 0.0140 | 0.0180 | 0.0207 |

y₁⁺ |
0.0032 | 0.0059 | 0.0076 | 0.0152 | 0.0188 | 0.0206 |

Next, to compute the lower bound of the confidence interval CI(*d*) = (*l*, *u*), we obtain the same data for *p*₂ and then carry out the computation:

*lower bound l*= ∑*y*₁⁻(*c*) ×*y*₂⁺(*c*).*upper bound u*= ∑*y*₁⁺(*c*) ×*y*₂⁻(*c*).

The products are quite small, so we have listed these to six decimal places. The summation gives us the following lower and upper bound terms:

function | CO | CS | SU | OD | A | PC | Total |

u |
0.000004 | 0.000025 | 0.000045 | 0.000213 | 0.000339 | 0.000426 | 0.001052 |

l |
0.000004 | 0.000024 | 0.000045 | 0.000213 | 0.000339 | 0.000426 | 0.001052 |

We can quote diversity for LLC by subtracting *l* from and adding *u* to *d* to obtain the absolute intervals:

*d*(LLC) = 0.6152 (0.6141, 0.6163), and*d*(ICE-GB) = 0.6440 (0.6431, 0.6455).

### Testing differences in diversity

In the Newcombe-Wilson test, we compare the difference between two Binomial observations *p*₁ and *p*₂ with the Pythagorean distance of the Wilson interval widths:

–√*u*₁² + *l*₂² < (*p*₁ – *p*₂) < √*l*₁² + *u*₂².

However in our diversity interval each limit is already squared. It is based on the squared Wilson interval: it is the product of two intervals, just as *d* is a sum of the product of two probabilities. The distribution within each interval is the square of the Wilson interval.

So to perform a significance test comparison, we simply test if

–(*u*₁ + *l*₂) < (*d*₁ – *d*₂) < (*l*₁ + *u*₂).

Or, to put it another way, **if the intervals do not overlap**, the difference is significant. In our case, *d*(ICE-GB) > *d*(LLC), so we only need test the inner interval. The upper bound of LLC diversity is 0.6163 < 0.6431 (the lower bound of *d*(ICE-GB)), so the difference is significant.

### Conclusions

In many scientific disciplines, such as medicine, papers that include graphs or cite figures without confidence intervals are considered incomplete and are likely to be rejected by journals. However, whereas the Wilson interval performs admirably for simple Binomial probabilities, computing confidence intervals for more complex measures typically involves a more involved computation.

We defined a diversity measure and derived a confidence interval for it. Although probabilistic (diversity is indeed a probability), it is not a *Binomial* probability. For one thing, it has a maximum below 1, of (*k –* 1) / *k*. For another, it is computed as the sum of the product of two sets of independent probabilities.

In order to derive this interval we recognised that this fact meant the intervals would correspond to a squared Wilson interval. This is a ‘variance’ measure, rather than a ‘standard deviation’ one. We could then simply sum the upper and lower variance measures together to obtain the interval. Likewise, comparing values of *d* involves simple addition of inner interval widths.

Like Cramér’s φ, diversity condenses an array with *k* – 1 degrees of freedom into a variable with a single degree of freedom. Swapping data between the smallest and largest columns would obtain exactly the same diversity score.

Testing for significant difference in diversity, therefore, is not the same as carrying out a *k* × 2 chi-square test. Such a test could be significant even when diversity scores are not significantly different. Our new diversity difference test is more conservative, and significant results may be more worthy of comment.

### See also

- Is “grammatical diversity” a useful concept?
- Reciprocating the Wilson interval
- Goodness of fit measures for discrete categorical data
- Measures of association for contingency tables