The confidence of diversity


Occasionally it is useful to cite measures in papers other than simple probabilities or differences in probability. When we do, we should estimate confidence intervals on these measures. There are a number of ways of estimating intervals, including bootstrapping and simulation, but these are computationally heavy.

For many measures it is possible to derive intervals from the Wilson score interval by employing a little mathematics. Elsewhere in this blog I discuss how to manipulate the Wilson score interval for simple transformations of p, such as 1/p, 1 – p, etc.

Below I am going to explain how to derive an interval for grammatical diversity, d, which we can define as the probability that two randomly-selected instances have different outcome classes.

Diversity is an effect size measure of a frequency distribution, i.e. a vector of k frequencies. If all frequencies are the same, the data is evenly spread, and the score will tend to a maximum. If all frequencies except one are zero, the chance of picking two different instances will of course be zero. Diversity is well-behaved except where categories have frequencies of 1.

To compute this diversity measure, we sum across the set of outcomes (all functions, all nouns, etc.), C:

  • diversity d(c ∈ C) = ∑ p₁(c).(1 – p₂(c)) if n > 1; 1 otherwise

where C is a set of > 1 disjoint categories, p₁(c) is the probability that item 1 is category c and p₂(c) is the probability that item 2 is the same category c.

We have probabilities

  • p₁(c) = F(c)/n,
  • p₂(c) = (F(c– 1)/(n – 1) = (p₁(c).n – 1)/(n – 1),

where n is the total number of instances.

The formula for p₂ includes an adjustment for the fact that we already know that the first item is c. This principle is used in card-playing statistics. Suppose I draw cards from a pack. If the first card I pick is a heart, I know that there are only 10 other hearts in the pack, so the probability of the next card I pick up being a heart is 10 out of 51, not 11 out of 52.

Note that as the set is closed, ∑p₁(c) = ∑p₂(c) = 1.

The maximum score is slightly less than (k – 1) / k except in the special case where n approaches k and there is a frequency of 1 in any category, in which case diversity can approach 1.

An example

In a paper with Bas Aarts and Jill Bowie (2018), we found that the share of functions of –ing clauses (‘gerunds’) appeared to change over time in the Diachronic Corpus of Present-day Spoken English (DCPSE).

We obtained the following graph. The bars marked ‘LLC’ refer to data drawn from the period 1956-1972; those marked ‘ICE-GB’ are from 1990-1992.

Changes in proportion of p(function | –ing) between LLC (1960s-70s) and ICE-GB (1990s) data in DCPSE, ordered by total frequency. After Aarts et al. (2018).

This graph considers six functions C = {CO, CS, OD, SU, A, PC} of the clause. It plots p(c) over C. Considered individually, some functions significantly increase and some decrease their share. Note also that the increases appear to be concentrated in the shorter bars (smaller p) and the decreases in the longer ones.

Intuitively this appears to mean that we are seeing –ing clauses increase in their diversity of grammatical function over time. We would like to test this proposition.

Here is the LLC data.

6 33 61 326 610 1,203 2,239

LLC frequency data: a simple array of k values. Diversity is computed out of the proportions occupied by each one, in this case, p(function | –ing).

Computing diversity scores, we arrive at

  • d(LLC) = 0.6152 and
  • d(ICE-GB) = 0.6443.

Confidence intervals for d

We wish to compare these two diversity measures. The first step is to estimate a confidence interval for d.

Step 1: intervals for each term

First we compute interval estimates for each term, d(c) = p₁(c).(1 – p₂(c)).

  • The Wilson score interval for a probability p is (w⁻, w⁺).

Any monotonic function of p, fn, can be applied and plotted as a simple transformation. See Reciprocating the Wilson interval. We can write

  • fn(p) ∈ (fn(w⁻), fn(w⁺)).

However, d(c) is not monotonic over its entire range. Indeed d(c) reaches a maximum where p = 0.5. However the axiom holds conservatively provided that the function is monotonic across the interval (w⁻, w⁺), i.e. where 0.5 is not within the interval. The following graph plots d(c) over p(c) for a two-cell vector where n = 40.

diversity monotonic
Diversity vs. probability for each cell in a 2-cell vector. Diversity is globally non-monotonic, peaking at p=0.5, but locally monotonic.

We can rewrite d(c) in terms of a probability p and n,

  • d(p, n) = p × (1 – (p × n – 1) / (n – 1)).

This has the interval

  • d(p, n) ∈ (d(w⁻, n), d(w⁺, n))

provided that d(w⁺, n) < 0.5. To obtain the interval we have simply plugged w⁻ and w⁺ into the formula for d(p, n) in place of p.

Indeed, noting the shape of d, we can derive the following.

  • d(p, n) ∈ (d(w⁻, n), d(w⁺, n)) where w⁺ < 0.5,
  • d(p, n) ∈ (d(w⁺, n), d(w⁻, n)) where w⁻ > 0.5,
  • d(p, n) ∈ (min(d(w⁻, n), d(w⁺, n)), d(0.5, n)) otherwise.
diversity max
Where the interval includes the maximum of d, the upper bound will be this maximum value.

Step 2: summation of terms

Next we need to sum these intervals. To do this we need to take account of the number of degrees of freedom of the vector.

Case 1: df = 1

If we had two values (as in our graphed example), we would have one degree of freedom. Cell probabilities p(1) + p(2) = 1, so p(2) = 1 – p(1).

The relationship above is exactly the same as applies for the Wilson score interval and 2×1 χ² goodness of fit test. Observed variation across p(1) determines the variation across p(2). Suppose P(1), the true value for p(1), were at an outer limit of p(1) (say, w⁺(1)). P(2) would be at the opposite outer limit of p(2) (w⁻(2)).

This means we should simply sum the transformed Wilson scores:

  • d(c ∈ C) ∈ (∑d(w⁻(c), n), ∑d(w⁺(c), n)).

We apply simple summation where intervals are strictly dependent on each other. We can obtain relative bounds of the dependent sum as:

  • l(dep) = d – ∑d(w⁻(c), n),
  • u(dep) = ∑d(w⁺(c), n) – d.

However, in our example we have more than one degree of freedom, and this method is too conservative.

Case 2: df > 1

Where probabilities are independent, some can increase and others decrease. The chance that two independent probabilities both fall within a 5% error level is 0.05². So we cannot simply add together intervals. The method of independent summation is to sum Pythagorean interval widths:

  • l(ind) = √∑[d(p(c), n) – d(w⁻(c), n)]², and
  • u(ind) = √∑[d(p(c), n) – d(w⁺(c), n)]².

However, in our case, we have what we might term semi-independent probabilities, with the level of independence determined by the number of degrees of freedom. We have df = k – 1 independent differences, so we can interpolate between the two methods in proportion to the number of cells.

  • l = (l(ind) × (k – 2) + 2l(dep)) / k, and
  • u = (u(ind) × (k – 2) + 2u(dep)) / k,
  • d(c ∈ C) ∈ (d – l, d + l).

Note that l = l(dep) where k = 2.

Example data

To see how this works, let’s return to our example. The following is drawn from the LLC data (first, blue bar in the graph), at an error level α = 0.05. Note that one of our cells (PC) has p₁ > 0.5, w₁⁻ is also > 0.5, so we must swap the interval for this cell.

function CO CS SU OD A PC
p 0.0027 0.0147 0.0272 0.1456 0.2724 0.5373
w₁⁻ 0.0012 0.0105 0.0213 0.1316 0.2544 0.5166
w₁⁺ 0.0058 0.0206 0.0348 0.1608 0.2913 0.5379

Derived probability and 95% Wilson interval measures for the LLC frequency data above.

Next, to compute the lower bound of the confidence interval CI(d) = (– l, u), we obtain the same data for p₂ and then carry out the computation.

  • l(dep) = d – ∑d(w⁻(c), n) = 0.6152 – 0.5833 = 0.0319,
  • u(dep) = ∑d(w⁺(c), n) – d = 0.6499 – 0.6510 = 0.0359,
  • l(ind) = √∑[d(p(c), n) – d(w⁻(c), n)]² = 0.0152,
  • u(ind) = √∑[d(p(c), n) – d(w⁺(c), n)]² = 0.0165.

This obtains an interval of (0.5945, 0.6382).

We can quote diversity for LLC with absolute intervals (– l, u):

  • d(LLC) = 0.6152 (0.5945, 0.6382), and
  • d(ICE-GB) = 0.6443 (0.6248, 0.6655).

Testing for differences in diversity

In the Newcombe-Wilson test, we compare the difference between two Binomial observations p₁ and p₂ with the Pythagorean distance of the Wilson interval widths y₁⁺ = w₁⁺ – p₁, etc:

–√(y₁⁺)² + (y₂⁻)² < (p₁ – p₂) < √(y₁⁻)² + (y₂⁺)².

If the equation above is true, the result is not significant (the difference falls within the confidence interval).

This method operates on the assumption that the observations are independent and the intervals are approximately Normal. In our case the difference in diversity is -0.0291, and the bounds are (-0.0301, +0.0297).

Since the difference falls inside those bounds – just – we can report that the difference is not significant.


In many scientific disciplines, such as medicine, papers that include graphs or cite figures without confidence intervals are considered incomplete and are likely to be rejected by journals. However, whereas the Wilson interval performs admirably for simple Binomial probabilities, computing confidence intervals for more complex measures typically involves a more involved computation.

We defined a diversity measure and derived a confidence interval for it. Although probabilistic (diversity is indeed a probability), it is not a Binomial probability. For one thing, it has a maximum below 1, of slightly in excess of (k – 1) / k. For another, it is computed as the sum of the product of two sets of related probabilities.

In order to derive this interval we made the assumption of monotonicity, i.e. that the function d tends to increase along its range, or decrease along its range. However, d is decidedly not monotonic it increases as p tends to 0.5 but falls thereafter. We employed the weaker assumption that it is monotonic within the confidence interval, or – in the case where the interval includes a change in direction – that it cannot exceed the global maximum. This has a conservative consequence: it makes the evaluation weaker than it would otherwise be.

We computed an interval by interpolating between dependent and independent estimates of variance, noting that the vector has k – 1 degrees of freedom. This is not the most accurate method (and I intend to return to this question in later posts), but it is sufficient for us to derive an interval, and, by employing Newcombe’s method, a test of significant difference.

Like Cramér’s φ, diversity condenses an array with k – 1 degrees of freedom into a variable with a single degree of freedom. Swapping data between the smallest and largest columns would obtain exactly the same diversity score.

Testing for significant difference in diversity, therefore, is not the same as carrying out a k × 2 chi-square test. Such a test could be significant even when diversity scores are not significantly different. Our new diversity difference test is more conservative, and significant results may be more worthy of comment.


Aarts, B., Wallis, S.A., and Bowie, J. (2018). –Ing clauses in spoken English: structure, usage and recent change. In Seoane, E., C. Acuña-Fariña, & I. Palacios-Martínez (eds.) Subordination in English. Synchronic and Diachronic Perspectives. Topics in English Linguistics (TiEL) 101. Berlin: De Gruyter. 129-154.

See also


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.