The confidence of diversity


Occasionally it is useful to cite measures in papers other than simple probabilities or differences in probability. When we do, we should estimate confidence intervals on these measures. There are a number of ways of estimating intervals, including bootstrapping and simulation, but these are computationally heavy.

For many measures it is possible to derive intervals from the Wilson score interval by employing a little mathematics. Elsewhere in this blog I discuss how to manipulate the Wilson score interval for simple transformations of p, such as 1/p, 1  p, etc.

Below I am going to explain how to derive an interval for grammatical diversity, d, which we can define as the probability that two randomly-selected instances have different outcome classes.

Diversity is an effect size measure of a vector of k values. If all values are the same, the data is evenly spread, and the score will be at its maximum. If all values except for one are zero, the chance of picking two different instances will be zero.

To compute this notion of diversity we sum across the set of outcomes (all functions, all nouns, etc.), C:

  • diversity d(c ∈ C) = ∑ p₁(c).(1 – p₂(c)) if n > 1; 1 otherwise

where C is a set of > 1 categories, p₁(c) is the probability that item 1 is category c and p₂(c) is the probability that item 2 is the same category c.

We have probabilities

  • p₁(c) = F(c)/n,
  • p₂(c) = (F(c– 1)/(n – 1),

where n is the total number of instances.

The formula for p₂ includes an adjustment for the fact that we already know that the first item is c. This principle is used in card-playing statistics  suppose I draw cards from a pack. If the first card I pick is a heart, I know that there are only 10 other hearts in the pack, so the probability of the next card I pick up being a heart is 10 out of 51, not 11 out of 52.

Note that as the set is closed, ∑p₁(c) = ∑p₂(c) = 1.

The maximum score is (k – 1) / k. If we wished to place diversity on a scale from 0 to 1, then the score could be rescaled.

An example

In a forthcoming paper with Bas Aarts and Jill Bowie, we found that the share of functions of –ing clauses (‘gerunds’) appeared to change over time in the Diachronic Corpus of Present-day Spoken English (DCPSE).

We obtained the following graph. The bars marked ‘LLC’ refer to data drawn from the period 1956-1972; those marked ‘ICE-GB’ are from 1990-1992.

Changes in proportion of p(function | –ing) between LLC (1960s-70s) and ICE-GB (1990s) data in DCPSE, ordered by total frequency. After Aarts et al. (forthcoming).

Changes in proportion of p(function | –ing) between LLC (1960s-70s) and ICE-GB (1990s) data in DCPSE, ordered by total frequency. After Aarts et al. (forthcoming).

This graph considers six functions C = {CO, CS, OD, SU, A, PC} of the clause. It plots p(c) over C. Considered individually, note that some significantly increase and some decrease, and that the increases appear to be concentrated in the shorter bars (smaller p) and the decreases in the longer ones. Intuitively this appears to mean that over time we are seeing a greater diversity of the use of –ing clauses.

Here is the LLC data.

6 33 61 326 610 1,203 2,239

LLC frequency data: a simple array of k values. Diversity is computed out of the proportions occupied by each one, in this case, p(function | –ing).

Computing diversity scores, we arrive at

  • d(LLC) = 0.6152 and
  • d(ICE-GB) = 0.6440.

Confidence intervals for d

Suppose next we wish to compare these two diversity measures. The first step is to estimate a confidence interval for d.

Note: A useful shortcut, which we employ here, involves the use of a relative Wilson score interval. Normally we quote intervals in absolute terms, such as p₁ is within the range (w₁⁻, w₁⁺). But to perform many mathematical generalisations we need to consider the interval widthy₁⁻ = |p w₁⁻|, y₁⁺ = |p w₁⁺|. For example, the Newcombe-Wilson interval takes the square root of the sum of the squares of the inner interval widths.

The formula (-y₁⁻, y₁⁺) is the Wilson interval relative to p₁ and is typically used to plot intervals in Excel.

To compute a confidence interval for the product (two multiplied terms a × b is called a “product” in mathematics) of two probabilities, p₁ × p₂, we need a formula that looks something like this. The interval should be plus or minus the product of the two pairs of interval widths:

  • CI(p₁ × p₂) = (y₁⁻ × y₂⁻, y₁⁺ × y₂⁺).

In our case we want the product p₁ × (1 – p₂). Since the probability (1 – p₂) is simply the alternate to p₂, the lower and upper bounds are (-y₂⁺, y₁⁻). The relative product interval is then simply

  • CI(p₁ × (1 – p₂)) = (y₁⁻ × y₂⁺, y₁⁺ × y₂⁻).

As diversity d is the sum of independent terms each with these intervals, we add them together to estimate the confidence interval.

Note: In this formula, p₁(c) and p₂(c) are co-dependent, and almost identical, so each term is equivalent to a population variance estimate.

Confidence intervals on d are then obtained by summing each bound separately.

  • CI(d) = (∑ y₁⁻(c) × y₂⁺(c), ∑ y₁⁺(c) × y₂⁻(c)).

Example data

To see how this works, let’s return to our example. The following is drawn from the LLC data (first, blue bar in the graph), at an error level α = 0.05.

function CO CS SU OD A PC
p 0.0027 0.0147 0.0272 0.1456 0.2724 0.5373
w₁⁻ 0.0012 0.0105 0.0213 0.1316 0.2544 0.5166
w₁⁺ 0.0058 0.0206 0.0348 0.1608 0.2913 0.5379
y₁⁻ 0.0015 0.0042 0.0060 0.0140 0.0180 0.0207
y₁⁺ 0.0032 0.0059 0.0076 0.0152 0.0188 0.0206

Derived probability and 95% Wilson interval measures for the LLC frequency data above. You should be able to see that y₁⁻ = p₁ – w₁⁻, etc.

Next, to compute the lower bound of the confidence interval CI(d) = (l, u), we obtain the same data for p₂ and then carry out the computation:

  • lower bound l = ∑ y₁⁻(c) × y₂⁺(c).
  • upper bound u = ∑ y₁⁺(c) × y₂⁻(c).

The products are quite small, so we have listed these to six decimal places. The summation gives us the following lower and upper bound terms:

function CO CS SU OD A PC Total
u 0.000004 0.000025 0.000045 0.000213 0.000339 0.000426 0.001052
l 0.000004 0.000024 0.000045 0.000213 0.000339 0.000426 0.001052

Calculating the bounds of the new 95% diversity interval. For large nl and u converge.

We can quote diversity for LLC by subtracting l from and adding u to d to obtain the absolute intervals:

  • d(LLC) = 0.6152 (0.6141, 0.6163), and
  • d(ICE-GB) = 0.6440 (0.6431, 0.6455).

Testing differences in diversity

In the Newcombe-Wilson test, we compare the difference between two Binomial observations p₁ and p₂ with the Pythagorean distance of the Wilson interval widths:

–√u₁² + l₂² < (p₁ – p₂) < √l₁² + u₂².

However in our diversity interval each limit is already squared. It is based on the squared Wilson interval: it is the product of two intervals, just as d is a sum of the product of two probabilities. The distribution within each interval is the square of the Wilson interval.

So to perform a significance test comparison, we simply test if

–(u₁ + l₂) < (d₁ – d₂) < (l₁ + u₂).

Or, to put it another way, if the intervals do not overlap, the difference is significant. In our case, d(ICE-GB) > d(LLC), so we only need test the inner interval. The upper bound of LLC diversity is 0.6163 < 0.6431 (the lower bound of d(ICE-GB)), so the difference is significant.


In many scientific disciplines, such as medicine, papers that include graphs or cite figures without confidence intervals are considered incomplete and are likely to be rejected by journals. However, whereas the Wilson interval performs admirably for simple Binomial probabilities, computing confidence intervals for more complex measures typically involves a more involved computation.

We defined a diversity measure and derived a confidence interval for it. Although probabilistic (diversity is indeed a probability), it is not a Binomial probability. For one thing, it has a maximum below 1, of (k – 1) / k. For another, it is computed as the sum of the product of two sets of independent probabilities.

In order to derive this interval we recognised that this fact meant the intervals would correspond to a squared Wilson interval. This is a ‘variance’ measure, rather than a ‘standard deviation’ one.  We could then simply sum the upper and lower variance measures together to obtain the interval. Likewise, comparing values of d involves simple addition of inner interval widths.

Like Cramér’s φ, diversity condenses an array with k – 1 degrees of freedom into a variable with a single degree of freedom. Swapping data between the smallest and largest columns would obtain exactly the same diversity score.

Testing for significant difference in diversity, therefore, is not the same as carrying out a k × 2 chi-square test. Such a test could be significant even when diversity scores are not significantly different. Our new diversity difference test is more conservative, and significant results may be more worthy of comment.

See also


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s