The confidence of diversity


Occasionally it is useful to cite measures in papers other than simple probabilities or differences in probability. When we do, we should estimate confidence intervals on these measures. There are a number of ways of estimating intervals, including bootstrapping and simulation, but these are computationally heavy.

For many measures it is possible to derive intervals from the Wilson score interval by employing a little mathematics. Elsewhere in this blog I discuss how to manipulate the Wilson score interval for simple transformations of p, such as 1/p, 1 – p, etc.

Below I am going to explain how to derive an interval for grammatical diversity, d, which we can define as the probability that two randomly-selected instances have different outcome classes.

Diversity is an effect size measure of a frequency distribution, i.e. a vector of k frequencies. If all frequencies are the same, the data is evenly spread, and the score will tend to a maximum. If all frequencies except one are zero, the chance of picking two different instances will of course be zero. Diversity is well-behaved except where categories have frequencies of 1. Continue reading


Coping with imperfect data


One of the challenges for corpus linguists is that many of the distinctions that we wish to make are either not annotated in a corpus at all or, if they are represented in the annotation, unreliably annotated. This issue frequently arises in corpora to which an algorithm has been applied, but where the results have not been checked by linguists, a situation which is unavoidable with mega-corpora. However, this is a general problem. We would always recommend that cases be reviewed for accuracy of annotation.

A version of this issue also arises when checking for the possibility of alternation, that is, to ensure that items of Type A can be replaced by Type B items, and vice-versa. An example might be epistemic modal shall vs. will. Most corpora, including richly-annotated corpora such as ICE-GB and DCPSE, do not include modal semantics in their annotation scheme. In such cases the issue is not that the annotation is “imperfect”, rather that our experiment relies on a presumption that the speaker has the choice of either type at any observed point (see Aarts et al. 2013), but that choice is conditioned by the semantic content of the utterance.

Continue reading

Reciprocating the Wilson interval


How can we calculate confidence intervals on a property like sentence length (as measured by the number of words per sentence)?

You might want to do this to find out whether or not, say, spoken utterances consist of shorter or longer sentences than those found in writing.

The problem is that the average number of words per sentence is not a probability. If you think about it, this ratio will (obviously) equal or exceed 1. So methods for calculating intervals on probabilities won’t work without recalibration.

Aside: You are most likely to hit this type of problem if you want to plot a graph of some non-probabilistic property, or you wish to cite a property with an upper and lower bound for some reason. Sometimes expressing something as a probability does not seem natural. However, it is a good discipline to think in terms of probabilities, and to convert your hypotheses into hypotheses about probabilities as far as possible. As we shall see, this is exactly what you have to do to apply the Wilson score interval.

Note also that just because you want to calculate confidence intervals on a property, you also have to consider whether the property is freely varying when expressed as a probability.

The Wilson score interval (w⁻, w⁺), is a robust method for computing confidence intervals about probabilistic observations p.

Elsewhere we saw that the Wilson score interval obtained an accurate approximation to the ‘exact’ Binomial interval based on an observed probability p, obtained by search. It is also well-constrained, so that neither upper nor lower bound can exceed the probabilistic range [0, 1].

But the Wilson interval is based on a probability. In this post we discuss how this method can be used for other quantities.

Continue reading

Freedom to vary and significance tests


Statistical tests based on the Binomial distribution (z, χ², log-likelihood and Newcombe-Wilson tests) assume that the item in question is free to vary at each point. This simply means that

  • If we find f items under investigation (what we elsewhere refer to as ‘Type A’ cases) out of N potential instances, the statistical model of inference assumes that it must be possible for f to be any number from 0 to N.
  • Probabilities, p = f / N, are expected to fall in the range [0, 1].

Note: this constraint is a mathematical one. All we are claiming is that the true proportion in the population could conceivably range from 0 to 1. This property is not limited to strict alternation with constant meaning (onomasiological, “envelope of variation” studies). In semasiological studies, where we evaluate alternative meanings of the same word, these tests can also be legitimate.

However, it is common in corpus linguistics to see evaluations carried out against a baseline containing terms that simply cannot plausibly be exchanged with the item under investigation. The most obvious example is statements of the following type: “linguistic Item x increases per million words between category 1 and 2”, with reference to a log-likelihood or χ² significance test to justify this claim. Rarely is this appropriate.

Some terminology: If Type A represents say, the use of modal shall, most words will not alternate with shall. For convenience, we will refer to cases that will alternate with Type A cases as Type B cases (e.g. modal will in certain contexts).

The remainder of cases (other words) are, for the purposes of our study, not evaluated. We will term these invariant cases Type C, because they cannot replace Type A or Type B.

In this post I will explain that not only does introducing such ‘Type C’ cases into an experimental design conflate opportunity and choice, but it also makes the statistical evaluation of variation more conservative. Not only may we mistake a change in opportunity as a change in the preference for the item, but we also weaken the power of statistical tests and tend to reject significant changes (in stats jargon, “Type II errors”).

This problem of experimental design far outweighs differences between methods for computing statistical tests. Continue reading

Change and certainty: plotting confidence intervals (2)


In a previous post I discussed how to plot confidence intervals on observed probabilities. Using this method we can create graphs like the following. (Data is in the Excel spreadsheet we used previously: for this post I have added a second worksheet.)

The graph depicts both the observed probability of a particular form and the certainty that this observation is accurate. The ‘I’-shaped error bars depict the estimated range of the true value of the observation at a 95% confidence level (see Wallis 2013 for more details).

A note of caution: these probabilities are semasiological proportions (different uses of the same word) rather than onomasiological choices (see Choice vs. use).

An example graph plot showing the changing proportions of meanings of the verb think over time in the US TIME Magazine Corpus, with Wilson score intervals, after Levin (2013). Many thanks to Magnus for the data!

In this post I discuss ways in which we can plot intervals on changes (differences) rather than single probabilities.

The clearer our visualisations, the better we can understand our own data, focus our explanations on significant results and communicate our results to others. Continue reading

Choosing the right test


One of the most common questions a new researcher has to deal with is the following:

what is the right statistical test for my purpose?

To answer this question we must distinguish between

  1. different experimental designs, and
  2. optimum methods for testing significance.

In corpus linguistics, many research questions involve choice. The speaker can say shall or will, choose to add a postmodifying clause to an NP or not, etc. If we want to know what factors influence this choice then these factors are termed independent variables (IVs) and the choice is  the dependent variable (DV). These choices are mutually exclusive alternatives. Framing the research question like this immediately helps us focus in on the appropriate class of tests.  Continue reading

A statistics crib sheet

Confidence intervalsHandout

Confidence intervals on an observed rate p should be computed using the Wilson score interval method. A confidence interval on an observation p represents the range that the true population value, P (which we cannot observe directly) may take, at a given level of confidence (e.g. 95%).

Note: Confidence intervals can be applied to onomasiological change (variation in choice) and semasiological change (variation in meaning), provided that P is free to vary from 0 to 1 (see Wallis 2012). Naturally, the interpretation of significant change in either case is different.

Methods for calculating intervals employ the Gaussian approximation to the Binomial distribution.

Confidence intervals on Expected (Population) values (P)

The Gaussian interval about P uses the mean and standard deviation as follows:

mean xP = F/N,
standard deviation S ≡ √P(1 – P)/N.

The Gaussian interval about P can be written as P ± E, where E = z.S, and z is the critical value of the standard Normal distribution at a given error level (e.g., 0.05). Although this is a bit of a mouthful, critical values of z are constant, so for any given level you can just substitute the constant for z. [z(0.05) = 1.95996 to six decimal places.]

In summary:

Gaussian intervalP ± z√P(1 – P)/N.

Confidence intervals on Observed (Sample) values (p)

We cannot use the same formula for confidence intervals about observations. Many people try to do this!

Most obviously, if p gets close to zero, the error e can exceed p, so the lower bound of the interval can fall below zero, which is clearly impossible! The problem is most apparent on smaller samples (larger intervals) and skewed values of p (close to 0 or 1).

The Gaussian is a reasonable approximation for an as-yet-unknown population probability P, it is incorrect for an interval around an observation p (Wallis 2013a). However the latter case is precisely where the Gaussian interval is used most often!

What is the correct method?

Continue reading