Why is statistics difficult?

Imagine you are somewhere on a road that you have never been on before. Picture it. It’s peaceful and calm. A car comes down the road. As it gets to a corner, the driver appears to lose control, and the car crashes into a wall. Fortunately the lone driver is OK but they can’t recall exactly what happened.

Let’s think about what you experienced. The car crash might involve a number of variables an investigator would be interested in.

How fast was the car going? Where were the brakes applied?

Look on the road. Get out a tape measure. How long was the skid before the car finally stopped?

How big and heavy was the car? How loud was the bang when the car crashed?

These are all physical variables. We are used to thinking about the world in terms of these kinds of variables: velocity, position, length, volume and mass. They are tangible: we can see and touch them, and we have physical equipment that helps us measure them. Continue reading

Coping with imperfect data


One of the challenges for corpus linguists is that many of the distinctions that we wish to make are either not annotated in a corpus at all or, if they are represented in the annotation, unreliably annotated. This issue frequently arises in corpora to which an algorithm has been applied, but where the results have not been checked by linguists, a situation which is unavoidable with mega-corpora. However, this is a general problem. We would always recommend that cases be reviewed for accuracy of annotation.

A version of this issue also arises when checking for the possibility of alternation, that is, to ensure that items of Type A can be replaced by Type B items, and vice-versa. An example might be epistemic modal shall vs. will. Most corpora, including richly-annotated corpora such as ICE-GB and DCPSE, do not include modal semantics in their annotation scheme. In such cases the issue is not that the annotation is “imperfect”, rather that our experiment relies on a presumption that the speaker has the choice of either type at any observed point (see Aarts et al. 2013), but that choice is conditioned by the semantic content of the utterance.

Continue reading

Binomial → Normal → Wilson


One of the questions that keeps coming up with students is the following.

What does the Wilson score interval represent, and why is it the right way to calculate a confidence interval based around an observation? 

In this blog post I will attempt to explain, in a series of hopefully simple steps, how we get from the Binomial distribution to the Wilson score interval. I have written about this in a more ‘academic’ style elsewhere, but I haven’t spelled it out in a blog post.
Continue reading

EDS Resources

This post contains the resources for students taking the UCL English Linguistics MA, all in one place.

Session 15: Introduction to statistics

Sessions 18 and 19: Statistics Workshops

Suggested further reading

An unnatural probability?

Not everything that looks like a probability is.

Just because a variable or function ranges from 0 to 1, it does not mean that it behaves like a unitary probability over that range.

Natural probabilities

What we might term a natural probability is a proper fraction of two frequencies, which we might write as p = f/n.

  • Provided that f can be any value from 0 to n, p can range from 0 to 1.
  • In this formula, f and n must also be natural frequencies, that is, n stands for the size of the set of all cases, and f the size of a true subset of these cases.

This natural probability is expected to be a Binomial variable, and the formulae for z tests, χ² tests, Wilson intervals, etc., as well as logistic regression and similar methods, may be legitimately applied to such variables. The Binomial distribution is the expected distribution of such a variable if each observation is drawn independently at random from the population (an assumption that is not strictly true with corpus data).

Another way of putting this is that a Binomial variable expresses the number of individual events of Type A in a situation where an outcome of either A and B are possible. If we observe, say 8 out of 10 cases are of Type A, then we can say we have an observed probability of A being chosen, p(A | {A, B}), of 0.8. In this case, f is the frequency of A (8), and n the frequency of both A and B (10). See Wallis (2013a). Continue reading

Choice vs. use


Many linguistic researchers are interested in semasiological variation, that is, how the meaning of words and expressions may be observed to vary over time or space. One word might have one dominant meaning or use at one point in time, and other meanings may supplant them. This is of obvious interest to etymology. How do new meanings come about? Why do others decline? Do old meanings die away or retain a specialist use?

Most of the research we have discussed on this blog is, by contrast, concerned with onomasiological variation, or variation in the choice of words or expressions to express the same meaning. In a linguistic choice experiment, the field of meaning is held to be constant, or approximately so, and we are concerned primarily with language production:

  • Given that a speaker (or writer, but we take speech as primary) wishes to express some thought, T, what is the probability that they will use expression E₁ out of the alternate forms {E₁, E₂,…} to express it?

This probability is meaningful in the language production process: it measures the actual use out of the options available to the speaker, at the point of utterance.

Conversely, semasiological researchers are concerned with a different type of probability:

  • Given that a speaker used an expression E, what is the probability that their meaning was T₁ out of the set of {T₁, T₂,…}?

For the hearer, this measure can also be thought of as the exposure rate: what proportion of times should a hearer (reader) interpret E as expressing T₁? This probability is meaningful to a language receiver, but it is not a meaningful statistic at the point of language production.

From the speaker’s point of view we can think of onomasiological variation as variation in choice, and semasiological variation as variation in relative proportion of use.

Continue reading

Reciprocating the Wilson interval


How can we calculate confidence intervals on a property like sentence length (as measured by the number of words per sentence)?

You might want to do this to find out whether or not, say, spoken utterances consist of shorter or longer sentences than those found in writing.

The problem is that the average number of words per sentence is not a probability. If you think about it, this ratio will (obviously) equal or exceed 1. So methods for calculating intervals on probabilities won’t work without recalibration.

Aside: You are most likely to hit this type of problem if you want to plot a graph of some non-probabilistic property, or you wish to cite a property with an upper and lower bound for some reason. Sometimes expressing something as a probability does not seem natural. However, it is a good discipline to think in terms of probabilities, and to convert your hypotheses into hypotheses about probabilities as far as possible. As we shall see, this is exactly what you have to do to apply the Wilson score interval.

Note also that just because you want to calculate confidence intervals on a property, you also have to consider whether the property is freely varying when expressed as a probability.

The Wilson score interval (w⁻, w⁺), is a robust method for computing confidence intervals about probabilistic observations p.

Elsewhere we saw that the Wilson score interval obtained an accurate approximation to the ‘exact’ Binomial interval based on an observed probability p, obtained by search. It is also well-constrained, so that neither upper nor lower bound can exceed the probabilistic range [0, 1].

But the Wilson interval is based on a probability. In this post we discuss how this method can be used for other quantities.

Continue reading