Binomial → Normal → Wilson


One of the questions that keeps coming up with students is the following.

What does the Wilson score interval represent, and why is it the right way to calculate a confidence interval based around an observation? 

In this blog post I will attempt to explain, in a series of hopefully simple steps, how we get from the Binomial distribution to the Wilson score interval. I have written about this in a more ‘academic’ style elsewhere, but I haven’t spelled it out in a blog post.
Continue reading

An unnatural probability?

Not everything that looks like a probability is.

Just because a variable or function ranges from 0 to 1, it does not mean that it behaves like a unitary probability over that range.

Natural probabilities

What we might term a natural probability is a proper fraction of two frequencies, which we might write as p = f/n.

  • Provided that f can be any value from 0 to n, p can range from 0 to 1.
  • In this formula, f and n must also be natural frequencies, that is, n stands for the size of the set of all cases, and f the size of a true subset of these cases.

This natural probability is expected to be a Binomial variable, and the formulae for z tests, χ² tests, Wilson intervals, etc., as well as logistic regression and similar methods, may be legitimately applied to such variables. The Binomial distribution is the expected distribution of such a variable if each observation is drawn independently at random from the population (an assumption that is not strictly true with corpus data).

Another way of putting this is that a Binomial variable expresses the number of individual events of Type A in a situation where an outcome of either A and B are possible. If we observe, say 8 out of 10 cases are of Type A, then we can say we have an observed probability of A being chosen, p(A | {A, B}), of 0.8. In this case, f is the frequency of A (8), and n the frequency of both A and B (10). See Wallis (2013a). Continue reading

Comparing frequencies within a discrete distribution

This page explains how to compare observed frequencies f₁ and f₂ from the same distributionF = {f₁, f₂,…}.
To compare observed frequencies f₁ and f₂ from different distributions, i.e. where F₁ = {f₁,…} and F₂ = {f₂,…}, you need to use a chi-square or Newcombe-Wilson test.


In a recent study, my colleague Jill Bowie obtained a discrete frequency distribution by manually classifying cases in a small sample drawn from a large corpus.

Jill converted this distribution into a row of probabilities and calculated Wilson score intervals on each observation, to express the uncertainty associated with a small sample. She had one question, however:

How do we know whether the proportion of one quantity is significantly greater than another?

We might use a Newcombe-Wilson test (see Wallis 2013a), but this test assumes that we want to compare samples from independent sources. Jill’s data are drawn from the same sample, and all probabilities must sum to 1. Instead, the optimum test is a dependent-sample test.


A discrete distribution looks something like this: F = {108, 65, 6, 2}. This is the frequency data for the middle column (circled) in the following chart.

This may be converted into a probability distribution P, representing the proportion of examples in each category, by simply dividing by the total: P = {0.60, 0.36, 0.03, 0.01}, which sums to 1.

We can plot these probabilities, with Wilson score intervals, as shown below.


An example graph plot showing the changing proportions of meanings of the verb think over time in the US TIME Magazine Corpus, with Wilson score intervals, after Levin (2013). In this post we discuss the 1960s data (circled). The sum of each column probability is 1. Many thanks to Magnus for the data!

So how do we know if one proportion is significantly greater than another?

  • When comparing values diachronically (horizontally), data is drawn from independent samples. We may use the Newcombe-Wilson test, and employ the handy visual rule that if intervals do not overlap they must be significantly different.
  • However, probabilities drawn from the same sample (vertically) sum to 1 — which is not the case for independent samples! There are k−1 degrees of freedom, where k is the number of classes. It turns out that the relevant significance test we need to use is an extremely basic test, but it is rarely discussed in the literature.

Continue reading

Why ‘Wald’ is Wrong: once more on confidence intervals


The idea of plotting confidence intervals on data, which is discussed in a number of posts elsewhere on this blog, should be straightforward. Everything we observe is uncertain, but some things are more certain than others! Instead of marking an observation as a point, its better to express it as a ‘cloud’, an interval representing a range of probabilities.

But the standard method for calculating intervals that most people are taught is wrong.

The reasons why are dealt with in detail in (Wallis 2013). In preparing this paper for publication, however, I came up with a new demonstration, using real data, as to why this is the case.

Continue reading

Freedom to vary and significance tests


Statistical tests based on the Binomial distribution (z, χ², log-likelihood and Newcombe-Wilson tests) assume that the item in question is free to vary at each point. This simply means that

  • If we find f items under investigation (what we elsewhere refer to as ‘Type A’ cases) out of N potential instances, the statistical model of inference assumes that it must be possible for f to be any number from 0 to N.
  • Probabilities, p = f / N, are expected to fall in the range [0, 1].

Note: this constraint is a mathematical one. All we are claiming is that the true proportion in the population could conceivably range from 0 to 1. This property is not limited to strict alternation with constant meaning (onomasiological, “envelope of variation” studies). In semasiological studies, where we evaluate alternative meanings of the same word, these tests can also be legitimate.

However, it is common in corpus linguistics to see evaluations carried out against a baseline containing terms that simply cannot plausibly be exchanged with the item under investigation. The most obvious example is statements of the following type: “linguistic Item x increases per million words between category 1 and 2”, with reference to a log-likelihood or χ² significance test to justify this claim. Rarely is this appropriate.

Some terminology: If Type A represents say, the use of modal shall, most words will not alternate with shall. For convenience, we will refer to cases that will alternate with Type A cases as Type B cases (e.g. modal will in certain contexts).

The remainder of cases (other words) are, for the purposes of our study, not evaluated. We will term these invariant cases Type C, because they cannot replace Type A or Type B.

In this post I will explain that not only does introducing such ‘Type C’ cases into an experimental design conflate opportunity and choice, but it also makes the statistical evaluation of variation more conservative. Not only may we mistake a change in opportunity as a change in the preference for the item, but we also weaken the power of statistical tests and tend to reject significant changes (in stats jargon, “Type II errors”).

This problem of experimental design far outweighs differences between methods for computing statistical tests. Continue reading

Choosing the right test


One of the most common questions a new researcher has to deal with is the following:

what is the right statistical test for my purpose?

To answer this question we must distinguish between

  1. different experimental designs, and
  2. optimum methods for testing significance.

In corpus linguistics, many research questions involve choice. The speaker can say shall or will, choose to add a postmodifying clause to an NP or not, etc. If we want to know what factors influence this choice then these factors are termed independent variables (IVs) and the choice is  the dependent variable (DV). These choices are mutually exclusive alternatives. Framing the research question like this immediately helps us focus in on the appropriate class of tests.  Continue reading

Plotting confidence intervals on graphs

So: you’ve got some data, you’ve read up on confidence intervals and you’re convinced. Your data is a small sample from a large/infinite population (all of contemporary US English, say), and therefore you need to estimate the error in every observation. You’d like to plot a pretty graph like the one below, but you don’t know where to start.

An example graph plot showing the changing proportions of meanings of the verb think over time in the US TIME Magazine Corpus, with Wilson score intervals, after Levin (2013). Many thanks to Magnus for the data!

Of course this graph is not just pretty.

Continue reading