Freedom to vary and significance tests


Statistical tests based on the Binomial distribution (z, χ², log-likelihood and Newcombe-Wilson tests) assume that the item in question is free to vary at each point. This simply means that

  • If we find f items under investigation (what we elsewhere refer to as ‘Type A’ cases) out of N potential instances, the statistical model of inference assumes that it must be possible for f to be any number from 0 to N.
  • Probabilities, p = f / N, are expected to fall in the range [0, 1].

Note: this constraint is a mathematical one. All we are claiming is that the true proportion in the population could conceivably range from 0 to 1. This property is not limited to strict alternation with constant meaning (onomasiological, “envelope of variation” studies). In semasiological studies, where we evaluate alternative meanings of the same word, these tests can also be legitimate.

However, it is common in corpus linguistics to see evaluations carried out against a baseline containing terms that simply cannot plausibly be exchanged with the item under investigation. The most obvious example is statements of the following type: “linguistic Item x increases per million words between category 1 and 2”, with reference to a log-likelihood or χ² significance test to justify this claim. Rarely is this appropriate.

Some terminology: If Type A represents say, the use of modal shall, most words will not alternate with shall. For convenience, we will refer to cases that will alternate with Type A cases as Type B cases (e.g. modal will in certain contexts).

The remainder of cases (other words) are, for the purposes of our study, not evaluated. We will term these invariant cases Type C, because they cannot replace Type A or Type B.

In this post I will explain that not only does introducing such ‘Type C’ cases into an experimental design conflate opportunity and choice, but it also makes the statistical evaluation of variation more conservative. Not only may we mistake a change in opportunity as a change in the preference for the item, but we also weaken the power of statistical tests and tend to reject significant changes (in stats jargon, “Type II errors”).

This problem of experimental design far outweighs differences between methods for computing statistical tests. In brief, an increase in non-alternating ‘Type C’ cases makes a test more conservative, so significant variation which would be identified with a smaller baseline will tend to be missed with a larger one. This conservatism arises from the fact that by introducing invariant terms we are gradually undermining the mathematical assumptions of the Binomial model and its approximations (Gaussian, log-likelihood, etc.).

A mathematical demonstration

We can demonstrate the problem of conservatism with rising N using a very simple test – the single sample z test (or the equivalent 2 × 1 χ² test for goodness of fit). We will use the Gaussian approximation to the Binomial interval (P – E, P + E), where the error interval width E = zα/2.√P(1 – P)/N.

A similar result is obtained with Wilson or log-likelihood intervals, as we show below. Consider what happens to the width of the error interval E if N increases but everything else stays the same.

  • Imagine that the true rate is F Type A cases and F‘ Type B cases.
  • For the sake of clarity, let us assume that correctly sampled, our sample size NF+F’ = 100.
  • But the sample is then expanded by adding further Type C cases that could not alternate with A or B. N increases but neither F nor F’ increases.

Significance testing involves comparing observed and expected probabilities (p, P). We will compare observed p = f / N with the expected P = F / N, so to place E on the same scale as the constant frequency F, we multiply it by N/100.

What happens to this error interval as N increases? The figure below plots a standardised Gaussian error term EN/100 over exponentially increasing N for different values of F. Note that the correct interval width is shown at the starting point on the left hand side (N = 100). Every rise after that represents an increase in “desensitivity”.


Expanding Gaussian intervals (α = 0.05), E, scaled by N/100, as the number of non-alternating Type C cases increase, for f = 5, 20, 40, 60, 80 and 95 out of 100 Type A cases. The correct interval width is on the left hand side of the figure. Increases in interval widths to the right represent decreased statistical sensitivity. (Note that the x axis is logarithmic.)

As the baseline size N increases, the interval expands, and corresponding statistical tests become more conservative.

  • The impact of rising N differs with initial frequency F. Consider the line for F = 5, i.e. where the true rate is 5%, or five items out of a baseline of N cases. The line is almost flat, indicating that rising N barely changes the likelihood of obtaining a significant result. However as F increases to occupy a greater share of the original N = 100 cases (20, 40, 60, 80, 95), the error interval increases by a greater fraction. This means that the ‘freedom to vary’ assumption matters more for observed proportions closer to 50% (0.5) than 0%.
  • Note that this problem affects majority types (types whose proportion exceeds 50%) to a greater extent than minority types. With a simple binary choice, the probability of Type B cases is 1 – P. If Type A is in a minority out of A and B, then B is a majority type, and likely to be subject to a conservative assessment of significance. As you can see from the figure, the interval for F = 95 increases more than four-fold with rising N.

For transparency, the calculations for the Gaussian and Wilson intervals are presented in this spreadsheet.

An example

There are 885,436 words in the Diachronic Corpus of Present-Day Spoken English (DCPSE), of which 150 are first person declarative shall, and 136 are will.  The overall percentage of shall f = 150×100/286 = 52%. However, were we to employ a word baseline, there would be 885,150 Type C cases (words that are neither shall nor will). There are approximately 3,095 Type C cases for every single alternating case (A or B).

On the same scale, the 95% Gaussian error for f (we have employed a ‘Wald’ approximation for simplicity here: error intervals for observations should always use a Wilson-based interval, see below) rises from 0.0578 calculated against the alternating cases, to approximately 0.0839 calculated against the number of words, an increase of 45%.

However, “an increase of 45%” underestimates the scale of the problem. This increase is a change in the interval width, not the tail area under the curve. The test is actually now ten times more conservative than it should be. Recall that the error level is the proportion of the area under the Normal curve in the tail area, and it turns out that this area α = 0.0044 rather than the figure we wanted, 0.05!

The error level is based on the area under the curve. Increasing the error interval width (bottom) decreases tail areas, so an increase in interval width tends to understate the loss of sensitivity. The ideal area is 5% but in this case the actual area is around 0.44%.

Other interval calculations

These results are not an artefact of the Gaussian formula for an expected probability P. They are fundamental to the statistical inference model of contingent variation (i.e. the Binomial model). Methods for calculating the interval range for observations p obtain essentially the same pattern. The figure below plots Wilson and log-likelihood intervals alongside the ‘Wald’ Gaussian, for f = 95. Again, the correct value is at the far left hand side (where N = 100). These intervals are asymmetric and have a different upper and lower bound (Wallis 2013). Lower and upper bounds swap over when p = 0.5 (here, at N = f × 2 = 180).

Expanding interval widths for Wilson and Log-likelihood interval calculations.

Expanding interval widths for Wilson and log-likelihood interval calculations on observations, f = 95 Type A cases, f’ = 5 Type B cases.


The statistical model underpinning Binomial-type tests and intervals assumes that observed frequencies f can take any value from 0 to N. However, many researchers working in Corpus Linguistics use a baseline for N that includes many cases that do not alternate, a common example being the use of ‘per million word’ baselines. Most phenomena in linguistic data are not as frequent as words, and cannot plausibly be so!

In this post, we considered what happens to the statistical model if a baseline includes invariant terms. In our model, the baseline includes N – (f+f‘) invariant Type C terms, but the true freedom to vary is limited to a ceiling of f+f‘ = 100.

The outcome is inevitable: if frequency counts include large numbers of invariant terms and yet we employ statistical models that assume that the item in question is free to vary, then not only are these tests used inappropriately, we can predict that they will tend to be conservative. A corollary is that if we have a small number of invariant terms then these tests should generally be fine. This mathematical observation outweighs differences between interval computations.

In brief:

The selection of a particular statistical computation (Gaussian, Wilson, log-likelihood etc.) is less important than the question of correctly specifying the experimental design, i.e. in using this type of test, restricting data to a set of types that could plausibly range from 0 to 100%.

Appropriate data could include strict choice-alternates expressing the same meaning (onomasiological change), but might also include different meanings for the same word, string or structure (semasiological change). See Choice vs. use. The item must be free to vary, i.e. that it is conceivable that a sample could have 100% of any single given type.

This analysis is in addition to the “envelope of variation” problem, i.e. that invariant Type C cases may also vary in number over the contrast, such as time, under study. See That vexed problem of choice.

Type C cases should ideally be eliminated, but if this is not possible they should be minimised (another way of looking at the graphs above is that the problem is reduced as N approaches 100).

The presence of large numbers of invariant Type C terms would impact on many other statistical models that assume that the item being evaluated is free to vary. For example, this problem also undermines the use of logistic models (S-curves), effectively turning logistic regression into a type of linear principal component analysis.


Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208. » Post

See also


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s