### Introduction

Statistical tests based on the Binomial distribution (*z*, χ², log-likelihood and Newcombe-Wilson tests) assume that **the item in question is free to vary at each point**. This simply means that

- If we find
*f*items under investigation (what we elsewhere refer to as ‘Type A’ cases) out of*N*potential instances, the statistical model of inference assumes that it must be possible for*f*to be any number from 0 to*N*. - Probabilities,
*p*=*f*/*N*, are expected to fall in the range [0, 1].

**Note:** this constraint is a *mathematical* one. All we are claiming is that the true proportion in the population could conceivably range from 0 to 1. This property is not limited to strict alternation with constant meaning (onomasiological, “envelope of variation” studies). In semasiological studies, where we evaluate alternative meanings of the same word, these tests can also be legitimate.

**However, it is common in corpus linguistics to see evaluations carried out against a baseline containing terms that simply cannot plausibly be exchanged with the item under investigation. **The most obvious example is statements of the following type: “linguistic Item *x* increases per million words between category 1 and 2”, with reference to a log-likelihood or χ² significance test to justify this claim.** **Rarely is this appropriate.

**Some terminology:** If **Type A** represents say, the use of modal *shall*, most words will not alternate with *shall*. For convenience, we will refer to cases that will alternate with Type A cases as **Type B** cases (e.g. modal *will* in certain contexts).

The remainder of cases (other words) are, for the purposes of our study, not evaluated. We will term these invariant cases **Type C**, because they cannot replace Type A or Type B.

In this post I will explain that not only does introducing such ‘Type C’ cases into an experimental design conflate *opportunity* and *choice*, but it also **makes the statistical evaluation of variation more conservative**. Not only may we mistake a change in opportunity as a change in the preference for the item, but we also weaken the power of statistical tests and tend to reject significant changes (in stats jargon, “Type II errors”).

This problem of **experimental design** far outweighs differences between methods for computing statistical tests. In brief, an increase in non-alternating ‘Type C’ cases makes a test more conservative, so significant variation which would be identified with a smaller baseline will tend to be missed with a larger one. This conservatism arises from the fact that by introducing invariant terms we are gradually undermining the mathematical assumptions of the Binomial model and its approximations (Gaussian, log-likelihood, etc.).

### A mathematical demonstration

We can demonstrate the problem of conservatism with rising *N* using a very simple test – the single sample *z* test (or the equivalent 2 × 1 χ² test for goodness of fit). We will use the Gaussian approximation to the Binomial interval (*P* – *E*, *P* + *E*), where the error interval width *E* = *z*_{α/2}.√*P*(1 – *P*)/*N*.

A similar result is obtained with Wilson or log-likelihood intervals, as we show below. Consider what happens to the width of the error interval *E* if *N* increases but everything else stays the same.

- Imagine that the true rate is
*F*Type A cases and*F*‘ Type B cases. - For the sake of clarity, let us assume that correctly sampled, our sample size
*N*=*F*+*F’*= 100. - But the sample is then expanded by adding further Type C cases that could not alternate with A or B.
*N*increases but neither*F*nor*F’*increases.

Significance testing involves comparing observed and expected probabilities (*p*, *P*). We will compare observed *p* = *f* / *N* with the expected *P* = *F* / *N*, so to place *E* on the same scale as the constant frequency *F*, we multiply it by *N*/100.

What happens to this error interval as *N* increases? The figure below plots a standardised Gaussian error term *EN*/100 over exponentially increasing *N* for different values of *F*. Note that the correct interval width is shown at the starting point on the left hand side (*N* = 100). Every rise after that represents an increase in “desensitivity”.

As the baseline size *N* increases, the interval expands, and corresponding statistical tests become more conservative.

- The impact of rising
*N*differs with initial frequency*F*. Consider the line for*F*= 5, i.e. where the true rate is 5%, or five items out of a baseline of*N*cases. The line is almost flat, indicating that rising*N*barely changes the likelihood of obtaining a significant result. However as*F*increases to occupy a greater share of the original*N*= 100 cases (20, 40, 60, 80, 95), the error interval increases by a greater fraction. This means that the ‘freedom to vary’ assumption matters more for observed proportions closer to 50% (0.5) than 0%. - Note that this problem affects majority types (types whose proportion exceeds 50%) to a greater extent than minority types. With a simple binary choice, the probability of Type B cases is 1 –
*P*. If Type A is in a minority out of A and B, then B is a majority type, and likely to be subject to a conservative assessment of significance. As you can see from the figure, the interval for*F*= 95 increases more than four-fold with rising*N*.

For transparency, the calculations for the Gaussian and Wilson intervals are presented in this spreadsheet.

### An example

There are 885,436 words in the *Diachronic Corpus of Present-Day Spoken English* (DCPSE), of which 150 are first person declarative *shall*, and 136 are *will*.* * The overall percentage of *shall* *f* = 150×100/286 = 52%. However, were we to employ a word baseline, there would be 885,150 Type C cases (words that are neither *shall* nor *will*). There are approximately 3,095 Type C cases for every single alternating case (A or B).

On the same scale, the 95% Gaussian error for *f* (we have employed a ‘Wald’ approximation for simplicity here: error intervals for observations should always use a Wilson-based interval, see below) rises from 0.0578 calculated against the alternating cases, to approximately 0.0839 calculated against the number of words, an increase of 45%.

However, “an increase of 45%” underestimates the scale of the problem. This increase is a change in the interval *width*, not the tail *area* under the curve. The test is actually now **ten times** more conservative than it should be. Recall that the error level is the proportion of the area under the Normal curve in the tail area, and it turns out that this area α = 0.0044 rather than the figure we wanted, 0.05!

### Other interval calculations

These results are not an artefact of the Gaussian formula for an expected probability *P*. They are fundamental to the statistical inference model of contingent variation (i.e. the Binomial model). Methods for calculating the interval range for observations *p* obtain essentially the same pattern. The figure below plots Wilson and log-likelihood intervals alongside the ‘Wald’ Gaussian, for *f* = 95. Again, the correct value is at the far left hand side (where *N* = 100). These intervals are asymmetric and have a different upper and lower bound (Wallis 2013). Lower and upper bounds swap over when *p* = 0.5 (here, at *N* = *f* × 2 = 180).

### Conclusions

The statistical model underpinning Binomial-type tests and intervals assumes that observed frequencies *f* can take any value from 0 to *N.* However, many researchers working in Corpus Linguistics use a baseline for *N* that includes many cases that do not alternate, a common example being the use of ‘per million word’ baselines. Most phenomena in linguistic data are not as frequent as words, and cannot plausibly be so!

In this post, we considered what happens to the statistical model if a baseline includes invariant terms. In our model, the baseline includes *N* – (*f*+*f*‘) invariant Type C terms, but the true freedom to vary is limited to a ceiling of *f*+*f*‘ = 100.

The outcome is inevitable: if frequency counts include large numbers of invariant terms and yet we employ statistical models that assume that the item in question is free to vary, then not only are these tests used inappropriately, we can predict that they will tend to be conservative. A corollary is that if we have a small number of invariant terms then these tests should generally be fine. This mathematical observation outweighs differences between interval computations.

In brief:

The selection of a particular statistical computation (Gaussian, Wilson, log-likelihood etc.) is **less important** than the question of correctly specifying the experimental design, i.e. in using this type of test, restricting data to a set of types that could plausibly range from 0 to 100%.

Appropriate data could include strict choice-alternates expressing the same meaning (onomasiological change), but might also include different meanings for the same word, string or structure (semasiological change). See Choice vs. use. The **item must be free to vary**, i.e. that it is conceivable that a sample could have 100% of any single given type.

This analysis is in addition to the “envelope of variation” problem, i.e. that invariant Type C cases may **also** vary in number over the contrast, such as time, under study. See That vexed problem of choice.

Type C cases should ideally be **eliminated**, but if this is not possible they should be **minimised** (another way of looking at the graphs above is that the problem is reduced as *N* approaches 100).

The presence of large numbers of invariant Type C terms would impact on many other statistical models that assume that the item being evaluated is free to vary. For example, this problem also undermines the use of logistic models (S-curves), effectively turning logistic regression into a type of linear principal component analysis.

### References

Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics ***20**:3, 178-208. » Post