Statistical tests based on the Binomial distribution (z, χ², log-likelihood and Newcombe-Wilson tests) assume that the item in question is free to vary at each point. This simply means that
- If we find f items under investigation (what we elsewhere refer to as ‘Type A’ cases) out of N potential instances, the statistical model of inference assumes that it must be possible for f to be any number from 0 to N.
- Probabilities, p = f / N, are expected to fall in the range [0, 1].
Note: this constraint is a mathematical one. All we are claiming is that the true proportion in the population could conceivably range from 0 to 1. This property is not limited to strict alternation with constant meaning (onomasiological, “envelope of variation” studies). In semasiological studies, where we evaluate alternative meanings of the same word, these tests can also be legitimate.
However, it is common in corpus linguistics to see evaluations carried out against a baseline containing terms that simply cannot plausibly be exchanged with the item under investigation. The most obvious example is statements of the following type: “linguistic Item x increases per million words between category 1 and 2”, with reference to a log-likelihood or χ² significance test to justify this claim. Rarely is this appropriate.
Some terminology: If Type A represents say, the use of modal shall, most words will not alternate with shall. For convenience, we will refer to cases that will alternate with Type A cases as Type B cases (e.g. modal will in certain contexts).
The remainder of cases (other words) are, for the purposes of our study, not evaluated. We will term these invariant cases Type C, because they cannot replace Type A or Type B.
In this post I will explain that not only does introducing such ‘Type C’ cases into an experimental design conflate opportunity and choice, but it also makes the statistical evaluation of variation more conservative. Not only may we mistake a change in opportunity as a change in the preference for the item, but we also weaken the power of statistical tests and tend to reject significant changes (in stats jargon, “Type II errors”).
This problem of experimental design far outweighs differences between methods for computing statistical tests. Continue reading