One of the main unsolved statistical problems in corpus linguistics is the following.
Statistical methods assume that samples under study are taken from the population at random.
Text corpora are only partially random. Corpora consist of passages of running text, where words, phrases, clauses and speech acts are structured together to describe the passage.
The selection of text passages for inclusion in a corpus is potentially random. However cases within each text may not be independent.
This randomness requirement is foundationally important. It governs our ability to generalise from the sample to the population.
The corollary of random sampling is that cases are independent from each other.
I see this problem as being fundamental to corpus linguistics as a credible experimental practice (to the point that I forced myself to relearn statistics from first principles after some twenty years in order to address it). In this blog entry I’m going to try to outline the problem and what it means in practice.
The saving grace is that statistical generalisation is premised on a mathematical model. The problem is not all-or-nothing. This means that we can, with care, attempt to address it proportionately.
[Note: To actually solve the problem would require the integration of multiple sources of evidence into an a posteriori model of case interaction that computed marginal ‘independence probabilities’ for each case abstracted from the corpus. This is way beyond what any reasonable individual linguist could ever reasonably be expected to do unless an out-of-the-box solution is developed (I’m working on it, albeit slowly, so if you have ideas, don’t fail to contact me…).]
There are numerous sources of case interaction and clustering in texts, ranging from conscious repetition of topic words and themes, unconscious tendencies to reuse particular grammatical choices, and interaction along axes of, for example, embedding and co-ordination (Wallis 2012a), and structurally overlapping cases (Nelson et al 2002: 272).
In this blog post I first outline the problem and then discuss feasible good practice based on our current technology. Continue reading