A methodological progression

(with thanks to Jill Bowie)


One of the most controversial arguments in corpus linguistics concerns the relationship between a ‘variationist’ paradigm comparable with lab experiments, and a traditional corpus linguistics paradigm focusing on normalised word frequencies.

Rather than see these two approaches as diametrically opposed, we propose that it is more helpful to view them as representing different points on a methodological progression, and to recognise that we are often forced to compromise our ideal experimental practice according to the data and tools at our disposal.

Viewing these approaches as being represented along a progression allows us to step back from any single perspective and ask ourselves how different results can be reconciled and research may be improved upon. It allows us to consider the potential value in performing more computer-aided manual annotation — always an arduous task — and where such annotation effort would be usefully focused.

The idea is sketched in the figure below.

A methodological progression: from normalised word frequencies to verified alternation.

1. Per million words

There is a long tradition within corpus linguistics of examining variation or change in terms of “normalised word frequencies”, or frequencies per thousand or per million words (pmw).

This is wholly understandable. Indeed, with a plain-text lexical corpus it makes perfect sense. The only reliable baseline is the sheer number of words in a corpus, so if one wishes to compare the frequency of one word — will, let’s say — over a contrast (say, time), then it is first necessary to normalise the frequency of will in each subcorpus by dividing it by the number of words.

A number of methods make use of these so-called ‘normed’ or ‘normalised’ frequencies. Biber’s Multi Dimensional (MD) Analysis method (1988) combines pmw frequencies with factor analysis to identify dimensions of register variation.

Examining frequencies per million words can be very useful. For example, Church (2000) showed that if you compared the probability of one word appearing at random in a text and the probability of the same word appearing given that it already had appeared in the text, then those words with a low initial probability but a much higher second probability — Noriega, say — were likely to be topic words in the text. Overall, the word Noriega is low frequency in a typical corpus, but it may appear multiple times in a text about Panama in the 1980s. Hey presto! A quick and effective method for identifying key words in a text. Collocational statistics work on the same principle, and exploit the tendency for words to cluster together in potentially linguistically interesting ways.

Per word. or per million word. observations are best thought of as an estimate of exposure rates, i.e. the likelihood that a hearer or reader will be exposed to a particular word if they are presented with the types of texts found in a corpus. If you are constructing a learner dictionary, an exposure rate may be exactly what you need.

However, if you are looking at speaker/writer performance, these methods tend to obtain results that are at most explorative, that is, the patterns may arise for one of several different reasons, and deciding beween these reasons would require further investigation, especially if we wish to draw deeper conclusions. MD Analysis distributes registers along generalised dimensions. It does not tell us why these distributions arise. The fact that some texts contain a higher proportion of verbs than nouns may be valuable, but it is only a first pass. To dig deeper we need to look at relative frequencies and probabilities.

Words have multiple meanings, and many words perform multiple grammatical roles. Language is not “just one damn word after another”, but highly structured. We can therefore expect that the probability of a speaker choosing a particular word will be conditioned by both the meaning desired and the grammatical repertoire available to her at the point of utterance. This means that the total number of words is rarely the optimum baseline for examining a particular change (an exception would be where we could reasonably expect a term to appear at any point in the sentence).

  • This recognition also means that in fact, Binomial statistical methods that assume that an observed probability p can range from 0 to 1 are weak and conservative (see Freedom to vary and significance tests).
  • Secondly, it means that the chance of the choice being available at any point in a sentence is likely to vary. This is not a problem in Church’s text summarisation application because he is attempting to detect this changing opportunity, but it is a big problem in studying linguistic choices as speakers or writers make them (Wallis 2012).

Finally, note that in the Noriega example, the probability of the second occurrence of a word given that the first has already been observed in the same text, which we might write p(Noriega₂ | Noriega₁), is a relative dependent probability, whereas the first, p(Noriega), is simply the absolute per-word probability across the data. Many word-level statistical methods (such as part-of-speech tagging) combine both absolute normalised observations — the rate of a word per million words say — with relative ones, such as the likelihood that the next word will be a verb given the previous word was to.

Changing baselines and focusing on relative probabilities is not a strange thing to do. Often it is the only way to proceed.

2. Selecting a more plausible baseline

These days, few corpora are limited to plain text. Part-of-speech (POS) tagged corpora take us part-way to our goal by differentiating words by wordclass (cf. will as a noun vs. auxiliary verb vs. main verb, etc.). With part-of-speech corpora it is frequently possible to improve on a word-based baseline.

For example, the modal verb will is an auxiliary to a main verb, so a better baseline for examining change in frequency of modal will would be all main verbs, or all tensed main verbs. This baseline is less susceptible to variation by text.

It is perfectly legitimate to experiment with baselines, recognising that the closer we can get to the change we are attempting to evaluate, the better. (Of course, whenever we change the baseline we need to report this in our experimental write-up! The baseline is a crucial element of the experimental design.)

Crucially, what we are doing is attempting to distinguish the opportunity to use the expression under study from the choice of a particular subexpression, in this case modal will.

  • An illustrative example: in a legal cross-examination text, speakers are likely to be discussing past events; two speakers discussing plans for a holiday, say, may use the future will/shall much more frequently.

Indeed, a number of researchers have observed that different text categories frequently offer different opportunities to employ a particular construction (see Wallis 2012). As Biber (1988) points out, some text types (e.g. modern scientific texts) tend to be more “nouny” than the average, while others are more “verby”. However, many researchers do not draw the logical conclusion: when studying speaker performance a per word baseline is usually not a stable basis for comparison.

One way to improve a baseline is to pick a higher-level structure, such as VPs (identified by counting main verbs in a POS-tagged corpus), and then eliminate cases that would not plausibly alternate.

  • Smitterberg (2005) examined change in the progressive by invoking so-called “knock-out factors” where a VP could not plausibly be recast as a progressive VP. His method improved upon a pmw score by obtaining arguably more reliable results, and also led to a different ranking between text categories in terms of which types of text exhibited a greater tendency to employ the progressive.
  • Bowie et al. (2014) showed that, in examining tensed VPs over time, not only did different text categories have different rates of tensed VP pmw (heights of column in the figure below), but also that in some text categories the rate increased over “time” (LLC / ICE-GB subcorpora), but in other categories it decreased (compare blue and red column heights).
Tensed VPs per million words, by text category, compared across the two subcorpora of DCPSE (after Bowie et al. 2013).
Tensed VPs per million words, by text category, compared across the two subcorpora of DCPSE (after Bowie et al. 2014).

Irrespective of the source of this variation (sampling or a real change), the choice of baseline matters.

Aside: A potentially helpful metaphor involves relative motion. Imagine two trains moving at different speeds (possibly in opposite directions, or on parallel or divergent tracks). However to study passenger movement we focus on whether passengers are moving towards the front or rear of each train. How can we do this? We factor out the train’s movement in each case by adopting this movement as the ‘origin’ or baseline of the passenger movement. If the trains are moving in opposite directions, the front of the train is different in each case. To account for diverging trains we might say the axis of motion is along the train.

What applies to different genre sub-corpora also applies to differently-sampled corpora. A very important corollary of the observation that different text categories may behave differently with different baselines is that reducing the effect of varying opportunity improves the robustness of comparisons between corpora, i.e. it improves the reproducibility of results.

3. Enumerating alternates and grammatically restricting data

A third level of refinement is often achieved by enumerating alternates (sometimes referred to as specifying the “envelope of variation”). Thus in studying modal will we ask the question, what does will alternate with? What could the speaker say rather than modal will?

We find ourselves focusing much more closely on the choice {will, shall} than when the baseline includes a wide range of verb phrases where the option of expressing modal will is unlikely to arise. So a better baseline may simply be the set of modal verbs {willshall}. Other expressions of futurity or prediction may be included within the same alternation set. Note how we are studying “the rate of using will out of all future-expressions” rather than simply the frequency of will per verb phrase or per million words.

We can further improve the baseline by grammatically restricting cases to remove alternation where one form is strongly marked. To modern ears, they/you shall go… sounds archaic, and thus it may be more meaningful to eliminate all but the first person shall/will alternation. We may further wish to distinguish interrogative or negative cases for similar reasons. The idea is that speakers are more self-aware when using marked forms and therefore are doing so consciously.

At this point the benefits of rich annotation, such as a parsed analysis, become obvious. With a parsed corpus such as ICE-GB or DCPSE, we can construct a grammatical structural search (called a Fuzzy Tree Fragment or FTF) and grammatically restrict cases based on the annotation. With a POS-tagged corpus we must rely on word-sequence patterns, which may be misleading. However even ICE-GB does not annotate modal auxiliary verbs by modal meaning.

The limits of the annotation are not necessarily the limits of an experiment, provided that the researcher is prepared to manually review cases. Thus in examining first person positive will vs. shall, Jo Close distinguished between Root and Epistemic meanings of will and shall, kept a tally of each, and considered their alternation separately (Aarts et al. 2013).

4. Eliminating non-alternating cases

The final level of baseline improvement involves verification, that is, examining the corpus and eliminating cases that do not alternate. This requires the researcher to go through matching cases in the corpus, or a subsample of them. In each case we need to check that instances in the dataset — cases of the primary form under study (here, will, what we have elsewhere called ‘Type A’ forms) and any alternates (‘Type B’) — could plausibly alternate. Formulaic or idiomatic instances (or those wrongly tagged) should be discarded.

In some situations, such as comparing the use of phrasal verbs vs. latinate alternates, it may be difficult to reliably identify all alternate forms, and thus a baseline of phrasal verbs plus “potentially phrasal verbs” is difficult to obtain. However it is still extremely worthwhile reviewing matching cases of potentially phrasal verbs (Wallis 2012) in order to estimate the proportion of those cases that may alternate. See also Coping with imperfect data for a worked example of this.

More generally we should always perform a ‘sanity check’ and examine cases in the corpus to see whether our results mean what we think they mean. Far from assuming that with ever-richer levels of corpus annotation, we can rely on this annotation, a focus on speaker choices brings us full-circle, back to examining cases in the corpus.

Although this blog is about corpus linguistics statistics, all statistical methods are only as good as the data that they are based on.


Recognising that experimental approaches may be found on a continuum or progression does not mean that one method is “as good as” another. By successively eliminating non-alternating cases from our sample we make our research more precise and reproducible.

We have outlined a series of steps that any researcher can take to improve upon a per million word baseline, which — whether we like it or not — necessarily conflates opportunity and choice. It is a simple question of logic: if you reduce the effect of variation in opportunity (sometimes termed ‘controlling for’ opportunity), you can improve the repeatability of experiments and also the meaningfulness of observations concerning choices that speakers make.

Since we are primarily concerned with such choices in our linguistic theory, whether our datasets are derived from cued and controlled lab experiments or ecologically-sampled uncued corpora, we need to explore optimum methods, to the limits of our data. In order to do this, we focus on first, when these speakers are able to make a choice; and second, when they make one choice over another.

We need to stop assuming that in studying speaker/writer performance, per million word baselines are optimal — or even a sensible default. Unfortunately, for a number of historical reasons, corpus linguistics has tended to do just that.

See also


Aarts, B., Close, J, and Wallis, S.A. 2013. Choices over time: methodological issues in investigating current change. » ePublished. Chapter 2 in Aarts, B., Close, J, Leech, G. and Wallis, S.A. (eds.) The Verb Phrase in English. Cambridge: CUP. » Table of contents and ordering info

Biber, D. 1988. Variation across speech and writing. Cambridge: CUP.

Bowie, J., Wallis S.A., and Aarts, B. 2014. Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Marín-Arrese, J.I., Carretero, M., Arús H,J. and van der Auwera, J. (eds.) English Modality: Core, Periphery and Evidentiality, Berlin: De Gruyter, 57-94.

Church, K. 2000. Empirical Estimates of Adaptation: The chance of Two Noriega’s is closer to p/2 than p2Coling, pp. 173-179.

Smitterberg, E. 2005. The Progressive in 19th-century English: a Process of Integration. Amsterdam: Rodopi.

Wallis, S.A. 2012. That vexed problem of choice. London: Survey of English Usage, UCL. » Post

1 thought on “A methodological progression”

  1. Great stuff. ‘Opportunity’ and ‘choice’ seem to be somehow coextensive with ‘variable’ and ‘variant’ (respectively) in the variationist paradigm.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.