That vexed problem of choice

(with thanks to Jill Bowie and Bas Aarts)

AbstractPaper (PDF)

A key challenge in corpus linguistics concerns the difficulty of operationalising linguistic questions in terms of choices made by speakers or writers. Whereas lab researchers design an experiment around a choice, comparable corpus research implies the inference of counterfactual alternates. This non-trivial requirement leads many to rely on a per million word baseline, meaning that variation separately due to opportunity and choice cannot be distinguished.

We formalise definitions of mutual substitution and the true rate of alternation as useful idealisations, recognising they may not always hold. Analysing data from a new volume on the verb phrase, we demonstrate how a focus on choices available to speakers allows researchers to factor out the effect of changing opportunities to draw conclusions about choices.

We discuss research strategies where alternates may not be easily identified, including refining baselines by eliminating forms and surveying change against multiple baselines. Finally we address three objections that have been made to this framework, that alternates are not reliably identifiable, baselines are arbitrary, and differing ecological pressures apply to different terms. Throughout we motivate our responses by evidence from current research, demonstrating that whereas the problem of identifying choices may be ‘vexed’, it represents a highly fruitful paradigm for corpus linguistics.


Many of the research questions we typically wish a corpus to answer can be formulated in terms of variables representing a linguistic choice made by speakers or writers, sometimes called onomasiology. A complementary approach (semiasology) examines the range of meanings associated with a particular expression. The essential idea is simple: in forming utterances, speakers and writers make a series of conscious and unconscious choices.

Numerous studies have examined how frequencies of words, lexical sequences and grammatical constructions vary under the pressure of changing external sociolinguistic conditions. Papers have been published on differences between speech and writing, the impact of change over time, and so forth. The point of this paper is to reemphasise that at the heart of these studies, correctly conceived, is necessarily a model of choice.

This statement has become axiomatic in sociolinguistics (Labov 1972; Lavandera 1978) and cognitive linguistics research, but in this paper we wish to emphasise that corpus linguists of all stripes cannot avoid this question. If a speaker had no choice about the words or constructions they used, then language would be invariant. Bauer (1994: 19) comments that “change is impossible without some variation”. Logically, therefore, all studies of language variation and change should be primarily conceived as questions of choice.

It follows then that rather than simply evaluate changes in normalised frequencies of individual forms, we need to try as far as possible to frame experiments to investigate changing use within a group of alternative forms. Likewise, semiasological studies can only show variation of semantic types where alternatives to those types are taken into account.

Herein lies the difficulty. In laboratory experiments it is straightforward to constrain choices in advance: present subjects with a stimulus and ask them to press button A or B in response. The experimenter designs in the choice. However corpus research is performed on unconstrained responses. Researchers carry out ex post facto analysis of data. Variationist ‘linguistic choice’ research therefore requires the inference of the counterfactual, i.e. alongside what subjects wrote or said, we need to infer what they could have written or spoken instead.

Unfortunately, it is frequently non-trivial to identify counterfactual ‘alternates’ (e.g. ‘non-progressive but progressivisable VPs’), and this fact has produced a number of practical objections from linguists. In this paper we explore three principal objections, what they imply, and how they can be overcome to the limits of our data. We also show that even where they cannot be wholly overcome by identifying a definitive alternate pattern in every case, the perspective of linguistic choice experiments remains optimum for obtaining linguistically meaningful results. Even if our data does not match this theoretical ideal, we can still approximate towards it. Recognising the limitations of experiments is central to responsible scientific reporting.

This is not to make a case for only focusing on strict choice. Whereas lab experiments cue choices, corpora can provide much better estimates of the overall likelihood of encountering a form ‘in the wild’. Empirically obtaining a rate of exposure per million words (‘pmw’) may help us rank forms by frequency to guide dictionary construction, design language teaching syllabi, or simply to provide valuable background information regarding which forms are more dominant. However, absolute frequency of exposure is not the same as the preference for a form given a choice of alternates. This paper explores a range of approaches that allow us to distinguish variation in the opportunity to use a form (affected by many factors including context) and variation in the choice of a particular form when that opportunity arises.

As a contribution to a methodological debate, this paper is not intended to supplant linguistic concerns – far from it – but rather to discuss ways in which linguistic hypotheses may be tested against corpus data in a manner maximally commensurable with those of other types of linguistic research (see Schönefeld 2011).

As an aside, it may be worth noting that many sciences employ what we might term ecological models of choice, i.e., choices made by organisms in a naturalistic context. Examples include

  • Market research: researchers are tasked with finding out in what circumstances might shoppers buy product A rather than product B. This choice is the focus of the research (the dependent variable). Other variables, such as whether they purchased other items at the same time, locations of products in the store, etc., may be considered as independent predictor variables.
  • Plant morphology: consider the choice of a rose to grow a flower from a node: not all nodes, where leaves appear, contain a flower. Different environmental factors and species cause the numbers of flowers produced to vary. The meaningful baseline for flower growth would be the number of nodes capable of producing a flower.

The second example illustrates the sense we employ the term ‘choice’ in this paper. Although we have referred to living organisms, mathematically the principle can be extended to the unavoidable process of selection from any set of alternative outcomes arising in a process. The capacity for conscious decision-making is not a precondition for this selection to take place.

These examples present similar challenges to the linguist attempting to extract choices from naturalistic data. If you wish to study why shoppers or roses do what they do, and what outcomes they produce, you should pose the research question in terms of a logical model of choice. Shoppers and plant nodes incapable of making the selection are discounted. More complex models may include the impact of co-occurring outcomes – repression of new growth when a plant has flowers on other stems, for example – but such models are best built on the basic choice model.

Note that we may need an explicit theory predicting the counterfactual, e.g. where a flower failed to appear. Different plants will grow flowers at different points in their structure – in the case of the rose, at the apex of the stem rather than the side, so the flower represents the end point of growth, and only the growing tip is capable of making the ‘choice’. As we shall see in section 5.2, studies of choice can include evaluating an effect downstream of the selection.

Excerpt: Refining baselines

One way we can see the effect of different baselines is simply to plot their ratio across the contrast under scrutiny. Figure 2 plots the proportions of possible baseline forms – words, tensed VPs, all modals and the pair {will, shall} – in DCPSE, across the time contrast 1960s:1990s (LLC:ICE-GB).

Proportion of DCPSE in the LLC (1960s) subcorpus.

Bar chart representing the degree to which potential baseline forms for examining change in uses of modal shall are found more frequently in the earlier (LLC, 1960s) component of the DCPSE corpus. If the choice of baseline did not matter, these proportions would be identical. Variation between the optimum {will, shall} baseline and the number of words default baseline constitutes variation of opportunity, which is a distracting factor if we wish to study when speakers say shall rather than will.

Around 52% of words are found in the earlier subcorpus, whereas tensed VPs are more evenly distributed at nearly exactly 1:1. The proportion of all modals found in the 1960s data is 54%. However, the proportion of first person declarative will or shall combined in the LLC data is 66% of the total (2:1). Finally, 74% of all cases of first person declarative shall are found in the 1960s subcorpus.

The LLC subcorpus has a higher proportion of will / shall first person declarative forms than the set of all modals would lead one to expect. Perhaps the earlier data contains more frequent expressions of obligation or prediction. Perhaps modal use is changing over time in other ways. Irrespective, only when we alight on the set of first person declarative will / shall cases do we identify all cases of the opportunity to express shall. Employing the nearest baseline to a change allows us to factor out variation of opportunity.


  1. Introduction
  2. Some preliminaries
    2.1   Mutual substitution
    2.2   True rate of alternation
    2.3   Must meaning be constant?
  3. Refining baselines and the ratio principle
    3.1   Word-based baselines
    3.2   Refining baselines
    3.3   Variation and reproducibility
  4. Surveying ‘absolute’ and ‘relative’ variation
  5. From alternation to choice
    5.1   Simple grammatical interaction
    5.2   To add or not to add
    5.3   Grammatically diverse alternates
  6. Objections
    6.1   Alternates are not reliably identifiable
      6.1.1  Bottom up – serially identifying alternates
      6.1.2  Top down – improving a verb baseline
    6.2   Baselines are arbitrary
    6.3   Multiple ecological pressures apply
      6.3.1  Multiplicity of lexical meaning
      6.3.2  Multiplicity of pressures on choices
  7. Conclusions

See also


Wallis, S.A. forthcoming. That vexed problem of choice. London: Survey of English Usage, UCL.


Bauer, L. 1994. Watching English Change: An Introduction to the Study of Linguistic Change in Standard Englishes in the Twentieth Century. London: Longman.

Labov, W. 1972. Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.

Lavandera, B.R. 1978. Where does the sociolinguistic variable stop? Language in Society 7: 171-182.

Schönefeld, D. (ed.) 2011. Converging Evidence. Methodological and theoretical issues for linguistic research. Amsterdam: John Benjamins.


One response to “That vexed problem of choice

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s