(with thanks to Jill Bowie and Bas Aarts)
A key challenge in corpus linguistics concerns the difficulty of operationalising linguistic questions in terms of choices made by speakers or writers. Whereas lab researchers design an experiment around a choice, comparable corpus research implies the inference of counterfactual alternates. This non-trivial requirement leads many to rely on a per million word baseline, meaning that variation separately due to opportunity and choice cannot be distinguished.
We formalise definitions of mutual substitution and the true rate of alternation as useful idealisations, recognising they may not always hold. Analysing data from a new volume on the verb phrase, we demonstrate how a focus on choices available to speakers allows researchers to factor out the effect of changing opportunities to draw conclusions about choices.
We discuss research strategies where alternates may not be easily identified, including refining baselines by eliminating forms and surveying change against multiple baselines. Finally we address three objections that have been made to this framework, that alternates are not reliably identifiable, baselines are arbitrary, and differing ecological pressures apply to different terms. Throughout we motivate our responses by evidence from current research, demonstrating that whereas the problem of identifying choices may be ‘vexed’, it represents a highly fruitful paradigm for corpus linguistics.