Three kinds of corpus evidence – and two types of constraint

Text corpora permit researchers to find evidence of three distinct kinds.

1. Frequency evidence of known terms (‘performance’)

Suppose you have a plain text corpus which you attempt to annotate automatically. You apply a computer program to the text. This program can be thought of as comprising three elements: a theoretical framework or ‘scheme’, an algorithm, and a knowledge-base (KB). Terms and constituents in this scheme are applied to the corpus according to the algorithm.

Having done so it should be a relatively simple matter to index those terms in the corpus and obtain frequencies for each one (e.g., how many instances of may are classed as a modal verb, noun, etc). The frequency evidence obtained tells you how the program performed against the real-world data in the corpus. However, if you stop at this point you do not know whether this evidence is accurate or complete.

2. Factual evidence of unknown terms (‘discovery’)

The process of annotation presents the opportunity for discovery of novel linguistic events. All NLP algorithms have a particular, and inevitably less-than perfect, performance. The system may misclassify some items, misanalyse constituents, or simply fail. Therefore

  1. first-pass frequency evidence is likely to be inaccurate (and potentially incomplete),
  2. errors may be due to inadequacies in the scheme, algorithm or knowledge-base.

In practice we have two choices: amend the system (scheme, KB or algorithm) and/or correct the corpus manually. A law of diminishing returns applies, and a certain amount of manual editing is inevitably necessary. [As a side comment, part-of-speech annotation is relatively accurate, but full parsing is prone to error. As different systems employ different frameworks accuracy rates vary, but one can anticipate around 95% accuracy for POS-tagging and at best 70% accuracy for parsing. In any case, some errors may be impossible to address without a deeper semantic analysis of the sentence than is feasible.]

Continue reading

Plotting confidence intervals on graphs

So: you’ve got some data, you’ve read up on confidence intervals and you’re convinced. Your data is a small sample from a large/infinite population (all of contemporary US English, say), and therefore you need to estimate the error in every observation. You’d like to plot a pretty graph like the one below, but you don’t know where to start.

An example graph plot showing the changing proportions of meanings of the verb think over time in the US TIME Magazine Corpus, with Wilson score intervals, after Levin (2013). Many thanks to Magnus for the data!

Of course this graph is not just pretty.

Continue reading

Some bêtes noires

There are a number of common issues in corpus linguistics papers.

  1. an extremely common tendency for authors to primarily cite frequencies normalised per million or thousand words (i.e. a per word baseline or multiple thereof),
  2. data is usually plotted without confidence intervals, so it is not possible to spot visually whether a perceived change might be statistically significant, and
  3. significance tests are often employed without a clear statement of what the test is evaluating.

Experimental design

The first issue may be unique to corpus linguistics, deriving from its particular historical origins.

It concerns the experimenter attempting to identify counterfactual alternates or select baselines. This is an experimental design question.

In the beginning was the Word.

Linguists examining volumes of plain text data (later supported by computing and part-of-speech tagging) invariably concentrated on the idea of the word as the unit of language. Collocation and concordancing sat alongside lexicography as the principal tools of the trade. “Statistics” here primarily concerned probabilistic measures of association between neighbouring words in order to find common patterns. This activity is of course perfectly fine, and allowed researchers to make huge gains in our understanding of language.


Without labouring the point (which I do elsewhere on this blog), the corollary of the statement that language is grammatical is that if, instead of describing the distribution of words, n-grams, etc, we wish to investigate how language is produced, the word cannot be our primary focus. Continue reading

Competition between choices over time

Introduction Paper (PDF)

Measuring choices over time implies examining competition between alternates.

This is a fairly obvious statement. However, some of the mathematical properties of this system are less well known. These inform the expected behaviour of observations, helping us correctly specify null hypotheses.

  • The proportion of {shall, will} utterances where shall is chosen, p(shall | {shall, will}), is in competition with the alternative probability of will (they are mutually exclusive) and bounded on a probabilistic scale.
  • The probability associated with each member of a set of alternates X = {xi}, which we might write as p(xi | X), is bounded, 0 ≤ p(xi | X) ≤ 1, and exhaustive, Σp(xi | X) = 1.

A bounded system behaves differently from an unbounded one. Every child knows that a ball bouncing in an alley behaves differently than in an open playground. ‘Walls’ direct motion toward the centre.

In this short paper we discuss two properties of competitive choice:

  1. the tendency for change to be S-shaped rather than linear, and
  2. how this has an impact on confidence intervals. Continue reading

That vexed problem of choice

(with thanks to Jill Bowie and Bas Aarts)

AbstractPaper (PDF)

A key challenge in corpus linguistics concerns the difficulty of operationalising linguistic questions in terms of choices made by speakers or writers. Whereas lab researchers design an experiment around a choice, comparable corpus research implies the inference of counterfactual alternates. This non-trivial requirement leads many to rely on a per million word baseline, meaning that variation separately due to opportunity and choice cannot be distinguished.

We formalise definitions of mutual substitution and the true rate of alternation as useful idealisations, recognising they may not always hold. Analysing data from a new volume on the verb phrase, we demonstrate how a focus on choices available to speakers allows researchers to factor out the effect of changing opportunities to draw conclusions about choices.

We discuss research strategies where alternates may not be easily identified, including refining baselines by eliminating forms and surveying change against multiple baselines. Finally we address three objections that have been made to this framework, that alternates are not reliably identifiable, baselines are arbitrary, and differing ecological pressures apply to different terms. Throughout we motivate our responses by evidence from current research, demonstrating that whereas the problem of identifying choices may be ‘vexed’, it represents a highly fruitful paradigm for corpus linguistics.

Continue reading