Coping with imperfect data


One of the challenges for corpus linguists is that many of the distinctions that we wish to make are either not annotated in a corpus at all or, if they are represented in the annotation, unreliably annotated. This issue frequently arises in corpora to which an algorithm has been applied, but where the results have not been checked by linguists, a situation which is unavoidable with mega-corpora. However, this is a general problem. We would always recommend that cases be reviewed for accuracy of annotation.

A version of this issue also arises when checking for the possibility of alternation, that is, to ensure that items of Type A can be replaced by Type B items, and vice-versa. An example might be epistemic modal shall vs. will. Most corpora, including richly-annotated corpora such as ICE-GB and DCPSE, do not include modal semantics in their annotation scheme. In such cases the issue is not that the annotation is “imperfect”, rather that our experiment relies on a presumption that the speaker has the choice of either type at any observed point (see Aarts et al. 2013), but that choice is conditioned by the semantic content of the utterance.

Continue reading

Is language really “a set of alternations?”

The perspective that the study of linguistic data should be driven by studies of individual speaker choices has been the subject of attack from a number of linguists.

The first set of objections have come from researchers who have traditionally focused on linguistic variation expressed in terms of rates per word, or per million words.

No such thing as free variation?

As Smith and Leech (2013) put it: “it is commonplace in linguistics that there is no such thing as free variation” and that indeed multiple differing constraints apply to each term. On the basis of this observation they propose an ‘ecological’ approach, although in their paper this approach is not clearly defined.

Continue reading

EDS Resources

This post contains the resources for students taking the UCL English Linguistics MA, all in one place.

Session 15: Introduction to statistics

Sessions 18 and 19: Statistics Workshops

Suggested further reading

A methodological progression

(with thanks to Jill Bowie)


One of the most controversial arguments in corpus linguistics concerns the relationship between a ‘variationist’ paradigm comparable with lab experiments, and a traditional corpus linguistics paradigm focusing on normalised word frequencies.

Rather than see these two approaches as diametrically opposed, we propose that it is more helpful to view them as representing different points on a methodological progression, and to recognise that we are often forced to compromise our ideal experimental practice according to the data and tools at our disposal.

Viewing these approaches as being represented along a progression allows us to step back from any single perspective and ask ourselves how different results can be reconciled and research may be improved upon. It allows us to consider the potential value in performing more computer-aided manual annotation — always an arduous task — and where such annotation effort would be usefully focused.

The idea is sketched in the figure below.

A methodological progression

A methodological progression: from normalised word frequencies to verified alternation.

Continue reading

Verb Phrase book published

Why this book?

book coverThe grammar of English is often thought to be stable over time. However a new book, edited by Bas Aarts, Joanne Close, Geoffrey Leech and Sean Wallis, The Verb Phrase in English: investigating recent language change with corpora (Cambridge University Press, 2013) presents a body of research from linguists that shows that using natural language corpora one can find changes within a core element of grammar, the Verb Phrase, over a span of decades rather than centuries.

The book draws from papers first presented at a symposium on the verb phrase organised for the Survey of English Usage’s 50th anniversary and on research from the Changing English Verb Phrase project.

Continue reading

Capturing patterns of linguistic interaction

Abstract Full Paper (PDF)

Numerous competing grammatical frameworks exist on paper, as algorithms and embodied in parsed corpora. However, not only is there little agreement about grammars among linguists, but there is no agreed methodology for demonstrating the benefits of one grammar over another. Consequently the status of parsed corpora or ‘treebanks’ is suspect.

The most common approach to empirically comparing frameworks is based on the reliable retrieval of individual linguistic events from an annotated corpus. However this method risks circularity, permits redundant terms to be added as a ‘solution’ and fails to reflect the broader structural decisions embodied in the grammar. In this paper we introduce a new methodology based on the ability of a grammar to reliably capture patterns of linguistic interaction along grammatical axes. Retrieving such patterns of interaction does not rely on atomic retrieval alone, does not risk redundancy and is no more circular than a conventional scientific reliance on auxiliary assumptions. It is also a valid experimental perspective in its own right.

We demonstrate our approach with a series of natural experiments. We find an interaction captured by a phrase structure analysis between attributive adjective phrases under a noun phrase with a noun head, such that the probability of adding successive adjective phrases falls. We note that a similar interaction (between adjectives preceding a noun) can also be found with a simple part-of-speech analysis alone. On the other hand, preverbal adverb phrases do not exhibit this interaction, a result anticipated in the literature, confirming our method.

Turning to cases of embedded postmodifying clauses, we find a similar fall in the additive probability of both successive clauses modifying the same NP and embedding clauses where the NP head is the most recent one. Sequential postmodification of the same head reveals a fall and then a rise in this additive probability. Reviewing cases, we argue that this result can only be explained as a natural phenomenon acting on language production which is expressed by the distribution of cases on an embedding axis, and that this is in fact empirical evidence for a grammatical structure embodying a series of speaker choices.

We conclude with a discussion of the implications of this methodology for a series of applications, including optimising and evaluating grammars, modelling case interaction, contrasting the grammar of multiple languages and language periods, and investigating the impact of psycholinguistic constraints on language production.

Continue reading

Freedom to vary and significance tests


Statistical tests based on the Binomial distribution (z, χ², log-likelihood and Newcombe-Wilson tests) assume that the item in question is free to vary at each point. This simply means that

  • If we find f items under investigation (what we elsewhere refer to as ‘Type A’ cases) out of N potential instances, the statistical model of inference assumes that it must be possible for f to be any number from 0 to N.
  • Probabilities, p = f / N, are expected to fall in the range [0, 1].

Note: this constraint is a mathematical one. All we are claiming is that the true proportion in the population could conceivably range from 0 to 1. This property is not limited to strict alternation with constant meaning (onomasiological, “envelope of variation” studies). In semasiological studies, where we evaluate alternative meanings of the same word, these tests can also be legitimate.

However, it is common in corpus linguistics to see evaluations carried out against a baseline containing terms that simply cannot plausibly be exchanged with the item under investigation. The most obvious example is statements of the following type: “linguistic Item x increases per million words between category 1 and 2”, with reference to a log-likelihood or χ² significance test to justify this claim. Rarely is this appropriate.

Some terminology: If Type A represents say, the use of modal shall, most words will not alternate with shall. For convenience, we will refer to cases that will alternate with Type A cases as Type B cases (e.g. modal will in certain contexts).

The remainder of cases (other words) are, for the purposes of our study, not evaluated. We will term these invariant cases Type C, because they cannot replace Type A or Type B.

In this post I will explain that not only does introducing such ‘Type C’ cases into an experimental design conflate opportunity and choice, but it also makes the statistical evaluation of variation more conservative. Not only may we mistake a change in opportunity as a change in the preference for the item, but we also weaken the power of statistical tests and tend to reject significant changes (in stats jargon, “Type II errors”).

This problem of experimental design far outweighs differences between methods for computing statistical tests. Continue reading