Is language really “a set of alternations?”

The perspective that the study of linguistic data should be driven by studies of individual speaker choices has been the subject of attack from a number of linguists.

The first set of objections have come from researchers who have traditionally focused on linguistic variation expressed in terms of rates per word, or per million words.

No such thing as free variation?

As Smith and Leech (2013) put it: “it is commonplace in linguistics that there is no such thing as free variation” and that indeed multiple differing constraints apply to each term. On the basis of this observation they propose an ‘ecological’ approach, although in their paper this approach is not clearly defined.

Continue reading

EDS Resources

This post contains the resources for students taking the UCL English Linguistics MA, all in one place.

Session 15: Introduction to statistics

Sessions 18 and 19: Statistics Workshops

Suggested further reading

A methodological progression

(with thanks to Jill Bowie)


One of the most controversial arguments in corpus linguistics concerns the relationship between a ‘variationist’ paradigm comparable with lab experiments, and a traditional corpus linguistics paradigm focusing on normalised word frequencies.

Rather than see these two approaches as diametrically opposed, we propose that it is more helpful to view them as representing different points on a methodological progression, and to recognise that we are often forced to compromise our ideal experimental practice according to the data and tools at our disposal.

Viewing these approaches as being represented along a progression allows us to step back from any single perspective and ask ourselves how different results can be reconciled and research may be improved upon. It allows us to consider the potential value in performing more computer-aided manual annotation — always an arduous task — and where such annotation effort would be usefully focused.

The idea is sketched in the figure below.

A methodological progression

A methodological progression: from normalised word frequencies to verified alternation.

Continue reading

Verb Phrase book published

Why this book?

book coverThe grammar of English is often thought to be stable over time. However a new book, edited by Bas Aarts, Joanne Close, Geoffrey Leech and Sean Wallis, The Verb Phrase in English: investigating recent language change with corpora (Cambridge University Press, 2013) presents a body of research from linguists that shows that using natural language corpora one can find changes within a core element of grammar, the Verb Phrase, over a span of decades rather than centuries.

The book draws from papers first presented at a symposium on the verb phrase organised for the Survey of English Usage’s 50th anniversary and on research from the Changing English Verb Phrase project.

Continue reading

Capturing patterns of linguistic interaction

Abstract Full Paper (PDF)

Numerous competing grammatical frameworks exist on paper, as algorithms and embodied in parsed corpora. However, not only is there little agreement about grammars among linguists, but there is no agreed methodology for demonstrating the benefits of one grammar over another. Consequently the status of parsed corpora or ‘treebanks’ is suspect.

The most common approach to empirically comparing frameworks is based on the reliable retrieval of individual linguistic events from an annotated corpus. However this method risks circularity, permits redundant terms to be added as a ‘solution’ and fails to reflect the broader structural decisions embodied in the grammar. In this paper we introduce a new methodology based on the ability of a grammar to reliably capture patterns of linguistic interaction along grammatical axes. Retrieving such patterns of interaction does not rely on atomic retrieval alone, does not risk redundancy and is no more circular than a conventional scientific reliance on auxiliary assumptions. It is also a valid experimental perspective in its own right.

We demonstrate our approach with a series of natural experiments. We find an interaction captured by a phrase structure analysis between attributive adjective phrases under a noun phrase with a noun head, such that the probability of adding successive adjective phrases falls. We note that a similar interaction (between adjectives preceding a noun) can also be found with a simple part-of-speech analysis alone. On the other hand, preverbal adverb phrases do not exhibit this interaction, a result anticipated in the literature, confirming our method.

Turning to cases of embedded postmodifying clauses, we find a similar fall in the additive probability of both successive clauses modifying the same NP and embedding clauses where the NP head is the most recent one. Sequential postmodification of the same head reveals a fall and then a rise in this additive probability. Reviewing cases, we argue that this result can only be explained as a natural phenomenon acting on language production which is expressed by the distribution of cases on an embedding axis, and that this is in fact empirical evidence for a grammatical structure embodying a series of speaker choices.

We conclude with a discussion of the implications of this methodology for a series of applications, including optimising and evaluating grammars, modelling case interaction, contrasting the grammar of multiple languages and language periods, and investigating the impact of psycholinguistic constraints on language production.

Continue reading

Three kinds of corpus evidence – and two types of constraint

Text corpora permit researchers to find evidence of three distinct kinds.

1. Frequency evidence of known terms (‘performance’)

Suppose you have a plain text corpus which you attempt to annotate automatically. You apply a computer program to the text. This program can be thought of as comprising three elements: a theoretical framework or ‘scheme’, an algorithm, and a knowledge-base (KB). Terms and constituents in this scheme are applied to the corpus according to the algorithm.

Having done so it should be a relatively simple matter to index those terms in the corpus and obtain frequencies for each one (e.g., how many instances of may are classed as a modal verb, noun, etc). The frequency evidence obtained tells you how the program performed against the real-world data in the corpus. However, if you stop at this point you do not know whether this evidence is accurate or complete.

2. Frequency evidence of unknown terms (‘discovery’)

The process of annotation presents the opportunity for discovery of novel linguistic events. All NLP algorithms have a particular, and inevitably less-than perfect, performance. The system may misclassify some items, misanalyse constituents, or simply fail. Therefore

  1. first-pass frequency evidence is likely to be inaccurate (and potentially incomplete),
  2. errors may be due to inadequacies in the scheme, algorithm or knowledge-base.

In practice we have two choices: amend the system (scheme, KB or algorithm) and/or correct the corpus manually. A law of diminishing returns applies, and a certain amount of manual editing is inevitably necessary. [As a side comment, part-of-speech annotation is relatively accurate, but full parsing is prone to error. As different systems employ different frameworks accuracy rates vary, but one can anticipate around 95% accuracy for POS-tagging and at best 70% accuracy for parsing. In any case, some errors may be impossible to address without a deeper semantic analysis of the sentence than is feasible.]

Continue reading

Random sampling, corpora and case interaction


One of the main unsolved statistical problems in corpus linguistics is the following.

Statistical methods assume that samples under study are taken from the population at random.

Text corpora are only partially random. Corpora consist of passages of running text, where words, phrases, clauses and speech acts are structured together to describe the passage.

The selection of text passages for inclusion in a corpus is potentially random. However cases within each text may not be independent.

This randomness requirement is foundationally important. It governs our ability to generalise from the sample to the population.

The corollary of random sampling is that cases are independent from each other.

I see this problem as being fundamental to corpus linguistics as a credible experimental practice (to the point that I forced myself to relearn statistics from first principles after some twenty years in order to address it). In this blog entry I’m going to try to outline the problem and what it means in practice.

The saving grace is that statistical generalisation is premised on a mathematical model. The problem is not all-or-nothing. This means that we can, with care, attempt to address it proportionately.

[Note: To actually solve the problem would require the integration of multiple sources of evidence into an a posteriori model of case interaction that computed marginal ‘independence probabilities’ for each case abstracted from the corpus. This is way beyond what any reasonable individual linguist could ever reasonably be expected to do unless an out-of-the-box solution is developed (I’m working on it, albeit slowly, so if you have ideas, don’t fail to contact me…).]

There are numerous sources of case interaction and clustering in texts, ranging from conscious repetition of topic words and themes, unconscious tendencies to reuse particular grammatical choices, and interaction along axes of, for example, embedding and co-ordination (Wallis 2012a), and structurally overlapping cases (Nelson et al 2002: 272).

In this blog post I first outline the problem and then discuss feasible good practice based on our current technology.  Continue reading