The replication crisis: what does it mean for corpus linguistics?


Over the last year, the field of psychology has been rocked by a major public dispute about statistics. This concerns the failure of claims in papers, published in top psychological journals, to replicate.

Replication is a big deal: if you publish a correlation between variable X and variable Y – that there is an increase in the use of the progressive over time, say, and that increase is statistically significant, you expect that this finding would be replicated were the experiment repeated.

I would strongly recommend Andrew Gelman’s brief history of the developing crisis in psychology. It is not necessary to agree with everything he says (personally, I find little to disagree with, although his argument is challenging) to recognise that he describes a serious problem here.

There may be more than one reason why published studies have failed to obtain compatible results on repetition, and so it is worth sifting these out.

In this blog post, what I want to do is try to explore what this replication crisis is – is it one problem, or several? – and then turn to what solutions might be available and what the implications are for corpus linguistics. Continue reading

POS tagging – a corpus-driven research success story?


One of the longest-running, and in many respects the least helpful, methodological debates in corpus linguistics concerns the spat between so-called corpus-driven and corpus-based linguists.

I say that this has been largely unhelpful because it has encouraged a dichotomy which is almost certainly false, and the focus on whether it is ‘right’ to work from corpus data upwards towards theory, or from theory downwards towards text, distracts from some serious methodological challenges we need to consider (see other posts on this blog).

Usually this discussion reviews the achievements of the most well-known corpus-based linguist, John Sinclair, in building the Collins Cobuild Corpus, and deriving the Collins Cobuild Dictionary (Sinclair et al. 1987) and Grammar (Sinclair et al. 1990) from it.

In this post I propose an alternative examination.

I want to suggest that the greatest success story for corpus-based research is the development of part-of-speech taggers (usually called a ‘POS-tagger’ or simply ‘tagger’) trained on corpus data.

These are industrial strength, reliable algorithms, that obtain good results with minimal assumptions about language.

So, who needs theory? Continue reading

Why Chomsky was Wrong About Corpus Linguistics


When the entire premise of your methodology is publicly challenged by one of the most pre-eminent figures in an overarching discipline, it seems wise to have a defence. Noam Chomsky’s famous objection to corpus linguistics therefore needs a serious response.

“One of the big insights of the scientific revolution, of modern science, at least since the seventeenth century… is that arrangement of data isn’t going to get you anywhere. You have to ask probing questions of nature. That’s what is called experimentation, and then you may get some answers that mean something. Otherwise you just get junk.” (Noam Chomsky, quoted in Aarts 2001).

Chomsky has consistently argued that the systematic ex post facto analysis of natural language sentence data is incapable of taking theoretical linguistics forward. In other words, corpus linguistics is a waste of time, because it is capable of focusing only on external phenomena of language – what Chomsky has at various times described as ‘e-language’.

Instead we should concentrate our efforts on developing new theoretical explanations for the internal language within the mind (‘i-language’). Over the years the terminology varied, but the argument has remained the same: real linguistics is the study of i-language, not e-language. Corpus linguistics studies e-language. Ergo, it is a waste of time.

Argument 1: in science, data requires theory

Chomsky refers to what he calls ‘the Galilean Style’ to make his case. This is the argument that it is necessary to engage in theoretical abstractions in order to analyse complex data. “[P]hysicists ‘give a higher degree of reality’ to the mathematical models of the universe that they construct than to ‘the ordinary world of sensation’” (Chomsky, 2002: 98). We need a theory in order to make sense of data, as so-called ‘unfiltered’ data is open to an infinite number of possible interpretations.

In the Aristotelian model of the universe the sun orbited the earth. The same data, reframed by the Copernican model, was explained by the rotation of the earth. However, the Copernican model of the universe was not arrived at by theoretical generalisation alone, but by a combination of theory and observation.

Chomsky’s first argument contains a kernel of truth. The following statement is taken for granted across all scientific disciplines: you need theory to analyse data. To put it another way, there is no such thing as an ‘assumption free’ science. But the second part of this argument, that the necessity of theory permits scientists to dispense with engagement with data (or even allows them to dismiss data wholesale), is not a characterisation of the scientific method that modern scientists would recognise. Indeed, Beheme (2016) argues that this method is also a mischaracterisation of Galileo’s method. Galileo’s particular fame, and his persecution, came from one source: the observations he made through his telescope. Continue reading

UCL Summer School in English Corpus Linguistics 2016

I am pleased to announce the fourth annual Summer School in English Corpus Linguistics to be held at University College London from 6-8 July.

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Aims and objectives of the course

Over the three days, participants will learn about the following:

  • the scope of Corpus Linguistics, and how we can use it to study the English Language;
  • key issues in Corpus Linguistics methodology;
  • how to use corpora to analyse issues in syntax and semantics;
  • basic elements of statistics;
  • how to navigate large and small corpora, particularly ICE-GB and DCPSE.

Learning outcomes

At the end of the course, participants will have:

  • acquired a basic but solid knowledge of the terminology, concepts and methodologies used in English Corpus Linguistics;
  • had practical experience working with two state-of-the-art corpora and a corpus exploration tool (ICECUP);
  • have gained an understanding of the breadth of Corpus Linguistics and the potential application for projects;
  • have learned about the fundamental concepts of inferential statistics and their practical application to Corpus Linguistics.

For more information, including costs, booking information, timetable, see the website.

See also

Is “grammatical diversity” a useful concept?


In a recent paper focusing on distributions of simple NPs (Aarts and Wallis, 2014), we found an interesting correlation across text genres in a corpus between two independent variables. For the purposes of this study, a “simple NP” was an NP consisting of a single-word head. What we found was a strong correlation between

  1. the probability that an NP consists of a single-word head, p(single head), and
  2. the probability that single-word heads were a personal pronoun, p(personal pronoun | single head).

Note that these two variables are independent because they do not compete, unlike, say, the probability that a single-word NP consists of a noun, vs. the probability that it is a pronoun. The scattergraph below illustrates the distribution and correlation clearly.

Scattergraph of text genres in ICE-GB; distributed (horizontally) by the proportion of all noun phrases consisting of a single word and (vertically) by the proportion of those NPs that are personal pronouns; spoken and written, with selected outliers identified.

Scattergraph of text genres in ICE-GB; distributed (horizontally) by the proportion of all noun phrases consisting of a single word and (vertically) by the proportion of those single-word NPs that are personal pronouns; spoken and written, with selected outliers identified.

Continue reading

What might a corpus of parsed spoken data tell us about language?

AbstractPaper (PDF)

This paper summarises a methodological perspective towards corpus linguistics that is both unifying and critical. It emphasises that the processes involved in annotating corpora and carrying out research with corpora are fundamentally cyclic, i.e. involving both bottom-up and top-down processes. Knowledge is necessarily partial and refutable.

This perspective unifies ‘corpus-driven’ and ‘theory-driven’ research as two aspects of a research cycle. We identify three distinct but linked cyclical processes: annotation, abstraction and analysis. These cycles exist at different levels and perform distinct tasks, but are linked together such that the output of one feeds the input of the next.

This subdivision of research activity into integrated cycles is particularly important in the case of working with spoken data. The act of transcription is itself an annotation, and decisions to structurally identify distinct sentences are best understood as integral with parsing. Spoken data should be preferred in linguistic research, but current corpora are dominated by large amounts of written text. We point out that this is not a necessary aspect of corpus linguistics and introduce two parsed corpora containing spoken transcriptions.

We identify three types of evidence that can be obtained from a corpus: factual, frequency and interaction evidence, representing distinct logical statements about data. Each may exist at any level of the 3A hierarchy. Moreover, enriching the annotation of a corpus allows evidence to be drawn based on those richer annotations. We demonstrate this by discussing the parsing of a corpus of spoken language data and two recent pieces of research that illustrate this perspective. Continue reading