POS tagging – a corpus-driven research success story?


One of the longest-running, and in many respects the least helpful, methodological debates in corpus linguistics concerns the spat between so-called corpus-driven and corpus-based linguists.

I say that this has been largely unhelpful because it has encouraged a dichotomy which is almost certainly false, and the focus on whether it is ‘right’ to work from corpus data upwards towards theory, or from theory downwards towards text, distracts from some serious methodological challenges we need to consider (see other posts on this blog).

Usually this discussion reviews the achievements of the most well-known corpus-based linguist, John Sinclair, in building the Collins Cobuild Corpus, and deriving the Collins Cobuild Dictionary (Sinclair et al. 1987) and Grammar (Sinclair et al. 1990) from it.

In this post I propose an alternative examination.

I want to suggest that the greatest success story for corpus-based research is the development of part-of-speech taggers (usually called a ‘POS-tagger’ or simply ‘tagger’) trained on corpus data.

These are industrial strength, reliable algorithms, that obtain good results with minimal assumptions about language.

So, who needs theory? Continue reading

Why Chomsky was Wrong About Corpus Linguistics


When the entire premise of your methodology is publicly challenged by one of the most pre-eminent figures in an overarching discipline, it seems wise to have a defence. Noam Chomsky’s famous objection to corpus linguistics therefore needs a serious response.

“One of the big insights of the scientific revolution, of modern science, at least since the seventeenth century… is that arrangement of data isn’t going to get you anywhere. You have to ask probing questions of nature. That’s what is called experimentation, and then you may get some answers that mean something. Otherwise you just get junk.” (Noam Chomsky, quoted in Aarts 2001).

Chomsky has consistently argued that the systematic ex post facto analysis of natural language sentence data is incapable of taking theoretical linguistics forward. In other words, corpus linguistics is a waste of time, because it is capable of focusing only on external phenomena of language – what Chomsky has at various times described as ‘e-language’.

Instead we should concentrate our efforts on developing new theoretical explanations for the internal language within the mind (‘i-language’). Over the years the terminology varied, but the argument has remained the same: real linguistics is the study of i-language, not e-language. Corpus linguistics studies e-language. Ergo, it is a waste of time.

Argument 1: in science, data requires theory

Chomsky refers to what he calls ‘the Galilean Style’ to make his case. This is the argument that it is necessary to engage in theoretical abstractions in order to analyse complex data. “[P]hysicists ‘give a higher degree of reality’ to the mathematical models of the universe that they construct than to ‘the ordinary world of sensation’” (Chomsky, 2002: 98). We need a theory in order to make sense of data, as so-called ‘unfiltered’ data is open to an infinite number of possible interpretations.

In the Aristotelian model of the universe the sun orbited the earth. The same data, reframed by the Copernican model, was explained by the rotation of the earth. However, the Copernican model of the universe was not arrived at by theoretical generalisation alone, but by a combination of theory and observation.

Chomsky’s first argument contains a kernel of truth. The following statement is taken for granted across all scientific disciplines: you need theory to analyse data. To put it another way, there is no such thing as an ‘assumption free’ science. But the second part of this argument, that the necessity of theory permits scientists to dispense with engagement with data (or even allows them to dismiss data wholesale), is not a characterisation of the scientific method that modern scientists would recognise. Indeed, Beheme (2016) argues that this method is also a mischaracterisation of Galileo’s method. Galileo’s particular fame, and his persecution, came from one source: the observations he made through his telescope. Continue reading

UCL Summer School in English Corpus Linguistics 2016

I am pleased to announce the fourth annual Summer School in English Corpus Linguistics to be held at University College London from 6-8 July.

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Aims and objectives of the course

Over the three days, participants will learn about the following:

  • the scope of Corpus Linguistics, and how we can use it to study the English Language;
  • key issues in Corpus Linguistics methodology;
  • how to use corpora to analyse issues in syntax and semantics;
  • basic elements of statistics;
  • how to navigate large and small corpora, particularly ICE-GB and DCPSE.

Learning outcomes

At the end of the course, participants will have:

  • acquired a basic but solid knowledge of the terminology, concepts and methodologies used in English Corpus Linguistics;
  • had practical experience working with two state-of-the-art corpora and a corpus exploration tool (ICECUP);
  • have gained an understanding of the breadth of Corpus Linguistics and the potential application for projects;
  • have learned about the fundamental concepts of inferential statistics and their practical application to Corpus Linguistics.

For more information, including costs, booking information, timetable, see the website.

See also

The variance of Binomial distributions


Recently I’ve been working on a problem that besets researchers in corpus linguistics who work with samples which are not drawn randomly from the population but rather are taken from a series of sub-samples. These sub-samples (in our case, texts) may be randomly drawn, but we cannot say the same for any two cases drawn from the same sub-sample. It stands to reason that two cases taken from the same sub-sample are more likely to share a characteristic under study than two cases drawn entirely at random. I introduce the paper elsewhere on my blog.

In this post I want to focus on an interesting and non-trivial result I needed to address along the way. This concerns the concept of variance as it applies to a Binomial distribution.

Most students are familiar with the concept of variance as it applies to a Gaussian (Normal) distribution. A Normal distribution is a continuous symmetric ‘bell-curve’ distribution defined by two variables, the mean and the standard deviation (the square root of the variance). The mean specifies the position of the centre of the distribution and the standard deviation specifies the width of the distribution.

Common statistical methods on Binomial variables, from χ² tests to line fitting, employ a further step. They approximate the Binomial distribution to the Normal distribution. They say, although we know this variable is Binomially distributed, let us assume the distribution is approximately Normal. The variance of the Binomial distribution becomes the variance of the equivalent Normal distribution.

In this methodological tradition, the variance of the Binomial distribution loses its meaning with respect to the Binomial distribution itself. It seems to be only valuable insofar as it allows us to parameterise the equivalent Normal distribution.

What I want to argue is that in fact, the concept of the variance of a Binomial distribution is important in its own right, and we need to understand it with respect to the Binomial distribution, not the Normal distribution. Sometimes it is not necessary to approximate the Binomial to the Normal, and if we can avoid this approximation our results are likely to be stronger as a result.

Continue reading

Adapting variance for random-text sampling

Introduction Paper (PDF)

Conventional stochastic methods based on the Binomial distribution rely on a standard model of random sampling whereby freely-varying instances of a phenomenon under study can be said to be drawn randomly and independently from an infinite population of instances.

These methods include confidence intervals and contingency tests (including multinomial tests), whether computed by Fisher’s exact method or variants of log-likelihood, χ², or the Wilson score interval (Wallis 2013). These methods are also at the core of others. The Normal approximation to the Binomial allows us to compute a notion of the variance of the distribution, and is to be found in line fitting and other generalisations.

In many empirical disciplines, samples are rarely drawn “randomly” from the population in a literal sense. Medical research tends to sample available volunteers rather than names compulsorily called up from electoral or medical records. However, provided that researchers are aware that their random sample is limited by the sampling method, and draw conclusions accordingly, such limitations are generally considered acceptable. Obtaining consent is occasionally a problematic experimental bias; actually recruiting relevant individuals is a more common problem.

However, in a number of disciplines, including corpus linguistics, samples are not drawn randomly from a population of independent instances, but instead consist of randomly-obtained contiguous subsamples. In corpus linguistics, these subsamples are drawn from coherent passages or transcribed recordings, generically termed ‘texts’. In this sampling regime, whereas any pair of instances in independent subsamples satisfy the independent-sampling requirement, pairs of instances in the same subsample are likely to be co-dependent to some degree.

To take a corpus linguistics example, a pair of grammatical clauses in the same text passage are more likely to share characteristics than a pair of clauses in two entirely independent passages. Similarly, epidemiological research often involves “cluster-based sampling”, whereby each subsample cluster is drawn from a particular location, family nexus, etc. Again, it is more likely that neighbours or family members share a characteristic under study than random individuals.

If the random-sampling assumption is undermined, a number of questions arise.

  • Are statistical methods employing this random-sample assumption simply invalid on data of this type, or do they gracefully degrade?
  • Do we have to employ very different tests, as some researchers have suggested, or can existing tests be modified in some way?
  • Can we measure the degree to which instances drawn from the same subsample are interdependent? This would help us determine both the scale of the problem and arrive at a potential solution to take this interdependence into account.
  • Would revised methods only affect the degree of certainty of an observed score (variance, confidence intervals, etc.), or might they also affect the best estimate of the observation itself (proportions or probability scores)?

Continue reading

Impossible logistic multinomials


Recently, a number of linguists have begun to question the wisdom of assuming that linguistic change tends to follow an ‘S-curve’ or more properly, logistic, pattern. For example, Nevalianen (2015) offers a series of empirical observations that show that whereas data sometimes follows a continuous ‘S’, frequently this does not happen. In this short article I try to explain why this result should not be surprising.

The fundamental assumption of logistic regression is that a probability representing a true fraction, or share, of a quantity undergoing a continuous process of change by default follows a logistic pattern. This is a reasonable assumption in certain limited circumstances because an ‘S-curve’ is mathematically analogous to a straight line (cf. Newton’s first law of motion).

Regression is a set of computational methods that attempts to find the closest match between an observed set of data and a function, such as a straight line, a polynomial, a power curve or, in this case, an S-curve. We say that the logistic curve is the underlying model we expect data to be matched against (regressed to). In another post, I comment on the feasibility of employing Wilson score intervals in an efficient logistic regression algorithm.

We have already noted that change is assumed to be continuous, which implies that the input variable (x) is real and linear, such as time (and not e.g. probabilistic). In this post we discuss different outcome variable types. What are the ‘limited circumstances’ in which logistic regression is mathematically coherent?

  • We assume probabilities are free to vary from 0 to 1.
  • The envelope of variation must be constant, i.e. it must always be possible for an observed probability to reach 1.

Taken together this also means that probabilities are Binomial, not multinomial. Let us discuss what this implies. Continue reading

UCL Summer School in English Corpus Linguistics 2015

Here’s announcing the third annual Summer School in English Corpus Linguistics to be held at University College London, from 6-8 July.

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Aims and objectives of the course

  • The Summer School is a primer in Corpus Linguistics for students of the English language. It is designed to be both accessible and inspiring!
  • Attendees are taught by world-class researchers at the Survey of English Usage, UCL.
  • Students are expected to have a basic knowledge of English linguistics and grammar.
  • It will take place in the English Department of University College London, in the heart of Central London.

For more information, including costs, booking information, timetable, see the website.

See also