Some bêtes noires

There are a number of common issues in corpus linguistics papers.

  1. an extremely common tendency for authors to primarily cite frequencies normalised per million or thousand words (i.e. a per word baseline or multiple thereof),
  2. data is usually plotted without confidence intervals, so it is not possible to spot visually whether a perceived change might be statistically significant, and
  3. significance tests are often employed without a clear statement of what the test is evaluating.

Experimental design

The first issue may be unique to corpus linguistics, deriving from its particular historical origins.

It concerns the experimenter attempting to identify counterfactual alternates or select baselines. This is an experimental design question.

In the beginning was the Word.

Linguists examining volumes of plain text data (later supported by computing and part-of-speech tagging) invariably concentrated on the idea of the word as the unit of language. Collocation and concordancing sat alongside lexicography as the principal tools of the trade. “Statistics” here primarily concerned probabilistic measures of association between neighbouring words in order to find common patterns. This activity is of course perfectly fine, and allowed researchers to make huge gains in our understanding of language.


Without labouring the point (which I do elsewhere on this blog), the corollary of the statement that language is grammatical is that if, instead of describing the distribution of words, n-grams, etc, we wish to investigate how language is produced, the word cannot be our primary focus.

I suggest that we need to pose research problems in terms of choice:

  • what choice did speakers/writers have at any given point in their language production process, and what factors might have influenced that choice?

This refocusing allows corpus linguistics results to be more easily ‘commensurable’ (theoretically compatible) with those in other fields, such as cognitive linguistics. Labov and others have been very influential in helping sociolinguists focus on this question of choice.

Focusing on choice is made easier with the development of corpora containing structured analysis such as (but not necessarily limited to) parsing. Indeed the question of choice is likely to be both syntactically and semantically constrained, and therefore the result of this type of research may be a renewed interest in deeper and richer annotation schemes.

Unfortunately it is not all plain sailing. Choice-based corpus research is frequently difficult because it implies that researchers can identify the alternate counterfactual pattern at any given decision point in a text. Unlike lab experiments, we cannot design in the choice (press button A or B) but must imply it from the text — and our knowledge of syntax and semantics.

If we cannot reliably obtain all cases where the choice was available, we may still be able to improve baselines and eliminate unlikely alternates — an exercise described in A methodological progression.


The second and third problems are statistical questions, and are common among self-taught researchers of any stripe.

There is a close relationship between confidence intervals and significance tests. Essentially:

  • Confidence intervals tell you the range of values on either side of an observation that you expect the true value (in the population) to be, at a given level of confidence (e.g. 95%) or, alternatively, that the chance of it being outside the range would be the error level (e.g. 0.05).
    • confidence intervals have one degree of freedom, that is, movement can be on either side of the observation but in one direction only.
    • confidence intervals are typically two-tailed, that is we are concerned with the probability of an observation being above or below the interval.
  • Confidence intervals can therefore be employed in distinct statistical tests:
    1. by comparing the interval with an assumed value in the population (‘goodness of fit’ test),
    2. by comparing two observations and their intervals (independence or ‘homogeneity’ test), or
    3. by comparing two observed differences and their intervals (separability test).

The test (a-c) is dependent on the experimental design. Recognising when to use a goodness of fit test rather than a test of independence is more important than picking the optimal formula for computing the result (χ², log-likelihood, z, Wilson, etc.). I’ve done the hard work in reviewing the literature and comparing methods. All you have to do is get to grips with your experiment!

The relationship between confidence intervals and significance tests is explained from first principles in this blog entry and paper, which I would suggest is a good place to start.

There are also a number of common statistical citation errors (problems in the way statistical results are cited), but this is a question for another blog entry.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s