Three kinds of corpus evidence – and two types of constraint

Text corpora permit researchers to find evidence of three distinct kinds.

1. Frequency evidence of known terms (‘performance’)

Suppose you have a plain text corpus which you attempt to annotate automatically. You apply a computer program to the text. This program can be thought of as comprising three elements: a theoretical framework or ‘scheme’, an algorithm, and a knowledge-base (KB). Terms and constituents in this scheme are applied to the corpus according to the algorithm.

Having done so it should be a relatively simple matter to index those terms in the corpus and obtain frequencies for each one (e.g., how many instances of may are classed as a modal verb, noun, etc). The frequency evidence obtained tells you how the program performed against the real-world data in the corpus. However, if you stop at this point you do not know whether this evidence is accurate or complete.

2. Frequency evidence of unknown terms (‘discovery’)

The process of annotation presents the opportunity for discovery of novel linguistic events. All NLP algorithms have a particular, and inevitably less-than perfect, performance. The system may misclassify some items, misanalyse constituents, or simply fail. Therefore

  1. first-pass frequency evidence is likely to be inaccurate (and potentially incomplete),
  2. errors may be due to inadequacies in the scheme, algorithm or knowledge-base.

In practice we have two choices: amend the system (scheme, KB or algorithm) and/or correct the corpus manually. A law of diminishing returns applies, and a certain amount of manual editing is inevitably necessary. [As a side comment, part-of-speech annotation is relatively accurate, but full parsing is prone to error. As different systems employ different frameworks accuracy rates vary, but one can anticipate around 95% accuracy for POS-tagging and at best 70% accuracy for parsing. In any case, some errors may be impossible to address without a deeper semantic analysis of the sentence than is feasible.]

Some linguists (notably, Sinclair 1992) have criticised the practice of manual editing, saying that all efforts should be applied to the system. However this objection misses the point:-

In correcting a corpus, we improve the accuracy of frequency evidence of known terms (1) and obtain frequency evidence of previously unknown terms (2).

This evidence can be used to improve the knowledge base of a parsing algorithm. Indeed, the task of annotating a corpus should be seen as cyclic (Wallis 2007) rather than simply top-down (running a program over data) or bottom-up (training the parser on data). A crucial part of that cyclic process is manual review, correction and editing of the corpus annotation itself. However, I digress. Once it is complete (or near complete) an annotated corpus is a source of a third type of evidence.

3. Evidence of association and co-occurrence of terms (‘interaction’)

The third class of evidence is of a different order than frequency evidence (cf. Gries, 2009:11), and consists of identifying whether the presence of one element affects the probability of a second element being present in a given relationship. At the level of words and part-of-speech tags, this type of evidence has been used for many years in collocation studies and training probabilistic taggers. Similar evidence has been applied, with less success, in training probabilistic parsers.

However this is scratching the surface of what is possible with a corpus.

Corpus linguistics permits us to evaluate the degree to which any potentially co-occurring pair of constituent elements coincide. Note that in order to study interaction we must focus on choice. After all, we are not merely interested in whether a term exists – this is evidence of the first two types – but in whether, having decided to employ one term or structure, a speaker chooses to employ another. As speakers and writers form utterances and sentences they make a myriad of interconnected decisions at a range of levels, including pragmatic, grammatical and lexical.

These choices constrain each other in two distinct ways:

  1. they close off possibilities absolutely (such that violating these constraints can even be said to be ‘ungrammatical’), or
  2. they influence subsequent decisions, i.e. these decisions interact.

An important area of corpus linguistics research starts with post-hoc experiments which are framed by absolute grammatical constraints (type 1) and which investigate how decisions influence each other (type 2).

For example, given a particular construction, such as a noun phrase (NP), we might wish to investigate how a decision to add an attributive adjective (e.g. young) before a noun head (Rodney) influences the decision to add a second (thin).

Statistical methods allow us to measure this interaction (see Wallis 2012a). Consequently, the richer the annotation, the greater the number of research questions that can be readily explored.

For instance, in a parsed corpus it is possible to study the interaction between decisions framed by grammatical constituents. Grammatical annotation allows researchers to reliably identify adjectives in the same NP as a noun head, because the NP brackets the query. To take a simple example, a POS-tagged corpus does not distinguish between the following:

  • an adjective before a noun, such as she was young Rodney said.
  • an attributive adjective in an NP: she saw young Rodney.

If a corpus is annotated further, e.g. morphologically, prosodically or pragmatically, then additional research questions become possible. One can study

  1. the impact of decisions encapsulated within each level;
  2. the impact of decisions between different levels, e.g. a grammatical choice on a prosodic one; or
  3. phenomena identified across multiple levels, e.g. an NP with a particular pragmatic function.

Since decisions interact this can also cause a ‘side problem’ in sampling. The corollary of the observation that one decision may interact with another is that if these decisions are themselves the focus of cases (the dependent variable), then these cases are not independent (Wallis 2012b).

In conclusion, the exercise of richly annotating corpora, often made with substantial human effort, is far from a wasted effort. On the contrary, only by obtaining a complete annotation of a corpus can we identify previously unknown phenomena and gather accurate frequency evidence, both in terms of recognising cases of single cases and of detecting their interaction.

Producing ever-larger ‘flat’ corpora benefits lexicology and the study of rare phenomena. However to study language scientifically we need richer and more systematically annotated corpora. Thanks to the human effort involved, these corpora tend to be an order of magnitude smaller (around 1M words) than lexical mega-corpora (100M words), and researchers working with such corpora need a good understanding of optimal statistical methods to make the most of their data.

See also


Gries, S. Th. 2009. Quantitative Corpus Linguistics with R. New York/London: Routledge.

Sinclair, John. 1992. The automatic analysis of corpora. Directions in Corpus Linguistics, ed. by Jan Svartvik, 379-397. (Proceedings of Nobel Symposium 82). Berlin: Mouton de Gruyter.

Wallis, S.A. 2007. Annotation, Retrieval and Experimentation. In Meurman-Solin, A. and Nurmi, A.A. (eds.) Annotating Variation and Change. Helsinki: Varieng, UoH. » ePublished

Wallis, S.A. 2012a. That vexed problem of choice. Survey of English Usage, UCL. » Post

Wallis, S.A. 2012b. Random sampling, corpora and case interaction. Survey of English Usage, UCL. » Post


One response to “Three kinds of corpus evidence – and two types of constraint

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s