Three kinds of corpus evidence – and two types of constraint

Text corpora offer researchers the potential to find linguistic evidence of three distinct kinds. We term these types of evidence frequency evidencefactual evidence and interaction evidence

1. Frequency evidence of known terms (‘performance’)

Suppose you have a plain text corpus which you attempt to annotate automatically. You begin by applying a computer program to annotate the text. This program, which might be a part-of-speech tagger, a semantic tagger, or some other kind of recognition-and-markup algorithm, can be thought of as comprising three elements: a theoretical framework or scheme, an algorithm, and a knowledge base (KB). Terms and constituents in the scheme are applied to the corpus according to the algorithm.

Once this process is complete, it should be a relatively simple matter to index annotated terms and obtain frequencies for each one (e.g., how many instances of may are classified as a modal verb, a noun, etc). Ideally we obtain this frequency evidence in an organised way, such as to create a set or distribution of frequencies of modal verbs.

The resulting frequency evidence obtained tells you how the program performed against the real-world data in the corpus. However, if you stop at this point you do not know whether this evidence is accurate or complete.

2. Factual evidence of unknown terms (‘discovery’)

A process of annotation presents the opportunity for the discovery of novel linguistic events. All NLP algorithms have a particular (and inevitably less-than perfect) performance. The system may misclassify some items, misanalyse constituents, or simply fail. Therefore we might note that

  1. first-pass frequency evidence is likely to be inaccurate (and potentially incomplete),
  2. errors may be due to inadequacies in the scheme, algorithm or knowledge-base.

In practice we have two choices: amend the system (scheme, KB or algorithm) and/or correct the corpus manually. A law of diminishing returns applies, and a certain amount of manual editing is inevitably necessary if we wish to obtain high coverage.

As a side comment, part-of-speech annotation is relatively accurate, but full parsing is prone to error. As different systems employ different frameworks accuracy rates vary, but one can anticipate around 95% accuracy for POS-tagging and at best 70% accuracy for parsing. In any case, some errors may be impossible to address without a deeper semantic analysis of the sentence than is feasible.

Some linguists (notably, Sinclair 1992) have criticised the practice of manual editing, saying that all human effort should be directed to improving the system. However this objection misses an important point:-

In correcting a corpus, we improve the accuracy of frequency evidence of known terms (1) and obtain frequency evidence of previously unknown terms (2).

In other words, when algorithms fail because we have discovered phenomena that were not anticipated in our scheme, we are compelled to attempt to integrate them into our scheme. The very first step is inevitably to annotate the particular instance in the corpus with a new descriptor (or a novel permutation of an existing descriptor).

This evidence can be used to improve the knowledge base of an annotation algorithm. Indeed, the task of annotating a corpus should be seen as cyclic (Wallis 2007) rather than simply top-down (running a program over data) or bottom-up (training a program on data). A crucial part of that cyclic process is the manual review, correction and editing of the corpus annotation itself.

However, I digress. Once it is complete (or near complete), an annotated corpus is a source of a third type of evidence.

3. Evidence of association and co-occurrence of terms (‘interaction’)

The third class of evidence is of a different order than frequency evidence (cf. Gries, 2009:11), and consists of identifying whether the presence of one element affects the probability of a second element being present in a given relationship. At the level of words and part-of-speech tags, this type of evidence has been used for many years in collocation studies and training probabilistic taggers. Similar evidence has been applied, with less success, in training probabilistic parsers.

However, algorithmic improvement of this kind is scratching the surface of what is possible with a corpus. Corpus linguistics permits us to evaluate the degree to which any potentially co-occurring pair of constituent elements coincide.

In order to study interaction we must inevitably focus on speaker or writer choice. After all, we are not merely interested in whether a term exists — this is evidence of the first two types — but in whether, having decided to express a thought, a speaker or writer chooses one word or construction over another alternative word or construction.

Choices and constraints

The choice-based paradigm starts from the observation that as people form utterances and sentences, they make a myriad of interconnected decisions at a range of levels, including pragmatic, grammatical and lexical.

Consider two adjacent choices in an utterance. They might be said to potentially impact on each other in two distinct ways:

  1. Framing constraints: once one choice is made, the secondary one becomes impossible or determined. Violating these constraints can even be said to be ‘ungrammatical’.
  2. Interaction evidence: although the second choice is still possible irrespective of the first, it turns out to be less or more likely because of the first decision.

An important area of corpus linguistics research starts with post-hoc experiments which are framed by absolute grammatical constraints (type 1) and which investigate how decisions influence each other (type 2).

Consider the serial additive probability of adding an attributive adjective before a noun head in a noun phrase (Wallis 2019), such as the young thin cat. The model assumes that the speaker first decides on the noun head (e.g., cat), and then chooses to add an adjective to the noun phrase in a repeating fashion (e.g., young, thin). At each step the speaker can choose whether to add an adjective or stop. The framing constraint is the permissible additive step (adding an attributive adjective). The interaction evidence obtained is a pattern of varying additive probability at each step.

Moreover, the richer the annotation, the greater the number of research questions that can be readily explored. 

For instance, in a parsed corpus it is possible to study the interaction between decisions framed by grammatical constituents. Grammatical annotation allows researchers to reliably identify adjectives in the same NP as a noun head, because the NP brackets the query. It allows us to deal appropriately with adjective phrases, such as the young [very thin] cat, and other larger constituents such as the cat [that followed me home].

If a corpus is annotated further, e.g. morphologically, prosodically or pragmatically, then additional research questions become possible. One can study

  1. the impact of decisions encapsulated within each level;
  2. the impact of decisions between different levels, e.g. the impact of a grammatical choice on a prosodic one; or
  3. phenomena identified across multiple levels, e.g. what grammatical patterns are adopted by an NP with a particular pragmatic function?

Finally, since neighbouring decisions may interact, this can also cause a ‘side problem’ in sampling. The corollary of the observation that one decision may interact with another in the same text is that if we sample two interacting decisions, we flout a basic requirement of random sampling, namely to draw instances in our sample that are properly independent (Wallis 2012b).

Conclusions

The exercise of richly annotating corpora, often with substantial human effort, is hugely valuable. Indeed, only by obtaining a complete annotation of a corpus can we identify previously unknown phenomena and gather accurate frequency evidence, and reliably detect their interaction.

Producing ever-larger ‘flat’ corpora benefits lexicology and the study of rare phenomena. However, to study language scientifically we need richer and more systematically annotated corpora. Thanks to the human effort involved, these corpora tend to be an order of magnitude smaller (around 1M words) than lexical mega-corpora (100M words plus), and researchers working with such corpora need a good understanding of optimal statistical methods to make the most of their data.

References

Gries, S. Th. 2009. Quantitative Corpus Linguistics with R. New York/London: Routledge.

Sinclair, John. 1992. The automatic analysis of corpora. Directions in Corpus Linguistics, ed. by Jan Svartvik, 379-397. (Proceedings of Nobel Symposium 82). Berlin: Mouton de Gruyter.

Wallis, S.A. 2007. Annotation, Retrieval and Experimentation. In Meurman-Solin, A. and Nurmi, A.A. (eds.) Annotating Variation and Change. Helsinki: Varieng, UoH. » ePublished

Wallis, S.A. 2012a. That vexed problem of choice. Survey of English Usage, UCL. » Post

Wallis, S.A. 2012b. Random sampling, corpora and case interaction. Survey of English Usage, UCL. » Post

Wallis, S.A. 2019. Investigating the additive probability of repeated language production decisions. International Journal of Corpus Linguistics 24:4, 490-521. » Post

See also

1 thought on “Three kinds of corpus evidence – and two types of constraint”

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.