May 2012 – corp.ling.stats

Text corpora offer researchers the potential to find linguistic evidence of three distinct kinds. We term these types of evidence frequency evidence, factual evidence and interaction evidence.

1. Frequency evidence of known terms (‘performance’)

Suppose you have a plain text corpus which you attempt to annotate automatically. You begin by applying a computer program to annotate the text. This program, which might be a part-of-speech tagger, a semantic tagger, or some other kind of recognition-and-markup algorithm, can be thought of as comprising three elements: a theoretical framework or scheme, an algorithm, and a knowledge base (KB). Terms and constituents in the scheme are applied to the corpus according to the algorithm.

Once this process is complete, it should be a relatively simple matter to index annotated terms and obtain frequencies for each one (e.g., how many instances of may are classified as a modal verb, a noun, etc). Ideally we obtain this frequency evidence in an organised way, such as to create a set or distribution of frequencies of modal verbs.

The resulting frequency evidence obtained tells you how the program performed against the real-world data in the corpus. However, if you stop at this point you do not know whether this evidence is accurate or complete.

2. Factual evidence of unknown terms (‘discovery’)

A process of annotation presents the opportunity for the discovery of novel linguistic events. All NLP algorithms have a particular (and inevitably less-than perfect) performance. The system may misclassify some items, misanalyse constituents, or simply fail. Therefore we might note that

first-pass frequency evidence is likely to be inaccurate (and potentially incomplete),
errors may be due to inadequacies in the scheme, algorithm or knowledge-base.

In practice we have two choices: amend the system (scheme, KB or algorithm) and/or correct the corpus manually. A law of diminishing returns applies, and a certain amount of manual editing is inevitably necessary if we wish to obtain high coverage.

As a side comment, part-of-speech annotation is relatively accurate, but full parsing is prone to error. As different systems employ different frameworks accuracy rates vary, but one can anticipate around 95% accuracy for POS-tagging and at best 70% accuracy for parsing. In any case, some errors may be impossible to address without a deeper semantic analysis of the sentence than is feasible.

Continue reading “Three kinds of corpus evidence – and two types of constraint” →

1. Frequency evidence of known terms (‘performance’)

2. Factual evidence of unknown terms (‘discovery’)

Share this: