Text corpora permit researchers to find evidence of three distinct kinds.
1. Frequency evidence of known terms (‘performance’)
Suppose you have a plain text corpus which you attempt to annotate automatically. You apply a computer program to the text. This program can be thought of as comprising three elements: a theoretical framework or ‘scheme’, an algorithm, and a knowledge-base (KB). Terms and constituents in this scheme are applied to the corpus according to the algorithm.
Having done so it should be a relatively simple matter to index those terms in the corpus and obtain frequencies for each one (e.g., how many instances of may are classed as a modal verb, noun, etc). The frequency evidence obtained tells you how the program performed against the real-world data in the corpus. However, if you stop at this point you do not know whether this evidence is accurate or complete.
2. Frequency evidence of unknown terms (‘discovery’)
The process of annotation presents the opportunity for discovery of novel linguistic events. All NLP algorithms have a particular, and inevitably less-than perfect, performance. The system may misclassify some items, misanalyse constituents, or simply fail. Therefore
- first-pass frequency evidence is likely to be inaccurate (and potentially incomplete),
- errors may be due to inadequacies in the scheme, algorithm or knowledge-base.
In practice we have two choices: amend the system (scheme, KB or algorithm) and/or correct the corpus manually. A law of diminishing returns applies, and a certain amount of manual editing is inevitably necessary. [As a side comment, part-of-speech annotation is relatively accurate, but full parsing is prone to error. As different systems employ different frameworks accuracy rates vary, but one can anticipate around 95% accuracy for POS-tagging and at best 70% accuracy for parsing. In any case, some errors may be impossible to address without a deeper semantic analysis of the sentence than is feasible.]