POS tagging – a corpus-driven research success story?


One of the longest-running, and in many respects the least helpful, methodological debates in corpus linguistics concerns the spat between so-called corpus-driven and corpus-based linguists.

I say that this has been largely unhelpful because it has encouraged a dichotomy which is almost certainly false, and the focus on whether it is ‘right’ to work from corpus data upwards towards theory, or from theory downwards towards text, distracts from some serious methodological challenges we need to consider (see other posts on this blog).

Usually this discussion reviews the achievements of the most well-known corpus-based linguist, John Sinclair, in building the Collins Cobuild Corpus, and deriving the Collins Cobuild Dictionary (Sinclair et al. 1987) and Grammar (Sinclair et al. 1990) from it.

In this post I propose an alternative examination.

I want to suggest that the greatest success story for corpus-based research is the development of part-of-speech taggers (usually called a ‘POS-tagger’ or simply ‘tagger’) trained on corpus data.

These are industrial strength, reliable algorithms, that obtain good results with minimal assumptions about language.

So, who needs theory?

How part-of-speech taggers work

Taggers consist of two parts:

  • a ‘learning’ algorithm that collects rules from training data, and
  • a ‘tagging’ algorithm which applies rules to new texts to classify words by their part of speech (word class).

The corpus-based aspect is the ‘learning’ algorithm.

A typical rule might be that if the word old (which can be a noun/nominal adjective, as in the old, or adjective, the old man) is followed by a noun, then old is more likely to be an adjective than otherwise.

The tagging algorithm takes a sentence and applies these rules like a crossword solver. It classifies the words that it is most certain of before considering those it is less confident about. Thus, in the old man, the is unambiguously a determiner, whereas both old and man can belong to more than one word class.

The learning algorithm generates summary statistics bottom-up from training data it is given, which are lots of sentences/texts which have already been tagged with the same part of speech scheme (i.e., a corpus).

It is not necessary to make many assumptions about the grammar of the language we are working with to obtain results comparable to the best reported in the literature. The computer does not need to ‘know’ what a noun or a verb is. It can simply obtain statistics about these different categories from the corpus.

But these algorithms do embody some assumptions about their language input. These assumptions can be enumerated as follows, although different classification schemes might vary in some details:

  1. language consists of sentences divided into lexical words;
  2. each sentence is capable of being analysed separately;
  3. words include part-words such as genitive markers and cliticised words, and compounds, where multiple words can be given the same tag;
  4. there are a fixed set of word class tags that each particular instance of a word can be categorised by – these commonly consist of word class category (noun, verb, etc.), plus secondary information (plural proper noun, copular verb, etc.);
  5. these tags were correctly applied to the training data.

Databases extracted by the learning algorithm typically consist of frequency distributions for every word-tag pattern, i.e. the number of cases in the training corpus where a given lexical word has a particular tag; and transition probabilities for each word-tag pattern if words have more than one tag.

The performance of these linguistically unsophisticated algorithms is striking. A typical tagger trained on a million words of English using a standard set of tags will make the correct decision for new sentences of a similar type some 95% of the time.

Different algorithms may vary in storage efficiency. My crude simulated annealing stochastic tagger (Wallis 2012), which stores transition probabilities exhaustively, is less space-efficient than Eric Brill’s patch tagger (Brill 1992). However, they obtain similar results.

The remaining 5% of residual incorrect examples tend to be cases that are idiomatic, or are part of a multi-word string of ambiguous words, or are a result of weaknesses in the training data.

To address these weaknesses we can make a number of improvements.

  • Store a finite set of idioms, strings or compounds. This is a bit clumsy and ad hoc, doesn’t scale well, but can actually improve performance.
  • Add modules to the database and algorithm. The Brill tagger employs some simple ad hoc regular morphology detection at an initial stage. A more thorough approach might consist of a morphological model of ‘lemmatisation’ (identifying word stems and affixes, e.g. re-educated → re– + educate + –ed). The advantage of this step is that even if we don’t have the word re-educated in our training set we can recognise educate as a verb and the entire word as a gerund noun or verb. Generalisation allows us to pool statistics, so we can have more reliable rules, and compress information, so we don’t have to store separate statistics for every single word.
  • Create a more general type of rule. The rules we have described were tied to particular words, such as old. It would be more efficient if we had a rule that said something like ‘for any word capable of being either an adjective or a noun, if it is followed by an adjective or noun, then it is likely to be an adjective.’ Note that to create such a rule we have to look for it (this is precisely what the Brill tagger does).

Is our algorithm corpus-driven any more?

But now let us consider where this path has taken us. Every step we have proposed to improve the performance of this corpus-driven algorithm requires the insertion of knowledge about idioms, morphology and grammar, top-down, into the algorithm.

A methodological corpus-driven purism that stated that we must work exclusively bottom-up was a little disingenuous, because we had to employ auxiliary assumptions (1) to (5) above from the start.

But now every improvement we wish to make requires further theoretical assumptions. It turns out that it is not possible to perform part-of-speech tagging without assumptions, and to improve the algorithm we need more theory.

Finally, whereas the learning algorithm might work bottom-up, the tagging algorithm itself works top-down, in that it applies its knowledge base of word-tag probabilities to new corpus data.

In conclusion

I have the utmost respect for corpus-driven linguists. The discipline of examining data with minimal assumptions is absolutely crucial! All scientists have to examine the data as it is, not compartmentalise it according to pre-given assumptions.

Over the years I have written extensively on not taking queries for granted, and directed corpus researchers to continually review the underlying sentences from which their statistics are derived.

However, it is simply not possible to work without any assumptions, even when building a bottom-up computer algorithm like a part-of-speech tagger.

So I would conclude that corpus-based research is properly located as part of a larger research cycle, in which it is valid and reasonable to work bottom-up and top-down at different times. Corpus-driven research methods are part of a family of exploratory methods from which all corpus linguists should draw. Insights from computationally-obtained summary statistics (whether from collocations, n-grams, phrase frames, indexes, or databases of part of speech taggers) are important resources for further research.

But insisting that the only legitimate corpus methods are bottom-up prevents us carrying out research with a corpus which asks questions that are inevitably framed by a particular theory.


Brill, E. 1992. A simple rule-based part of speech tagger. In Proceedings of the third conference on applied natural language processing (ANLC ’92). Association for Computational Linguistics, Stroudsburg, PA, USA, 152-155.

Sinclair, J., Hanks, P., Fox, G., Moon, R. and Stock, P. and others, 1987 (eds.), Collins Cobuild English Language Dictionary, London: Collins.

Sinclair, J., Fox, G., Bullon, S., Krishnamurthy, R., Manning, E., Todd, J. and others, 1990 (eds.) Collins Cobuild English Grammar, London: Collins.

Wallis S.A. 2012. Tagging ICE Phillipines and other corpora. London: Survey of English Usage. » ePublished


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s