Why Chomsky was Wrong About Corpus Linguistics


When the entire premise of your methodology is publicly challenged by one of the most pre-eminent figures in an overarching discipline, it seems wise to have a defence. Noam Chomsky’s famous objection to corpus linguistics therefore needs a serious response.

“One of the big insights of the scientific revolution, of modern science, at least since the seventeenth century… is that arrangement of data isn’t going to get you anywhere. You have to ask probing questions of nature. That’s what is called experimentation, and then you may get some answers that mean something. Otherwise you just get junk.” (Noam Chomsky, quoted in Aarts 2001).

Chomsky has consistently argued that the systematic ex post facto analysis of natural language sentence data is incapable of taking theoretical linguistics forward. In other words, corpus linguistics is a waste of time, because it is capable of focusing only on external phenomena of language – what Chomsky has at various times described as ‘e-language’.

Instead we should concentrate our efforts on developing new theoretical explanations for the internal language within the mind (‘i-language’). Over the years the terminology varied, but the argument has remained the same: real linguistics is the study of i-language, not e-language. Corpus linguistics studies e-language. Ergo, it is a waste of time.

Argument 1: in science, data requires theory

Chomsky refers to what he calls ‘the Galilean Style’ to make his case. This is the argument that it is necessary to engage in theoretical abstractions in order to analyse complex data. “[P]hysicists ‘give a higher degree of reality’ to the mathematical models of the universe that they construct than to ‘the ordinary world of sensation’” (Chomsky, 2002: 98). We need a theory in order to make sense of data, as so-called ‘unfiltered’ data is open to an infinite number of possible interpretations.

In the Aristotelian model of the universe the sun orbited the earth. The same data, reframed by the Copernican model, was explained by the rotation of the earth. However, the Copernican model of the universe was not arrived at by theoretical generalisation alone, but by a combination of theory and observation.

Chomsky’s first argument contains a kernel of truth. The following statement is taken for granted across all scientific disciplines: you need theory to analyse data. To put it another way, there is no such thing as an ‘assumption free’ science. But the second part of this argument, that the necessity of theory permits scientists to dispense with engagement with data (or even allows them to dismiss data wholesale), is not a characterisation of the scientific method that modern scientists would recognise. Indeed, Beheme (2016) argues that this method is also a mischaracterisation of Galileo’s method. Galileo’s particular fame, and his persecution, came from one source: the observations he made through his telescope.

In astronomy it is necessary to build physical theories of the universe to make sense of observed data. Astronomical science must proceed by a process of theory building, attempting to account for observations within the theoretical framework. Moreover, rather than relying on naive Popperian refutation (abandoning a theory if one observation appears to contradict the theory), science tends to rely on triangulation (approaching the same theoretical generalisation from multiple sources and directions), and pluralism, i.e. the existence of competing theories such that if one fails another may replace it (Putnam 1974). Triangulation may also mean designing new experiments to test theoretical predictions as technology advances – such as viewing the earth from space, or placing atomic clocks on airliners to test special relativity.

Arguing for the necessity of theory is not an argument against corpus linguistics per se, but it is an argument of a particular type of corpus linguistics practice. The ‘Birmingham School’ of corpus linguistics, most associated with John Sinclair, has prided itself on making minimal theoretical assumptions and working bottom-up from words themselves. Some of the results of this approach are impressive. However,

  1. this type of corpus linguistics is not theory neutral or assumption free (e.g. we assume that w₁, w₂ are words, and a word is a linguistically meaningful unit);
  2. the process of validating theoretical generalisations entails a linguistic decision based on an external theory (e.g. there exists a distinct wordclass termed ‘adjective’);
  3. once theoretical generalisations are derived bottom-up (e.g. cases of w₁, w₂, etc are members of the set of adjectives), we arrive at a methodological paradox.

Sinclair’s methodological paradox is simply this: if it is true that statements of the kind ‘w₁ is an adjective’ are linguistically valuable, then it follows that when analysing new data, we should exploit this new knowledge. However, Sinclair’s method is to work inductively from new data without making such a priori assumptions. Either he has to dispense with his previous conclusions, and start from scratch, or he has to change his method.

In conclusion, the argument that you need theory to interpret data, because data has multiple possible interpretations, is correct. However this statement does not extend to permitting scientists to select data to fit their theory. Awkward and challenging results may not be ignored.

Moreover, if Chomsky’s argument were correct, no scientific field would ever arrive at a dominant scientific model. Every scientist could adopt different theoretical frameworks and premises because there was no agreed process for either refuting a theory or determining the outcome of competition between theories. Science has a pattern of both pluralistic competitive research and consensus-forming around ‘strong theories’. Chomsky’s characterisation of science may be a description of the fractious state of linguistics, but it departs from the scientific method.

I would suggest that it would be preferable to make linguistics more like science, rather than to make science more like linguistics.

Argument 2: translation is error-prone, so corpus data is epiphenomena

Chomsky’s second argument is that the process of translation from internal to external language is subject to error. Consequently, studying e-language is not a productive way to study i-language. We need to study i-language, therefore we should reject corpus data.

This argument has been more influential than the first.

It also appears to be a reasonable criticism of a certain kind of corpus linguistics. Corpus linguistics has tended to focus on word frequencies, which, in the absence of a theoretical interpretation as to why certain forms might be more frequent than others, simply becomes descriptive. Chomsky can reasonably summarise this as studying the epiphenomena of linguistics.

By contrast, theoretical linguists have tended to use an introspective method (backed up occasionally with second-party elicitation) on the grammatical acceptability of test sentences. This is a scholastic approach drawn from traditional prescriptive grammars. The method contains a significant subjective element, even when data is drawn from elicitation experiments with large numbers of test subjects. Direct introspection simply tells us that we believe a sentence to be ‘grammatical’.

Could this type of research question be posed with corpus data? No, but corpus linguists do not have to dispense with introspective insight. Corpus linguists are linguists too!

Moving from million-word to billion-word POS-tagged corpora has not generated greater insight, merely more robust results. However, this observation is properly a criticism of the research foci of much corpus linguistics as practised. (I would argue that this is a limitation of POS-tagged corpus research.) It is not an argument against corpus data.

However, there are two reasons why Chomsky’s second argument cannot hold. The first is what we might call the ‘linguists are not God’ reason.

Linguists do not have special access to i-language data. Their data is from introspection, elicitation or even corpora. But this data is also external language! If there were no systematic mapping between i-language and e-language within an individual, ‘i-linguistics’ would not be possible.

Chomsky and his followers could theorise about any number of internal models. But they could never choose between them except by appealing to some general abstract principle, such as Occam’s razor (simplicity). Linguistic data cannot penetrate the question because all linguistic data is in fact e-language data.

The best, most robust, carefully-obtained data from uncued experimental settings is still e-language. It may be collected in a more focused (and artificial) way than corpus data, but it is also no more ‘internal’ than corpus data. Introspection data elicited from experiments may elicit subjective grammatical expectations, but results are no more scientific than those from any other scientists’ introspection. Physicists do not despair of their equipment and resort to interviewing their peers! Perhaps linguists should follow their lead.

The second counter-argument is that the process of articulating i-language as e-language is a cognitive one, that is, it takes place through cognitive processes in the mind. According to Chomsky, this process exposes the pure i-language to the distorting prism of articulation, and thereby makes e-language unreliable data.

However, if this were true, the same objection would necessarily be true for the generation of i-language in the first place. If articulation of e-language is subject to error, the generation of i-language itself must also be error-prone.

Random variation, cultural bias, personal preference, processing interference, etc, can take place at either stage, because these phenomena are artefacts of actual neurological pathways. Different types of error may arise at different locations, but there is no special error-free part of the brain. Speakers under the influence of alcohol have confused thoughts and slur their words. Alcohol, like error, is not selective.

A number of corpus linguists, including Geoffrey Leech, have commented on the regular ‘grammaticality’ of even the most informal spontaneous speech data. This observation should not be surprising – if speech data did not follow grammatical rules, speakers would not understand each other, and, given the historical and ontological primacy of speech over writing, language could never develop!

There may be noise in the signal, but the signal is not exclusively noise. We should not give up on corpora just yet.

Corpus data and experimental data

Corpus data is simply uncued natural language data (sometimes termed ‘ecological’ data) as distinct from data obtained in an experimental setting. The key advantage of experimental data is that a researcher can manipulate variables under investigation and avoid variation in potentially confounding variables while obtaining data. A secondary advantage may be that one can construct a setting that provides a high frequency of sought-after phenomena that might otherwise be rare in a corpus. The disadvantages are the risk that the experimental conditions obtained are artificial (and possibly artificially cued), and the cost of obtaining and annotating data.

A corpus could contain experimental data, or data obtained by experiment could be annotated to the same level as a parsed corpus such as ICE-GB. These methods are not in competition but are complementary. A corpus can provide test data for experiments, identify potentially worth-while experiments, and provide a control for experimental outcomes.

Corpus linguistics offers three kinds of evidence to a theoretical linguist – factual evidence that phenomena exist, evidence of frequency and distribution, and ‘interaction evidence’ pertaining to the co-occurrence of phenomena (Wallis 2014).

There is no need to discount corpora as a lesser source, or one more likely to be tainted by error than other sources. It is a different source of evidence, one that requires due methodological care, but one that has the potential for both the evaluation of theory against real-world natural language and robust statistical evaluation.

What kind of corpus linguistics do we need?

If data can only be studied by first relating it to a theory, then theoretical linguists first need to pay attention to how corpora are annotated. Do corpora contain useful representations for linguistic research? Are phenomena of interest to linguists capable of being captured within the corpus?

‘Annotation’ is the process of systematically applying a theoretical description to all the texts in a corpus. A decision to annotate instances of a particular phenomenon entails significant effort. All such instances in the corpus must be identified, and each decision must be properly motivated. Like classification schemes in science (e.g the periodic table), linguistic phenomena are not simply identified, but related within a coherent annotation scheme. It follows that the entire scheme must be linguistically defended and systematically applied.

Syntacticians should pay particular attention to parsed corpora. It follows that if linguists are studying grammar then grammatically analysed corpora (‘parsed corpora’ or ‘treebanks’) are likely to be much more valuable than corpora with part-of-speech wordclass tags applied to each word. However, there is wide disagreement between theoretical linguists as to which grammatical scheme is optimal.

Inevitably the effort of annotation means that one has to choose a particular scheme at a particular point in time and systematically apply it. This poses a problem for researchers using the corpus. If they are stuck in a ‘hermeneutic trap’, only able to pose research questions within the annotation framework, and engage in circular reasoning, then corpus linguistics has a serious problem. After the huge effort of annotation you can only please a small number of linguists!

The solution to this problem offered by Wallis and Nelson (2001) is ‘abstraction’ – a process of reinterpretation of the annotated sentences from the representation in the corpus to the preferred representation of the linguist researcher, which takes place during the research process itself. Linguists do not have to accept the theoretical framework applied to a corpus in order to use it. Instead, the corpus representation is considered simply as a ‘handle on the data’, a method for systematically obtaining data across a corpus. It is not necessary to accept the framework uncritically.

In practice this means that researchers might find themselves constructing logical combinations of structural queries to retrieve a dataset aligned to their research theory and goals. But this is a small price to pay for having a grammatical framework already applied and evaluated against corpus data.

Finally abstraction is not an end goal but a means to obtaining an abstracted dataset expressed in terms commensurate with the theoretical demands of the researcher. It is this dataset that may then be subject to a third process, one we refer to as ‘analysis’, hence the ‘3A’ model of corpus linguistics, distinguishing the stages of annotation, abstraction and analysis.


Aarts, B. 2001. Corpus linguistics, Chomsky and Fuzzy Tree Fragments. In: C. Mair and M. Hundt (eds.) Corpus linguistics and linguistic theory. Amsterdam: Rodopi. 5-13.

Beheme, C. 2016. How Galilean is the ‘Galilean Method’? History and Philosophy of the Language Sciences, http://hiphilangsci.net/2016/04/02/how-galilean

Chomsky, N. 2002. On Nature and Language. Cambridge: Cambridge University Press.

Putnam, H. 1974. The ‘Corroboration’ of Scientific Theories, republished in Hacking, I. (ed.) (1981), Scientific Revolutions, Oxford Readings in Philosophy, Oxford: OUP. 60-79.

Wallis, S.A. 2014. What might a corpus of parsed spoken data tell us about language? In L. Veselovská and M. Janebová (eds.) Complex Visibles Out There. Olomouc: Palacký University, 2014. 641-662. » Post

Wallis, S.A. and Nelson G. 2001. Knowledge discovery in grammatically analysed corpora. Data Mining and Knowledge Discovery, 5: 307–340.


4 responses to “Why Chomsky was Wrong About Corpus Linguistics

  1. hi

    i was wondering about this “But the second part of this argument, that the necessity of theory permits scientists to dispense with engagement with data (or even allows them to dismiss inconvenient data wholesale), is not a characterisation of the scientific method that modern scientists would recognise.”

    where does Chomsky in the quote and in the longer interview dismiss data in general? Is he not just dismissing corpus data (even if one ignores another history of CL e.g. http://htl.linguist.univ-paris-diderot.fr/leon/leon_hs.pdf) whereas data from “experimentation” is in the quote?

  2. Thank you for this very informative article. I remember attending a UPenn conference in 2010 where Chomsky presented the same arguments against corpus linguistics using the “bee dancing” metaphor. Those of us who find corpus linguistics a useful field will not be discouraged.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s