What might a corpus of parsed spoken data tell us about language?

AbstractPaper (PDF)

This paper summarises a methodological perspective towards corpus linguistics that is both unifying and critical. It emphasises that the processes involved in annotating corpora and carrying out research with corpora are fundamentally cyclic, i.e. involving both bottom-up and top-down processes. Knowledge is necessarily partial and refutable.

This perspective unifies ‘corpus-driven’ and ‘theory-driven’ research as two aspects of a research cycle. We identify three distinct but linked cyclical processes: annotation, abstraction and analysis. These cycles exist at different levels and perform distinct tasks, but are linked together such that the output of one feeds the input of the next.

This subdivision of research activity into integrated cycles is particularly important in the case of working with spoken data. The act of transcription is itself an annotation, and decisions to structurally identify distinct sentences are best understood as integral with parsing. Spoken data should be preferred in linguistic research, but current corpora are dominated by large amounts of written text. We point out that this is not a necessary aspect of corpus linguistics and introduce two parsed corpora containing spoken transcriptions.

We identify three types of evidence that can be obtained from a corpus: factual, frequency and interaction evidence, representing distinct logical statements about data. Each may exist at any level of the 3A hierarchy. Moreover, enriching the annotation of a corpus allows evidence to be drawn based on those richer annotations. We demonstrate this by discussing the parsing of a corpus of spoken language data and two recent pieces of research that illustrate this perspective.


The field of corpus linguistics has grown in popularity in recent years. Moreover, many researchers who would not otherwise consider themselves to be corpus linguists have begun to apply corpus linguistics methods to their linguistic problems, a growth that is partly attributable to an increasing availability of corpus data and tools. It therefore seems apposite to take stock, and question what kinds of research can be done with corpora and which types of corpora and methods might yield useful results.

This methodological “turn to corpora” does not have universal support. Some theoretical linguists, including Noam Chomsky, argue that at best, any collection of language data merely provides researchers with examples of the actual external performance of human beings in a given context  (see, e.g. Aarts 2001). Corpora do not provide insight into internal language or its production processes. Such a position raises questions about what data, if any, might be used to evaluate ‘deep’ theories, as linguists’ personal intuitions are no more likely to pierce the veil of consciousness. Nevertheless, this contrary position raises a serious challenge to corpus researchers. We will return to the question of the potential relevance of corpus linguistics for the study of language production by reporting on some recent research in Section 6.

What do we mean by “a corpus”? In the most general sense, corpora are simply collections of language data that have been processed to make them accessible for research purposes. The largest current corpora contain primarily written texts, that is, texts generated by authors at keyboards, screens or paper. These are types of language that are rarely spontaneously produced, frequently edited by others, and often included in databases due to their ease of availability. They may also be written with an imagined audience, in contrast to spoken utterances produced for a co-present (and interacting) audience. Although written data of this kind is easy to obtain, and therefore large corpora are readily compiled, this sampling methodology places significant limitations on the types of inference that might be safely drawn. The ability to test hypotheses against unmediated, spontaneously produced linguistic utterances seems paramount.

However, not all corpora are collected from written sources. In this paper, we are particularly interested in what corpora of spoken data, ideally in the form of recordings aligned with an orthographic transcription, might tell us about language. Transcriptions of this kind should record the actual lexical output, e.g. including false starts, examples of self-correction and overlapping speech, unedited by the speaker. In an uncued, unrehearsed context, this kind of speech data is arguably the closest to genuinely “spontaneous” naturalistic language output as is achievable. The lexical record can be aligned with an audio and video recording, contain meta-linguistic information, gestural signals, and so on.

Prioritising speech over writing in linguistics research has other justifications aside from mere spontaneity, which might otherwise be achieved by simply recording every keystroke. Speech predates writing historically, both generally and also in relation to literacy spread. Child development sees children express themselves through speech earlier than they write, and many writers are aware that their writing requires a more-or-less internal speech act. Our corpus data has approximately 2,000 words spoken by participants every quarter of an hour. By contrast, the author Stephen King (2002) recommends that authors try to write 1,000 words a day. Allowing for individual variation, and with the exception of isolated individuals or those unable physiologically to produce speech, it seems likely that human beings produce much more speech than writing.

Axiomatically, different sampling frames obtain different kinds of corpora. Spoken data may be collected for a variety of purposes, some more representative and ‘natural’ than others, such as telephone calls or air traffic control data. Some spoken data might be captured in the laboratory: collected in controlled conditions, but unnatural, potentially psychologically stressed, and not particularly representative. So when we refer to “spoken corpora”, we are fundamentally concerned with naturally-occurring speech in ‘ecological’ contexts where speech output is spontaneous, uncued, and unrehearsed. An important sub-classification concerns whether the audience is present and participating, i.e. in a monologic or dialogic setting.

The fact that a corpus ideal may be away from a lab does not mean that results should not be commensurable with laboratory data. On the contrary, corpus data can be a useful complement to lab experiments. The primary distinction between laboratory and corpus data is as follows. Corpus linguistics is characterised by the multiple reuse of existing data, and the ex post facto analysis of such data, rather than a controlled data collection exercise under laboratory conditions. Corpus linguistics is thus better understood as the methodology of linguistics framed as an observational science (like astronomy, evolutionary biology or geology) rather than an experimental one.

As a result of this perspective, corpora usually contain whole passages and texts, in order to be open to multiple levels of description and evaluation. Laboratory research collects fresh data for each research question, and therefore may record data efficiently, containing relevant components of the output determined a priori.

However, the lines between the lab experiment and the corpus are becoming blurred. Where data must be encoded with a rich annotation (see Section 4) such as a detailed prosodic transcription, data reuse maximises the benefits of a costly research effort. Other sciences have also begun to take data reuse seriously. Medical science has seen computer-assisted meta-analysis, where data from multiple experiments are combined and reanalysed, become increasingly standard.

Given that we have a working definition of a spoken corpus as a database of transcribed spoken data, with or without original audio files, what can such a database tell us about language? Traditional discussions of corpus linguistics methodology have tended to focus on a dichotomy between top-down ‘corpus-based’ and bottom-up ‘corpus-driven’ research. We will argue that both positions are one-sided and are usefully subsumed into an exploratory cyclic approach to research.


  1. Introduction
  2. What can a corpus tell us?
  3. The 3A cycle
  4. What can a richly annotated corpus tell us?
  5. Sociolinguistic influences: modal shall/will over time
  6. Interacting grammatical decisions: NP premodification
  7. Conclusions


Wallis, S.A. 2014. What might a corpus of parsed spoken data tell us about language? In L. Veselovská and M. Janebová (eds.) Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure. Olomouc: Palacký University, 2014. pp 641-662. » pre-published (PDF)

See also


Aarts, B. 2001. Corpus linguistics, Chomsky and Fuzzy Tree Fragments. In: C. Mair and M. Hundt (eds.) Corpus linguistics and linguistic theory. Amsterdam: Rodopi. 5-13.

King, S. 2002. On Writing. New York: Pocket Books.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s