What might a corpus of parsed spoken data tell us about language?

AbstractPaper (PDF)

This paper summarises a methodological perspective towards corpus linguistics that is both unifying and critical. It emphasises that the processes involved in annotating corpora and carrying out research with corpora are fundamentally cyclic, i.e. involving both bottom-up and top-down processes. Knowledge is necessarily partial and refutable.

This perspective unifies ‘corpus-driven’ and ‘theory-driven’ research as two aspects of a research cycle. We identify three distinct but linked cyclical processes: annotation, abstraction and analysis. These cycles exist at different levels and perform distinct tasks, but are linked together such that the output of one feeds the input of the next.

This subdivision of research activity into integrated cycles is particularly important in the case of working with spoken data. The act of transcription is itself an annotation, and decisions to structurally identify distinct sentences are best understood as integral with parsing. Spoken data should be preferred in linguistic research, but current corpora are dominated by large amounts of written text. We point out that this is not a necessary aspect of corpus linguistics and introduce two parsed corpora containing spoken transcriptions.

We identify three types of evidence that can be obtained from a corpus: factual, frequency and interaction evidence, representing distinct logical statements about data. Each may exist at any level of the 3A hierarchy. Moreover, enriching the annotation of a corpus allows evidence to be drawn based on those richer annotations. We demonstrate this by discussing the parsing of a corpus of spoken language data and two recent pieces of research that illustrate this perspective. Continue reading

Coping with imperfect data


One of the challenges for corpus linguists is that many of the distinctions that we wish to make are either not annotated in a corpus at all or, if they are represented in the annotation, unreliably annotated. This issue frequently arises in corpora to which an algorithm has been applied, but where the results have not been checked by linguists, a situation which is unavoidable with mega-corpora. However, this is a general problem. We would always recommend that cases be reviewed for accuracy of annotation.

A version of this issue also arises when checking for the possibility of alternation, that is, to ensure that items of Type A can be replaced by Type B items, and vice-versa. An example might be epistemic modal shall vs. will. Most corpora, including richly-annotated corpora such as ICE-GB and DCPSE, do not include modal semantics in their annotation scheme. In such cases the issue is not that the annotation is “imperfect”, rather that our experiment relies on a presumption that the speaker has the choice of either type at any observed point (see Aarts et al. 2013), but that choice is conditioned by the semantic content of the utterance.

Continue reading

ICAME talk on linguistic interaction

I spoke on Capturing patterns of linguistic interaction in a parsed corpus at ICAME 34, Santiago de Compostela, Spain, on 25 May.

The talk presents my latest research in the linguistic interaction research thread (see Wallis 2012). My slides and handout are published below.



Wallis, S.A. 2012. Capturing patterns of linguistic interaction in a parsed corpus: an insight into the empirical evaluation of grammar? London: Survey of English Usage » Post

A methodological progression

(with thanks to Jill Bowie)


One of the most controversial arguments in corpus linguistics concerns the relationship between a ‘variationist’ paradigm comparable with lab experiments, and a traditional corpus linguistics paradigm focusing on normalised word frequencies.

Rather than see these two approaches as diametrically opposed, we propose that it is more helpful to view them as representing different points on a methodological progression, and to recognise that we are often forced to compromise our ideal experimental practice according to the data and tools at our disposal.

Viewing these approaches as being represented along a progression allows us to step back from any single perspective and ask ourselves how different results can be reconciled and research may be improved upon. It allows us to consider the potential value in performing more computer-aided manual annotation — always an arduous task — and where such annotation effort would be usefully focused.

The idea is sketched in the figure below.

A methodological progression

A methodological progression: from normalised word frequencies to verified alternation.

Continue reading

Capturing patterns of linguistic interaction

Abstract Full Paper (PDF)

Numerous competing grammatical frameworks exist on paper, as algorithms and embodied in parsed corpora. However, not only is there little agreement about grammars among linguists, but there is no agreed methodology for demonstrating the benefits of one grammar over another. Consequently the status of parsed corpora or ‘treebanks’ is suspect.

The most common approach to empirically comparing frameworks is based on the reliable retrieval of individual linguistic events from an annotated corpus. However this method risks circularity, permits redundant terms to be added as a ‘solution’ and fails to reflect the broader structural decisions embodied in the grammar. In this paper we introduce a new methodology based on the ability of a grammar to reliably capture patterns of linguistic interaction along grammatical axes. Retrieving such patterns of interaction does not rely on atomic retrieval alone, does not risk redundancy and is no more circular than a conventional scientific reliance on auxiliary assumptions. It is also a valid experimental perspective in its own right.

We demonstrate our approach with a series of natural experiments. We find an interaction captured by a phrase structure analysis between attributive adjective phrases under a noun phrase with a noun head, such that the probability of adding successive adjective phrases falls. We note that a similar interaction (between adjectives preceding a noun) can also be found with a simple part-of-speech analysis alone. On the other hand, preverbal adverb phrases do not exhibit this interaction, a result anticipated in the literature, confirming our method.

Turning to cases of embedded postmodifying clauses, we find a similar fall in the additive probability of both successive clauses modifying the same NP and embedding clauses where the NP head is the most recent one. Sequential postmodification of the same head reveals a fall and then a rise in this additive probability. Reviewing cases, we argue that this result can only be explained as a natural phenomenon acting on language production which is expressed by the distribution of cases on an embedding axis, and that this is in fact empirical evidence for a grammatical structure embodying a series of speaker choices.

We conclude with a discussion of the implications of this methodology for a series of applications, including optimising and evaluating grammars, modelling case interaction, contrasting the grammar of multiple languages and language periods, and investigating the impact of psycholinguistic constraints on language production.

Continue reading

Three kinds of corpus evidence – and two types of constraint

Text corpora permit researchers to find evidence of three distinct kinds.

1. Frequency evidence of known terms (‘performance’)

Suppose you have a plain text corpus which you attempt to annotate automatically. You apply a computer program to the text. This program can be thought of as comprising three elements: a theoretical framework or ‘scheme’, an algorithm, and a knowledge-base (KB). Terms and constituents in this scheme are applied to the corpus according to the algorithm.

Having done so it should be a relatively simple matter to index those terms in the corpus and obtain frequencies for each one (e.g., how many instances of may are classed as a modal verb, noun, etc). The frequency evidence obtained tells you how the program performed against the real-world data in the corpus. However, if you stop at this point you do not know whether this evidence is accurate or complete.

2. Frequency evidence of unknown terms (‘discovery’)

The process of annotation presents the opportunity for discovery of novel linguistic events. All NLP algorithms have a particular, and inevitably less-than perfect, performance. The system may misclassify some items, misanalyse constituents, or simply fail. Therefore

  1. first-pass frequency evidence is likely to be inaccurate (and potentially incomplete),
  2. errors may be due to inadequacies in the scheme, algorithm or knowledge-base.

In practice we have two choices: amend the system (scheme, KB or algorithm) and/or correct the corpus manually. A law of diminishing returns applies, and a certain amount of manual editing is inevitably necessary. [As a side comment, part-of-speech annotation is relatively accurate, but full parsing is prone to error. As different systems employ different frameworks accuracy rates vary, but one can anticipate around 95% accuracy for POS-tagging and at best 70% accuracy for parsing. In any case, some errors may be impossible to address without a deeper semantic analysis of the sentence than is feasible.]

Continue reading

That vexed problem of choice

(with thanks to Jill Bowie and Bas Aarts)

AbstractPaper (PDF)

A key challenge in corpus linguistics concerns the difficulty of operationalising linguistic questions in terms of choices made by speakers or writers. Whereas lab researchers design an experiment around a choice, comparable corpus research implies the inference of counterfactual alternates. This non-trivial requirement leads many to rely on a per million word baseline, meaning that variation separately due to opportunity and choice cannot be distinguished.

We formalise definitions of mutual substitution and the true rate of alternation as useful idealisations, recognising they may not always hold. Analysing data from a new volume on the verb phrase, we demonstrate how a focus on choices available to speakers allows researchers to factor out the effect of changing opportunities to draw conclusions about choices.

We discuss research strategies where alternates may not be easily identified, including refining baselines by eliminating forms and surveying change against multiple baselines. Finally we address three objections that have been made to this framework, that alternates are not reliably identifiable, baselines are arbitrary, and differing ecological pressures apply to different terms. Throughout we motivate our responses by evidence from current research, demonstrating that whereas the problem of identifying choices may be ‘vexed’, it represents a highly fruitful paradigm for corpus linguistics.

Continue reading