ICAME talk on rebalancing corpora

I will be speaking on problems of corpus sampling and the evaluation of independent variable interaction at the 35th ICAME conference in Nottingham this week.

My slides are available here.

ICAME talk on linguistic interaction

I spoke on Capturing patterns of linguistic interaction in a parsed corpus at ICAME 34, Santiago de Compostela, Spain, on 25 May.

The talk presents my latest research in the linguistic interaction research thread (see Wallis 2012). My slides and handout are published below.

Resources

References

Wallis, S.A. 2012. Capturing patterns of linguistic interaction in a parsed corpus: an insight into the empirical evaluation of grammar? London: Survey of English Usage » Post

Capturing patterns of linguistic interaction

Abstract Full Paper (PDF)

Numerous competing grammatical frameworks exist on paper, as algorithms and embodied in parsed corpora. However, not only is there little agreement about grammars among linguists, but there is no agreed methodology for demonstrating the benefits of one grammar over another. Consequently the status of parsed corpora or ‘treebanks’ is suspect.

The most common approach to empirically comparing frameworks is based on the reliable retrieval of individual linguistic events from an annotated corpus. However this method risks circularity, permits redundant terms to be added as a ‘solution’ and fails to reflect the broader structural decisions embodied in the grammar. In this paper we introduce a new methodology based on the ability of a grammar to reliably capture patterns of linguistic interaction along grammatical axes. Retrieving such patterns of interaction does not rely on atomic retrieval alone, does not risk redundancy and is no more circular than a conventional scientific reliance on auxiliary assumptions. It is also a valid experimental perspective in its own right.

We demonstrate our approach with a series of natural experiments. We find an interaction captured by a phrase structure analysis between attributive adjective phrases under a noun phrase with a noun head, such that the probability of adding successive adjective phrases falls. We note that a similar interaction (between adjectives preceding a noun) can also be found with a simple part-of-speech analysis alone. On the other hand, preverbal adverb phrases do not exhibit this interaction, a result anticipated in the literature, confirming our method.

Turning to cases of embedded postmodifying clauses, we find a similar fall in the additive probability of both successive clauses modifying the same NP and embedding clauses where the NP head is the most recent one. Sequential postmodification of the same head reveals a fall and then a rise in this additive probability. Reviewing cases, we argue that this result can only be explained as a natural phenomenon acting on language production which is expressed by the distribution of cases on an embedding axis, and that this is in fact empirical evidence for a grammatical structure embodying a series of speaker choices.

We conclude with a discussion of the implications of this methodology for a series of applications, including optimising and evaluating grammars, modelling case interaction, contrasting the grammar of multiple languages and language periods, and investigating the impact of psycholinguistic constraints on language production.

Continue reading

Inferential statistics – and other animals

Introduction

Inferential statistics is a methodology of extrapolation from data. It rests on a mathematical model which allows us to predict values in the population based on observations in a sample drawn from that population.

Central to this methodology is the idea of reporting not just the observation itself but also the certainty of that observation. In some cases we can observe the population directly and make statements about it.

  • We can cite the 10 most frequent words in Shakespeare’s First Folio with complete certainty (allowing for spelling variations). Such statements would simply be facts.
  • Similarly, we could take a corpus like ICE-GB and report that in it, there are 14,275 adverbs ending in -ly out of 1,061,263 words.

Provided that we limit the scope of our remarks to the corpus itself, we do not need to worry about degrees of certainty because these statements are simply facts. Statements about the corpus are sometimes called descriptive statistics (the word statistic here being used in its most general sense, i.e. a number). Continue reading