ICAME talk on linguistic interaction

A quick announcement.

I will be speaking on Capturing patterns of linguistic interaction in a parsed corpus at ICAME 34, Santiago de Compostela, Spain, on Saturday afternoon (25 May).

The talk presents my latest research in the linguistic interaction research thread (see Wallis 2012).

If you are unable to attend, or desire a sneak preview, my slides are published below.

Resources

References

Wallis, S.A. 2012. Capturing patterns of linguistic interaction in a parsed corpus: an insight into the empirical evaluation of grammar? London: Survey of English Usage » Post

Comparing frequencies within a discrete distribution

Introduction

In a recent study, my colleague Jill Bowie obtained a discrete frequency distribution by manually classifying cases in a small sample drawn from a large corpus.

Jill converted this distribution into a row of probabilities and calculated Wilson score intervals on each observation, to express the uncertainty associated with a small sample. She had one question, however:

How do we know whether the proportion of one quantity is significantly greater than another?

We could simply use a Newcombe-Wilson test (see Wallis forthcoming), but this assumes that samples are drawn from independent sources. In Jill’s example, data is drawn from the same sample, and all probabilities must sum to 1. We need to employ a stricter test.

Example

A discrete distribution looks something like this: O = {108, 65, 6, 2}. This is the frequency data for the middle column in the following chart. The probabilities {0.60, 0.36, 0.03, 0.01} sum to 1.

An example graph plot showing the changing proportions of meanings of the verb think over time in the US TIME Magazine Corpus, with Wilson score intervals, after Levin (2013). Many thanks to Magnus for the data!

So how do we know if one proportion is significantly greater than another?

  • When comparing values diachronically (horizontally), data is drawn from independent samples. We can use the Newcombe-Wilson test, and employ the handy visual rule that if intervals do not overlap they must be significantly different.
  • However, probabilities drawn from the same sample (vertically) sum to 1 — which is not the case for independent samples! The degrees of freedom is k−1, where k is the number of classes. It turns out that if we need to check this, the test we need to use is even more primitive than the 2 × 1 goodness of fit χ² test.

Continue reading

A methodological progression

(with thanks to Jill Bowie)

Introduction

One of the most controversial arguments in corpus linguistics concerns the relationship between a ‘variationist’ paradigm comparable with lab experiments, and a traditional corpus linguistics paradigm focusing on normalised word frequencies.

Rather than see these two approaches as diametrically opposed, we propose that it is more helpful to view them as representing different points on a methodological progression, and to recognise that we are often forced to compromise our ideal experimental practice according to the data and tools at our disposal.

Viewing these approaches as being represented along a progression allows us to step back from any single perspective and ask ourselves how different results can be reconciled and research may be improved upon. It allows us to consider the potential value in performing more computer-aided manual annotation — always an arduous task — and where such annotation effort would be usefully focused.

The idea is sketched in the figure below.

A methodological progression

A methodological progression: from normalised word frequencies to verified alternation.

Continue reading

Choice vs. use

Introduction

Many linguistic researchers are interested in semasiological variation, that is, how the meaning of words and expressions may be observed to vary over time or space. One word might have one dominant meaning or use at one point in time, and other meanings may supplant them. This is of obvious interest to etymology. How do new meanings come about? Why do others decline? Do old meanings die away or retain a specialist use?

Most of the research we have discussed on this blog is, by contrast, concerned with onomasiological variation, or variation in the choice of words or expressions to express the same meaning. In a linguistic choice experiment, the field of meaning is held to be constant, or approximately so, and we are concerned primarily with language production:

  • Given that a speaker (or writer, but we take speech as primary) wishes to express some thought, T, what is the probability that they will use expression E₁ out of the alternate forms {E₁, E₂,…} to express it?

This probability is meaningful in the language production process: it measures the actual use out of the options available to the speaker, at the point of utterance.

Conversely, semasiological researchers are concerned with a different type of probability:

  • Given that a speaker used an expression E, what is the probability that their meaning was T₁ out of the set of {T₁, T₂,…}?

For the hearer, this measure can also be thought of as the exposure rate: what proportion of times should a hearer (reader) interpret E as expressing T₁? This probability is meaningful to a language receiver, but it is not a meaningful statistic at the point of language production.

From the speaker’s point of view we can think of onomasiological variation as variation in choice, and semasiological variation as variation in relative proportion of use.

Continue reading

Verb Phrase book published

Why this book?

book coverThe grammar of English is often thought to be stable over time. However a new book, edited by Bas Aarts, Joanne Close, Geoffrey Leech and Sean Wallis, The Verb Phrase in English: investigating recent language change with corpora (Cambridge University Press, 2013) presents a body of research from linguists that shows that using natural language corpora one can find changes within a core element of grammar, the Verb Phrase, over a span of decades rather than centuries.

The book draws from papers first presented at a symposium on the verb phrase organised for the Survey of English Usage’s 50th anniversary and on research from the Changing English Verb Phrase project.

Continue reading

Capturing patterns of linguistic interaction

Abstract Full Paper (PDF)

Numerous competing grammatical frameworks exist on paper, as algorithms and embodied in parsed corpora. However, not only is there little agreement about grammars among linguists, but there is no agreed methodology for demonstrating the benefits of one grammar over another. Consequently the status of parsed corpora or ‘treebanks’ is suspect.

The most common approach to empirically comparing frameworks is based on the reliable retrieval of individual linguistic events from an annotated corpus. However this method risks circularity, permits redundant terms to be added as a ‘solution’ and fails to reflect the broader structural decisions embodied in the grammar. In this paper we introduce a new methodology based on the ability of a grammar to reliably capture patterns of linguistic interaction along grammatical axes. Retrieving such patterns of interaction does not rely on atomic retrieval alone, does not risk redundancy and is no more circular than a conventional scientific reliance on auxiliary assumptions. It is also a valid experimental perspective in its own right.

We demonstrate our approach with a series of natural experiments. We find an interaction captured by a phrase structure analysis between attributive adjective phrases under a noun phrase with a noun head, such that the probability of adding successive adjective phrases falls. We note that a similar interaction (between adjectives preceding a noun) can also be found with a simple part-of-speech analysis alone. On the other hand, preverbal adverb phrases do not exhibit this interaction, a result anticipated in the literature, confirming our method.

Turning to cases of embedded postmodifying clauses, we find a similar fall in the additive probability of both successive clauses modifying the same NP and embedding clauses where the NP head is the most recent one. Sequential postmodification of the same head reveals a fall and then a rise in this additive probability. Reviewing cases, we argue that this result can only be explained as a natural phenomenon acting on language production which is expressed by the distribution of cases on an embedding axis, and that this is in fact empirical evidence for a grammatical structure embodying a series of speaker choices.

We conclude with a discussion of the implications of this methodology for a series of applications, including optimising and evaluating grammars, modelling case interaction, contrasting the grammar of multiple languages and language periods, and investigating the impact of psycholinguistic constraints on language production.

Continue reading