What might a corpus of parsed spoken data tell us about language?

AbstractPaper (PDF)

This paper summarises a methodological perspective towards corpus linguistics that is both unifying and critical. It emphasises that the processes involved in annotating corpora and carrying out research with corpora are fundamentally cyclic, i.e. involving both bottom-up and top-down processes. Knowledge is necessarily partial and refutable.

This perspective unifies ‘corpus-driven’ and ‘theory-driven’ research as two aspects of a research cycle. We identify three distinct but linked cyclical processes: annotation, abstraction and analysis. These cycles exist at different levels and perform distinct tasks, but are linked together such that the output of one feeds the input of the next.

This subdivision of research activity into integrated cycles is particularly important in the case of working with spoken data. The act of transcription is itself an annotation, and decisions to structurally identify distinct sentences are best understood as integral with parsing. Spoken data should be preferred in linguistic research, but current corpora are dominated by large amounts of written text. We point out that this is not a necessary aspect of corpus linguistics and introduce two parsed corpora containing spoken transcriptions.

We identify three types of evidence that can be obtained from a corpus: factual, frequency and interaction evidence, representing distinct logical statements about data. Each may exist at any level of the 3A hierarchy. Moreover, enriching the annotation of a corpus allows evidence to be drawn based on those richer annotations. We demonstrate this by discussing the parsing of a corpus of spoken language data and two recent pieces of research that illustrate this perspective. Continue reading

ICAME talk on rebalancing corpora

I will be speaking on problems of corpus sampling and the evaluation of independent variable interaction at the 35th ICAME conference in Nottingham this week.

My slides are available here.

Coping with imperfect data

Introduction

One of the challenges for corpus linguists is that many of the distinctions that we wish to make are either not annotated in a corpus at all or, if they are represented in the annotation, the quality of the annotation may be imperfect. This frequently arises in corpora to which an algorithm has been applied, but the results have not been checked by linguists. However, we would always recommend that cases be reviewed for accuracy of annotation.

A version of this issue also arises when checking for the possibility of alternation, that is that items of Type A can be replaced by Type B items, and vice-versa. An example might be epistemic modal shall vs. will. Most corpora, including richly-annotated corpora such as ICE-GB and DCPSE, do not include modal semantics in their annotation scheme. In such cases the key point is not that the annotation is “imperfect”, rather that our experiment relies on a presumption that the speaker has the choice of either type at any observed point (see Aarts et al. 2013).

Continue reading

Is language really “a set of alternations?”

The perspective that the study of linguistic data should be driven by studies of individual speaker choices has been the subject of attack from a number of linguists.

The first set of objections have come from researchers who have traditionally focused on linguistic variation expressed in terms of rates per word, or per million words.

No such thing as free variation?

As Smith and Leech (2013) put it: “it is commonplace in linguistics that there is no such thing as free variation” and that indeed multiple differing constraints apply to each term. On the basis of this observation they propose an ‘ecological’ approach, although in their paper this approach is not clearly defined.

Continue reading

Summer School in English Corpus Linguistics 2014

Thanks to everyone who came to our second Summer School in English Corpus Linguistics at University College London from Monday 7 to Wednesday 9 July 2014. We hope that it was enjoyable and challenging in equal measure. There were lectures, seminars and hands-on sessions.

As a service to those who were able to attend (and a few who could not), I have published the slides from my talk on ‘Simple statistics for corpus linguistics’ and a spreadsheet for demonstrating the binomial distribution below.

If you want to try to replicate the class experience in your own time, please note that at around the half-way point, each member of the class was asked to toss a coin ten times and report the results. We then input the number of students who threw 0 heads, 1, head, 2 heads, etc. into the spreadsheet.

Presentation

Resources

Further reading

(See also the menus along the top of the blog for more reading.)

Three-day Statistics for Linguistics workshop, University of Sussex

English Language & Linguistics at Sussex is pleased to offer a three-day workshop in Statistics for Linguistics, led by Professor Chris Butler.<

  • 3–5 March 2014, University of Sussex

Continue reading

Binomial → Normal → Wilson

Introduction

One of the questions that keeps coming up with students is the following.

What does the Wilson score interval represent, and why is it the right way to calculate a confidence interval based around an observation? 

In this blog post I will attempt to explain, in a series of hopefully simple steps, how we get from the Binomial distribution to the Wilson score interval. I have written about this in a more ‘academic’ style elsewhere, but I haven’t spelled it out in a blog post.
Continue reading