In a recent paper focusing on distributions of simple NPs (Aarts and Wallis, 2014), we found an interesting correlation across text genres in a corpus between two independent variables. For the purposes of this study, a “simple NP” was an NP consisting of a single-word head. What we found was a strong correlation between
the probability that an NP consists of a single-word head, p(single head), and
the probability that single-word heads were a personal pronoun, p(personal pronoun | single head).
Note that these two variables are independent because they do not compete, unlike, say, the probability that a single-word NP consists of a noun, vs. the probability that it is a pronoun. The scattergraph below illustrates the distribution and correlation clearly.
Scattergraph of text genres in ICE-GB; distributed (horizontally) by the proportion of all noun phrases consisting of a single word and (vertically) by the proportion of those single-word NPs that are personal pronouns; spoken and written, with selected outliers identified.
This paper summarises a methodological perspective towards corpus linguistics that is both unifying and critical. It emphasises that the processes involved in annotating corpora and carrying out research with corpora are fundamentally cyclic, i.e. involving both bottom-up and top-down processes. Knowledge is necessarily partial and refutable.
This perspective unifies ‘corpus-driven’ and ‘theory-driven’ research as two aspects of a research cycle. We identify three distinct but linked cyclical processes: annotation, abstraction and analysis. These cycles exist at different levels and perform distinct tasks, but are linked together such that the output of one feeds the input of the next.
This subdivision of research activity into integrated cycles is particularly important in the case of working with spoken data. The act of transcription is itself an annotation, and decisions to structurally identify distinct sentences are best understood as integral with parsing. Spoken data should be preferred in linguistic research, but current corpora are dominated by large amounts of written text. We point out that this is not a necessary aspect of corpus linguistics and introduce two parsed corpora containing spoken transcriptions.
We identify three types of evidence that can be obtained from a corpus: factual, frequency and interaction evidence, representing distinct logical statements about data. Each may exist at any level of the 3A hierarchy. Moreover, enriching the annotation of a corpus allows evidence to be drawn based on those richer annotations. We demonstrate this by discussing the parsing of a corpus of spoken language data and two recent pieces of research that illustrate this perspective. Continue reading →
One of the challenges for corpus linguists is that many of the distinctions that we wish to make are either not annotated in a corpus at all or, if they are represented in the annotation, the quality of the annotation may be imperfect. This frequently arises in corpora to which an algorithm has been applied, but the results have not been checked by linguists. However, we would always recommend that cases be reviewed for accuracy of annotation.
A version of this issue also arises when checking for the possibility of alternation, that is that items of Type A can be replaced by Type B items, and vice-versa. An example might be epistemic modal shall vs. will. Most corpora, including richly-annotated corpora such as ICE-GB and DCPSE, do not include modal semantics in their annotation scheme. In such cases the key point is not that the annotation is “imperfect”, rather that our experiment relies on a presumption that the speaker has the choice of either type at any observed point (see Aarts et al. 2013).
The perspective that the study of linguistic data should be driven by studies of individual speaker choices has been the subject of attack from a number of linguists.
The first set of objections have come from researchers who have traditionally focused on linguistic variation expressed in terms of rates per word, or per million words.
No such thing as free variation?
As Smith and Leech (2013) put it: “it is commonplace in linguistics that there is no such thing as free variation” and that indeed multiple differing constraints apply to each term. On the basis of this observation they propose an ‘ecological’ approach, although in their paper this approach is not clearly defined.
Thanks to everyone who came to our second Summer School in English Corpus Linguistics at University College London from Monday 7 to Wednesday 9 July 2014. We hope that it was enjoyable and challenging in equal measure. There were lectures, seminars and hands-on sessions.
As a service to those who were able to attend (and a few who could not), I have published the slides from my talk on ‘Simple statistics for corpus linguistics’ and a spreadsheet for demonstrating the binomial distribution below.
If you want to try to replicate the class experience in your own time, please note that at around the half-way point, each member of the class was asked to toss a coin ten times and report the results. We then input the number of students who threw 0 heads, 1, head, 2 heads, etc. into the spreadsheet.