How might parsing spoken data present greater challenges than parsing writing?

This is a very broad question, ultimately answered empirically by the performance of a particular parser.

However to predict performance, we might consider the types of structure that a parser is likely to find difficult and then examine a parsed corpus of speech and writing for key statistics.

Variables such as mean sentence length or main clause complexity are often cited as a proxy for parsing difficulty. However, sentence length and complexity are likely to be poor guides in this case. Spoken data is not split into sentences by the speaker, rather, utterance segmentation is a matter of transcriber/annotator choice. In order to improve performance, an annotator might simply increase the number of sentence subdivisions. Complexity ‘per sentence’ is similarly potentially misleading.

In the original London Lund Corpus (LLC), spoken data was split by speaker turns, and phonetic tone units were marked. In the case of speeches, speaker turns could be very long compound ‘run-on’ sentences. In practice, when texts were parsed, speaker turns might be split at coordinators or following a sentence adverbial.

In this discussion paper we will use the British Component of the International Corpus of English (ICE-GB, Nelson et al. 2002) as a test corpus of parsed speech and writing. It is worth noting that both components were parsed together by the same tools and research team.

A very clear difference between speech and writing in ICE-GB is to be found in the degree of self-correction. The mean rate of self-correction in ICE-GB spoken data is 3.5% of words (the rate for writing is 0.4%). The spoken genre with the lowest level of self-correction is broadcast news (0.7%). By contrast, student examination scripts have around 5% of words crossed out by writers, followed by social letters and student essays, which have around 0.8% of words marked for removal.

However, self-correction can be addressed at the annotation stage, by removing it from the input to the parser, parsing this simplified sentence, and reintegrating the output with the original corpus string. To identify issues of parsing complexity, therefore we need to consider the sentence minus any self-correction. Are there other factors that may make the input stream more difficult to parse than writing? Continue reading


The confidence of diversity


Occasionally it is useful to cite measures in papers other than simple probabilities or differences in probability. When we do, we should estimate confidence intervals on these measures. There are a number of ways of estimating intervals, including bootstrapping and simulation, but these are computationally heavy.

For many measures it is possible to derive intervals from the Wilson score interval by employing a little mathematics. Elsewhere in this blog I discuss how to manipulate the Wilson score interval for simple transformations of p, such as 1/p, 1 – p, etc.

Below I am going to explain how to derive an interval for grammatical diversity, d, which we can define as the probability that two randomly-selected instances have different outcome classes.

Diversity is an effect size measure of a frequency distribution, i.e. a vector of k frequencies. If all frequencies are the same, the data is evenly spread, and the score will tend to a maximum. If all frequencies except one are zero, the chance of picking two different instances will of course be zero. Diversity is well-behaved except where categories have frequencies of 1. Continue reading

Why is statistics difficult?

Imagine you are somewhere on a road that you have never been on before. Picture it. It’s peaceful and calm. A car comes down the road. As it gets to a corner, the driver appears to lose control, and the car crashes into a wall. Fortunately the lone driver is OK but they can’t recall exactly what happened.

Let’s think about what you experienced. The car crash might involve a number of variables an investigator would be interested in.

How fast was the car going? Where were the brakes applied?

Look on the road. Get out a tape measure. How long was the skid before the car finally stopped?

How big and heavy was the car? How loud was the bang when the car crashed?

These are all physical variables. We are used to thinking about the world in terms of these kinds of variables: velocity, position, length, volume and mass. They are tangible: we can see and touch them, and we have physical equipment that helps us measure them. Continue reading

Point tests and multi-point tests for separability of homogeneity


I have been recently reviewing and rewriting a paper for publication that I first wrote back in 2011. The paper (Wallis forthcoming) concerns the problem of how we test whether repeated runs of the same experiment obtain essentially the same results, i.e. results are not significantly different from each other.

These meta-tests can be used to test an experiment for replication: if you repeat an experiment and obtain significantly different results on the first repetition, then, with a 1% error level, you can say there is a 99% chance that the experiment is not replicable.

These tests have other applications. You might be wishing to compare your results with those of others in the literature, compare results with different operationalisation (definitions of variables), or just compare results obtained with different data – such as comparing a grammatical distribution observed in speech with that found within writing.

The design of tests for this purpose is addressed within the t-testing ANOVA community, where tests are applied to continuously-valued variables. The solution concerns a particular version of an ANOVA, called “the test for interaction in a factorial analysis of variance” (Sheskin 1997: 489).

However, anyone using data expressed as discrete alternatives (A, B, C etc) has a problem: the classical literature does not explain what you should do.

Gradient and point tests


Figure 1: Point tests (A) and gradient tests (B), from Wallis (forthcoming).

The rewrite of the paper caused me to distinguish between two types of tests: ‘point tests’, which I describe below, and ‘gradient tests’. Continue reading

Detecting direction in interaction evidence

IntroductionPaper (PDF)

I have previously argued (Wallis 2014) that interaction evidence is the most fruitful type of corpus linguistics evidence for grammatical research (and doubtless for many other areas of linguistics).

Frequency evidence, which we can write as p(x), the probability of x occurring, concerns itself simply with the overall distribution of a linguistic phenomenon x – such as whether informal written English has a higher proportion of interrogative clauses than formal written English. In order to calculate frequency evidence we must define x, i.e. decide how to identify interrogative clauses. We must also pick an appropriate baseline n for this evaluation, i.e. we need to decide whether to use words, clauses, or any other structure to identify locations where an interrogative clause may occur.

Interaction evidence is different. It is a statistical correlation between a decision that a writer or speaker makes at one part of a text, which we will label point A, and a decision at another part, point B. The idea is shown schematically in Figure 1. A and B are separate ‘decision points’ in a given relationship (e.g. lexical adjacency), which can be also considered as ‘variables’.

Figure 1: Associative inference from lexico-grammatical choice variable A to variable B (sketch).

Figure 1: Associative inference from lexico-grammatical choice variable A to variable B (sketch).

This class of evidence is used in a wide range of computational algorithms. These include collocation methods, part-of-speech taggers, and probabilistic parsers. Despite the promise of interaction evidence, the majority of corpus studies tend to consist of discussions of frequency differences and distributions.

In this paper I want to look at applications of interaction evidence which are made more-or-less at the same time by the same speaker/writer. In such circumstances we cannot be sure that just because B follows A in the text, the decision relating to B was made after the decision at A. Continue reading

UCL Summer School in English Corpus Linguistics 2017

I am pleased to announce the fifth annual Summer School in English Corpus Linguistics to be held at University College London from 5-7 July.

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Continue reading

The replication crisis: what does it mean for corpus linguistics?



Over the last year, the field of psychology has been rocked by a major public dispute about statistics. This concerns the failure of claims in papers, published in top psychological journals, to replicate.

Replication is a big deal: if you publish a correlation between variable X and variable Y – that there is an increase in the use of the progressive over time, say, and that increase is statistically significant, you expect that this finding would be replicated were the experiment repeated.

I would strongly recommend Andrew Gelman’s brief history of the developing crisis in psychology. It is not necessary to agree with everything he says (personally, I find little to disagree with, although his argument is challenging) to recognise that he describes a serious problem here.

There may be more than one reason why published studies have failed to obtain compatible results on repetition, and so it is worth sifting these out.

In this blog post, what I want to do is try to explore what this replication crisis is – is it one problem, or several? – and then turn to what solutions might be available and what the implications are for corpus linguistics. Continue reading