However to predict performance, we might consider the types of structure that a parser is likely to find difficult and then examine a parsed corpus of speech and writing for key statistics.

Variables such as mean sentence length or main clause complexity are often cited as a proxy for parsing difficulty. However, sentence length and complexity are likely to be poor guides in this case. Spoken data is not split into sentences by the speaker, rather, utterance segmentation is a matter of transcriber/annotator choice. In order to improve performance, an annotator might simply increase the number of sentence subdivisions. Complexity ‘per sentence’ is similarly potentially misleading.

In the original *London Lund Corpus* (LLC), spoken data was split by speaker turns, and phonetic tone units were marked. In the case of speeches, speaker turns could be very long compound ‘run-on’ sentences. In practice, when texts were parsed, speaker turns might be split at coordinators or following a sentence adverbial.

In this discussion paper we will use the *British Component of the International Corpus of English* (ICE-GB, Nelson *et al.* 2002) as a test corpus of parsed speech and writing. It is worth noting that both components were parsed together by the same tools and research team.

A very clear difference between speech and writing in ICE-GB is to be found in the degree of **self-correction**. The mean rate of self-correction in ICE-GB spoken data is 3.5% of words (the rate for writing is 0.4%). The spoken genre with the lowest level of self-correction is broadcast news (0.7%). By contrast, student examination scripts have around 5% of words crossed out by writers, followed by social letters and student essays, which have around 0.8% of words marked for removal.

However, self-correction can be addressed at the annotation stage, by removing it from the input to the parser, parsing this simplified sentence, and reintegrating the output with the original corpus string. To identify issues of parsing complexity, therefore we need to consider the sentence minus any self-correction. Are there other factors that may make the input stream more difficult to parse than writing?

Perhaps a more revealing estimate of top level complexity concerns the extent to which, following parsing, these segments, termed ‘parse units’, are not considered grammatically to be clauses. The scattergraph below plots the mean proportion of parse units that are ‘**non clauses**’ rather than clauses on the horizontal axis. The category of ‘non clause’ does not include subjectless or verbless clauses (see below), but may include standalone phrases and pragmatically meaningful utterances (sometimes called ‘clause fragments’). By contrast, the vertical axis shows the mean number of **incomplete** clauses. These are clauses that have been rendered incomplete, for example because the speaker was interrupted. (We have not included confidence intervals because we are interested in the overall scatter.)

- Overall in ICE-GB
**there are twice the proportion of ‘non clause’ parse units in the spoken data**(on average, 29% of parse units are not clauses) than the written component (14%). Business letters are an outlier, apparently due to the inclusion of full addresses and other formal ephemera. At the upper left of the written distribution, press editorials have the highest number of incomplete clauses while less than one in twenty parse units are considered non clauses. - Comparing means,
**there are over four times the proportion of incomplete clauses in spoken transcripts**compared to written text (2.15% to 0.51%). Means are shown with ‘X’ symbols in the scattergraph.

This scattergraph distinguishes written and spoken data to a much greater extent than, e.g. analysis of small phrases (Aarts *et al.* 2014). This indicates that the challenges in the parsing of speech data lie principally in high level structure. Getting the top level analysis correct is the most difficult challenge in any parsing enterprise. The sheer proportion of the number of non clauses in speech, and the relatively high proportion of incomplete clauses should cause us to be cautious about accepting performance estimates based on the parsing of written data when we are concerned with the parsing of speech.

Spoken data is not necessarily more complex in other aspects. For example, speech data is generally less likely to include subjectless or verbless clauses than writing. The following scattergraph plots the mean probabilities of clauses being **subjectless** (vertical axis) and **verbless** (horizontal axis) for ICE-GB text categories within speech and writing. The highest proportion of verbless clauses in any genre are found in spontaneous commentaries, a spoken genre which encourages concise phrasing, for example:

*England have won four *[*the Soviet Union three*]* with three drawn* _{[S2A-001 #167]}

Compared to writing, a lower proportion of clauses in speech are analysed as compound clauses, but this seems to be an artefact of the sentence segmentation decisions we discussed earlier. In the case of ICE-GB speech data, large coordinated spoken clauses were frequently split at the coordinator, with the coordinator (*and*, *but*, etc) then treated as a connective introducing a new clause. This decision is semantic and stylistic (in writing, termed ‘avoiding run-on sentences’), although it could be argued that in the parsing of ICE-GB, annotators over-compensated.

In objective lexical terms, the spoken data has a slightly greater tendency to exhibit coordinating words. There are 15% more connectives or coordinators per word in ICE-GB spoken data compared to writing, and 4% more subordinating conjunctions.

If ICE-GB spoken utterances were over-zealously subdivided, this tendency has had a greater impact on coordinated clauses than subordinate ones, but it has had an impact on subordination nonetheless. Thus the proportion of ‘dependent’ (subordinate) clauses out of those clauses explicitly marked as either main or dependent in spoken data is actually 85% of the equivalent rate in the written data, despite the greater rate of subordinators.

In summary, the main factor that might make speech harder to parse than writing is that spoken data tends to be more grammatically incomplete than written data. The high proportion of ‘non clauses’, and the greater number of clauses marked as incomplete, both indicate that this is where the principal difficulty lies.

This incompleteness is in addition to self-correction, that is, where speakers correct their own utterances.

Aarts, B. and S.A. Wallis 2014. Noun phrase simplicity in spoken English. In L. Veselovská and M. Janebová (eds.) *Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure.* Olomouc: Palacký University, 2014. pp 501-511. » Post

Nelson, G., Wallis, S.A. and Aarts, B. 2002. *Exploring Natural Language: Working with the British Component of the International Corpus of English*. Amsterdam: John Benjamins.

]]>

Occasionally it is useful to cite measures in papers other than simple probabilities or differences in probability. When we do, we should estimate confidence intervals on these measures. There are a number of ways of estimating intervals, including bootstrapping and simulation, but these are computationally heavy.

For many measures it is possible to derive intervals from the Wilson score interval by employing a little mathematics. Elsewhere in this blog I discuss how to manipulate the Wilson score interval for simple transformations of *p*, such as 1/*p*, 1 – *p*, etc.

Below I am going to explain how to derive an interval for grammatical diversity, *d*, which we can define as **the probability that two randomly-selected instances have different outcome classes**.

Diversity is an effect size measure of a frequency distribution, i.e. a vector of *k* frequencies. If all frequencies are the same, the data is evenly spread, and the score will tend to a maximum. If all frequencies except one are zero, the chance of picking two different instances will of course be zero. Diversity is well-behaved except where categories have frequencies of 1.

To compute this diversity measure, we sum across the set of outcomes (all functions, all nouns, etc.), **C**:

*diversity d*(*c*∈**C**) = ∑*p*₁(*c*).(1 –*p*₂(*c*)) if*n*> 1; 1 otherwise

where **C** is a set of *k *> 1 disjoint categories, *p*₁(*c*)* *is the probability that item 1 is category *c* and *p*₂(*c*) is the probability that item 2 is the same category *c*.

We have probabilities

*p*₁(*c*) =*F*(*c*)/*n,**p*₂(*c*) = (*F*(*c*)*–*1)/(*n –*1) = (*p*₁(*c*).*n*– 1)/(*n*– 1),

where *n* is the total number of instances.

The formula for *p*₂ includes an adjustment for the fact that we already know that the first item is *c*. This principle is used in card-playing statistics. Suppose I draw cards from a pack. If the first card I pick is a heart, I know that there are only 10 other hearts in the pack, so the probability of the next card I pick up being a heart is 10 out of 51, not 11 out of 52.

Note that as the set is closed, ∑*p*₁(*c*) = ∑*p*₂(*c*) = 1.

The maximum score is slightly less than (*k* – 1) / *k *except in the special case where *n* approaches *k* and there is a frequency of 1 in any category, in which case diversity can approach 1.

In a forthcoming paper with Bas Aarts and Jill Bowie, we found that the share of functions of *–ing* clauses (‘gerunds’) appeared to change over time in the *Diachronic Corpus of Present-day Spoken English* (DCPSE).

We obtained the following graph. The bars marked ‘LLC’ refer to data drawn from the period 1956-1972; those marked ‘ICE-GB’ are from 1990-1992.

This graph considers six functions **C** = {CO, CS, OD, SU, A, PC} of the clause. It plots *p*(*c*) over **C**. Considered individually, some functions significantly increase and some decrease their share. Note also that the increases appear to be concentrated in the shorter bars (smaller *p*) and the decreases in the longer ones.

Intuitively this appears to mean that we are seeing *–ing* clauses increase in their diversity of grammatical function over time. We would like to test this proposition.

Here is the LLC data.

CO | CS | SU | OD | A | PC | Total |

6 | 33 | 61 | 326 | 610 | 1,203 | 2,239 |

Computing diversity scores, we arrive at

*d*(LLC) = 0.6152 and*d*(ICE-GB) = 0.6443.

We wish to compare these two diversity measures. The first step is to estimate a confidence interval for *d*.

First we compute interval estimates for each term, *d*(*c*) = *p*₁(*c*).(1 – *p*₂(*c*)).

- The Wilson score interval for a probability
*p*is (*w*⁻,*w*⁺).

Any monotonic function of *p*, *fn*, can be applied and plotted as a simple transformation. See Reciprocating the Wilson interval. We can write

*fn*(*p*) ∈ (*fn*(*w*⁻),*fn*(*w*⁺)).

However, *d*(*c*) is not monotonic over its entire range. Indeed *d*(*c*) reaches a maximum where *p* = 0.5. However the axiom holds conservatively provided that the function is monotonic across the interval (*w*⁻, *w*⁺), i.e. where 0.5 is not within the interval. The following graph plots *d*(*c*) over *p*(*c*) for a two-cell vector where *n* = 40.

We can rewrite *d*(*c*) in terms of a probability *p* and *n*,

*d*(*p*,*n*) =*p*× (1 – (*p × n*– 1) / (*n*– 1)).

This has the interval

*d*(*p*,*n*) ∈ (*d*(*w*⁻,*n*),*d*(*w*⁺,*n*))

provided that *d*(*w*⁺, *n*) < 0.5. To obtain the interval we have simply plugged *w*⁻ and *w*⁺ into the formula for *d*(*p*, *n*) in place of *p*.

Indeed, noting the shape of *d*, we can derive the following.

*d*(*p*,*n*) ∈ (*d*(*w*⁻,*n*),*d*(*w*⁺,*n*)) where*w*⁺ < 0.5,*d*(*p*,*n*) ∈ (*d*(*w*⁺,*n*),*d*(*w*⁻,*n*))*w*⁻ > 0.5,*d*(*p*,*n*) ∈ (min(*d*(*w*⁻,*n*),*d*(*w*⁺,*n*)),*d*(0.5,*n*)) otherwise.

Next we need to sum these intervals. To do this we need to take account of the number of degrees of freedom of the vector.

Case 1: *df* = 1

If we had two values (as in our graphed example), we would have one degree of freedom. Cell probabilities *p*(1) + *p*(2) = 1, so *p*(2) = 1 – *p*(1).

The relationship above is exactly the same as applies for the Wilson score interval and 2×1 χ² goodness of fit test. Observed variation across *p*(1) **determines** the variation across *p*(2). Suppose *P*(1), the true value for *p*(1), were at an outer limit of *p*(1) (say, *w*⁺(1)). *P*(2) would be at the opposite outer limit of *p*(2) (*w*⁻(2)).

This means we should simply sum the transformed Wilson scores:

*d*(*c*∈**C**) ∈ (∑*d*(*w*⁻(*c*)*, n*), ∑*d*(*w*⁺(*c*),*n*)).

We apply simple summation where intervals are strictly dependent on each other. We can obtain relative bounds of the dependent sum as:

*l*(dep) =*d*– ∑*d*(*w*⁻(*c*)*, n*),*u*(dep) = ∑*d*(*w*⁺(*c*),*n*) –*d*.

However, in our example we have more than one degree of freedom, and this method is too conservative.

Case 2: *df* > 1

Where probabilities are independent, some can increase and others decrease. The chance that two independent probabilities both fall within a 5% error level is 0.05². So we cannot simply add together intervals. The method of independent summation is to sum Pythagorean interval widths:

*l*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁻(*c*),*n*)]², and*u*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁺(*c*),*n*)]².

However, in our case, we have what we might term semi-independent probabilities, with the level of independence determined by the number of degrees of freedom. We have *df* = *k* – 1 independent differences, so we can interpolate between the two methods in proportion to the number of cells.

*l*= (*l*(ind) × (*k*– 2) + 2*l*(dep)) /*k*, and*u*= (*u*(ind) × (*k*– 2) + 2*u*(dep)) /*k*,*d*(*c*∈**C**) ∈ (*d*–*l*,*d*+*l*).

Note that *l* = *l*(dep) where *k* = 2.

To see how this works, let’s return to our example. The following is drawn from the LLC data (first, blue bar in the graph), at an error level α = 0.05. Note that one of our cells (PC) has *p*₁ > 0.5, *w*₁⁻ is also > 0.5, so we must swap the interval for this cell.

function | CO | CS | SU | OD | A | PC |

p₁ |
0.0027 | 0.0147 | 0.0272 | 0.1456 | 0.2724 | 0.5373 |

w₁⁻ |
0.0012 | 0.0105 | 0.0213 | 0.1316 | 0.2544 | 0.5166 |

w₁⁺ |
0.0058 | 0.0206 | 0.0348 | 0.1608 | 0.2913 | 0.5379 |

Next, to compute the lower bound of the confidence interval CI(*d*) = (*d *– *l*, *d *+ *u*), we obtain the same data for *p*₂ and then carry out the computation.

*l*(dep) =*d*– ∑*d*(*w*⁻(*c*)*, n*) = 0.6152 – 0.5833 = 0.0319,*u*(dep) = ∑*d*(*w*⁺(*c*),*n*) –*d*= 0.6499 – 0.6510 = 0.0359,*l*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁻(*c*),*n*)]² = 0.0152,*u*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁺(*c*),*n*)]² = 0.0165.

This obtains an interval of (0.5945, 0.6382).

We can quote diversity for LLC with absolute intervals (*d *– *l*, *d *+ *u*):

*d*(LLC) = 0.6152 (0.5945, 0.6382), and*d*(ICE-GB) = 0.6443 (0.6248, 0.6655).

In the Newcombe-Wilson test, we compare the difference between two Binomial observations *p*₁ and *p*₂ with the Pythagorean distance of the Wilson interval widths *y*₁⁺ = *w*₁⁺ – *p*₁, etc:

–√(*y*₁⁺)² + (*y*₂⁻)² < (*p*₁ – *p*₂) < √(*y*₁⁻)² + (*y*₂⁺)².

If the equation above is true, the result is not significant (the difference falls within the confidence interval).

This method operates on the assumption that the observations are independent and the intervals are approximately Normal. In our case the difference in diversity is -0.0291, and the bounds are (-0.0301, +0.0297).

Since the difference falls inside those bounds – just – we can report that the difference is not significant.

In many scientific disciplines, such as medicine, papers that include graphs or cite figures without confidence intervals are considered incomplete and are likely to be rejected by journals. However, whereas the Wilson interval performs admirably for simple Binomial probabilities, computing confidence intervals for more complex measures typically involves a more involved computation.

We defined a diversity measure and derived a confidence interval for it. Although probabilistic (diversity is indeed a probability), it is not a *Binomial* probability. For one thing, it has a maximum below 1, of slightly in excess of (*k –* 1) / *k*. For another, it is computed as the sum of the product of two sets of related probabilities.

In order to derive this interval we made the assumption of monotonicity, i.e. that the function *d* tends to increase along its range, or decrease along its range. However, *d* is decidedly **not** monotonic *–* it increases as *p* tends to 0.5 but falls thereafter. We employed the weaker assumption that it is monotonic within the confidence interval, or – in the case where the interval includes a change in direction – that it cannot exceed the global maximum. This has a conservative consequence: it makes the evaluation weaker than it would otherwise be.

We computed an interval by interpolating between dependent and independent estimates of variance, noting that the vector has *k* – 1 degrees of freedom. This is not the most accurate method (and I intend to return to this question in later posts), but it is sufficient for us to derive an interval, and, by employing Newcombe’s method, a test of significant difference.

Like Cramér’s φ, diversity condenses an array with *k* – 1 degrees of freedom into a variable with a single degree of freedom. Swapping data between the smallest and largest columns would obtain exactly the same diversity score.

Testing for significant difference in diversity, therefore, is not the same as carrying out a *k* × 2 chi-square test. Such a test could be significant even when diversity scores are not significantly different. Our new diversity difference test is more conservative, and significant results may be more worthy of comment.

Aarts, B., Wallis, S.A., and Bowie, J. (forthcoming). *–Ing clauses in spoken English: structure, usage and recent change*.

Let’s think about what you experienced. The car crash might involve a number of variables an investigator would be interested in.

How fast was the car going? Where were the brakes applied?

Look on the road. Get out a tape measure. How long was the skid before the car finally stopped?

How big and heavy was the car? How loud was the bang when the car crashed?

These are all **physical variables**. We are used to thinking about the world in terms of these kinds of variables: velocity, position, length, volume and mass. They are tangible: we can see and touch them, and we have physical equipment that helps us measure them.

To this list we might add variables we can’t see, such as how loud the bang was. We might not be able to see it, but we can appreciate that loudness is a variable that ranges from very quiet to extremely loud indeed! With a decibel meter we can get an accurate reading, but if you are trying to explain how loud something was to the Police from memory, the best you might be able to do is a rough-and-ready assessment.

We are also used to thinking about some other variables that might be relevant to our car crash investigation. If we are investigating on behalf of the insurance company, we might want to know the answers to some rather less tangible variables. What was the value of the car before the accident? How wealthy is the driver? How dangerous is that stretch of road?

We are used to thinking about the world in terms of physical variables but we are also brought up in a social world of economic value. The value of the car, the wealth of the driver. These **social variables** are a bit more ‘slippery’ than the physical variables. ‘Value’ can be highly subjective: the car might have been vintage, and different buyers might place a different value on it. The buyer, being canny, might then resell it for a higher value. Nonetheless everyone brought up in a world of trade and capital understands the idea that a car can be sold and in that process a price attached to it. Likewise, ‘wealth’ might be measured in different ways, or in different currencies. So although these are not physical variables, we are comfortable with the idea that they are tangible to us.

But what about that last variable? I asked, *how dangerous is that stretch of road?* This variable is a risk value. It is a **probability**. We can rephrase my question as “what is the probability that for every car that comes down the road, it crashes?” If we can measure this in some way, and make repeat measurements elsewhere, we could make comparisons. Perhaps we have discovered an accident ‘black spot’: somewhere where there is a greater chance of a road accident than at other locations.

**But a probability cannot be calculated on the strength of a single accident.** It can only be measured by a different, more patient, process of observation. We have to observe *many* cars driving down the road, count the ones that crash, and build up a set of observations. Probability is not a tangible variable, and it takes an effort of imagination to think about.

I want to argue that the first thing that makes the subject of statistics difficult, compared to, say, engineering, is that even the most elementary variable we use, observed probability, is not physically tangible.

Let us think about our car crash for a minute. I said that you have never been on this road before. You have no data on the probability of a crash on that road. But it would be very easy to assume from the simple fact that you saw a crash that, if the road surface seemed poor, or it was raining, these facts contributed to the accident and made it more likely. But you have only one data point to draw from. This kind of inference is not valid. It is an over-extrapolation. It is little more than a guess.

Our natural instinct is to form explanations in our mind, hypotheses, and to look for patterns and causes in the world. (Part of our training as scientists is to be suspicious of that inclination. Of course we might be right, but we have to be relentlessly careful and self-critical before we can conclude that we are.)

If we wanted to make a case that this location is an accident black spot, we would need to set up equipment and monitor the road for accidents. We would need to continue to observe the road over a substantial period of time to get the data we needed. This is called a **natural experiment**, where we don’t attempt to interfere with the conditions of the road but simply observe driver behaviour and car crashes.

Alternatively, we might **conduct an actual experiment** and drive various cars down the road to see how they handled. Either way, we would need to observe many cars going past before we could make a realistic estimate of the chance of a crash.

If probability is difficult to observe directly, this has an effect on our ability to think about it. Probability is more difficult to conceive of in the way we conceive of length, say. We all vary in our spatial reasoning abilities, but we experience reinforcement learning from daily observations, tape measures and practice. As we have seen, probability is much more elusive because it is only observed from many observations. This makes it difficult to reliably estimate probability in advance, or to reason with probabilities.

Even experienced researchers make mistakes. The psychologists Tersky and Kahneman (1971) reported the findings from a questionnaire they gave to professional psychologists. The questions concerned the decisions they would make in research based on statements about probability. They showed that not only were their expert subjects unreliable, they provided evidence of persistent biases in human cognition, including the one we mentioned earlier – a belief in the reliability of their own observations, even when they had few observations on which to base their conclusions.

So if you are struggling with statistical concepts, **don’t worry**. You are not alone. Indeed, I have come to the conclusion that *it is necessary to struggle with probability*. We have all been there, and one of my main criticisms of traditional statistics teaching is that most treatments skate over the core concepts and goes straight to statistical testing methods that the experimenter, with no conceptual grounding (never mind mathematical underpinnings), simply takes on faith.

Probability is difficult to observe. It is an abstract mathematical concept that can only be measured indirectly, from many observations. And simple observed probability is just the beginning. In discussing inferential statistics I try to keep to three notions of probability and a simple labelling system: observed probability, for which I will use the label lower-case *p*, the ‘true’ population probability, capital *P*, and a third type, the probability that our observed probability is reliable, which we denote with α. Many people make mistakes reasoning about that last little variable. But we are getting ahead of ourselves.

The best way to get to grips with probability is to replace my thought experiment with a physical one.

But: **safety first!** Please don’t crash an actual car — use a Scalextric instead!

Tversky, A., and Kahneman, D. 1971. Belief in the law of small numbers. *Psychological Bulletin* **76**:2, 105-110. **»** ePublished

I have been recently reviewing and rewriting a paper for publication that I first wrote back in 2011. The paper (Wallis forthcoming) concerns the problem of how we test whether repeated runs of the same experiment obtain essentially the same results, i.e. results are not significantly different from each other.

These meta-tests can be used to test an experiment for replication: if you repeat an experiment and obtain significantly different results on the first repetition, then, with a 1% error level, you can say there is a 99% chance that the experiment is not replicable.

These tests have other applications. You might be wishing to compare your results with those of others in the literature, compare results with different operationalisation (definitions of variables), or just compare results obtained with different data – such as comparing a grammatical distribution observed in speech with that found within writing.

The design of tests for this purpose is addressed within the *t*-testing ANOVA community, where tests are applied to continuously-valued variables. The solution concerns a particular version of an ANOVA, called “the test for interaction in a factorial analysis of variance” (Sheskin 1997: 489).

However, anyone using data expressed as discrete alternatives (A, B, C etc) has a problem: the classical literature does not explain what you should do.

The rewrite of the paper caused me to distinguish between two types of tests: ‘point tests’, which I describe below, and ‘gradient tests’.

These tests can be used to compare results drawn from 2 × 2 or *r* × *c* χ² tests for homogeneity (also known as tests for independence). This is the most common type of contingency test, which can be computed using Fisher’s exact method or as a Newcombe-Wilson difference interval.

- A
**gradient test**(B) evaluates whether the*gradient*or difference between point 1 and point 2 differs between runs of an experiment,*d*=*p*₁ –*p*₂. This concerns whether claims about the rate of change, or size of effect, observed are replicable. Gradient tests can be extended, with increasing degrees of freedom, into tests comparing*patterns*of effect. - A
**point test**(A) simply asks whether data at either point, evaluated separately, differs between experimental runs. This concerns whether single observations, such as*p*₁, are replicable. Point tests can be extended into ‘multi-point’ tests, which we discuss below.

Point tests only apply to homogeneity data. If you wish to compare outcomes from goodness of fit tests, you need a version of the gradient test, to compare differences from an expected *P*, *d* = *p*₁ – *P*. Since different data sets may have different expected *P*, a distinct ‘point test for goodness of fit’ would be meaningless.

The earlier version of the paper, which has been published on this blog since its launch 2012, focused on gradient tests. The possibility of carrying out a point test was mentioned in passing. In this blog post I want to focus on point tests.

The obvious problem with gradient tests is that two experimental runs might obtain the same gradient but in fact be very different in start and end points. Consider the following graph.

The data in Figure 1 is calculated from two 2 × 2 tables drawn from a paper by Aarts, Close and Wallis (2013).

**Note:** To obtain Figure 2, I simply replaced one frequency in the first table: 46 with 100. The data is also found on the 2×2 homogeneity tab in this Excel spreadsheet, which contains a wide range of separability tests.

To make our exposition clearer, Table 1 uses the same format as in the Excel spreadsheet (with the dependent variable distributed vertically) rather than the format in the paper.

spoken | LLC (1960s) |
ICE-GB (1990s) |
Total |

shall |
124 | 46 | 170 |

will |
501 | 544 | 1,045 |

Total |
625 | 590 | 1,215 |

written | LOB (1960s) |
FLOB (1990s) |
Total |

shall |
355 | 200 | 555 |

will |
2,798 | 2,723 | 5,521 |

Total |
3,153 | 2,923 | 6,076 |

Aarts *et al*. carried out 2 × 2 homogeneity tests for the two tables separately. These test whether modal *shall* declines as a proportion of the modal *shall/will* alternation between the two time points. In other words, we compare LLC with ICE-GB data, and LOB with FLOB data.

To carry out a point test we simply rotate the test 90 degrees, e.g. to compare data at the 1960s point we compare LLC with LOB.

As I have explained elsewhere (Wallis 2013), there are a number of different methods for carrying out this comparison.

These include:

- The
*z*test for two independent proportions (Sheskin 1997: 226). - The Newcombe-Wilson interval test (Newcombe 1998).
- The 2 × 2 χ² test for homogeneity (independence).

These are all standard tests and each is discussed in papers and elsewhere on this blog.

The advantage of the third approach is that it is extensible to *c*-way multinomial observations by using a 2 × *c* χ² test.

The tests listed above can be used to compare the 1960s and 1990s intervals in Figure 1 separately.

However, in many cases it would be helpful to have a method that evaluated both pairs of observations in a single test. This can be generalised to a series of *r* observations. To do this, in (Wallis forthcoming) I propose what I call a multi-point test.

We generalise the χ² formula by summing over *i* = 1..*r*:

- χ
² = ∑χ²(_{d}*i*)

where χ²(*i*) represents the χ² score for homogeneity for each set of data at position *i* in the distribution.

This test has *r* × df(*i*) degrees of freedom, where df(*i*) is the degrees of freedom for each χ² point test. So, in the worked example we have seen, the summed test has two degrees of freedom:

spoken | LLC (1960s) |
ICE-GB (1990s) |
Total |

shall |
124 | 46 | 170 |

will |
501 | 544 | 1,045 |

Total |
625 | 590 | 1,215 |

written | LOB (1960s) |
FLOB (1990s) |
Total |

shall |
355 | 200 | 555 |

will |
2,798 | 2,723 | 5,521 |

Total |
3,153 | 2,923 | 6,076 |

χ² | 34.6906 | 0.6865 | 35.3772 |

Since the computation sums independently-calculated χ² scores, each score may be individually considered for significant difference (with df(*i*) degrees of freedom). Hence we can see above the large score for the 1960s data (individually significant) and the small score for 1990s (individually non-significant).

**Note:** Whereas χ² is generally associative (non-directional), the summed equation (χ* _{d}*²) is not. Nor is this computation the same as a 3 dimensional test (

- The multi-point test factors out variation between tests over the independent variable (in this instance: time). This means that if there is a lot more data in one table at a particular time period, this fact does not skew the results.
- On the other hand, it does not factor out variation over the dependent variable – after all, this is precisely what we wish to examine!

Naturally, like the point test, this test may be generalised to multinomial observations.

An alternative multi-point test for binomial (two-way) variables employs a sum of χ² values abstracted from Newcombe-Wilson tests.

- Carry out Newcombe-Wilson tests for each point test
*i*at a given error level α, obtaining*D*,_{i}*W*⁻ and_{i}*W*⁺._{i} - Identify the inner interval width
*W*for each test:_{i}- if
*D*< 0,_{i }*W*=_{i}*W*⁻;_{i}*W*=_{i}*W*⁺ otherwise._{i}

- if
- Use the difference
*D*and inner interval_{i}*W*to compute χ² scores:_{i}- χ²(
*i*) = (*D*._{i}*z*_{α/2}/*W*)²._{i}

- χ²(

It is then possible to sum χ²(*i*) as before.

Using the data in the worked example we obtain:

**1960s:** *D _{i}* = 0.0858,

Since *D _{i}* is positive in both cases, we use the upper interval width each time. This gives us χ² scores of 28.4076 and 1.3769 respectively, which obtains a sum of 29.78. Compared to the first method above, this approach tends to downplay extreme differences.

The point test and the additive generalisation of this test into a ‘multi-point test’ represent a method of contrasting multiple runs of the same experiment, comparing observed changes in different subcorpora or genres, or examine the empirical effect of changing definitions of variables.

These tests consider the null hypothesis that **individual observations** are not different; or, in the multi-point case, that **in general** the observations are not different.

- They do not evaluate the gradient between points or the size of effect. If we wish to compare
**sizes of effect**we would need to use one of the methods for this purpose described in (Wallis forthcoming). - The method only applies to comparing tests for homogeneity (independence). To compare
**goodness of fit**data, a different approach is required (also described in Wallis forthcoming).

Nonetheless, these tests are useful meta-tests that build on classical Pearson χ² tests, and they are useful tools in our analytical armoury.

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.

Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. *Statistics in Medicine* **17**: 873-890.

Wallis, S.A. 2013. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. » Post

Wallis, S.A. forthcoming (first published 2011). *Comparing χ² tables for separability of distribution and effect*. London: Survey of English Usage. » Post

I have previously argued (Wallis 2014) that interaction evidence is the most fruitful type of corpus linguistics evidence for grammatical research (and doubtless for many other areas of linguistics).

Frequency evidence, which we can write as *p*(*x*), the probability of *x* occurring, concerns itself simply with the overall distribution of a linguistic phenomenon *x* – such as whether informal written English has a higher proportion of interrogative clauses than formal written English. In order to calculate frequency evidence we must define *x*, i.e. decide how to identify interrogative clauses. We must also pick an appropriate baseline *n* for this evaluation, i.e. we need to decide whether to use words, clauses, or any other structure to identify locations where an interrogative clause may occur.

**Interaction evidence** is different. It is a statistical correlation between a decision that a writer or speaker makes at one part of a text, which we will label point *A*, and a decision at another part, point *B*. The idea is shown schematically in Figure 1. *A* and *B* are separate ‘decision points’ in a given relationship (e.g. lexical adjacency), which can be also considered as ‘variables’.

This class of evidence is used in a wide range of computational algorithms. These include collocation methods, part-of-speech taggers, and probabilistic parsers. Despite the promise of interaction evidence, the majority of corpus studies tend to consist of discussions of frequency differences and distributions.

In this paper I want to look at applications of interaction evidence which are made more-or-less at the same time by the same speaker/writer. In such circumstances we cannot be sure that just because *B** *follows *A** *in the text, the decision relating to *B* was made after the decision at *A*.

For example, in studying the premodification of noun phrases by attributive adjectives in English – which adjective is applied first in assembling an NP like *the old tall green ship*, for instance – **we cannot be sure that adjectives are selected by the speaker in sentence order**. It is also perfectly plausible that adjectives were chosen in an alternative or parallel order in the mind of the speaker, and then assembled in the final order during the language production process.

Of course, in cases where points *A* and *B* are separated substantively in time (as in many instances of structural self-priming) or where *B* is spoken in response to *A* by another speaker (structural priming of another’s language), there is unlikely to be any ambiguity about decision order. Moreover, if *A* licences *B*, then the order in unambiguous.

However, in circumstances where *A* and *B* are proximal, and where the order of decisions made by the speaker/writer cannot be presumed, we wish to consider whether there are mathematical or statistical methods for predicting the most likely order decisions were made.

Such a method would have considerable value in experimental design in cognitive corpus linguistics. For example, since Heads of NPs, VPs etc are conceived of as determining their complements, it may not be too much a stretch to argue that if this method works, we may have found a way of empirically evaluating this grammatical concept.

- Introduction
- A collocation example

2.1 Employing chi-square and phi

2.2 Directional statistics

2.3 Significantly directional? - A grammatical example

3.1 Testing for difference under alternation

3.2 Comparing Newcombe-Wilson intervals for direction

3.3 Optimising the dififference interval - Mapping significance of association and direction
- Concluding remarks
- References

Wallis, S.A. 2017. *Detecting direction in interaction evidence*. London: Survey of English Usage. **»** Paper (PDF)

- Excel spreadsheets

Wallis, S.A. 2011. *Comparing χ² tests for separability*. London: Survey of English Usage, UCL. **»** post

Wallis, S.A. 2012. *Goodness of fit measures for discrete categorical data*. London: Survey of English Usage, UCL. **»** post

Wallis, S.A. 2013a. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. **»** post

Wallis, S.A. 2013b. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics* **20**:3, 178-208. **»** post

Wallis, S.A. 2014. What might a corpus of parsed spoken data tell us about language? In L. Veselovská and M. Janebová (eds.) *Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure.* Olomouc: Palacký University, 2014. pp 641-662. **»** post

Wallis, S.A. forthcoming. *That vexed problem of choice*. London: Survey of English Usage, UCL. **»** post

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Over the three days, participants will learn about the following:

- the scope of Corpus Linguistics, and how we can use it to study the English Language;
- key issues in Corpus Linguistics methodology;
- how to use corpora to analyse issues in syntax and semantics;
- basic elements of statistics;
- how to navigate large and small corpora, particularly ICE-GB and DCPSE.

At the end of the course, participants will have:

- acquired a basic but solid knowledge of the terminology, concepts and methodologies used in English Corpus Linguistics;
- had practical experience working with two state-of-the-art corpora and a corpus exploration tool (ICECUP);
- have gained an understanding of the breadth of Corpus Linguistics and the potential application for projects;
- have learned about the fundamental concepts of inferential statistics and their practical application to Corpus Linguistics.

For more information, including costs, booking information, timetable, see the website.

Over the last year, the field of psychology has been rocked by a major public dispute about statistics. This concerns the failure of claims in papers, published in top psychological journals, to replicate.

Replication is a big deal: if you publish a correlation between variable *X* and variable *Y* – that there is an increase in the use of the progressive over time, say, and that increase is statistically significant, you expect that this finding would be replicated were the experiment repeated.

I would strongly recommend Andrew Gelman’s brief history of the developing crisis in psychology. It is not necessary to agree with everything he says (personally, I find little to disagree with, although his argument is challenging) to recognise that he describes a serious problem here.

There may be more than one reason why published studies have failed to obtain compatible results on repetition, and so it is worth sifting these out.

In this blog post, what I want to do is try to explore what this replication crisis is – is it one problem, or several? – and then turn to what solutions might be available and what the implications are for corpus linguistics.

The debate between Neil Millar and Geoff Leech regarding the alleged increase (Millar 2009) and decline (Leech 2011) of the modal auxiliary verbs is an example of this problem.

Millar based his conclusions on the TIME corpus, discovering that the rate of modal verbs per million words tended to increase over time. Leech, using the Brown series of US English corpora, discovered the opposite. Both applied statistical methods to their data but obtained very different conclusions.

Inferential statistics operates by predicting the result of repeated runs of the same experiment, i.e. on samples of data drawn from the same population.

Stating that something “significantly increases over time” can be reformulated as:

- subject to caveats of
**random sampling**(the sample is, or approximates to, a random sample of utterances drawn from the same population), and**Binomial variables**(observations are free to vary from 0 to 1), - we can calculate a
**confidence interval**at a given error rate (say 1 in 20 times for a 5% error rate / 95% interval) on the difference in two observations of variable*X*taken at two time points 1 and 2,*x*₂ –*x*₁, **all points**within this interval (including the lower bound) are greater than 0,**on repeated runs of the same experiment we can expect to see an observation fall outside of the confidence interval of the difference at the predicted rate**(here, 1 time in 20).

**Note:** For the purposes of this blog post, I am focusing on the last bullet point – when we say that something “fails to replicate”, we mean that on a repetition the result falls outside the confidence interval of the difference *on the very next occasion! *More precisely, we mean that the results are statistically separable.

Leech obtained a different result from Millar on the first attempted repetition of this experiment. This could be a fluke, but it seems to be a failure to replicate. There should only be a 1 in 20 chance of this happening.

Observing such a replication failure should lead us to ask some searching questions about these two studies, many of which are discussed elsewhere in this blog.

Much of the controversy can be summed up by the bottom row in this table, drawn from Millar (2009). This appears to show a 23% increase in modal use between the 1920s and 2000s. With a lot of data and a sizeable effect, this increase seems bound to be significant.

1920s | 1930s | 1940s | 1950s | 1960s | 1970s | 1980s | 1990s | 2000s | % diff 1920s-2000s | |

will |
2,194.63 | 1,681.76 | 1,856.40 | 1,988.37 | 1,965.76 | 2,135.73 | 2,057.43 | 2,273.23 | 2,362.52 | +7.7% |

would |
1,690.70 | 1,665.01 | 2,095.76 | 1,669.18 | 1,513.30 | 1,828.92 | 1,758.44 | 1,797.03 | 1,693.19 | +0.1% |

can |
832.91 | 742.30 | 955.73 | 1,093.39 | 1,233.13 | 1,305.82 | 1,231.99 | 1,475.95 | 1,777.07 | +113.4% |

could |
661.33 | 822.72 | 1,188.24 | 998.83 | 950.73 | 1,106.25 | 1,156.61 | 1,378.39 | 1,342.56 | +103.0% |

may |
583.59 | 515.12 | 496.93 | 502.74 | 628.13 | 743.66 | 775.92 | 937.08 | 931.91 | +59.7% |

should |
577.46 | 450.07 | 454.87 | 495.26 | 441.96 | 475.50 | 453.33 | 521.46 | 593.27 | +2.7% |

must |
485.31 | 418.03 | 456.57 | 417.62 | 401.36 | 390.47 | 347.02 | 306.69 | 250.59 | -48.4% |

might |
374.52 | 375.40 | 500.33 | 408.90 | 399.80 | 458.99 | 416.81 | 474.23 | 433.34 | +15.7% |

shall |
212.19 | 120.79 | 96.42 | 70.52 | 50.48 | 35.65 | 25.93 | 16.09 | 9.26 | -95.6% |

ought |
50.22 | 37.94 | 39.31 | 40.34 | 36.91 | 34.29 | 28.27 | 34.90 | 27.65 | -44.9% |

Total | 7,662.86 | 6,829.14 | 8,140.56 | 7,685.15 | 7,621.56 | 8,515.28 | 8,251.75 | 9,215.05 | 9,421.36 | +22.9% |

In attempting to identify why Leech and Millar obtain different results, the following questions should be considered.

**Are the two samples drawn from the same population, or are they drawn from two distinct populations?**To put it another way, are there characteristics of the TIME data that makes it distinct from the general written data in the Brown corpora? For example, does TIME have a ‘house style’, with subeditors enforcing it, which has led to a greater frequency of modal use? Has TIME tended to curate more stories with more modal hedges than the overall trend? Jill Bowie (Bowie*et al*2013) reported that genre subdivisions within the spoken DCPSE corpus often exposed different modal trends.**Does Millar’s data support a general observation of increased modal use?**Bowie observes that Millar’s aggregate data fluctuates over the entire time period (see Table, bottom row), and some changes in sub-periods appear to be consistent with the trend reported by Leech in an earlier study in 2003. According to this observation, simply expressing the trend as an increase in modal verb use seems misleading.**Is it legitimate to aggregate all modals together?**In one sense, modals are a well-defined category of verb: a closed category, especially if one excludes the semi-modals. So “modal use” is a legitimate variable. But we can also see that different modal verbs are undergoing different patterns of change over time (see Table). Millar reports that*shall*and*must*are in decline in his data while*will*and*can*are increasing. Whereas*shall*and*will*may be alternates in some contexts, this does not mean that bundling all modal trends together is particularly meaningful. Moreover, since the synchronic distribution of modals (like most linguistic variables) is sensitive to genre, this issue also interacts with my first bullet point, i.e. the fact that there are known differences between corpora.**How reliable is a per-million-word measure?**What does the data look like if we use a different baseline, for example, modal use per tensed verb phrase (or tensed main verb)? Doing this allows us to factor out variation in ‘tensed VP density’ (i.e. the variation in potential sites for modals to be deployed) between texts. Failure to do this (as both Leech and Millar do) means that we are not measuring when writers**choose**to use modal verbs, but the rate to which we, the reader, are**exposed**to them. See That vexed problem of choice.

If VP density in text samples changes over time in either corpus, this may explain these different results – not as a result of increasing or declining modal use but as a result of increasing or declining tensed VP density (or declining / increasing density of other constituents). More generally, word-based baselines almost always conflate opportunity and use because the option to insert the element is not available following every other word (exceptions might include pauses or expletives, but these exceptions prove the rule). This conflation undermines the Binomial model and increases the risk that results will not replicate. The solution is to focus on identifying each choice-point as much as possible.**Does per word (per-million-word) data conform to the Binomial statistical model?**Since the entire corpus cannot consist of modal verbs, observations of modal verbs can never approach 100%, so the answer has to be no. However, the effect of this inappropriate model is that it tends to lead to the underreporting of otherwise significant results. See Freedom to vary and statistical tests. This may be a problem, but logically, it cannot be an explanation for obtaining two different ‘significant’ results in opposite directions!

All of the above are reasons to be unsurprised at the fact that Millar’s summary finding was not replicated in Leech’s data. But to be fair, many of Millar’s individual trends *do* appear to be consistent with results found in the Brown corpus.

As we shall see, the problem of replication is not that *all* results in one study are not reproduced in another study, rather it is that *some* results are not reproduced. But this observation raises an obvious question: which results should we cite?

Moreover, if our most remarked-upon finding is not replicated, we have an obvious problem.

The replication crisis has been most discussed in psychology and the social sciences. In psychology, some published findings have been controversial to say the least. Claims that ‘Engineers have more sons; nurses have more daughters’ have tended to attract the interest of other psychologists relatively quickly. But this is shooting fish in a barrel.

In psychology, it is common to perform studies with small numbers of participants – 10 per experimental condition is usually cited as a minimum, which means that between 20 and 40 participants becomes the norm. Many kinds of failure to replicate are due to what statisticians tend to call ‘basic errors’, such as using an inappropriate statistical test. I discuss this elsewhere in this blog.

- The most common error is applying a mathematical model to data that does not conform to it. For example, applying a Binomial model that assumes that an observed probability is free to vary from 0 to 1 to a variable that can only vary between 0 and 0.001 (say), is mathematically unsound. No method that makes this assumption will work the way that the Binomial model predicts when it comes to replication.
- Corpus linguistics has a particular historical problem due to the ubiquity of studies employing word-based baselines (per million words, per thousand words etc). It is not possible to adjust an error level to fix this problem, because the problem is one of missing data — in this case, frequency data for a meaningful choice baseline (ideally, the frequency of alternate forms). Bravo for variationism.

This is why in this blog I have tended to argue for applying the simplest possible experimental designs (2 × 2 contingency tests, for example) over multivariate regression algorithms which may work, but are treated as ‘black boxes’ by almost all who use them. Such algorithms may ‘over fit’ data, i.e. they match the data more closely than is mathematically justified. But more importantly, they (and the assumptions underpinning them) are not transparent to their users.

I argue that if you don’t understand how your results were derived, you are taking them on faith.

This does not mean I don’t think that some multi-variable methods are not theoretically superior to, or potentially more powerful than, simpler tests. On the contrary, I object that before we use any statistical method we need to be sure that we understand what they are doing with our data. We have to ask ourselves constantly, *what do our results mean?*

However, the replication problem does not go away entirely once we have dealt with these so-called basic errors.

Andrew Gelman and Eric Loken (2013) raise a more fundamental problem that, if valid, is particularly problematic for corpus linguists. This concerns a question that goes to the heart of the post-hoc analysis of data, and the fundamental philosophy of statistical claims and the scientific method.

Essentially their argument goes like this.

- All data contains random noise, and thus every variable in a dataset (extracted from a corpus) will contain random noise. Researchers tend to assume that by employing a significance test we ‘control’ for this noise. But this is a mischaracterisation. Faced with a dataset consisting of pure noise, we would detect a ‘significant’ result 1 in 20 times (at a 0.05 threshold). Another way of thinking about this is that statistical methods can find patterns in data (correlations) even when there are no patterns to be found.
- Any data set may contain multiple variables, there are multiple potential definitions of these variables, and there are multiple analyses we could perform on the data. In a corpus we could modify definitions of variables, perform new queries, change baselines, etc., to perform new analyses.
- It follows that there is a very large number of potential hypotheses we
*could*test against the data. (Note: this is not an argument against exploring the hypothesis space in order to choose a better baseline on theoretical grounds!)

This part of the argument is not very controversial. However, Gelman and Loken’s more provocative claim is as follows.

- Few researchers would admit to running very many tests against data and reporting results, which the authors term ‘fishing’ for significant results, or ‘p-hacking’. There are some algorithms that do this (multivariate logistic regression anyone?), but most research is not like this.
- Unfortunately, the authors argue,
**standard post-hoc analysis methods – exploring data, graphing results and reporting significant results – does much the same thing.**We dispense with blind alleys (what they call ‘forking paths’), because we can see that they are not likely to produce significant results. Although we don’t actually run these dead-end tests, for mathematical purposes*our educated eyeballing of data to focus on interesting phenomena has done the same thing*.

- As a result, we underestimate the robustness of our results, and often, they fail to replicate.

Gelman and Loken are not alone in making this criticism. Cumming (2014) objects to ‘NHST’ (null hypothesis significance testing), interpreted as an imperative that

“explains selective publication, motivates data selection and tweaking until the *p* value is sufficiently small, and deludes us into thinking that any finding that meets the criterion of statistical significance is true and does not require replication.”

Since it would be unfair to criticise others for a problem that my own work may be prone to, let us consider the following graph that we used while writing Bowie and Wallis (2016). The graph does not appear in the final version of the paper – not because we didn’t like it, but because we decided to adopt a different baseline in breaking down an overall pattern of change into sub-components. But it is typical of the kind of graph we might be interested in examining.

There are two critical questions that follow from Gelman and Loken’s critique.

*In plotting this kind of graph and reporting confidence intervals, are we misrepresenting the level of certainty found in the graph?**Are we engaging in, or encouraging, retrospective cherry-picking of contrasts between observations and confidence intervals?*

In the following graph there are 19 decades and 5 trend lines, i.e. 95 confidence intervals. There are 171 × 5 potential pairwise comparisons, and 10 × 19 vertical pairwise comparisons. So there are, let’s say, 1,045 potential statistical pairwise tests which would be reasonable to carry out. With a 1 in 20 error rate, at least 52 ‘significant’ pairwise comparisons would be incapable of replication.

Gelman, Loken, Cumming *et al.* would argue that by selecting a few statistically significant claims from this graph, we have committed precisely the error they object to.

However, I have to defend this graph, and others like it, by arguing that **this is not our method**. We don’t sift through 1,045 possible comparisons and then report significant results selectively! In the paper, and in our work more generally, we really don’t encourage this kind of cherry-picking (the human equivalent of over-fitting). We are more concerned with the overall patterns that we see, general trends, etc., which are more likely to be replicable in broad terms.

Thus, for example, in that paper we don’t pull out specific significant pairwise comparisons to make strong claims. In this particular graph we can see an apparently statistically significant sharp decline between 1900 and 1930 in the tendency of writers to use the verb SAY (as in *he is said to have stayed behind*) before a *to-*infinitive perfect, compared to the other verbs in the group. This observation may be replicable, but **the conclusions of the paper do not depend on this observation**. This claim, and similar claims, do not appear in the paper.

Similarly, if we turn back to Neil Millar’s modals-per-million-word data for a moment, Bowie’s observation that the data does not show a consistent increase over time is interesting. Millar did not select the time period in order to report that modals were on the increase – on the contrary, he non-arbitrarily took the start and end point of the timeframe sampled. But the conclusion that ‘modals increased over the entire period’ was only one statement that described the data. In shorter periods there was a significant fall, and different modal verbs behaved differently. Indeed, the complexity of his results is best summed up by the detailed graphs within his paper!

**In conclusion:** it is better to present and discuss the pattern, not just the end point – or the slogan.

Nonetheless we may still have the sneaking suspicion that what we are doing is a kind of researcher bias. We tend to report statistically significant results and ignore those inconvenient non-significant ones. The fear is that results assumed to be due to chance 1 in 20 times are more likely due to chance 1 in 5 times (say), simply because we have – inadvertently and unconsciously – already preselected our data and methods to obtain significant results.

Some highly experienced researchers have suggested that we fix this problem by adopting tougher error levels – adopt a 1 in 100 level and we might arrive at 1 in 25. The problem is that this assumes we know the appropriate multiplier to apply.

It is entirely legitimate to adjust an error level to ensure that multiple independent tests are simultaneously significant, as some fitting algorithms do. But if a statistical model is incorrectly applied to data, logically the solution must lie in correcting the model, not the error level.

Gelman and Loken suggest instead that published studies should always involve a replication process. They argue it is preferable that researchers publish half as many experiments and include a replication step than publish non-replicable results.

**Suggested method:** Before you start, create two random subcorpora A and B by randomly drawing texts from the corpus and assigning them to A and B in turn. You may wish to control for balance, e.g. to ensure subsampling is drawn equitably from each genre category. Perform the study on A, and summarise the results. Without changing a single query, variable or analysis step, apply exactly the same analysis to B.

Do we get **compatible results**, i.e. *results that fall within the confidence intervals of the first experiment*? More precisely, are the results statistically separable?

An alternative to formal replication is to repeat the experiment with well-defined, as distinct from randomly generated, subcorpora.

**Sampling subcorpora:** Suppose you apply an analysis to spoken data in ICE-GB, and then repeat it with written data. Do we get broadly similar results? If we obtain comparable results for two subcorpora with a known difference in sampling, it is probable they would pass a replication test where two subsamples were not sampled differently. On the other hand, if results *are* different, this would justify further investigation.

Even where replication is not carried out (for reasons of insufficient data, perhaps), an uncontroversial corollary of this argument is that your research method should be sufficiently transparent so that it can be replicated by others.

As a general principle, authors should make raw frequency data available to permit a reanalysis by other analysis methods. I find it frustrating when papers publish per million word frequencies in tables, when what is needed for a reanalysis is raw frequency data!

Another of Gelman and Loken’s recommendations is that researchers need to spend more time focusing on sizes of effect, rather than just reporting statistical significance. With lots of data and large effect sizes, the problem is reduced. Certainly we should be wary of citing just-significant results with a small effect size.

Where does this leave the arguments I have made elsewhere in favour of visualising data with confidence intervals? One of the implications of the ‘forking paths’ argument is that we tend not to report dead-end, non-significant results. But well considered graphs can visualise all data in a given frame, rather than selected data (of course we have to ‘frame’ this data, select variables, etc.).

One advantage of graphing data with confidence intervals is that we apply the same criteria to all data points and allow the reader to interpret the graph. Significant and non-significant contrasts are available to be viewed. We also visualise effect sizes and the weight of evidence (confidence intervals), even if it is arguable that our model is insufficiently conservative.

Thus a strength of Millar’s paper is the reporting of trends and graphs. In the graph above, the confidence intervals improve our understanding of the overall trends we see.

We just should not assume that every significant difference will be replicable.

This is really one of mine, but I suggest it is implicit in the argument above.

It seems to me to be an absolutely essential requirement for any empirical scientist to play devil’s advocate to their own hypothesis.

That is, it is not sufficient to ‘find something interesting in data’, and publish. What we are really trying to do is detect meaningful phenomena in data, or to put it another way, we are trying to find robust evidence of phenomena that have implications for linguistic theory. We are trying to move from observed correlation to a hypothesised underlying cause.

Statistics is a tool to help us do this. But logic also plays an essential part.

Without wishing to create a checklist for empirical linguistics (such that a researcher is convinced in the validity of their results simply because they can tick off the list), we might argue that the following steps are necessary in all empirical research.

**Identify the underlying research question**, framed in general theoretical terms.**Operationalise the research question**as a series of testable hypotheses or predictions, and evaluate them. Plot graphs! Visualising data with confidence intervals allows us to visualise expected variation and make more robust claims.**Focus reporting on global patterns**across the entire dataset. If your research ends up prioritising an apparently unusual local pattern in a selected part of the data, consider whether this may be an artefact of sampling.**Critique the results of this evaluation**in terms of the original research question, and play devil’s advocate: what other possible underlying explanations might there be for the observed results?**Consider alternative hypotheses**and test them. Try to design new experiments to separate out different possible explanations for the observed phenomenon.**Plan to include a replication step**prior to publication. This means being prepared to partition the data in the way described above, dividing the corpus into different pools of source texts.

Whether or not Gelman and Loken’s argument applies to your corpus linguistics study — and we have to eliminate basic errors first — the principal conclusion is that it is difficult to understate the importance of **reporting accuracy and transparency**. If the study does not appear to replicate in the future, possible reasons must be capable of exploration by future researchers. It would not have been possible to explore the differences between Leech and Millar’s data had Neil Millar simply summarised a few trends and reported some statistically significant findings.

It is incumbent on all of us to properly describe the limitations of data and sampling; definitions of variables and abstraction (query) methods for populating them; as well as graphing data to reveal both significant and non-significant patterns at the same time.

A typical mistake is to refer to ‘British English’ (say) as a short hand for ‘data drawn from British English texts sampled according to the sampling frame defined in Section 3’. Many failures to replicate in psychology can be attributed to precisely this type of logical error – that the experimental dataset is not a reliable model for the population claimed.

Finally, Cumming (2014) makes an important distinction between **exploratory research** and **prespecified research**. Corpus linguistics is almost inevitably exploratory, as it is impossible to prespecify data collection in post-hoc analysis. In a natural experiment we cannot control for confounding variables, and we must frame our conclusions accordingly.

Bowie, J., Wallis, S.A. and Aarts, B. 2013. Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Marín-Arrese, J.I., Carretero, M., Arús H.J. and van der Auwera, J. (eds.) *English Modality*, Berlin: De Gruyter, 57-94.

Bowie, J. and Wallis, S.A. 2016. The *to*-infinitival perfect: A study of decline. In Werner, V., Seoane, E., and Suárez-Gómez, C. (eds.) *Re-assessing the Present Perfect*, Topics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94.

Cumming, G. 2014. The New Statistics: Why and How, *Psychological Science*, 25(1), 7-29.

Gelman, A. and Loken, E. 2013. The garden of forking paths. Columbia University. **»** ePublished.

Leech, G. 2011. The modals ARE declining: reply to Neil Millar’s ‘Modal verbs in TIME: frequency changes 1923–2006’. *International Journal of Corpus Linguistics* 16(4).

Millar, N. 2009. Modal verbs in TIME: frequency changes 1923–2006. *International Journal of Corpus Linguistics* 14(2), 191–220.

One of the longest-running, and in many respects the least helpful, methodological debates in corpus linguistics concerns the spat between so-called **corpus-driven** and **corpus-based** linguists.

I say that this has been largely unhelpful because it has encouraged a dichotomy which is almost certainly false, and the focus on whether it is ‘right’ to work from corpus data upwards towards theory, or from theory downwards towards text, distracts from some serious methodological challenges we need to consider (see other posts on this blog).

Usually this discussion reviews the achievements of the most well-known corpus-based linguist, John Sinclair, in building the *Collins Cobuild Corpus*, and deriving the *Collins Cobuild Dictionary* (Sinclair *et al*. 1987) and *Grammar* (Sinclair *et al*. 1990) from it.

**In this post I propose an alternative examination.**

I want to suggest that *the greatest success story for corpus-based research is the development of part-of-speech taggers* (usually called a ‘POS-tagger’ or simply ‘tagger’) trained on corpus data.

These are industrial strength, reliable algorithms, that obtain good results with minimal assumptions about language.

So, *who needs theory?*

Taggers consist of two parts:

**a ‘learning’ algorithm**that collects rules from training data, and**a ‘tagging’ algorithm**which applies rules to new texts to classify words by their part of speech (word class).

The corpus-based aspect is the ‘learning’ algorithm.

A typical rule might be that if the word *old* (which can be a noun/nominal adjective, as in *the old*, or adjective, *the old man*) is followed by a noun, then *old* is more likely to be an adjective than otherwise.

The tagging algorithm takes a sentence and applies these rules like a crossword solver. It classifies the words that it is most certain of before considering those it is less confident about. Thus, in *the old man*, *the* is unambiguously a determiner, whereas both *old* and *man* can belong to more than one word class.

The learning algorithm generates summary statistics bottom-up from training data it is given, which are lots of sentences/texts which have already been tagged with the same part of speech scheme (i.e., a corpus).

It is not necessary to make many assumptions about the grammar of the language we are working with to obtain results comparable to the best reported in the literature. The computer does not need to ‘know’ what a noun or a verb is. It can simply obtain statistics about these different categories from the corpus.

But these algorithms *do* embody some assumptions about their language input. These assumptions can be enumerated as follows, although different classification schemes might vary in some details:

- language consists of
**sentences**divided into lexical**words**; - each
**sentence**is capable of being analysed separately; **words**include part-words such as genitive markers and cliticised words, and compounds, where multiple words can be given the same tag;- there are a fixed set of
**word class tags**that each particular instance of a word can be categorised by – these commonly consist of word class category (noun, verb, etc.), plus secondary information (plural proper noun, copular verb, etc.); - these tags were correctly applied to the
**training data**.

Databases extracted by the learning algorithm typically consist of **frequency distributions** for every word-tag pattern, i.e. the number of cases in the training corpus where a given lexical word has a particular tag; and **transition probabilities** for each word-tag pattern if words have more than one tag.

The performance of these linguistically unsophisticated algorithms is striking. **A typical tagger trained on a million words of English using a standard set of tags will make the correct decision for new sentences of a similar type some 95% of the time.**

Different algorithms may vary in storage efficiency. My crude simulated annealing stochastic tagger (Wallis 2012), which stores transition probabilities exhaustively, is less space-efficient than Eric Brill’s patch tagger (Brill 1992). *However, they obtain similar results.*

The remaining 5% of residual incorrect examples tend to be cases that are idiomatic, or are part of a multi-word string of ambiguous words, or are a result of weaknesses in the training data.

To address these weaknesses we can make a number of improvements.

**Store a finite set of idioms, strings or compounds.**This is a bit clumsy and*ad hoc*, doesn’t scale well, but can actually improve performance.**Add modules to the database and algorithm.**The Brill tagger employs some simple*ad hoc*regular morphology detection at an initial stage. A more thorough approach might consist of a morphological model of ‘lemmatisation’ (identifying word stems and affixes, e.g.*re-educated*→*re–*+*educate*+ –*ed*). The advantage of this step is that even if we don’t have the word*re-educated*in our training set we can recognise*educate*as a verb and the entire word as a gerund noun or verb. Generalisation allows us to pool statistics, so we can have more reliable rules, and compress information, so we don’t have to store separate statistics for every single word.**Create a more general type of rule.**The rules we have described were tied to particular words, such as*old*. It would be more efficient if we had a rule that said something like ‘for any word capable of being either an adjective or a noun, if it is followed by an adjective or noun, then it is likely to be an adjective.’*Note that to create such a rule we have to look for it*(this is precisely what the Brill tagger does).

But now let us consider where this path has taken us. Every step we have proposed to improve the performance of this corpus-driven algorithm requires the insertion of knowledge about idioms, morphology and grammar, top-down, into the algorithm.

A methodological corpus-driven purism that stated that we must work exclusively bottom-up was a little disingenuous, because we had to employ auxiliary assumptions (1) to (5) above from the start.

But now every improvement we wish to make requires further theoretical assumptions. It turns out that it is not possible to perform part-of-speech tagging without assumptions, and to improve the algorithm we need more theory.

Finally, whereas the learning algorithm might work bottom-up, the tagging algorithm itself works top-down, in that it applies its knowledge base of word-tag probabilities to new corpus data.

I have the utmost respect for corpus-driven linguists. The discipline of examining data with minimal assumptions is absolutely crucial! All scientists have to examine the data *as it is*, not compartmentalise it according to pre-given assumptions.

Over the years I have written extensively on not taking queries for granted, and directed corpus researchers to continually review the underlying sentences from which their statistics are derived.

However, it is simply not possible to work without *any* assumptions, even when building a bottom-up computer algorithm like a part-of-speech tagger.

So I would conclude that corpus-based research is properly located as part of a larger research cycle, in which it is valid and reasonable to work bottom-up and top-down at different times. Corpus-driven research methods are part of a family of exploratory methods from which all corpus linguists should draw. Insights from computationally-obtained summary statistics (whether from collocations, *n*-grams, phrase frames, indexes, or databases of part of speech taggers) are important resources for further research.

But insisting that the only legitimate corpus methods are bottom-up prevents us carrying out research with a corpus which asks questions that are inevitably framed by a particular theory.

Brill, E. 1992. A simple rule-based part of speech tagger. In *Proceedings of the third conference on applied natural language processing* (ANLC ’92). Association for Computational Linguistics, Stroudsburg, PA, USA, 152-155.

Sinclair, J., Hanks, P., Fox, G., Moon, R. and Stock, P. and others, 1987 (eds.), Collins *Cobuild English Language Dictionary*, London: Collins.

Sinclair, J., Fox, G., Bullon, S., Krishnamurthy, R., Manning, E., Todd, J. and others, 1990 (eds.) *Collins Cobuild English Grammar*, London: Collins.

Wallis S.A. 2012. *Tagging ICE Phillipines and other corpora*. London: Survey of English Usage. **»** ePublished

When the entire premise of your methodology is publicly challenged by one of the most pre-eminent figures in an overarching discipline, it seems wise to have a defence. Noam Chomsky’s famous objection to corpus linguistics therefore needs a serious response.

“One of the big insights of the scientific revolution, of modern science, at least since the seventeenth century… is that arrangement of data isn’t going to get you anywhere. You have to ask probing questions of nature. That’s what is called experimentation, and then you may get some answers that mean something. Otherwise you just get junk.” (Noam Chomsky, quoted in Aarts 2001).

Chomsky has consistently argued that the systematic *ex post facto* analysis of natural language sentence data is incapable of taking theoretical linguistics forward. In other words, corpus linguistics is a waste of time, because it is capable of focusing only on external phenomena of language – what Chomsky has at various times described as ‘e-language’.

Instead we should concentrate our efforts on developing new theoretical explanations for the internal language within the mind (‘i-language’). Over the years the terminology varied, but the argument has remained the same: real linguistics is the study of i-language, not e-language. Corpus linguistics studies e-language. Ergo, it is a waste of time.

Chomsky refers to what he calls ‘the Galilean Style’ to make his case. This is the argument that it is necessary to engage in theoretical abstractions in order to analyse complex data. “[P]hysicists ‘give a higher degree of reality’ to the mathematical models of the universe that they construct than to ‘the ordinary world of sensation’” (Chomsky, 2002: 98). We need a theory in order to make sense of data, as so-called ‘unfiltered’ data is open to an infinite number of possible interpretations.

In the Aristotelian model of the universe the sun orbited the earth. The same data, reframed by the Copernican model, was explained by the rotation of the earth. However, the Copernican model of the universe was not arrived at by theoretical generalisation alone, but by a combination of theory and observation.

Chomsky’s first argument contains a kernel of truth. The following statement is taken for granted across all scientific disciplines: **you need theory to analyse data**. To put it another way, there is no such thing as an ‘assumption free’ science. But the second part of this argument, that the necessity of theory permits scientists to dispense with engagement with data (or even allows them to dismiss data wholesale), is not a characterisation of the scientific method that modern scientists would recognise. Indeed, Beheme (2016) argues that this method is also a mischaracterisation of Galileo’s method. Galileo’s particular fame, and his persecution, came from one source: the observations he made through his telescope.

In astronomy it is necessary to build physical theories of the universe to make sense of observed data. Astronomical science must proceed by a process of theory building, attempting to account for observations within the theoretical framework. Moreover, rather than relying on naive Popperian refutation (abandoning a theory if one observation appears to contradict the theory), science tends to rely on **triangulation** (approaching the same theoretical generalisation from multiple sources and directions), and **pluralism**, i.e. the existence of competing theories such that if one fails another may replace it (Putnam 1974). Triangulation may also mean designing new experiments to test theoretical predictions as technology advances – such as viewing the earth from space, or placing atomic clocks on airliners to test special relativity.

Arguing for the necessity of theory is not an argument against corpus linguistics *per se*, but it is an argument of a particular type of corpus linguistics practice. The ‘Birmingham School’ of corpus linguistics, most associated with John Sinclair, has prided itself on making minimal theoretical assumptions and working bottom-up from words themselves. Some of the results of this approach are impressive. However,

- this type of corpus linguistics is not theory neutral or assumption free (e.g. we assume that
*w*₁,*w*₂ are words, and a word is a linguistically meaningful unit); - the process of validating theoretical generalisations entails a linguistic decision based on an external theory (e.g. there exists a distinct wordclass termed ‘adjective’);
- once theoretical generalisations are derived bottom-up (e.g. cases of
*w*₁,*w*₂, etc are members of the set of adjectives), we arrive at a methodological paradox.

Sinclair’s methodological paradox is simply this: if it is true that statements of the kind ‘*w*₁ is an adjective’ are linguistically valuable, then it follows that when analysing new data, we should exploit this new knowledge. However, Sinclair’s method is to work inductively from new data without making such *a priori* assumptions. Either he has to dispense with his previous conclusions, and start from scratch, or he has to change his method.

In conclusion, the argument that you need theory to interpret data, because data has multiple possible interpretations, is correct. However this statement does not extend to permitting scientists to select data to fit their theory. Awkward and challenging results may not be ignored.

Moreover, if Chomsky’s argument were correct, no scientific field would ever arrive at a dominant scientific model. Every scientist could adopt different theoretical frameworks and premises because there was no agreed process for either refuting a theory or determining the outcome of competition between theories. Science has a pattern of both pluralistic competitive research *and* consensus-forming around ‘strong theories’. Chomsky’s characterisation of science may be a description of the fractious state of linguistics, but it departs from the scientific method.

I would suggest that it would be preferable to make linguistics more like science, rather than to make science more like linguistics.

Chomsky’s second argument is that the process of translation from internal to external language is subject to error. Consequently, studying e-language is not a productive way to study i-language. We need to study i-language, therefore we should reject corpus data.

This argument has been more influential than the first.

It also appears to be a reasonable criticism of a certain kind of corpus linguistics. Corpus linguistics has tended to focus on word frequencies, which, in the absence of a theoretical interpretation as to *why* certain forms might be more frequent than others, simply becomes descriptive. Chomsky can reasonably summarise this as studying the epiphenomena of linguistics.

By contrast, theoretical linguists have tended to use an introspective method (backed up occasionally with second-party elicitation) on the grammatical acceptability of test sentences. This is a scholastic approach drawn from traditional prescriptive grammars. The method contains a significant subjective element, even when data is drawn from elicitation experiments with large numbers of test subjects. Direct introspection simply tells us that we *believe* a sentence to be ‘grammatical’.

Could this type of research question be posed with corpus data? No, but corpus linguists do not have to dispense with introspective insight. Corpus linguists are linguists too!

Moving from million-word to billion-word POS-tagged corpora has not generated greater insight, merely more robust results. However, this observation is properly a criticism of the research foci of much corpus linguistics as practised. (I would argue that this is a limitation of POS-tagged corpus research.) It is not an argument against corpus *data*.

However, there are two reasons why Chomsky’s second argument cannot hold. The first is what we might call **the ‘linguists are not God’ reason**.

Linguists do not have special access to i-language data. Their data is from introspection, elicitation or even corpora. But *this* data is also external language! If there were no systematic mapping between i-language and e-language within an individual, ‘i-linguistics’ would not be possible.

Chomsky and his followers could theorise about any number of internal models. But they could never choose between them except by appealing to some general abstract principle, such as Occam’s razor (simplicity). Introspection and experiment cannot penetrate the question because *all* linguistic data is in fact e-language data.

The best, most robust, carefully-obtained data from uncued experimental settings is still e-language. It may be collected in a more focused (and artificial) way than corpus data, but it is also no more ‘internal’ than corpus data. Introspection data elicited from experiments may elicit subjective grammatical expectations, but results are no more scientific than those from any other scientists’ introspection. Physicists do not despair of their equipment and resort to interviewing their peers! Perhaps linguists should follow their lead.

The second counter-argument is that the process of articulating i-language as e-language is a *cognitive* one, that is, it takes place through cognitive processes in the mind. According to Chomsky, this process exposes the pure i-language to the distorting prism of articulation, and thereby makes e-language unreliable data.

However, if this were true, the same objection would necessarily be true for the generation of i-language in the first place. **If articulation of e-language is subject to error, the generation of i-language itself is also bound to be error-prone.**

Random variation, cultural bias, personal preference, processing interference, etc, can take place at either stage, because these phenomena are artefacts of actual neurological pathways. Different types of error may arise at different locations, but there is no special error-free part of the brain. Speakers under the influence of alcohol have confused thoughts *and* slur their words. Alcohol, like error, is not selective.

On the other hand, a number of corpus linguists, including Geoffrey Leech, have commented on the regular ‘grammaticality’ of even the most informal spontaneous speech data. This observation should not be surprising – if speech data did not follow grammatical rules, speakers would not understand each other, and, given the historical and ontological primacy of speech over writing, language could never develop!

There may be noise in the signal, but the signal is not exclusively noise. We should not give up on corpora just yet.

Corpus data is simply uncued natural language data (sometimes termed ‘ecological’ data) as distinct from data obtained in an experimental setting. The key advantage of experimental data is that a researcher can manipulate variables under investigation and avoid variation in potentially confounding variables while obtaining data. A secondary advantage may be that one can construct a setting that provides a high frequency of sought-after phenomena that might otherwise be rare in a corpus. The disadvantages are the risk that the experimental conditions obtained are artificial (and possibly artificially *cued*), and the cost of obtaining and annotating data.

A corpus could contain experimental data, or data obtained by experiment could be annotated to the same level as a parsed corpus such as ICE-GB. These methods are not in competition but are complementary. A corpus can provide test data for experiments, identify potentially worth-while experiments, and provide a control for experimental outcomes.

Corpus linguistics offers three kinds of evidence to a theoretical linguist – factual evidence that phenomena exist, evidence of frequency and distribution, and ‘interaction evidence’ pertaining to the co-occurrence of phenomena (Wallis 2014).

There is no need to discount corpora as a lesser source, or one more likely to be tainted by error than other sources. It is a *different* source of evidence, one that requires due methodological care, but one that has the potential for both the evaluation of theory against real-world natural language and robust statistical evaluation.

If data can only be studied by first relating it to a theory, then theoretical linguists first need to pay attention to how corpora are annotated. Do corpora contain useful representations for linguistic research? Are phenomena of interest to linguists capable of being captured within the corpus?

‘Annotation’ is the process of systematically applying a theoretical description to all the texts in a corpus. A decision to annotate instances of a particular phenomenon entails significant effort. All such instances in the corpus must be identified, and each decision must be properly motivated. Like classification schemes in science (e.g the periodic table), linguistic phenomena are not simply identified, but related within a coherent annotation scheme. It follows that the entire scheme must be linguistically defended and systematically applied.

Syntacticians should pay particular attention to parsed corpora. It follows that if linguists are studying grammar then grammatically analysed corpora (‘parsed corpora’ or ‘treebanks’) are likely to be much more valuable than corpora with part-of-speech wordclass tags applied to each word. However, there is wide disagreement between theoretical linguists as to which grammatical scheme is optimal.

Inevitably the effort of annotation means that one has to choose a particular scheme at a particular point in time and systematically apply it. This poses a problem for researchers using the corpus. If they are stuck in a ‘hermeneutic trap’, only able to pose research questions within the annotation framework, and engage in circular reasoning, then corpus linguistics has a serious problem. After the huge effort of annotation you can only please a small number of linguists!

The solution to this problem offered by Wallis and Nelson (2001) is ‘abstraction’ – a process of reinterpretation of the annotated sentences from the representation in the corpus to the preferred representation of the linguist researcher, which takes place during the research process itself. Linguists do not have to accept the theoretical framework applied to a corpus in order to use it. Instead, the corpus representation is considered simply as a ‘handle on the data’, a method for systematically obtaining data across a corpus. It is not necessary to accept the framework uncritically.

In practice this means that researchers might find themselves constructing logical combinations of structural queries to retrieve a dataset aligned to their research theory and goals. But this is a small price to pay for having a grammatical framework already applied and evaluated against corpus data.

Finally abstraction is not an end goal but a means to obtaining an abstracted dataset expressed in terms commensurate with the theoretical demands of the researcher. It is this dataset that may then be subject to a third process, one we refer to as ‘analysis’, hence the ‘3A’ model of corpus linguistics, distinguishing the stages of annotation, abstraction and analysis.

Aarts, B. 2001. Corpus linguistics, Chomsky and Fuzzy Tree Fragments. In: C. Mair and M. Hundt (eds.) *Corpus linguistics and linguistic theory*. Amsterdam: Rodopi. 5-13.

Beheme, C. 2016. How Galilean is the ‘Galilean Method’? *History and Philosophy of the Language Sciences*, http://hiphilangsci.net/2016/04/02/how-galilean

Chomsky, N. 2002. *On Nature and Language*. Cambridge: Cambridge University Press.

Putnam, H. 1974. The ‘Corroboration’ of Scientific Theories, republished in Hacking, I. (ed.) (1981), *Scientific Revolutions*, Oxford Readings in Philosophy, Oxford: OUP. 60-79.

Wallis, S.A. 2014. What might a corpus of parsed spoken data tell us about language? In L. Veselovská and M. Janebová (eds.) *Complex Visibles Out There*. Olomouc: Palacký University, 2014. 641-662. **»** Post

Wallis, S.A. and Nelson G. 2001. Knowledge discovery in grammatically analysed corpora. *Data Mining and Knowledge Discovery*, **5**: 307–340.

Recently I’ve been working on a problem that besets researchers in corpus linguistics who work with samples which are not drawn randomly from the population but rather are taken from a series of sub-samples. These sub-samples (in our case, texts) may be randomly drawn, but we cannot say the same for any two cases drawn from the same sub-sample. It stands to reason that two cases taken from the same sub-sample are more likely to share a characteristic under study than two cases drawn entirely at random. I introduce the paper elsewhere on my blog.

In this post I want to focus on an interesting and non-trivial result I needed to address along the way. This concerns the concept of **variance** as it applies to a Binomial distribution.

Most students are familiar with the concept of variance as it applies to a Gaussian (Normal) distribution. A Normal distribution is a continuous symmetric ‘bell-curve’ distribution defined by two variables, the **mean** and the **standard deviation** (the square root of the variance). The mean specifies the position of the centre of the distribution and the standard deviation specifies the width of the distribution.

Common statistical methods on Binomial variables, from χ² tests to line fitting, employ a further step. They approximate the Binomial distribution to the Normal distribution. They say, *although we know this variable is Binomially distributed, let us assume the distribution is approximately Normal*. The variance of the Binomial distribution becomes the variance of the equivalent Normal distribution.

In this methodological tradition, the variance of the Binomial distribution loses its meaning with respect to the Binomial distribution itself. It seems to be only valuable insofar as it allows us to parameterise the equivalent Normal distribution.

What I want to argue is that in fact, the concept of the variance of a Binomial distribution is important in its own right, and we need to understand it with respect to the Binomial distribution, not the Normal distribution. Sometimes it is not necessary to approximate the Binomial to the Normal, and if we can avoid this approximation our results are likely to be stronger as a result.

Every fundamental primer in statistics approaches the problem in the following way.

A Binomial variable is a two-valued variable (hence ‘bi-nomial’). The values can be anything, but let us simply call them, according to coin-tossing tradition, as ‘heads’ and ‘tails’. The proportion of cases that are heads in any randomly-drawn sample, of size *n*, taken from a population, which we might term *p*, is free to vary from 0 to 1. That is, all *n *cases in the sample may be heads (*p* = 1) or all may be tails (*p* = 0).

Now, suppose we know, Zeus-like, the actual proportion in the population, *P.* We don’t *have* to be a deity – we might assume that our coin is unbiased so *P* = 0.5 (heads and tails are equally probable) – but a common error is when people get big *P* (true value in the population) and little *p* (observed value in a sample) muddled up. Let’s leave observed *p* aside for a minute.

We can calculate the distribution for *P* and *n* using the following Binomial formula:

*Binomial distribution B*(*r*) =* nCr P ^{r}* (1 –

where *r* ranges from 0 to *n*. This means that the probability of obtaining exactly *r* heads out of *n* coin tosses is calculated by multiplying

- the
**combinatorial function***nCr*(the number of unique ways we can obtain exactly*r*cases out of*n*cases); - the
**probability that***r*cases are heads*P*and^{r} - the
**probability that the remainder are tails**(1 –*P*)^{(n – r)}.

This formula obtains the ideal Binomial distribution.

The graph below shows what this looks like for ten tosses of an unbiased coin, where *P* = 0.5 and *n* = 10. The mean of this distribution is *nP*, i.e. 0.5 × 10 = 5.

**Note.** Equation (1) also works for a ‘trick’ coin, e.g. where *P* = 0.9 (9 times out of 10 we obtain heads). Although most primers first show a graph of *P* = 0.5, few real-world Binomial variables are equiprobable. (Don’t be misled by the symmetry of this graph.)

This distribution has a number of important characteristics.

- The most obvious characteristic is that it is
**discrete**– the only possible values of*r*are integer values from 0 to*n*. Therefore if we sample 10 coin tosses, an observed probability*p*could be 0, 0.1, 0.2, right up to 1. If the true value of*P*was 0.45, we could not observe*p*= 0.45 if we only had ten coin tosses. - A less obvious, but important, characteristic is that this distribution is
**probabilistic**– the sum of all columns ∑*B*(*r*) = 1. - Finally, for all values of
*P*other than 0.5, the distribution is**assymmetric**. See below.

You can also see how unlikely it is that all coins are heads or all tails. The chance of this happening is not zero, but it is small. There is only one possible combination of heads and tails where all ten coins are heads (HHHHHHHHHH) out of 1,024 (2^{n}) possible patterns. The probability of observing *p *= 0 is 1 in 1,024.

There are ten ways that one coin will be a tail and nine heads (THHHHHHHHH, HTHHHHHHHH,… HHHHHHHHHT), and so on.

The combinatorial function *nCr* tells us exactly how many different ways we can obtain *r* cases out of *n* potential cases. The full formula is given in equation (2) below, where *x*! means the factorial of *x*, or *x*(*x*-1)(*x*-2)…(1).

*combinatorial function nCr* = *n*!/(*n-r*)!*r*!.(2)

You should be able to see that in cases where *r* = 0 or *r* = *n*, *nCr* = 1; where *r* = 1 or *r* = *n*-1, *nCr* = *n*.

If *P* = 0.5 then the Binomial function (1) above becomes simply

*B*(*r*) =* nCr P ^{r}* (1 –

However, the general function is much more flexible. It allows us to consider distributions for different values of *P*. (Again, these are plotted on an integer scale.)

Note that these distributions are clearly assymmetric, being centred at *P* < 0.5 and bounded by 0 and *n*. As *P* approaches zero this assymmetry becomes more acute.

Another aspect we can immediately see from the graphs above is that, as well as increasingly becoming less symmetric, as *P* approaches zero, the distribution becomes more concentrated together. We say that the variance of the distribution decreases.

The variance of a Binomial distribution on the integer scale *r *= 0…*n* can be obtained from the function

*(integer) variance S*² = *nP*(1 – *P*).

To compare different-sized samples, we obviously need to use the same scale. The simplest standardisation is to adopt a probabilistic scale, i.e. where *p *= 0…1. To do this we divide this formula by *n*². The variance of a Binomial distribution on a **probabilistic scale** is obtained from the function

*(probabilistic) variance S*² = *P*(1 – *P*)/*n*.(3)

Thus if *P* = 0.5 and *n* = 10, *S*² = 0.025. If *P* = 0.1 and *n* = 10, *S*² = 0.009. (You shouldn’t need a calculator to work this out!) This formula has the following properties.

- For the same
*n*> 1, as*P*tends to zero,*P*(1 –*P*) will also tend to 0. (Consider: if a coin had zero chance of being a head, it will always be a tail!) - For the same
*P*> 0, as*n*increases,*P*(1 –*P*)/*n*decreases. (Obviously if*P*=0 then*S*² cannot decrease!)

Variance is simply the square of the standard deviation of the same distribution:

*standard deviation S* ≡ √*P*(1 – *P*)/*n*.

The concept of variance and standard deviation are usually applied to the **Normal distribution**. Here they have immediate meaning because, as we noted in the introduction, a Normal distribution can be described by two parameters: the **mean**, in this case *P*, and the **standard deviation**, *S*.

Indeed, in the same statistics primers, at around this point we are encouraged to set aside what we have learned about the Binomial distribution and simply assume that it is ‘close to’ the Normal distribution *N*(*P*, *S*). We might see comments that this is an acceptable step for large *n* or where both *nP* and *n*(1 – *P*) > 5.

It is worth emphasising: this step (due to an observation by de Moivre in the 18th Century) is an **approximation**. The Binomial and Normal distributions are different. Here is the distribution for *P *= 0.3 again, but this time with a Normal distribution approximated to it. There is a small difference between the two mid-points, which we have labelled as ‘error’.

- Most obviously, the Normal distribution is
**continuous**rather than discrete. This means we can obtain an estimate for the expected probability that*p*= 0.45. - Like the Binomial distribution, the standardised Normal distribution is also
**probabilistic**, i.e. the area under the curve sums to 1. - Finally, the Normal distribution is
**symmetric**. Moreover, it assumes that the observed variable is unbounded. An unbounded variable is free to vary from minus infinity (-∞) to plus infinity (+∞). (This is a corollary: if the variable was bounded, it could not be symmetric.)

It is worth considering this last point. Many statistics text books use example variables from the natural and physical sciences.

- For example, the height of children in a class, which we might call
*H*, is usually considered to be an unbounded variable, suitable for the Normal distribution. - But in fact, the height of children is a bounded variable.
**It has a lower limit.**At the risk of stating the obvious, children cannot be less than zero height(!), and indeed, to be permitted to go to school, must be of a certain age and be physically safe to do so.*H*must have a lower limit rather greater than zero.**It has an upper limit.**A number of factors, from growth rates to the physical strength of bone, limit the possible height of children.

- Far from being unbounded,
*H*is bounded by biology!

What everyone does is assume that the observed mean height is **so far** from the bounds that although the bounds exist, they have negligible effect on the distribution. (This is not always a healthy assumption, but it is the source of these injunctions to only approximate to the Normal distribution in cases where *nP* > 5.)

On the other hand, Binomial variables (and the Binomial distribution), are **strictly** bounded. We may write, e.g. *P* ∈ [0, 1], which simply means “*P* ranges from 0 to 1 inclusive”. The probability *P* may also be expressed as a proportion or percentage, so we might say that a rate can be any value from 0% to 100%.

So far we have discussed the *ideal* Binomial distribution. Equation (1) is the mathematical extrapolation of the likelihood, *B*(*r*) of observing *r* future results for a sample of *n* cases drawn randomly from a population if the true rate in the population was *P*.

In some circumstances we may *observe* a Binomial distribution. I do this in class with students – each student tosses a coin a fixed number of times and we note down the number of students who had 0 heads, 1 head and so on.

In the paper I am working on, I realised that this principle can also be employed to identify the extent to which a corpus sample might deviate from an ideal random sample for a given variable. This is an important question for corpus linguistics.

The first step is to partition the corpus sample into subsamples according to the text that they are drawn from. To all intents and purposes, these texts can be assumed to be random even if they were not subject to controlled sampling.

Note that two cases drawn from different texts are therefore likely to be independent and equivalent to a pair of cases in a true random sample. However two cases from the same text may share characteristics. There are all sorts of reasons why this is likely to be the case, from a shared topic to personal preferences, priming and other psycholinguistic effects. The reason does not actually matter – we just need to recognise this is likely to be the case.

**Question:**How may we measure the deviation of the corpus sample from an ideal random sample?**Answer:**By studying the distribution of these subsamples.

Suppose the subsamples are equivalent to random samples. Even though cases are drawn from the same text, suppose it turns out that the particular variable is not sensitive to context, previous utterances, etc. In this case, we would expect these sub-samples to be Binomially distributed.

To plot the following graph we first ‘quantise’ (round up or down to a particular number) the observed probability *p*. The vertical axis, *f*, is simply the number of texts in the direct conversations category of ICE-GB, where the probability that a clause is interrogative (*p*(inter) is 0, 0.01, 0.02, etc.). There are 90 texts in this category. We can see that this distribution is approximately Binomial.

We may calculate the variance of this observed distribution with the following pair of formulae, derived from Sheskin (1997).

The first estimate (4) does not take into account the fact that samples are drawn from a population, whereas the second measure, termed the *unbiased estimate of the population variance*, does. For that reason, we here use capital *P* to refer to each probability in the first case and lower case *p* to refer to observations.

*variance of a set of scores* *s’*_{ss}² = ∑(*P _{i}* –

*observed between-subsample variance s*_{ss}² = ∑(*p _{i}* –

where *p _{i}* is the observed probability for subsample

Equations (4) and (5) have one deficiency. It assumes that each subsample is of the same size. This is fine for classroom coin-tossing. It is unlikely to be the case in a corpus sample.

The estimate of variance for a set of different-sized subsamples can be obtained from

*variance of a set of scores (different sizes)* = *s’*_{ss}² = ∑*pr _{i }*(

*observed between-subsample variance s*_{ss}² = *t*/(*t*-1) × ∑*pr _{i }*(

where *pr _{i}* =

It is possible to prove that if *pr _{i}* is equal to the Binomial probability

* ∑nCr P ^{r}* (1 –

This means that equation (6) *defines the correct mathematical relationship between a Binomial distribution on a probabilistic scale and its expected variance*. Another way of putting this is that it is legitimate to apply equations (6) and (7) to a Binomial variable.

**Example:** To illustrate this equivalence, consider the following computation for *P* = 0.3 and *n* = 2. Equation (3) obtains, simply *S*² = (0.3 × 0.7)/2 = 0.105.

r/n |
r |
nCr |
B(r) |
B(r) × (r/n – P)² |

0 | 0 | 1 | 0.49 | 0.0441 |

0.5 | 1 | 2 | 0.42 | 0.0168 |

1 | 2 | 1 | 0.09 | 0.0441 |

Totals |
4 | 1 | 0.1050 |

We can therefore contrast the observed subsample variance with the variance that would be predicted assuming each subsample were a random sample, i.e. the expected Binomial variance, which in this notation would be

*predicted between-subsample variance S*_{ss}² = *p*(1 – *p*)/*t*.

If the two variance scores are the same, then to all intents and purposes, our subsamples are random samples, and the entire corpus sample can be considered a random collection of random samples, i.e. a random sample.

However, if the observed subsample variance differs than that predicted, we are entitled to take this into account when considering the variance of the corpus sample. We employ the ratio of variances, * F*_{ss}, to adjust the sample size accordingly.

*cluster-adjustment ratio F*_{ss} = *S*_{ss}² / *s*_{ss}², and (6)

*corrected sample size n’* = *nF*_{ss}.

If the observed sample has a greater variance than the predicted variance, *F*_{ss} < 1, and we can say that there are fewer truly independent random cases in our overall corpus sample, we increase our uncertainty of our cross-corpus observation, significance tests become more strict, confidence intervals wider, etc.

In the paper, we observe that sometimes *F*_{ss} > 1 and discuss reasons for this. Suffice it to say it is certainly possible, although this may at first sight appear counter-intuitive.

To illustrate the method, consider the following graph. This is the same data as the figure above. You can download this spreadsheet to inspect the calculation for yourself.

Note that in this case we see a close correspondence between the two predicted distributions – Binomial and Normal. The observed distribution is also approximately Normal (accepting the randomness we would anticipate in any observed distribution of course).

The method of comparing variances we employed makes no assumptions about the Binomial approximating to the Normal distribution.

However, this method usually comes under the umbrella of analysis of variance (ANOVA), which is premised on data being Normally distributed. Instead of assuming that ANOVA *might* be legitimately employed for Binomial (bounded, assymmetric, discrete) distributions, we were concerned to *prove* that our definitions of variance were applicable to the Binomial.

Why might this matter? There are two reasons.

- The approximation to the Normal distribution is an approximation, and introduces a number of ‘smoothing’ errors as a result.
- We must ensure that the method is robust for highly skewed values of
*p*.

In the figure above the Normal and Binomial distributions are similar. However, this is not always the case.

Consider the following graph (Figure 4 in the paper). Here data is drawn, not from a single genre, but across the diverse genres contained within the ICE-GB corpus, from the most highly interactive speech contexts to the most didactic of written instructional texts.

The two upper dotted lines are the predicted Normal and Binomial distributions for this observed value of *p* (0.0399) and *t* = 500 texts. You can see how the Normal distribution is narrower than the predicted Binomial.

Equation (5) captures the total variance between subsamples in this figure. It is approximately 4% of the predicted variance according to equation (3).

The lower line is the Normal distribution premised on the observed subsample variance. Again, you can see a large deviation between the observed frequency distribution (bars) and this Normal distribution, which is also clearly clipped by the lower bound at *p* = 0.

If our method were dependent on the Normal distribution, we simply could not sustain it in highly-skewed contexts such as this.

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.