Occasionally it is useful to cite measures in papers other than simple probabilities or differences in probability. When we do, we should estimate confidence intervals on these measures. There are a number of ways of estimating intervals, including bootstrapping and simulation, but these are computationally heavy.

For many measures it is possible to derive intervals from the Wilson score interval by employing a little mathematics. Elsewhere in this blog I discuss how to manipulate the Wilson score interval for simple transformations of *p*, such as 1/*p*, 1 – *p*, etc.

Below I am going to explain how to derive an interval for grammatical diversity, *d*, which we can define as **the probability that two randomly-selected instances have different outcome classes**.

Diversity is an effect size measure of a frequency distribution, i.e. a vector of *k* frequencies. If all frequencies are the same, the data is evenly spread, and the score will tend to a maximum. If all frequencies except one are zero, the chance of picking two different instances will of course be zero. Diversity is well-behaved except where categories have frequencies of 1.

To compute this measure of diversity, we sum across the set of outcomes (all functions, all nouns, etc.), **C**:

*diversity d*(*c*∈**C**) = ∑*p*₁(*c*).(1 –*p*₂(*c*)) if*n*> 1; 1 otherwise

where **C** is a set of *k *> 1 disjoint categories, *p*₁(*c*)* *is the probability that item 1 is category *c* and *p*₂(*c*) is the probability that item 2 is the same category *c*.

We have probabilities

*p*₁(*c*) =*F*(*c*)/*n,**p*₂(*c*) = (*F*(*c*)*–*1)/(*n –*1) = (*p*₁(*c*).*n*– 1)/(*n*– 1),

where *n* is the total number of instances.

The formula for *p*₂ includes an adjustment for the fact that we already know that the first item is *c*. This principle is used in card-playing statistics. Suppose I draw cards from a pack. If the first card I pick is a heart, I know that there are only 10 other hearts in the pack, so the probability of the next card I pick up being a heart is 10 out of 51, not 11 out of 52.

Note that as the set is closed, ∑*p*₁(*c*) = ∑*p*₂(*c*) = 1.

The maximum score is slightly less than (*k* – 1) / *k *except in the special case where *n* approaches *k* and there is a frequency of 1 in any category, in which case diversity can approach 1.

In a forthcoming paper with Bas Aarts and Jill Bowie, we found that the share of functions of *–ing* clauses (‘gerunds’) appeared to change over time in the *Diachronic Corpus of Present-day Spoken English* (DCPSE).

We obtained the following graph. The bars marked ‘LLC’ refer to data drawn from the period 1956-1972; those marked ‘ICE-GB’ are from 1990-1992.

This graph considers six functions **C** = {CO, CS, OD, SU, A, PC} of the clause. It plots *p*(*c*) over **C**. Considered individually, some functions significantly increase and some decrease their share. Note also that the increases appear to be concentrated in the shorter bars (smaller *p*) and the decreases in the longer ones.

Intuitively this appears to mean that we are seeing *–ing* clauses increase in their diversity of grammatical function over time. We would like to test this proposition.

Here is the LLC data.

CO | CS | SU | OD | A | PC | Total |

6 | 33 | 61 | 326 | 610 | 1,203 | 2,239 |

Computing diversity scores, we arrive at

*d*(LLC) = 0.6152 and*d*(ICE-GB) = 0.6443.

We wish to compare these two diversity measures. The first step is to estimate a confidence interval for *d*.

First we compute interval estimates for each term, *d*(*c*) = *p*₁(*c*).(1 – *p*₂(*c*)).

- The Wilson score interval for a probability
*p*is (*w*⁻,*w*⁺).

Any monotonic function of *p*, *fn*, can be applied and plotted as a simple transformation. See Reciprocating the Wilson interval. We can write

*fn*(*p*) ∈ (*fn*(*w*⁻),*fn*(*w*⁺)).

However, *d*(*c*) is not monotonic over its entire range. Indeed *d*(*c*) reaches a maximum where *p* = 0.5. However the axiom holds conservatively provided that it the function is monotonic across the interval (*w*⁻, *w*⁺), i.e. where 0.5 is not within the interval. The following graph plots *d*(*c*) over *p*(*c*) for a two-cell vector where *n* = 40.

We can rewrite *d*(*c*) in terms of a probability *p* and *n*,

*d*(*p*,*n*) =*p*× (1 – (*p × n*– 1) / (*n*– 1)).

This has the interval

*d*(*p*,*n*) ∈ (*d*(*w*⁻,*n*),*d*(*w*⁺,*n*))

provided that *d*(*w*⁺, *n*) < 0.5. To obtain the interval we have simply plugged *w*⁻ and *w*⁺ into the formula for *d*(*p*, *n*) in place of *p*.

Indeed, noting the shape of *d*, we can derive the following.

*d*(*p*,*n*) ∈ (*d*(*w*⁻,*n*),*d*(*w*⁺,*n*)) where*w*⁺ < 0.5,*d*(*p*,*n*) ∈ (*d*(*w*⁺,*n*),*d*(*w*⁻,*n*))*w*⁻ > 0.5,*d*(*p*,*n*) ∈ (min(*d*(*w*⁻,*n*),*d*(*w*⁺,*n*)),*d*(0.5,*n*)) otherwise.

Next we need to sum these intervals. To do this we need to take account of the number of degrees of freedom of the vector.

Case 1: *df* = 1

If we had two values (as in our graphed example), we would have one degree of freedom. Cell probabilities *p*(1) + *p*(2) = 1, so *p*(2) would depend entirely on *p*(1), and observed variation across *p*(1) determines the variation across *p*(2). In this case we should simply sum the transformed Wilson scores:

*d*(*c*∈**C**) ∈ (∑*d*(*w*⁻(*c*)*, n*), ∑*d*(*w*⁺(*c*),*n*)).

We can apply simple summation where intervals are strictly dependent on each other. We can obtain relative bounds of the dependent sum as:

*l*(dep) =*d*– ∑*d*(*w*⁻(*c*)*, n*),*u*(dep) = ∑*d*(*w*⁺(*c*),*n*) –*d*.

In our example we have more than one degree of freedom, and this method is too conservative.

Case 2: *df* > 1

Where probabilities are independent, some can increase and others decrease. The chance that two independent probabilities both fall within a 5% error level is 0.05². So we cannot simply add together intervals. The method of independent summation is to sum Pythagorean interval widths:

*l*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁻(*c*),*n*)]², and*u*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁺(*c*),*n*)]².

However, in our case, we have what we might term semi-independent probabilities, with the level of independence determined by the number of degrees of freedom. We have *df* = *k* – 1 independent differences, so we can interpolate between the two methods in proportion to the number of cells.

*l*= (*l*(ind) × (*k*– 2) + 2*l*(dep)) /*k*, and*u*= (*u*(ind) × (*k*– 2) + 2*u*(dep)) /*k*,*d*(*c*∈**C**) ∈ (*d*–*l*,*d*+*l*).

Note that *l* = *l*(dep) where *k* = 2.

To see how this works, let’s return to our example. The following is drawn from the LLC data (first, blue bar in the graph), at an error level α = 0.05. Note that one of our cells (PC) has *p*₁ > 0.5, *w*₁⁻ is also > 0.5, so we must swap the interval for this cell.

function | CO | CS | SU | OD | A | PC |

p₁ |
0.0027 | 0.0147 | 0.0272 | 0.1456 | 0.2724 | 0.5373 |

w₁⁻ |
0.0012 | 0.0105 | 0.0213 | 0.1316 | 0.2544 | 0.5166 |

w₁⁺ |
0.0058 | 0.0206 | 0.0348 | 0.1608 | 0.2913 | 0.5379 |

Next, to compute the lower bound of the confidence interval CI(*d*) = (*d *– *l*, *d *+ *u*), we obtain the same data for *p*₂ and then carry out the computation.

*l*(dep) =*d*– ∑*d*(*w*⁻(*c*)*, n*) = 0.6152 – 0.5833 = 0.0319,*u*(dep) = ∑*d*(*w*⁺(*c*),*n*) –*d*= 0.6499 – 0.6510 = 0.0359,*l*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁻(*c*),*n*)]² = 0.0152,*u*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁺(*c*),*n*)]² = 0.0165.

This obtains an interval of (0.5945, 0.6382).

We can quote diversity for LLC with absolute intervals (*d *– *l*, *d *+ *u*):

*d*(LLC) = 0.6152 (0.5945, 0.6382), and*d*(ICE-GB) = 0.6443 (0.6248, 0.6655).

In the Newcombe-Wilson test, we compare the difference between two Binomial observations *p*₁ and *p*₂ with the Pythagorean distance of the Wilson interval widths *y*₁⁺ = *w*₁⁺ – *p*₁, etc:

–√(*y*₁⁺)² + (*y*₂⁻)² < (*p*₁ – *p*₂) < √(*y*₁⁻)² + (*y*₂⁺)².

If the equation above is true, the result is not significant (the difference falls within the confidence interval).

This method operates on the assumption that the observations are independent and the intervals are approximately Normal. In our case the difference in diversity is -0.0291, and the bounds are (-0.0301, +0.0297).

Since the difference falls inside those bounds – just – we can report that the difference is not significant.

In many scientific disciplines, such as medicine, papers that include graphs or cite figures without confidence intervals are considered incomplete and are likely to be rejected by journals. However, whereas the Wilson interval performs admirably for simple Binomial probabilities, computing confidence intervals for more complex measures typically involves a more involved computation.

We defined a diversity measure and derived a confidence interval for it. Although probabilistic (diversity is indeed a probability), it is not a *Binomial* probability. For one thing, it has a maximum below 1, of slightly in excess of (*k –* 1) / *k*. For another, it is computed as the sum of the product of two sets of related probabilities.

In order to derive this interval we made the assumption of monotonicity, i.e. that the function *d* tends to increase along its range, or decrease along its range. However, *d* is decidedly **not** monotonic *–* it increases as *p* tends to 0.5 but falls thereafter. We employed the weaker assumption that it is monotonic within the confidence interval, or – in the case where the interval includes a change in direction – that it cannot exceed the global maximum. This has a conservative consequence: it makes the evaluation weaker than it would otherwise be.

We computed an interval by interpolating between dependent and independent estimates of variance, noting that the vector has *k* – 1 degrees of freedom. This is not the most accurate method (and I intend to return to this question in later posts), but it is sufficient for us to derive an interval, and, by employing Newcombe’s method, a test of significant difference.

Like Cramér’s φ, diversity condenses an array with *k* – 1 degrees of freedom into a variable with a single degree of freedom. Swapping data between the smallest and largest columns would obtain exactly the same diversity score.

Testing for significant difference in diversity, therefore, is not the same as carrying out a *k* × 2 chi-square test. Such a test could be significant even when diversity scores are not significantly different. Our new diversity difference test is more conservative, and significant results may be more worthy of comment.

Aarts, B., Wallis, S.A., and Bowie, J. (forthcoming). *–Ing clauses in spoken English: structure, usage and recent change*.

- Diversity interval example (Excel)
- Is “grammatical diversity” a useful concept?
- Interval arithmetic ‘cheat sheet (PDF)
- Reciprocating the Wilson interval
- Goodness of fit measures for discrete categorical data
- Measures of association for contingency tables

]]>

Let’s think about what you experienced. The car crash might involve a number of variables an investigator would be interested in.

How fast was the car going? Where were the brakes applied?

Look on the road. Get out a tape measure. How long was the skid before the car finally stopped?

How big and heavy was the car? How loud was the bang when the car crashed?

These are all **physical variables**. We are used to thinking about the world in terms of these kinds of variables: velocity, position, length, volume and mass. They are tangible: we can see and touch them, and we have physical equipment that helps us measure them.

To this list we might add variables we can’t see, such as how loud the bang was. We might not be able to see it, but we can appreciate that loudness is a variable that ranges from very quiet to extremely loud indeed! With a decibel meter we can get an accurate reading, but if you are trying to explain how loud something was to the Police from memory, the best you might be able to do is a rough-and-ready assessment.

We are also used to thinking about some other variables that might be relevant to our car crash investigation. If we are investigating on behalf of the insurance company, we might want to know the answers to some rather less tangible variables. What was the value of the car before the accident? How wealthy is the driver? How dangerous is that stretch of road?

We are used to thinking about the world in terms of physical variables but we are also brought up in a social world of economic value. The value of the car, the wealth of the driver. These **social variables** are a bit more ‘slippery’ than the physical variables. ‘Value’ can be highly subjective: the car might have been vintage, and different buyers might place a different value on it. The buyer, being canny, might then resell it for a higher value. Nonetheless everyone brought up in a world of trade and capital understands the idea that a car can be sold and in that process a price attached to it. Likewise, ‘wealth’ might be measured in different ways, or in different currencies. So although these are not physical variables, we are comfortable with the idea that they are tangible to us.

But what about that last variable? I asked, *how dangerous is that stretch of road?* This variable is a risk value. It is a **probability**. We can rephrase my question as “what is the probability that for every car that comes down the road, it crashes?” If we can measure this in some way, and make repeat measurements elsewhere, we could make comparisons. Perhaps we have discovered an accident ‘black spot’: somewhere where there is a greater chance of a road accident than at other locations.

**But a probability cannot be calculated on the strength of a single accident.** It can only be measured by a different, more patient, process of observation. We have to observe *many* cars driving down the road, count the ones that crash, and build up a set of observations. Probability is not a tangible variable, and it takes an effort of imagination to think about.

I want to argue that the first thing that makes the subject of statistics difficult, compared to, say, engineering, is that even the most elementary variable we use, observed probability, is not physically tangible.

Let us think about our car crash for a minute. I said that you have never been on this road before. You have no data on the probability of a crash on that road. But it would be very easy to assume from the simple fact that you saw a crash that, if the road surface seemed poor, or it was raining, these facts contributed to the accident and made it more likely. But you have only one data point to draw from. This kind of inference is not valid. It is an over-extrapolation. It is little more than a guess.

Our natural instinct is to form explanations in our mind, hypotheses, and to look for patterns and causes in the world. (Part of our training as scientists is to be suspicious of that inclination. Of course we might be right, but we have to be relentlessly careful and self-critical before we can conclude that we are.)

If we wanted to make a case that this location is an accident black spot, we would need to set up equipment and monitor the road for accidents. We would need to continue to observe the road over a substantial period of time to get the data we needed. This is called a **natural experiment**, where we don’t attempt to interfere with the conditions of the road but simply observe driver behaviour and car crashes.

Alternatively, we might **conduct an actual experiment** and drive various cars down the road to see how they handled. Either way, we would need to observe many cars going past before we could make a realistic estimate of the chance of a crash.

If probability is difficult to observe directly, this has an effect on our ability to think about it. Probability is more difficult to conceive of in the way we conceive of length, say. We all vary in our spatial reasoning abilities, but we experience reinforcement learning from daily observations, tape measures and practice. As we have seen, probability is much more elusive because it is only observed from many observations. This makes it difficult to reliably estimate probability in advance, or to reason with probabilities.

Even experienced researchers make mistakes. The psychologists Tersky and Kahneman (1971) reported the findings from a questionnaire they gave to professional psychologists. The questions concerned the decisions they would make in research based on statements about probability. They showed that not only were their expert subjects unreliable, they provided evidence of persistent biases in human cognition, including the one we mentioned earlier – a belief in the reliability of their own observations, even when they had few observations on which to base their conclusions.

So if you are struggling with statistical concepts, **don’t worry**. You are not alone. Indeed, I have come to the conclusion that *it is necessary to struggle with probability*. We have all been there, and one of my main criticisms of traditional statistics teaching is that most treatments skate over the core concepts and goes straight to statistical testing methods that the experimenter, with no conceptual grounding (never mind mathematical underpinnings), simply takes on faith.

Probability is difficult to observe. It is an abstract mathematical concept that can only be measured indirectly, from many observations. And simple observed probability is just the beginning. In discussing inferential statistics I try to keep to three notions of probability and a simple labelling system: observed probability, for which I will use the label lower-case *p*, the ‘true’ population probability, capital *P*, and a third type, the probability that our observed probability is reliable, which we denote with α. Many people make mistakes reasoning about that last little variable. But we are getting ahead of ourselves.

The best way to get to grips with probability is to replace my thought experiment with a physical one.

But: **safety first!** Please don’t crash an actual car — use a Scalextric instead!

Tversky, A., and Kahneman, D. 1971. Belief in the law of small numbers. *Psychological Bulletin* **76**:2, 105-110. **»** ePublished

]]>

I have been recently reviewing and rewriting a paper for publication that I first wrote back in 2011. The paper (Wallis forthcoming) concerns the problem of how we test whether repeated runs of the same experiment obtain essentially the same results, i.e. results are not significantly different from each other.

These meta-tests can be used to test an experiment for replication: if you repeat an experiment and obtain significantly different results on the first repetition, then, with a 1% error level, you can say there is a 99% chance that the experiment is not replicable.

These tests have other applications. You might be wishing to compare your results with those of others in the literature, compare results with different operationalisation (definitions of variables), or just compare results obtained with different data – such as comparing a grammatical distribution observed in speech with that found within writing.

The design of tests for this purpose is addressed within the *t*-testing ANOVA community, where tests are applied to continuously-valued variables. The solution concerns a particular version of an ANOVA, called “the test for interaction in a factorial analysis of variance” (Sheskin 1997: 489).

However, anyone using data expressed as discrete alternatives (A, B, C etc) has a problem: the classical literature does not explain what you should do.

The rewrite of the paper caused me to distinguish between two types of tests: ‘point tests’, which I describe below, and ‘gradient tests’.

These tests can be used to compare results drawn from 2 × 2 or *r* × *c* χ² tests for homogeneity (also known as tests for independence). This is the most common type of contingency test, which can be computed using Fisher’s exact method or as a Newcombe-Wilson difference interval.

- A
**gradient test**(B) evaluates whether the*gradient*or difference between point 1 and point 2 differs between runs of an experiment,*d*=*p*₁ –*p*₂. This concerns whether claims about the rate of change, or size of effect, observed are replicable. Gradient tests can be extended, with increasing degrees of freedom, into tests comparing*patterns*of effect. - A
**point test**(A) simply asks whether data at either point, evaluated separately, differs between experimental runs. This concerns whether single observations, such as*p*₁, are replicable. Point tests can be extended into ‘multi-point’ tests, which we discuss below.

Point tests only apply to homogeneity data. If you wish to compare outcomes from goodness of fit tests, you need a version of the gradient test, to compare differences from an expected *P*, *d* = *p*₁ – *P*. Since different data sets may have different expected *P*, a distinct ‘point test for goodness of fit’ would be meaningless.

The earlier version of the paper, which has been published on this blog since its launch 2012, focused on gradient tests. The possibility of carrying out a point test was mentioned in passing. In this blog post I want to focus on point tests.

The obvious problem with gradient tests is that two experimental runs might obtain the same gradient but in fact be very different in start and end points. Consider the following graph.

The data in Figure 1 is calculated from two 2 × 2 tables drawn from a paper by Aarts, Close and Wallis (2013).

**Note:** To obtain Figure 2, I simply replaced one frequency in the first table: 46 with 100. The data is also found on the 2×2 homogeneity tab in this Excel spreadsheet, which contains a wide range of separability tests.

To make our exposition clearer, Table 1 uses the same format as in the Excel spreadsheet (with the dependent variable distributed vertically) rather than the format in the paper.

spoken | LLC (1960s) |
ICE-GB (1990s) |
Total |

shall |
124 | 46 | 170 |

will |
501 | 544 | 1,045 |

Total |
625 | 590 | 1,215 |

written | LOB (1960s) |
FLOB (1990s) |
Total |

shall |
355 | 200 | 555 |

will |
2,798 | 2,723 | 5,521 |

Total |
3,153 | 2,923 | 6,076 |

Aarts *et al*. carried out 2 × 2 homogeneity tests for the two tables separately. These test whether modal *shall* declines as a proportion of the modal *shall/will* alternation between the two time points. In other words, we compare LLC with ICE-GB data, and LOB with FLOB data.

To carry out a point test we simply rotate the test 90 degrees, e.g. to compare data at the 1960s point we compare LLC with LOB.

As I have explained elsewhere (Wallis 2013), there are a number of different methods for carrying out this comparison.

These include:

- The
*z*test for two independent proportions (Sheskin 1997: 226). - The Newcombe-Wilson interval test (Newcombe 1998).
- The 2 × 2 χ² test for homogeneity (independence).

These are all standard tests and each is discussed in papers and elsewhere on this blog.

The advantage of the third approach is that it is extensible to *c*-way multinomial observations by using a 2 × *c* χ² test.

The tests listed above can be used to compare the 1960s and 1990s intervals in Figure 1 separately.

However, in many cases it would be helpful to have a method that evaluated both pairs of observations in a single test. This can be generalised to a series of *r* observations. To do this, in (Wallis forthcoming) I propose what I call a multi-point test.

We generalise the χ² formula by summing over *i* = 1..*r*:

- χ
² = ∑χ²(_{d}*i*)

where χ²(*i*) represents the χ² score for homogeneity for each set of data at position *i* in the distribution.

This test has *r* × df(*i*) degrees of freedom, where df(*i*) is the degrees of freedom for each χ² point test. So, in the worked example we have seen, the summed test has two degrees of freedom:

spoken | LLC (1960s) |
ICE-GB (1990s) |
Total |

shall |
124 | 46 | 170 |

will |
501 | 544 | 1,045 |

Total |
625 | 590 | 1,215 |

written | LOB (1960s) |
FLOB (1990s) |
Total |

shall |
355 | 200 | 555 |

will |
2,798 | 2,723 | 5,521 |

Total |
3,153 | 2,923 | 6,076 |

χ² | 34.6906 | 0.6865 | 35.3772 |

Since the computation sums independently-calculated χ² scores, each score may be individually considered for significant difference (with df(*i*) degrees of freedom). Hence we can see above the large score for the 1960s data (individually significant) and the small score for 1990s (individually non-significant).

**Note:** Whereas χ² is generally associative (non-directional), the summed equation (χ* _{d}*²) is not. Nor is this computation the same as a 3 dimensional test (

- The multi-point test factors out variation between tests over the independent variable (in this instance: time). This means that if there is a lot more data in one table at a particular time period, this fact does not skew the results.
- On the other hand, it does not factor out variation over the dependent variable – after all, this is precisely what we wish to examine!

Naturally, like the point test, this test may be generalised to multinomial observations.

An alternative multi-point test for binomial (two-way) variables employs a sum of χ² values abstracted from Newcombe-Wilson tests.

- Carry out Newcombe-Wilson tests for each point test
*i*at a given error level α, obtaining*D*,_{i}*W*⁻ and_{i}*W*⁺._{i} - Identify the inner interval width
*W*for each test:_{i}- if
*D*< 0,_{i }*W*=_{i}*W*⁻;_{i}*W*=_{i}*W*⁺ otherwise._{i}

- if
- Use the difference
*D*and inner interval_{i}*W*to compute χ² scores:_{i}- χ²(
*i*) = (*D*._{i}*z*_{α/2}/*W*)²._{i}

- χ²(

It is then possible to sum χ²(*i*) as before.

Using the data in the worked example we obtain:

**1960s:** *D _{i}* = 0.0858,

Since *D _{i}* is positive in both cases, we use the upper interval width each time. This gives us χ² scores of 28.4076 and 1.3769 respectively, which obtains a sum of 29.78. Compared to the first method above, this approach tends to downplay extreme differences.

The point test and the additive generalisation of this test into a ‘multi-point test’ represent a method of contrasting multiple runs of the same experiment, comparing observed changes in different subcorpora or genres, or examine the empirical effect of changing definitions of variables.

These tests consider the null hypothesis that **individual observations** are not different; or, in the multi-point case, that **in general** the observations are not different.

- They do not evaluate the gradient between points or the size of effect. If we wish to compare
**sizes of effect**we would need to use one of the methods for this purpose described in (Wallis forthcoming). - The method only applies to comparing tests for homogeneity (independence). To compare
**goodness of fit**data, a different approach is required (also described in Wallis forthcoming).

Nonetheless, these tests are useful meta-tests that build on classical Pearson χ² tests, and they are useful tools in our analytical armoury.

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.

Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. *Statistics in Medicine* **17**: 873-890.

Wallis, S.A. 2013. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. » Post

Wallis, S.A. forthcoming (first published 2011). *Comparing χ² tables for separability of distribution and effect*. London: Survey of English Usage. » Post

]]>

I have previously argued (Wallis 2014) that interaction evidence is the most fruitful type of corpus linguistics evidence for grammatical research (and doubtless for many other areas of linguistics).

Frequency evidence, which we can write as *p*(*x*), the probability of *x* occurring, concerns itself simply with the overall distribution of a linguistic phenomenon *x* – such as whether informal written English has a higher proportion of interrogative clauses than formal written English. In order to calculate frequency evidence we must define *x*, i.e. decide how to identify interrogative clauses. We must also pick an appropriate baseline *n* for this evaluation, i.e. we need to decide whether to use words, clauses, or any other structure to identify locations where an interrogative clause may occur.

**Interaction evidence** is different. It is a statistical correlation between a decision that a writer or speaker makes at one part of a text, which we will label point *A*, and a decision at another part, point *B*. The idea is shown schematically in Figure 1. *A* and *B* are separate ‘decision points’ in a given relationship (e.g. lexical adjacency), which can be also considered as ‘variables’.

This class of evidence is used in a wide range of computational algorithms. These include collocation methods, part-of-speech taggers, and probabilistic parsers. Despite the promise of interaction evidence, the majority of corpus studies tend to consist of discussions of frequency differences and distributions.

In this paper I want to look at applications of interaction evidence which are made more-or-less at the same time by the same speaker/writer. In such circumstances we cannot be sure that just because *B** *follows *A** *in the text, the decision relating to *B* was made after the decision at *A*.

For example, in studying the premodification of noun phrases by attributive adjectives in English – which adjective is applied first in assembling an NP like *the old tall green ship*, for instance – **we cannot be sure that adjectives are selected by the speaker in sentence order**. It is also perfectly plausible that adjectives were chosen in an alternative or parallel order in the mind of the speaker, and then assembled in the final order during the language production process.

Of course, in cases where points *A* and *B* are separated substantively in time (as in many instances of structural self-priming) or where *B* is spoken in response to *A* by another speaker (structural priming of another’s language), there is unlikely to be any ambiguity about decision order. Moreover, if *A* licences *B*, then the order in unambiguous.

However, in circumstances where *A* and *B* are proximal, and where the order of decisions made by the speaker/writer cannot be presumed, we wish to consider whether there are mathematical or statistical methods for predicting the most likely order decisions were made.

Such a method would have considerable value in experimental design in cognitive corpus linguistics. For example, since Heads of NPs, VPs etc are conceived of as determining their complements, it may not be too much a stretch to argue that if this method works, we may have found a way of empirically evaluating this grammatical concept.

- Introduction
- A collocation example

2.1 Employing chi-square and phi

2.2 Directional statistics

2.3 Significantly directional? - A grammatical example

3.1 Testing for difference under alternation

3.2 Comparing Newcombe-Wilson intervals for direction

3.3 Optimising the dififference interval - Mapping significance of association and direction
- Concluding remarks
- References

Wallis, S.A. 2017. *Detecting direction in interaction evidence*. London: Survey of English Usage. **»** Paper (PDF)

- Excel spreadsheets

Wallis, S.A. 2011. *Comparing χ² tests for separability*. London: Survey of English Usage, UCL. **»** post

Wallis, S.A. 2012. *Goodness of fit measures for discrete categorical data*. London: Survey of English Usage, UCL. **»** post

Wallis, S.A. 2013a. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. **»** post

Wallis, S.A. 2013b. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics* **20**:3, 178-208. **»** post

Wallis, S.A. 2014. What might a corpus of parsed spoken data tell us about language? In L. Veselovská and M. Janebová (eds.) *Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure.* Olomouc: Palacký University, 2014. pp 641-662. **»** post

Wallis, S.A. forthcoming. *That vexed problem of choice*. London: Survey of English Usage, UCL. **»** post

]]>

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Over the three days, participants will learn about the following:

- the scope of Corpus Linguistics, and how we can use it to study the English Language;
- key issues in Corpus Linguistics methodology;
- how to use corpora to analyse issues in syntax and semantics;
- basic elements of statistics;
- how to navigate large and small corpora, particularly ICE-GB and DCPSE.

At the end of the course, participants will have:

- acquired a basic but solid knowledge of the terminology, concepts and methodologies used in English Corpus Linguistics;
- had practical experience working with two state-of-the-art corpora and a corpus exploration tool (ICECUP);
- have gained an understanding of the breadth of Corpus Linguistics and the potential application for projects;
- have learned about the fundamental concepts of inferential statistics and their practical application to Corpus Linguistics.

For more information, including costs, booking information, timetable, see the website.

]]>

Over the last year, the field of psychology has been rocked by a major public dispute about statistics. This concerns the failure of claims in papers, published in top psychological journals, to replicate.

Replication is a big deal: if you publish a correlation between variable *X* and variable *Y* – that there is an increase in the use of the progressive over time, say, and that increase is statistically significant, you expect that this finding would be replicated were the experiment repeated.

I would strongly recommend Andrew Gelman’s brief history of the developing crisis in psychology. It is not necessary to agree with everything he says (personally, I find little to disagree with, although his argument is challenging) to recognise that he describes a serious problem here.

There may be more than one reason why published studies have failed to obtain compatible results on repetition, and so it is worth sifting these out.

In this blog post, what I want to do is try to explore what this replication crisis is – is it one problem, or several? – and then turn to what solutions might be available and what the implications are for corpus linguistics.

The debate between Neil Millar and Geoff Leech regarding the alleged increase (Millar 2009) and decline (Leech 2011) of the modal auxiliary verbs is an example of this problem.

Millar based his conclusions on the TIME corpus, discovering that the rate of modal verbs per million words tended to increase over time. Leech, using the Brown series of US English corpora, discovered the opposite. Both applied statistical methods to their data but obtained very different conclusions.

Inferential statistics operates by predicting the result of repeated runs of the same experiment, i.e. on samples of data drawn from the same population.

Stating that something “significantly increases over time” can be reformulated as:

- subject to caveats of
**random sampling**(the sample is, or approximates to, a random sample of utterances drawn from the same population), and**Binomial variables**(observations are free to vary from 0 to 1), - we can calculate a
**confidence interval**at a given error rate (say 1 in 20 times for a 5% error rate / 95% interval) on the difference in two observations of variable*X*taken at two time points 1 and 2,*x*₂ –*x*₁, **all points**within this interval (including the lower bound) are greater than 0,**on repeated runs of the same experiment we can expect to see an observation fall outside of the confidence interval of the difference at the predicted rate**(here, 1 time in 20).

**Note:** For the purposes of this blog post, I am focusing on the last bullet point – when we say that something “fails to replicate”, we mean that on a repetition the result falls outside the confidence interval of the difference *on the very next occasion!*

Leech obtained a different result from Millar on the first attempted repetition of this experiment. This could be a fluke, but it seems to be a failure to replicate. There should only be a 1 in 20 chance of this happening.

Observing such a replication failure should lead us to ask some searching questions about these two studies, many of which are discussed elsewhere in this blog.

Much of the controversy can be summed up by the bottom row in this table, drawn from Millar (2009). This appears to show a 23% increase in modal use between the 1920s and 2000s. With a lot of data and a sizeable effect, this increase seems bound to be significant.

1920s | 1930s | 1940s | 1950s | 1960s | 1970s | 1980s | 1990s | 2000s | % diff 1920s-2000s | |

will |
2,194.63 | 1,681.76 | 1,856.40 | 1,988.37 | 1,965.76 | 2,135.73 | 2,057.43 | 2,273.23 | 2,362.52 | +7.7% |

would |
1,690.70 | 1,665.01 | 2,095.76 | 1,669.18 | 1,513.30 | 1,828.92 | 1,758.44 | 1,797.03 | 1,693.19 | +0.1% |

can |
832.91 | 742.30 | 955.73 | 1,093.39 | 1,233.13 | 1,305.82 | 1,231.99 | 1,475.95 | 1,777.07 | +113.4% |

could |
661.33 | 822.72 | 1,188.24 | 998.83 | 950.73 | 1,106.25 | 1,156.61 | 1,378.39 | 1,342.56 | +103.0% |

may |
583.59 | 515.12 | 496.93 | 502.74 | 628.13 | 743.66 | 775.92 | 937.08 | 931.91 | +59.7% |

should |
577.46 | 450.07 | 454.87 | 495.26 | 441.96 | 475.50 | 453.33 | 521.46 | 593.27 | +2.7% |

must |
485.31 | 418.03 | 456.57 | 417.62 | 401.36 | 390.47 | 347.02 | 306.69 | 250.59 | -48.4% |

might |
374.52 | 375.40 | 500.33 | 408.90 | 399.80 | 458.99 | 416.81 | 474.23 | 433.34 | +15.7% |

shall |
212.19 | 120.79 | 96.42 | 70.52 | 50.48 | 35.65 | 25.93 | 16.09 | 9.26 | -95.6% |

ought |
50.22 | 37.94 | 39.31 | 40.34 | 36.91 | 34.29 | 28.27 | 34.90 | 27.65 | -44.9% |

Total | 7,662.86 | 6,829.14 | 8,140.56 | 7,685.15 | 7,621.56 | 8,515.28 | 8,251.75 | 9,215.05 | 9,421.36 | +22.9% |

In attempting to identify why Leech and Millar obtain different results, the following questions should be considered.

**Are the two samples drawn from the same population, or are they drawn from two distinct populations?**To put it another way, are there characteristics of the TIME data that makes it distinct from the general written data in the Brown corpora? For example, does TIME have a ‘house style’, with subeditors enforcing it, which has led to a greater frequency of modal use? Has TIME tended to curate more stories with more modal hedges than the overall trend? Jill Bowie (Bowie*et al*2013) reported that genre subdivisions within the spoken DCPSE corpus often exposed different modal trends.**Does Millar’s data support a general observation of increased modal use?**Bowie observes that Millar’s aggregate data fluctuates over the entire time period (see Table, bottom row), and some changes in sub-periods appear to be consistent with the trend reported by Leech in an earlier study in 2003. According to this observation, simply expressing the trend as an increase in modal verb use seems misleading.**Is it legitimate to aggregate all modals together?**In one sense, modals are a well-defined category of verb: a closed category, especially if one excludes the semi-modals. So “modal use” is a legitimate variable. But we can also see that different modal verbs are undergoing different patterns of change over time (see Table). Millar reports that*shall*and*must*are in decline in his data while*will*and*can*are increasing. Whereas*shall*and*will*may be alternates in some contexts, this does not mean that bundling all modal trends together is particularly meaningful. Moreover, since the synchronic distribution of modals (like most linguistic variables) is sensitive to genre, this issue also interacts with my first bullet point, i.e. the fact that there are known differences between corpora.**How reliable is a per-million-word measure?**What does the data look like if we use a different baseline, for example, modal use per tensed verb phrase (or tensed main verb)? Doing this allows us to factor out variation in ‘tensed VP density’ (i.e. the variation in potential sites for modals to be deployed) between texts. Failure to do this (as both Leech and Millar do) means that we are not measuring when writers**choose**to use modal verbs, but the rate to which we, the reader, are**exposed**to them. See That vexed problem of choice.

If VP density in text samples changes over time in either corpus, this may explain these different results – not as a result of increasing or declining modal use but as a result of increasing or declining tensed VP density (or declining / increasing density of other constituents). More generally, word-based baselines almost always conflate opportunity and use because the option to insert the element is not available following every other word (exceptions might include pauses or expletives, but these exceptions prove the rule). This conflation undermines the Binomial model and increases the risk that results will not replicate. The solution is to focus on identifying each choice-point as much as possible.**Does per word (per-million-word) data conform to the Binomial statistical model?**Since the entire corpus cannot consist of modal verbs, observations of modal verbs can never approach 100%, so the answer has to be no. However, the effect of this inappropriate model is that it tends to lead to the underreporting of otherwise significant results. See Freedom to vary and statistical tests. This may be a problem, but logically, it cannot be an explanation for obtaining two different ‘significant’ results in opposite directions!

All of the above are reasons to be unsurprised at the fact that Millar’s summary finding was not replicated in Leech’s data. But to be fair, many of Millar’s individual trends *do* appear to be consistent with results found in the Brown corpus.

As we shall see, the problem of replication is not that *all* results in one study are not reproduced in another study, rather it is that *some* results are not reproduced. But if our most remarked-upon finding is not replicated, we have a problem.

The replication crisis has been most discussed in psychology and the social sciences. In psychology, some published findings have been controversial to say the least. Claims that ‘Engineers have more sons; nurses have more daughters’ have tended to attract the interest of other psychologists relatively quickly. But this is shooting fish in a barrel.

In psychology, it is common to perform studies with small numbers of participants – 10 per experimental condition is usually cited as a minimum, which means that between 20 and 40 participants becomes the norm. Many kinds of failure to replicate are due to what statisticians tend to call ‘basic errors’, such as using an inappropriate statistical test. I discuss this elsewhere in this blog.

In this blog I have tended to argue for applying the simplest possible experimental designs (2 × 2 contingency tests, for example) over multivariate regression algorithms which may work, but are treated as ‘black boxes’ by almost all who use them. Such algorithms may ‘over fit’ data, i.e. they match the data more closely than is mathematically justified. But more importantly, they (and the assumptions underpinning them) are not transparent to their users.

I argue that if you don’t understand how your results were derived, you are taking them on faith.

This does not mean I don’t think that some multi-variable methods are not theoretically superior to, or potentially more powerful than, simpler tests. On the contrary, I object that before we use any statistical method we need to be sure that we understand what they are doing with our data. We have to ask ourselves constantly, *what do our results mean?*

However, the replication problem does not go away entirely once we have dealt with these so-called basic errors.

Andrew Gelman and Eric Loken (2013) raise a more fundamental problem that, if valid, is particularly problematic for corpus linguists. This concerns a question that goes to the heart of the post-hoc analysis of data, and the fundamental philosophy of statistical claims and the scientific method.

Essentially their argument goes like this.

- All data contains random noise, and thus every variable in a dataset (extracted from a corpus) will contain random noise. Researchers tend to assume that by employing a significance test we ‘control’ for this noise. But this is a mischaracterisation. Faced with a dataset consisting of pure noise, we would detect a ‘significant’ result 1 in 20 times (at a 0.05 threshold). Another way of thinking about this is that statistical methods can find patterns in data (correlations) even when there are no patterns to be found.
- Any data set may contain multiple variables, there are multiple potential definitions of these variables, and there are multiple analyses we could perform on the data. In a corpus we could modify definitions of variables, perform new queries, change baselines, etc., to perform new analyses.
- It follows that there is a very large number of potential hypotheses we
*could*test against the data. (Note: this is not an argument against exploring the hypothesis space in order to choose a better baseline on theoretical grounds!)

This part of the argument is not very controversial. However, Gelman and Loken’s more provocative claim is as follows.

- Few researchers would admit to running very many tests against data and reporting results, which the authors term ‘fishing’ for significant results, or ‘p-hacking’. There are some algorithms that do this (multivariate logistic regression anyone?), but most research is not like this.
- Unfortunately, the authors argue,
**standard post-hoc analysis methods – exploring data, graphing results and reporting significant results – does much the same thing.**We dispense with blind alleys (what they call ‘forking paths’), because we can see that they are not likely to produce significant results. Although we don’t actually run these dead-end tests, for mathematical purposes*our educated eyeballing of data to focus on interesting phenomena has done the same thing*.

- As a result, we underestimate the robustness of our results, and often, they fail to replicate.

Gelman and Loken are not alone in making this criticism. Cumming (2014) objects to ‘NHST’ (null hypothesis significance testing), interpreted as an imperative that

“explains selective publication, motivates data selection and tweaking until the *p* value is sufficiently small, and deludes us into thinking that any finding that meets the criterion of statistical significance is true and does not require replication.”

Since it would be unfair to criticise others for a problem that my own work may be prone to, let us consider the following graph that we used while writing Bowie and Wallis (2016). The graph does not appear in the final version of the paper – not because we didn’t like it, but because we decided to adopt a different baseline in breaking down an overall pattern of change into sub-components. But it is typical of the kind of graph we might be interested in examining.

There are two critical questions that follow from Gelman and Loken’s critique.

*In plotting this kind of graph and reporting confidence intervals, are we misrepresenting the level of certainty found in the graph?**Are we engaging in, or encouraging, retrospective cherry-picking of contrasts between observations and confidence intervals?*

In the following graph there are 19 decades and 5 trend lines, i.e. 95 confidence intervals. There are 171 × 5 potential pairwise comparisons, and 10 × 19 vertical pairwise comparisons. So there are, let’s say, 1,045 potential statistical pairwise tests which would be reasonable to carry out. With a 1 in 20 error rate, at least 52 ‘significant’ pairwise comparisons would be incapable of replication.

Gelman, Loken, Cumming *et al.* would argue that by selecting a few statistically significant claims from this graph, we have committed precisely the error they object to.

However, I have to defend this graph, and others like it, by arguing that **this is not our method**. We don’t sift through 1,045 possible comparisons and then report significant results selectively! In the paper, and in our work more generally, we really don’t encourage this kind of cherry-picking (the human equivalent of over-fitting). We are more concerned with the overall patterns that we see, general trends, etc., which are more likely to be replicable in broad terms.

Thus, for example, in that paper we don’t pull out specific significant pairwise comparisons to make strong claims. In this particular graph we can see an apparently statistically significant sharp decline between 1900 and 1930 in the tendency of writers to use the verb SAY (as in *he is said to have stayed behind*) before a *to-*infinitive perfect, compared to the other verbs in the group. This observation may be replicable, but **the conclusions of the paper do not depend on this observation**. This claim, and similar claims, do not appear in the paper.

Similarly, if we turn back to Neil Millar’s modals-per-million-word data for a moment, Bowie’s observation that the data does not show a consistent increase over time is interesting. Millar did not select the time period in order to report that modals were on the increase – on the contrary, he non-arbitrarily took the start and end point of the timeframe sampled. But the conclusion that ‘modals increased over the entire period’ was only one statement that described the data. In shorter periods there was a significant fall, and different modal verbs behaved differently. Indeed, the complexity of his results is best summed up by the detailed graphs within his paper!

**In conclusion:** it is better to present and discuss the pattern, not just the end point – or the slogan.

Nonetheless we may still have the sneaking suspicion that what we are doing is a kind of researcher bias. We tend to report statistically significant results and ignore those inconvenient non-significant ones. The fear is that results assumed to be due to chance 1 in 20 times are more likely due to chance 1 in 5 times (say), simply because we have – inadvertently and unconsciously – already preselected our data and methods to obtain significant results. Some highly experienced researchers have suggested that we fix this problem by adopting tougher error levels – adopt a 1 in 100 level and we might arrive at 1 in 25. The problem is that this assumes we know the appropriate multiplier to apply.

Gelman and Loken suggest instead that published studies should always involve a replication process. They argue it is preferable that researchers publish half as many experiments and include a replication step than publish non-replicable results.

**Suggested method:** Before you start, create two random subcorpora A and B by randomly drawing texts from the corpus and assigning them to A and B in turn. You may wish to control for balance, e.g. to ensure subsampling is drawn equitably from each genre category. Perform the study on A, and summarise the results. Without changing a single query, variable or analysis step, apply exactly the same analysis to B.

Do we get compatible results, i.e. *results that fall within the confidence intervals of the first experiment*? More precisely, are the results statistically separable?

An alternative to formal replication is to repeat the experiment with well-defined, as distinct from randomly generated, subcorpora.

**Sampling subcorpora:** Suppose you apply an analysis to spoken data in ICE-GB, and then repeat it with written data. Do we get broadly similar results? If we obtain comparable results for two subcorpora with a known difference in sampling, it is probable they would pass a replication test where two subsamples were not sampled differently. On the other hand, if results *are* different, this would justify further investigation.

Even where replication is not carried out (for reasons of insufficient data, perhaps), an uncontroversial corollary of this argument is that your research method should be sufficiently transparent so that it can be replicated by others.

As a general principle, authors should make raw frequency data available to permit a reanalysis by other analysis methods. I find it frustrating when papers publish per million word frequencies in tables, when what is needed for a reanalysis is raw frequency data!

Another of Gelman and Loken’s recommendations is that researchers need to spend more time focusing on sizes of effect, rather than just reporting statistical significance. With lots of data and large effect sizes, the problem is reduced. Certainly we should be wary of citing just-significant results with a small effect size.

Where does this leave the arguments I have made elsewhere in favour of visualising data with confidence intervals? One of the implications of the ‘forking paths’ argument is that we tend not to report dead-end, non-significant results. But well considered graphs can visualise all data in a given frame, rather than selected data (of course we have to ‘frame’ this data, select variables, etc.).

One advantage of graphing data with confidence intervals is that we apply the same criteria to all data points and allow the reader to interpret the graph. Significant and non-significant contrasts are available to be viewed. We also visualise effect sizes and the weight of evidence (confidence intervals), even if it is arguable that our model is insufficiently conservative.

Thus a strength of Millar’s paper is the reporting of trends and graphs. In the graph above, the confidence intervals improve our understanding of the overall trends we see.

We just should not assume that every significant difference will be replicable.

This is really one of mine, but I suggest it is implicit in the argument above.

It seems to me to be an absolutely essential requirement for any empirical scientist to play devil’s advocate to their own hypothesis.

That is, it is not sufficient to ‘find something interesting in data’, and publish. What we are really trying to do is detect meaningful phenomena in data, or to put it another way, we are trying to find robust evidence of phenomena that have implications for linguistic theory. We are trying to move from observed correlation to a hypothesised underlying cause.

Statistics is a tool to help us do this. But logic also plays an essential part.

Without wishing to create a checklist for empirical linguistics (such that a researcher is convinced in the validity of their results simply because they can tick off the list), we might argue that the following steps are necessary in all empirical research.

**Identify the underlying research question**, framed in general theoretical terms.**Operationalise the research question**as a series of testable hypotheses or predictions, and evaluate them. Plot graphs! Visualising data with confidence intervals allows us to visualise expected variation and make more robust claims.**Focus reporting on global patterns**across the entire dataset. If your research ends up prioritising an apparently unusual local pattern in a selected part of the data, consider whether this may be an artefact of sampling.**Critique the results of this evaluation**in terms of the original research question, and play devil’s advocate: what other possible underlying explanations might there be for the observed results?**Consider alternative hypotheses**and test them. Try to design new experiments to separate out different possible explanations for the observed phenomenon.**Plan to include a replication step**prior to publication. This means being prepared to partition the data in the way described above, dividing the corpus into different pools of source texts.

Whether or not Gelman and Loken’s argument applies to your corpus linguistics study — and we have to eliminate basic errors first — the principal conclusion is that it is difficult to understate the importance of **reporting accuracy and transparency**. If the study does not appear to replicate in the future, possible reasons must be capable of exploration by future researchers. It would not have been possible to explore the differences between Leech and Millar’s data had Neil Millar simply summarised a few trends and reported some statistically significant findings.

It is incumbent on all of us to properly describe the limitations of data and sampling; definitions of variables and abstraction (query) methods for populating them; as well as graphing data to reveal both significant and non-significant patterns at the same time.

A typical mistake is to refer to ‘British English’ (say) as a short hand for ‘data drawn from British English texts sampled according to the sampling frame defined in Section 3’. Many failures to replicate in psychology can be attributed to precisely this type of logical error – that the experimental dataset is not a reliable model for the population claimed.

Finally, Cumming (2014) makes an important distinction between **exploratory research** and **prespecified research**. Corpus linguistics is almost inevitably exploratory, as it is impossible to prespecify data collection in post-hoc analysis. In a natural experiment we cannot control for confounding variables, and we must frame our conclusions accordingly.

Bowie, J., Wallis, S.A. and Aarts, B. 2013. Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Marín-Arrese, J.I., Carretero, M., Arús H.J. and van der Auwera, J. (eds.) *English Modality*, Berlin: De Gruyter, 57-94.

Bowie, J. and Wallis, S.A. 2016. The *to*-infinitival perfect: A study of decline. In Werner, V., Seoane, E., and Suárez-Gómez, C. (eds.) *Re-assessing the Present Perfect*, Topics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94.

Cumming, G. 2014. The New Statistics: Why and How, *Psychological Science*, 25(1), 7-29.

Gelman, A. and Loken, E. 2013. The garden of forking paths. Columbia University. **»** ePublished.

Leech, G. 2011. The modals ARE declining: reply to Neil Millar’s ‘Modal verbs in TIME: frequency changes 1923–2006’. *International Journal of Corpus Linguistics* 16(4).

Millar, N. 2009. Modal verbs in TIME: frequency changes 1923–2006. *International Journal of Corpus Linguistics* 14(2), 191–220.

]]>

One of the longest-running, and in many respects the least helpful, methodological debates in corpus linguistics concerns the spat between so-called **corpus-driven** and **corpus-based** linguists.

I say that this has been largely unhelpful because it has encouraged a dichotomy which is almost certainly false, and the focus on whether it is ‘right’ to work from corpus data upwards towards theory, or from theory downwards towards text, distracts from some serious methodological challenges we need to consider (see other posts on this blog).

Usually this discussion reviews the achievements of the most well-known corpus-based linguist, John Sinclair, in building the *Collins Cobuild Corpus*, and deriving the *Collins Cobuild Dictionary* (Sinclair *et al*. 1987) and *Grammar* (Sinclair *et al*. 1990) from it.

**In this post I propose an alternative examination.**

I want to suggest that *the greatest success story for corpus-based research is the development of part-of-speech taggers* (usually called a ‘POS-tagger’ or simply ‘tagger’) trained on corpus data.

These are industrial strength, reliable algorithms, that obtain good results with minimal assumptions about language.

So, *who needs theory?*

Taggers consist of two parts:

**a ‘learning’ algorithm**that collects rules from training data, and**a ‘tagging’ algorithm**which applies rules to new texts to classify words by their part of speech (word class).

The corpus-based aspect is the ‘learning’ algorithm.

A typical rule might be that if the word *old* (which can be a noun/nominal adjective, as in *the old*, or adjective, *the old man*) is followed by a noun, then *old* is more likely to be an adjective than otherwise.

The tagging algorithm takes a sentence and applies these rules like a crossword solver. It classifies the words that it is most certain of before considering those it is less confident about. Thus, in *the old man*, *the* is unambiguously a determiner, whereas both *old* and *man* can belong to more than one word class.

The learning algorithm generates summary statistics bottom-up from training data it is given, which are lots of sentences/texts which have already been tagged with the same part of speech scheme (i.e., a corpus).

It is not necessary to make many assumptions about the grammar of the language we are working with to obtain results comparable to the best reported in the literature. The computer does not need to ‘know’ what a noun or a verb is. It can simply obtain statistics about these different categories from the corpus.

But these algorithms *do* embody some assumptions about their language input. These assumptions can be enumerated as follows, although different classification schemes might vary in some details:

- language consists of
**sentences**divided into lexical**words**; - each
**sentence**is capable of being analysed separately; **words**include part-words such as genitive markers and cliticised words, and compounds, where multiple words can be given the same tag;- there are a fixed set of
**word class tags**that each particular instance of a word can be categorised by – these commonly consist of word class category (noun, verb, etc.), plus secondary information (plural proper noun, copular verb, etc.); - these tags were correctly applied to the
**training data**.

Databases extracted by the learning algorithm typically consist of **frequency distributions** for every word-tag pattern, i.e. the number of cases in the training corpus where a given lexical word has a particular tag; and **transition probabilities** for each word-tag pattern if words have more than one tag.

The performance of these linguistically unsophisticated algorithms is striking. **A typical tagger trained on a million words of English using a standard set of tags will make the correct decision for new sentences of a similar type some 95% of the time.**

Different algorithms may vary in storage efficiency. My crude simulated annealing stochastic tagger (Wallis 2012), which stores transition probabilities exhaustively, is less space-efficient than Eric Brill’s patch tagger (Brill 1992). *However, they obtain similar results.*

The remaining 5% of residual incorrect examples tend to be cases that are idiomatic, or are part of a multi-word string of ambiguous words, or are a result of weaknesses in the training data.

To address these weaknesses we can make a number of improvements.

**Store a finite set of idioms, strings or compounds.**This is a bit clumsy and*ad hoc*, doesn’t scale well, but can actually improve performance.**Add modules to the database and algorithm.**The Brill tagger employs some simple*ad hoc*regular morphology detection at an initial stage. A more thorough approach might consist of a morphological model of ‘lemmatisation’ (identifying word stems and affixes, e.g.*re-educated*→*re–*+*educate*+ –*ed*). The advantage of this step is that even if we don’t have the word*re-educated*in our training set we can recognise*educate*as a verb and the entire word as a gerund noun or verb. Generalisation allows us to pool statistics, so we can have more reliable rules, and compress information, so we don’t have to store separate statistics for every single word.**Create a more general type of rule.**The rules we have described were tied to particular words, such as*old*. It would be more efficient if we had a rule that said something like ‘for any word capable of being either an adjective or a noun, if it is followed by an adjective or noun, then it is likely to be an adjective.’*Note that to create such a rule we have to look for it*(this is precisely what the Brill tagger does).

But now let us consider where this path has taken us. Every step we have proposed to improve the performance of this corpus-driven algorithm requires the insertion of knowledge about idioms, morphology and grammar, top-down, into the algorithm.

A methodological corpus-driven purism that stated that we must work exclusively bottom-up was a little disingenuous, because we had to employ auxiliary assumptions (1) to (5) above from the start.

But now every improvement we wish to make requires further theoretical assumptions. It turns out that it is not possible to perform part-of-speech tagging without assumptions, and to improve the algorithm we need more theory.

Finally, whereas the learning algorithm might work bottom-up, the tagging algorithm itself works top-down, in that it applies its knowledge base of word-tag probabilities to new corpus data.

I have the utmost respect for corpus-driven linguists. The discipline of examining data with minimal assumptions is absolutely crucial! All scientists have to examine the data *as it is*, not compartmentalise it according to pre-given assumptions.

Over the years I have written extensively on not taking queries for granted, and directed corpus researchers to continually review the underlying sentences from which their statistics are derived.

However, it is simply not possible to work without *any* assumptions, even when building a bottom-up computer algorithm like a part-of-speech tagger.

So I would conclude that corpus-based research is properly located as part of a larger research cycle, in which it is valid and reasonable to work bottom-up and top-down at different times. Corpus-driven research methods are part of a family of exploratory methods from which all corpus linguists should draw. Insights from computationally-obtained summary statistics (whether from collocations, *n*-grams, phrase frames, indexes, or databases of part of speech taggers) are important resources for further research.

But insisting that the only legitimate corpus methods are bottom-up prevents us carrying out research with a corpus which asks questions that are inevitably framed by a particular theory.

Brill, E. 1992. A simple rule-based part of speech tagger. In *Proceedings of the third conference on applied natural language processing* (ANLC ’92). Association for Computational Linguistics, Stroudsburg, PA, USA, 152-155.

Sinclair, J., Hanks, P., Fox, G., Moon, R. and Stock, P. and others, 1987 (eds.), Collins *Cobuild English Language Dictionary*, London: Collins.

Sinclair, J., Fox, G., Bullon, S., Krishnamurthy, R., Manning, E., Todd, J. and others, 1990 (eds.) *Collins Cobuild English Grammar*, London: Collins.

Wallis S.A. 2012. *Tagging ICE Phillipines and other corpora*. London: Survey of English Usage. **»** ePublished

]]>

When the entire premise of your methodology is publicly challenged by one of the most pre-eminent figures in an overarching discipline, it seems wise to have a defence. Noam Chomsky’s famous objection to corpus linguistics therefore needs a serious response.

“One of the big insights of the scientific revolution, of modern science, at least since the seventeenth century… is that arrangement of data isn’t going to get you anywhere. You have to ask probing questions of nature. That’s what is called experimentation, and then you may get some answers that mean something. Otherwise you just get junk.” (Noam Chomsky, quoted in Aarts 2001).

Chomsky has consistently argued that the systematic *ex post facto* analysis of natural language sentence data is incapable of taking theoretical linguistics forward. In other words, corpus linguistics is a waste of time, because it is capable of focusing only on external phenomena of language – what Chomsky has at various times described as ‘e-language’.

Instead we should concentrate our efforts on developing new theoretical explanations for the internal language within the mind (‘i-language’). Over the years the terminology varied, but the argument has remained the same: real linguistics is the study of i-language, not e-language. Corpus linguistics studies e-language. Ergo, it is a waste of time.

Chomsky refers to what he calls ‘the Galilean Style’ to make his case. This is the argument that it is necessary to engage in theoretical abstractions in order to analyse complex data. “[P]hysicists ‘give a higher degree of reality’ to the mathematical models of the universe that they construct than to ‘the ordinary world of sensation’” (Chomsky, 2002: 98). We need a theory in order to make sense of data, as so-called ‘unfiltered’ data is open to an infinite number of possible interpretations.

In the Aristotelian model of the universe the sun orbited the earth. The same data, reframed by the Copernican model, was explained by the rotation of the earth. However, the Copernican model of the universe was not arrived at by theoretical generalisation alone, but by a combination of theory and observation.

Chomsky’s first argument contains a kernel of truth. The following statement is taken for granted across all scientific disciplines: **you need theory to analyse data**. To put it another way, there is no such thing as an ‘assumption free’ science. But the second part of this argument, that the necessity of theory permits scientists to dispense with engagement with data (or even allows them to dismiss data wholesale), is not a characterisation of the scientific method that modern scientists would recognise. Indeed, Beheme (2016) argues that this method is also a mischaracterisation of Galileo’s method. Galileo’s particular fame, and his persecution, came from one source: the observations he made through his telescope.

In astronomy it is necessary to build physical theories of the universe to make sense of observed data. Astronomical science must proceed by a process of theory building, attempting to account for observations within the theoretical framework. Moreover, rather than relying on naive Popperian refutation (abandoning a theory if one observation appears to contradict the theory), science tends to rely on **triangulation** (approaching the same theoretical generalisation from multiple sources and directions), and **pluralism**, i.e. the existence of competing theories such that if one fails another may replace it (Putnam 1974). Triangulation may also mean designing new experiments to test theoretical predictions as technology advances – such as viewing the earth from space, or placing atomic clocks on airliners to test special relativity.

Arguing for the necessity of theory is not an argument against corpus linguistics *per se*, but it is an argument of a particular type of corpus linguistics practice. The ‘Birmingham School’ of corpus linguistics, most associated with John Sinclair, has prided itself on making minimal theoretical assumptions and working bottom-up from words themselves. Some of the results of this approach are impressive. However,

- this type of corpus linguistics is not theory neutral or assumption free (e.g. we assume that
*w*₁,*w*₂ are words, and a word is a linguistically meaningful unit); - the process of validating theoretical generalisations entails a linguistic decision based on an external theory (e.g. there exists a distinct wordclass termed ‘adjective’);
- once theoretical generalisations are derived bottom-up (e.g. cases of
*w*₁,*w*₂, etc are members of the set of adjectives), we arrive at a methodological paradox.

Sinclair’s methodological paradox is simply this: if it is true that statements of the kind ‘*w*₁ is an adjective’ are linguistically valuable, then it follows that when analysing new data, we should exploit this new knowledge. However, Sinclair’s method is to work inductively from new data without making such *a priori* assumptions. Either he has to dispense with his previous conclusions, and start from scratch, or he has to change his method.

In conclusion, the argument that you need theory to interpret data, because data has multiple possible interpretations, is correct. However this statement does not extend to permitting scientists to select data to fit their theory. Awkward and challenging results may not be ignored.

Moreover, if Chomsky’s argument were correct, no scientific field would ever arrive at a dominant scientific model. Every scientist could adopt different theoretical frameworks and premises because there was no agreed process for either refuting a theory or determining the outcome of competition between theories. Science has a pattern of both pluralistic competitive research *and* consensus-forming around ‘strong theories’. Chomsky’s characterisation of science may be a description of the fractious state of linguistics, but it departs from the scientific method.

I would suggest that it would be preferable to make linguistics more like science, rather than to make science more like linguistics.

Chomsky’s second argument is that the process of translation from internal to external language is subject to error. Consequently, studying e-language is not a productive way to study i-language. We need to study i-language, therefore we should reject corpus data.

This argument has been more influential than the first.

It also appears to be a reasonable criticism of a certain kind of corpus linguistics. Corpus linguistics has tended to focus on word frequencies, which, in the absence of a theoretical interpretation as to *why* certain forms might be more frequent than others, simply becomes descriptive. Chomsky can reasonably summarise this as studying the epiphenomena of linguistics.

By contrast, theoretical linguists have tended to use an introspective method (backed up occasionally with second-party elicitation) on the grammatical acceptability of test sentences. This is a scholastic approach drawn from traditional prescriptive grammars. The method contains a significant subjective element, even when data is drawn from elicitation experiments with large numbers of test subjects. Direct introspection simply tells us that we *believe* a sentence to be ‘grammatical’.

Could this type of research question be posed with corpus data? No, but corpus linguists do not have to dispense with introspective insight. Corpus linguists are linguists too!

Moving from million-word to billion-word POS-tagged corpora has not generated greater insight, merely more robust results. However, this observation is properly a criticism of the research foci of much corpus linguistics as practised. (I would argue that this is a limitation of POS-tagged corpus research.) It is not an argument against corpus *data*.

However, there are two reasons why Chomsky’s second argument cannot hold. The first is what we might call **the ‘linguists are not God’ reason**.

Linguists do not have special access to i-language data. Their data is from introspection, elicitation or even corpora. But *this* data is also external language! If there were no systematic mapping between i-language and e-language within an individual, ‘i-linguistics’ would not be possible.

Chomsky and his followers could theorise about any number of internal models. But they could never choose between them except by appealing to some general abstract principle, such as Occam’s razor (simplicity). Introspection and experiment cannot penetrate the question because *all* linguistic data is in fact e-language data.

The best, most robust, carefully-obtained data from uncued experimental settings is still e-language. It may be collected in a more focused (and artificial) way than corpus data, but it is also no more ‘internal’ than corpus data. Introspection data elicited from experiments may elicit subjective grammatical expectations, but results are no more scientific than those from any other scientists’ introspection. Physicists do not despair of their equipment and resort to interviewing their peers! Perhaps linguists should follow their lead.

The second counter-argument is that the process of articulating i-language as e-language is a *cognitive* one, that is, it takes place through cognitive processes in the mind. According to Chomsky, this process exposes the pure i-language to the distorting prism of articulation, and thereby makes e-language unreliable data.

However, if this were true, the same objection would necessarily be true for the generation of i-language in the first place. **If articulation of e-language is subject to error, the generation of i-language itself is also bound to be error-prone.**

Random variation, cultural bias, personal preference, processing interference, etc, can take place at either stage, because these phenomena are artefacts of actual neurological pathways. Different types of error may arise at different locations, but there is no special error-free part of the brain. Speakers under the influence of alcohol have confused thoughts *and* slur their words. Alcohol, like error, is not selective.

On the other hand, a number of corpus linguists, including Geoffrey Leech, have commented on the regular ‘grammaticality’ of even the most informal spontaneous speech data. This observation should not be surprising – if speech data did not follow grammatical rules, speakers would not understand each other, and, given the historical and ontological primacy of speech over writing, language could never develop!

There may be noise in the signal, but the signal is not exclusively noise. We should not give up on corpora just yet.

Corpus data is simply uncued natural language data (sometimes termed ‘ecological’ data) as distinct from data obtained in an experimental setting. The key advantage of experimental data is that a researcher can manipulate variables under investigation and avoid variation in potentially confounding variables while obtaining data. A secondary advantage may be that one can construct a setting that provides a high frequency of sought-after phenomena that might otherwise be rare in a corpus. The disadvantages are the risk that the experimental conditions obtained are artificial (and possibly artificially *cued*), and the cost of obtaining and annotating data.

A corpus could contain experimental data, or data obtained by experiment could be annotated to the same level as a parsed corpus such as ICE-GB. These methods are not in competition but are complementary. A corpus can provide test data for experiments, identify potentially worth-while experiments, and provide a control for experimental outcomes.

Corpus linguistics offers three kinds of evidence to a theoretical linguist – factual evidence that phenomena exist, evidence of frequency and distribution, and ‘interaction evidence’ pertaining to the co-occurrence of phenomena (Wallis 2014).

There is no need to discount corpora as a lesser source, or one more likely to be tainted by error than other sources. It is a *different* source of evidence, one that requires due methodological care, but one that has the potential for both the evaluation of theory against real-world natural language and robust statistical evaluation.

If data can only be studied by first relating it to a theory, then theoretical linguists first need to pay attention to how corpora are annotated. Do corpora contain useful representations for linguistic research? Are phenomena of interest to linguists capable of being captured within the corpus?

‘Annotation’ is the process of systematically applying a theoretical description to all the texts in a corpus. A decision to annotate instances of a particular phenomenon entails significant effort. All such instances in the corpus must be identified, and each decision must be properly motivated. Like classification schemes in science (e.g the periodic table), linguistic phenomena are not simply identified, but related within a coherent annotation scheme. It follows that the entire scheme must be linguistically defended and systematically applied.

Syntacticians should pay particular attention to parsed corpora. It follows that if linguists are studying grammar then grammatically analysed corpora (‘parsed corpora’ or ‘treebanks’) are likely to be much more valuable than corpora with part-of-speech wordclass tags applied to each word. However, there is wide disagreement between theoretical linguists as to which grammatical scheme is optimal.

Inevitably the effort of annotation means that one has to choose a particular scheme at a particular point in time and systematically apply it. This poses a problem for researchers using the corpus. If they are stuck in a ‘hermeneutic trap’, only able to pose research questions within the annotation framework, and engage in circular reasoning, then corpus linguistics has a serious problem. After the huge effort of annotation you can only please a small number of linguists!

The solution to this problem offered by Wallis and Nelson (2001) is ‘abstraction’ – a process of reinterpretation of the annotated sentences from the representation in the corpus to the preferred representation of the linguist researcher, which takes place during the research process itself. Linguists do not have to accept the theoretical framework applied to a corpus in order to use it. Instead, the corpus representation is considered simply as a ‘handle on the data’, a method for systematically obtaining data across a corpus. It is not necessary to accept the framework uncritically.

In practice this means that researchers might find themselves constructing logical combinations of structural queries to retrieve a dataset aligned to their research theory and goals. But this is a small price to pay for having a grammatical framework already applied and evaluated against corpus data.

Finally abstraction is not an end goal but a means to obtaining an abstracted dataset expressed in terms commensurate with the theoretical demands of the researcher. It is this dataset that may then be subject to a third process, one we refer to as ‘analysis’, hence the ‘3A’ model of corpus linguistics, distinguishing the stages of annotation, abstraction and analysis.

Aarts, B. 2001. Corpus linguistics, Chomsky and Fuzzy Tree Fragments. In: C. Mair and M. Hundt (eds.) *Corpus linguistics and linguistic theory*. Amsterdam: Rodopi. 5-13.

Beheme, C. 2016. How Galilean is the ‘Galilean Method’? *History and Philosophy of the Language Sciences*, http://hiphilangsci.net/2016/04/02/how-galilean

Chomsky, N. 2002. *On Nature and Language*. Cambridge: Cambridge University Press.

Putnam, H. 1974. The ‘Corroboration’ of Scientific Theories, republished in Hacking, I. (ed.) (1981), *Scientific Revolutions*, Oxford Readings in Philosophy, Oxford: OUP. 60-79.

Wallis, S.A. 2014. What might a corpus of parsed spoken data tell us about language? In L. Veselovská and M. Janebová (eds.) *Complex Visibles Out There*. Olomouc: Palacký University, 2014. 641-662. **»** Post

Wallis, S.A. and Nelson G. 2001. Knowledge discovery in grammatically analysed corpora. *Data Mining and Knowledge Discovery*, **5**: 307–340.

]]>

Recently I’ve been working on a problem that besets researchers in corpus linguistics who work with samples which are not drawn randomly from the population but rather are taken from a series of sub-samples. These sub-samples (in our case, texts) may be randomly drawn, but we cannot say the same for any two cases drawn from the same sub-sample. It stands to reason that two cases taken from the same sub-sample are more likely to share a characteristic under study than two cases drawn entirely at random. I introduce the paper elsewhere on my blog.

In this post I want to focus on an interesting and non-trivial result I needed to address along the way. This concerns the concept of **variance** as it applies to a Binomial distribution.

Most students are familiar with the concept of variance as it applies to a Gaussian (Normal) distribution. A Normal distribution is a continuous symmetric ‘bell-curve’ distribution defined by two variables, the **mean** and the **standard deviation** (the square root of the variance). The mean specifies the position of the centre of the distribution and the standard deviation specifies the width of the distribution.

Common statistical methods on Binomial variables, from χ² tests to line fitting, employ a further step. They approximate the Binomial distribution to the Normal distribution. They say, *although we know this variable is Binomially distributed, let us assume the distribution is approximately Normal*. The variance of the Binomial distribution becomes the variance of the equivalent Normal distribution.

In this methodological tradition, the variance of the Binomial distribution loses its meaning with respect to the Binomial distribution itself. It seems to be only valuable insofar as it allows us to parameterise the equivalent Normal distribution.

What I want to argue is that in fact, the concept of the variance of a Binomial distribution is important in its own right, and we need to understand it with respect to the Binomial distribution, not the Normal distribution. Sometimes it is not necessary to approximate the Binomial to the Normal, and if we can avoid this approximation our results are likely to be stronger as a result.

Every fundamental primer in statistics approaches the problem in the following way.

A Binomial variable is a two-valued variable (hence ‘bi-nomial’). The values can be anything, but let us simply call them, according to coin-tossing tradition, as ‘heads’ and ‘tails’. The proportion of cases that are heads in any randomly-drawn sample, of size *n*, taken from a population, which we might term *p*, is free to vary from 0 to 1. That is, all *n *cases in the sample may be heads (*p* = 1) or all may be tails (*p* = 0).

Now, suppose we know, Zeus-like, the actual proportion in the population, *P.* We don’t *have* to be a deity – we might assume that our coin is unbiased so *P* = 0.5 (heads and tails are equally probable) – but a common error is when people get big *P* (true value in the population) and little *p* (observed value in a sample) muddled up. Let’s leave observed *p* aside for a minute.

We can calculate the distribution for *P* and *n* using the following Binomial formula:

*Binomial distribution B*(*r*) =* nCr P ^{r}* (1 –

where *r* ranges from 0 to *n*. This means that the probability of obtaining exactly *r* heads out of *n* coin tosses is calculated by multiplying

- the
**combinatorial function***nCr*(the number of unique ways we can obtain exactly*r*cases out of*n*cases); - the
**probability that***r*cases are heads*P*and^{r} - the
**probability that the remainder are tails**(1 –*P*)^{(n – r)}.

This formula obtains the ideal Binomial distribution.

The graph below shows what this looks like for ten tosses of an unbiased coin, where *P* = 0.5 and *n* = 10. The mean of this distribution is *nP*, i.e. 0.5 × 10 = 5.

**Note.** Equation (1) also works for a ‘trick’ coin, e.g. where *P* = 0.9 (9 times out of 10 we obtain heads). Although most primers first show a graph of *P* = 0.5, few real-world Binomial variables are equiprobable. (Don’t be misled by the symmetry of this graph.)

This distribution has a number of important characteristics.

- The most obvious characteristic is that it is
**discrete**– the only possible values of*r*are integer values from 0 to*n*. Therefore if we sample 10 coin tosses, an observed probability*p*could be 0, 0.1, 0.2, right up to 1. If the true value of*P*was 0.45, we could not observe*p*= 0.45 if we only had ten coin tosses. - A less obvious, but important, characteristic is that this distribution is
**probabilistic**– the sum of all columns ∑*B*(*r*) = 1. - Finally, for all values of
*P*other than 0.5, the distribution is**assymmetric**. See below.

You can also see how unlikely it is that all coins are heads or all tails. The chance of this happening is not zero, but it is small. There is only one possible combination of heads and tails where all ten coins are heads (HHHHHHHHHH) out of 1,024 (2^{n}) possible patterns. The probability of observing *p *= 0 is 1 in 1,024.

There are ten ways that one coin will be a tail and nine heads (THHHHHHHHH, HTHHHHHHHH,… HHHHHHHHHT), and so on.

The combinatorial function *nCr* tells us exactly how many different ways we can obtain *r* cases out of *n* potential cases. The full formula is given in equation (2) below, where *x*! means the factorial of *x*, or *x*(*x*-1)(*x*-2)…(1).

*combinatorial function nCr* = *n*!/(*n-r*)!*r*!.(2)

You should be able to see that in cases where *r* = 0 or *r* = *n*, *nCr* = 1; where *r* = 1 or *r* = *n*-1, *nCr* = *n*.

If *P* = 0.5 then the Binomial function (1) above becomes simply

*B*(*r*) =* nCr P ^{r}* (1 –

However, the general function is much more flexible. It allows us to consider distributions for different values of *P*. (Again, these are plotted on an integer scale.)

Note that these distributions are clearly assymmetric, being centred at *P* < 0.5 and bounded by 0 and *n*. As *P* approaches zero this assymmetry becomes more acute.

Another aspect we can immediately see from the graphs above is that, as well as increasingly becoming less symmetric, as *P* approaches zero, the distribution becomes more concentrated together. We say that the variance of the distribution decreases.

The variance of a Binomial distribution on the integer scale *r *= 0…*n* can be obtained from the function

*(integer) variance S*² = *nP*(1 – *P*).

To compare different-sized samples, we obviously need to use the same scale. The simplest standardisation is to adopt a probabilistic scale, i.e. where *p *= 0…1. To do this we divide this formula by *n*². The variance of a Binomial distribution on a **probabilistic scale** is obtained from the function

*(probabilistic) variance S*² = *P*(1 – *P*)/*n*.(3)

Thus if *P* = 0.5 and *n* = 10, *S*² = 0.025. If *P* = 0.1 and *n* = 10, *S*² = 0.009. (You shouldn’t need a calculator to work this out!) This formula has the following properties.

- For the same
*n*> 1, as*P*tends to zero,*P*(1 –*P*) will also tend to 0. (Consider: if a coin had zero chance of being a head, it will always be a tail!) - For the same
*P*> 0, as*n*increases,*P*(1 –*P*)/*n*decreases. (Obviously if*P*=0 then*S*² cannot decrease!)

Variance is simply the square of the standard deviation of the same distribution:

*standard deviation S* ≡ √*P*(1 – *P*)/*n*.

The concept of variance and standard deviation are usually applied to the **Normal distribution**. Here they have immediate meaning because, as we noted in the introduction, a Normal distribution can be described by two parameters: the **mean**, in this case *P*, and the **standard deviation**, *S*.

Indeed, in the same statistics primers, at around this point we are encouraged to set aside what we have learned about the Binomial distribution and simply assume that it is ‘close to’ the Normal distribution *N*(*P*, *S*). We might see comments that this is an acceptable step for large *n* or where both *nP* and *n*(1 – *P*) > 5.

It is worth emphasising: this step (due to an observation by de Moivre in the 18th Century) is an **approximation**. The Binomial and Normal distributions are different. Here is the distribution for *P *= 0.3 again, but this time with a Normal distribution approximated to it. There is a small difference between the two mid-points, which we have labelled as ‘error’.

- Most obviously, the Normal distribution is
**continuous**rather than discrete. This means we can obtain an estimate for the expected probability that*p*= 0.45. - Like the Binomial distribution, the standardised Normal distribution is also
**probabilistic**, i.e. the area under the curve sums to 1. - Finally, the Normal distribution is
**symmetric**. Moreover, it assumes that the observed variable is unbounded. An unbounded variable is free to vary from minus infinity (-∞) to plus infinity (+∞). (This is a corollary: if the variable was bounded, it could not be symmetric.)

It is worth considering this last point. Many statistics text books use example variables from the natural and physical sciences.

- For example, the height of children in a class, which we might call
*H*, is usually considered to be an unbounded variable, suitable for the Normal distribution. - But in fact, the height of children is a bounded variable.
**It has a lower limit.**At the risk of stating the obvious, children cannot be less than zero height(!), and indeed, to be permitted to go to school, must be of a certain age and be physically safe to do so.*H*must have a lower limit rather greater than zero.**It has an upper limit.**A number of factors, from growth rates to the physical strength of bone, limit the possible height of children.

- Far from being unbounded,
*H*is bounded by biology!

What everyone does is assume that the observed mean height is **so far** from the bounds that although the bounds exist, they have negligible effect on the distribution. (This is not always a healthy assumption, but it is the source of these injunctions to only approximate to the Normal distribution in cases where *nP* > 5.)

On the other hand, Binomial variables (and the Binomial distribution), are **strictly** bounded. We may write, e.g. *P* ∈ [0, 1], which simply means “*P* ranges from 0 to 1 inclusive”. The probability *P* may also be expressed as a proportion or percentage, so we might say that a rate can be any value from 0% to 100%.

So far we have discussed the *ideal* Binomial distribution. Equation (1) is the mathematical extrapolation of the likelihood, *B*(*r*) of observing *r* future results for a sample of *n* cases drawn randomly from a population if the true rate in the population was *P*.

In some circumstances we may *observe* a Binomial distribution. I do this in class with students – each student tosses a coin a fixed number of times and we note down the number of students who had 0 heads, 1 head and so on.

In the paper I am working on, I realised that this principle can also be employed to identify the extent to which a corpus sample might deviate from an ideal random sample for a given variable. This is an important question for corpus linguistics.

The first step is to partition the corpus sample into subsamples according to the text that they are drawn from. To all intents and purposes, these texts can be assumed to be random even if they were not subject to controlled sampling.

Note that two cases drawn from different texts are therefore likely to be independent and equivalent to a pair of cases in a true random sample. However two cases from the same text may share characteristics. There are all sorts of reasons why this is likely to be the case, from a shared topic to personal preferences, priming and other psycholinguistic effects. The reason does not actually matter – we just need to recognise this is likely to be the case.

**Question:**How may we measure the deviation of the corpus sample from an ideal random sample?**Answer:**By studying the distribution of these subsamples.

Suppose the subsamples are equivalent to random samples. Even though cases are drawn from the same text, suppose it turns out that the particular variable is not sensitive to context, previous utterances, etc. In this case, we would expect these sub-samples to be Binomially distributed.

To plot the following graph we first ‘quantise’ (round up or down to a particular number) the observed probability *p*. The vertical axis, *f*, is simply the number of texts in the direct conversations category of ICE-GB, where the probability that a clause is interrogative (*p*(inter) is 0, 0.01, 0.02, etc.). There are 90 texts in this category. We can see that this distribution is approximately Binomial.

We may calculate the variance of this observed distribution with the following pair of formulae, derived from Sheskin (1997).

The first estimate (4) does not take into account the fact that samples are drawn from a population, whereas the second measure, termed the *unbiased estimate of the population variance*, does. For that reason, we here use capital *P* to refer to each probability in the first case and lower case *p* to refer to observations.

*variance of a set of scores* *s’*_{ss}² = ∑(*P _{i}* –

*observed between-subsample variance s*_{ss}² = ∑(*p _{i}* –

where *p _{i}* is the observed probability for subsample

Equations (4) and (5) have one deficiency. It assumes that each subsample is of the same size. This is fine for classroom coin-tossing. It is unlikely to be the case in a corpus sample.

The estimate of variance for a set of different-sized subsamples can be obtained from

*variance of a set of scores (different sizes)* = *s’*_{ss}² = ∑*pr _{i }*(

*observed between-subsample variance s*_{ss}² = *t*/(*t*-1) × ∑*pr _{i }*(

where *pr _{i}* =

It is possible to prove that if *pr _{i}* is equal to the Binomial probability

* ∑nCr P ^{r}* (1 –

This means that equation (6) *defines the correct mathematical relationship between a Binomial distribution on a probabilistic scale and its expected variance*. Another way of putting this is that it is legitimate to apply equations (6) and (7) to a Binomial variable.

**Example:** To illustrate this equivalence, consider the following computation for *P* = 0.3 and *n* = 2. Equation (3) obtains, simply *S*² = (0.3 × 0.7)/2 = 0.105.

r/n |
r |
nCr |
B(r) |
B(r) × (r/n – P)² |

0 | 0 | 1 | 0.49 | 0.0441 |

0.5 | 1 | 2 | 0.42 | 0.0168 |

1 | 2 | 1 | 0.09 | 0.0441 |

Totals |
4 | 1 | 0.1050 |

We can therefore contrast the observed subsample variance with the variance that would be predicted assuming each subsample were a random sample, i.e. the expected Binomial variance, which in this notation would be

*predicted between-subsample variance S*_{ss}² = *p*(1 – *p*)/*t*.

If the two variance scores are the same, then to all intents and purposes, our subsamples are random samples, and the entire corpus sample can be considered a random collection of random samples, i.e. a random sample.

However, if the observed subsample variance differs than that predicted, we are entitled to take this into account when considering the variance of the corpus sample. We employ the ratio of variances, * F*_{ss}, to adjust the sample size accordingly.

*cluster-adjustment ratio F*_{ss} = *S*_{ss}² / *s*_{ss}², and (6)

*corrected sample size n’* = *nF*_{ss}.

If the observed sample has a greater variance than the predicted variance, *F*_{ss} < 1, and we can say that there are fewer truly independent random cases in our overall corpus sample, we increase our uncertainty of our cross-corpus observation, significance tests become more strict, confidence intervals wider, etc.

In the paper, we observe that sometimes *F*_{ss} > 1 and discuss reasons for this. Suffice it to say it is certainly possible, although this may at first sight appear counter-intuitive.

To illustrate the method, consider the following graph. This is the same data as the figure above. You can download this spreadsheet to inspect the calculation for yourself.

Note that in this case we see a close correspondence between the two predicted distributions – Binomial and Normal. The observed distribution is also approximately Normal (accepting the randomness we would anticipate in any observed distribution of course).

The method of comparing variances we employed makes no assumptions about the Binomial approximating to the Normal distribution.

However, this method usually comes under the umbrella of analysis of variance (ANOVA), which is premised on data being Normally distributed. Instead of assuming that ANOVA *might* be legitimately employed for Binomial (bounded, assymmetric, discrete) distributions, we were concerned to *prove* that our definitions of variance were applicable to the Binomial.

Why might this matter? There are two reasons.

- The approximation to the Normal distribution is an approximation, and introduces a number of ‘smoothing’ errors as a result.
- We must ensure that the method is robust for highly skewed values of
*p*.

In the figure above the Normal and Binomial distributions are similar. However, this is not always the case.

Consider the following graph (Figure 4 in the paper). Here data is drawn, not from a single genre, but across the diverse genres contained within the ICE-GB corpus, from the most highly interactive speech contexts to the most didactic of written instructional texts.

The two upper dotted lines are the predicted Normal and Binomial distributions for this observed value of *p* (0.0399) and *t* = 500 texts. You can see how the Normal distribution is narrower than the predicted Binomial.

Equation (5) captures the total variance between subsamples in this figure. It is approximately 4% of the predicted variance according to equation (3).

The lower line is the Normal distribution premised on the observed subsample variance. Again, you can see a large deviation between the observed frequency distribution (bars) and this Normal distribution, which is also clearly clipped by the lower bound at *p* = 0.

If our method were dependent on the Normal distribution, we simply could not sustain it in highly-skewed contexts such as this.

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.

]]>

Conventional stochastic methods based on the Binomial distribution rely on a standard model of random sampling whereby freely-varying instances of a phenomenon under study can be said to be drawn randomly and independently from an infinite population of instances.

These methods include confidence intervals and contingency tests (including multinomial tests), whether computed by Fisher’s exact method or variants of log-likelihood, χ², or the Wilson score interval (Wallis 2013). These methods are also at the core of others. The Normal approximation to the Binomial allows us to compute a notion of the variance of the distribution, and is to be found in line fitting and other generalisations.

In many empirical disciplines, samples are rarely drawn “randomly” from the population in a literal sense. Medical research tends to sample available volunteers rather than names compulsorily called up from electoral or medical records. However, provided that researchers are aware that their random sample is limited by the sampling method, and draw conclusions accordingly, such limitations are generally considered acceptable. Obtaining consent is occasionally a problematic experimental bias; actually recruiting relevant individuals is a more common problem.

However, in a number of disciplines, including **corpus linguistics**, samples are not drawn randomly from a population of independent instances, but instead consist of randomly-obtained contiguous subsamples. In corpus linguistics, these subsamples are drawn from coherent passages or transcribed recordings, generically termed ‘texts’. In this sampling regime, whereas any pair of instances in independent subsamples satisfy the independent-sampling requirement, pairs of instances in the same subsample are likely to be co-dependent to some degree.

To take a corpus linguistics example, a pair of grammatical clauses in the same text passage are more likely to share characteristics than a pair of clauses in two entirely independent passages. Similarly, epidemiological research often involves “cluster-based sampling”, whereby each subsample cluster is drawn from a particular location, family nexus, etc. Again, it is more likely that neighbours or family members share a characteristic under study than random individuals.

If the random-sampling assumption is undermined, a number of questions arise.

- Are statistical methods employing this random-sample assumption simply
**invalid**on data of this type, or do they gracefully degrade? - Do we have to employ very
**different tests**, as some researchers have suggested, or can existing tests be modified in some way? - Can we measure the
**degree**to which instances drawn from the same subsample are interdependent? This would help us determine both the scale of the problem and arrive at a potential solution to take this interdependence into account. - Would revised methods only affect the
**degree of certainty**of an observed score (variance, confidence intervals, etc.), or might they also affect the**best estimate of the observation**itself (proportions or probability scores)?

We will employ a method related to ANOVA and F-tests, applying this method to a probabilistic rather than linear scale. This step is not taken lightly but as we shall see in section 6, it can be justified.

Consider an observation *p* drawn from a number of texts, *t*, based on *n* total instances. Conventionally we would assume that these *n* instances are randomly drawn from an infinite population, and then employ the Normal approximation to the Binomial distribution:

*standard deviation s* ≡ √*p*(1 – *p*)/*n*.

*variance s*² = *p*(1 – *p*)/*n*, and(1)

*Wilson’s score interval* (*w*⁻, *w*⁺)

≡ [*p* + *z*_{α/2}²/2*N* ± *z*_{α/2}√*p*(1 – *p*)/*N* + *z*_{α/2}²/4*N²*] / [1 + *z*_{α/2}²/*N*].(2)

where *z*_{α/2} is the critical value of the Normal distribution for a given error level α (see Wallis 2013 for a detailed discussion). Other derivations from (1) include χ² and log-likelihood tests, least-square line-fitting, and so on. The model assumes that all *n* instances are randomly drawn from an infinite (or very large) population. However, we suspect that our subsamples are not equivalent to random samples, and that this sampling method will affect the result.

To investigate this question, our approach involves two stages.

First, we measure the variance of scores between text subsamples according to two different models, one that presumes that each subsample is a random sample, and one calculated from the actual distribution of subsample scores. Consider the frequency distribution of probability scores, *p _{i}*, across all

*subsample mean p* = ∑*p _{i}* /

If subsamples were randomly drawn from the population, it would follow from (1) that the variance could be **predicted** by

*between-subsample variance S*_{ss}² = *p*(1 – *p*)/*t*.(3)

To measure the **actual** variance of the distribution we employ a method derived from Sheskin (1997: 7). First, note that the variance of a series of *N* observed scores *X _{i}*, can be obtained by

*s*² = ∑(*X _{i}* –

which can be rewritten as

*observed between-subsample variance s*_{ss}² = ∑(*p _{i}* –

This formula measures the internal variance of the series, but it fails to take into account the fact that the series is a subsample from which we wish to predict the true population value. The formula for the *unbiased estimate of the population variance* may be obtained by

*observed between-subsample variance s*_{ss}² = ∑(*p _{i}* –

where *t* – 1 is the number of degrees of freedom of our set of observations. This is the formula we will use for our computations.

Second, we adjust the weight of evidence according to the degree to which these two variances (equations (3) and (4)) disagree. If the observed and predicted variance estimates coincide, then the total set of subsamples is, to all intents and purposes, a random sample from the population, and no adjustment is needed to sample variances, standard deviations, confidence intervals, tests, etc.

We can expect, however, that in most cases the actual distribution has greater spread than that predicted by the randomness assumption. In such cases, we employ the **ratio of variances**, *F*_{ss}, as a scale factor for the number of random independent cases, *n*.

Gaussian variances with the same probability *p* are inversely proportion to the number of cases supporting them, *n*, i.e. *s*² ≡ *p*(1 – *p*)/*n* (cf. equation (1)). Assuming the Normal approximation to the Binomial holds for the distribution of *p*, we can estimate a corrected total independent sample size *n’*, by multiplying *n* by the ratio of variances for the same *p*.

*cluster-adjustment ratio F*_{ss} = *S*_{ss}² / *s*_{ss}², and (5)

*corrected sample size n’* = *nF*_{ss}.

To put it another way, the ratio *n’*:*n* is the same as *S*_{ss}²:*s*_{ss}². This ratio should be less than 1, and thus *n* is decreased. If we decrease *n* in equations (1) and (2), we obtain larger estimates of sample variance and wider confidence intervals. An adjusted *n* is easily generalised to contingency tests and other methods.

…

Figure 6 plots the distribution of *p* with Wilson intervals across ICE-GB genre categories. The thin ‘I’-shaped error bars represent the conventional Wilson score interval for *p*, assuming random sampling. The thicker error bars represent the adjusted Wilson interval obtained using the probabilistically-weighted method of equation (7). These results are tabulated in Table 2 in the paper.

The figure reinforces observations we made earlier. Within a single text type, such as *broadcast interviews*, *p* has a compressed range and cannot plausibly approach 1. (Note that mean *p* does not exceed 0.03 in any genre.) The observed between-text distribution is smaller than that predicted by equation (3), and, armed with this information, we are able to reduce the 95% Wilson score interval for *p*. This degree of compression (or, to put it another way, the plausible value of max(*p*)) may also differ by text genre.

However, the reduction due to range-compression is offset by a countervailing tendency: pooling genres increases the variance of *p*. The distribution of texts across the entire corpus consists of the sum of the spoken and written distributions (means 0.0091 and 0.0137 respectively), and so on.

The Wilson interval for the mean *p* averaged over all of ICE-GB approximately doubles in width (*F*_{ss} = 0.2509), and the intervals for *spoken*, *dialogue*, *private,* *written* and *printed *(marked in bold in Figure 6) also expand, albeit to lesser extents. The other intervals contract (*F*_{ss} > 1), tending to generate a more consistent set of intervals over all text categories.

- Introduction
- Previous research

2.1 Employing rank tests

2.2 Case interaction models - Adjusting the Binomial model
- Example 1: interrogative clause probability, direct conversations

4.1 Alternative method: fitting - Example 2: Clauses per word, direct conversations
- Uneven-size subsamples
- Example 3: Interrogative clause probability, all ICE-GB data
- Example 4: Rate of transitive complement addition
- Conclusions

Wallis, S.A. 2015. *Adapting random-instance sampling variance estimates and Binomial models for random-text sampling*. London: Survey of English Usage, UCL. http://www.ucl.ac.uk/english-usage/statspapers/recalibrating-intervals.pdf

- Spreadsheet example (Excel)
- The variance of Binomial distributions
- Random sampling, corpora and case interaction
- Reciprocating the Wilson interval
- Freedom to vary and significance tests

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.

Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics ***20**:3, 178-208 **»** Post

]]>