We have discussed the Wilson score interval at length elsewhere (Wallis 2013a, b). Given an observed Binomial proportion *p* = *f* / *n* observations, and confidence level 1-α, the interval represents the two-tailed range of values where *P*, the true proportion in the population, is likely to be found. Note that *f* and *n* are integers, so whereas *P* is a probability, *p* is a proper fraction (a rational number).

The interval provides a robust method (Newcombe 1998, Wallis 2013a) for directly estimating confidence intervals on these simple observations. It can take a correction for continuity in circumstances where it is desired to perform a more conservative test and err on the side of caution. We have also shown how it can be employed in logistic regression (Wallis 2015).

The point of this paper is to explore methods for computing Wilson distributions, i.e. the analogue of the Normal distribution for this interval. There are at least two good reasons why we might wish to do this.

The first is to shed insight onto the performance of the generating function (formula), interval and distribution itself. Plotting an interval means selecting a single error level α, whereas visualising the distribution allows us to see how the function performs over the range of possible values for α, for different values of *p* and *n*.

A second good reason is to counteract the tendency, common in too many presentations of statistics, to present the Gaussian (‘Normal’) distribution as if it were some kind of ‘universal law of data’, a mistaken corollary of the Central Limit Theorem. This is particularly unwise in the case of observations of Binomial proportions, which are strictly bounded at 0 and 1.

As we shall see, the Wilson distribution diverges from the Gaussian most dramatically as it tends towards the boundaries of the probabilistic range, i.e. where the interval approaches 0 or 1. By contrast, the Normal distribution is unbounded, and continues to plus or minus infinity.

The Wilson score interval (Wilson 1927) may be computed with the following formula.

*Wilson score interval* (*w*⁻, *w*⁺) = (*p* + *z*²/2*n ± *√*p*(1 – *p*)/*n* + *z*²/4*n*²) / [1 + *z*²/*n*]. (1)

Let us first consider cases where *P* is less than *p*. At the lower bound of this interval (*P* = *w*⁻) the upper bound for the Gaussian interval for *P*, *E*⁺, must be equal to *p* (Wallis 2013a).

We can carry out a test for significant difference between *p* and *P* by either

- calculating a Gaussian interval at
*P*and testing if*p*is greater than the upper bound, or - calculating a Wilson interval at
*p*and testing if*P*is less than the lower bound.

To consider cases where *P* is greater than *p*, we simply reverse this logic. We test if *p* is smaller than the lower bound of a Gaussian interval for *P*, or *P* is greater than the upper bound of the Wilson interval for *p*. The Gaussian version of the test is called the **single proportion z test**. It can also be calculated as a

As *p* tends to 0, we obtain increasingly skewed distributions (Figure 3). The interval cannot be easily approximated by a Normal interval, and the sum of the two distributions is decidedly not Gaussian (‘Normal’).

In Figure 3, note how the mean *p* is no longer the most likely value (mode).

In plotting this distribution pair, the area on either side of *p* is projected to be of equal size, i.e. it treats as a given that the true value *P* is equally likely to be above and below *p*. This is not necessarily true! Indeed we might multiply both distributions by the probability of the prior. But this fact should not cause us to change the plot.

Note how, thanks to the proximity to the boundary at zero, the interval for *w*⁻ becomes increasingly compressed between 0 and *p*, reflected by the increased height of the curve.

The tendency to express the distribution like an exponential decline on the least bounded side reaches its limit when *p* = 0 or 1. The ‘squeezed interval’ is uncomputable and simply disappears.

- Introduction
- Plotting the distribution

2.1 Obtaining values of*w*⁻

2.2 Employing a delta approximation - Example plots

3.1 An initial example

3.2 Properties of the Wilson distributions

3.3 Varying*p*

3.4 Small*n* - Further perspectives on the distribution

4.1 Percentiles of the Wilson distributions

4.2 The logit Wilson distribution

4.3 Continuity-corrected Wilson distributions - Conclusions
- References

- Full paper (PDF)
- Spreadsheet (Excel)
- Plotting confidence intervals on graphs
- Binomial → Normal → Wilson
- Logistic regression with Wilson intervals

Newcombe, R.G. 1998. Two-sided confidence intervals for the single proportion: comparison of seven methods. *Statistics in Medicine* **17**: 857-872.

Wallis, S.A. 2013a. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics ***20**:3, 178-208 **»** Post

Wallis, S.A. 2013b. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4**,** 350-378. **»** Post

Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. *Journal of the American Statistical Association* **22**: 209-212.

However, I think it is a good example of why a mathematical approach to statistics (instead of the usual rote-learning of tests) is extremely valuable.

At the time of writing (March 2018) nearly two hundred thousand university staff in the UK are active members of a pension scheme called USS. This scheme draws in income from these members and pays out to pensioners. Every three years the pension is valued, which is not a simple process. The valuation consists of two aspects, both uncertain:

- to value the liabilities of the pension fund, which means the obligations to current pensioners and future pensioners (current active members), and
- to estimate the future asset value of the pension fund when the scheme is obliged to pay out to pensioners.

What happened in 2017 (and happened in the last two valuations) is that the pension fund has been declared to be in deficit, meaning that the liabilities are greater than the assets. However, in all cases this ‘deficit’ is a projection forwards in time. We do not know how long people will actually live, so we don’t know how much it will cost to pay them a pension. And we don’t know what the future values of assets held by the pension fund will be.

In September 2017, the USS pension fund published a table which included two figures using the method of accounting they employed at the time to value the scheme.

- They said
**the best estimate**of the outcome was a surplus of £8.3 billion. - But they said that
**the deficit allowing for uncertainty**(‘prudence’) was –£5.1 billion.

Now, if a pension fund is in deficit, it matters a great deal! Someone has to pay to address the deficit. Either the rules of the pension fund must change (so cutting the liabilities) or the assets must be increased (so the employers and/or employees, who pay into the pension fund must pay more). The dispute about the deficit engulfed UK universities in March 2018 with strikes by many tens of thousands of staff, lectures cancelled, etc. But is there really a ‘deficit’, and if so, what does this tell us?

The first additional bit of information we need to know is how the ‘uncertainty’ is modelled. In February 2018 I got a useful bit of information. The ‘deficit’ is the lower bound on a 33% confidence interval (α = 2/3). This is an interval that divides the distribution into thirds by area. One third is below the lower bound, one third above the upper bound, and one third is in the middle. This gives us a picture that looks something like this:

Of course, experimentalist statisticians will never use such an error-prone confidence interval. We wouldn’t touch anything below 95% (α = 0.05)! To make things a bit more confusing, the actuaries talk about this having a ‘67% level of prudence’ meaning that two-thirds of the distribution is above the lower bound. All of this is fine, but it means we must proceed with care to decode the language and avoid making mistakes.

In any case, the distribution of this interval is approximately Normal. The detailed graphs I have seen of USS’s projections are a bit more shaky (which makes them appear a bit more ‘sciency’), but let’s face it, these are projections with a great deal of uncertainty. It is reasonable to employ a Normal approximation and use a ‘Wald’ interval in this case because the interval is pretty much unbounded – the outcome variable could eventually fall over a large range. (Note that we recommend Wilson intervals on probability ranges precisely because probability *p* is bounded by 0 and 1.)

What do we know?

- The
**best estimate**is the median of the distribution. In the case of the Normal it is also**the mean**,*v*= 8.3 (billion pounds). - The
**lower bound of the confidence interval***v*⁻ = –5.1. This is the quoted deficit figure. - The
**error level**of the Normal distribution α = 2/3. - The
**critical value**of the Normal distribution,*z*_{α/2}= 0.4307.

We also know that *v*⁻ = *v* – *z*_{α/2}.*s*.

- So we can calculate the
**standard deviation***s*= (*v*–*v*⁻) /*z*_{α/2 }= 31.1101.

That’s a standard deviation of £31 billion! No wonder the Normal distribution looks so wide.

This tells us that a big problem with the prediction is the sheer scale of the uncertainty attached to the estimate. It is not necessarily a problem with the pension – after all, even using this valuation method it is odds-on to reach a positive outcome of £8.3 billion.

Now, it turns out that there are lots of problems with the method for valuing the pension scheme. Crucially, the entire exercise is predicated on imagining the ‘old’ UK universities (or a large proportion of them) go bankrupt. I have written about this elsewhere. It is not crucial for our statistics discussion, even if it is a costly problem for staff, employers and students impacted by the industrial action as the argument about who should pay for this type of ‘deficit’ ensues.

Irrespective of the rights and wrongs of *that* argument (and we will return to this in conclusion), this exercise should have convinced you of one thing though – with such a high level of uncertainty about the valuation, pretty much any value can be obtained!

The next thing I did was wonder, what is the break-even (zero) point on the distribution, where valuation *v* = 0? In other words, can we calculate the chance of default occurring according to this model?

This seems to me to be an important operation. Most of all it allows us to meaningfully compare different valuations, which, as we shall see in a minute, is a useful thing to do. The USS Trustees, who manage the scheme, are concerned with one thing – the **risk of default**, *p*(*v* < 0). So it strikes me that we ought to calculate it.

The zero point is the value of α/2 when *v*⁻ = *v* – *z*_{α/2}.*s* = 0.

So we need to know α when *z*_{α/2} = *v*/*s* = 0.2667.

There are various ways to compute this, but I used a poor-man’s Newton-Raphson method in Excel to find α. That is, I input different values of α until the Normal function (‘NORMSINV(1-(α/2))’) obtained a closely-similar value of *z*!

I am sure there is a neater way, but it would obtain essentially the same result. It’s the maths that count!

- In this case, this obtains
**error level**α = 0.79.

This means that there would be an area of 0.21 inside the interval if *v*⁻ = 0. Another way of thinking about this is that, of the half-distribution below the mean, 21% of *that* area is where *v* >= 0.

So we can now report that the** probability of default** *p*(*v* < 0) = 0.5 – 0.21/2 = 0.395.

As a result of the valuation in September there was much shaking of heads amongst employers. This level of risk seemed to great to bear. So they reported to their organisation, Universities UK (UUK) that they wanted to see less risk in their model. The first valuation employed a method that was termed gradual ‘de-risking’, meaning that the assets would be moved from a mixed stocks and shares portfolio into investments in government stocks, termed ‘gilts’. The idea is that this is less risky because these gilts are ‘low risk’ compared to stocks and shares.

As a result of this consultation, the scheme actuaries were sent away and they came up with some different figures. These were

- The
**best estimate**,*v*= £5.2bn (this figure was not made very public) - The
**quoted deficit**,*v*⁻ = -£7.5bn (this was made*very*public)

Again, the same interval calculation was employed.

I was ‘leaked’ the best estimate. Knowing now how the calculation was made for our first valuation, I employed the same method.

- The
**standard deviation***s*= (*v*–*v*⁻) /*z*_{α/2 }= 29.4850.

The graph now looks like this.

So, what is the probability of default, i.e. *p*(*v* < 0)? What has happened to the risk to the Pension Trustees?

We have *z*_{α/2} = *v*/*s* = 0.1764, which obtains α = 0.86.

**The probability of default**,*p*(*v*< 0), is now 0.43.

So – wait for this – **if the employers engage in what they think is a ‘risk-averse’ modelling approach, they increase the risk of default!** What is going on?

Let’s pause for a moment.

- The risk of default is an estimated risk of the likely outcome of the unravelling of the pension scheme should this prove necessary. It is like predicting the chance of an aeroplane
**crashing**. - But the ‘risk’ of stock-market investments is a different thing entirely. It is
**volatility**, short-term variation, that might increase or decrease investments over the short term. To use our aeroplane analogy, it is turbulence. **What the November valuation did was drop the altitude of the plane to avoid turbulence, but it increased the risk of crashing the plane into mountains!**

People are not used to reasoning with probability and risk, and it is easy to conflate different probabilities and different risks. Only a logical and mathematical approach to thinking about probability can rescue you from the kind of error exhibited by the university employers, when, insisting on a ‘lower level of risk’, they managed to increase the risk to the scheme and themselves.

What is quite disturbing about this argument is that I am not a professional actuary, yet I spotted the error immediately. I was not the only one.

You would think that the first thing a competent professional would do on obtaining this new calculation is critique it, wonder why this counter-intuitive outcome had been obtained, and advise those running the scheme accordingly. Yet at the time of writing in March 2018, UUK are still trying to use this November valuation to try to get their way.

So-called ‘de-risking’ increases the only risk that should matter (the risk of ultimate default), and therefore it is neither a competent investment strategy nor a good method for valuing the pension scheme!

Here we don’t have published figures from USS. But we have some information from our previous calculations.

The September valuation was obtained, not by employing no de-risking, but by modelling the effect of replacing stocks and shares with gilts after a 10 year-delay. The November valuation was obtained by starting de-risking immediately. Both aim for complete de-risking by the 20-year point. See Figure 3.

We also can safely assume that the cost and yield of gilts is likely to be stable. (Indeed the low ‘long term gilt yield’ is half the problem of valuing live pension schemes like this.)

In the first place we have two valuations, **A** (September) and **B** (November).

These can be depicted like this.

We can now estimate the likely outcome for a new model, **C**, that employs a 20-year delay before total divestment, using some simple maths and the Bienyamé theorem (that independent variances may be summed).

Gilts are predicted to have a more-or-less constant low value. Stocks are predicted to be more volatile, around a given mean growth rate.

If we assume that stocks continue to perform in a similar manner over each five-year period (i.e. that the best estimate and standard deviation of the growth rate is constant) the areas under the curve for **A** and **B** are equivalent to immediate and total divestment at time points 15 and 10 years respectively (dashed vertical lines). This is because we can assume that the exposure of assets to stock market risk is considered to be constant.

Consider **B** first. This employs an immediate de-risking model which has the lowest standard deviation and variance.

- Var(
**B**) =*s*₁² = 869.36

In the case of **A**, there is an additional variance term due to the stocks-and-shares uncertainty generated by delay:

- Var(
**A**) =*s*₁² +*s*₂² = 967.84

Therefore the additional uncertainty due to delayed de-risking *s*₂² = 967.84 – 869.36 = 98.48.

Assuming investment performance, gilt yields, etc. are constant over time, the area between **A** and **B** is also the same area between **A** and **C**.

- Var(
**C**) =*s*₁² + 2*s*₂² = 1,066.32, - Standard deviation for
**C**,*s*= 32.6545.

We obtain the best estimate for **C** by simple addition, so *v* = £11.4 billion.

This gives us a ‘deficit’ of –£2.66 billion and a probability of default of 0.3635.

Note that the gradual de-risking model (**A**) is roughly equivalent to delaying de-risking for five years and then selling stock as in model **B**. We can now compute **D **(25-year delay), **E** (30-year delay), and further models employing the same approach.

This obtains the following graphs.

In other words, even if one agreed to de-risk in twenty-five years’ time, the projected deficit would be close to zero, and thereafter, the scheme generates a surplus at this level of prudence.

**Therefore not de-risking at all (performing an ongoing valuation) must obtain a surplus.** The limit of this ‘deficit’ curve exceeds zero.

If long-term gilt yields rise beat CPI, then the benefits of increased predictability might outweigh the loss in asset performance. But we would need to perform a calculation of the trade-off based on the best evidence available at the time. What is clear is that de-risking punishes the pension scheme for an external factor – low long-term gilt and bond yields – for no good reason.

Another way to see the same result is to plot the probability of default over these different ‘de-risking horizons’. This obtains the following graph of *p*(*v* < 0).

The evidence is therefore that an assessment of the assets and liabilities of the live pension scheme (an ‘ongoing valuation’) must return a net surplus. Indeed, this is what the actuaries *First Actuarial* found by other methods (Salt and Benstead 2017).

Some might object that there were other differences between the November and September valuations, and therefore taking the difference between them is not appropriate. This may be true, but the burden of evidence has shifted. Until actuaries working for UUK and USS are transparent about their assumptions, I would suggest that I have demolished the idea that there could be an ongoing deficit by a straightforward mathematical argument.

The ability to conceptualise probability in a meaningful way is central to any rational argument about statistics and uncertainty. We can see this in the confusion between ‘de-risking’ and real risk, i.e. risk of pension default.

There is one last sting in this particular tale.

In the case of the USS pension, the entire premise of ‘de-risking’ is that a trigger event as financially destabilising to the bankruptcy of the entire pre-92 university sector takes place. This might not mean the total bankruptcy of the sector, but it would require a large number of big institutions to shut down and the remaining institutions to fail to absorb their students, staff and market share.

**Now, the probability of this event is – or should be – effectively zero.** Since *p*(*v* < 0) × 0 = 0, from a logical perspective it does not really matter what the probability of deficit actually is. However, the current UK regulatory environment still presumes that pension funds must be evaluated by ‘managing the risk of default’ (which means in practice modelling by de-risking), even if the probability of the trigger event is zero.

That an evaluation of this kind is even contemplated in the case of USS illustrates what one might call a wilful ignorance of basic mathematics. One of the Big Four accountancy firms has attached their name to various tendentious statements about the USS pension scheme, levels of prudence, etc. It is to their shame that they have done so.

**As we have demonstrated, the probability of scheme default is zero provided that the scheme is not de-risked.** *Actual* de-risking – an act of self-harm of the first order – increases the chance of default, although even in the worst case, immediate de-risking is still odds-on to leave a surplus.

The obvious solution to the current crisis is for the Government to accept that a multi-employer scheme of publicly-funded universities is not subject to the same risks of a single-employer pension fund.

Sector bankruptcy would be a national tragedy that would also constitute the simultaneous collapse of one of the UK’s leading exporting industries (higher education), the eviction of millions of students from their courses and the collapse of the UK independent research sector. It is a political issue of the utmost importance to the UK economy as well as generations of university staff and students.

Cuts in the pension benefits and increases in employer expenditure are pointless and damaging when the ‘deficit’ is so obviously an artefact of the valuation method. The obvious solution is that the Government simply guarantees the security of the pension fund, and permits the Trustees to value the scheme on an ongoing basis.

Sam Marsh uncovers that the trigger reasoning used by the pension fund USS for deciding to ‘de-risk’ (‘Test 1’) contains a colossal error. Even if the Government did not step in, USS itself has no grounds to de-risk. See also Mike Otsuka’s explanation.

However to predict performance, we might consider the types of structure that a parser is likely to find difficult and then examine a parsed corpus of speech and writing for key statistics.

Variables such as mean sentence length or main clause complexity are often cited as a proxy for parsing difficulty. However, sentence length and complexity are likely to be poor guides in this case. Spoken data is not split into sentences by the speaker, rather, utterance segmentation is a matter of transcriber/annotator choice. In order to improve performance, an annotator might simply increase the number of sentence subdivisions. Complexity ‘per sentence’ is similarly potentially misleading.

In the original *London Lund Corpus* (LLC), spoken data was split by speaker turns, and phonetic tone units were marked. In the case of speeches, speaker turns could be very long compound ‘run-on’ sentences. In practice, when texts were parsed, speaker turns might be split at coordinators or following a sentence adverbial.

In this discussion paper we will use the *British Component of the International Corpus of English* (ICE-GB, Nelson *et al.* 2002) as a test corpus of parsed speech and writing. It is worth noting that both components were parsed together by the same tools and research team.

A very clear difference between speech and writing in ICE-GB is to be found in the degree of **self-correction**. The mean rate of self-correction in ICE-GB spoken data is 3.5% of words (the rate for writing is 0.4%). The spoken genre with the lowest level of self-correction is broadcast news (0.7%). By contrast, student examination scripts have around 5% of words crossed out by writers, followed by social letters and student essays, which have around 0.8% of words marked for removal.

However, self-correction can be addressed at the annotation stage, by removing it from the input to the parser, parsing this simplified sentence, and reintegrating the output with the original corpus string. To identify issues of parsing complexity, therefore we need to consider the sentence minus any self-correction. Are there other factors that may make the input stream more difficult to parse than writing?

Perhaps a more revealing estimate of top level complexity concerns the extent to which, following parsing, these segments, termed ‘parse units’, are not considered grammatically to be clauses. The scattergraph below plots the mean proportion of parse units that are ‘**non clauses**’ rather than clauses on the horizontal axis. The category of ‘non clause’ does not include subjectless or verbless clauses (see below), but may include standalone phrases and pragmatically meaningful utterances (sometimes called ‘clause fragments’). By contrast, the vertical axis shows the mean number of **incomplete** clauses. These are clauses that have been rendered incomplete, for example because the speaker was interrupted. (We have not included confidence intervals because we are interested in the overall scatter.)

- Overall in ICE-GB
**there are twice the proportion of ‘non clause’ parse units in the spoken data**(on average, 29% of parse units are not clauses) than the written component (14%). Business letters are an outlier, apparently due to the inclusion of full addresses and other formal ephemera. At the upper left of the written distribution, press editorials have the highest number of incomplete clauses while less than one in twenty parse units are considered non clauses. - Comparing means,
**there are over four times the proportion of incomplete clauses in spoken transcripts**compared to written text (2.15% to 0.51%). Means are shown with ‘X’ symbols in the scattergraph.

This scattergraph distinguishes written and spoken data to a much greater extent than, e.g. analysis of small phrases (Aarts *et al.* 2014). This indicates that the challenges in the parsing of speech data lie principally in high level structure. Getting the top level analysis correct is the most difficult challenge in any parsing enterprise. The sheer proportion of the number of non clauses in speech, and the relatively high proportion of incomplete clauses should cause us to be cautious about accepting performance estimates based on the parsing of written data when we are concerned with the parsing of speech.

Spoken data is not necessarily more complex in other aspects. For example, speech data is generally less likely to include subjectless or verbless clauses than writing. The following scattergraph plots the mean probabilities of clauses being **subjectless** (vertical axis) and **verbless** (horizontal axis) for ICE-GB text categories within speech and writing. The highest proportion of verbless clauses in any genre are found in spontaneous commentaries, a spoken genre which encourages concise phrasing, for example:

*England have won four *[*the Soviet Union three*]* with three drawn* _{[S2A-001 #167]}

Compared to writing, a lower proportion of clauses in speech are analysed as compound clauses, but this seems to be an artefact of the sentence segmentation decisions we discussed earlier. In the case of ICE-GB speech data, large coordinated spoken clauses were frequently split at the coordinator, with the coordinator (*and*, *but*, etc) then treated as a connective introducing a new clause. This decision is semantic and stylistic (in writing, termed ‘avoiding run-on sentences’), although it could be argued that in the parsing of ICE-GB, annotators over-compensated.

In objective lexical terms, the spoken data has a slightly greater tendency to exhibit coordinating words. There are 15% more connectives or coordinators per word in ICE-GB spoken data compared to writing, and 4% more subordinating conjunctions.

If ICE-GB spoken utterances were over-zealously subdivided, this tendency has had a greater impact on coordinated clauses than subordinate ones, but it has had an impact on subordination nonetheless. Thus the proportion of ‘dependent’ (subordinate) clauses out of those clauses explicitly marked as either main or dependent in spoken data is actually 85% of the equivalent rate in the written data, despite the greater rate of subordinators.

In summary, the main factor that might make speech harder to parse than writing is that spoken data tends to be more grammatically incomplete than written data. The high proportion of ‘non clauses’, and the greater number of clauses marked as incomplete, both indicate that this is where the principal difficulty lies.

This incompleteness is in addition to self-correction, that is, where speakers correct their own utterances.

Aarts, B. and S.A. Wallis 2014. Noun phrase simplicity in spoken English. In L. Veselovská and M. Janebová (eds.) *Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure.* Olomouc: Palacký University, 2014. pp 501-511. » Post

Nelson, G., Wallis, S.A. and Aarts, B. 2002. *Exploring Natural Language: Working with the British Component of the International Corpus of English*. Amsterdam: John Benjamins.

]]>

Occasionally it is useful to cite measures in papers other than simple probabilities or differences in probability. When we do, we should estimate confidence intervals on these measures. There are a number of ways of estimating intervals, including bootstrapping and simulation, but these are computationally heavy.

For many measures it is possible to derive intervals from the Wilson score interval by employing a little mathematics. Elsewhere in this blog I discuss how to manipulate the Wilson score interval for simple transformations of *p*, such as 1/*p*, 1 – *p*, etc.

Below I am going to explain how to derive an interval for grammatical diversity, *d*, which we can define as **the probability that two randomly-selected instances have different outcome classes**.

Diversity is an effect size measure of a frequency distribution, i.e. a vector of *k* frequencies. If all frequencies are the same, the data is evenly spread, and the score will tend to a maximum. If all frequencies except one are zero, the chance of picking two different instances will of course be zero. Diversity is well-behaved except where categories have frequencies of 1.

To compute this diversity measure, we sum across the set of outcomes (all functions, all nouns, etc.), **C**:

*diversity d*(*c*∈**C**) = ∑*p*₁(*c*).(1 –*p*₂(*c*)) if*n*> 1; 1 otherwise

where **C** is a set of *k *> 1 disjoint categories, *p*₁(*c*)* *is the probability that item 1 is category *c* and *p*₂(*c*) is the probability that item 2 is the same category *c*.

We have probabilities

*p*₁(*c*) =*F*(*c*)/*n,**p*₂(*c*) = (*F*(*c*)*–*1)/(*n –*1) = (*p*₁(*c*).*n*– 1)/(*n*– 1),

where *n* is the total number of instances.

The formula for *p*₂ includes an adjustment for the fact that we already know that the first item is *c*. This principle is used in card-playing statistics. Suppose I draw cards from a pack. If the first card I pick is a heart, I know that there are only 10 other hearts in the pack, so the probability of the next card I pick up being a heart is 10 out of 51, not 11 out of 52.

Note that as the set is closed, ∑*p*₁(*c*) = ∑*p*₂(*c*) = 1.

The maximum score is slightly less than (*k* – 1) / *k *except in the special case where *n* approaches *k* and there is a frequency of 1 in any category, in which case diversity can approach 1.

In a forthcoming paper with Bas Aarts and Jill Bowie, we found that the share of functions of *–ing* clauses (‘gerunds’) appeared to change over time in the *Diachronic Corpus of Present-day Spoken English* (DCPSE).

We obtained the following graph. The bars marked ‘LLC’ refer to data drawn from the period 1956-1972; those marked ‘ICE-GB’ are from 1990-1992.

This graph considers six functions **C** = {CO, CS, OD, SU, A, PC} of the clause. It plots *p*(*c*) over **C**. Considered individually, some functions significantly increase and some decrease their share. Note also that the increases appear to be concentrated in the shorter bars (smaller *p*) and the decreases in the longer ones.

Intuitively this appears to mean that we are seeing *–ing* clauses increase in their diversity of grammatical function over time. We would like to test this proposition.

Here is the LLC data.

CO | CS | SU | OD | A | PC | Total |

6 | 33 | 61 | 326 | 610 | 1,203 | 2,239 |

Computing diversity scores, we arrive at

*d*(LLC) = 0.6152 and*d*(ICE-GB) = 0.6443.

We wish to compare these two diversity measures. The first step is to estimate a confidence interval for *d*.

First we compute interval estimates for each term, *d*(*c*) = *p*₁(*c*).(1 – *p*₂(*c*)).

- The Wilson score interval for a probability
*p*is (*w*⁻,*w*⁺).

Any monotonic function of *p*, *fn*, can be applied and plotted as a simple transformation. See Reciprocating the Wilson interval. We can write

*fn*(*p*) ∈ (*fn*(*w*⁻),*fn*(*w*⁺)).

However, *d*(*c*) is not monotonic over its entire range. Indeed *d*(*c*) reaches a maximum where *p* = 0.5. However the axiom holds conservatively provided that the function is monotonic across the interval (*w*⁻, *w*⁺), i.e. where 0.5 is not within the interval. The following graph plots *d*(*c*) over *p*(*c*) for a two-cell vector where *n* = 40.

We can rewrite *d*(*c*) in terms of a probability *p* and *n*,

*d*(*p*,*n*) =*p*× (1 – (*p × n*– 1) / (*n*– 1)).

This has the interval

*d*(*p*,*n*) ∈ (*d*(*w*⁻,*n*),*d*(*w*⁺,*n*))

provided that *d*(*w*⁺, *n*) < 0.5. To obtain the interval we have simply plugged *w*⁻ and *w*⁺ into the formula for *d*(*p*, *n*) in place of *p*.

Indeed, noting the shape of *d*, we can derive the following.

*d*(*p*,*n*) ∈ (*d*(*w*⁻,*n*),*d*(*w*⁺,*n*)) where*w*⁺ < 0.5,*d*(*p*,*n*) ∈ (*d*(*w*⁺,*n*),*d*(*w*⁻,*n*))*w*⁻ > 0.5,*d*(*p*,*n*) ∈ (min(*d*(*w*⁻,*n*),*d*(*w*⁺,*n*)),*d*(0.5,*n*)) otherwise.

Next we need to sum these intervals. To do this we need to take account of the number of degrees of freedom of the vector.

Case 1: *df* = 1

If we had two values (as in our graphed example), we would have one degree of freedom. Cell probabilities *p*(1) + *p*(2) = 1, so *p*(2) = 1 – *p*(1).

The relationship above is exactly the same as applies for the Wilson score interval and 2×1 χ² goodness of fit test. Observed variation across *p*(1) **determines** the variation across *p*(2). Suppose *P*(1), the true value for *p*(1), were at an outer limit of *p*(1) (say, *w*⁺(1)). *P*(2) would be at the opposite outer limit of *p*(2) (*w*⁻(2)).

This means we should simply sum the transformed Wilson scores:

*d*(*c*∈**C**) ∈ (∑*d*(*w*⁻(*c*)*, n*), ∑*d*(*w*⁺(*c*),*n*)).

We apply simple summation where intervals are strictly dependent on each other. We can obtain relative bounds of the dependent sum as:

*l*(dep) =*d*– ∑*d*(*w*⁻(*c*)*, n*),*u*(dep) = ∑*d*(*w*⁺(*c*),*n*) –*d*.

However, in our example we have more than one degree of freedom, and this method is too conservative.

Case 2: *df* > 1

Where probabilities are independent, some can increase and others decrease. The chance that two independent probabilities both fall within a 5% error level is 0.05². So we cannot simply add together intervals. The method of independent summation is to sum Pythagorean interval widths:

*l*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁻(*c*),*n*)]², and*u*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁺(*c*),*n*)]².

However, in our case, we have what we might term semi-independent probabilities, with the level of independence determined by the number of degrees of freedom. We have *df* = *k* – 1 independent differences, so we can interpolate between the two methods in proportion to the number of cells.

*l*= (*l*(ind) × (*k*– 2) + 2*l*(dep)) /*k*, and*u*= (*u*(ind) × (*k*– 2) + 2*u*(dep)) /*k*,*d*(*c*∈**C**) ∈ (*d*–*l*,*d*+*l*).

Note that *l* = *l*(dep) where *k* = 2.

To see how this works, let’s return to our example. The following is drawn from the LLC data (first, blue bar in the graph), at an error level α = 0.05. Note that one of our cells (PC) has *p*₁ > 0.5, *w*₁⁻ is also > 0.5, so we must swap the interval for this cell.

function | CO | CS | SU | OD | A | PC |

p₁ |
0.0027 | 0.0147 | 0.0272 | 0.1456 | 0.2724 | 0.5373 |

w₁⁻ |
0.0012 | 0.0105 | 0.0213 | 0.1316 | 0.2544 | 0.5166 |

w₁⁺ |
0.0058 | 0.0206 | 0.0348 | 0.1608 | 0.2913 | 0.5379 |

Next, to compute the lower bound of the confidence interval CI(*d*) = (*d *– *l*, *d *+ *u*), we obtain the same data for *p*₂ and then carry out the computation.

*l*(dep) =*d*– ∑*d*(*w*⁻(*c*)*, n*) = 0.6152 – 0.5833 = 0.0319,*u*(dep) = ∑*d*(*w*⁺(*c*),*n*) –*d*= 0.6499 – 0.6510 = 0.0359,*l*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁻(*c*),*n*)]² = 0.0152,*u*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁺(*c*),*n*)]² = 0.0165.

This obtains an interval of (0.5945, 0.6382).

We can quote diversity for LLC with absolute intervals (*d *– *l*, *d *+ *u*):

*d*(LLC) = 0.6152 (0.5945, 0.6382), and*d*(ICE-GB) = 0.6443 (0.6248, 0.6655).

In the Newcombe-Wilson test, we compare the difference between two Binomial observations *p*₁ and *p*₂ with the Pythagorean distance of the Wilson interval widths *y*₁⁺ = *w*₁⁺ – *p*₁, etc:

–√(*y*₁⁺)² + (*y*₂⁻)² < (*p*₁ – *p*₂) < √(*y*₁⁻)² + (*y*₂⁺)².

If the equation above is true, the result is not significant (the difference falls within the confidence interval).

This method operates on the assumption that the observations are independent and the intervals are approximately Normal. In our case the difference in diversity is -0.0291, and the bounds are (-0.0301, +0.0297).

Since the difference falls inside those bounds – just – we can report that the difference is not significant.

In many scientific disciplines, such as medicine, papers that include graphs or cite figures without confidence intervals are considered incomplete and are likely to be rejected by journals. However, whereas the Wilson interval performs admirably for simple Binomial probabilities, computing confidence intervals for more complex measures typically involves a more involved computation.

We defined a diversity measure and derived a confidence interval for it. Although probabilistic (diversity is indeed a probability), it is not a *Binomial* probability. For one thing, it has a maximum below 1, of slightly in excess of (*k –* 1) / *k*. For another, it is computed as the sum of the product of two sets of related probabilities.

In order to derive this interval we made the assumption of monotonicity, i.e. that the function *d* tends to increase along its range, or decrease along its range. However, *d* is decidedly **not** monotonic *–* it increases as *p* tends to 0.5 but falls thereafter. We employed the weaker assumption that it is monotonic within the confidence interval, or – in the case where the interval includes a change in direction – that it cannot exceed the global maximum. This has a conservative consequence: it makes the evaluation weaker than it would otherwise be.

We computed an interval by interpolating between dependent and independent estimates of variance, noting that the vector has *k* – 1 degrees of freedom. This is not the most accurate method (and I intend to return to this question in later posts), but it is sufficient for us to derive an interval, and, by employing Newcombe’s method, a test of significant difference.

Like Cramér’s φ, diversity condenses an array with *k* – 1 degrees of freedom into a variable with a single degree of freedom. Swapping data between the smallest and largest columns would obtain exactly the same diversity score.

Testing for significant difference in diversity, therefore, is not the same as carrying out a *k* × 2 chi-square test. Such a test could be significant even when diversity scores are not significantly different. Our new diversity difference test is more conservative, and significant results may be more worthy of comment.

Aarts, B., Wallis, S.A., and Bowie, J. (forthcoming). *–Ing clauses in spoken English: structure, usage and recent change*.

Let’s think about what you experienced. The car crash might involve a number of variables an investigator would be interested in.

How fast was the car going? Where were the brakes applied?

Look on the road. Get out a tape measure. How long was the skid before the car finally stopped?

How big and heavy was the car? How loud was the bang when the car crashed?

These are all **physical variables**. We are used to thinking about the world in terms of these kinds of variables: velocity, position, length, volume and mass. They are tangible: we can see and touch them, and we have physical equipment that helps us measure them.

To this list we might add variables we can’t see, such as how loud the bang was. We might not be able to see it, but we can appreciate that loudness is a variable that ranges from very quiet to extremely loud indeed! With a decibel meter we can get an accurate reading, but if you are trying to explain how loud something was to the Police from memory, the best you might be able to do is a rough-and-ready assessment.

We are also used to thinking about some other variables that might be relevant to our car crash investigation. If we are investigating on behalf of the insurance company, we might want to know the answers to some rather less tangible variables. What was the value of the car before the accident? How wealthy is the driver? How dangerous is that stretch of road?

We are used to thinking about the world in terms of physical variables but we are also brought up in a social world of economic value. The value of the car, the wealth of the driver. These **social variables** are a bit more ‘slippery’ than the physical variables. ‘Value’ can be highly subjective: the car might have been vintage, and different buyers might place a different value on it. The buyer, being canny, might then resell it for a higher value. Nonetheless everyone brought up in a world of trade and capital understands the idea that a car can be sold and in that process a price attached to it. Likewise, ‘wealth’ might be measured in different ways, or in different currencies. So although these are not physical variables, we are comfortable with the idea that they are tangible to us.

But what about that last variable? I asked, *how dangerous is that stretch of road?* This variable is a risk value. It is a **probability**. We can rephrase my question as “what is the probability that for every car that comes down the road, it crashes?” If we can measure this in some way, and make repeat measurements elsewhere, we could make comparisons. Perhaps we have discovered an accident ‘black spot’: somewhere where there is a greater chance of a road accident than at other locations.

**But a probability cannot be calculated on the strength of a single accident.** It can only be measured by a different, more patient, process of observation. We have to observe *many* cars driving down the road, count the ones that crash, and build up a set of observations. Probability is not a tangible variable, and it takes an effort of imagination to think about.

I want to argue that the first thing that makes the subject of statistics difficult, compared to, say, engineering, is that even the most elementary variable we use, observed probability, is not physically tangible.

Let us think about our car crash for a minute. I said that you have never been on this road before. You have no data on the probability of a crash on that road. But it would be very easy to assume from the simple fact that you saw a crash that, if the road surface seemed poor, or it was raining, these facts contributed to the accident and made it more likely. But you have only one data point to draw from. This kind of inference is not valid. It is an over-extrapolation. It is little more than a guess.

Our natural instinct is to form explanations in our mind, hypotheses, and to look for patterns and causes in the world. (Part of our training as scientists is to be suspicious of that inclination. Of course we might be right, but we have to be relentlessly careful and self-critical before we can conclude that we are.)

If we wanted to make a case that this location is an accident black spot, we would need to set up equipment and monitor the road for accidents. We would need to continue to observe the road over a substantial period of time to get the data we needed. This is called a **natural experiment**, where we don’t attempt to interfere with the conditions of the road but simply observe driver behaviour and car crashes.

Alternatively, we might **conduct an actual experiment** and drive various cars down the road to see how they handled. Either way, we would need to observe many cars going past before we could make a realistic estimate of the chance of a crash.

If probability is difficult to observe directly, this has an effect on our ability to think about it. Probability is more difficult to conceive of in the way we conceive of length, say. We all vary in our spatial reasoning abilities, but we experience reinforcement learning from daily observations, tape measures and practice. As we have seen, probability is much more elusive because it is only observed from many observations. This makes it difficult to reliably estimate probability in advance, or to reason with probabilities.

Even experienced researchers make mistakes. The psychologists Tersky and Kahneman (1971) reported the findings from a questionnaire they gave to professional psychologists. The questions concerned the decisions they would make in research based on statements about probability. They showed that not only were their expert subjects unreliable, they provided evidence of persistent biases in human cognition, including the one we mentioned earlier – a belief in the reliability of their own observations, even when they had few observations on which to base their conclusions.

So if you are struggling with statistical concepts, **don’t worry**. You are not alone. Indeed, I have come to the conclusion that *it is necessary to struggle with probability*. We have all been there, and one of my main criticisms of traditional statistics teaching is that most treatments skate over the core concepts and goes straight to statistical testing methods that the experimenter, with no conceptual grounding (never mind mathematical underpinnings), simply takes on faith.

Probability is difficult to observe. It is an abstract mathematical concept that can only be measured indirectly, from many observations. And simple observed probability is just the beginning. In discussing inferential statistics I try to keep to three notions of probability and a simple labelling system: observed probability, for which I will use the label lower-case *p*, the ‘true’ population probability, capital *P*, and a third type, the probability that our observed probability is reliable, which we denote with α. Many people make mistakes reasoning about that last little variable. But we are getting ahead of ourselves.

The best way to get to grips with probability is to replace my thought experiment with a physical one.

But: **safety first!** Please don’t crash an actual car — use a Scalextric instead!

Tversky, A., and Kahneman, D. 1971. Belief in the law of small numbers. *Psychological Bulletin* **76**:2, 105-110. **»** ePublished

I have been recently reviewing and rewriting a paper for publication that I first wrote back in 2011. The paper (Wallis forthcoming) concerns the problem of how we test whether repeated runs of the same experiment obtain essentially the same results, i.e. results are not significantly different from each other.

These meta-tests can be used to test an experiment for replication: if you repeat an experiment and obtain significantly different results on the first repetition, then, with a 1% error level, you can say there is a 99% chance that the experiment is not replicable.

These tests have other applications. You might be wishing to compare your results with those of others in the literature, compare results with different operationalisation (definitions of variables), or just compare results obtained with different data – such as comparing a grammatical distribution observed in speech with that found within writing.

The design of tests for this purpose is addressed within the *t*-testing ANOVA community, where tests are applied to continuously-valued variables. The solution concerns a particular version of an ANOVA, called “the test for interaction in a factorial analysis of variance” (Sheskin 1997: 489).

However, anyone using data expressed as discrete alternatives (A, B, C etc) has a problem: the classical literature does not explain what you should do.

The rewrite of the paper caused me to distinguish between two types of tests: ‘point tests’, which I describe below, and ‘gradient tests’.

These tests can be used to compare results drawn from 2 × 2 or *r* × *c* χ² tests for homogeneity (also known as tests for independence). This is the most common type of contingency test, which can be computed using Fisher’s exact method or as a Newcombe-Wilson difference interval.

- A
**gradient test**(B) evaluates whether the*gradient*or difference between point 1 and point 2 differs between runs of an experiment,*d*=*p*₁ –*p*₂. This concerns whether claims about the rate of change, or size of effect, observed are replicable. Gradient tests can be extended, with increasing degrees of freedom, into tests comparing*patterns*of effect. - A
**point test**(A) simply asks whether data at either point, evaluated separately, differs between experimental runs. This concerns whether single observations, such as*p*₁, are replicable. Point tests can be extended into ‘multi-point’ tests, which we discuss below.

Point tests only apply to homogeneity data. If you wish to compare outcomes from goodness of fit tests, you need a version of the gradient test, to compare differences from an expected *P*, *d* = *p*₁ – *P*. Since different data sets may have different expected *P*, a distinct ‘point test for goodness of fit’ would be meaningless.

The earlier version of the paper, which has been published on this blog since its launch 2012, focused on gradient tests. The possibility of carrying out a point test was mentioned in passing. In this blog post I want to focus on point tests.

The obvious problem with gradient tests is that two experimental runs might obtain the same gradient but in fact be very different in start and end points. Consider the following graph.

The data in Figure 1 is calculated from two 2 × 2 tables drawn from a paper by Aarts, Close and Wallis (2013).

**Note:** To obtain Figure 2, I simply replaced one frequency in the first table: 46 with 100. The data is also found on the 2×2 homogeneity tab in this Excel spreadsheet, which contains a wide range of separability tests.

To make our exposition clearer, Table 1 uses the same format as in the Excel spreadsheet (with the dependent variable distributed vertically) rather than the format in the paper.

spoken | LLC (1960s) |
ICE-GB (1990s) |
Total |

shall |
124 | 46 | 170 |

will |
501 | 544 | 1,045 |

Total |
625 | 590 | 1,215 |

written | LOB (1960s) |
FLOB (1990s) |
Total |

shall |
355 | 200 | 555 |

will |
2,798 | 2,723 | 5,521 |

Total |
3,153 | 2,923 | 6,076 |

Aarts *et al*. carried out 2 × 2 homogeneity tests for the two tables separately. These test whether modal *shall* declines as a proportion of the modal *shall/will* alternation between the two time points. In other words, we compare LLC with ICE-GB data, and LOB with FLOB data.

To carry out a point test we simply rotate the test 90 degrees, e.g. to compare data at the 1960s point we compare LLC with LOB.

As I have explained elsewhere (Wallis 2013), there are a number of different methods for carrying out this comparison.

These include:

- The
*z*test for two independent proportions (Sheskin 1997: 226). - The Newcombe-Wilson interval test (Newcombe 1998).
- The 2 × 2 χ² test for homogeneity (independence).

These are all standard tests and each is discussed in papers and elsewhere on this blog.

The advantage of the third approach is that it is extensible to *c*-way multinomial observations by using a 2 × *c* χ² test.

The tests listed above can be used to compare the 1960s and 1990s intervals in Figure 1 separately.

However, in many cases it would be helpful to have a method that evaluated both pairs of observations in a single test. This can be generalised to a series of *r* observations. To do this, in (Wallis forthcoming) I propose what I call a multi-point test.

We generalise the χ² formula by summing over *i* = 1..*r*:

- χ
² = ∑χ²(_{d}*i*)

where χ²(*i*) represents the χ² score for homogeneity for each set of data at position *i* in the distribution.

This test has *r* × df(*i*) degrees of freedom, where df(*i*) is the degrees of freedom for each χ² point test. So, in the worked example we have seen, the summed test has two degrees of freedom:

spoken | LLC (1960s) |
ICE-GB (1990s) |
Total |

shall |
124 | 46 | 170 |

will |
501 | 544 | 1,045 |

Total |
625 | 590 | 1,215 |

written | LOB (1960s) |
FLOB (1990s) |
Total |

shall |
355 | 200 | 555 |

will |
2,798 | 2,723 | 5,521 |

Total |
3,153 | 2,923 | 6,076 |

χ² | 34.6906 | 0.6865 | 35.3772 |

Since the computation sums independently-calculated χ² scores, each score may be individually considered for significant difference (with df(*i*) degrees of freedom). Hence we can see above the large score for the 1960s data (individually significant) and the small score for 1990s (individually non-significant).

**Note:** Whereas χ² is generally associative (non-directional), the summed equation (χ* _{d}*²) is not. Nor is this computation the same as a 3 dimensional test (

- The multi-point test factors out variation between tests over the independent variable (in this instance: time). This means that if there is a lot more data in one table at a particular time period, this fact does not skew the results.
- On the other hand, it does not factor out variation over the dependent variable – after all, this is precisely what we wish to examine!

Naturally, like the point test, this test may be generalised to multinomial observations.

An alternative multi-point test for binomial (two-way) variables employs a sum of χ² values abstracted from Newcombe-Wilson tests.

- Carry out Newcombe-Wilson tests for each point test
*i*at a given error level α, obtaining*D*,_{i}*W*⁻ and_{i}*W*⁺._{i} - Identify the inner interval width
*W*for each test:_{i}- if
*D*< 0,_{i }*W*=_{i}*W*⁻;_{i}*W*=_{i}*W*⁺ otherwise._{i}

- if
- Use the difference
*D*and inner interval_{i}*W*to compute χ² scores:_{i}- χ²(
*i*) = (*D*._{i}*z*_{α/2}/*W*)²._{i}

- χ²(

It is then possible to sum χ²(*i*) as before.

Using the data in the worked example we obtain:

**1960s:** *D _{i}* = 0.0858,

Since *D _{i}* is positive in both cases, we use the upper interval width each time. This gives us χ² scores of 28.4076 and 1.3769 respectively, which obtains a sum of 29.78. Compared to the first method above, this approach tends to downplay extreme differences.

The point test and the additive generalisation of this test into a ‘multi-point test’ represent a method of contrasting multiple runs of the same experiment, comparing observed changes in different subcorpora or genres, or examine the empirical effect of changing definitions of variables.

These tests consider the null hypothesis that **individual observations** are not different; or, in the multi-point case, that **in general** the observations are not different.

- They do not evaluate the gradient between points or the size of effect. If we wish to compare
**sizes of effect**we would need to use one of the methods for this purpose described in (Wallis forthcoming). - The method only applies to comparing tests for homogeneity (independence). To compare
**goodness of fit**data, a different approach is required (also described in Wallis forthcoming).

Nonetheless, these tests are useful meta-tests that build on classical Pearson χ² tests, and they are useful tools in our analytical armoury.

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.

Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. *Statistics in Medicine* **17**: 873-890.

Wallis, S.A. 2013. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. » Post

Wallis, S.A. forthcoming (first published 2011). *Comparing χ² tables for separability of distribution and effect*. London: Survey of English Usage. » Post

I have previously argued (Wallis 2014) that interaction evidence is the most fruitful type of corpus linguistics evidence for grammatical research (and doubtless for many other areas of linguistics).

Frequency evidence, which we can write as *p*(*x*), the probability of *x* occurring, concerns itself simply with the overall distribution of a linguistic phenomenon *x* – such as whether informal written English has a higher proportion of interrogative clauses than formal written English. In order to calculate frequency evidence we must define *x*, i.e. decide how to identify interrogative clauses. We must also pick an appropriate baseline *n* for this evaluation, i.e. we need to decide whether to use words, clauses, or any other structure to identify locations where an interrogative clause may occur.

**Interaction evidence** is different. It is a statistical correlation between a decision that a writer or speaker makes at one part of a text, which we will label point *A*, and a decision at another part, point *B*. The idea is shown schematically in Figure 1. *A* and *B* are separate ‘decision points’ in a given relationship (e.g. lexical adjacency), which can be also considered as ‘variables’.

This class of evidence is used in a wide range of computational algorithms. These include collocation methods, part-of-speech taggers, and probabilistic parsers. Despite the promise of interaction evidence, the majority of corpus studies tend to consist of discussions of frequency differences and distributions.

In this paper I want to look at applications of interaction evidence which are made more-or-less at the same time by the same speaker/writer. In such circumstances we cannot be sure that just because *B** *follows *A** *in the text, the decision relating to *B* was made after the decision at *A*.

For example, in studying the premodification of noun phrases by attributive adjectives in English – which adjective is applied first in assembling an NP like *the old tall green ship*, for instance – **we cannot be sure that adjectives are selected by the speaker in sentence order**. It is also perfectly plausible that adjectives were chosen in an alternative or parallel order in the mind of the speaker, and then assembled in the final order during the language production process.

Of course, in cases where points *A* and *B* are separated substantively in time (as in many instances of structural self-priming) or where *B* is spoken in response to *A* by another speaker (structural priming of another’s language), there is unlikely to be any ambiguity about decision order. Moreover, if *A* licences *B*, then the order in unambiguous.

However, in circumstances where *A* and *B* are proximal, and where the order of decisions made by the speaker/writer cannot be presumed, we wish to consider whether there are mathematical or statistical methods for predicting the most likely order decisions were made.

Such a method would have considerable value in experimental design in cognitive corpus linguistics. For example, since Heads of NPs, VPs etc are conceived of as determining their complements, it may not be too much a stretch to argue that if this method works, we may have found a way of empirically evaluating this grammatical concept.

- Introduction
- A collocation example

2.1 Employing chi-square and phi

2.2 Directional statistics

2.3 Significantly directional? - A grammatical example

3.1 Testing for difference under alternation

3.2 Comparing Newcombe-Wilson intervals for direction

3.3 Optimising the dififference interval - Mapping significance of association and direction
- Concluding remarks
- References

Wallis, S.A. 2017. *Detecting direction in interaction evidence*. London: Survey of English Usage. **»** Paper (PDF)

- Excel spreadsheets

Wallis, S.A. 2011. *Comparing χ² tests for separability*. London: Survey of English Usage, UCL. **»** post

Wallis, S.A. 2012. *Goodness of fit measures for discrete categorical data*. London: Survey of English Usage, UCL. **»** post

Wallis, S.A. 2013a. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. **»** post

Wallis, S.A. 2013b. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics* **20**:3, 178-208. **»** post

Wallis, S.A. 2014. What might a corpus of parsed spoken data tell us about language? In L. Veselovská and M. Janebová (eds.) *Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure.* Olomouc: Palacký University, 2014. pp 641-662. **»** post

Wallis, S.A. forthcoming. *That vexed problem of choice*. London: Survey of English Usage, UCL. **»** post

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Over the three days, participants will learn about the following:

- the scope of Corpus Linguistics, and how we can use it to study the English Language;
- key issues in Corpus Linguistics methodology;
- how to use corpora to analyse issues in syntax and semantics;
- basic elements of statistics;
- how to navigate large and small corpora, particularly ICE-GB and DCPSE.

At the end of the course, participants will have:

- acquired a basic but solid knowledge of the terminology, concepts and methodologies used in English Corpus Linguistics;
- had practical experience working with two state-of-the-art corpora and a corpus exploration tool (ICECUP);
- have gained an understanding of the breadth of Corpus Linguistics and the potential application for projects;
- have learned about the fundamental concepts of inferential statistics and their practical application to Corpus Linguistics.

For more information, including costs, booking information, timetable, see the website.

Over the last year, the field of psychology has been rocked by a major public dispute about statistics. This concerns the failure of claims in papers, published in top psychological journals, to replicate.

Replication is a big deal: if you publish a correlation between variable *X* and variable *Y* – that there is an increase in the use of the progressive over time, say, and that increase is statistically significant, you expect that this finding would be replicated were the experiment repeated.

I would strongly recommend Andrew Gelman’s brief history of the developing crisis in psychology. It is not necessary to agree with everything he says (personally, I find little to disagree with, although his argument is challenging) to recognise that he describes a serious problem here.

There may be more than one reason why published studies have failed to obtain compatible results on repetition, and so it is worth sifting these out.

In this blog post, what I want to do is try to explore what this replication crisis is – is it one problem, or several? – and then turn to what solutions might be available and what the implications are for corpus linguistics.

The debate between Neil Millar and Geoff Leech regarding the alleged increase (Millar 2009) and decline (Leech 2011) of the modal auxiliary verbs is an example of this problem.

Millar based his conclusions on the TIME corpus, discovering that the rate of modal verbs per million words tended to increase over time. Leech, using the Brown series of US English corpora, discovered the opposite. Both applied statistical methods to their data but obtained very different conclusions.

Inferential statistics operates by predicting the result of repeated runs of the same experiment, i.e. on samples of data drawn from the same population.

Stating that something “significantly increases over time” can be reformulated as:

- subject to caveats of
**random sampling**(the sample is, or approximates to, a random sample of utterances drawn from the same population), and**Binomial variables**(observations are free to vary from 0 to 1), - we can calculate a
**confidence interval**at a given error rate (say 1 in 20 times for a 5% error rate / 95% interval) on the difference in two observations of variable*X*taken at two time points 1 and 2,*x*₂ –*x*₁, **all points**within this interval (including the lower bound) are greater than 0,**on repeated runs of the same experiment we can expect to see an observation fall outside of the confidence interval of the difference at the predicted rate**(here, 1 time in 20).

**Note:** For the purposes of this blog post, I am focusing on the last bullet point – when we say that something “fails to replicate”, we mean that on a repetition the result falls outside the confidence interval of the difference *on the very next occasion! *More precisely, we mean that the results are statistically separable.

Leech obtained a different result from Millar on the first attempted repetition of this experiment. This could be a fluke, but it seems to be a failure to replicate. There should only be a 1 in 20 chance of this happening.

Observing such a replication failure should lead us to ask some searching questions about these two studies, many of which are discussed elsewhere in this blog.

Much of the controversy can be summed up by the bottom row in this table, drawn from Millar (2009). This appears to show a 23% increase in modal use between the 1920s and 2000s. With a lot of data and a sizeable effect, this increase seems bound to be significant.

1920s | 1930s | 1940s | 1950s | 1960s | 1970s | 1980s | 1990s | 2000s | % diff 1920s-2000s | |

will |
2,194.63 | 1,681.76 | 1,856.40 | 1,988.37 | 1,965.76 | 2,135.73 | 2,057.43 | 2,273.23 | 2,362.52 | +7.7% |

would |
1,690.70 | 1,665.01 | 2,095.76 | 1,669.18 | 1,513.30 | 1,828.92 | 1,758.44 | 1,797.03 | 1,693.19 | +0.1% |

can |
832.91 | 742.30 | 955.73 | 1,093.39 | 1,233.13 | 1,305.82 | 1,231.99 | 1,475.95 | 1,777.07 | +113.4% |

could |
661.33 | 822.72 | 1,188.24 | 998.83 | 950.73 | 1,106.25 | 1,156.61 | 1,378.39 | 1,342.56 | +103.0% |

may |
583.59 | 515.12 | 496.93 | 502.74 | 628.13 | 743.66 | 775.92 | 937.08 | 931.91 | +59.7% |

should |
577.46 | 450.07 | 454.87 | 495.26 | 441.96 | 475.50 | 453.33 | 521.46 | 593.27 | +2.7% |

must |
485.31 | 418.03 | 456.57 | 417.62 | 401.36 | 390.47 | 347.02 | 306.69 | 250.59 | -48.4% |

might |
374.52 | 375.40 | 500.33 | 408.90 | 399.80 | 458.99 | 416.81 | 474.23 | 433.34 | +15.7% |

shall |
212.19 | 120.79 | 96.42 | 70.52 | 50.48 | 35.65 | 25.93 | 16.09 | 9.26 | -95.6% |

ought |
50.22 | 37.94 | 39.31 | 40.34 | 36.91 | 34.29 | 28.27 | 34.90 | 27.65 | -44.9% |

Total | 7,662.86 | 6,829.14 | 8,140.56 | 7,685.15 | 7,621.56 | 8,515.28 | 8,251.75 | 9,215.05 | 9,421.36 | +22.9% |

In attempting to identify why Leech and Millar obtain different results, the following questions should be considered.

**Are the two samples drawn from the same population, or are they drawn from two distinct populations?**To put it another way, are there characteristics of the TIME data that makes it distinct from the general written data in the Brown corpora? For example, does TIME have a ‘house style’, with subeditors enforcing it, which has led to a greater frequency of modal use? Has TIME tended to curate more stories with more modal hedges than the overall trend? Jill Bowie (Bowie*et al*2013) reported that genre subdivisions within the spoken DCPSE corpus often exposed different modal trends.**Does Millar’s data support a general observation of increased modal use?**Bowie observes that Millar’s aggregate data fluctuates over the entire time period (see Table, bottom row), and some changes in sub-periods appear to be consistent with the trend reported by Leech in an earlier study in 2003. According to this observation, simply expressing the trend as an increase in modal verb use seems misleading.**Is it legitimate to aggregate all modals together?**In one sense, modals are a well-defined category of verb: a closed category, especially if one excludes the semi-modals. So “modal use” is a legitimate variable. But we can also see that different modal verbs are undergoing different patterns of change over time (see Table). Millar reports that*shall*and*must*are in decline in his data while*will*and*can*are increasing. Whereas*shall*and*will*may be alternates in some contexts, this does not mean that bundling all modal trends together is particularly meaningful. Moreover, since the synchronic distribution of modals (like most linguistic variables) is sensitive to genre, this issue also interacts with my first bullet point, i.e. the fact that there are known differences between corpora.**How reliable is a per-million-word measure?**What does the data look like if we use a different baseline, for example, modal use per tensed verb phrase (or tensed main verb)? Doing this allows us to factor out variation in ‘tensed VP density’ (i.e. the variation in potential sites for modals to be deployed) between texts. Failure to do this (as both Leech and Millar do) means that we are not measuring when writers**choose**to use modal verbs, but the rate to which we, the reader, are**exposed**to them. See That vexed problem of choice.

If VP density in text samples changes over time in either corpus, this may explain these different results – not as a result of increasing or declining modal use but as a result of increasing or declining tensed VP density (or declining / increasing density of other constituents). More generally, word-based baselines almost always conflate opportunity and use because the option to insert the element is not available following every other word (exceptions might include pauses or expletives, but these exceptions prove the rule). This conflation undermines the Binomial model and increases the risk that results will not replicate. The solution is to focus on identifying each choice-point as much as possible.**Does per word (per-million-word) data conform to the Binomial statistical model?**Since the entire corpus cannot consist of modal verbs, observations of modal verbs can never approach 100%, so the answer has to be no. However, the effect of this inappropriate model is that it tends to lead to the underreporting of otherwise significant results. See Freedom to vary and statistical tests. This may be a problem, but logically, it cannot be an explanation for obtaining two different ‘significant’ results in opposite directions!

All of the above are reasons to be unsurprised at the fact that Millar’s summary finding was not replicated in Leech’s data. But to be fair, many of Millar’s individual trends *do* appear to be consistent with results found in the Brown corpus.

As we shall see, the problem of replication is not that *all* results in one study are not reproduced in another study, rather it is that *some* results are not reproduced. But this observation raises an obvious question: which results should we cite?

Moreover, if our most remarked-upon finding is not replicated, we have an obvious problem.

The replication crisis has been most discussed in psychology and the social sciences. In psychology, some published findings have been controversial to say the least. Claims that ‘Engineers have more sons; nurses have more daughters’ have tended to attract the interest of other psychologists relatively quickly. But this is shooting fish in a barrel.

In psychology, it is common to perform studies with small numbers of participants – 10 per experimental condition is usually cited as a minimum, which means that between 20 and 40 participants becomes the norm. Many kinds of failure to replicate are due to what statisticians tend to call ‘basic errors’, such as using an inappropriate statistical test. I discuss this elsewhere in this blog.

- The most common error is applying a mathematical model to data that does not conform to it. For example, applying a Binomial model that assumes that an observed probability is free to vary from 0 to 1 to a variable that can only vary between 0 and 0.001 (say), is mathematically unsound. No method that makes this assumption will work the way that the Binomial model predicts when it comes to replication.
- Corpus linguistics has a particular historical problem due to the ubiquity of studies employing word-based baselines (per million words, per thousand words etc). It is not possible to adjust an error level to fix this problem, because the problem is one of missing data — in this case, frequency data for a meaningful choice baseline (ideally, the frequency of alternate forms). Bravo for variationism.

This is why in this blog I have tended to argue for applying the simplest possible experimental designs (2 × 2 contingency tests, for example) over multivariate regression algorithms which may work, but are treated as ‘black boxes’ by almost all who use them. Such algorithms may ‘over fit’ data, i.e. they match the data more closely than is mathematically justified. But more importantly, they (and the assumptions underpinning them) are not transparent to their users.

I argue that if you don’t understand how your results were derived, you are taking them on faith.

This does not mean I don’t think that some multi-variable methods are not theoretically superior to, or potentially more powerful than, simpler tests. On the contrary, I object that before we use any statistical method we need to be sure that we understand what they are doing with our data. We have to ask ourselves constantly, *what do our results mean?*

However, the replication problem does not go away entirely once we have dealt with these so-called basic errors.

Andrew Gelman and Eric Loken (2013) raise a more fundamental problem that, if valid, is particularly problematic for corpus linguists. This concerns a question that goes to the heart of the post-hoc analysis of data, and the fundamental philosophy of statistical claims and the scientific method.

Essentially their argument goes like this.

- All data contains random noise, and thus every variable in a dataset (extracted from a corpus) will contain random noise. Researchers tend to assume that by employing a significance test we ‘control’ for this noise. But this is a mischaracterisation. Faced with a dataset consisting of pure noise, we would detect a ‘significant’ result 1 in 20 times (at a 0.05 threshold). Another way of thinking about this is that statistical methods can find patterns in data (correlations) even when there are no patterns to be found.
- Any data set may contain multiple variables, there are multiple potential definitions of these variables, and there are multiple analyses we could perform on the data. In a corpus we could modify definitions of variables, perform new queries, change baselines, etc., to perform new analyses.
- It follows that there is a very large number of potential hypotheses we
*could*test against the data. (Note: this is not an argument against exploring the hypothesis space in order to choose a better baseline on theoretical grounds!)

This part of the argument is not very controversial. However, Gelman and Loken’s more provocative claim is as follows.

- Few researchers would admit to running very many tests against data and reporting results, which the authors term ‘fishing’ for significant results, or ‘p-hacking’. There are some algorithms that do this (multivariate logistic regression anyone?), but most research is not like this.
- Unfortunately, the authors argue,
**standard post-hoc analysis methods – exploring data, graphing results and reporting significant results – does much the same thing.**We dispense with blind alleys (what they call ‘forking paths’), because we can see that they are not likely to produce significant results. Although we don’t actually run these dead-end tests, for mathematical purposes*our educated eyeballing of data to focus on interesting phenomena has done the same thing*.

- As a result, we underestimate the robustness of our results, and often, they fail to replicate.

Gelman and Loken are not alone in making this criticism. Cumming (2014) objects to ‘NHST’ (null hypothesis significance testing), interpreted as an imperative that

“explains selective publication, motivates data selection and tweaking until the *p* value is sufficiently small, and deludes us into thinking that any finding that meets the criterion of statistical significance is true and does not require replication.”

Since it would be unfair to criticise others for a problem that my own work may be prone to, let us consider the following graph that we used while writing Bowie and Wallis (2016). The graph does not appear in the final version of the paper – not because we didn’t like it, but because we decided to adopt a different baseline in breaking down an overall pattern of change into sub-components. But it is typical of the kind of graph we might be interested in examining.

There are two critical questions that follow from Gelman and Loken’s critique.

*In plotting this kind of graph and reporting confidence intervals, are we misrepresenting the level of certainty found in the graph?**Are we engaging in, or encouraging, retrospective cherry-picking of contrasts between observations and confidence intervals?*

In the following graph there are 19 decades and 5 trend lines, i.e. 95 confidence intervals. There are 171 × 5 potential pairwise comparisons, and 10 × 19 vertical pairwise comparisons. So there are, let’s say, 1,045 potential statistical pairwise tests which would be reasonable to carry out. With a 1 in 20 error rate, at least 52 ‘significant’ pairwise comparisons would be incapable of replication.

Gelman, Loken, Cumming *et al.* would argue that by selecting a few statistically significant claims from this graph, we have committed precisely the error they object to.

However, I have to defend this graph, and others like it, by arguing that **this is not our method**. We don’t sift through 1,045 possible comparisons and then report significant results selectively! In the paper, and in our work more generally, we really don’t encourage this kind of cherry-picking (the human equivalent of over-fitting). We are more concerned with the overall patterns that we see, general trends, etc., which are more likely to be replicable in broad terms.

Thus, for example, in that paper we don’t pull out specific significant pairwise comparisons to make strong claims. In this particular graph we can see an apparently statistically significant sharp decline between 1900 and 1930 in the tendency of writers to use the verb SAY (as in *he is said to have stayed behind*) before a *to-*infinitive perfect, compared to the other verbs in the group. This observation may be replicable, but **the conclusions of the paper do not depend on this observation**. This claim, and similar claims, do not appear in the paper.

Similarly, if we turn back to Neil Millar’s modals-per-million-word data for a moment, Bowie’s observation that the data does not show a consistent increase over time is interesting. Millar did not select the time period in order to report that modals were on the increase – on the contrary, he non-arbitrarily took the start and end point of the timeframe sampled. But the conclusion that ‘modals increased over the entire period’ was only one statement that described the data. In shorter periods there was a significant fall, and different modal verbs behaved differently. Indeed, the complexity of his results is best summed up by the detailed graphs within his paper!

**In conclusion:** it is better to present and discuss the pattern, not just the end point – or the slogan.

Nonetheless we may still have the sneaking suspicion that what we are doing is a kind of researcher bias. We tend to report statistically significant results and ignore those inconvenient non-significant ones. The fear is that results assumed to be due to chance 1 in 20 times are more likely due to chance 1 in 5 times (say), simply because we have – inadvertently and unconsciously – already preselected our data and methods to obtain significant results.

Some highly experienced researchers have suggested that we fix this problem by adopting tougher error levels – adopt a 1 in 100 level and we might arrive at 1 in 25. The problem is that this assumes we know the appropriate multiplier to apply.

It is entirely legitimate to adjust an error level to ensure that multiple independent tests are simultaneously significant, as some fitting algorithms do. But if a statistical model is incorrectly applied to data, logically the solution must lie in correcting the model, not the error level.

Gelman and Loken suggest instead that published studies should always involve a replication process. They argue it is preferable that researchers publish half as many experiments and include a replication step than publish non-replicable results.

**Suggested method:** Before you start, create two random subcorpora A and B by randomly drawing texts from the corpus and assigning them to A and B in turn. You may wish to control for balance, e.g. to ensure subsampling is drawn equitably from each genre category. Perform the study on A, and summarise the results. Without changing a single query, variable or analysis step, apply exactly the same analysis to B.

Do we get **compatible results**, i.e. *results that fall within the confidence intervals of the first experiment*? More precisely, are the results statistically separable?

An alternative to formal replication is to repeat the experiment with well-defined, as distinct from randomly generated, subcorpora.

**Sampling subcorpora:** Suppose you apply an analysis to spoken data in ICE-GB, and then repeat it with written data. Do we get broadly similar results? If we obtain comparable results for two subcorpora with a known difference in sampling, it is probable they would pass a replication test where two subsamples were not sampled differently. On the other hand, if results *are* different, this would justify further investigation.

Even where replication is not carried out (for reasons of insufficient data, perhaps), an uncontroversial corollary of this argument is that your research method should be sufficiently transparent so that it can be replicated by others.

As a general principle, authors should make raw frequency data available to permit a reanalysis by other analysis methods. I find it frustrating when papers publish per million word frequencies in tables, when what is needed for a reanalysis is raw frequency data!

Another of Gelman and Loken’s recommendations is that researchers need to spend more time focusing on sizes of effect, rather than just reporting statistical significance. With lots of data and large effect sizes, the problem is reduced. Certainly we should be wary of citing just-significant results with a small effect size.

Where does this leave the arguments I have made elsewhere in favour of visualising data with confidence intervals? One of the implications of the ‘forking paths’ argument is that we tend not to report dead-end, non-significant results. But well considered graphs can visualise all data in a given frame, rather than selected data (of course we have to ‘frame’ this data, select variables, etc.).

One advantage of graphing data with confidence intervals is that we apply the same criteria to all data points and allow the reader to interpret the graph. Significant and non-significant contrasts are available to be viewed. We also visualise effect sizes and the weight of evidence (confidence intervals), even if it is arguable that our model is insufficiently conservative.

Thus a strength of Millar’s paper is the reporting of trends and graphs. In the graph above, the confidence intervals improve our understanding of the overall trends we see.

We just should not assume that every significant difference will be replicable.

This is really one of mine, but I suggest it is implicit in the argument above.

It seems to me to be an absolutely essential requirement for any empirical scientist to play devil’s advocate to their own hypothesis.

That is, it is not sufficient to ‘find something interesting in data’, and publish. What we are really trying to do is detect meaningful phenomena in data, or to put it another way, we are trying to find robust evidence of phenomena that have implications for linguistic theory. We are trying to move from observed correlation to a hypothesised underlying cause.

Statistics is a tool to help us do this. But logic also plays an essential part.

Without wishing to create a checklist for empirical linguistics (such that a researcher is convinced in the validity of their results simply because they can tick off the list), we might argue that the following steps are necessary in all empirical research.

**Identify the underlying research question**, framed in general theoretical terms.**Operationalise the research question**as a series of testable hypotheses or predictions, and evaluate them. Plot graphs! Visualising data with confidence intervals allows us to visualise expected variation and make more robust claims.**Focus reporting on global patterns**across the entire dataset. If your research ends up prioritising an apparently unusual local pattern in a selected part of the data, consider whether this may be an artefact of sampling.**Critique the results of this evaluation**in terms of the original research question, and play devil’s advocate: what other possible underlying explanations might there be for the observed results?**Consider alternative hypotheses**and test them. Try to design new experiments to separate out different possible explanations for the observed phenomenon.**Plan to include a replication step**prior to publication. This means being prepared to partition the data in the way described above, dividing the corpus into different pools of source texts.

Whether or not Gelman and Loken’s argument applies to your corpus linguistics study — and we have to eliminate basic errors first — the principal conclusion is that it is difficult to understate the importance of **reporting accuracy and transparency**. If the study does not appear to replicate in the future, possible reasons must be capable of exploration by future researchers. It would not have been possible to explore the differences between Leech and Millar’s data had Neil Millar simply summarised a few trends and reported some statistically significant findings.

It is incumbent on all of us to properly describe the limitations of data and sampling; definitions of variables and abstraction (query) methods for populating them; as well as graphing data to reveal both significant and non-significant patterns at the same time.

A typical mistake is to refer to ‘British English’ (say) as a short hand for ‘data drawn from British English texts sampled according to the sampling frame defined in Section 3’. Many failures to replicate in psychology can be attributed to precisely this type of logical error – that the experimental dataset is not a reliable model for the population claimed.

Finally, Cumming (2014) makes an important distinction between **exploratory research** and **prespecified research**. Corpus linguistics is almost inevitably exploratory, as it is impossible to prespecify data collection in post-hoc analysis. In a natural experiment we cannot control for confounding variables, and we must frame our conclusions accordingly.

Bowie, J., Wallis, S.A. and Aarts, B. 2013. Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Marín-Arrese, J.I., Carretero, M., Arús H.J. and van der Auwera, J. (eds.) *English Modality*, Berlin: De Gruyter, 57-94.

Bowie, J. and Wallis, S.A. 2016. The *to*-infinitival perfect: A study of decline. In Werner, V., Seoane, E., and Suárez-Gómez, C. (eds.) *Re-assessing the Present Perfect*, Topics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94.

Cumming, G. 2014. The New Statistics: Why and How, *Psychological Science*, 25(1), 7-29.

Gelman, A. and Loken, E. 2013. The garden of forking paths. Columbia University. **»** ePublished.

Leech, G. 2011. The modals ARE declining: reply to Neil Millar’s ‘Modal verbs in TIME: frequency changes 1923–2006’. *International Journal of Corpus Linguistics* 16(4).

Millar, N. 2009. Modal verbs in TIME: frequency changes 1923–2006. *International Journal of Corpus Linguistics* 14(2), 191–220.

One of the longest-running, and in many respects the least helpful, methodological debates in corpus linguistics concerns the spat between so-called **corpus-driven** and **corpus-based** linguists.

I say that this has been largely unhelpful because it has encouraged a dichotomy which is almost certainly false, and the focus on whether it is ‘right’ to work from corpus data upwards towards theory, or from theory downwards towards text, distracts from some serious methodological challenges we need to consider (see other posts on this blog).

Usually this discussion reviews the achievements of the most well-known corpus-based linguist, John Sinclair, in building the *Collins Cobuild Corpus*, and deriving the *Collins Cobuild Dictionary* (Sinclair *et al*. 1987) and *Grammar* (Sinclair *et al*. 1990) from it.

**In this post I propose an alternative examination.**

I want to suggest that *the greatest success story for corpus-based research is the development of part-of-speech taggers* (usually called a ‘POS-tagger’ or simply ‘tagger’) trained on corpus data.

These are industrial strength, reliable algorithms, that obtain good results with minimal assumptions about language.

So, *who needs theory?*

Taggers consist of two parts:

**a ‘learning’ algorithm**that collects rules from training data, and**a ‘tagging’ algorithm**which applies rules to new texts to classify words by their part of speech (word class).

The corpus-based aspect is the ‘learning’ algorithm.

A typical rule might be that if the word *old* (which can be a noun/nominal adjective, as in *the old*, or adjective, *the old man*) is followed by a noun, then *old* is more likely to be an adjective than otherwise.

The tagging algorithm takes a sentence and applies these rules like a crossword solver. It classifies the words that it is most certain of before considering those it is less confident about. Thus, in *the old man*, *the* is unambiguously a determiner, whereas both *old* and *man* can belong to more than one word class.

The learning algorithm generates summary statistics bottom-up from training data it is given, which are lots of sentences/texts which have already been tagged with the same part of speech scheme (i.e., a corpus).

It is not necessary to make many assumptions about the grammar of the language we are working with to obtain results comparable to the best reported in the literature. The computer does not need to ‘know’ what a noun or a verb is. It can simply obtain statistics about these different categories from the corpus.

But these algorithms *do* embody some assumptions about their language input. These assumptions can be enumerated as follows, although different classification schemes might vary in some details:

- language consists of
**sentences**divided into lexical**words**; - each
**sentence**is capable of being analysed separately; **words**include part-words such as genitive markers and cliticised words, and compounds, where multiple words can be given the same tag;- there are a fixed set of
**word class tags**that each particular instance of a word can be categorised by – these commonly consist of word class category (noun, verb, etc.), plus secondary information (plural proper noun, copular verb, etc.); - these tags were correctly applied to the
**training data**.

Databases extracted by the learning algorithm typically consist of **frequency distributions** for every word-tag pattern, i.e. the number of cases in the training corpus where a given lexical word has a particular tag; and **transition probabilities** for each word-tag pattern if words have more than one tag.

The performance of these linguistically unsophisticated algorithms is striking. **A typical tagger trained on a million words of English using a standard set of tags will make the correct decision for new sentences of a similar type some 95% of the time.**

Different algorithms may vary in storage efficiency. My crude simulated annealing stochastic tagger (Wallis 2012), which stores transition probabilities exhaustively, is less space-efficient than Eric Brill’s patch tagger (Brill 1992). *However, they obtain similar results.*

The remaining 5% of residual incorrect examples tend to be cases that are idiomatic, or are part of a multi-word string of ambiguous words, or are a result of weaknesses in the training data.

To address these weaknesses we can make a number of improvements.

**Store a finite set of idioms, strings or compounds.**This is a bit clumsy and*ad hoc*, doesn’t scale well, but can actually improve performance.**Add modules to the database and algorithm.**The Brill tagger employs some simple*ad hoc*regular morphology detection at an initial stage. A more thorough approach might consist of a morphological model of ‘lemmatisation’ (identifying word stems and affixes, e.g.*re-educated*→*re–*+*educate*+ –*ed*). The advantage of this step is that even if we don’t have the word*re-educated*in our training set we can recognise*educate*as a verb and the entire word as a gerund noun or verb. Generalisation allows us to pool statistics, so we can have more reliable rules, and compress information, so we don’t have to store separate statistics for every single word.**Create a more general type of rule.**The rules we have described were tied to particular words, such as*old*. It would be more efficient if we had a rule that said something like ‘for any word capable of being either an adjective or a noun, if it is followed by an adjective or noun, then it is likely to be an adjective.’*Note that to create such a rule we have to look for it*(this is precisely what the Brill tagger does).

But now let us consider where this path has taken us. Every step we have proposed to improve the performance of this corpus-driven algorithm requires the insertion of knowledge about idioms, morphology and grammar, top-down, into the algorithm.

A methodological corpus-driven purism that stated that we must work exclusively bottom-up was a little disingenuous, because we had to employ auxiliary assumptions (1) to (5) above from the start.

But now every improvement we wish to make requires further theoretical assumptions. It turns out that it is not possible to perform part-of-speech tagging without assumptions, and to improve the algorithm we need more theory.

Finally, whereas the learning algorithm might work bottom-up, the tagging algorithm itself works top-down, in that it applies its knowledge base of word-tag probabilities to new corpus data.

I have the utmost respect for corpus-driven linguists. The discipline of examining data with minimal assumptions is absolutely crucial! All scientists have to examine the data *as it is*, not compartmentalise it according to pre-given assumptions.

Over the years I have written extensively on not taking queries for granted, and directed corpus researchers to continually review the underlying sentences from which their statistics are derived.

However, it is simply not possible to work without *any* assumptions, even when building a bottom-up computer algorithm like a part-of-speech tagger.

So I would conclude that corpus-based research is properly located as part of a larger research cycle, in which it is valid and reasonable to work bottom-up and top-down at different times. Corpus-driven research methods are part of a family of exploratory methods from which all corpus linguists should draw. Insights from computationally-obtained summary statistics (whether from collocations, *n*-grams, phrase frames, indexes, or databases of part of speech taggers) are important resources for further research.

But insisting that the only legitimate corpus methods are bottom-up prevents us carrying out research with a corpus which asks questions that are inevitably framed by a particular theory.

Brill, E. 1992. A simple rule-based part of speech tagger. In *Proceedings of the third conference on applied natural language processing* (ANLC ’92). Association for Computational Linguistics, Stroudsburg, PA, USA, 152-155.

Sinclair, J., Hanks, P., Fox, G., Moon, R. and Stock, P. and others, 1987 (eds.), Collins *Cobuild English Language Dictionary*, London: Collins.

Sinclair, J., Fox, G., Bullon, S., Krishnamurthy, R., Manning, E., Todd, J. and others, 1990 (eds.) *Collins Cobuild English Grammar*, London: Collins.

Wallis S.A. 2012. *Tagging ICE Phillipines and other corpora*. London: Survey of English Usage. **»** ePublished