Many conventional statistical methods employ **the Normal approximation to the Binomial distribution** (see Binomial → Normal → Wilson), either explicitly or buried in formulae.

The well-known Gaussian population interval (1) is **inverted** to obtain (2).

*Gaussian interval* (*E*⁻, *E*⁺) ≡ *P* ± *z*√*P*(1 – *P*)/*n*. (1)

where *n* represents the size of the sample, and *z* the two-tailed critical value for the Normal distribution at an error level α, more properly written *z*_{α/2}. The standard deviation of the population proportion *P* is *S* = √*P*(1 – *P*)/*n*, so we could abbreviate the above to (*E*⁻, *E*⁺) ≡ *P* ± *zS*.

When these methods require us to calculate a confidence interval about an observed proportion, *p*, we must invert the Normal formula using the Wilson score interval formula (Equation (2)).

*Wilson score interval* (*w*⁻, *w*⁺) ≡ [*p* + *z*²/2*n* ± *z*√*p*(1 – *p*)/*n* + *z*²/4*n²*] / [1 + *z*²/*n*].(2)

In a 2013 paper for JQL (Wallis 2013a), I referred to this inversion principle as the ‘interval equality principle’. This means that if (1) is calculated for *p* = *E*⁻ (the Gaussian lower bound of *P*), then the upper bound that results, *w*⁺, will equal *P*. Similarly, for *p* = *E*⁺, the lower bound of *p*, *w*⁻ will equal *P*.

We might write this relationship as

*p* ≡ GaussianLower(WilsonUpper(*p*)), or

*P* ≡ WilsonLower(GaussianUpper(*P*)), etc. (3)

where we have functions *E*⁻ = GaussianLower(*P*), *w*⁺ = WilsonUpper(*p*), etc.

In the paper, I performed a series of computational evaluations of the performance of different interval calculations, following in the footsteps of more notable predecessors. Comparison with the analogous interval calculated directly from the Binomial distribution showed that a continuity-corrected version of the Wilson score interval performed accurately.

Continuity corrections are used because the Binomial distribution is ‘chunky’. See the figure below.

All observed proportions must be whole fractions of *n*, *p* ∈ {0/*n*, 1/*n*, 2/*n*,… *n*/*n*} and yet the interval calculation we use is based on the Normal interval (1), which is continuous. So, using a method owing to Frank Yates, we add an extra ‘half 1/*n*‘ to intervals on either side of *P*.

The most famous example of a continuity-correction is employed with a standard chi-square formula

*Yates’* χ² = Σ(|*o*_{i,j} – *e*_{i,j}| – 0.5)²/*e*_{i,j} (4)

for all cells at index positions *i*, *j* in a contingency table. This formula is expressed in units of *n* rather than 1, so the correction is simply 0.5.

Strictly speaking, Yates’ formula has a flaw. It should guarantee that if the difference between observed and expected cells, *d =* *o*_{i,j} – *e*_{i,j}, is within ±0.5, the entire term should go to zero. This makes little difference for 2 × 2 tables, but for tables with more than one degree of freedom the following is recommended.

*Yates’* χ² = Σ(DiffCorrect(*o*_{i,j} – *e*_{i,j}, 0.5))²/*e*_{i,j},(4′)

where DiffCorrect(*d*, *c*) = *d – c* if *d* > *c*, *d + c* if *d < –c*, and 0 otherwise.

χ² is based on the Normal distribution *z *(Wallis 2013b). The standard deviation for a Gaussian population interval about a known or predicted population value *P *(Equation (2)) may be corrected for continuity by Yates’ population interval.

*Yates’ interval* (*E*⁻, *E*⁺) ≡ *P* ± (*z*√*P*(1 – *P*)/*n* + 1/2*n*).(5)

It is easy to see the relationship between Equations (5) and (1). Moreover it is straightforward to apply other adjustments to the standard deviation or variance (the variance is simply the square of the standard deviation, so this amounts to the same thing).

The continuity-corrected Wilson score interval formula is not often presented, and when it does appear, it appears in slightly different forms in the literature. However, on the basis of Robert Newcombe’s (1998) paper, I tend to present it as Equation (6). In fact this is simplified, as it is also necessary to employ ‘min’ and ‘max’ constraints to ensure that *w*_{cc}⁻ ∈ [0, *p*] and *w*_{cc}⁺ ∈ [*p*, 1].

*w*_{cc}⁻ = [2*np* + *z*² – {*z*√*z*² – 1/*n* + 4*np*(1 – *p*) + (4*p* – 2) + 1}] / [2(*n* + *z*²)], and

*w*_{cc}⁺ = [2*np* + *z*² + {*z*√*z*² – 1/*n* + 4*np*(1 – *p*) – (4*p* – 2) + 1}] / [2(*n* + *z*²)]. (6)

Indeed, for the last ten years or so I have been working with this formula. It exists in spreadsheets I give our students. But it has two obvious problems.

First, it is not at all intuitive. How is Equation (6) related to Equation (2)? What is the difference between them? How was Equation (6) even derived?

Second – and this relates to the first problem – it is not decomposable. Which terms represent the continuity correction, and which the interval?

As we shall see, there are circumstances when we might wish to modify the variance and thus the width of the interval, but do not adjust the correction for continuity.

Consider the **finite population correction** or ‘f.p.c.’. This is typically presented as an adjustment to standard deviation. See this post.

*Finite population correction* ν = √(*N* – *n*)/(*N* – 1). (7)

As the name implies, the finite population correction is applied to an interval or test when a sample is not drawn from an infinite population as the standard model assumes, but when it is drawn from one of a fixed size, *N*. In particular, it is relevant if the sample is a sizeable proportion of the population, say, 5%. Clearly if *N* >> *n*, then the finite population correction factor ν tends to 1, and has no effect.

To apply this adjustment to Equation (1) and (5), we multiply the standard deviation term by ν.

*Gaussian interval* (*E*⁻, *E*⁺) ≡ *P* ± *zν*√*P*(1 – *P*)/*n*. (1′)

and

*Yates’ interval* (*E*⁻, *E*⁺) ≡ *P* ± (*zν*√*P*(1 – *P*)/*n* + 1/2*n*).(5′)

This adjustment may also be applied to Equation (1). By inspecting (1′) we can see that rather than multiply the standard deviation by ν, we could also adjust the sample size, *n’* = *n*/ν², and substitute *n’* for *n* in each equation. We can now apply it to the uncorrected Wilson score interval, Equation (2).

But we cannot use the same method with Equation (6), the continuity-corrected Wilson interval. To see why, first consider Equation (5). We need to adjust the standard deviation *S,* but not the continuity-correction term, *c* = 1/2*n*.

Why do we not rescale *c*? Answer: because the entire point of a continuity correction is to overcome the ‘chunkiness’ of the actual Binomial distribution. See above. So we should not modify *n* in the formula for *c*. The original source distribution is no less chunky. The interval is narrower because we can be more certain.

To apply this correction to a χ² test, we calculate the test in the normal way and divide the result by ν². This works for the standard test or Yates’ version (Equation (4)).

Our task is therefore to find a formula for (6) that separates out the scale of the standard deviation from the continuity-corrected term.

It turns out that the solution turns out to be extremely simple and intuitive. Indeed it is *so* simple and intuitive once you see it that it is rather surprising that papers do not simply give it in this form! (I suspect that this says more about a tendency for mathematical brevity on the one hand and the tendency for researchers to copy formulae rather than analyse and explain them from first principles.)

**Aside:** The route to a Eureka moment is not always very edifying. In my case, I could have kicked myself! After three days of struggling with algebraic reductions of Equation (6), I read back through Newcombe (1998) and his sources. Blyth and Still (1983), was also not very clear, but at least it reformulates Equations (2) and (6) differently. Then I remembered something. I had plotted Equation (6) when plotting the Wilson distribution. The corrected intervals began at *p* ± 1/2*n*. See the figure below.

Here it is (drum roll please):

Let us use functions to define the interval bounds for the uncorrected interval (Equation (2)),

*w*⁻ = WilsonLower(*p*),

*w*⁺ = WilsonUpper(*p*).

Then

*w*_{cc}⁻ = WilsonLower(*p* – 1/2*n*),

*w*_{cc}⁺ = WilsonUpper(*p* + 1/2*n*). (8)

That was not hard, was it?

This equation solves our problem. The continuity-correction is added to the origin of the interval, *p*, first. Just as with Yates’ formula (4), we modify the variance in Equation (2) by rescaling *n*. But we can retain *c* = 1/2*n* without rescaling it.

Note that when we apply a continuity correction to the population proportion *P*, we calculate the interval on the basis of *P* first and *then *add 1/2*n *second*.* But when we apply a continuity correction to the observed proportion *p*, we add it to *p* first, and then calculate the interval. This is logical, because the interval equality principle also applies to the continuity-corrected interval.

Sometimes statisticians make life unnecessarily difficult for ourselves. The solution above is hinted at by Blyth, Still, and Newcombe, but it is certainly not presented in the way I have done above.

Secondly, it is rare to see a statistical discussion on correcting for continuity and finite population at the same time. Corrections for continuity tend to be forgotten as soon as formulae become more complex or tables gain more dimensions. However the reasons for correcting for continuity have not suddenly disappeared! The source distribution is still ‘chunky’!

Yet with care and consideration – and some first-principles mathematics – it is possible to apply corrections for continuity and finite population to the same formulae. Other corrections, such as cluster sampling corrections (in corpora, this is usually random text sampling), can also now be applied just as easily.

Given the proven improvements in reducing Type I errors that this adjustment involves, especially for small samples, we should apply continuity corrections whenever we carry out a significance test. Equation (2) may still be used for plotting purposes, but for comparing proportions we should employ Yates’ 2 × 2 test or the Newcombe-Wilson test with continuity-correction (see Wallis 2013a, b).

Blyth, C.R. & H.A. Still. 1983. Binomial Confidence Intervals. *Journal of the American Statistical Association* **78**, 108-116.

Newcombe, R.G. 1998a. Two-sided confidence intervals for the single proportion: comparison of seven methods. *Statistics in Medicine* **17**, 857-872.

Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics ***20**:3, 178-208. **»** Post

Wallis, S.A. 2013b. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. **»** Post

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics.

Please note that this course is very popular, and numbers are deliberately limited on a first-come, first-served basis! You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Over the three days, participants will learn about the following:

- the scope of Corpus Linguistics, and how we can use it to study the English Language;
- key issues in Corpus Linguistics methodology;
- how to use corpora to analyse issues in syntax and semantics;
- basic elements of statistics;
- how to navigate large and small corpora, particularly ICE-GB and DCPSE.

At the end of the course, participants will have:

- acquired a basic but solid knowledge of the terminology, concepts and methodologies used in English Corpus Linguistics;
- had practical experience working with two state-of-the-art corpora and a corpus exploration tool (ICECUP);
- have gained an understanding of the breadth of Corpus Linguistics and the potential application for projects;
- have learned about the fundamental concepts of inferential statistics and their practical application to Corpus Linguistics.

For more information, including costs, booking information, timetable, see the website.

The standard approach to teaching (and thus thinking about) statistics is based on **projecting distributions of ranges of expected values**. The distribution of an expected value is a set of probabilities that predict what the value will be, according to a mathematical model of what you predict should happen.

For the experimentalist, this distribution is **the imaginary distribution of very many repetitions of the same experiment that you may have just undertaken**. It is the output of a mathematical model.

- Note that this idea of a projected distribution is not the same as the term ‘expected distribution’. An expected distribution is a series of values you predict your data should match.
- Thus in what follows we simply compare a single expected value
*P*with an observed value*p*. This can be thought of as comparing the expected distribution**E**= {*P*, 1 –*P*} with the observed distribution**O**= {*p*, 1 –*p*}.

Thinking about this projected distribution represents a colossal feat of imagination: it is a projection of what you think would happen if only you had world enough and time to repeat your experiment, again and again. But often you can’t get more data. Perhaps the effort to collect your data was huge, or the data is from a finite set of available data (historical documents, patients with a rare condition, etc.). *Actual* replication may be impossible for material reasons.

In general, distributions of this kind are extremely hard to imagine, because they are not part of our directly-observed experience. See Why is statistics difficult? for more on this. So we already have an uphill task in getting to grips with this kind of reasoning.

**Significant difference** (often shortened to ‘significance’) refers to the difference between your observations (the ‘observed distribution’) and what you expect to see (the expected distribution). But to evaluate whether a numerical difference is significant, we have to take into account both the shape and spread of this projected distribution of expected values.

When you select a statistical test you do two things:

- you choose a mathematical model which projects a distribution of possible values, and
- you choose a way of calculating significant difference.

The problem is that in many cases it is very difficult to imagine this projected distribution, or — which amounts to the same thing — the implications of the statistical model.

When tests are selected, the main criterion you have to consider concerns the **type of data** being analysed (an ‘ordinal scale’, a ‘categorical scale’, a ‘ratio scale’, and so on). But the scale of measurement is only one of several parameters that allows us to predict how random selection might affect the resampling of data.

A mathematical model contains what are usually called **assumptions**, although it might be more accurate to call them ‘preconditions’ or parameters. If these assumptions about your data are incorrect, the test is likely to give an inaccurate result. This principle is not either/or, but can be thought of as a scale of ‘degradation’. *The less the data conforms to these assumptions, the more likely your test is to give the wrong answer.*

This is particularly problematic in some computational applications. The programmer could not imagine the projected distribution, so they tweaked various parameters until the program ‘worked’. In a ‘black-box’ algorithm this might not matter. If it appears to work, who cares if the algorithm is not very principled? Performance might be less than optimal, but it may still produce valuable and interesting results.

**But in science there really should be no such excuse.**

The question I have been asking myself for the last ten years or so is simply *can we do better?* Is there a better way to teach (and think about) statistics than from the perspective of distributions projected by counter-intuitive mathematical models (taken on trust) and significant tests?

One of the simplest statistical models concerns **Binomial distributions**. I find myself writing again and again about this class of distributions (and the mathematical model underpinning them) because they are central to corpus linguistics research, where variables mostly concern categorical decisions.

But even if you are principally concerned with other types of statistical model, bear with me. The argument below may be applied to the Student’s *t* distribution, for example. The differences lie in the formulae for computing intervals. The reasoning process is directly comparable.

The conventional way to think about a Binomial evaluation is as follows.

- Consider the true rate of something, A, in the population out of an outcome or choice {A, B}, represented by a
**population proportion***P.*- We could write
*P*(A | {A, B}) to make this clearer, but for brevity we will simply use*P*. - Note that the true rate
*P*could conceivably be 0 or 1, i.e. all cases might be B, or all cases A.

- We could write
- Use the Binomial function to predict, probabilistically, the
**likely distribution**of*P*.- This is the projected distribution of
*P*.

- This is the projected distribution of
- Perform a test for
**a particular observation**,*p*, that tells us how likely it is that*p*is consistent with that distribution, termed the ‘tail probability’.- If
*P*>*p*we typically want to know how likely it is that*P*is consistent with any value from 0 to*p*. - If
*P*<*p*, we work out the chance of randomly picking a value from*p*to 1.

- If

This particular test is called the **Binomial test**.

Below is an example, taken from an earlier blog post, Comparing frequencies within a discrete distribution. This particular evaluation models the Binomial distribution for *P* = 0.5 and *n* = 173 (the amount of data in our sample, termed the **sample size**).

The Binomial distribution (purple hump) is the distribution we would expect to see if we repeatedly tried to sample *P*, i.e. we repeated our experiment *ad infinitum*.

That is what we mean by a ‘projected distribution’. We can’t see it, and we can’t construct it by repeated observation because we have insufficient time!

The height of each column in this distribution is the chance that we might observe any particular frequency, *r* ∈ (0, *n*), whenever we perform our experiment. For the maths to work, we assume that every single one of the *n* cases in our sample is randomly and independently sampled from a population of cases whose mean probability was *P.*

The values we would most likely observe are 86 and 87 (173 × 0.5 = 86.5). In this case, *p* cannot be 0.5, even if this is the ‘expected value’ *P*!

However, the chance of either of these values being obtained is pretty small — 0.06. There is a range of values to either side of *P *where we would expect to see *p* fall. What the pattern shows us is that, say, a value of *r* = 60 or less is very unlikely to have occurred by chance.

The formula for the Binomial function looks like this:

*Binomial distribution B*(*r*) =* nCr P ^{r}* (1 –

This function generates the probability that any given value of *r *will be obtained, given *P* and *n*. For more information on what these terms mean, see Wallis (2013).

Next, we consider our particular observation, which might be expressed as a frequency, *f* = 65 or proportion *p* = *f* / *n* = 0.3757.

Now, the conventional approach to this test is to add up all the columns in the area greater than or equal to *p*, or from 0 to 65 (see the box in the figure above). This ‘Binomial tail sum’ area turns out to be 0.000669 to six decimal places. So we can report that there is less than a 0.000669 chance that this observation, *p*, was less than *P* due to mere random chance. In other words, we can say that the difference *p* – *P* is significant, at an error level α < 0.05.

Since this calculation is a little time-consuming and computationally arduous to carry out with large values of *n*, for over 200 years researchers have used an approximation credited to Carl Friedrich Gauss, namely to approximate the chunky Binomial distribution to another, smooth distribution, called the Gaussian or ‘Normal’ distribution.

In the graph below, the Gaussian distribution is plotted as a dashed line. As you can see, in this case the difference between the two shapes is almost imperceptible.

But now we can dispense with all that complicated ‘adding up of combinations’ that the Binomial test requires. The Gaussian approximation calculates the standard deviation of the Normal distribution, *S*, using a very simple function. On a probabilistic scale this calculation looks like this.

*S* = √*P*(1 − *P*)/*n *, (2)

*S* = √0.25 / 173 = 0.0380.

The Normal distribution is a regular shape that can be specified by two parameters: the mean and the standard deviation. We have mean *P* = 0.5 and standard deviation *S* = 0.0380.

Now we can apply a further trick. To perform the test, we don’t actually need to add up the area to the left of *p*. That’s a lot of work. All we need do is work out what *p* *would need to be* in order for the difference *p* – *P* to be just at the edge between significance and non-significance. At this point, the area under the curve will equal a given threshold probability, α/2 of the total area under the curve, where α represents the acceptable ‘error level’ (e.g. 1 in 20 = 0.05, 1 in 100 = 0.01 and so on). This area is half of α because, as the graph indicates, there will be another similar ‘tail area’ at the other side of the curve.

In simple terms, the area shaded in pink in the graph above is half of 5% of the total area under the curve, or — to put it another way — if the true rate in the population *P* was 0.5, the chance of a random sample obtaining a value of *p* less than the line to the right of that area is 0.05/2 = 0.025. (In our graph we have scaled all values on the horizontal axis by the total frequency, *n*, but this just means we multiply everything on a probability scale by *n*!)

How do we work this out? Well, we use the critical value of the Normal distribution, which we can write as *z*_{α/2} or, less commonly, Φ^{-1}(α/2), where Φ(*x*) is the Normal cumulative probability distribution function. This allows us to compute an interval where (1 – α = 95%) of the area under the curve is *z*_{α/2} standard deviations from *P*.

For α = 0.05, this ‘two-tailed’ value is 1.95996. The Normal confidence interval about *P* is then simply the range centred on *P*:

(*P*–*z*_{α/2}.*S* … *P* … *P*+*z*_{α/2}.*S*) = (0.4245, 0.5745).

Since *p* = 0.3757 is outside this range, we can report that *p* is significantly different from *P *(or *p* – *P* is a significant difference, which amounts to the same thing). This is more informative than saying ‘the result is significant’. But crucially, it relies on us pre-identifying a value of *P*, which we cannot obtain from data!

We have marked this out in the graph above, again, multiplying by *n*.

The conventional approach to statistics focuses on the mathematical model, and the projected distribution. Is there another way?

An alternative way of thinking about statistics is to start from the user’s perspective.

Most of the time we simply do not have a population value *P*, but we always have an observation *p*. In our example we assumed *P* was 0.5 for the purposes of the test — to compare *p* with 0.5. But this is a very limited application of statistics. What if we don’t know what *P* is? We only have observations to go on.

**Conclusion:** Instead of focusing on the projected distribution of a known population value, we should focus instead on *projecting the behaviour of observed values*.

The following graph plots the Wilson score distribution about *p*, using a method I developed in an earlier blog post. That distribution (blue line) may be given a confidence interval (the Wilson score interval) with the pink dot in the centre. We have plotted the equivalent 95% interval as before, so, again, 2.5% of the area under the curve can be found in the tail area ‘triangle’ above the upper bound (vertical line), and 2.5% of the area under the curve is found in the tail area below the lower bound.

The confidence interval for *p* (indicated by the line with the pink dot) is:

95% *Wilson score interval* (*w*⁻, *w*⁺) = (0.3070, 0.4498),

using the Wilson score interval formula (Wilson 1927):

*Wilson score interval* (*w*⁻, *w*⁺) = (*p* + *z*²/2*n ± *√*p*(1 – *p*)/*n* + *z*²/4*n*²) / [1 + *z*²/*n*],(3)

where *z* represents the error level *z*_{α/2}, shortened for reasons of space.

This particular distribution looks very similar to the Normal distribution. However, it is a little squeezed on the left hand side. It is **asymmetric**, with the interval widths being unequal:

*y*⁻ = *p* – *w*⁻ = 0.3757 – 0.3070 = 0.0687, and

*y*⁺ = *w*⁺ – *p* = 0.4498 – 0.3757 = 0.0741.

For more information, see Plotting the Wilson distribution.

What does this interval tell us?

In our sample, we observed *p* = 0.3757 as the proportion *p*(A | {A, B}) = 65/173.

*On the basis of this information alone*, we can predict that the range of the most likely values for *P* in the population from which the sample is drawn is between 0.3070 and 0.4498, if we make this prediction with a 95% level of confidence.

The value *w*⁻ represents the lowest possible value of *P* consistent with the observation *p*, i.e. that if *P* < *p*, but *P* > *w*⁻, we would report that the difference was not significant.

Similarly, the value *w*⁺ is the largest possible value of *P* consistent with *p*.

**Aside:**We can scale the interval to the frequency range (0…173), i.e. approximately (53, 78). However, since we are mostly interested in values of a proportion out of any sample size (a future sample might be twice as large, say), for practical reasons it is better to keep the interval range probabilistic.

Note that we have dispensed with any need to consider the actual population proportion, *P*. We don’t need to know what it is. Instead we view it, through our ‘Wilson telescope’, from the perspective of our observation *p*. The picture is a bit blurry, which is why we have a confidence interval that stretches over some 10% of the probability scale. But we have a reasonable estimate of where *P* is likely to be.

Consider the following thought experiment.

As an adult, you meet up with a bunch of random friends you haven’t seen for several years. Twenty in all, with nothing particular to connect them together.

For the sake of our thought experiment, let us assume this group of friends are twenty random individuals drawn from the population, but if they all went to the same school we might be concerned about whether they only represented a more limited population!

It turns out, as you chat, that** 5 out of 20** had **chicken pox** (varicella) as a child. (Chicken pox is a childhood disease, and few adults get it, so anyone over 20 can be assumed to be immune by that age).

On the basis of this observation alone, *what is the most likely rate of chicken pox in the population?* Can we be 95% confident it is less than half?

To work out the answer, we know two facts: *p* = 5 / 20 = 0.25, and *n* = 20.

Using Equation (3), this gives us

95% *Wilson score interval* (*w*⁻, *w*⁺) = (0.1186, 0.4687),

which excludes 0.5, so **the 95% interval is indeed less than half**. (With a correction for continuity, the interval becomes (0.0959, 0.4941) — still below 0.5).

If you think about it, this conclusion is at one and the same time, remarkably powerful — and counter-intuitive.

**How can it be that, with only 20 people to go on, we can be so definite in our conclusions? **

- The answer is that the Wilson interval is derived from the Normal approximation to the Binomial, and the Binomial is itself based on simple counting of the different ways we can obtain a particular frequency combination. See Binomial → Normal → Wilson.

In many ways the idea of an interval about an observation *p* is just as curious as the idea of an interval about *P.* Both are based on the idea that simple randomness leads to a predicted degree of variation when data is resampled.

- Note that we can test this question in other ways. For example, we could use the Normal approximation to the Binomial with
*P*= 0.5 to perform*a test*, but this would not give us the range of likely values of*P*.

The Wilson interval on *p* has many more applications than either traditional tests or confidence intervals on *P*. This is simply because, as we noted earlier, most of the time we simply do not know what *P* is.

For example, we can compare Wilson intervals using what I have elsewhere referred to as the **Wilson score interval comparison heuristic**:-

For any pair of proportions, *p*₁ and *p*₂, check the following:

- if the intervals for
*p*₁ and*p*₂ do not overlap, they**must**be significantly different; - if one point is inside the interval of the other, they
**cannot**be significantly different; - otherwise, carry out a statistical test to decide whether the result is significantly different.

What this means is that in many cases we don’t need to perform a statistical test to compare them. We can simply ‘eyeball’ the data. We can also use confidence intervals to perform tests, like the Newcombe-Wilson test.

Armed with our new-found mathematical understanding of statistics, we can also ask other, related questions.

For example, we might ask how much data would we need to conclude that an observation of *p* = 0.25 allows us to conclude that *P* < 0.5?

To get the answer, I have plotted the upper and lower bound of the Wilson score interval for *n* as multiples of 4 (our observation concerns whole numbers, remember). For good measure I have included the error level α = 0.01 alongside 0.05. We can clearly see the asymmetry of the interval.

We can see that for α = 0.05, we only need *n* = 16 guests at our get-together to justify a claim that the population value *P* is below 50%, but at α = 0.01, we need 28 guests. (This is proof positive that anyone who demands a smaller error level needs more friends!)

**Does this all mean we should dispense with significance tests altogether and replace them with confidence interval analysis?** This is something that many in the ‘New Statistics’ movement claim. I argue against this because not all tests can be substituted for confidence interval comparisons. For example, the *z* test summarised above can also be carried out using a 2 × 1 χ² test computation. But for *r* > 2, an *r* × 1 χ² test is not the same as a series of 2 × 1 tests.

Dispensing with tests altogether is premature, but a focus on confidence intervals on observed data is a much better way to engage statistically with data than ‘black-box’ tests.

- Plotting the Wilson distribution
- Why is statistics difficult?
- Comparing frequencies within a discrete distribution
- Binomial → Normal → Wilson

Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics* **20**:3, 178-208. » Post

Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. *Journal of the American Statistical Association* **22**: 209-212.

We have discussed the Wilson score interval at length elsewhere (Wallis 2013a, b). Given an observed Binomial proportion *p* = *f* / *n* observations, and confidence level 1-α, the interval represents the two-tailed range of values where *P*, the true proportion in the population, is likely to be found. Note that *f* and *n* are integers, so whereas *P* is a probability, *p* is a proper fraction (a rational number).

The interval provides a robust method (Newcombe 1998, Wallis 2013a) for directly estimating confidence intervals on these simple observations. It can take a correction for continuity in circumstances where it is desired to perform a more conservative test and err on the side of caution. We have also shown how it can be employed in logistic regression (Wallis 2015).

The point of this paper is to explore methods for computing Wilson distributions, i.e. the analogue of the Normal distribution for this interval. There are at least two good reasons why we might wish to do this.

The first is to shed insight onto the performance of the generating function (formula), interval and distribution itself. Plotting an interval means selecting a single error level α, whereas visualising the distribution allows us to see how the function performs over the range of possible values for α, for different values of *p* and *n*.

A second good reason is to counteract the tendency, common in too many presentations of statistics, to present the Gaussian (‘Normal’) distribution as if it were some kind of ‘universal law of data’, a mistaken corollary of the Central Limit Theorem. This is particularly unwise in the case of observations of Binomial proportions, which are strictly bounded at 0 and 1.

As we shall see, the Wilson distribution diverges from the Gaussian most dramatically as it tends towards the boundaries of the probabilistic range, i.e. where the interval approaches 0 or 1. By contrast, the Normal distribution is unbounded, and continues to plus or minus infinity.

The Wilson score interval (Wilson 1927) may be computed with the following formula.

*Wilson score interval* (*w*⁻, *w*⁺) = (*p* + *z*²/2*n ± *√*p*(1 – *p*)/*n* + *z*²/4*n*²) / [1 + *z*²/*n*]. (1)

Let us first consider cases where *P* is less than *p*. At the lower bound of this interval (*P* = *w*⁻) the upper bound for the Gaussian interval for *P*, *E*⁺, must be equal to *p* (Wallis 2013a).

We can carry out a test for significant difference between *p* and *P* by either

- calculating a Gaussian interval at
*P*and testing if*p*is greater than the upper bound, or - calculating a Wilson interval at
*p*and testing if*P*is less than the lower bound.

To consider cases where *P* is greater than *p*, we simply reverse this logic. We test if *p* is smaller than the lower bound of a Gaussian interval for *P*, or *P* is greater than the upper bound of the Wilson interval for *p*. The Gaussian version of the test is called the **single proportion z test**. It can also be calculated as a

As *p* tends to 0, we obtain increasingly skewed distributions (Figure 3). The interval cannot be easily approximated by a Normal interval, and the sum of the two distributions is decidedly not Gaussian (‘Normal’).

In Figure 3, note how the mean *p* is no longer the most likely value (mode).

In plotting this distribution pair, the area on either side of *p* is projected to be of equal size, i.e. it treats as a given that the true value *P* is equally likely to be above and below *p*. This is not necessarily true! Indeed we might multiply both distributions by the probability of the prior. But this fact should not cause us to change the plot.

Note how, thanks to the proximity to the boundary at zero, the interval for *w*⁻ becomes increasingly compressed between 0 and *p*, reflected by the increased height of the curve.

The tendency to express the distribution like an exponential decline on the least bounded side reaches its limit when *p* = 0 or 1. The ‘squeezed interval’ is uncomputable and simply disappears.

- Introduction
- Plotting the distribution

2.1 Obtaining values of*w*⁻

2.2 Employing a delta approximation - Example plots

3.1 An initial example

3.2 Properties of the Wilson distributions

3.3 Varying*p*

3.4 Small*n* - Further perspectives on the distribution

4.1 Percentiles of the Wilson distributions

4.2 The logit Wilson distribution

4.3 Continuity-corrected Wilson distributions - Conclusions
- References

- Full paper (PDF)
- Spreadsheet (Excel)
- Plotting confidence intervals on graphs
- Binomial → Normal → Wilson
- Logistic regression with Wilson intervals

Newcombe, R.G. 1998. Two-sided confidence intervals for the single proportion: comparison of seven methods. *Statistics in Medicine* **17**: 857-872.

Wallis, S.A. 2013a. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics ***20**:3, 178-208 **»** Post

Wallis, S.A. 2013b. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4**,** 350-378. **»** Post

Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. *Journal of the American Statistical Association* **22**: 209-212.

However, I think it is a good example of why a mathematical approach to statistics (instead of the usual rote-learning of tests) is extremely valuable.

At the time of writing (March 2018) nearly two hundred thousand university staff in the UK are active members of a pension scheme called USS. This scheme draws in income from these members and pays out to pensioners. Every three years the pension is valued, which is not a simple process. The valuation consists of two aspects, both uncertain:

- to value the liabilities of the pension fund, which means the obligations to current pensioners and future pensioners (current active members), and
- to estimate the future asset value of the pension fund when the scheme is obliged to pay out to pensioners.

What happened in 2017 (and happened in the last two valuations) is that the pension fund has been declared to be in deficit, meaning that the liabilities are greater than the assets. However, in all cases this ‘deficit’ is a projection forwards in time. We do not know how long people will actually live, so we don’t know how much it will cost to pay them a pension. And we don’t know what the future values of assets held by the pension fund will be.

In September 2017, the USS pension fund published a table which included two figures using the method of accounting they employed at the time to value the scheme.

- They said
**the best estimate**of the outcome was a surplus of £8.3 billion. - But they said that
**the deficit allowing for uncertainty**(‘prudence’) was –£5.1 billion.

Now, if a pension fund is in deficit, it matters a great deal! Someone has to pay to address the deficit. Either the rules of the pension fund must change (so cutting the liabilities) or the assets must be increased (so the employers and/or employees, who pay into the pension fund must pay more). The dispute about the deficit engulfed UK universities in March 2018 with strikes by many tens of thousands of staff, lectures cancelled, etc. But is there really a ‘deficit’, and if so, what does this tell us?

The first additional bit of information we need to know is how the ‘uncertainty’ is modelled. In February 2018 I got a useful bit of information. The ‘deficit’ is the lower bound on a 33% confidence interval (α = 2/3). This is an interval that divides the distribution into thirds by area. One third is below the lower bound, one third above the upper bound, and one third is in the middle. This gives us a picture that looks something like this:

Of course, experimentalist statisticians will never use such an error-prone confidence interval. We wouldn’t touch anything below 95% (α = 0.05)! To make things a bit more confusing, the actuaries talk about this having a ‘67% level of prudence’ meaning that two-thirds of the distribution is above the lower bound. All of this is fine, but it means we must proceed with care to decode the language and avoid making mistakes.

In any case, the distribution of this interval is approximately Normal. The detailed graphs I have seen of USS’s projections are a bit more shaky (which makes them appear a bit more ‘sciency’), but let’s face it, these are projections with a great deal of uncertainty. It is reasonable to employ a Normal approximation and use a ‘Wald’ interval in this case because the interval is pretty much unbounded – the outcome variable could eventually fall over a large range. (Note that we recommend Wilson intervals on probability ranges precisely because probability *p* is bounded by 0 and 1.)

What do we know?

- The
**best estimate**is the median of the distribution. In the case of the Normal it is also**the mean**,*v*= 8.3 (billion pounds). - The
**lower bound of the confidence interval***v*⁻ = –5.1. This is the quoted deficit figure. - The
**error level**of the Normal distribution α = 2/3. - The
**critical value**of the Normal distribution,*z*_{α/2}= 0.4307.

We also know that *v*⁻ = *v* – *z*_{α/2}.*s*.

- So we can calculate the
**standard deviation***s*= (*v*–*v*⁻) /*z*_{α/2 }= 31.1101.

That’s a standard deviation of £31 billion! No wonder the Normal distribution looks so wide.

This tells us that a big problem with the prediction is the sheer scale of the uncertainty attached to the estimate. It is not necessarily a problem with the pension – after all, even using this valuation method it is odds-on to reach a positive outcome of £8.3 billion.

Now, it turns out that there are lots of problems with the method for valuing the pension scheme. Crucially, the entire exercise is predicated on imagining the ‘old’ UK universities (or a large proportion of them) go bankrupt. I have written about this elsewhere. It is not crucial for our statistics discussion, even if it is a costly problem for staff, employers and students impacted by the industrial action as the argument about who should pay for this type of ‘deficit’ ensues.

Irrespective of the rights and wrongs of *that* argument (and we will return to this in conclusion), this exercise should have convinced you of one thing though – with such a high level of uncertainty about the valuation, pretty much any value can be obtained!

The next thing I did was wonder, what is the break-even (zero) point on the distribution, where valuation *v* = 0? In other words, can we calculate the chance of default occurring according to this model?

This seems to me to be an important operation. Most of all it allows us to meaningfully compare different valuations, which, as we shall see in a minute, is a useful thing to do. The USS Trustees, who manage the scheme, are concerned with one thing – the **risk of default**, *p*(*v* < 0). So it strikes me that we ought to calculate it.

The zero point is the value of α/2 when *v*⁻ = *v* – *z*_{α/2}.*s* = 0.

So we need to know α when *z*_{α/2} = *v*/*s* = 0.2667.

There are various ways to compute this, but I used a poor-man’s Newton-Raphson method in Excel to find α. That is, I input different values of α until the Normal function (‘NORMSINV(1-(α/2))’) obtained a closely-similar value of *z*!

I am sure there is a neater way, but it would obtain essentially the same result. It’s the maths that count!

- In this case, this obtains
**error level**α = 0.79.

This means that there would be an area of 0.21 inside the interval if *v*⁻ = 0. Another way of thinking about this is that, of the half-distribution below the mean, 21% of *that* area is where *v* >= 0.

So we can now report that the** probability of default** *p*(*v* < 0) = 0.5 – 0.21/2 = 0.395.

As a result of the valuation in September there was much shaking of heads amongst employers. This level of risk seemed to great to bear. So they reported to their organisation, Universities UK (UUK) that they wanted to see less risk in their model. The first valuation employed a method that was termed gradual ‘de-risking’, meaning that the assets would be moved from a mixed stocks and shares portfolio into investments in government stocks, termed ‘gilts’. The idea is that this is less risky because these gilts are ‘low risk’ compared to stocks and shares.

As a result of this consultation, the scheme actuaries were sent away and they came up with some different figures. These were

- The
**best estimate**,*v*= £5.2bn (this figure was not made very public) - The
**quoted deficit**,*v*⁻ = -£7.5bn (this was made*very*public)

Again, the same interval calculation was employed.

I was ‘leaked’ the best estimate. Knowing now how the calculation was made for our first valuation, I employed the same method.

- The
**standard deviation***s*= (*v*–*v*⁻) /*z*_{α/2 }= 29.4850.

The graph now looks like this.

So, what is the probability of default, i.e. *p*(*v* < 0)? What has happened to the risk to the Pension Trustees?

We have *z*_{α/2} = *v*/*s* = 0.1764, which obtains α = 0.86.

**The probability of default**,*p*(*v*< 0), is now 0.43.

So – wait for this – **if the employers engage in what they think is a ‘risk-averse’ modelling approach, they increase the risk of default!** What is going on?

Let’s pause for a moment.

- The risk of default is an estimated risk of the likely outcome of the unravelling of the pension scheme should this prove necessary. It is like predicting the chance of an aeroplane
**crashing**. - But the ‘risk’ of stock-market investments is a different thing entirely. It is
**volatility**, short-term variation, that might increase or decrease investments over the short term. To use our aeroplane analogy, it is turbulence. **What the November valuation did was drop the altitude of the plane to avoid turbulence, but it increased the risk of crashing the plane into mountains!**

People are not used to reasoning with probability and risk, and it is easy to conflate different probabilities and different risks. Only a logical and mathematical approach to thinking about probability can rescue you from the kind of error exhibited by the university employers, when, insisting on a ‘lower level of risk’, they managed to increase the risk to the scheme and themselves.

What is quite disturbing about this argument is that I am not a professional actuary, yet I spotted the error immediately. I was not the only one.

You would think that the first thing a competent professional would do on obtaining this new calculation is critique it, wonder why this counter-intuitive outcome had been obtained, and advise those running the scheme accordingly. Yet at the time of writing in March 2018, UUK are still trying to use this November valuation to try to get their way.

So-called ‘de-risking’ increases the only risk that should matter (the risk of ultimate default), and therefore it is neither a competent investment strategy nor a good method for valuing the pension scheme!

Here we don’t have published figures from USS. But we have some information from our previous calculations.

The September valuation was obtained, not by employing no de-risking, but by modelling the effect of replacing stocks and shares with gilts after a 10 year-delay. The November valuation was obtained by starting de-risking immediately. Both aim for complete de-risking by the 20-year point. See Figure 3.

We also can safely assume that the cost and yield of gilts is likely to be stable. (Indeed the low ‘long term gilt yield’ is half the problem of valuing live pension schemes like this.)

In the first place we have two valuations, **A** (September) and **B** (November).

These can be depicted like this.

We can now estimate the likely outcome for a new model, **C**, that employs a 20-year delay before total divestment, using some simple maths and the Bienyamé theorem (that independent variances may be summed).

Gilts are predicted to have a more-or-less constant low value. Stocks are predicted to be more volatile, around a given mean growth rate.

If we assume that stocks continue to perform in a similar manner over each five-year period (i.e. that the best estimate and standard deviation of the growth rate is constant) the areas under the curve for **A** and **B** are equivalent to immediate and total divestment at time points 15 and 10 years respectively (dashed vertical lines). This is because we can assume that the exposure of assets to stock market risk is considered to be constant.

Consider **B** first. This employs an immediate de-risking model which has the lowest standard deviation and variance.

- Var(
**B**) =*s*₁² = 869.36

In the case of **A**, there is an additional variance term due to the stocks-and-shares uncertainty generated by delay:

- Var(
**A**) =*s*₁² +*s*₂² = 967.84

Therefore the additional uncertainty due to delayed de-risking *s*₂² = 967.84 – 869.36 = 98.48.

Assuming investment performance, gilt yields, etc. are constant over time, the area between **A** and **B** is also the same area between **A** and **C**.

- Var(
**C**) =*s*₁² + 2*s*₂² = 1,066.32, - Standard deviation for
**C**,*s*= 32.6545.

We obtain the best estimate for **C** by simple addition, so *v* = £11.4 billion.

This gives us a ‘deficit’ of –£2.66 billion and a probability of default of 0.3635.

Note that the gradual de-risking model (**A**) is roughly equivalent to delaying de-risking for five years and then selling stock as in model **B**. We can now compute **D **(25-year delay), **E** (30-year delay), and further models employing the same approach.

This obtains the following graphs.

In other words, even if one agreed to de-risk in twenty-five years’ time, the projected deficit would be close to zero, and thereafter, the scheme generates a surplus at this level of prudence.

**Therefore not de-risking at all (performing an ongoing valuation) must obtain a surplus.** The limit of this ‘deficit’ curve exceeds zero.

If long-term gilt yields rise beat CPI, then the benefits of increased predictability might outweigh the loss in asset performance. But we would need to perform a calculation of the trade-off based on the best evidence available at the time. What is clear is that de-risking punishes the pension scheme for an external factor – low long-term gilt and bond yields – for no good reason.

Another way to see the same result is to plot the probability of default over these different ‘de-risking horizons’. This obtains the following graph of *p*(*v* < 0).

The evidence is therefore that an assessment of the assets and liabilities of the live pension scheme (an ‘ongoing valuation’) must return a net surplus. Indeed, this is what the actuaries *First Actuarial* found by other methods (Salt and Benstead 2017).

Some might object that there were other differences between the November and September valuations, and therefore taking the difference between them is not appropriate. This may be true, but the burden of evidence has shifted. Until actuaries working for UUK and USS are transparent about their assumptions, I would suggest that I have demolished the idea that there could be an ongoing deficit by a straightforward mathematical argument.

The ability to conceptualise probability in a meaningful way is central to any rational argument about statistics and uncertainty. We can see this in the confusion between ‘de-risking’ and real risk, i.e. risk of pension default.

There is one last sting in this particular tale.

In the case of the USS pension, the entire premise of ‘de-risking’ is that a trigger event as financially destabilising to the bankruptcy of the entire pre-92 university sector takes place. This might not mean the total bankruptcy of the sector, but it would require a large number of big institutions to shut down and the remaining institutions to fail to absorb their students, staff and market share.

**Now, the probability of this event is – or should be – effectively zero.** Since *p*(*v* < 0) × 0 = 0, from a logical perspective it does not really matter what the probability of deficit actually is. However, the current UK regulatory environment still presumes that pension funds must be evaluated by ‘managing the risk of default’ (which means in practice modelling by de-risking), even if the probability of the trigger event is zero.

That an evaluation of this kind is even contemplated in the case of USS illustrates what one might call a wilful ignorance of basic mathematics. One of the Big Four accountancy firms has attached their name to various tendentious statements about the USS pension scheme, levels of prudence, etc. It is to their shame that they have done so.

**As we have demonstrated, the probability of scheme default is zero provided that the scheme is not de-risked.** *Actual* de-risking – an act of self-harm of the first order – increases the chance of default, although even in the worst case, immediate de-risking is still odds-on to leave a surplus.

The obvious solution to the current crisis is for the Government to accept that a multi-employer scheme of publicly-funded universities is not subject to the same risks of a single-employer pension fund.

Sector bankruptcy would be a national tragedy that would also constitute the simultaneous collapse of one of the UK’s leading exporting industries (higher education), the eviction of millions of students from their courses and the collapse of the UK independent research sector. It is a political issue of the utmost importance to the UK economy as well as generations of university staff and students.

Cuts in the pension benefits and increases in employer expenditure are pointless and damaging when the ‘deficit’ is so obviously an artefact of the valuation method. The obvious solution is that the Government simply guarantees the security of the pension fund, and permits the Trustees to value the scheme on an ongoing basis.

Sam Marsh uncovers that the trigger reasoning used by the pension fund USS for deciding to ‘de-risk’ (‘Test 1’) contains a colossal error. Even if the Government did not step in, USS itself has no grounds to de-risk. See also Mike Otsuka’s explanation.

However to predict performance, we might consider the types of structure that a parser is likely to find difficult and then examine a parsed corpus of speech and writing for key statistics.

Variables such as mean sentence length or main clause complexity are often cited as a proxy for parsing difficulty. However, sentence length and complexity are likely to be poor guides in this case. Spoken data is not split into sentences by the speaker, rather, utterance segmentation is a matter of transcriber/annotator choice. In order to improve performance, an annotator might simply increase the number of sentence subdivisions. Complexity ‘per sentence’ is similarly potentially misleading.

In the original *London Lund Corpus* (LLC), spoken data was split by speaker turns, and phonetic tone units were marked. In the case of speeches, speaker turns could be very long compound ‘run-on’ sentences. In practice, when texts were parsed, speaker turns might be split at coordinators or following a sentence adverbial.

In this discussion paper we will use the *British Component of the International Corpus of English* (ICE-GB, Nelson *et al.* 2002) as a test corpus of parsed speech and writing. It is worth noting that both components were parsed together by the same tools and research team.

A very clear difference between speech and writing in ICE-GB is to be found in the degree of **self-correction**. The mean rate of self-correction in ICE-GB spoken data is 3.5% of words (the rate for writing is 0.4%). The spoken genre with the lowest level of self-correction is broadcast news (0.7%). By contrast, student examination scripts have around 5% of words crossed out by writers, followed by social letters and student essays, which have around 0.8% of words marked for removal.

However, self-correction can be addressed at the annotation stage, by removing it from the input to the parser, parsing this simplified sentence, and reintegrating the output with the original corpus string. To identify issues of parsing complexity, therefore we need to consider the sentence minus any self-correction. Are there other factors that may make the input stream more difficult to parse than writing?

Perhaps a more revealing estimate of top level complexity concerns the extent to which, following parsing, these segments, termed ‘parse units’, are not considered grammatically to be clauses. The scattergraph below plots the mean proportion of parse units that are ‘**non clauses**’ rather than clauses on the horizontal axis. The category of ‘non clause’ does not include subjectless or verbless clauses (see below), but may include standalone phrases and pragmatically meaningful utterances (sometimes called ‘clause fragments’). By contrast, the vertical axis shows the mean number of **incomplete** clauses. These are clauses that have been rendered incomplete, for example because the speaker was interrupted. (We have not included confidence intervals because we are interested in the overall scatter.)

- Overall in ICE-GB
**there are twice the proportion of ‘non clause’ parse units in the spoken data**(on average, 29% of parse units are not clauses) than the written component (14%). Business letters are an outlier, apparently due to the inclusion of full addresses and other formal ephemera. At the upper left of the written distribution, press editorials have the highest number of incomplete clauses while less than one in twenty parse units are considered non clauses. - Comparing means,
**there are over four times the proportion of incomplete clauses in spoken transcripts**compared to written text (2.15% to 0.51%). Means are shown with ‘X’ symbols in the scattergraph.

This scattergraph distinguishes written and spoken data to a much greater extent than, e.g. analysis of small phrases (Aarts *et al.* 2014). This indicates that the challenges in the parsing of speech data lie principally in high level structure. Getting the top level analysis correct is the most difficult challenge in any parsing enterprise. The sheer proportion of the number of non clauses in speech, and the relatively high proportion of incomplete clauses should cause us to be cautious about accepting performance estimates based on the parsing of written data when we are concerned with the parsing of speech.

Spoken data is not necessarily more complex in other aspects. For example, speech data is generally less likely to include subjectless or verbless clauses than writing. The following scattergraph plots the mean probabilities of clauses being **subjectless** (vertical axis) and **verbless** (horizontal axis) for ICE-GB text categories within speech and writing. The highest proportion of verbless clauses in any genre are found in spontaneous commentaries, a spoken genre which encourages concise phrasing, for example:

*England have won four *[*the Soviet Union three*]* with three drawn* _{[S2A-001 #167]}

Compared to writing, a lower proportion of clauses in speech are analysed as compound clauses, but this seems to be an artefact of the sentence segmentation decisions we discussed earlier. In the case of ICE-GB speech data, large coordinated spoken clauses were frequently split at the coordinator, with the coordinator (*and*, *but*, etc) then treated as a connective introducing a new clause. This decision is semantic and stylistic (in writing, termed ‘avoiding run-on sentences’), although it could be argued that in the parsing of ICE-GB, annotators over-compensated.

In objective lexical terms, the spoken data has a slightly greater tendency to exhibit coordinating words. There are 15% more connectives or coordinators per word in ICE-GB spoken data compared to writing, and 4% more subordinating conjunctions.

If ICE-GB spoken utterances were over-zealously subdivided, this tendency has had a greater impact on coordinated clauses than subordinate ones, but it has had an impact on subordination nonetheless. Thus the proportion of ‘dependent’ (subordinate) clauses out of those clauses explicitly marked as either main or dependent in spoken data is actually 85% of the equivalent rate in the written data, despite the greater rate of subordinators.

In summary, the main factor that might make speech harder to parse than writing is that spoken data tends to be more grammatically incomplete than written data. The high proportion of ‘non clauses’, and the greater number of clauses marked as incomplete, both indicate that this is where the principal difficulty lies.

This incompleteness is in addition to self-correction, that is, where speakers correct their own utterances.

Aarts, B. and S.A. Wallis 2014. Noun phrase simplicity in spoken English. In L. Veselovská and M. Janebová (eds.) *Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure.* Olomouc: Palacký University, 2014. pp 501-511. » Post

Nelson, G., Wallis, S.A. and Aarts, B. 2002. *Exploring Natural Language: Working with the British Component of the International Corpus of English*. Amsterdam: John Benjamins.

]]>

Occasionally it is useful to cite measures in papers other than simple probabilities or differences in probability. When we do, we should estimate confidence intervals on these measures. There are a number of ways of estimating intervals, including bootstrapping and simulation, but these are computationally heavy.

For many measures it is possible to derive intervals from the Wilson score interval by employing a little mathematics. Elsewhere in this blog I discuss how to manipulate the Wilson score interval for simple transformations of *p*, such as 1/*p*, 1 – *p*, etc.

Below I am going to explain how to derive an interval for grammatical diversity, *d*, which we can define as **the probability that two randomly-selected instances have different outcome classes**.

Diversity is an effect size measure of a frequency distribution, i.e. a vector of *k* frequencies. If all frequencies are the same, the data is evenly spread, and the score will tend to a maximum. If all frequencies except one are zero, the chance of picking two different instances will of course be zero. Diversity is well-behaved except where categories have frequencies of 1.

To compute this diversity measure, we sum across the set of outcomes (all functions, all nouns, etc.), **C**:

*diversity d*(*c*∈**C**) = ∑*p*₁(*c*).(1 –*p*₂(*c*)) if*n*> 1; 1 otherwise

where **C** is a set of *k *> 1 disjoint categories, *p*₁(*c*)* *is the probability that item 1 is category *c* and *p*₂(*c*) is the probability that item 2 is the same category *c*.

We have probabilities

*p*₁(*c*) =*F*(*c*)/*n,**p*₂(*c*) = (*F*(*c*)*–*1)/(*n –*1) = (*p*₁(*c*).*n*– 1)/(*n*– 1),

where *n* is the total number of instances.

The formula for *p*₂ includes an adjustment for the fact that we already know that the first item is *c*. This principle is used in card-playing statistics. Suppose I draw cards from a pack. If the first card I pick is a heart, I know that there are only 10 other hearts in the pack, so the probability of the next card I pick up being a heart is 10 out of 51, not 11 out of 52.

Note that as the set is closed, ∑*p*₁(*c*) = ∑*p*₂(*c*) = 1.

The maximum score is slightly less than (*k* – 1) / *k *except in the special case where *n* approaches *k* and there is a frequency of 1 in any category, in which case diversity can approach 1.

In a paper with Bas Aarts and Jill Bowie (2018), we found that the share of functions of *–ing* clauses (‘gerunds’) appeared to change over time in the *Diachronic Corpus of Present-day Spoken English* (DCPSE).

We obtained the following graph. The bars marked ‘LLC’ refer to data drawn from the period 1956-1972; those marked ‘ICE-GB’ are from 1990-1992.

This graph considers six functions **C** = {CO, CS, OD, SU, A, PC} of the clause. It plots *p*(*c*) over **C**. Considered individually, some functions significantly increase and some decrease their share. Note also that the increases appear to be concentrated in the shorter bars (smaller *p*) and the decreases in the longer ones.

Intuitively this appears to mean that we are seeing *–ing* clauses increase in their diversity of grammatical function over time. We would like to test this proposition.

Here is the LLC data.

CO | CS | SU | OD | A | PC | Total |

6 | 33 | 61 | 326 | 610 | 1,203 | 2,239 |

Computing diversity scores, we arrive at

*d*(LLC) = 0.6152 and*d*(ICE-GB) = 0.6443.

We wish to compare these two diversity measures. The first step is to estimate a confidence interval for *d*.

First we compute interval estimates for each term, *d*(*c*) = *p*₁(*c*).(1 – *p*₂(*c*)).

- The Wilson score interval for a probability
*p*is (*w*⁻,*w*⁺).

Any monotonic function of *p*, *fn*, can be applied and plotted as a simple transformation. See Reciprocating the Wilson interval. We can write

*fn*(*p*) ∈ (*fn*(*w*⁻),*fn*(*w*⁺)).

However, *d*(*c*) is not monotonic over its entire range. Indeed *d*(*c*) reaches a maximum where *p* = 0.5. However the axiom holds conservatively provided that the function is monotonic across the interval (*w*⁻, *w*⁺), i.e. where 0.5 is not within the interval. The following graph plots *d*(*c*) over *p*(*c*) for a two-cell vector where *n* = 40.

We can rewrite *d*(*c*) in terms of a probability *p* and *n*,

*d*(*p*,*n*) =*p*× (1 – (*p × n*– 1) / (*n*– 1)).

This has the interval

*d*(*p*,*n*) ∈ (*d*(*w*⁻,*n*),*d*(*w*⁺,*n*))

provided that *d*(*w*⁺, *n*) < 0.5. To obtain the interval we have simply plugged *w*⁻ and *w*⁺ into the formula for *d*(*p*, *n*) in place of *p*.

Indeed, noting the shape of *d*, we can derive the following.

*d*(*p*,*n*) ∈ (*d*(*w*⁻,*n*),*d*(*w*⁺,*n*)) where*w*⁺ < 0.5,*d*(*p*,*n*) ∈ (*d*(*w*⁺,*n*),*d*(*w*⁻,*n*))*w*⁻ > 0.5,*d*(*p*,*n*) ∈ (min(*d*(*w*⁻,*n*),*d*(*w*⁺,*n*)),*d*(0.5,*n*)) otherwise.

Next we need to sum these intervals. To do this we need to take account of the number of degrees of freedom of the vector.

Case 1: *df* = 1

If we had two values (as in our graphed example), we would have one degree of freedom. Cell probabilities *p*(1) + *p*(2) = 1, so *p*(2) = 1 – *p*(1).

The relationship above is exactly the same as applies for the Wilson score interval and 2×1 χ² goodness of fit test. Observed variation across *p*(1) **determines** the variation across *p*(2). Suppose *P*(1), the true value for *p*(1), were at an outer limit of *p*(1) (say, *w*⁺(1)). *P*(2) would be at the opposite outer limit of *p*(2) (*w*⁻(2)).

This means we should simply sum the transformed Wilson scores:

*d*(*c*∈**C**) ∈ (∑*d*(*w*⁻(*c*)*, n*), ∑*d*(*w*⁺(*c*),*n*)).

We apply simple summation where intervals are strictly dependent on each other. We can obtain relative bounds of the dependent sum as:

*l*(dep) =*d*– ∑*d*(*w*⁻(*c*)*, n*),*u*(dep) = ∑*d*(*w*⁺(*c*),*n*) –*d*.

However, in our example we have more than one degree of freedom, and this method is too conservative.

Case 2: *df* > 1

Where probabilities are independent, some can increase and others decrease. The chance that two independent probabilities both fall within a 5% error level is 0.05². So we cannot simply add together intervals. The method of independent summation is to sum Pythagorean interval widths:

*l*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁻(*c*),*n*)]², and*u*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁺(*c*),*n*)]².

However, in our case, we have what we might term semi-independent probabilities, with the level of independence determined by the number of degrees of freedom. We have *df* = *k* – 1 independent differences, so we can interpolate between the two methods in proportion to the number of cells.

*l*= (*l*(ind) × (*k*– 2) + 2*l*(dep)) /*k*, and*u*= (*u*(ind) × (*k*– 2) + 2*u*(dep)) /*k*,*d*(*c*∈**C**) ∈ (*d*–*l*,*d*+*l*).

Note that *l* = *l*(dep) where *k* = 2.

To see how this works, let’s return to our example. The following is drawn from the LLC data (first, blue bar in the graph), at an error level α = 0.05. Note that one of our cells (PC) has *p*₁ > 0.5, *w*₁⁻ is also > 0.5, so we must swap the interval for this cell.

function | CO | CS | SU | OD | A | PC |

p₁ |
0.0027 | 0.0147 | 0.0272 | 0.1456 | 0.2724 | 0.5373 |

w₁⁻ |
0.0012 | 0.0105 | 0.0213 | 0.1316 | 0.2544 | 0.5166 |

w₁⁺ |
0.0058 | 0.0206 | 0.0348 | 0.1608 | 0.2913 | 0.5379 |

Next, to compute the lower bound of the confidence interval CI(*d*) = (*d *– *l*, *d *+ *u*), we obtain the same data for *p*₂ and then carry out the computation.

*l*(dep) =*d*– ∑*d*(*w*⁻(*c*)*, n*) = 0.6152 – 0.5833 = 0.0319,*u*(dep) = ∑*d*(*w*⁺(*c*),*n*) –*d*= 0.6499 – 0.6510 = 0.0359,*l*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁻(*c*),*n*)]² = 0.0152,*u*(ind) = √∑[*d*(*p*(*c*),*n*) –*d*(*w*⁺(*c*),*n*)]² = 0.0165.

This obtains an interval of (0.5945, 0.6382).

We can quote diversity for LLC with absolute intervals (*d *– *l*, *d *+ *u*):

*d*(LLC) = 0.6152 (0.5945, 0.6382), and*d*(ICE-GB) = 0.6443 (0.6248, 0.6655).

In the Newcombe-Wilson test, we compare the difference between two Binomial observations *p*₁ and *p*₂ with the Pythagorean distance of the Wilson interval widths *y*₁⁺ = *w*₁⁺ – *p*₁, etc:

–√(*y*₁⁺)² + (*y*₂⁻)² < (*p*₁ – *p*₂) < √(*y*₁⁻)² + (*y*₂⁺)².

If the equation above is true, the result is not significant (the difference falls within the confidence interval).

This method operates on the assumption that the observations are independent and the intervals are approximately Normal. In our case the difference in diversity is -0.0291, and the bounds are (-0.0301, +0.0297).

Since the difference falls inside those bounds – just – we can report that the difference is not significant.

In many scientific disciplines, such as medicine, papers that include graphs or cite figures without confidence intervals are considered incomplete and are likely to be rejected by journals. However, whereas the Wilson interval performs admirably for simple Binomial probabilities, computing confidence intervals for more complex measures typically involves a more involved computation.

We defined a diversity measure and derived a confidence interval for it. Although probabilistic (diversity is indeed a probability), it is not a *Binomial* probability. For one thing, it has a maximum below 1, of slightly in excess of (*k –* 1) / *k*. For another, it is computed as the sum of the product of two sets of related probabilities.

In order to derive this interval we made the assumption of monotonicity, i.e. that the function *d* tends to increase along its range, or decrease along its range. However, *d* is decidedly **not** monotonic *–* it increases as *p* tends to 0.5 but falls thereafter. We employed the weaker assumption that it is monotonic within the confidence interval, or – in the case where the interval includes a change in direction – that it cannot exceed the global maximum. This has a conservative consequence: it makes the evaluation weaker than it would otherwise be.

We computed an interval by interpolating between dependent and independent estimates of variance, noting that the vector has *k* – 1 degrees of freedom. This is not the most accurate method (and I intend to return to this question in later posts), but it is sufficient for us to derive an interval, and, by employing Newcombe’s method, a test of significant difference.

Like Cramér’s φ, diversity condenses an array with *k* – 1 degrees of freedom into a variable with a single degree of freedom. Swapping data between the smallest and largest columns would obtain exactly the same diversity score.

Testing for significant difference in diversity, therefore, is not the same as carrying out a *k* × 2 chi-square test. Such a test could be significant even when diversity scores are not significantly different. Our new diversity difference test is more conservative, and significant results may be more worthy of comment.

Aarts, B., Wallis, S.A., and Bowie, J. (2018). *–Ing clauses in spoken English: structure, usage and recent change*. In Seoane, E., C. Acuña-Fariña, & I. Palacios-Martínez (eds.) *Subordination in English. Synchronic and Diachronic Perspectives*. Topics in English Linguistics (TiEL) 101. Berlin: De Gruyter. 129-154.

- Diversity interval example (Excel)
- Is “grammatical diversity” a useful concept?
- Interval arithmetic ‘cheat sheet (PDF)
- Reciprocating the Wilson interval
- Goodness of fit measures for discrete categorical data
- Measures of association for contingency tables

]]>

Let’s think about what you experienced. The car crash might involve a number of variables an investigator would be interested in.

- How fast was the car going? Where were the brakes applied?
- Look on the road. Get out a tape measure. How long was the skid before the car finally stopped?
- How big and heavy was the car? How loud was the bang when the car crashed?

These are all **physical variables**. We are used to thinking about the world in terms of these kinds of variables: velocity, position, length, volume and mass. They are tangible: we can see and touch them, and we have physical equipment that helps us measure them.

To this list we might add variables we can’t see, such as how loud the bang was. We might not be able to see it, but we can appreciate that loudness is a variable that ranges from very quiet to extremely loud indeed! With a decibel meter we might get an accurate reading, but you were not expecting a crash, and if you are trying to explain how loud something was to the Police from memory, the best you might be able to do is a rough-and-ready assessment.

We are also used to thinking about some other variables that might be relevant to our car crash investigation. If we were investigating on behalf of the insurance company, we might want to know the answers to some slightly less tangible variables. What was the value of the car before the accident? How wealthy is the driver? How dangerous is that stretch of road?

We are used to thinking about the world in terms of physical variables but we are also brought up in a social world of economic value: the value of the car, the wealth of the driver. These **social variables** are a bit more ‘slippery’ than the physical variables. ‘Value’ can be highly subjective: the car might have been vintage, and different buyers might place a different value on it. The buyer, being canny, might resell it for a higher value. Nonetheless, everyone brought up in a world of trade and capital understands the idea that a car can be sold and, in the process, a price attached to it. Likewise, ‘wealth’ might be measured in different ways, or in different currencies. So although monetary attributes are not physical variables, we are comfortable with the idea that they are tangible to us.

But what about that last variable? I asked, *how dangerous is that stretch of road?*

This variable is a risk value. It is a **probability**. We can rephrase my question as “what is the probability that for every car that comes down the road, it crashes?” If we can measure this in some way, and make repeat measurements elsewhere, we could make comparisons. Perhaps we have discovered an accident ‘black spot’: somewhere where there is a greater chance of a road accident than at other locations.

**But a probability cannot be calculated on the strength of a single accident.** It can only be measured by a different, more patient, process of observation. We have to observe *many* cars driving down the road, count the ones that crash, and build up a set of observations. Probability is not a tangible variable, and it takes an effort of imagination to think about.

**I argue that the first thing that makes the subject of statistics difficult, compared to, say, engineering, is that even the most elementary variable we use, observed probability, is not physically tangible.**

Let us think about our car crash for a minute. I said that you have never been on this road before. You have no data on the probability of a crash on that road. But it would be very easy to assume from the simple fact that you saw a crash that, if the road surface seemed poor, or it was raining, these facts contributed to the accident and made it more likely. But you have only one data point to draw from. This kind of inference is not valid. It is an over-extrapolation. It is little more than a guess.

Our natural instinct is to form explanations in our mind, hypotheses, and to look for patterns and causes in the world. (Part of our training as scientists is to be suspicious of that inclination. Of course we *might* be right, but we have to be relentlessly careful and self-critical before we can conclude that we are right.)

If we wanted to make a case that this location is an accident black spot, we would need to set up equipment and monitor the road for accidents. We would need to continue to observe the road over a prolonged period of time to get the data we needed. This is called a **natural experiment**, where we don’t attempt to interfere with the conditions of the road but simply observe driver behaviour and car crashes.

Alternatively, we might **conduct an actual experiment** and drive various cars down the road to see how they handled. Either way, we would need to observe many cars going past before we could make a realistic estimate of the chance of a crash.

If probability is difficult to observe directly, this has an effect on our ability to think about it. Probability is more difficult to conceive of in the way we conceive of length, say. We all vary in our spatial reasoning abilities, but we experience reinforcement learning from daily observations, tape measures and practice. As we have seen, probability is much more elusive because it is only observed from many observations. This makes it difficult to reliably estimate probability in advance, or to reason with probabilities.

Even experienced researchers make mistakes. The psychologists Tersky and Kahneman (1971) reported the findings from a questionnaire they gave to professional psychologists. The questions concerned the decisions they would make in research based on statements about probability. They showed that not only were their expert subjects unreliable, they provided evidence of persistent biases in human cognition, including the one we mentioned earlier – a belief in the reliability of their own observations, even when they had few observations on which to base their conclusions.

So, if you are struggling with statistical concepts, **don’t worry**. You are not alone. Indeed, I have come to the conclusion that *it is necessary to struggle with probability*. We have all been there, and one of my main criticisms of traditional statistics teaching is that most treatments skate over the core concepts and goes straight to statistical testing methods that the experimenter, with no conceptual grounding (never mind mathematical underpinnings), simply takes on faith.

Probability is difficult to observe. It is an abstract mathematical concept that can only be measured indirectly, from many observations. And simple observed probability is just the beginning. In discussing inferential statistics I try to keep to three notions of probability and a simple labelling system: observed probability, for which I will use the label lower-case *p*, the ‘true’ population probability, capital *P*, and a third type, the probability that our observed probability is reliable, which we denote with α. Many people make mistakes reasoning about that last little variable. But we are getting ahead of ourselves.

The best way to get to grips with probability is to replace my thought experiment with a physical one.

But: **safety first!** Please don’t crash an actual car — use a Scalextric instead!

Tversky, A., and Kahneman, D. 1971. Belief in the law of small numbers. *Psychological Bulletin* **76**:2, 105-110. **»** ePublished

I have been recently reviewing and rewriting a paper for publication that I first wrote back in 2011. The paper (Wallis 2018) concerns the problem of how we test whether repeated runs of the same experiment obtain essentially the same results, i.e. results are not significantly different from each other.

These meta-tests can be used to test an experiment for replication: if you repeat an experiment and obtain significantly different results on the first repetition, then, with a 1% error level, you can say there is a 99% chance that the experiment is not replicable.

These tests have other applications. You might be wishing to compare your results with those of others in the literature, compare results with different operationalisation (definitions of variables), or just compare results obtained with different data – such as comparing a grammatical distribution observed in speech with that found within writing.

The design of tests for this purpose is addressed within the *t*-testing ANOVA community, where tests are applied to continuously-valued variables. The solution concerns a particular version of an ANOVA, called “the test for interaction in a factorial analysis of variance” (Sheskin 1997: 489).

However, anyone using data expressed as discrete alternatives (A, B, C etc) has a problem: the classical literature does not explain what you should do.

The rewrite of the paper caused me to distinguish between two types of tests: ‘point tests’, which I describe below, and ‘gradient tests’.

These tests can be used to compare results drawn from 2 × 2 or *r* × *c* χ² tests for homogeneity (also known as tests for independence). This is the most common type of contingency test, which can be computed using Fisher’s exact method or as a Newcombe-Wilson difference interval.

- A
**gradient test**(B) evaluates whether the*gradient*or difference between point 1 and point 2 differs between runs of an experiment,*d*=*p*₁ –*p*₂. This concerns whether claims about the rate of change, or size of effect, observed are replicable. Gradient tests can be extended, with increasing degrees of freedom, into tests comparing*patterns*of effect. - A
**point test**(A) simply asks whether data at either point, evaluated separately, differs between experimental runs. This concerns whether single observations, such as*p*₁, are replicable. Point tests can be extended into ‘multi-point’ tests, which we discuss below.

Point tests only apply to homogeneity data. If you wish to compare outcomes from goodness of fit tests, you need a version of the gradient test, to compare differences from an expected *P*, *d* = *p*₁ – *P*. Since different data sets may have different expected *P*, a distinct ‘point test for goodness of fit’ would be meaningless.

The earlier version of the paper, which has been published on this blog since its launch 2012, focused on gradient tests. The possibility of carrying out a point test was mentioned in passing. In this blog post I want to focus on point tests.

The obvious problem with gradient tests is that two experimental runs might obtain the same gradient but in fact be very different in start and end points. Consider the following graph.

The data in Figure 1 is calculated from two 2 × 2 tables drawn from a paper by Aarts, Close and Wallis (2013).

**Note:** To obtain Figure 2, I simply replaced one frequency in the first table: 46 with 100. The data is also found on the 2×2 homogeneity tab in this Excel spreadsheet, which contains a wide range of separability tests.

To make our exposition clearer, Table 1 uses the same format as in the Excel spreadsheet (with the dependent variable distributed vertically) rather than the format in the paper.

spoken | LLC (1960s) |
ICE-GB (1990s) |
Total |

shall |
124 | 46 | 170 |

will |
501 | 544 | 1,045 |

Total |
625 | 590 | 1,215 |

written | LOB (1960s) |
FLOB (1990s) |
Total |

shall |
355 | 200 | 555 |

will |
2,798 | 2,723 | 5,521 |

Total |
3,153 | 2,923 | 6,076 |

Aarts *et al*. carried out 2 × 2 homogeneity tests for the two tables separately. These test whether modal *shall* declines as a proportion of the modal *shall/will* alternation between the two time points. In other words, we compare LLC with ICE-GB data, and LOB with FLOB data.

To carry out a point test we simply rotate the test 90 degrees, e.g. to compare data at the 1960s point we compare LLC with LOB.

As I have explained elsewhere (Wallis 2013), there are a number of different methods for carrying out this comparison.

These include:

- The
*z*test for two independent proportions (Sheskin 1997: 226). - The Newcombe-Wilson interval test (Newcombe 1998).
- The 2 × 2 χ² test for homogeneity (independence).

These are all standard tests and each is discussed in papers and elsewhere on this blog.

The advantage of the third approach is that it is extensible to *c*-way multinomial observations by using a 2 × *c* χ² test.

The tests listed above can be used to compare the 1960s and 1990s intervals in Figure 1 separately.

However, in many cases it would be helpful to have a method that evaluated both pairs of observations in a single test. This can be generalised to a series of *r* observations. To do this, in (Wallis 2018) I propose what I call a multi-point test.

We generalise the χ² formula by summing over *i* = 1..*r*:

- χ
² = ∑χ²(_{d}*i*)

where χ²(*i*) represents the χ² score for homogeneity for each set of data at position *i* in the distribution.

This test has *r* × df(*i*) degrees of freedom, where df(*i*) is the degrees of freedom for each χ² point test. So, in the worked example we have seen, the summed test has two degrees of freedom:

spoken | LLC (1960s) |
ICE-GB (1990s) |
Total |

shall |
124 | 46 | 170 |

will |
501 | 544 | 1,045 |

Total |
625 | 590 | 1,215 |

written | LOB (1960s) |
FLOB (1990s) |
Total |

shall |
355 | 200 | 555 |

will |
2,798 | 2,723 | 5,521 |

Total |
3,153 | 2,923 | 6,076 |

χ² | 34.6906 | 0.6865 | 35.3772 |

Since the computation sums independently-calculated χ² scores, each score may be individually considered for significant difference (with df(*i*) degrees of freedom). Hence we can see above the large score for the 1960s data (individually significant) and the small score for 1990s (individually non-significant).

**Note:** Whereas χ² is generally associative (non-directional), the summed equation (χ* _{d}*²) is not. Nor is this computation the same as a 3 dimensional test (

- The multi-point test factors out variation between tests over the independent variable (in this instance: time). This means that if there is a lot more data in one table at a particular time period, this fact does not skew the results.
- On the other hand, it does not factor out variation over the dependent variable – after all, this is precisely what we wish to examine!

Naturally, like the point test, this test may be generalised to multinomial observations.

An alternative multi-point test for binomial (two-way) variables employs a sum of χ² values abstracted from Newcombe-Wilson tests.

- Carry out Newcombe-Wilson tests for each point test
*i*at a given error level α, obtaining*D*,_{i}*W*⁻ and_{i}*W*⁺._{i} - Identify the inner interval width
*W*for each test:_{i}- if
*D*< 0,_{i }*W*=_{i}*W*⁻;_{i}*W*=_{i}*W*⁺ otherwise._{i}

- if
- Use the difference
*D*and inner interval_{i}*W*to compute χ² scores:_{i}- χ²(
*i*) = (*D*._{i}*z*_{α/2}/*W*)²._{i}

- χ²(

It is then possible to sum χ²(*i*) as before.

Using the data in the worked example we obtain:

**1960s:** *D _{i}* = 0.0858,

Since *D _{i}* is positive in both cases, we use the upper interval width each time. This gives us χ² scores of 28.4076 and 1.3769 respectively, which obtains a sum of 29.78. Compared to the first method above, this approach tends to downplay extreme differences.

The point test and the additive generalisation of this test into a ‘multi-point test’ represent a method of contrasting multiple runs of the same experiment, comparing observed changes in different subcorpora or genres, or examine the empirical effect of changing definitions of variables.

These tests consider the null hypothesis that **individual observations** are not different; or, in the multi-point case, that **in general** the observations are not different.

- They do not evaluate the gradient between points or the size of effect. If we wish to compare
**sizes of effect**we would need to use one of the methods for this purpose described in (Wallis forthcoming). - The method only applies to comparing tests for homogeneity (independence). To compare
**goodness of fit**data, a different approach is required (also described in Wallis forthcoming).

Nonetheless, these tests are useful meta-tests that build on classical Pearson χ² tests, and they are useful tools in our analytical armoury.

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.

Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. *Statistics in Medicine* **17**: 873-890.

Wallis, S.A. 2013. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. » Post

Wallis, S.A. 2018. Comparing χ^{2} tables for separability of distribution and effect. *Journal of Quantitative Linguistics*. DOI: 10.1080/09296174.2018.1496537 » Post

I have previously argued (Wallis 2014) that interaction evidence is the most fruitful type of corpus linguistics evidence for grammatical research (and doubtless for many other areas of linguistics).

Frequency evidence, which we can write as *p*(*x*), the probability of *x* occurring, concerns itself simply with the overall distribution of a linguistic phenomenon *x* – such as whether informal written English has a higher proportion of interrogative clauses than formal written English. In order to calculate frequency evidence we must define *x*, i.e. decide how to identify interrogative clauses. We must also pick an appropriate baseline *n* for this evaluation, i.e. we need to decide whether to use words, clauses, or any other structure to identify locations where an interrogative clause may occur.

**Interaction evidence** is different. It is a statistical correlation between a decision that a writer or speaker makes at one part of a text, which we will label point *A*, and a decision at another part, point *B*. The idea is shown schematically in Figure 1. *A* and *B* are separate ‘decision points’ in a given relationship (e.g. lexical adjacency), which can be also considered as ‘variables’.

This class of evidence is used in a wide range of computational algorithms. These include collocation methods, part-of-speech taggers, and probabilistic parsers. Despite the promise of interaction evidence, the majority of corpus studies tend to consist of discussions of frequency differences and distributions.

In this paper I want to look at applications of interaction evidence which are made more-or-less at the same time by the same speaker/writer. In such circumstances we cannot be sure that just because *B** *follows *A** *in the text, the decision relating to *B* was made after the decision at *A*.

For example, in studying the premodification of noun phrases by attributive adjectives in English – which adjective is applied first in assembling an NP like *the old tall green ship*, for instance – **we cannot be sure that adjectives are selected by the speaker in sentence order**. It is also perfectly plausible that adjectives were chosen in an alternative or parallel order in the mind of the speaker, and then assembled in the final order during the language production process.

Of course, in cases where points *A* and *B* are separated substantively in time (as in many instances of structural self-priming) or where *B* is spoken in response to *A* by another speaker (structural priming of another’s language), there is unlikely to be any ambiguity about decision order. Moreover, if *A* licences *B*, then the order in unambiguous.

However, in circumstances where *A* and *B* are proximal, and where the order of decisions made by the speaker/writer cannot be presumed, we wish to consider whether there are mathematical or statistical methods for predicting the most likely order decisions were made.

Such a method would have considerable value in experimental design in cognitive corpus linguistics. For example, since Heads of NPs, VPs etc are conceived of as determining their complements, it may not be too much a stretch to argue that if this method works, we may have found a way of empirically evaluating this grammatical concept.

- Introduction
- A collocation example

2.1 Employing chi-square and phi

2.2 Directional statistics

2.3 Significantly directional? - A grammatical example

3.1 Testing for difference under alternation

3.2 Comparing Newcombe-Wilson intervals for direction

3.3 Optimising the dififference interval - Mapping significance of association and direction
- Concluding remarks
- References

Wallis, S.A. 2017. *Detecting direction in interaction evidence*. London: Survey of English Usage. **»** Paper (PDF)

- Excel spreadsheets

Wallis, S.A. 2011. *Comparing χ² tests for separability*. London: Survey of English Usage, UCL. **»** post

Wallis, S.A. 2012. *Goodness of fit measures for discrete categorical data*. London: Survey of English Usage, UCL. **»** post

Wallis, S.A. 2013a. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. **»** post

Wallis, S.A. 2013b. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics* **20**:3, 178-208. **»** post

Wallis, S.A. 2014. What might a corpus of parsed spoken data tell us about language? In L. Veselovská and M. Janebová (eds.) *Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure.* Olomouc: Palacký University, 2014. pp 641-662. **»** post

Wallis, S.A. forthcoming. *That vexed problem of choice*. London: Survey of English Usage, UCL. **»** post