I have been recently reviewing and rewriting a paper for publication that I first wrote back in 2011. The paper (Wallis forthcoming) concerns the problem of how we test whether repeated runs of the same experiment obtain essentially the same results, i.e. results are not significantly different from each other.

These meta-tests can be used to test an experiment for replication: if you repeat an experiment and obtain significantly different results on the first repetition, then, with a 1% error level, you can say there is a 99% chance that the experiment is not replicable.

These tests have other applications. You might be wishing to compare your results with those of others in the literature, compare results with different operationalisation (definitions of variables), or just compare results obtained with different data – such as comparing a grammatical distribution observed in speech with that found within writing.

The design of tests for this purpose is addressed within the t-testing ANOVA community, where tests are applied to continuously-valued variables. The solution concerns a particular version of an ANOVA, called “the test for interaction in a factorial analysis of variance” (Sheskin 1997: 489).

However, anyone using data expressed as discrete alternatives (A, B, C etc) has a problem: the classical literature does not explain what you should do.

Gradient and point tests


Figure 1: Point tests (A) and gradient tests (B), from Wallis (forthcoming).

The rewrite of the paper caused me to distinguish between two types of tests: ‘point tests’, which I describe below, and ‘gradient tests’.

These tests can be used to compare results drawn from 2 × 2 or r × c χ² tests for homogeneity (also known as tests for independence). This is the most common type of contingency test, which can be computed using Fisher’s exact method or as a Newcombe-Wilson difference interval.

  • A gradient test (B) evaluates whether the gradient or difference between point 1 and point 2 differs between runs of an experiment, dp₁ – p₂. This concerns whether claims about the rate of change, or size of effect, observed are replicable. Gradient tests can be extended, with increasing degrees of freedom, into tests comparing patterns of effect.
  • A point test (A) simply asks whether data at either point, evaluated separately, differs between experimental runs. This concerns whether single observations, such as p₁, are replicable. Point tests can be extended into ‘multi-point’ tests, which we discuss below.

Point tests only apply to homogeneity data. If you wish to compare outcomes from goodness of fit tests, you need a version of the gradient test, to compare differences from an expected Pdp₁ – P. Since different data sets may have different expected P, a distinct ‘point test for goodness of fit’ would be meaningless.

The earlier version of the paper, which has been published on this blog since its launch 2012, focused on gradient tests. The possibility of carrying out a point test was mentioned in passing. In this blog post I want to focus on point tests.

The obvious problem with gradient tests is that two experimental runs might obtain the same gradient but in fact be very different in start and end points. Consider the following graph.


Figure 2: Why we need two different types of test: (almost) equal gradients but unequal points.

Point tests

The data in Figure 1 is calculated from two 2 × 2 tables drawn from a paper by Aarts, Close and Wallis (2013).

Note: To obtain Figure 2, I simply replaced one frequency in the first table: 46 with 100. The data is also found on the 2×2 homogeneity tab in this Excel spreadsheet, which contains a wide range of separability tests.

To make our exposition clearer, Table 1 uses the same format as in the Excel spreadsheet (with the dependent variable distributed vertically) rather than the format in the paper.

spoken LLC
shall 124 46 170
will 501 544 1,045
Total 625 590 1,215
written LOB
shall 355 200 555
will 2,798 2,723 5,521
Total 3,153 2,923 6,076

Frequency data for the choice modal shall out of the choice shall vs. will, various sources, from Aarts et al. (2013).

Aarts et al. carried out 2 × 2 homogeneity tests for the two tables separately. These test whether modal shall declines as a proportion of the modal shall/will alternation between the two time points. In other words, we compare LLC with ICE-GB data, and LOB with FLOB data.

To carry out a point test we simply rotate the test 90 degrees, e.g. to compare data at the 1960s point we compare LLC with LOB.

As I have explained elsewhere (Wallis 2013), there are a number of different methods for carrying out this comparison.

These include:

  1. The z test for two independent proportions (Sheskin 1997: 226).
  2. The Newcombe-Wilson interval test (Newcombe 1998).
  3. The 2 × 2 χ² test for homogeneity (independence).

These are all standard tests and each is discussed in papers and elsewhere on this blog.

The advantage of the third approach is that it is extensible to c-way multinomial observations by using a 2 × c χ² test.

The multi-point test

The tests listed above can be used to compare the 1960s and 1990s intervals in Figure 1 separately.

However, in many cases it would be helpful to have a method that evaluated both pairs of observations in a single test. This can be generalised to a series of r observations. To do this, in (Wallis forthcoming) I propose what I call a multi-point test.

We generalise the χ² formula by summing over i = 1..r:

  • χd² = ∑χ²(i)

where χ²(i) represents the χ² score for homogeneity for each set of data at position i in the distribution.

This test has r × df(i) degrees of freedom, where df(i) is the degrees of freedom for each χ² point test. So, in the worked example we have seen, the summed test has two degrees of freedom:

spoken LLC
shall 124 46 170
will 501 544 1,045
Total 625 590 1,215
written LOB
shall 355 200 555
will 2,798 2,723 5,521
Total 3,153 2,923 6,076
χ² 34.6906 0.6865 35.3772

Applying the generalised point test calculation to the table above. χ² = 35.38 is significant with 2 degrees of freedom and α = 0.05.

Since the computation sums independently-calculated χ² scores, each score may be individually considered for significant difference (with df(i) degrees of freedom). Hence we can see above the large score for the 1960s data (individually significant) and the small score for 1990s (individually non-significant).

Note: Whereas χ² is generally associative (non-directional), the summed equation (χd²) is not. Nor is this computation the same as a 3 dimensional test (t × r × c). Variables are treated differently.

  • The multi-point test factors out variation between tests over the independent variable (in this instance: time). This means that if there is a lot more data in one table at a particular time period, this fact does not skew the results.
  • On the other hand, it does not factor out variation over the dependent variable – after all, this is precisely what we wish to examine!

Naturally, like the point test, this test may be generalised to multinomial observations.

A Newcombe-Wilson multi-point test

An alternative multi-point test for binomial (two-way) variables employs a sum of χ² values abstracted from Newcombe-Wilson tests.

  1. Carry out Newcombe-Wilson tests for each point test i at a given error level α, obtaining Di, Wi⁻ and Wi⁺.
  2. Identify the inner interval width Wi for each test:
    • if D< 0, Wi = Wi⁻; WiWi⁺ otherwise.
  3. Use the difference Di and inner interval Wi to compute χ² scores:
    • χ²(i) = (Di . zα/2 / Wi)².

It is then possible to sum χ²(i) as before.

Using the data in the worked example we obtain:

1960s: Di = 0.0858, Wi⁻ = -0.0347 and Wi⁺ = 0.0316 (significant).
1990s: Di = 0.0095, Wi⁻ = -0.0194 and Wi⁺ = 0.0159 (ns).

Since Di is positive in both cases, we use the upper interval width each time. This gives us χ² scores of 28.4076 and 1.3769 respectively, which obtains a sum of 29.78. Compared to the first method above, this approach tends to downplay extreme differences.

In conclusion

The point test and the additive generalisation of this test into a ‘multi-point test’ represent a method of contrasting multiple runs of the same experiment, comparing observed changes in different subcorpora or genres, or examine the empirical effect of changing definitions of variables.

These tests consider the null hypothesis that individual observations are not different; or, in the multi-point case, that in general the observations are not different.

  • They do not evaluate the gradient between points or the size of effect. If we wish to compare sizes of effect we would need to use one of the methods for this purpose described in (Wallis forthcoming).
  • The method only applies to comparing tests for homogeneity (independence). To compare goodness of fit data, a different approach is required (also described in Wallis forthcoming).

Nonetheless, these tests are useful meta-tests that build on classical Pearson χ² tests, and they are useful tools in our analytical armoury.

See also


Sheskin, D.J. 1997. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.

Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890.

Wallis, S.A. 2013. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post

Wallis, S.A. forthcoming (first published 2011). Comparing χ² tables for separability of distribution and effect. London: Survey of English Usage. » Post

Detecting direction in interaction evidence

IntroductionPaper (PDF)

I have previously argued (Wallis 2014) that interaction evidence is the most fruitful type of corpus linguistics evidence for grammatical research (and doubtless for many other areas of linguistics).

Frequency evidence, which we can write as p(x), the probability of x occurring, concerns itself simply with the overall distribution of a linguistic phenomenon x – such as whether informal written English has a higher proportion of interrogative clauses than formal written English. In order to calculate frequency evidence we must define x, i.e. decide how to identify interrogative clauses. We must also pick an appropriate baseline n for this evaluation, i.e. we need to decide whether to use words, clauses, or any other structure to identify locations where an interrogative clause may occur.

Interaction evidence is different. It is a statistical correlation between a decision that a writer or speaker makes at one part of a text, which we will label point A, and a decision at another part, point B. The idea is shown schematically in Figure 1. A and B are separate ‘decision points’ in a given relationship (e.g. lexical adjacency), which can be also considered as ‘variables’.

Figure 1: Associative inference from lexico-grammatical choice variable A to variable B (sketch).

Figure 1: Associative inference from lexico-grammatical choice variable A to variable B (sketch).

This class of evidence is used in a wide range of computational algorithms. These include collocation methods, part-of-speech taggers, and probabilistic parsers. Despite the promise of interaction evidence, the majority of corpus studies tend to consist of discussions of frequency differences and distributions.

In this paper I want to look at applications of interaction evidence which are made more-or-less at the same time by the same speaker/writer. In such circumstances we cannot be sure that just because B follows A in the text, the decision relating to B was made after the decision at A. Continue reading

UCL Summer School in English Corpus Linguistics 2017

I am pleased to announce the fifth annual Summer School in English Corpus Linguistics to be held at University College London from 5-7 July.

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Continue reading

The replication crisis: what does it mean for corpus linguistics?


Over the last year, the field of psychology has been rocked by a major public dispute about statistics. This concerns the failure of claims in papers, published in top psychological journals, to replicate.

Replication is a big deal: if you publish a correlation between variable X and variable Y – that there is an increase in the use of the progressive over time, say, and that increase is statistically significant, you expect that this finding would be replicated were the experiment repeated.

I would strongly recommend Andrew Gelman’s brief history of the developing crisis in psychology. It is not necessary to agree with everything he says (personally, I find little to disagree with, although his argument is challenging) to recognise that he describes a serious problem here.

There may be more than one reason why published studies have failed to obtain compatible results on repetition, and so it is worth sifting these out.

In this blog post, what I want to do is try to explore what this replication crisis is – is it one problem, or several? – and then turn to what solutions might be available and what the implications are for corpus linguistics. Continue reading

POS tagging – a corpus-driven research success story?


One of the longest-running, and in many respects the least helpful, methodological debates in corpus linguistics concerns the spat between so-called corpus-driven and corpus-based linguists.

I say that this has been largely unhelpful because it has encouraged a dichotomy which is almost certainly false, and the focus on whether it is ‘right’ to work from corpus data upwards towards theory, or from theory downwards towards text, distracts from some serious methodological challenges we need to consider (see other posts on this blog).

Usually this discussion reviews the achievements of the most well-known corpus-based linguist, John Sinclair, in building the Collins Cobuild Corpus, and deriving the Collins Cobuild Dictionary (Sinclair et al. 1987) and Grammar (Sinclair et al. 1990) from it.

In this post I propose an alternative examination.

I want to suggest that the greatest success story for corpus-based research is the development of part-of-speech taggers (usually called a ‘POS-tagger’ or simply ‘tagger’) trained on corpus data.

These are industrial strength, reliable algorithms, that obtain good results with minimal assumptions about language.

So, who needs theory? Continue reading

Why Chomsky was Wrong About Corpus Linguistics


When the entire premise of your methodology is publicly challenged by one of the most pre-eminent figures in an overarching discipline, it seems wise to have a defence. Noam Chomsky’s famous objection to corpus linguistics therefore needs a serious response.

“One of the big insights of the scientific revolution, of modern science, at least since the seventeenth century… is that arrangement of data isn’t going to get you anywhere. You have to ask probing questions of nature. That’s what is called experimentation, and then you may get some answers that mean something. Otherwise you just get junk.” (Noam Chomsky, quoted in Aarts 2001).

Chomsky has consistently argued that the systematic ex post facto analysis of natural language sentence data is incapable of taking theoretical linguistics forward. In other words, corpus linguistics is a waste of time, because it is capable of focusing only on external phenomena of language – what Chomsky has at various times described as ‘e-language’.

Instead we should concentrate our efforts on developing new theoretical explanations for the internal language within the mind (‘i-language’). Over the years the terminology varied, but the argument has remained the same: real linguistics is the study of i-language, not e-language. Corpus linguistics studies e-language. Ergo, it is a waste of time.

Argument 1: in science, data requires theory

Chomsky refers to what he calls ‘the Galilean Style’ to make his case. This is the argument that it is necessary to engage in theoretical abstractions in order to analyse complex data. “[P]hysicists ‘give a higher degree of reality’ to the mathematical models of the universe that they construct than to ‘the ordinary world of sensation’” (Chomsky, 2002: 98). We need a theory in order to make sense of data, as so-called ‘unfiltered’ data is open to an infinite number of possible interpretations.

In the Aristotelian model of the universe the sun orbited the earth. The same data, reframed by the Copernican model, was explained by the rotation of the earth. However, the Copernican model of the universe was not arrived at by theoretical generalisation alone, but by a combination of theory and observation.

Chomsky’s first argument contains a kernel of truth. The following statement is taken for granted across all scientific disciplines: you need theory to analyse data. To put it another way, there is no such thing as an ‘assumption free’ science. But the second part of this argument, that the necessity of theory permits scientists to dispense with engagement with data (or even allows them to dismiss data wholesale), is not a characterisation of the scientific method that modern scientists would recognise. Indeed, Beheme (2016) argues that this method is also a mischaracterisation of Galileo’s method. Galileo’s particular fame, and his persecution, came from one source: the observations he made through his telescope. Continue reading

The variance of Binomial distributions


Recently I’ve been working on a problem that besets researchers in corpus linguistics who work with samples which are not drawn randomly from the population but rather are taken from a series of sub-samples. These sub-samples (in our case, texts) may be randomly drawn, but we cannot say the same for any two cases drawn from the same sub-sample. It stands to reason that two cases taken from the same sub-sample are more likely to share a characteristic under study than two cases drawn entirely at random. I introduce the paper elsewhere on my blog.

In this post I want to focus on an interesting and non-trivial result I needed to address along the way. This concerns the concept of variance as it applies to a Binomial distribution.

Most students are familiar with the concept of variance as it applies to a Gaussian (Normal) distribution. A Normal distribution is a continuous symmetric ‘bell-curve’ distribution defined by two variables, the mean and the standard deviation (the square root of the variance). The mean specifies the position of the centre of the distribution and the standard deviation specifies the width of the distribution.

Common statistical methods on Binomial variables, from χ² tests to line fitting, employ a further step. They approximate the Binomial distribution to the Normal distribution. They say, although we know this variable is Binomially distributed, let us assume the distribution is approximately Normal. The variance of the Binomial distribution becomes the variance of the equivalent Normal distribution.

In this methodological tradition, the variance of the Binomial distribution loses its meaning with respect to the Binomial distribution itself. It seems to be only valuable insofar as it allows us to parameterise the equivalent Normal distribution.

What I want to argue is that in fact, the concept of the variance of a Binomial distribution is important in its own right, and we need to understand it with respect to the Binomial distribution, not the Normal distribution. Sometimes it is not necessary to approximate the Binomial to the Normal, and if we can avoid this approximation our results are likely to be stronger as a result.

Continue reading