Random sampling, corpora and case interaction


One of the main unsolved statistical problems in corpus linguistics is the following.

Statistical methods assume that samples under study are taken from the population at random.

Text corpora are only partially random. Corpora consist of passages of running text, where words, phrases, clauses and speech acts are structured together to describe the passage.

The selection of text passages for inclusion in a corpus is potentially random. However cases within each text may not be independent.

This randomness requirement is foundationally important. It governs our ability to generalise from the sample to the population.

The corollary of random sampling is that cases are independent from each other.

I see this problem as being fundamental to corpus linguistics as a credible experimental practice (to the point that I forced myself to relearn statistics from first principles after some twenty years in order to address it). In this blog entry I’m going to try to outline the problem and what it means in practice.

The saving grace is that statistical generalisation is premised on a mathematical model. The problem is not all-or-nothing. This means that we can, with care, attempt to address it proportionately.

[Note: To actually solve the problem would require the integration of multiple sources of evidence into an a posteriori model of case interaction that computed marginal ‘independence probabilities’ for each case abstracted from the corpus. This is way beyond what any reasonable individual linguist could ever reasonably be expected to do unless an out-of-the-box solution is developed (I’m working on it, albeit slowly, so if you have ideas, don’t fail to contact me…).]

There are numerous sources of case interaction and clustering in texts, ranging from conscious repetition of topic words and themes, unconscious tendencies to reuse particular grammatical choices, and interaction along axes of, for example, embedding and co-ordination (Wallis 2012a), and structurally overlapping cases (Nelson et al 2002: 272).

In this blog post I first outline the problem and then discuss feasible good practice based on our current technology. 

Two experiments

When we perform queries on a corpus and extract data (a process we have elsewhere referred to as ‘abstraction’) we obtain a new sample, but the cases in this sample may not be independent from each other. Cases which come from the same text source may interact.

For example, speakers or writers can complete a noun phrase by postmodifying it with a non-finite or relative clause, e.g.

  • people who live in Berlin (relative)
  • people living in Berlin (non-finite)

These options are relatively unmarked, so speakers may use one or the other form without any pretensions of style. Let us therefore consider the choice of relative vs. non-finite postmodifiers.

A corpus allows us to abstract data to perform the following two, perfectly legitimate, experiments:

  1. horizontally: investigate the interaction between two postmodifying clause decisions in a given grammatical relationship (e.g. in coordinated noun phrases), and
  2. vertically: investigate whether the choice of postmodifying clause type is affected by another factor, e.g. the utterance mode (speech or writing).

Two experiments. In the horizontal experiment (a) we try to predict one decision (e.g. choice of postmodifying clause type) from another. In experiment (b), indicated by vertical arrows, we try to predict the choice from some external factor. The problem for experiment (b) is that cases a→A, b→B do not occur independently.

Note that if experiment (a) finds that the choice of one clause form affects the other (i.e. if these postmodifying clauses interact), then, should those clauses reappear in the sample for experiment (b), they cannot in fact be independent. The figure above summarises the problem.

The interesting question becomes:

can we use evidence from experiment (a) to determine the extent of interaction in experiment (b), and factor this into our evaluation of (b)?

Weighting evidence by independence

Ideally, we would wish to identify the prior probability of each case in the sample c occurring at random, which we can call the marginal independence probability ip(c).

If we had a good estimate of this probability we could replace totals in formulae for confidence intervals, χ² tests, etc. simply:

Corrected frequency n = Σip(c), i.e. the total independent frequency replaces the simple count of cases.

The difficulty is that there is no agreed method to calculate ip(c)!

  • Almost every corpus linguistics paper assumes a priori that samples are random (or “sufficiently so”), ip(c) = 1. This is the “best case” scenario.
  • The worst case is that all cases in the same text passage (“text” or “subtext”) are dependent on each other. Then ip(c) = 1/m where m is the number of cases per passage.

Let’s think about this for a moment. This worst case scenario can only occur if every case is the same.

I can think of two circumstances when this might be true (you may think of others).

  1. The speaker is consciously repeating their wording or phrasing. For example, consider the choice between modal must and ought to in an instructional passage from a government leaflet. The editor enforces a rule: to express obligation, use must. This scenario implies that the speaker/writer is aware of the decision.
  2. You are actually measuring the same instance. If you are applying a query that matches the same word/phrase/clause more than once in different arrangements (see Nelson et al. 2002: 272; and on overlapping cases). This is an experimental design problem concerning how cases are abstracted from the corpus.

Most examples of case interaction are between the two extremes, so 1/mip(c) < 1.

Measuring case interaction

In a research project that completed in 2007, I reported on an a priori approach to estimating ip(c) and apply it to every case in the corpus prior to carrying out statistical tests. However this a priori model was necessarily arbitrary. It did not take into account the fact that different choices might interact to differing extents and relied on the grammatical proximity of the nearest neighbour to estimate the size of the effect (Report). What follows therefore is a mathematical reassessment: an attempt to develop an a posteriori model of case interaction by gathering evidence from the corpus.

We can estimate the interdependence between two cases in experiment (b) above by carrying out a kind of linguistic interaction transmission experiment (LITE, Wallis 2009). A transmission experiment (Newton 1730) consists of three elements: a transmitter, a receiver and a medium.

The idea is sketched out below. A decision at point A (e.g. choice of clause type) transmits information to a second decision (which it influences) at point B, via an intermediate structure headed by a clause or phrase C (the medium).

In our case A and B are the same decision (see above) and we might also refer to this as a kind of ‘priming’ experiment. (LITEs are not limited in principle to repeating choices, however, so decisions at A and B could be different.) What is novel about a LITE is that we permute over the intermediate element.

A linguistic interaction transmission experiment.

Suppose A and B represent different instances of relative or non-finite postmodifying clauses in the same tree, i.e. our experiment (a) above.

There are then two questions: what measure should be used to measure the strength of the interaction between A or B, or, to put it another way, the amount of information from A that passes through intermediate elements to B? Second, how might we partition our data by different intermediate elements?

To measure the strength of interaction we employ Cramér’s φ. Elsewhere (Wallis 2012b) I have proved that φ measures the linear interdependence between two variables A and B. So all we need do is define A as the decision at point A and B as the decision at B. Cramér’s φ is bidirectional, so the direction of the arrow in the LITE is a little misleading. If A affects the value of B the reverse is also true.

Having identified the measure of interdependence we can explore what happens to φ as we increase the distance δ between each case.

Association between two related decisions φ(A, B) over distance δ.

The graph partitions data from ICE-GB in two ways:

  1. whether postmodifying clauses are under a coordinating clause or not, and
  2. by measuring the distance δ, in nodes up and down, from one case to the next.

This is not an exercise easy for researchers to repeat. In order to obtain this graph I had to program software to measure the distance δ and partition data.

The graph allows us to see that the priming effect of other decisions on the current case is occasionally non-negligible, and that coordination supports priming to the extent that a second term has a dependent probability of approximately 0.6 on the first. We are not particularly interested in the confidence intervals on φ, although this may be useful in smoothing calculations. It is also possible to see (in passing) that coordination provides a significantly greater support for syntactic priming of non-finite or relative postmodifying clauses.

We can now infer that where we have two immediately coordinated cases the cases share this probability of dependence between them.

ip(A) = ip(B) = 1 – φ(A, B)/2 = 1 – 0.3 = 0.7.

Similarly, for non-coordinated cases at the same two nodes distance we find φ = 0.2 etc.

ip(A) = ip(B) = 1 – φ(A, B)/2 = 1 – 0.1 = 0.9.

This has given us an “order of  magnitude” estimate of the scale of the problem. Case interaction is not negligible, and could make the difference between a significant research result and a non-significant one. We need to take it into account, particularly if significant results are borderline.

Due diligence for researchers

As I noted in the introduction, this is not a problem where I can offer a simple off-the-peg solution. Obtaining a good a posteriori model of case interaction (i.e. one obtained from data), and applying it to experiments without effort is not yet resolved.

However I hope that I have demonstrated that this is an issue we can’t afford to ignore. This is particularly so when dealing with subsets of corpora involving few participants, or where cases are concentrated in a small number of texts (and therefore their presence is due to a small number of participants).

Below I have identified a number of steps to consider taking.

  1. Always define experiments in terms of dependent variable choice. The question of choice and case interaction are distinct and tangential, but focusing on choices factors out variation of opportunity. Since opportunities also potentially influence each other, this factors out this source of case interaction. However this does not mean that case interaction per se becomes less important: focusing on choices makes experiments more precise, and potentially more sensitive to uneven sampling.
  2. Most importantly, identify cases which physically overlap. We need to ensure that we don’t count the same case twice. If you can do nothing else, eliminate one of the cases. For example, in ICE-GB and DCPSE, self-corrected cases are not counted by default. Ideally, share the evidence (reduce both probabilities ip(c) = 0.5).
  3. Try to estimate the risk that ignoring case interaction will lead to a false positive result.
    1. Consider the confidence interval or χ² value: if the total number of cases N was reduced by half, would it make a difference to the significant result? If not, then further work is unlikely to make much difference.
    2. Measure φ = √χ²/(– 1)N at a close range to attempt to get an estimate for the scale of interaction between adjacent cases. If φ is small (φ<0.1, say) then you can stop.
    3. Identify the proportion of cases that match more than one sentence. If less than 10% of cases appear in the same sentence as another case, we can probably not worry about case interaction further.
  4. Optionally, attempt to compensate for priming.
    1. We can tentatively assume that only cases which are in the same sentence will be potentially downscaled. A simple (and approximate) approach involves calculating the ratio of the number of independent sentences containing one or more cases, n, to the total number of hits, i.e. sn / N, and then scale confidence intervals and χ² tests appropriately (multiply frequencies by s or substitute n for N). A more precise approach would assign ip(c) = 1/m for all cases where m is the number of cases per sentence, and recount frequency by summing ip(c). See above.
    2. A more credible estimate could be somewhere in between best and worst case scenarios, so you could take s = (n + N) / 2Nip(c) = 2/m, etc.
    3. If you have been able to compute φ you could substitute ip(c) = 1 – φ + φ/[2 + φ(m-2)] for a better estimate.

The key point is that even without an a posteriori model and computational support, if a crucial observation in an experiment is of borderline significance then you should carry out this type of assessment. Many results are not borderline, and a sense of proportionality is required in deciding when to worry about case interaction.

See also


Nelson, G., Wallis, S.A. & Aarts, B. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Varieties of English around the World series. Amsterdam: John Benjamins.

Newton, I. 1730. Opticks. London: Dover.

Wallis, S.A. 2009. Grammatical Noriegas: interaction in corpora and treebanks. Paper presented at ICAME 2009, Lancaster. » Slides (PowerPoint)

Wallis, S.A. 2012a. Capturing patterns of linguistic interaction. London: Survey of English Usage, UCL. » Post

Wallis, S.A. 2012b. Measures of association for contingency tables. London: Survey of English Usage, UCL. » Post


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s