Random sampling, corpora and case interaction

Introduction

One of the main unsolved statistical problems in corpus linguistics is the following.

  • Statistical methods assume that samples under study are taken from the population at random.
  • But text corpora are only partially random. Corpora consist of passages of running text, where words, phrases, clauses and speech acts are structured together to describe the passage.
  • The selection of text passages for inclusion in a corpus is probably random (it will be sufficiently random unless you just pick the first ones you find, like the top of a set of search results!). However cases within each text may not be independent.

This randomness requirement is foundationally important. It governs our ability to generalise from the sample to the population.

The corollary of random sampling is that cases are independent from each other.

I see this problem as being fundamental to corpus linguistics as a credible experimental practice (to the point that I forced myself to relearn statistics from first principles after some twenty years in order to address it). In this blog entry I’m going to try to outline the problem and what it means in practice.

The saving grace is that statistical generalisation is premised on a mathematical model. The problem is not all-or-nothing. This means that we can, with care, attempt to address it proportionately.

Note: To actually solve the problem would require the integration of multiple sources of evidence into an a posteriori model of case interaction that computed marginal ‘independence probabilities’ for each case abstracted from the corpus. This is way beyond what any reasonable individual linguist could ever reasonably be expected to do unless an out-of-the-box solution is developed (I’m working on it, albeit slowly, so if you have ideas, don’t fail to contact me…).

There are numerous sources of case interaction and clustering in texts, ranging from conscious repetition of topic words and themes, unconscious tendencies to reuse particular grammatical choices, and interaction along axes of, for example, embedding and co-ordination (Wallis 2019), and structurally overlapping cases (Nelson et al 2002: 272).

In this blog post I first outline the problem and then discuss feasible good practice based on our current technology.

Two experiments

When we perform queries on a corpus and extract data (a process we have elsewhere referred to as ‘abstraction’) we obtain a new sample, but the cases in this sample may not be independent from each other. Cases which come from the same text source may interact.

For example, speakers or writers can complete a noun phrase by postmodifying it with a non-finite or relative clause, e.g.

  • people who live in Berlin (relative)
  • people living in Berlin (non-finite)

These options are relatively unmarked, so speakers may use one or the other form without any pretensions of style. Let us therefore consider the choice of relative vs. non-finite postmodifiers.

A corpus allows us to abstract data to perform the following two, perfectly legitimate, experiments:

  1. horizontally: investigate the interaction between two postmodifying clause decisions in a given grammatical relationship (e.g. in coordinated noun phrases), and
  2. vertically: investigate whether the choice of postmodifying clause type is affected by another factor, e.g. the utterance mode (speech or writing).
Two experiments
Two experiments.

In the horizontal experiment (a), we try to predict one decision (e.g. choice of postmodifying clause type) from another. In experiment (b), indicated by vertical arrows, we try to predict the choice from some external factor. The problem for experiment (b) is that if there is evidence for (a), cases a → A, b → B do not occur independently.

So if experiment (a) finds that the choice of one clause form affects the other (i.e. if these postmodifying clauses interact), then, where those clauses reappear in the sample for experiment (b), they cannot in fact be independent! The figure above summarises the problem.

The interesting question becomes:

can we use evidence from experiment (a) to determine the extent of interaction in experiment (b), and factor this into our evaluation of (b)?

Weighting evidence by independence

Ideally, we would wish to identify the prior probability of each case in the sample c occurring at random, which we can call the marginal independence probability ip(c).

If we had a good estimate of this probability we could replace totals in formulae for confidence intervals, χ² tests, etc. simply:

Corrected frequency n = Σip(c),

i.e. the total independent frequency replaces the simple count of cases. The difficulty is that there is no agreed method to calculate ip(c)!

  • Almost every corpus linguistics paper assumes a priori that samples are random (or “sufficiently so”), ip(c) = 1. This is the “best case” scenario.
  • The worst case is that all cases in the same text passage (“text” or “subtext”) are dependent on each other. Then ip(c) = 1/m where m is the number of cases per passage. At this limit, frequency statistics become dispersion statistics.

Let’s think about this for a moment. This worst case scenario will only be detectable if every case is the same.

I can think of two circumstances when this might be true (you may think of others).

  1. The speaker is consciously repeating their wording or phrasing. For example, consider the choice between modal must and ought to in an instructional passage from a government leaflet. The editor enforces a rule: to express obligation, use must. This scenario implies that the speaker/writer is aware of the decision.
  2. You are actually measuring the same instance (circularity). If you are applying a query that matches the same word/phrase/clause more than once in different arrangements (see Nelson et al. 2002: 272; and on overlapping cases). This is an experimental design problem concerning how cases are abstracted from the corpus.

Most examples of case interaction are between the two extremes, so 1/m < ip(c) < 1.

Measuring case interaction

In a research project that completed in 2007, I reported on an a priori approach to estimating ip(c) and apply it to every case in the corpus prior to carrying out statistical tests. However this a priori model was necessarily arbitrary. It did not take into account the fact that different choices might interact to differing extents and relied on the grammatical proximity of the nearest neighbour to estimate the size of the effect (Report). What follows therefore is a mathematical reassessment: an attempt to develop an a posteriori model of case interaction by gathering evidence from the corpus.

We can estimate the interdependence between two cases in experiment (b) above by carrying out a kind of linguistic interaction transmission experiment (LITE, Wallis 2009; Wallis 2021). A transmission experiment (Newton 1730) consists of three elements: a transmitter, a receiver and a medium.

The idea is sketched out below. A decision at point A (e.g. choice of clause type) transmits information to a second decision (which it influences) at point B, via an intermediate structure headed by a clause or phrase C (the medium).

In our case A and B are the same decision (see above) and we might also refer to this as a kind of ‘priming’ experiment. (LITEs are not limited in principle to repeating choices, however, so decisions at A and B could be different.) What is novel about a LITE is that we permute over the intermediate element.

A linguistic interaction transmission experiment.

Suppose A and B represent different instances of relative or non-finite postmodifying clauses in the same tree, i.e. our experiment (a) above.

There are then two questions: what measure should be used to measure the strength of the interaction between A or B, or, to put it another way, the amount of information from A that passes through intermediate elements to B? Second, how might we partition our data by different intermediate elements?

To measure the strength of interaction we may employ Cramér’s ϕ. This is a well-founded associative measure related to chi-square, χ². Elsewhere (Wallis 2012) I proved that ϕ measures the linear interdependence between two variables A and B. So all we need do is define A as the decision at point A and B as the decision at B. Cramér’s ϕ is bidirectional, so the direction of the arrow in the LITE is a little misleading. If the value of A correlates with the value of B, the reverse is also true.

Next, having identified the measure of interdependence we can explore what happens to ϕ as we increase the distance γ between each case.

Plotting the association between two related decisions (G, D) over path length . Almost all are positive, i.e. there is a tendency to reuse rather than avoid the same construction. Uncertainty is indicated by translated 2 × 2 Newcombe-Wilson intervals and  = 0.05. Published in Wallis (2021).
Plotting the association between two related decisions ϕ(G, D) over path length γ. Almost all are positive, i.e. there is a tendency to reuse rather than avoid the same construction. Uncertainty is indicated by translated 2 × 2 Newcombe-Wilson intervals and α = 0.05. From Wallis (2021).

The graph partitions data from ICE-GB in two ways:

  1. whether postmodifying clauses are under a coordinating clause or not, and
  2. by measuring the path length γ, in nodes up and down, from one case to the next.

This is not an exercise easy for researchers to repeat. In order to obtain this graph I had to program software to measure the distance γ and partition data.

The graph, published in Wallis (2021), allows us to see that the priming effect of other decisions on the current case is occasionally non-negligible, and that coordination supports priming to the extent that a second term has a dependent probability of approximately 0.6 on the first.

In generalising a post-hoc model, we are not particularly interested in the confidence intervals on ϕ, although they may be useful in fitting calculations. It is also possible to see (in passing) that coordination provides a significantly greater support for syntactic priming of non-finite or relative postmodifying clauses, so a model that accounts for coordinating and non-coordinating ancestor clauses is likely to be more accurate than one that does not.

We can now infer that where we have two immediately coordinated cases the cases share this probability of dependence between them.

ip(A) = ip(B) = 1 – ϕ(A, B)/2 = 1 – 0.3 = 0.7.

Similarly, for non-coordinated cases at the same two nodes distance we find ϕ = 0.2 etc.

ip(A) = ip(B) = 1 – ϕ(A, B)/2 = 1 – 0.1 = 0.9.

This has given us an “order of magnitude” estimate of the scale of the problem. Case interaction is not negligible, and could make the difference between a significant research result and a non-significant one. We need to take it into account, particularly if significant results are borderline.

Due diligence for researchers

As I noted in the introduction, this is not a problem where I can offer a simple off-the-peg solution. (An alternative approach which avoids this type of modelling is described in Adapting variance for random-text sampling). Obtaining a good a posteriori model of case interaction (i.e. one obtained from data), and applying it to experiments without effort is a problem not yet resolved.

However I hope that I have demonstrated that this is an issue we can’t afford to ignore. This is particularly so when dealing with subsets of corpora involving few participants, or where cases are concentrated in a small number of texts (and therefore their presence is due to a small number of participants).

Below I have identified a number of steps we might consider taking.

  1. Always define experiments in terms of dependent variable choice. The question of choice and case interaction are distinct and tangential, but focusing on choices factors out variation of opportunity. Since opportunities also potentially influence each other, this factors out this source of case interaction. However this does not mean that case interaction per se becomes less important: focusing on choices makes experiments more precise, and potentially more sensitive to uneven sampling.
  2. Most importantly, identify cases which physically overlap. We need to ensure that we don’t count the same case twice. If you can do nothing else, eliminate one of the cases. For example, in ICE-GB and DCPSE, self-corrected cases are not counted by default. Ideally, share the evidence (reduce both probabilities ip(c) = 0.5).
  3. Try to estimate the risk that ignoring case interaction will lead to a false positive result.
    1. Consider the confidence interval or χ² value: if the total number of cases N was reduced by half, would it make a difference to the significant result? If not, then further work is unlikely to make much difference.
    2. Measure ϕ = χ²/(k – 1)N at a close range to attempt to get an estimate for the scale of interaction between adjacent cases. If ϕ is small (ϕ < 0.1, say) then you can stop.
    3. Identify the proportion of cases that match more than one sentence. If less than 10% of cases appear in the same sentence as another case, we can probably not worry about case interaction further.
  4. Optionally, attempt to compensate for priming.
    1. We can tentatively assume that only cases which are in the same sentence will be potentially downscaled. A simple (and approximate) approach involves calculating the ratio of the number of independent sentences containing one or more cases, n, to the total number of hits, i.e. s = n / N, and then scale confidence intervals and χ² tests appropriately (multiply frequencies by s or substitute n for N). A more precise approach would assign ip(c) = 1/m for all cases where m is the number of cases per sentence, and recount frequency by summing ip(c). See above.
    2. A more credible estimate could be somewhere in between best and worst case scenarios, so you could take s = (n + N) / 2N, ip(c) = 2/m, etc.
    3. If you have been able to compute ϕ you could substitute ip(c) = 1 – ϕ + ϕ/[2 + ϕ(m–2)] for a better estimate.

The key point is that even without an a posteriori model and computational support, if a crucial observation in an experiment is of borderline significance then you should carry out this type of assessment. Many results are not borderline, and a sense of proportionality is required in deciding when to worry about case interaction.

References

Nelson, G., Wallis, S.A. & Aarts, B. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Varieties of English around the World series. Amsterdam: John Benjamins.

Newton, I. 1730. Opticks. London: Dover.

Wallis, S.A. 2009. Grammatical Noriegas: interaction in corpora and treebanks. Paper presented at ICAME 2009, Lancaster. » Slides (PowerPoint)

Wallis, S.A. 2012. Measures of association for contingency tables. London: Survey of English Usage, UCL. » Post

Wallis, S.A. 2019. Investigating the additive probability of repeated language production decisions. International Journal of Corpus Linguistics 24:4, 490-521. » Post

Wallis, S.A. 2021. Statistics in Corpus Linguistics Research. New York: Routledge. » Announcement

See also

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.