Detecting direction in interaction evidence

IntroductionPaper (PDF)

I have previously argued (Wallis 2014) that interaction evidence is the most fruitful type of corpus linguistics evidence for grammatical research (and doubtless for many other areas of linguistics).

Frequency evidence, which we can write as p(x), the probability of x occurring, concerns itself simply with the overall distribution of a linguistic phenomenon x – such as whether informal written English has a higher proportion of interrogative clauses than formal written English. In order to calculate frequency evidence we must define x, i.e. decide how to identify interrogative clauses. We must also pick an appropriate baseline n for this evaluation, i.e. we need to decide whether to use words, clauses, or any other structure to identify locations where an interrogative clause may occur.

Interaction evidence is different. It is a statistical correlation between a decision that a writer or speaker makes at one part of a text, which we will label point A, and a decision at another part, point B. The idea is shown schematically in Figure 1. A and B are separate ‘decision points’ in a given relationship (e.g. lexical adjacency), which can be also considered as ‘variables’.

Figure 1: Associative inference from lexico-grammatical choice variable A to variable B (sketch).



This class of evidence is used in a wide range of computational algorithms. These include collocation methods, part-of-speech taggers, and probabilistic parsers. Despite the promise of interaction evidence, the majority of corpus studies tend to consist of discussions of frequency differences and distributions.

In this paper I want to look at applications of interaction evidence which are made more-or-less at the same time by the same speaker/writer. In such circumstances we cannot be sure that just because B follows A in the text, the decision relating to B was made after the decision at A.

For example, in studying the premodification of noun phrases by attributive adjectives in English – which adjective is applied first in assembling an NP like the old tall green ship, for instance – we cannot be sure that adjectives are selected by the speaker in sentence order. It is also perfectly plausible that adjectives were chosen in an alternative or parallel order in the mind of the speaker, and then assembled in the final order during the language production process.

Of course, in cases where points A and B are separated substantively in time (as in many instances of structural self-priming) or where B is spoken in response to A by another speaker (structural priming of another’s language), there is unlikely to be any ambiguity about decision order. Moreover, if A licences B, then the order in unambiguous.

However, in circumstances where A and B are proximal, and where the order of decisions made by the speaker/writer cannot be presumed, we wish to consider whether there are mathematical or statistical methods for predicting the most likely order decisions were made.

Such a method would have considerable value in experimental design in cognitive corpus linguistics. For example, since Heads of NPs, VPs etc are conceived of as determining their complements, it may not be too much a stretch to argue that if this method works, we may have found a way of empirically evaluating this grammatical concept.


  1. Introduction
  2. A collocation example
    2.1 Employing chi-square and phi
    2.2 Directional statistics
    2.3 Significantly directional?
  3. A grammatical example
    3.1 Testing for difference under alternation
    3.2 Comparing Newcombe-Wilson intervals for direction
    3.3 Optimising the dififference interval
  4. Mapping significance of association and direction
  5. Concluding remarks
  6. References


Wallis, S.A. 2017. Detecting direction in interaction evidence. London: Survey of English Usage. » Paper (PDF)

See also


Wallis, S.A. 2011. Comparing χ² tests for separability. London: Survey of English Usage, UCL. » post

Wallis, S.A. 2012. Goodness of fit measures for discrete categorical data. London: Survey of English Usage, UCL. » post

Wallis, S.A. 2013a. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » post

Wallis, S.A. 2013b. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208. » post

Wallis, S.A. 2014. What might a corpus of parsed spoken data tell us about language? In L. Veselovská and M. Janebová (eds.) Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure. Olomouc: Palacký University, 2014. pp 641-662. » post

Wallis, S.A. forthcoming. That vexed problem of choice. London: Survey of English Usage, UCL. » post


