Detecting direction in interaction evidence

IntroductionPaper (PDF)

I have previously argued (Wallis 2014) that interaction evidence is the most fruitful type of corpus linguistics evidence for grammatical research (and doubtless for many other areas of linguistics).

Frequency evidence, which we can write as p(x), the probability of x occurring, concerns itself simply with the overall distribution of a linguistic phenomenon x – such as whether informal written English has a higher proportion of interrogative clauses than formal written English. In order to calculate frequency evidence we must define x, i.e. decide how to identify interrogative clauses. We must also pick an appropriate baseline n for this evaluation, i.e. we need to decide whether to use words, clauses, or any other structure to identify locations where an interrogative clause may occur.

Interaction evidence is different. It is a statistical correlation between a decision that a writer or speaker makes at one part of a text, which we will label point A, and a decision at another part, point B. The idea is shown schematically in Figure 1. A and B are separate ‘decision points’ in a given relationship (e.g. lexical adjacency), which can be also considered as ‘variables’.

Figure 1: Associative inference from lexico-grammatical choice variable A to variable B (sketch).

Figure 1: Associative inference from lexico-grammatical choice variable A to variable B (sketch).

This class of evidence is used in a wide range of computational algorithms. These include collocation methods, part-of-speech taggers, and probabilistic parsers. Despite the promise of interaction evidence, the majority of corpus studies tend to consist of discussions of frequency differences and distributions.

In this paper I want to look at applications of interaction evidence which are made more-or-less at the same time by the same speaker/writer. In such circumstances we cannot be sure that just because B follows A in the text, the decision relating to B was made after the decision at A. Continue reading

Capturing patterns of linguistic interaction

Abstract Full Paper (PDF)

Numerous competing grammatical frameworks exist on paper, as algorithms and embodied in parsed corpora. However, not only is there little agreement about grammars among linguists, but there is no agreed methodology for demonstrating the benefits of one grammar over another. Consequently the status of parsed corpora or ‘treebanks’ is suspect.

The most common approach to empirically comparing frameworks is based on the reliable retrieval of individual linguistic events from an annotated corpus. However this method risks circularity, permits redundant terms to be added as a ‘solution’ and fails to reflect the broader structural decisions embodied in the grammar. In this paper we introduce a new methodology based on the ability of a grammar to reliably capture patterns of linguistic interaction along grammatical axes. Retrieving such patterns of interaction does not rely on atomic retrieval alone, does not risk redundancy and is no more circular than a conventional scientific reliance on auxiliary assumptions. It is also a valid experimental perspective in its own right.

We demonstrate our approach with a series of natural experiments. We find an interaction captured by a phrase structure analysis between attributive adjective phrases under a noun phrase with a noun head, such that the probability of adding successive adjective phrases falls. We note that a similar interaction (between adjectives preceding a noun) can also be found with a simple part-of-speech analysis alone. On the other hand, preverbal adverb phrases do not exhibit this interaction, a result anticipated in the literature, confirming our method.

Turning to cases of embedded postmodifying clauses, we find a similar fall in the additive probability of both successive clauses modifying the same NP and embedding clauses where the NP head is the most recent one. Sequential postmodification of the same head reveals a fall and then a rise in this additive probability. Reviewing cases, we argue that this result can only be explained as a natural phenomenon acting on language production which is expressed by the distribution of cases on an embedding axis, and that this is in fact empirical evidence for a grammatical structure embodying a series of speaker choices.

We conclude with a discussion of the implications of this methodology for a series of applications, including optimising and evaluating grammars, modelling case interaction, contrasting the grammar of multiple languages and language periods, and investigating the impact of psycholinguistic constraints on language production.

Continue reading

Choosing the right test

Introduction

One of the most common questions a new researcher has to deal with is the following:

what is the right statistical test for my purpose?

To answer this question we must distinguish between

  1. different experimental designs, and
  2. optimum methods for testing significance.

In corpus linguistics, many research questions involve choice. The speaker can say shall or will, choose to add a postmodifying clause to an NP or not, etc. If we want to know what factors influence this choice then these factors are termed independent variables (IVs) and the choice is  the dependent variable (DV). These choices are mutually exclusive alternatives. Framing the research question like this immediately helps us focus in on the appropriate class of tests.  Continue reading

Robust and sound?

When we carry out experiments and perform statistical tests we have two distinct aims.

  1. To form statistically robust conclusions about empirical data.
  2. To make logically sound arguments about experimental conclusions.

Robustness is essentially an inductive mathematical or statistical issue.

Soundness is a deductive question of experimental design and reporting.

Robust conclusions are those that are likely to be repeated if another researcher were to come along and perform the same experiment with different data sampled in much the same way. Sound arguments distinguish between what we can legitimately infer from our data, and the hypothesis we may wish to test.

Continue reading

Some bêtes noires

There are a number of common issues in corpus linguistics papers.

  1. an extremely common tendency for authors to primarily cite frequencies normalised per million or thousand words (i.e. a per word baseline or multiple thereof),
  2. data is usually plotted without confidence intervals, so it is not possible to spot visually whether a perceived change might be statistically significant, and
  3. significance tests are often employed without a clear statement of what the test is evaluating.

Experimental design

The first issue may be unique to corpus linguistics, deriving from its particular historical origins.

It concerns the experimenter attempting to identify counterfactual alternates or select baselines. This is an experimental design question.

In the beginning was the Word.

Linguists examining volumes of plain text data (later supported by computing and part-of-speech tagging) invariably concentrated on the idea of the word as the unit of language. Collocation and concordancing sat alongside lexicography as the principal tools of the trade. “Statistics” here primarily concerned probabilistic measures of association between neighbouring words in order to find common patterns. This activity is of course perfectly fine, and allowed researchers to make huge gains in our understanding of language.

But…

Without labouring the point (which I do elsewhere on this blog), the corollary of the statement that language is grammatical is that if, instead of describing the distribution of words, n-grams, etc, we wish to investigate how language is produced, the word cannot be our primary focus. Continue reading

A statistics crib sheet

Confidence intervalsHandout

Confidence intervals on an observed rate p should be computed using the Wilson score interval method. A confidence interval on an observation p represents the range that the true population value, P (which we cannot observe directly) may take, at a given level of confidence (e.g. 95%).

Note: Confidence intervals can be applied to onomasiological change (variation in choice) and semasiological change (variation in meaning), provided that P is free to vary from 0 to 1 (see Wallis 2012). Naturally, the interpretation of significant change in either case is different.

Methods for calculating intervals employ the Gaussian approximation to the Binomial distribution.

Confidence intervals on Expected (Population) values (P)

The Gaussian interval about P uses the mean and standard deviation as follows:

mean xP = F/N,
standard deviation S ≡ √P(1 – P)/N.

The Gaussian interval about P can be written as P ± E, where E = z.S, and z is the critical value of the standard Normal distribution at a given error level (e.g., 0.05). Although this is a bit of a mouthful, critical values of z are constant, so for any given level you can just substitute the constant for z. [z(0.05) = 1.95996 to six decimal places.]

In summary:

Gaussian intervalP ± z√P(1 – P)/N.

Confidence intervals on Observed (Sample) values (p)

We cannot use the same formula for confidence intervals about observations. Many people try to do this!

Most obviously, if p gets close to zero, the error e can exceed p, so the lower bound of the interval can fall below zero, which is clearly impossible! The problem is most apparent on smaller samples (larger intervals) and skewed values of p (close to 0 or 1).

The Gaussian is a reasonable approximation for an as-yet-unknown population probability P, it is incorrect for an interval around an observation p (Wallis 2013a). However the latter case is precisely where the Gaussian interval is used most often!

What is the correct method?

Continue reading

Comparing χ² tests for separability

Abstract Paper (PDF)

This paper describes a series of statistical meta-tests for comparing independent contingency tables for different types of significant difference. Each experiment is carried out using Binomial or multinomial contingency statistics (χ², z, Fisher, log-likelihood tests, etc.). The meta-tests permit us to evaluate whether experiments have failed to replicate on new data; whether a particular data source or subcorpus obtains a significantly different result than another; or whether changing experimental parameters obtains a stronger effect.

Recognising when an experiment obtains a significantly different result and when it does not is an issue frequently overlooked in research publication. Papers are frequently published citing ‘p values’ or test scores suggesting a ‘stronger effect’ substituting for sound statistical reasoning. This paper sets out a series of tests which together illustrate the correct approach to this question.

The meta-tests presented are derived mathematically from the χ² test and the Wilson score interval, and consist of pairwise ‘point’ tests, ‘homogeneity’ tests and ‘goodness of fit’ tests. Meta-tests for comparing tests with one degree of freedom (e.g. ‘2 × 1’ and ‘2 × 2’ tests) are generalised to those of arbitrary size). Finally, we compare our approach with a competing approach offered by Zar (1999), which, while straightforward to calculate, turns out to be both less powerful and less robust.

Introduction

Researchers often wish to compare the results of their experiments with those of others.

Alternatively they may wish to compare permutations of an experiment to see if a modification in the experimental design obtains a significantly different result. By doing so they would be able to investigate the empirical question of the effect of modifying an experimental design on reported results, as distinct from a deductive argument concerning the optimum design.

One of the reasons for carrying out such a test concerns the question of replication. Significance tests and confidence intervals rely on an a priori Binomial model predicting the likely distribution of future runs of the same experiment. However, there is a growing concern that allegedly significant results published in eminent psychology journals have failed to replicate (see, e.g. Gelman and Loken 2013). The reasons may be due to variation of the sample, or problems with the experimental design (such as unstated assumptions or baseline conditions that vary over experimental runs). The methods described here permit us to define a ‘failure to replicate’ as occurring when subsequent repetitions of the same experiment obtain statistically separable results on more occasions than predicted by the error level, ‘α’, used for the test.

Consider Table 1, taken from Aarts, Close and Wallis (2013). The two tables summarise a pair of 2 × 2 contingency tests for two different sets of British English corpus data for the modal alternation shall vs. will. The spoken data is drawn from the Diachronic Corpus of Present-day Spoken English, which contains matching data from the London-Lund Corpus and the British Component of the International Corpus of English (ICE-GB). The written data is drawn from the Lancaster-Oslo-Bergen (LOB) corpus and the matching Freiburg-Lancaster-Oslo-Bergen (FLOB) corpus.

Both 2 × 2 subtests are individually significant (χ² = 36.58 and 35.65 respectively). The results (see the effect size measures φ and percentage difference d%). appear to be different.

How might we test if the tables are significantly different from each other?

(spoken) shall will Total χ²(shall) χ²(will) summary
LLC (1960s) 124 501 625 15.28 2.49 d% = -60.70% ±19.67%

φ = 0.17

χ² = 36.58 s

ICE-GB (1990s) 46 544 590 16.18 2.63
TOTAL 170 1,045 1,215 31.46 5.12
(written) shall+ will+’ll Total χ²(shall+) χ²(will+’ll) summary
LOB (1960s) 355 2,798 3,153 15.58 1.57 d% = -39.23% ±12.88%

φ = 0.08

χ² = 35.65 s

FLOB (1990s) 200 2,723 2,923 16.81 1.69
TOTAL 555 5,521 6,076 32.40 3.26

Table 1: A pair of 2 × 2 tables for shall/will alternation, after Aarts et al. (2013): upper, spoken, lower: written, with other differences in the experimental design. Note that χ² values are almost identical but Cramér’s φ and percentage swing d% are different.

We can plot Table 1 as two independent pairs of probability observations, as in Figure 1. We calculate the proportion p = f/n in each case, and – in order to estimate the likely range of error introduced by the sampling procedure – compute Wilson score intervals at a 95% confidence level.

Figure 1: Example data in Table 1, plotted with 95% Wilson score intervals

Figure 1: Example data in Table 1, plotted with 95% Wilson score intervals (Wallis 2013a).

The intervals in Figure 1 are shown by ‘I’ shaped error bars: were the experiment to be re-run multiple times, in 95% of predicted repeated runs, observations at each point will fall within the interval. Where Wilson intervals do not overlap at all (e.g. LLC vs. LOB, marked ‘A’) we can identify the difference is significant without further testing; where they overlap such that one point is within the interval the difference is non-significant; otherwise a test must be applied.

In this paper we discuss two different analytical comparisons.

  1. ‘Point tests’ compare pairs of observations (‘points’) across the dependent variable (e.g. shall/will) and tables t = {1, 2}. To do this we compare the two points and their confidence intervals. We carry out a 2 × 2 χ² test for homogeneity or a Newcombe-Wilson test (Wallis 2013a) to compare each point. We can compare the initial 1960s data (LLC vs. LOB, indicated) in the same way as we might compare spoken 1960s and 1990s data (e.g. LLC vs. ICE-GB).
  2. ‘Gradient tests’ compare differences in ‘sizes of effect’ (e.g. a change in the ratio shall/will over time) between tables t. We might ask, is the gradient significantly steeper for the spoken data than for the written data?

Note that these tests evaluate different things and have different outcomes. If plot-lines are parallel, the gradient test will be non-significant, but the point test could still be significant at every pair of points. The two tests are complementary analytical tools.

1.1 How not to compare test results

A common, but mistaken, approach to comparing experimental results involves simply citing the output of significance tests (Goldacre 2011). Researchers frequently make claims citing, t, F or χ² scores, ‘p values’ (error levels), etc, as evidence for the strength of results. However, this fundamentally misinterprets the meaning of these measures, and comparisons between them are not legitimate.

Continue reading