### Abstract Paper (PDF)

This paper describes a series of statistical meta-tests for comparing independent contingency tables for different types of significant difference. Recognising when an experiment obtains a significantly different result and when it does not is an issue frequently overlooked in research publication. Papers are frequently published citing ‘*p* values’ or test scores suggesting a ‘stronger effect’ substituting for sound statistical reasoning. This paper sets out a series of tests which together illustrate the correct approach to this question.

These meta-tests permit us to evaluate whether experiments have failed to replicate on new data; whether a particular data source or subcorpus obtains a significantly different result than another; or whether changing experimental parameters obtains a stronger effect.

The meta-tests are derived mathematically from the χ² test and the Wilson score interval, and consist of pairwise ‘point’ tests, ‘homogeneity’ tests and ‘goodness of fit’ tests. Meta-tests for comparing tests with one degree of freedom (e.g. ‘2 × 1’ and ‘2 × 2’ tests) are generalised to those of arbitrary size). Finally, we compare our approach with a competing approach offered by Zar (1999), which, while straightforward to calculate, turns out to be both less powerful and less robust.

### Introduction

Researchers often wish to compare the results of their experiments with those of others.

Alternatively they may wish to compare permutations of an experiment to see if a modification in the experimental design obtains a significantly different result. By doing so they would be able to investigate the empirical question of the effect of modifying an experimental design on reported results, as distinct from a deductive argument concerning the optimum design.

One of the reasons for carrying out such a test concerns the question of replication. Significance tests and confidence intervals rely on an *a priori* Binomial model predicting the likely distribution of future runs of the same experiment. However, there is a growing concern that allegedly significant results published in eminent psychology journals have failed to replicate (see, e.g. Gelman and Loken 2013). The reasons may be due to variation of the sample, or problems with the experimental design (such as unstated assumptions or baseline conditions that vary over experimental runs). The methods described here permit us to define a ‘failure to replicate’ as occurring when subsequent repetitions of the same experiment obtain statistically separable results on more occasions than predicted by the error level, ‘α’, used for the test.

Consider Table 1, taken from Aarts, Close and Wallis (2013). The two tables summarise a pair of 2 × 2 contingency tests for two different sets of British English corpus data for the modal alternation *shall* vs. *will*. The spoken data is drawn from the *Diachronic Corpus of Present-day Spoken English*, which contains matching data from the *London-Lund Corpus* and the *British Component of the International Corpus of English* (ICE-GB). The written data is drawn from the *Lancaster-Oslo-Bergen* (LOB) corpus and the matching *Freiburg-Lancaster-Oslo-Bergen* (FLOB) corpus.

Both 2 × 2 subtests are individually significant (χ² = 36.58 and 35.65 respectively). The results (see the effect size measures φ and percentage difference *d ^{%}*). appear to be different.

How might we test if the tables are significantly different from each other?

(spoken) | shall |
will |
Total |
χ²(shall) |
χ²(will) |
summary |

LLC (1960s) |
124 | 501 | 625 | 15.28 |
2.49 | d = -60.70% ±19.67%
^{%}φ = 0.17 χ² = |

ICE-GB (1990s) |
46 | 544 | 590 | 16.18 |
2.63 | |

TOTAL |
170 | 1,045 | 1,215 | 31.46 |
5.12 |

(written) | shall+ |
will+’ll |
Total |
χ²(shall+) |
χ²(will+’ll) |
summary |

LOB (1960s) |
355 | 2,798 | 3,153 | 15.58 |
1.57 | d = -39.23% ±12.88%
^{%}φ = 0.08 χ² = |

FLOB (1990s) |
200 | 2,723 | 2,923 | 16.81 |
1.69 | |

TOTAL |
555 | 5,521 | 6,076 | 32.40 |
3.26 |

We can plot Table 1 as two independent pairs of probability observations, as in Figure 1. We calculate the proportion *p* = *f*/*n* in each case, and – in order to estimate the likely range of error introduced by the sampling procedure – compute Wilson score intervals at a 95% confidence level.

The intervals in Figure 1 are shown by ‘I’ shaped error bars: were the experiment to be re-run multiple times, in 95% of predicted repeated runs, observations at each point will fall within the interval. Where Wilson intervals do not overlap at all (e.g. LLC vs. LOB, marked ‘**A**’) we can identify the difference is significant without further testing; where they overlap such that one point is within the interval the difference is non-significant; otherwise a test must be applied.

In this paper we discuss two different analytical comparisons.

- ‘Point tests’ compare pairs of observations (‘points’) across the dependent variable (e.g.
*shall/will*) and tables*t*= {1, 2}. To do this we compare the two points and their confidence intervals. We carry out a 2 × 2 χ² test for homogeneity or a Newcombe-Wilson test (Wallis 2013a) to compare each point. We can compare the initial 1960s data (LLC vs. LOB, indicated) in the same way as we might compare spoken 1960s and 1990s data (e.g. LLC vs. ICE-GB). - ‘Gradient tests’ compare differences in ‘sizes of effect’ (e.g. a change in the ratio
*shall/will*over time) between tables*t*. We might ask, is the gradient significantly steeper for the spoken data than for the written data?

Note that these tests evaluate different things and have different outcomes. If plot-lines are parallel, the gradient test will be non-significant, but the point test could still be significant at every pair of points. The two tests are complementary analytical tools.

#### 1.1 How not to compare test results

A common, but mistaken, approach to comparing experimental results involves simply citing the output of significance tests (Goldacre 2011). Researchers frequently make claims citing, *t*, *F* or χ² scores, ‘*p* values’ (error levels), etc, as evidence for the strength of results. However, this fundamentally misinterprets the meaning of these measures, and comparisons between them are not legitimate.