Introduction Paper (PDF)
Researchers often wish to compare the results of their experiments with those of others. Alternatively they may wish to compare permutations of an experiment to see if a modification in the experimental design obtains a significantly different result.
This question concerns an empirical analysis of the effect of modifying an experimental design on reported results, rather than a deductive argument concerning the optimum design.
Many researchers attempt this type of evaluation by employing statements about their results (citing, t, F or χ² values, error levels or “p values”, etc), as benchmarks for the strength of their results, implying a comparison that is frequently misunderstood (Goldacre 2011).
Alternatively, descriptive statistics of effect size such as percentage difference, log odds ratios, or Cramér’s φ may be used for comparison. These measures adjust for the volume of data and measure the pattern of change observed. However, effect size comparisons are discussed in the literature in surprisingly crude terms, e.g. ‘strong’, ‘medium’ and ‘weak’ effects (cf. Sheskin 1997: 244). In this paper we explain how to evaluate differences in effect size statistically.
- The fact that one chi-square value or error level exceeds another merely means that reported indicators differ. It does not mean that the results are statistically separable, i.e. that the results are significantly different from each other at a given likelihood of error.
- However if we wish to claim a difference in experimental outcomes between experimental ‘runs’, this is precisely what we must establish.
In this paper we attempt to address how this question of separability may be evaluated.
We begin by focusing on comparing the results of two paired contingency tests:
- two 2 × 2 tests for homogeneity (independence) and
- two 2 × 1 goodness of fit tests.
The idea is that both dependent and independent variables are matched but not necessarily identical, i.e., in both subtests we attempt to measure the same quantities by different definitions, methods or samples. The new test then compares these subtest results for separability and tells us if the effect of the change in experimental design obtains a significantly different result.
Consider the example below, from Aarts, Close and Wallis (2013). The two tables summarise contingency tests for two different sets of data. The results appear to be different, especially if we consider effect size measures φ and d%. The question is whether we can test if they are significantly different from each other.
|LLC (1960s)||124||501||625||15.28||2.49||d% = -60.70% ±19.67%
φ = 0.17
χ² = 36.58 s
|LOB (1960s)||355||2,798||3,153||15.58||1.57||d% = -39.23% ±12.88%
φ = 0.08
χ² = 35.65 s
The idea is summarised by the figure below. There are two broad classes of test: those that distinguish results of goodness of fit tests (“separability of fit”) and comparing tests of homogeneity (“separability of independence”).
In this paper we concentrate on 2 × 2 and 2 × 1 tests because they have one degree of freedom, so significant results can be explained by a single factor.
It is possible to employ a similar approach for evaluating pairs of larger “r × c” or “r × 1” tables (see section 4 in the paper). However, we argue elsewhere (Wallis 2013) that it is good practice that such tables, which have many degrees of freedom (and therefore contain multiple potential areas of significant variation), should be analysed by subdivision into tables with one degree of freedom to identify areas of significant difference. The simplest tests we describe here may therefore have the greatest utility.
The tests we describe here represent a kind of meta-analysis: they provide a method for comparing and summarising experimental results. Other tests for comparing contingency test results include McNemar and Cochran Q tests (Sheskin 1997) which compare distributions, but not differences, and are known to be weak tests.
Zar’s (1999: 471, 500) chi-square heterogeneity analysis is the most similar class of tests in the literature to ours. Section 5 reviews these tests and compares them with our approach. The key difference is that Zar’s method requires that data has (approximately) the same prior distribution (i.e. the same starting point), whereas our tests do not.
Finally, note that in this paper we discuss contingency tests. There is a comparable procedure for comparing multiple runs of t tests (or ANOVAs) but it is rarely recognised as such. This is the test for interaction in a factorial analysis of variance (Sheskin 1997: 489) where one of the factors represents the repeated run.
- Tests for differences between 2 × 2 test outcomes
- Tests for differences between 2 × 1 goodness of fit test outcomes
- Generalisation to r × c and r × 1 χ² tests
- Hetereogeneity χ² tests
- Full Paper (PDF)
- Excel spreadsheet
- Detecting direction in interaction evidence
- Binomial confidence intervals and contingency tests
Wallis, S.A. 2011. Comparing χ² tests for separability. London: Survey of English Usage, UCL. http://www.ucl.ac.uk/english-usage/statspapers/comparing-x2-tests.pdf
Aarts, B., Close, J, and Wallis, S.A. 2013. Choices over time: methodological issues in investigating current change. » ePublished. Chapter 2 in Aarts, B., Close, J, Leech, G. and Wallis, S.A. (eds.) The Verb Phrase in English. Cambridge: CUP. » Table of contents and ordering info
Goldacre, B. 2011. The statistical error that just keeps on coming. Guardian, 9 September 2011.
Sheskin, D.J. 1997. Handbook of Parametric and Nonparametric Statistical Procedures. 1st Edition. Boca Raton, Fl: CRC Press.
Wallis, S.A. 2013. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post
Zar, J. H. 1999. Biostatistical analysis. 4th Edition. Upper Saddle River, NJ: Prentice Hall.