Abstract Paper (PDF)
A set of statistical tests termed contingency tests, of which χ² is the most well-known example, are commonly employed in linguistics research. Contingency tests compare discrete distributions, that is, data divided into two or more alternative categories, such as alternative linguistic choices of a speaker or different experimental conditions. These tests are highly ubiquitous, and are part of every linguistics researcher’s arsenal.
However the mathematical underpinnings of these tests are rarely discussed in the literature in an approachable way, with the result that many researchers may apply tests inappropriately, fail to see the possibility of testing particular questions, or draw unsound conclusions. Contingency tests are also closely related to the construction of confidence intervals, which are highly useful and revealing methods for plotting the certainty of experimental observations.
This paper is organised in the following way. The foundations of the simplest type of χ² test, the 2 × 1 goodness of fit test, are introduced and related to the z test for a single observed proportion p and the Wilson score confidence interval about p. We then show how the 2 × 2 test for independence (homogeneity) is derived from two observations p1 and p2 and explain when each test should be used. We also briefly introduce the Newcombe-Wilson test, which ideally should be used in preference to the χ² test for observations drawn from two independent populations (such as two subcorpora). We then turn to tests for larger tables, generally termed “r × c” tests, which have multiple degrees of freedom and therefore may encompass multiple trends, and discuss strategies for their analysis. Finally, we turn briefly to the question of differentiating test results. We introduce the concept of effect size (also termed ‘measures of association’) and finally explain how we may perform statistical separability tests to distinguish between two sets of results.
Introduction
Karl Pearson’s famous chi-square test is derived from another statistic, called the z statistic, based on the Normal distribution.
The simplest versions of χ² can be shown to be mathematically identical to equivalent z tests. The tests produce the same result in all circumstances. For all intents and purposes “chi-squared” could be called “z-squared”. The critical values of χ² for one degree of freedom are the square of the corresponding critical values of z.
- The standard 2 × 2 χ² test is another way of calculating the z test for two independent proportions taken from the same population (Sheskin 1997: 226).
- This test is based on an even simpler test. The 2 × 1 (or 1 × 2) “goodness of fit” (g.o.f.) χ² test is an implementation of one of the simplest tests in statistics, called the Binomial test, or population z test (Sheskin 1997: 118). This test compares a sample observation against a predicted value which is assumed to be Binomially distributed.
If this is the case, why might we need chi-square? Pearson’s innovation in developing chi-square was to permit a test of a larger array with multiple values greater than 2, i.e., to extend the 2 × 2 test to a more general test with r rows and c columns. Similarly the z test can be extended to an r × 1 χ² test in order to evaluate an arbitrary number of rows. Such a procedure permits us to detect significant variation across multiple values, rather than rely on 2-way comparisons. However, further analysis is then needed, in the form of 2 × 2 or 2 × 1 g.o.f. χ² tests, to identify which values are undergoing significant variation (see Section 3).
The fundamental assumption of these tests can be stated in simple terms as follows:
An observed sample represents a limited selection from a much larger population. Were we to obtain multiple samples we might get slightly different results. In reporting results, therefore, we need a measure of their reliability. Stating that a result is significant at a certain level of error (α=0.01, for example) is another way of stating that, were we to repeat the experiment many times, the likelihood of obtaining a result other than that reported will be below this error level.
Contents
- Introduction
- The origin of χ²
2.1 Sampling assumptions2.2 The ‘Wald’ confidence interval2.3 Single-sample, population z tests and goodness of fit χ²2.4 The Wilson score interval2.5 The 2 × 2 χ² and z test for two independent proportions2.6 The z test for two independent proportions from independent populations2.7 Yates’ correction, log likelihood and other methods
- The appropriate use of χ²
3.1 Selecting tests3.2 The problem of linguistic choice3.3 Case interaction in corpora3.4 Analysing larger tables
- Comparing the results of experiments
4.1 Measuring swing on a single dependent value4.2 Measuring effect size over all dependent values4.3 Using ϕ to measure effect size on a single dependent value4.4 Testing swings for statistical separability
- Conclusions
Citation
Wallis, S.A. 2013. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. DOI:10.1080/09296174.2013.830554
- ePublication (Taylor & Francis online)
Citation (updated and extended)
Wallis, S.A. 2021. From Intervals to Tests. Chapter 8 in Wallis, S.A. Statistics in Corpus Linguistics Research. New York: Routledge. 134-165.
References
Sheskin, D.J. 1997. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.