### Abstract Paper (PDF)

A set of statistical tests termed *contingency tests*, of which χ² is the most well-known example, are commonly employed in linguistics research. Contingency tests compare discrete distributions, that is, data divided into two or more alternative categories, such as alternative linguistic choices of a speaker or different experimental conditions. These tests are highly ubiquitous, and are part of every linguistics researcher’s arsenal.

However the mathematical underpinnings of these tests are rarely discussed in the literature in an approachable way, with the result that many researchers may apply tests inappropriately, fail to see the possibility of testing particular questions, or draw unsound conclusions. Contingency tests are also closely related to the construction of *confidence intervals*, which are highly useful and revealing methods for plotting the certainty of experimental observations.

This paper is organised in the following way. The foundations of the simplest type of χ² test, the 2 × 1 goodness of fit test, are introduced and related to the *z* test for a single observed proportion *p* and the Wilson score confidence interval about *p*. We then show how the 2 × 2 test for independence (homogeneity) is derived from two observations *p*₁ and *p*₂ and explain when each test should be used. We also briefly introduce the Newcombe-Wilson test, which ideally should be used in preference to the χ² test for observations drawn from two independent populations (such as two subcorpora). We then turn to tests for larger tables, generally termed “*r* × *c*” tests, which have multiple degrees of freedom and therefore may encompass multiple trends, and discuss strategies for their analysis. Finally, we turn briefly to the question of differentiating test results. We introduce the concept of *effect size* (also termed ‘measures of association’) and finally explain how we may perform statistical *separability tests* to distinguish between two sets of results.

### Introduction

Karl Pearson’s famous chi-square test is derived from another statistic, called the *z* statistic, based on the Normal distribution.

The simplest versions of χ² can be shown to be **mathematically identical** to equivalent *z* tests. The tests produce the same result in all circumstances. For all intents and purposes “chi-squared” could be called “z-squared”. The critical values of χ² for one degree of freedom are the square of the corresponding critical values of *z*.

- The standard 2 × 2 χ² test is another way of calculating the
taken from the same population (Sheskin 1997: 226).*z*test for two independent proportions - This test is based on an even simpler test. The 2 × 1 (or 1 × 2) “goodness of fit” (g.o.f.) χ² test is an implementation of one of the simplest tests in statistics, called the Binomial test, or
**population**(Sheskin 1997: 118). This test compares a sample observation against a predicted value which is assumed to be Binomially distributed.*z*test

If this is the case, **why might we need chi-square?** Pearson’s innovation in developing chi-square was to permit a test of a larger array with *multiple values greater than 2*, i.e., to extend the 2 × 2 test to a more general test with *r* rows and *c* columns. Similarly the *z* test can be extended to an *r* × 1 χ² test in order to evaluate an arbitrary number of rows. Such a procedure permits us to detect significant variation across multiple values, rather than rely on 2-way comparisons. However, further analysis is then needed, in the form of 2 × 2 or 2 × 1 g.o.f. χ² tests, to identify **which** values are undergoing significant variation (see Section 3).

The fundamental assumption of these tests can be stated in simple terms as follows:

An observed sample represents a limited selection from a much larger population. Were we to obtain multiple samples we might get slightly different results. In reporting results, therefore, we need a measure of their **reliability**. Stating that a result is significant at a certain level of error (α=0.01, for example) is another way of stating that, were we to repeat the experiment many times, the likelihood of obtaining a result other than that reported will be below this error level.

### Contents

- Introduction
- The origin of χ²

2.1 Sampling assumptions

2.2 The ‘Wald’ confidence interval

2.3 Single-sample, population z tests and goodness of fit χ²

2.4 The Wilson score interval

2.5 The 2 × 2 χ² and*z*test for two independent proportions

2.6 The*z*test for two independent proportions from independent populations

2.7 Yates’ correction, log likelihood and other methods - The appropriate use of χ²

3.1 Selecting tests

3.2 The problem of linguistic choice

3.3 Case interaction in corpora

3.4 Analysing larger tables - Comparing the results of experiments

4.1 Measuring swing on a single dependent value

4.2 Measuring effect size over all dependent values

4.3 Using φ to measure effect size on a single dependent value

4.4 Testing swings for statistical separability - Conclusions

### Citation

Wallis, S.A. 2013. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. DOI:10.1080/09296174.2013.830554

- ePublication (Taylor & Francis online)

### See also

- Binomial confidence intervals and contingency tests
- Comparing χ² tests for separability
- Preprint (PDF)
- PowerPoint slides
- Excel spreadsheets
- Binomial → Normal → Wilson

### References

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.