### Introduction

One of the most common questions a new researcher has to deal with is the following:

*what is the right statistical test for my purpose?*

To answer this question we must distinguish between

- different
**experimental designs**, and - optimum
**methods**for testing significance.

In corpus linguistics, many research questions involve choice. The speaker can say *shall* or *will*, choose to add a postmodifying clause to an NP or not, etc. If we want to know what factors influence this choice then these factors are termed **independent variables** (IVs) and the choice is the **dependent variable **(DV). These choices are mutually exclusive alternatives. Framing the research question like this immediately helps us focus in on the appropriate class of tests.

### Tests for categorical data

The most common scenario in corpus linguistics is when both independent and dependent variables are **categorical**, which is why in recent years I’ve focused on this area in particular. The most well-known test is “the χ² test” (more correctly, the contingency test), which comes in two versions (Wallis 2013).

In contingency tests, data is expressed in the form of a contingency table of frequencies. The independent variable *B* has a discrete set of categories and therefore produces a discrete frequency distribution for each column of the dependent variable *A*. This means that for every value of *A* and *B* (let’s call these *i* and *j *respectively) there is a frequency count, *f*(*i*, *j*), representing the number of times in the dataset that *A=i* and *B=j*.

**Goodness of fit tests**are used to compare a distribution over a**selected value**,*a*, of a variable,*A*, with the**overall distribution**. “Goodness of fit” means that the distribution at*a*fits the distribution at*A*. It is also referred to as an*r*× 1 test because it evaluates a distribution of*r*cells for a single value of*A*. A significant result means that we can reject the null hypothesis that the distribution at*a*matches the overall distribution at*A*.**Independence tests**are used to evaluate whether the value of one variable is independent from the value of the other. We typically use it to test the extent to which, were we to know the value of the independent variable (IV,*B*), we could predict the value of the dependent variable (DV,*A*). Note that the test is reversible: were we to swap*A*and*B*we would obtain the same test result. It is also referred to as an**homogeneity test**, and may also be referred to as an*r*×*c*test because it compares distributions of*r*cells across all*c*subvalues of*A*. A significant result means that we can reject the null hypothesis that the two variables are independent.

These tests essentially operate by performing two steps: calculate the size of an effect, and then compare this effect size with a limit: a confidence interval or critical value.

Simple 2 × 1 and 2 × 2 tests are **more powerful** in practice than larger tables (*r* × 1, *r* × *c*). They have one degree of freedom, and make few assumptions about the data. They therefore test only one “thing” at a time. Over the years I’ve become a fan of these simple tests – hence this spreadsheet.

A specialised goodness of fit test, most easily calculated using the single sample *z* test, compares two probabilities drawn from the same sample for significant difference (i.e. difference from **E** = {0.5*n*, 0.5*n*}). See below.

### Tests for comparing results

It is also possible to perform a further type of “meta-test” which compares results obtained from the first two tests. It is common, but poor practice, to see citations of individual χ² scores or error levels in papers. However, to be candid, this information is almost completely useless — the fact that one test obtains a higher χ² score or smaller α than another does not mean that the effect witnessed is greater, ‘stronger’ etc.

It is permissible to cite **sizes of effect ***descriptively* (that is, to describe the sample). However the optimum approach to comparing outcomes is to employ a separability test.

**Separability tests**(Wallis 2011) evaluate whether the results of two comparable experiments are significantly different from each other. Whereas the goodness of fit and homogeneity tests look for a significant*non-zero*difference between*a*and*A*or between*a*₀,*a*₁,*a*₂, etc., a separability test operates at a higher level. It attempts to decide whether two sets of results from earlier subtests are significantly different. A significant result allows us to reject the null hypothesis that the two results say the same thing about the population.

There are different separability tests for comparing goodness of fit tables (what we might term “separability of fit”) and homogeneity tables (“separability of independence”), illustrated by the figure above. Note that it only makes sense to perform this type of meta-analysis when pairs of tables have the **same structure**: if they are structurally different then they are different anyway! This test can be used to compare the results of the same experiment performed on different samples (e.g. from different corpora) or when different definitions of variables are used. Aarts, Close and Wallis (2013) employed this test in a step-wise fashion, changing one parameter at a time, to compare their results with those of previous researchers.

### Optimum methods of calculation

It is possible to use different formulae or **methods** to carry out these tests. The standard χ² calculation has known weaknesses, and to address these a number of alternatives have been proposed, including performing ‘exact’ tests (Binomial, Fisher), employing Log-likelihood, and applying Yates’ or Williams’ corrections. So the question then is which method should we choose?

Fortunately, a number of authors (in particular, Robert Newcombe (1998a and b), but see also my own modest effort) have put the time in to evaluate confidence intervals and tests, and we can therefore offer some straightforward advice on this topic.

- With one degree of freedom (2 × 1, 2 × 2), use
**Yates’ continuity-corrected χ²**in preference to standard χ² tests. - If the independent variable subdivides the corpus by speaker or text, then strictly speaking you should use an independent-population test. A good test is the 2 × 2
**Newcombe-Wilson test with continuity-correction**(Newcombe 1998b, Wallis 2009). This approach is also recommended for separability tests. - With multiple degrees of freedom, use a
*r*×*c*χ² test, collapsing cells as necessary (see Wallis 2013). Examine tables for areas of greatest change (χ² partials) and subdivide as required. - Log-likelihood is
**not**an improvement on χ² – it employs different assumptions and has some interesting properties, which are exploited in log-linear models – but it is not a better “χ² test”.

Finally, don’t pick and choose alternative formulae just to see if you can obtain a significant result. Select a method and error level and stick to it.

### Testing frequencies from the same sample

A special case of the goodness of fit test may be used to compare probabilities drawn from the same sample. Consider a discrete frequency distribution *F* = {*f*₁, *f*₂,…} summing to *n*. We can plot Wilson score intervals on probabilities *p _{i}* =

*f*/

_{i}*n*. If two intervals do not overlap, the difference must be significant.

The null hypothesis is that a pair of frequencies, *f _{a}*,

*f*, are approximately the same, and bisect their data neatly:

_{b}*O* = {*f _{a}*,

*f*} ≈

_{b}*E*= {0.5(

*f*+

_{a}*f*), 0.5(

_{b}*f*+

_{a}*f*)}.

_{b}Note that for the purpose of this calculation we ignore all other frequencies apart from this pair. See Comparing frequencies within a discrete distribution. An alternative calculation employs the *z* test for a population probability *P*, where *P* = 0.5.

### Tests for other types of data

The situation starts to become more complicated when one or more of the variables are not categorical. There are a range of tests designed for **ranked **and **interval/ratio **data, usefully divided according to whether one or other variable is categorical or not.

- If the
**independent variable**is categorical you should employ**tests for two or more independent samples**(χ², Mann-Whitney*U*, Student’s*t*test etc). - If the
**dependent variable**is categorical you*can*employ the same tests but their interpretation may be less clear. A significant result from a reversed-order test is evidence of interaction between the two variables.* - Otherwise, employ graph plotting and
**regression**(Spearman’s*R²*, Pearson’s*r²*).

[*For example, the *t* test for two independent samples is commonly stated such that the independent variable (subsample) is Boolean (e.g. speech vs. writing) and the dependent variable is at least on an interval scale (e.g. clause length). A significant result tells us that the mean length of clauses varies according to whether it is found in speech or writing. But *the test can also be applied in reverse*: given a clause length we may infer (stylistically) whether the text it is found in comes from speech or writing. Correlations can be interpreted in both directions, just like the χ² independence test. Arguably the distinction between independent and dependent variables is philosophically less important in *ex post facto* data analysis than in lab experiments where the independent variable may be controlled or manipulated by the researcher.]

On this blog I’m not attempting to reproduce every single test under the sun. Surveys of standard statistical tests can be found in numerous experimental design and statistics textbooks. For example, Chapter 1 in Oakes (1998) provides a useful (if rather rapid) summary of tests with practical corpus-based examples. If you can persevere with the algebra, Sheskin (1997) is recommended for a rather more comprehensive review (a useful decision table is on p28-30).

However when deciding between tests, bear in mind that *analysis often benefits from simplicity*. The following steps are all perfectly legitimate.

**Use a weaker test.**It is always possible to sacrifice information and employ a test that makes fewer assumptions about the data, if no other option is available.**Merge cells.**Just as we may merge cells in contingency tables, numeric variables may be “quantised” (e.g. “time” could be annual data, split into decades or just “early” vs. “late”, see below).

This means that ranked or interval data can be grouped into categories and a contingency test applied, even though this process throws away information and is less theoretically powerful than, say, a test exploiting the fact that data is grouped in a ranked order.

On the other hand, sophisticated regression techniques and parametric tests are powerful but employ more assumptions. As with all analytical methods, test results must be carefully interpreted and explained. The main pitfalls with regression techniques concern the fact that their apparent power can be misleading *because your assumptions may be wrong*! Even an intuitive concept like ‘simplicity’ (parsimony) relies on the variables chosen and how they are expressed. My advice would therefore be to *use these methods last*, and always be explicit about the assumptions they rely on.

The first step of any analysis is to **plot the data**, with confidence intervals if at all possible, so that you can get a proper idea of what might be going on. Then, depending on the volume of data and the scales of evidence, you can consider posing more specific questions and carrying out more precise analyses.

### Working with time series data

For example, Aarts *et al*. investigated the alternation of *shall / will *over time in late 20th Century spoken British English. The graph above shows:

**pink Xs**: centres of “early” vs. “late” (1960s vs. 1990s) data: with two values, “time” is effectively**categorical**(Boolean).**blue dots**(with error bars representing confidence intervals): data grouped into five-year categories (1960-64, 65-69, etc.): “time” may now be considered**interval**data.**dashed line**: an estimated best-fit logistic curve within these intervals.

Note that two-valued interval data is treated as categorical. For a meaningful line we need at least three values. The logistic (‘S’) model is considered an extremely simple default pattern (to understand why see Wallis 2010). We can’t really say that the first datapoint on the left (1960-65) is falling below this idealisation: we don’t have enough data to make this claim.

We can also **compare pairs of confidence intervals** serially.

**If intervals do not overlap:**the difference is significant,**If one interval includes the other observed point:**the difference*cannot be*significant,**Otherwise:**test the difference between points with a 2 × 2 χ² or Newcombe-Wilson test.

**Note:** The same approach can be used in comparing frequencies drawn from the same distribution (e.g. if we compare different *p* lines for the same time). Where probabilities are drawn from the same population we use the stricter goodness of fit test (see above).

We can immediately see that, in the figure above, the probability of *shall* **does not significantly change** over the period 1960-1980, because all intervals include the observed point (blue dot) of the next. However the interval for 1990-95 slightly overlaps the interval for 1975-80 without including the observed *p* (and vice versa), and we should therefore perform a test. This **does** obtain a significant result.

In sum, it is important to understand that understanding your data and getting the experimental design right is more important than picking the optimum test. Experimental research is cautious: to form robust conclusions we would rather make fewer assumptions and risk rejecting significant results which might be picked up with stronger tests.

### See also

*z*-squared: the origin and application of χ²- Binomial confidence intervals and contingency tests
- Comparing χ² tests for separability
- Comparing frequencies within a discrete distribution
- Excel spreadsheets

### References

Aarts, B., Close, J, and Wallis, S.A. 2013. Choices over time: methodological issues in investigating current change. » ePublished. Chapter 2 in Aarts, B., Close, J, Leech, G. and Wallis, S.A. (eds.) *The Verb Phrase in English*. Cambridge: CUP. » Table of contents and ordering info

Newcombe, R.G. 1998a. Two-sided confidence intervals for the single proportion: comparison of seven methods. *Statistics in Medicine* **17**: 857-872.

Newcombe, R.G. 1998b. Interval estimation for the difference between independent proportions: comparison of eleven methods. *Statistics in Medicine* **17**: 873-890.

Oakes, M.P. 1998. *Statistics for Corpus Linguistics*. Edinburgh: EUP.

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.

Wallis, S.A. 2013. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. **»** Post