Choosing the right test


One of the most common questions a new researcher has to deal with is the following:

what is the right statistical test for my purpose?

To answer this question we must distinguish between

  1. different experimental designs, and
  2. optimum methods for testing significance.

In corpus linguistics, many research questions involve choice. The speaker can say shall or will, choose to add a postmodifying clause to an NP or not, etc. If we want to know what factors influence this choice then these factors are termed independent variables (IVs) and the choice is  the dependent variable (DV). These choices are mutually exclusive alternatives. Framing the research question like this immediately helps us focus in on the appropriate class of tests. 

Tests for categorical data

The most common scenario in corpus linguistics is when both independent and dependent variables are categorical, which is why in recent years I’ve focused on this area in particular. The most well-known test is “the χ² test” (more correctly, the contingency test), which comes in two versions (Wallis 2013).

In contingency tests, data is expressed in the form of a contingency table of frequencies. The independent variable B has a discrete set of categories and therefore produces a discrete frequency distribution for each column of the dependent variable A. This means that for every value of A and B (let’s call these i and respectively) there is a frequency count, f(i, j), representing the number of times in the dataset that A=i and B=j.

Goodness of fit and homogeneity tests.

  1. Goodness of fit tests are used to compare a distribution over a selected value, a, of a variable, A, with the overall distribution. “Goodness of fit” means that the distribution at a fits the distribution at A. It is also referred to as an r × 1 test because it evaluates a distribution of r cells for a single value of A. A significant result means that we can reject the null hypothesis that the distribution at a matches the overall distribution at A.
  2. Independence tests are used to evaluate whether the value of one variable is independent from the value of the other. We typically use it to test the extent to which, were we to know the value of the independent variable (IV, B), we could predict the value of the dependent variable (DV, A). Note that the test is reversible: were we to swap A and B we would obtain the same test result. It is also referred to as an homogeneity test, and may also be referred to as an r × c test because it compares distributions of r cells across all c subvalues of A. A significant result means that we can reject the null hypothesis that the two variables are independent.

These tests essentially operate by performing two steps: calculate the size of an effect, and then compare this effect size with a limit: a confidence interval or critical value.

Simple 2 × 1 and 2 × 2 tests are more powerful in practice than larger tables (r × 1, r × c). They have one degree of freedom, and make few assumptions about the data. They therefore test only one “thing” at a time. Over the years I’ve become a fan of these simple tests – hence this spreadsheet.

A specialised goodness of fit test, most easily calculated using the single sample z test, compares two probabilities drawn from the same sample for significant difference (i.e. difference from E = {0.5n, 0.5n}). See below.

Tests for comparing results

It is also possible to perform a further type of  “meta-test” which compares results obtained from the first two tests. It is common, but poor practice, to see citations of individual χ² scores or error levels in papers. However, to be candid, this information is almost completely useless — the fact that one test obtains a higher χ² score or smaller α than another does not mean that the effect witnessed is greater, ‘stronger’ etc.

It is permissible to cite sizes of effect descriptively (that is, to describe the sample). However the optimum approach to comparing outcomes is to employ a separability test.

  1. Separability tests (Wallis 2011) evaluate whether the results of two comparable experiments are significantly different from each other. Whereas the goodness of fit and homogeneity tests look for a significant non-zero difference between a and A or between a₀, a₁, a₂, etc., a separability test operates at a higher level. It attempts to decide whether two sets of results from earlier subtests are significantly different. A significant result allows us to reject the null hypothesis that the two results say the same thing about the population.

Separability tests: upper, comparing goodness of fit A=a and A’=a’ and lower, comparing independence of a/¬a and a’/¬a’.

There are different separability tests for comparing goodness of fit tables (what we might term “separability of fit”) and homogeneity tables (“separability of independence”), illustrated by the figure above. Note that it only makes sense to perform this type of meta-analysis when pairs of tables have the same structure: if they are structurally different then they are different anyway! This test can be used to compare the results of the same experiment performed on different samples (e.g. from different corpora) or when different definitions of variables are used. Aarts, Close and Wallis (2013) employed this test in a step-wise fashion, changing one parameter at a time, to compare their results with those of previous researchers.

A recent innovation (Wallis 2017) employs a separability test to assess whether an association between two Binomial variables (detected by a 2 × 2 test for homogeneity) can be said to be directional, i.e. such that one variable has a greater predictive impact on the other than vice-versa.

Optimum methods of calculation

It is possible to use different formulae or methods to carry out these tests. The standard χ² calculation has known weaknesses, and to address these a number of alternatives have been proposed, including performing ‘exact’ tests (Binomial, Fisher), employing Log-likelihood, and applying Yates’ or Williams’ corrections. So the question then is which method should we choose?

Fortunately, a number of authors (in particular, Robert Newcombe (1998a and b), but also see my own modest effort) have put in the time to evaluate confidence intervals and tests, and we can therefore offer some straightforward advice on this topic.

  • In theory, ‘exact’ tests are preferable to those based on the Normal approximation to the Binomial (z, χ², etc.), but are computationally costly and difficult to generalise. The oft-cited advantage of precision may be valuable in border-line cases.
  • With one degree of freedom (2 × 1, 2 × 2), use Yates’ continuity-corrected χ² in preference to standard χ² tests.
  • If the independent variable subdivides the corpus by speaker or text, then strictly speaking you should use an independent-population test. A good test is the 2 × 2 Newcombe-Wilson test with continuity-correction (Newcombe 1998b, Wallis 2009). This approach is also recommended for separability tests.
  • With multiple degrees of freedom, use a r × c χ² test, collapsing cells as necessary (see Wallis 2013). Examine tables for areas of greatest change (χ² partials) and subdivide as required. Don’t be afraid to plot graphs of probability variation with confidence intervals!
  • Log-likelihood is not an improvement on χ² – it employs different assumptions and has some interesting properties, which are exploited in log-linear models – but it is not a better “χ² test”.

Finally, don’t pick and choose alternative formulae just to see if you can obtain a significant result. Select a method and error level and stick to it.

Comparing competing frequencies drawn from the same sample

A special case of the goodness of fit test may be used to compare probabilities drawn from the same sample. Consider a discrete frequency distribution F = {f₁, f₂,…} summing to n. We can plot Wilson score intervals on probabilities pi = fi / n. If two intervals do not overlap, the difference must be significant.

The null hypothesis is that a pair of frequencies, fafb, are approximately the same, in which case they would bisect the data neatly:

O = {fafb}  ≈  E = {0.5(fa+fb), 0.5(fa+fb)}.

Note that for the purpose of this calculation we ignore all other frequencies apart from this pair. See Comparing frequencies within a discrete distribution. An alternative calculation employs the z test for a population probability P, where P = 0.5.

Tests for other types of data

The situation starts to become more complicated when one or more of the variables are not categorical. There are a range of tests designed for ranked and  interval/ratio data, usefully divided according to whether one or other variable is categorical or not.

  • If the independent variable is categorical you should employ tests for two or more independent samples (χ², Mann-Whitney U, Student’s t test etc).
  • If the dependent variable is categorical you can employ the same tests but their interpretation may be less clear. A significant result from a reversed-order test is evidence of interaction between the two variables.*
  • Otherwise, employ graph plotting and regression (Spearman’s , Pearson’s ).

[*For example, the t test for two independent samples is commonly stated such that the independent variable (subsample) is Boolean (e.g. speech vs. writing) and the dependent variable is at least on an interval scale (e.g. clause length). A significant result tells us that the mean length of clauses varies according to whether it is found in speech or writing. But the test can also be applied in reverse: given a clause length we may infer (stylistically) whether the text it is found in comes from speech or writing. Correlations can be interpreted in both directions, just like the χ² independence test. Arguably the distinction between independent and dependent variables is philosophically less important in ex post facto data analysis than in lab experiments where the independent variable may be controlled or manipulated by the researcher.]

On this blog I’m not attempting to reproduce every single test under the sun. Surveys of standard statistical tests can be found in numerous experimental design and statistics textbooks. For example, Chapter 1 in Oakes (1998) provides a useful (if rather rapid) summary of tests with practical corpus-based examples. If you can persevere with the algebra, Sheskin (1997) is  recommended for a rather more comprehensive review (a useful decision table is on p28-30).

However when deciding between tests, bear in mind that analysis often benefits from simplicity. The following steps are all perfectly legitimate.

  • Use a weaker test. It is always possible to sacrifice information and employ a test that makes fewer assumptions about the data, if no other option is available.
  • Merge cells. Just as we may merge cells in contingency tables, numeric variables may be “quantised” (e.g. “time” could be annual data, split into decades or just “early” vs. “late”, see below).

This means that ranked or interval data can be grouped into categories and a contingency test applied, even though this process throws away information and is less theoretically powerful than, say, a test exploiting the fact that data is grouped in a ranked order.

On the other hand, sophisticated regression techniques and parametric tests are powerful but employ more assumptions. As with all analytical methods, test results must be carefully interpreted and explained. The main pitfalls with regression techniques concern the fact that their apparent power can be misleading because your assumptions may be wrong! Even an intuitive concept like ‘simplicity’ (parsimony) relies on the variables chosen and how they are expressed. My advice would therefore be to use these methods last, and always be explicit about the assumptions they rely on.

The first step of any analysis is to plot the data, with confidence intervals if at all possible, so that you can get a proper idea of what might be going on. Then, depending on the volume of data and the scales of evidence, you can consider posing more specific questions and carrying out more precise analyses.

Working with time series data

An example time series plot. Plotting the probability of selecting shall out of either shall or will over time in the DCPSE corpus. After Aarts et al. 2013.

For example, Aarts et al. investigated the alternation of shall / will over time in late 20th Century spoken British English. The graph above shows:

  1. pink Xs: centres of “early” vs. “late” (1960s vs. 1990s) data: with two values, “time” is effectively categorical (Boolean).
  2. blue dots (with error bars representing confidence intervals): data grouped into five-year categories (1960-64, 65-69, etc.): “time” may now be considered interval data.
  3. dashed line: an estimated best-fit logistic curve within these intervals.

Note that two-valued interval data is treated as categorical. For a meaningful line we need at least three values. The logistic (‘S’) model is considered an extremely simple default pattern (to understand why see Wallis 2010). We can’t really say that the first datapoint on the left (1960-65) is falling below this idealisation: we don’t have enough data to make this claim.

We can also compare pairs of confidence intervals serially.

  • If intervals do not overlap: the difference is significant,
  • If one interval includes the other observed point: the difference cannot be significant,
  • Otherwise: test the difference between points with a 2 × 2 χ² or Newcombe-Wilson test.

Note: The same approach can be used in comparing frequencies drawn from the same distribution (e.g. if we compare different p lines for the same time). Where probabilities are drawn from the same population we use the stricter goodness of fit test (see above).

We can immediately see that, in the figure above, the probability of shall does not significantly change over the period 1960-1980, because all intervals include the observed point (blue dot) of the next. However the interval for 1990-95 slightly overlaps the interval for 1975-80 without including the observed p (and vice versa), and we should therefore perform a test. This does obtain a significant result.

In sum, it is important to understand that understanding your data and getting the experimental design right is more important than picking the optimum test. Experimental research is cautious: to form robust conclusions we would rather make fewer assumptions and risk rejecting significant results which might be picked up with stronger tests.

See also


Aarts, B., Close, J, and Wallis, S.A. 2013. Choices over time: methodological issues in investigating current change. » ePublished. Chapter 2 in Aarts, B., Close, J, Leech, G. and Wallis, S.A. (eds.) The Verb Phrase in English. Cambridge: CUP. » Table of contents and ordering info

Newcombe, R.G. 1998a. Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine 17: 857-872.

Newcombe, R.G. 1998b. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890.

Oakes, M.P. 1998. Statistics for Corpus Linguistics. Edinburgh: EUP.

Sheskin, D.J. 1997. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.

Wallis, S.A. 2013. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post

Wallis, S.A. 2017. Detecting direction in interaction evidence. » Post


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s