Choosing the right test

Introduction

One of the most common questions a new researcher has to deal with is the following:

what is the right statistical test for my purpose?

To answer this question we must distinguish between

  1. different experimental designs, and
  2. optimum methods for testing significance.

In corpus linguistics, many research questions involve choice. The speaker can say shall or will, choose to add a postmodifying clause to an NP or not, etc. If we want to know what factors influence this choice then these factors are termed independent variables (IVs) and the choice is the dependent variable (DV). These choices are mutually exclusive alternatives. Framing the research question like this immediately helps us focus in on the appropriate class of tests.

Tests for categorical data

The most common scenario in corpus linguistics is when both independent and dependent variables are categorical, which is why in recent years I’ve focused on this area in particular. The most well-known test is ‘the χ² test’ (more correctly, the contingency test), which has two well-known versions (Wallis 2013).

In contingency tests, data is expressed in the form of a contingency table of frequencies. The independent variable B has a discrete set of categories and therefore produces a discrete frequency distribution for each column of the dependent variable A. This means that for every value of A and B (let’s call these i and j respectively) there is a frequency count, f(i, j), representing the number of times in the dataset that A = i and B = j.

Goodness of fit and homogeneity (independence) tests.
Goodness of fit and homogeneity (independence) tests.
  1. Goodness of fit tests are used to compare a distribution over a selected value, a1, of a variable, A, with the overall distribution. ‘Goodness of fit’ means that the distribution at a1 fits the distribution at A. It is also referred to as an ‘r × 1’ test because it evaluates a distribution of r cells for a single value of A. A significant result means that we can reject the null hypothesis that the distribution at a1 matches the overall distribution at A.
  2. Independence tests are used to evaluate whether the value of one variable is independent from the value of the other. We typically use it to test the extent to which, were we to know the value of the independent variable (IV, B), we could predict the value of the dependent variable (DV, A). Note that the test is reversible: were we to swap A and B we would obtain the same test result. It is also referred to as an homogeneity test, and may also be referred to as an ‘r × c’ test because it compares distributions of r cells across all c subvalues of A. A significant result means that we can reject the null hypothesis that the two variables are independent (the distributions are homogenous).

These tests essentially operate by performing two steps: calculate the size of an effect, and then compare this effect size with a limit: a confidence interval or critical value.

Simple 2 × 1 and 2 × 2 tests are often more powerful in practice than larger tables (r × 1, r × c). They have one degree of freedom, and make few assumptions about the data. They therefore test only one ‘thing’ at a time. Over the years I’ve become a fan of these simple tests – hence this spreadsheet.

A specialised goodness of fit test, most easily calculated using the single sample z test, compares two probabilities drawn from the same sample for significant difference (i.e. difference from E = {0.5n, 0.5n}). See below.

Tests for comparing results

It is also possible to perform further ‘meta-tests’ which compare results obtained from the first two tests.

It is extremely common, but poor practice, to see citations of individual χ² scores or error levels (‘p-values’, labelled α in this blog to avoid confusion with proportions) in papers. However, to be blunt, this information is almost completely useless. The fact that one test obtains a higher χ² score or smaller α value than another does not mean that the effect witnessed is greater, ‘stronger’ etc.

It is permissible to cite sizes of effect descriptively (that is, to describe the sample). However the correct approach to comparing outcomes is to employ a meta-test for separability.

  1. Separability of gradient tests (Wallis 2019) evaluate whether the results of two comparable experiments are significantly different from each other. Whereas the goodness of fit and homogeneity tests look for a significant non-zero difference between a and A or between a0, a1, a2, etc., a separability test operates at a higher level. It attempts to determine whether two sets of results from earlier subtests are significantly different. A significant result allows us to reject the null hypothesis that the two results say the same thing about the population.
Separability of gradient tests: upper, comparing goodness of fit A = a1 and A′ = a1′, and lower, comparing independence of a1/a2 and a1′/a2′.
Separability of gradient tests: upper, comparing goodness of fit A = a1 and A′ = a1′, and lower, comparing independence of a1/a2 and a1′/a2′.

There are different separability tests for comparing goodness of fit tables (which we might term ‘separability of fit’) and homogeneity tables (‘separability of independence’), illustrated by the figure above. Note that it only makes sense to perform this type of meta-analysis when pairs of tables have the same structure: if they are structurally different then they are different anyway! This type of test can be used to compare the results of the same experiment performed on different samples (e.g. from different corpora, or in replication studies) or when different definitions of variables are used. Aarts, Close and Wallis (2013) employed a gradient test in a step-wise fashion, changing one parameter at a time, to compare their results with those of previous researchers.

A recent innovation (Wallis 2017) employs a separability test to assess whether an association between two Binomial variables (detected by a 2 × 2 test for homogeneity) can be said to be predictively directional, i.e. such that one variable has a greater predictive impact on the other than vice-versa. Predictive direction must be strictly distinguished from causality, however, which is another common error!

  1. Separability of observation (‘point’) tests (Wallis 2019) allow us to compare observations between two or more experimental datasets. To compare two proportions (a simple point test) the best method is to carry out Newcombe-Wilson tests. If we wish to compare two series of cell frequencies, such as an entire column or row in each table, we can use a ‘multi-point’ test using a generalisation from Pearson’s chi-square. See Point tests and multi-point tests for separability of homogeneity.
Separability of observations (point tests).
Separability of observation tests (point tests) compare proportions or distributions between tables.

Optimum methods of calculation

It is possible to use different formulae or methods to carry out these tests. The standard χ² calculation has known weaknesses, and to address these a number of alternatives have been proposed, including performing ‘exact’ tests (Binomial, Fisher), employing Log-likelihood, and applying Yates’s or Williams’s corrections. So the question then is which method should we choose?

Fortunately, a number of authors (in particular, Robert Newcombe (1998a and b), but also see my own modest effort) have put in the time to evaluate confidence intervals and tests, and we can therefore offer some straightforward advice on this topic.

  • In theory, ‘exact’ tests are preferable to those based on the Normal approximation to the Binomial (z, χ², etc.), but are computationally costly and difficult to generalise. The oft-cited advantage of precision may be valuable in border-line cases.
  • With one degree of freedom (2 × 1, 2 × 2), use Yates’s continuity-corrected χ² in preference to standard χ² tests.
  • If the independent variable subdivides the corpus by speaker or text, then strictly speaking you should use an independent-population test. A good test is the 2 × 2 Newcombe-Wilson test with continuity-correction (Newcombe 1998b, Wallis 2009). This approach is also recommended for separability tests.
  • With multiple degrees of freedom, use a r × c χ² test, collapsing cells as necessary (see Wallis 2013). Examine tables for areas of greatest change (χ² partials) and subdivide as required. Don’t be afraid to plot graphs of probability variation with confidence intervals!
  • Log-likelihood is not an improvement on χ² – it employs different assumptions and has some interesting properties, which are exploited in log-linear models – but it is not a better ‘χ² test’.

Finally, don’t pick and choose alternative formulae just to see if you can obtain a significant result. Select a method and error level and stick to it.

Comparing competing frequencies drawn from the same sample

A special case of the goodness of fit test may be used to compare probabilities drawn from the same sample. Consider a discrete frequency distribution F = {f1, f2,…} summing to n. We can plot Wilson score intervals on probabilities pi = fi / n. If two intervals do not overlap, the difference must be significant.

The null hypothesis is that a pair of frequencies, fa, fb, are approximately the same, in which case they would bisect the data neatly:

O = {fa, fb} ≈ E = {0.5(fa+fb), 0.5(fa+fb)}.

Note that for the purpose of this calculation we ignore all other frequencies apart from this pair. See Comparing frequencies within a discrete distribution. An alternative calculation employs the z test for a population probability, P = 0.5.

Tests and intervals for functions of proportions

What if the variable we are interested in is not a proportion or frequency, but a different property, such as word, clause or sentence length, the ratio of two proportions (the ‘risk ratio’), percentage difference or entropy?

These properties have one thing in common. They can be calculated from one or more independent proportions. For example, clause length can be expressed as l = 1/p, where p is the chance that a random word is the first in a clause.

Using a method developed in the post An algebra of intervals, we can calculate intervals on these properties, and then compare them, either against an absolute score (e.g. risk ratio = 1, entropy > 0.2), or different observations of the same property.

Tests for other types of data

The situation starts to become more complicated when one or more of the variables are not categorical. There are a range of tests designed for ranked and interval/ratio data, usefully divided according to whether one or other variable is categorical or not.

  • If the independent variable is categorical you should employ tests for two or more independent samples (χ², Mann-Whitney U, Student’s t test etc).
  • If the dependent variable is categorical you can employ the same tests but their interpretation may be less clear. A significant result from a reversed-order test is evidence of interaction between the two variables.*
  • Otherwise, employ graph plotting and regression (Spearman’s R², Pearson’s r²).

*For example, the t test for two independent samples is commonly stated such that the independent variable (subsample) is Boolean (e.g. speech vs. writing) and the dependent variable is at least on an interval scale (e.g. clause length). A significant result tells us that the mean length of clauses varies according to whether it is found in speech or writing. But the test can also be applied in reverse: given a clause length we may infer (stylistically) whether the text it is found in comes from speech or writing. Correlations can be interpreted in both directions, just like the χ² independence test. Arguably the distinction between independent and dependent variables is philosophically less important in ex post facto data analysis than in lab experiments where the independent variable may be controlled or manipulated by the researcher.

On this blog I do not attempt to reproduce every single test under the sun. Surveys of standard statistical tests can be found in numerous experimental design and statistics textbooks. For example, Chapter 1 in Oakes (1998) provides a useful (if rather rapid) summary of tests with practical corpus-based examples. If you can persevere with the algebra, Sheskin (1997) is recommended for a rather more comprehensive review (a useful decision table is on p28-30).

However when deciding between tests, bear in mind that analysis often benefits from simplicity. The following steps are all perfectly legitimate.

  • Use a weaker test. It is always possible to sacrifice information and employ a test that makes fewer assumptions about the data, if no other option is available.
  • Merge cells. Just as we may merge cells in contingency tables, numeric variables may be ‘quantised’ (e.g. ‘time’ could be annual data, split into decades or just ‘early’ vs. ‘late’, see below).

This means that ranked or interval data can be grouped into categories and a contingency test applied, even though this process throws away information and is less theoretically powerful than, say, a test exploiting the fact that data is grouped in a ranked order.

On the other hand, sophisticated regression techniques and parametric tests are powerful but employ more assumptions. As with all analytical methods, test results must be carefully interpreted and explained. The main pitfalls with regression techniques concern the fact that their apparent power can be misleading because your assumptions may be wrong! Even an intuitive concept like ‘simplicity’ (parsimony) relies on the variables chosen, how they are defined and how they are expressed (e.g. on a linear, log or logistic scale). My advice would therefore be to use these methods last, and always be explicit about the assumptions they rely on.

The first step of any analysis is to plot the data, with confidence intervals if at all possible, so that you can get a proper idea of what might be going on. Then, depending on the volume of data and the scales of evidence, you can consider posing more specific questions and carrying out more precise analyses.

Working with time series data

An example time series plot. Plotting the probability of selecting shall out of either shall or will over time in the DCPSE corpus. After Aarts et al. 2013.
An example time series plot. Plotting the probability of selecting shall out of either shall or will over time in the DCPSE corpus. After Aarts et al. 2013.

For example, Aarts et al. investigated the alternation of shall / will over time in late 20th Century spoken British English. The graph above shows:

  1. pink Xs: centres of ‘early’ vs. ‘late’ (1960s vs. 1990s) data: with two values, ‘time’ is effectively categorical (Boolean).
  2. blue dots (with error bars representing confidence intervals): data grouped into five-year categories (1960-64, 65-69, etc.): ‘time’ may now be considered interval data.
  3. dashed line: an estimated best-fit logistic curve within these intervals.

Note that two-valued interval data is treated as categorical. For a meaningful line we need at least three values. The logistic (‘S’) model is considered an extremely simple default pattern (to understand why see Wallis 2010). We can’t really say that the first datapoint on the left (1960-65) is falling below this idealisation: we don’t have enough data to make this claim.

We can also compare pairs of confidence intervals serially.

  • If intervals do not overlap: the difference is significant,
  • If one interval includes the other observed point: the difference cannot be significant,
  • Otherwise: test the difference between points with a 2 × 2 χ² or Newcombe-Wilson test.

Note: The same approach can be used in comparing frequencies drawn from the same distribution (e.g. if we compare different p lines for the same time). Where probabilities are drawn from the same population we use the stricter goodness of fit test (see above).

We can immediately see that, in the figure above, the probability of shall does not significantly change over the period 1960-1980, because all intervals include the observed point (blue dot) of the next. However the interval for 1990-95 slightly overlaps the interval for 1975-80 without including the observed p (and vice versa), and we should therefore perform a test. This does obtain a significant result.

In sum, it is important to understand that understanding your data and getting the experimental design right is more important than picking the optimum test. Experimental research is cautious: to form robust conclusions we would rather make fewer assumptions and risk rejecting significant results which might be picked up with stronger tests.

References

Aarts, B., Close, J, and Wallis, S.A. 2013. Choices over time: methodological issues in investigating current change. » ePublished. Chapter 2 in Aarts, B., Close, J, Leech, G. and Wallis, S.A. (eds.) The Verb Phrase in English. Cambridge: CUP. » Table of contents and ordering info

Newcombe, R.G. 1998a. Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine 17: 857-872.

Newcombe, R.G. 1998b. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890.

Oakes, M.P. 1998. Statistics for Corpus Linguistics. Edinburgh: EUP.

Sheskin, D.J. 1997. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.

Wallis, S.A. 2013. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post

Wallis, S.A. 2017. Detecting direction in interaction evidence. » Post

Wallis, S.A. 2019. Comparing χ2 tables for separability of distribution and effect. Journal of Quantitative Linguistics 26:4, 330-355. » Post

See also

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.