### Introduction

I have been recently reviewing and rewriting a paper for publication that I first wrote back in 2011. The paper (Wallis 2019) concerns the problem of how we test whether repeated runs of the same experiment obtain essentially the same results, i.e. results are not significantly different from each other.

These meta-tests can be used to test an experiment for replication: if you repeat an experiment and obtain significantly different results on the first repetition, then, with a 1% error level, you can say there is a 99% chance that the experiment is not replicable.

These tests have other applications. You might be wishing to compare your results with those of others in the literature, compare results with a different operationalisation (definitions of variables), or just compare results obtained with different data – such as comparing a grammatical distribution observed in speech with that found within writing.

The design of tests for this purpose is addressed within the *t*-testing ANOVA community, where tests are applied to continuously-valued variables. The solution concerns a particular version of an ANOVA, called “the test for interaction in a factorial analysis of variance” (Sheskin 1997: 489).

However, anyone using data expressed as discrete alternatives (A, B, C etc) has a problem: the classical literature does not explain what you should do.

### Gradient and point tests

The rewrite of the paper caused me to distinguish between two types of tests: ‘point tests’, which I describe below, and ‘gradient tests’.

These tests can be used to compare results drawn from 2 × 2 or *r* × *c* χ² tests for homogeneity (also known as tests for independence). This is the most common type of contingency test, which can be computed using Fisher’s exact method or as a Newcombe-Wilson difference interval.

- A
**gradient test**(B) evaluates whether the*gradient*or difference between point 1 and point 2 differs between runs of an experiment,*d*=*p*₁ –*p*₂. This concerns whether claims about the rate of change, or size of effect, observed are replicable. Gradient tests can be extended, with increasing degrees of freedom, into tests comparing*patterns*of effect. - A
**point test**(A) simply asks whether data at either point, evaluated separately, differs between experimental runs. This concerns whether single observations, such as*p*₁, are replicable. Point tests can be extended into ‘multi-point’ tests, which we discuss below.

Point tests only apply to homogeneity data. If you wish to compare outcomes from goodness of fit tests, you need a version of the gradient test, to compare differences from an expected *P*, *d* = *p*₁ – *P*. Since different data sets may have different expected *P*, a distinct ‘point test for goodness of fit’ would be meaningless.

The earlier version of the paper, which has been published on this blog since its launch 2012, focused on gradient tests. The possibility of carrying out a point test was mentioned in passing. In this blog post I want to focus on point tests.

The obvious problem with gradient tests is that two experimental runs might obtain the same gradient but in fact be very different in start and end points. Consider the following graph.

### Point tests

The data in Figure 1 is calculated from two 2 × 2 tables drawn from a paper by Aarts, Close and Wallis (2013).

**Note:** To obtain Figure 2, I simply replaced one frequency in the first table: 46 with 100. The data is also found on the 2×2 homogeneity tab in this Excel spreadsheet, which contains a wide range of separability tests.

To make our exposition clearer, Table 1 uses the same format as in the Excel spreadsheet (with the dependent variable distributed vertically) rather than the format in the paper.

spoken | LLC (1960s) | ICE-GB (1990s) | Total |

shall |
124 | 46 | 170 |

will |
501 | 544 | 1,045 |

Total |
625 | 590 | 1,215 |

written | LOB (1960s) | FLOB (1990s) | Total |

shall |
355 | 200 | 555 |

will |
2,798 | 2,723 | 5,521 |

Total |
3,153 | 2,923 | 6,076 |

Aarts *et al*. carried out 2 × 2 homogeneity tests for the two tables separately. These test whether modal *shall* declines as a proportion of the modal *shall/will* alternation between the two time points. In other words, we compare LLC with ICE-GB data, and LOB with FLOB data.

To carry out a point test we simply rotate the test 90 degrees, e.g. to compare data at the 1960s point we compare LLC with LOB.

As I have explained elsewhere (Wallis 2013), there are a number of different methods for carrying out this comparison.

These include:

- The
*z*test for two independent proportions (Sheskin 1997: 226). - The Newcombe-Wilson interval test (Newcombe 1998).
- The 2 × 2 χ² test for homogeneity (independence).

These are all standard tests and each is discussed in papers and elsewhere on this blog.

The advantage of the third approach is that it is extensible to *c*-way multinomial observations by using a 2 × *c* χ² test.

### The multi-point test

The tests listed above can be used to compare the 1960s and 1990s intervals in Figure 1 separately.

However, in many cases it would be helpful to have a method that evaluated both pairs of observations in a single test. This can be generalised to a series of *r* observations. To do this, in (Wallis 2018) I propose what I call a multi-point test.

We generalise the χ² formula by summing over *i* = 1..*r*:

- χ
² = ∑χ²(_{d}*i*)

where χ²(*i*) represents the χ² score for homogeneity for each set of data at position *i* in the distribution.

This test has *r* × df(*i*) degrees of freedom, where df(*i*) is the degrees of freedom for each χ² point test. So, in the worked example we have seen, the summed test has two degrees of freedom:

spoken | LLC (1960s) | ICE-GB (1990s) | Total |

shall |
124 | 46 | 170 |

will |
501 | 544 | 1,045 |

Total |
625 | 590 | 1,215 |

written | LOB (1960s) | FLOB (1990s) | Total |

shall |
355 | 200 | 555 |

will |
2,798 | 2,723 | 5,521 |

Total |
3,153 | 2,923 | 6,076 |

χ² | 34.6906 | 0.6865 | 35.3772 |

Since the computation sums independently-calculated χ² scores, each score may be individually considered for significant difference (with df(*i*) degrees of freedom). Hence we can see above the large score for the 1960s data (individually significant) and the small score for 1990s (individually non-significant).

**Note:** Whereas χ² is generally associative (non-directional), the summed equation (χ* _{d}*²) is not. Nor is this computation the same as a 3 dimensional test (

*t*×

*r*×

*c*). Variables are treated differently.

- The multi-point test factors out variation between tests over the independent variable (in this instance: time). This means that if there is a lot more data in one table at a particular time period, this fact does not skew the results.
- On the other hand, it does not factor out variation over the dependent variable – after all, this is precisely what we wish to examine!

Naturally, like the point test, this test may be generalised to multinomial observations.

### A Newcombe-Wilson multi-point test

An alternative multi-point test for binomial (two-way) variables employs a sum of χ² values abstracted from Newcombe-Wilson tests.

- Carry out Newcombe-Wilson tests for each point test
*i*at a given error level α, obtaining*D*,_{i}*W*⁻ and_{i}*W*⁺._{i} - Identify the inner interval width
*W*for each test:_{i}- if
*D*< 0,_{i }*W*=_{i}*W*⁻;_{i}*W*=_{i}*W*⁺ otherwise._{i}

- if
- Use the difference
*D*and inner interval_{i}*W*to compute χ² scores:_{i}- χ²(
*i*) = (*D*._{i}*z*_{α/2}/*W*)²._{i}

- χ²(

It is then possible to sum χ²(*i*) as before.

Using the data in the worked example we obtain:

**1960s:** *D _{i}* = 0.0858,

*W*⁻ = -0.0347 and

_{i}*W*⁺ = 0.0316 (significant).

_{i}**1990s:**

*D*= 0.0095,

_{i}*W*⁻ = -0.0194 and

_{i}*W*⁺ = 0.0159 (ns).

_{i}Since *D _{i}* is positive in both cases, we use the upper interval width each time. This gives us χ² scores of 28.4076 and 1.3769 respectively, which obtains a sum of 29.78. Compared to the first method above, this approach tends to downplay extreme differences.

### In conclusion

The point test and the additive generalisation of this test into a ‘multi-point test’ represent a method of contrasting multiple runs of the same experiment, comparing observed changes in different subcorpora or genres, or examine the empirical effect of changing definitions of variables.

These tests consider the null hypothesis that **individual observations** are not different; or, in the multi-point case, that **in general** the observations are not different.

- They do not evaluate the gradient between points or the size of effect. If we wish to compare
**sizes of effect**we would need to use one of the methods for this purpose described in (Wallis forthcoming). - The method only applies to comparing tests for homogeneity (independence). To compare
**goodness of fit**data, a different approach is required (also described in Wallis forthcoming).

Nonetheless, these tests are useful meta-tests that build on classical Pearson χ² tests, and they are useful tools in our analytical armoury.

### See also

### References

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.

Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. *Statistics in Medicine* **17**: 873-890.

Wallis, S.A. 2013. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. » Post

Wallis, S.A. 2019. Comparing χ^{2} tables for separability of distribution and effect. *Journal of Quantitative Linguistics* **26**:4, 330-355. » Post