### Introduction

Over the last few months I have been looking at computationally evaluating confidence intervals and significance tests. This process has helped me sharpen up the recommendations I can give to researchers. I have updated some online papers and blog posts as a result.

**This analysis has exposed a difference, rarely commented upon, between the optimum test for contingency (“χ²-type”) tests when independent variable samples are drawn from the same population or independent populations. **

**For 2 × 2 tests it is recommended to use a different test (Newcombe-Wilson) when the IV is sociolinguistic (e.g. genre, time, different subcorpora) or otherwise divides samples by participants, than when the same participant may be sampled in either value (e.g. when the IV is a lexical-grammatical variable).**

**Meta-comment:** In a way this is another benefit of a blog — unlike traditional publication, I can quickly correct any problems or improve papers as a result of my discoveries or those of colleagues. However it also means I need to draw the attention of my readership to any changes.

Confidence intervals and significance tests are closely related, for reasons discussed here. So if we can evaluate a formula for a confidence interval in some way, then we can also potentially evaluate the test.

This area of “testing tests” is quite complicated, and statisticians (like the rest of us) are capable of arguing at great length about the merits of one or another approach. However the first thing to do is pick the **ideal** baseline — and then decide how to evaluate methods against that.

A more thorough review than I had time to undertake is that of Robert Newcombe (1998a, b), who essentially compared methods against a bootstrapped procedure for computing intervals. He looked at a range of different properties for intervals, such as whether they could have **zero width** (a bad thing, as nothing is certain) or **overshoot **(exceeded the allowable range [0, 1] or [-1, +1]) and so on. I was more interested in simply whether one test obtained a significant result in circumstances when another did not. I therefore depend on the baseline being “correct” (see Agresti and Coull 1998), a problem which Newcombe avoids. Nonetheless my conclusions are similar to Newcombe’s and hopefully easier to follow for the uninitiated. An alternative assessment of the single interval can be found in Brown *et al* (2001).

Detailed results are published online in the Binomial intervals paper. In this post I simply summarise the results to indicate what has changed. I have tweaked recommendations in the 2 × 2 spreadsheet to take account of these changes as well.

I should also comment that proposals for descriptive statistical measures described elsewhere in this blog, such as goodness of fit measures of association, have been investigated in a similarly computationally-intensive manner.

### Confidence intervals on a single proportion *p*

To evaluate confidence interval formulae I compared several different methods against an exact Binomial search method (called the Clopper-Pearson method).

This search procedure gets the computer to find the value *P* which is the centre of a Binomial interval such that its **upper bound** (with a ‘tail’ of α/2: α is just the error level, i.e. 0.05 or 0.01) is the observation *p*. This also means that *P* is the **lower bound** of the confidence interval around *p*.

This relationship is summarised below: we observe *p* (on the right of the figure) and need to find *P*.

The Binomial calculation is itself a bit of a monster, but a computer can calculate it quickly. It involves summing the following formula for integer values of *r* from *x*₁ = *x* to *x*₂ = *n*.

*Cum Bin prob B*(*x*₁, *x*₂, *n*, *P*) = Σ *nCr P ^{r}* (1 –

*P*)

^{(n – r)}.

A search algorithm applies this calculation many times with different values for *P*, ‘zooming in’ ever closer until it finds one where the result is α/2.

This gives us our baseline “correct” value for *P* which we can compare with values obtained from other methods.

When *n* is small the graph is particularly instructive. (These methods get very similar results when *n *is large anyway.)

With *x* = {0, 1, 2, 3,… *n*} I therefore tested the following intervals:

- Traditional: Gaussian on
*p*, with and without continuity correction. Not shown here because it is wrong! (See Figure 6 in the paper.) - Search procedure to find Gaussian on
*P*: obtains the same result as the Wilson score interval on*p*. Shown below as χ²(*P*) = Wilson(*p*). - Search procedure with Yates’ continuity correction applied, matching the Wilson score interval with continuity correction.
- Search procedure to find log-likelihood on
*P*.

Note how closely Yates’ χ²(*P*) = Wilson c.c.(*p*) fits the population Binomial (Bin(*P*)) curve (grey line).

It also turns out (see paper) that log-likelihood performs worse than standard χ², averaged over the probability of selection. It seems that those who have advocated log-likelihood have concentrated on performance in certain areas of the curve (the improbable lower interval on low-skewed values), not the overall curve. As I point out in the paper, if the observation *p* is below 0.5, it is more likely, all other things being equal, that *P* is greater than *p*!

This causes us to make a simple recommendation: **use Wilson’s score interval with continuity correction** for calculating confidence intervals on *p* and **use Yates’ test** for 2 × 1 χ² tests.

### 2 × 2 significance tests

When comparing two-proportion tests we find that two baselines present themselves. The table can be expressed as four frequencies (*a*, *b*, *c*, *d*) or two probabilities (*p*₁, *p*₂) and two totals (*n*₁, *n*₂).

- The summed
**Fisher’s exact test**is an appropriate baseline test when samples are drawn from the**same population**. Ideally this means that the independent variable is free to vary. In practice we can probably allow variables based on*lexical-grammatical*queries. - The
**cumulative Binomial**search procedure above is the basis for a baseline test when samples are drawn from**different populations**. This is the optimum method when the independent variable is*sociolinguistic*, and different participants are writing or speaking in each value of the variable (what we might refer to as a “between subjects” design).

**Note:** These baseline tests are relatively computationally intensive and are therefore rarely employed in practice. For our purposes they represent a mathematically-justified ‘ideal’, not a useful test.

In this paper we are simply interested in whether a test obtains the same result as the baseline test. To spot differences we plot error matrices for every test. The graph below shows where different tests obtain different results from Fisher’s test.

Each mark represents an error: Type I errors are where the test is insufficiently cautious, and overestimates the significance of the result. Type II errors are the opposite — the test is over-cautious and rejects a result that is significant according to Fisher.

This graph considers all tables where *n*₁ = *n*₂ = 20, so *a* and *c* range over {0, 1, 2,.. 20} and *b* = 20 – *a*, etc. Changing parameters (*n*₁, *n*₂) obtains a similar pattern. Note that **Yates’ test** has fewer errors overall, and these err on the side of caution. This means we can recommend this test. Again, note that log-likelihood produces the most errors.

In the paper we go further. We repeat the plot with α = 0.01, and obtain a similar result. However, thus far we have merely evaluated performance for equally split permutations of 40 observations. We therefore plot the probability of Type I and II errors for all tests for *n*₁ = *n*₂ from 1 to 100 and also for *n*₁ = 5*n*₂. With equal row totals, Yates’ test obtains no Type I errors, but a small number are found if one row has rather more data than the other. Nonetheless, this test outperforms the others.

We next turn to independent-population tests. As previously noted, we use the Binomial search algorithm to find the inner values of ideal intervals (*P*₁, *P*₂), obtain the widths of these intervals (*P*₁ – *p*₁ etc.) and then employ the sum of variances rule to combine them. (This rule assumes that standard deviations are independent, but is potentially problematic for small *n*.)

For our performance plot, we compare the Newcombe-Wilson tests with and without continuity correction, plus log-likelihood.

We find that this time the preferred test is that performed by testing differences against the **Newcombe-Wilson interval with continuity correction**. As we saw when discussing the single proportion interval, Wilson’s continuity-corrected score interval method closely matches the exact Binomial: combining two exact Binomials and two continuity-corrected Wilson intervals using the same approach therefore obtains a similar result.

Again, in the paper we extend our evaluation over multiple sizes of table. This reveals that when either row total *n*₁ or *n*₂ is small (<15) however, Yates’ χ² test obtains fewer errors.

### Some concluding remarks

What this evaluation turned up was a distinction missing from most discussions of χ².

**There is a difference between tests that assume that samples are drawn from the same or from different populations. **

Sheskin (1997: 229) comments on this in passing. Indeed, Fisher’s test explicitly calculates probabilities on the basis that row and column sums are constant (Sheskin 1997: 221), so that a gain in one cell must mean a corresponding loss in another cell. However when samples are drawn from different texts, independent variables consist of subcorpora, or we are comparing the results of experiments then it is more theoretically appealing to perform difference tests.

For the most part this does not matter a great deal. Fisher’s test or the paired Binomial obtain very similar results. However once we start examining differences in performance between tests, small discrepancies are picked up. Furthermore, for *ex post facto *corpus linguistics research it could be argued that these distinctions are important: samples obtained from different texts are separate in the clear sense that a speaker could not choose to swap an utterance in one text for an utterance in another!

We can actually measure the size of this difference between so-called ‘exact’ tests. We use the same method of evaluation for comparing the two tests (independent-population, same-population) as we use for evaluating tests against a baseline. (See Appendix 2 in the Binomial intervals paper for a more detailed explanation.) Fisher’s test is slightly more conservative – the plot represents the probability across all combinations where the paired Binomial is significant and Fisher is not.

Note that *this* question of choice turns on the **independent** variable in an experimental design.

As a corollary, in comparing the outcome of 2 × 2 tests we should use the relevant meta-test, with or without continuity correction (see the paper for details).

**The main winner in these evaluations is Yates’ χ², with the (closely related) Newcombe-Wilson interval with continuity correction being more accurate in certain circumstances.**

### See also

- Wallis, S.A. 2013. Binomial confidence intervals and contingency tests.
*Journal of Quantitative Linguistics***20**:3, 178-208**»**Post - Spreadsheet: 2 × 2 or 2 × 1 χ² tests
- Binomial algorithm snippets

### References

Agresti, A. and Coull, B.A. 1998. Approximate is better than ‘exact’ for interval estimation of binomial proportions. *The American Statistician* **52**: 119–126.

Brown, L.D., Cai, T. and DaGupta, A. 2001. Interval estimation for a binomial proportion. *Statistical Science* **16**: 101-133.

Newcombe, R.G. 1998a. Two-sided confidence intervals for the single proportion: comparison of seven methods. *Statistics in Medicine* **17**: 857-872.

Newcombe, R.G. 1998b. Interval estimation for the difference between independent proportions: comparison of eleven methods. *Statistics in Medicine* **17**: 873-890.

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.