**Note:**

This page explains how to compare observed frequencies

*f*₁ and

*f*₂ from the same distribution,

**F**= {

*f*₁,

*f*₂,…}. To compare observed frequencies

*f*₁ and

*f*₂ from different distributions, i.e. where

**F₁**= {

*f*₁,…} and

**F₂**= {

*f*₂,…}, you need to use a chi-square or Newcombe-Wilson test.

### Introduction

In a recent study, my colleague Jill Bowie obtained a discrete frequency distribution by manually classifying cases in a small sample drawn from a large corpus.

Jill converted this distribution into a row of probabilities and calculated Wilson score intervals on each observation, to express the uncertainty associated with a small sample. She had one question, however:

**How do we know whether the proportion of one quantity is significantly greater than another?**

We might use a Newcombe-Wilson test (see Wallis 2013a), but this test assumes that we want to compare samples from independent sources. Jill’s data are drawn from the same sample, and all probabilities must sum to 1. Instead, the optimum test is a **dependent-sample** test.

### Example

A discrete distribution looks something like this: **F** = {108, 65, 6, 2}. This is the frequency data for the middle column (circled) in the following chart.

This may be converted into a probability distribution **P**, representing the proportion of examples in each category, by simply dividing by the total: **P** = {0.60, 0.36, 0.03, 0.01}, which sums to 1.

We can plot these probabilities, with Wilson score intervals, as shown below.

**So how do we know if one proportion is significantly greater than another?**

- When comparing values diachronically (horizontally), data is drawn from
**independent samples**. We may use the Newcombe-Wilson test, and employ the handy visual rule that if intervals do not overlap they must be significantly different. - However, probabilities drawn from the
**same sample**(vertically) sum to 1 — which is not the case for independent samples! There are*k−*1 degrees of freedom, where*k*is the number of classes. It turns out that the relevant significance test we need to use is an extremely basic test, but it is rarely discussed in the literature.

### The single sample *z* test

The **single sample z test** (Sheskin 1997:33) is the first test in this book. He expresses the test in terms of frequencies, but I tend to find it simpler to express it in terms of probabilities.

Given an observed probability *p*, and a known expected (population) probability *P*, the test checks to see if the two values are statistically significantly different.

If we know *P* we are permitted to calculate a Gaussian **population** interval about it (see also the interval equality principle in Wallis 2013a):

*Population standard deviation S* = √*P*(1 − *P*)/*n*,

where *n* represents the number of cases in the **sample**. We can then employ the equation to find a *z*-score:

*z* = (*p − P*)/*S.*

This score is signed by the direction of change, so to turn it into a significance test, we test if the absolute value |*z*| is less than the critical value of *z* at a given error level α, written *z*_{α/2}.

We can reformulate this test as simply checking whether *p* is outside the range *P* ± *z*_{α/2}.*S*.

This test can be used for testing an observed probability against **any** expected population probability *P *(see also Wallis 2013b), but we want to use this test in a particular way.

### Testing pairs of frequencies for significant difference

We can use this test to compare whether two frequencies, say, *f*₁ = 108 (‘cogitate’) and *f*₂ = 65 (‘intend’), are significantly different from one another.

- For these two frequencies,
*n*=*f*₁+*f*₂ = 173. Convert them into probabilities,*p*₁ =*f*₁/*n*= 0.6243 and*p*₂ = 1 −*p*₁ = 0.3757. Importantly, we**ignore the rest of the distribution**for the purposes of this calculation. **Test the probability that an observed result,**(since if*p*₁, is significantly different from*P*=0.5*p*₁ = 0.5, then*p*₂ =*p*₁). Since*p*₁ and*p*₂ are linked (*p*₂ = 1 −*p*₁) we only need to test*p*₁.

We can plug this data into the single sample *z* test:

*z* = (*p*₁ − *P*)/*S*,

where

*S* = √*P*(1 − *P*)/*n * = √0.25 / 173 = 0.0380,

thus

*z* = -0.1243/0.0380 = -3.2692.

Since |*z*| > 1.95996, we can say that the two columns are significantly different in frequency at an error level of 0.05. In the 1960s data there are significantly more ‘cogitate’ uses of *think* than ‘intend’ uses.

To make this easier to check, I have provided an example calculation in the attached spreadsheet. This employs the device of converting *z* to χ² and then employing CHIDIST to obtain the error level. (This is provided for convenience, but you should not infer “greater significance” to results with a lower error level.)

There are other methods, but they are more complicated and still achieve the same result. You can use a 2 × 1 goodness of fit χ² test using the expected distribution **E** = {0.5*n*, 0.5*n*}. Alternatively, you can calculate the revised Wilson interval on *p*₁ (i.e. just for the pair *p*₁ and *p*₂) and test to see if 0.5 is outside it. See plotting *p*₁ below.

### Relation with Wilson score intervals

Where does this leave us with our handy rule for examining charts with Wilson score intervals?

For two observations drawn from the same sample:

- Do the intervals overlap?
**If no**: the observations are significantly different.

- Does either observed probability fall within the other interval?
**If yes**: the observations are*not*significantly different.

- Otherwise test for significance using a
**single-sample***z*test.

We do not need to test for significance unless condition (3) applies, i.e. *W*, the new combined interval to be tested, is subject to the following limits:

max(*w*₁, *w*₂) ≤ *W* ≤ *w*₁+*w*₂,

where *w*₁, *w*₂ represent the inner Wilson interval widths.

#### Quick explanation

*w*₁+*w*₂ is the minimum distance for non-overlapping intervals, and- max(
*w*₁,*w*₂) is the minimum distance where neither probability falls within the range of the other.

The proof that *W* is within these limits may be of interest. It is a little more complicated than for *W* = √*w*₁²+*w*₂² (the Newcombe-Wilson interval), which is within these limits due to simple algebra.

#### Proof

In extremis, we have only two frequencies in the original distribution, i.e. two probabilities where *p*₂ = 1 − *p*₁. Including a non-zero third or fourth probability (as in the graph above) has the effect of loosening this coupling (increasing the number of degrees of freedom and increasing *n*). The more cases fall in other categories, the more the test converges on the independent sample (Newcombe-Wilson) test, which has the same property.

In the two-category case, **if and only if** the Gaussian (*z*, Normal) interval for *P*=0.5 includes *p*₁, the corresponding Wilson interval for *p*₁ must include *P*, by the interval equality principle. As *p*₂ mirrors *p*₁, the Wilson interval for *p*₂ also mirrors that of *p*₁. It will also include *P*, and therefore the intervals will overlap at *P*=0.5.

### Plotting *p*₁

We can plot *p*₁ for Magnus Levin’s example data with Wilson score intervals. In the following graph we have performed pairwise comparisons on nearest neighbours (ordered by frequency). We can see that quotative vs. interpretive uses of *think* are not statistically significantly different (at an error level α=0.05) for 1920s (no data) and 1960s (the interval crosses *P*=0.5). Note that *p*₁ here is the proportion of the first value out of the pair of values, not out of all values, as we are comparing each pair of frequencies independently.

This graph bears on the central point of this discussion.

- We are comparing the ratio of two observed frequencies, which converts to a simple probability value
*p*₁. We simply employ a**goodness of fit**test against*P*=0.5 to verify that they are different. - Three different methods obtain precisely the same result (Wallis 2013b). These are the single-sample
*z*test, 2 × 1 χ² test or Wilson interval test. The intervals in the graph above are Wilson score intervals on the probability of the first element of each pair, which we term*p*₁. - We should
*not*apply a 2 × 2 test (such as the Newcombe-Wilson or χ² homogeneity test) designed to test independent probabilities to compare dependent ones.

### Applying a continuity-correction

A continuity-corrected *z*-score (for small samples) is obtained by subtracting ^{1}/_{2n} from the absolute difference:

*z* = (|*p*₁ − *P*| − ^{1}/_{2n})/*S*.

The procedure is as before. Although in this formula *z* is unsigned, the direction is obvious. I have included a second column to the spreadsheet to perform this calculation.

Robert Newcombe also offers a continuity-corrected version of the Wilson score interval (Wallis 2013a).

### Exact Binomial method

The methods we have discussed thus far employ the Gaussian (Normal) approximation to the Binomial distribution. This approximation is not exact. Correcting for this by employing a continuity-correction is conservative, i.e. we fail to find properly significant results in some cases.

A viable alternative where *n<*200 (or so) is the Binomial test (Wallis 2013a). For this test we use the Binomial formula with *P* = 0.5:-

*Binomial probability B*(*r*; *n*, *P*) ≡ *nCr* . *P ^{r}*(1 −

*P*)

^{(n-r)},

Since* P* = 1 − *P* = 0.5, this formula can be simplified to *B*(*r*; *n*, 0.5) ≡ *nCr* / 2* ^{n}*.

This computes the probability of selecting exactly *r* out of *n* cases. To test an observed probability, *p = f/n*, for significant difference from

*P*at say, α = 0.05, we need to sum values. We carry out a two-tailed test, dividing α by 2 at both tail ends.

The 0.5 Binomial distribution is symmetric, so we only need to consider the formula for *p *< 0.5, and test for 1 − *p* otherwise. The observed lower tail area is obtained by summing from 0 to *f*.

*Binomial test* ∑_{r=0..f }*B*(*r*; *n*, 0.5) ≤ α/2.

In plain English this says, add up all values of the Binomial formula for *P *= 0.5 from *r *= 0 to *f*, and test if the result is less than α/2.

To return to our example:

- Data:
*f*₁ = 108 (‘cogitate’) and*f*₂ = 65 (‘intend’),*n*= 173, and*f*₂ is lower. - The Binomial tail sum from 0 to 65 is 0.000669 (to six decimal places).
- This is less than α/2 = 0.025. Hence the test is significant.

### See also

- Excel spreadsheets
- Single sample
*z*test - Example data (c/o Magnus Levin)

- Single sample
- Plotting confidence intervals on graphs (single probabilities)
- Change and certainty: plotting confidence intervals (2)

### References

Aarts, B., G. Leech, J. Close and S.A. Wallis (eds.) 2013. *The Verb Phrase in English: Investigating recent language change with corpora.* Cambridge: CUP. » Table of contents and ordering info

Levin, M. 2013. The progressive verb in modern American English. Chapter 8 in Aarts *et al* (2013).

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.

Wallis, S.A. 2013a. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics ***20**:3, 178-208 **»** Post

Wallis, S.A. 2013b. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. **»** Post