Comparing frequencies within a discrete distribution

Note:
This page explains how to compare observed frequencies f1 and f2 from the same distribution, F = {f1, f2,…}. To compare observed frequencies f1 and f2 from different distributions, i.e. where F1 = {f1,…} and F2 = {f2,…}, you need to use a chi-square or Newcombe-Wilson test.

Introduction

In a recent study, my colleague Jill Bowie obtained a discrete frequency distribution by manually classifying cases in a small sample drawn from a large corpus.

Jill converted this distribution into a row of probabilities and calculated Wilson score intervals on each observation, to express the uncertainty associated with a small sample. She had one question, however:

How do we know whether the proportion of one quantity is significantly greater than another?

We might use a Newcombe-Wilson test (see Wallis 2013a), but this test assumes that we want to compare samples from independent sources. Jill’s data are drawn from the same sample, and all probabilities must sum to 1. Instead, the optimum test is a dependent-sample test.

Example

A discrete distribution looks something like this: F = {108, 65, 6, 2}. This is the frequency data for the middle column (circled) in the following chart.

This may be converted into a probability distribution P, representing the proportion of examples in each category, by simply dividing by the total: P = {0.60, 0.36, 0.03, 0.01}, which sums to 1.

We can plot these probabilities, with Wilson score intervals, as shown below.

tag1cmp
An example graph plot showing the changing proportions of meanings of the verb think over time in the US TIME Magazine Corpus, with Wilson score intervals, after Levin (2013). In this post we discuss the 1960s data (circled). The sum of each column probability is 1. Many thanks to Magnus for the data!

So how do we know if one proportion is significantly greater than another?

  • When comparing values diachronically (horizontally), data is drawn from independent samples. We may use the Newcombe-Wilson test, and employ the handy visual rule that if intervals do not overlap they must be significantly different.
  • However, probabilities drawn from the same sample (vertically) sum to 1 — which is not the case for independent samples! There are k−1 degrees of freedom, where k is the number of classes. It turns out that the relevant significance test we need to use is an extremely basic test, but it is rarely discussed in the literature.

The single sample z test

The single sample z test (Sheskin 1997:33) is the first test in this book. He expresses the test in terms of frequencies, but I tend to find it simpler to express it in terms of probabilities.

Given an observed probability p, and a known expected (population) probability P, the test checks to see if the two values are statistically significantly different.

If we know P we are permitted to calculate a Gaussian population interval about it (see also the interval equality principle in Wallis 2013a):

Population standard deviation S = √P(1 – P)/n,

where n represents the number of cases in the sample. We can then employ the equation to find a z-score:

z = (p – P)/S.

This score is signed by the direction of change, so to turn it into a significance test, we test if the absolute value |z| is less than the critical value of z at a given error level α, written zα/2.

We can reformulate this test as simply checking whether p is outside the range P ± zα/2.S.

popsamp2
The single-sample population z test assumes that random observations are distributed about P according to a Normal (Gaussian) distribution, and tests to see if an observed p is outside of the marked interval range, P ± zα/2.S.

This test can be used for testing an observed probability against any expected population probability P (see also Wallis 2013b), but we want to use this test in a particular way.

Testing pairs of frequencies for significant difference

We can use this test to compare whether two frequencies, say, f1 = 108 (‘cogitate’) and f2 = 65 (‘intend’), are significantly different from one another.

  • For these two frequencies, n = f1+f2 = 173. Convert them into probabilities, p1 = f1/n = 0.6243 and p2 = 1 – p1 = 0.3757. Importantly, we ignore the rest of the distribution for the purposes of this calculation.
  • Test the probability that an observed result, p1, is significantly different from P = 0.5 (since if p1 = 0.5, then p2 = p1). Since p1 and p2 are linked (p2 = 1 − p1) we only need to test p1.

We can plug this data into the single sample z test:

z = (p1P)/S,

where

S = √P(1 – P)/n = √0.25 / 173 = 0.0380,

thus

z = -0.1243/0.0380 = -3.2692.

Since |z| > 1.95996, we can say that the two columns are significantly different in frequency at an error level of 0.05. In the 1960s data there are significantly more ‘cogitate’ uses of think than ‘intend’ uses.

To make this easier to check, I have provided an example calculation in the attached spreadsheet. This employs the device of converting z to χ² and then employing CHIDIST to obtain the error level. (This is provided for convenience, but you should not infer “greater significance” to results with a lower error level.)

There are other methods, but they are more complicated and still achieve the same result. You can use a 2 × 1 goodness of fit χ² test using the expected distribution E = {0.5n, 0.5n}. Alternatively, you can calculate the revised Wilson interval on p1 (i.e. just for the pair p1 and p2) and test to see if 0.5 is outside it. See plotting p1 below.

Relation with Wilson score intervals

Where does this leave us with our handy rule for examining charts with Wilson score intervals?

For two observations drawn from the same sample:

  1. Do the intervals overlap?
    • If no: the observations are significantly different.
  2. Does either observed probability fall within the other interval?
    • If yes: the observations are not significantly different.
  3. Otherwise test for significance using a single-sample z test.

We do not need to test for significance unless condition (3) applies, i.e. W, the new combined interval to be tested, is subject to the following limits:

max(w1, w2) ≤ Ww1+w2,

where w1, w2 represent the inner Wilson interval widths.

Quick explanation

  1. w1+w2 is the minimum distance for non-overlapping intervals, and
  2. max(w1, w2) is the minimum distance where neither probability falls within the range of the other.

The proof that W is within these limits may be of interest. It is a little more complicated than for W = √w1 ²+w2²     (the Newcombe-Wilson interval), which is within these limits due to simple algebra.

Proof

In extremis, we have only two frequencies in the original distribution, i.e. two probabilities where p2 = 1 − p1. Including a non-zero third or fourth probability (as in the graph above) has the effect of loosening this coupling (increasing the number of degrees of freedom and increasing n). The more cases fall in other categories, the more the test converges on the independent sample (Newcombe-Wilson) test, which has the same property.

In the two-category case, if and only if the Gaussian (z, Normal) interval for P = 0.5 includes p1, the corresponding Wilson interval for p1 must include P, by the interval equality principle. As p2 mirrors p1, the Wilson interval for p2 also mirrors that of p1. It will also include P, and therefore the intervals will overlap at P = 0.5.

Plotting p1

We can plot p1 for Magnus Levin’s example data with Wilson score intervals. In the following graph we have performed pairwise comparisons on nearest neighbours (ordered by frequency). We can see that quotative vs. interpretive uses of think are not statistically significantly different (at an error level α = 0.05) for 1920s (no data) and 1960s (the interval crosses P = 0.5). Note that p1 here is the proportion of the first value out of the pair of values, not out of all values, as we are comparing each pair of frequencies independently.

tag2
Pairwise comparison (n = f1+f2) of nearest neighbours, plotting p1 = f1/n with Wilson score intervals (α = 0.05). If the interval excludes P = 0.5 (dotted line) the two frequencies are significantly different.

This graph bears on the central point of this discussion.

  • We are comparing the ratio of two observed frequencies, which converts to a simple probability value p1. We simply employ a goodness of fit test against P = 0.5 to verify that they are different.
  • Three different methods obtain precisely the same result (Wallis 2013b). These are the single-sample z test, 2 × 1 χ² test or Wilson interval test. The intervals in the graph above are Wilson score intervals on the probability of the first element of each pair, which we term p1.
  • We should not apply a 2 × 2 test (such as the Newcombe-Wilson or χ² homogeneity test) designed to test independent probabilities to compare dependent ones.

Applying a continuity correction

A continuity-corrected z-score (for small samples) is obtained by subtracting 12n from the absolute difference:

z = (|p1P| – 12n)/S.

The procedure is as before. Although in this formula z is unsigned, the direction is obvious. I have included a second column to the spreadsheet to perform this calculation.

Robert Newcombe also offers a continuity-corrected version of the Wilson score interval (Wallis 2013a). See also Correcting for continuity.

Exact Binomial method

The methods we have discussed thus far employ the Gaussian (Normal) approximation to the Binomial distribution. This approximation is not exact. Correcting for this by employing a continuity correction is conservative, i.e. we fail to find properly significant results in some cases.

A viable alternative where n<200 (or so) is the Binomial test (Wallis 2013a). For this test we use the Binomial formula with P = 0.5:-

Binomial probability B(r; n, P) ≡ nCr . Pr(1 – P)(n–r),

Since P = 1 – P = 0.5, this formula can be simplified to B(r; n, 0.5) ≡ nCr / 2n.

This computes the probability of selecting exactly r out of n cases. To test an observed probability, p = f/n, for significant difference from P at say, α = 0.05, we need to sum values. We carry out a two-tailed test, dividing α by 2 at both tail ends.

The 0.5 Binomial distribution is symmetric, so we only need to consider the formula for p < 0.5, and test for 1 – p otherwise. The observed lower tail area is obtained by summing from 0 to f.

Binomial test Σ
r =0..f
B(r; n, 0.5) ≤ α/2.

In plain English this says, add up all values of the Binomial formula for P = 0.5 from r = 0 to f, and test if the result is less than α/2.

bin-difftest
Applying the binomial distribution to a simple frequency comparison test for f1 = 108 and f2 = 65.

To return to our example:

  • Data: f1 = 108 (‘cogitate’) and f2 = 65 (‘intend’), n = 173, and f2 is lower.
  • The Binomial tail sum from 0 to 65 is 0.000669 (to six decimal places).
  • This is less than α/2 = 0.025. Hence the test is significant.

Citation (abridged book version)

Wallis, S.A. 2021. Comparing Frequencies in the Same Distribution. Chapter 9 in Wallis, S.A. Statistics in Corpus Linguistics Research. New York: Routledge. 166-170.

References

Aarts, B., G. Leech, J. Close and S.A. Wallis (eds.) 2013. The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP. » Table of contents and ordering info

Levin, M. 2013. The progressive verb in modern American English. Chapter 8 in Aarts et al (2013).

Sheskin, D.J. 1997. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.

Wallis, S.A. 2013a. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208 » Post

Wallis, S.A. 2013b. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post

See also

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.