Comparing frequencies within a discrete distribution

This page explains how to compare observed frequencies f₁ and f₂ from the same distributionF = {f₁, f₂,…}.
To compare observed frequencies f₁ and f₂ from different distributions, i.e. where F₁ = {f₁,…} and F₂ = {f₂,…}, you need to use a chi-square or Newcombe-Wilson test.


In a recent study, my colleague Jill Bowie obtained a discrete frequency distribution by manually classifying cases in a small sample drawn from a large corpus.

Jill converted this distribution into a row of probabilities and calculated Wilson score intervals on each observation, to express the uncertainty associated with a small sample. She had one question, however:

How do we know whether the proportion of one quantity is significantly greater than another?

We might use a Newcombe-Wilson test (see Wallis 2013a), but this test assumes that we want to compare samples from independent sources. Jill’s data are drawn from the same sample, and all probabilities must sum to 1. Instead, the optimum test is a dependent-sample test.


A discrete distribution looks something like this: F = {108, 65, 6, 2}. This is the frequency data for the middle column (circled) in the following chart.

This may be converted into a probability distribution P, representing the proportion of examples in each category, by simply dividing by the total: P = {0.60, 0.36, 0.03, 0.01}, which sums to 1.

We can plot these probabilities, with Wilson score intervals, as shown below.


An example graph plot showing the changing proportions of meanings of the verb think over time in the US TIME Magazine Corpus, with Wilson score intervals, after Levin (2013). In this post we discuss the 1960s data (circled). The sum of each column probability is 1. Many thanks to Magnus for the data!

So how do we know if one proportion is significantly greater than another?

  • When comparing values diachronically (horizontally), data is drawn from independent samples. We may use the Newcombe-Wilson test, and employ the handy visual rule that if intervals do not overlap they must be significantly different.
  • However, probabilities drawn from the same sample (vertically) sum to 1 — which is not the case for independent samples! There are k−1 degrees of freedom, where k is the number of classes. It turns out that the relevant significance test we need to use is an extremely basic test, but it is rarely discussed in the literature.

The single sample z test

The single sample z test (Sheskin 1997:33) is the first test in this book. He expresses the test in terms of frequencies, but I tend to find it simpler to express it in terms of probabilities.

Given an observed probability p, and a known expected (population) probability P, the test checks to see if the two values are statistically significantly different.

If we know P we are permitted to calculate a Gaussian population interval about it (see also the interval equality principle in Wallis 2013a):

Population standard deviation S = √P(1 − P)/n,

where n represents the number of cases in the sample. We can then employ the equation to find a z-score:

z = (p − P)/S.

This score is signed by the direction of change, so to turn it into a significance test, we test if the absolute value |z| is less than the critical value of z at a given error level α, written zα/2.

We can reformulate this test as simply checking whether p is outside the range P ± zα/2.S.

The single sample z test

The single-sample population z test assumes that random observations are distributed about P according to a Normal (Gaussian) distribution, and tests to see if an observed p is outside of the marked interval range, P ± zα/2.S.

This test can be used for testing an observed probability against any expected population probability P (see also Wallis 2013b), but we want to use this test in a particular way.

Testing pairs of frequencies for significant difference

We can use this test to compare whether two frequencies, say, f₁ = 108 (‘cogitate’) and f₂ = 65 (‘intend’), are significantly different from one another.

  • For these two frequencies, n = f₁+f₂ = 173. Convert them into probabilities, p₁ = f₁/n = 0.6243 and p₂ = 1 − p₁ = 0.3757. Importantly, we ignore the rest of the distribution for the purposes of this calculation.
  • Test the probability that an observed result, p₁, is significantly different from P=0.5 (since if p₁ = 0.5, then p₂ = p₁). Since p₁ and p₂ are linked (p₂ = 1 − p₁) we only need to test  p₁.

We can plug this data into the single sample z test:

z = (p₁ − P)/S,


S = √P(1 − P)/ = √0.25 / 173 = 0.0380,


z = -0.1243/0.0380 = -3.2692.

Since |z| > 1.95996, we can say that the two columns are significantly different in frequency at an error level of 0.05. In the 1960s data there are significantly more ‘cogitate’ uses of think than ‘intend’ uses.

To make this easier to check, I have provided an example calculation in the attached spreadsheet. This employs the device of converting z to χ² and then employing CHIDIST to obtain the error level. (This is provided for convenience, but you should not infer “greater significance” to results with a lower error level.)

There are other methods, but they are more complicated and still achieve the same result. You can use a 2 × 1 goodness of fit χ² test using the expected distribution E = {0.5n, 0.5n}. Alternatively, you can calculate the revised Wilson interval on p₁ (i.e. just for the pair p₁ and p₂) and test to see if 0.5 is outside it. See plotting p below.

Relation with Wilson score intervals

Where does this leave us with our handy rule for examining charts with Wilson score intervals?

For two observations drawn from the same sample:

  1. Do the intervals overlap?
    • If no: the observations are significantly different.
  2. Does either observed probability fall within the other interval?
    • If yes: the observations are not significantly different.
  3. Otherwise test for significance using a single-sample z test.

We do not need to test for significance unless condition (3) applies, i.e. W, the new combined interval to be tested, is subject to the following limits:

max(w₁, w₂) ≤ W ≤ w₁+w₂,

where w₁, w₂ represent the inner Wilson interval widths.

Quick explanation

  1. w₁+w₂ is the minimum distance for non-overlapping intervals, and
  2. max(w₁, w₂) is the minimum distance where neither probability falls within the range of the other.

The proof that W is within these limits may be of interest. It is a little more complicated than for W = √w₁²+w₂² (the Newcombe-Wilson interval), which is within these limits due to simple algebra.


In extremis, we have only two frequencies in the original distribution, i.e. two probabilities where p₂ = 1 − p₁. Including a non-zero third or fourth probability (as in the graph above) has the effect of loosening this coupling (increasing the number of degrees of freedom and increasing n). The more cases fall in other categories, the more the test converges on the independent sample (Newcombe-Wilson) test, which has the same property.

In the two-category case, if and only if the Gaussian (z, Normal) interval for P=0.5 includes p₁, the corresponding Wilson interval for p₁ must include P, by the interval equality principle. As p₂ mirrors p₁, the Wilson interval for p₂ also mirrors that of p₁. It will also include P, and therefore the intervals will overlap at P=0.5.

Plotting p

We can plot p₁ for Magnus Levin’s example data with Wilson score intervals. In the following graph we have performed pairwise comparisons on nearest neighbours (ordered by frequency). We can see that quotative vs. interpretive uses of think are not statistically significantly different (at an error level α=0.05) for 1920s (no data) and 1960s (the interval crosses P=0.5). Note that p₁ here is the proportion of the first value out of the pair of values, not out of all values, as we are comparing each pair of frequencies independently.

Pairwise comparison, Wilson intervals

Pairwise comparison (n=f₁+f₂) of nearest neighbours, plotting p₁ = f₁/n with Wilson score intervals (α=0.05). If the interval excludes P=0.5 (dotted) the two frequencies are significantly different.

This graph bears on the central point of this discussion.

  • We are comparing the ratio of two observed frequencies, which converts to a simple probability value p₁. We simply employ a goodness of fit test against P=0.5 to verify that they are different.
  • Three different methods obtain precisely the same result (Wallis 2013b). These are the single-sample z test, 2 × 1 χ² test or Wilson interval test. The intervals in the graph above are Wilson score intervals on the probability of the first element of each pair, which we term p₁.
  • We should not apply a 2 × 2 test (such as the Newcombe-Wilson or χ² homogeneity test) designed to test independent probabilities to compare dependent ones.

Applying a continuity-correction

A continuity-corrected z-score (for small samples) is obtained by subtracting 1/2n from the absolute difference:

z = (|p₁ − P| − 1/2n)/S.

The procedure is as before. Although in this formula z is unsigned, the direction is obvious. I have included a second column to the spreadsheet to perform this calculation.

Robert Newcombe also offers a continuity-corrected version of the Wilson score interval (Wallis 2013a).

Exact Binomial method

The methods we have discussed thus far employ the Gaussian (Normal) approximation to the Binomial distribution. This approximation is not exact. Correcting for this by employing a continuity-correction is conservative, i.e. we fail to find properly significant results in some cases.

A viable alternative where n<200 (or so) is the Binomial test (Wallis 2013a). For this test we use the Binomial formula with P = 0.5:-

Binomial probability B(r; n, P) ≡ nCr . Pr(1 − P)(n-r),

Since P = 1 − P = 0.5, this formula can be simplified to B(r; n, 0.5) ≡ nCr / 2n.

This computes the probability of selecting exactly r out of n cases. To test an observed probability, p = f/n, for significant difference from P at say, α = 0.05, we need to sum values. We carry out a two-tailed test, dividing α by 2 at both tail ends.

The 0.5 Binomial distribution is symmetric, so we only need to consider the formula for < 0.5, and test for 1 − p otherwise. The observed lower tail area is obtained by summing from 0 to f.

Binomial testr=0..B(r; n, 0.5) ≤ α/2.

In plain English this says, add up all values of the Binomial formula for = 0.5 from = 0 to f, and test if the result is less than α/2.

Applying the binomial distribution to a simple frequency comparison test, for f1 = 108 and f2 = 65.

Applying the binomial distribution to a simple frequency comparison test for f₁ = 108 and f₂ = 65.

To return to our example:

  • Data: f₁ = 108 (‘cogitate’) and f₂ = 65 (‘intend’), n = 173, and f₂ is lower.
  • The Binomial tail sum from 0 to 65 is 0.000669 (to six decimal places).
  • This is less than α/2 = 0.025. Hence the test is significant.

See also


Aarts, B., G. Leech, J. Close and S.A. Wallis (eds.) 2013. The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP. » Table of contents and ordering info

Levin, M. 2013. The progressive verb in modern American English. Chapter 8 in Aarts et al (2013).

Sheskin, D.J. 1997. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.

Wallis, S.A. 2013a. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208 » Post

Wallis, S.A. 2013b. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s