Confidence intervals on goodness of fit ϕ scores


In Wallis (2021), I offered two approaches to computing confidence intervals on the effect size Cramér’s ϕ. I also motivated and summarised approaches to a comparable goodness of fit metric (where a high ϕ score reflects a greater difference and thus a ‘poor fit’).

A goodness of fit evaluation is one where we compare an observed distribution of k cells, say, with an expected distribution of the same number of cells. The test, which is a type of χ2 test, has a number of applications. A goodness of fit ϕ score would be expected to range from 0 to 1, with 0 representing identity and 1 representing the opposite, a maximally distinct distribution. 

In an earlier paper published on this blog (Wallis 2012), I considered a range of possible measures that had this property. However, one of the questions I had left unresolved was how to compute a confidence interval on such a measure.

Why might we want to do this?

  • To cite or plot measures with confidence intervals, identifying the level of certainty we can ascribe to a particular observed measure.
  • To compare ϕ with an arbitrary level, e.g. to test if ϕ ≠ D where D ≠ 0. (As we shall see, where k > 2 and ϕ unsigned, comparing goodness of fit ϕ with 0 is more difficult due to loss of information, and it is preferable to employ a goodness of fit test instead.)
  • To compare two ϕ scores for their significant difference in a given direction, e.g. to establish that, say, ϕ1 > ϕ2.

Summing independent, dependent and constrained variances

The Bienaymé theorem serves for computing the total variance of the sum of k independent Normally distributed variables by simple summation of variance.

Bienaymé variance s2 = s12 + s22 + … + sk2 = ∑si2.(1)

A total standard deviation s is obtained by taking the square root of Equation (1).

To estimate a confidence interval on a sum of k independent proportions, ∑pi, we follow Zou and Donner (2008). A confidence interval on a sum of proportions may be obtained by substituting interval widths, u = (pw) and u+ = (w+p), for each si term in the equation. The confidence interval is then found with the square root of the result. The constant zα/2 factors out. See An algebra of intervals.

independent sum (L, U) = (∑pi – √∑(pi  – wi   )², pi + √∑(wi+    – pi )²), (1′)

This assumes that all of these proportions are independent. But what of chi-square-type scenarios, where there are k – 1 degrees of freedom for k proportions summing to 1?

Obviously, we are not interested in the confidence interval for ∑pi, as this must be 1 (or [1, 1] if you prefer). But we are interested in confidence intervals for the sum of functions of pi, ∑fn(pi). Zou and Donner argue that equations of this type should be sound provided that the original interval is sound.

Consider the simplest two-valued 2 × 1 goodness of fit χ2. As we know, the two proportions are completely dependent. If p1 increases, p2 = 1 – p1 must fall. The table has a single degree of freedom. Consequently, standard deviations and interval positions are simply summed.

total standard deviation s = s1 + s2. (2)

dependent sum (L, U) = (∑fn(wi), ∑fn(wi+)), (2′)

for an increasing monotonic function, fn, over P = [0, 1]. We will discuss other function types below.

Another way of thinking about this is that independent variables are considered to vary at right angles (tangents) to each other, whereas strictly dependent variables vary along the same axis. In some circumstances this means variables subtract and even cancel each other out; in others (like χ2) they sum.

Figure 1. Left: standard deviation of sum of independent variables x, y, z; right, summing standard deviations of two dependent variables on the same axis.
Figure 1. Left: standard deviation of sum of independent variables x, y, z; right, summing standard deviations of two dependent variables on the same axis.

How do we generalise this idea to closed k × 1 goodness of fit χ2 tables, where there are k – 1 degrees of freedom? Now there are fewer dimensions than variables.

Generalising χ2

We are used to thinking about chi-square (χ2) as a test procedure, either in a homogeneity or goodness of fit evaluation. But it can also be thought of as the sum of k squares of standardised random variables that have a Normal (Gaussian) distribution. Their variance can be computed with Equation (1).

Test evaluations (sometimes specified as Pearson’s χ2) are implementations of this principle, where k is the number of degrees of freedom in the test. As summing independent variances will clearly tend to lead to ever-higher totals, critical values of chi-square also increase with degrees of freedom.

If x1, x2,… xk are k independent Normally-distributed variables with different means and variances (which we can write as xi ~ Ni, σi2) for i ∈ {1, 2,… k}), then the sum of the square of the standardised variables, zi, is chi-square distributed with k degrees of freedom. We may write

w = ∑(xi – μi)2i2 = ∑zi2 ~ χ2(k),

where zi = (xi – μi)/σi is the standardised Normal variable, i.e. so that zi ~ N(0, 1).

For functions of the goodness of the fit chi-square type we simply rescale the variance and interval by the factor kappa κ = k/df = k/(k – 1).

k-constrained variance s2 = κ∑si2.(3)

Similarly, we have

k-constrained sum (L, U) = (∑pi – √κ∑(pi  – wi   )², pi + √κ∑(wi+    – pi )²), (3′)

In the following example these functions converge on (2) and (2′) respectively.

Goodness of fit ϕp

Goodness of fit measure ϕp is a ‘root mean square’ fit score that cannot exceed 1. It is a ‘mutual’ goodness of fit because the two distributions could be treated as equivalent, whereas ϕe, which we discuss below, is a proper subset goodness of fit. (For a discussion, see Wallis (2012)).

Wallis (2021: 229) offers three formulae: Equation (4) is the simplest on which to base a confidence interval.

ϕp = ½∑(pi  – Pi, (4)

where each observed proportion pi is free to vary on a k-constrained basis (∑pi = 1), whereas each Pi is constant (and also sums to 1). Let the independent Wilson score confidence interval for each observed pi be (wi, wi+).

As with Cramér’s 2 × 2 ϕ, we might note that goodness of fit ϕ also has a simpler two-value signed solution. Wallis (2021: 229) observes that ϕp is equal to unsigned simple difference | d | where d = p1 – P1. So we can simply define

signed ϕp = d = p1 – P1, (5)

which has the displaced Wilson confidence interval

signed ϕp ∈ (w1P1, w1+P1). (6)

I have updated my popular 2 × 2 chi-square spreadsheet to calculate intervals for signed ϕp.

However, the signed score does not generalise to more than 2 cells. Unsigned ϕp cannot be less than zero, so testing if ϕp ≠ 0 is unlikely to be fruitful (see below). We cannot use it as a substitute test for significant difference from zero.

For k > 2, let us define a squared difference function, ‘sqd(pi)’, for each term. The distribution {Pi} is given for our purposes, so we omit the parameter for brevity.

sqd(pi) = (piPi)2/2. (7)

We can now compute intervals for ϕp2 = ∑sqd(pi), obtaining the interval for ϕp as a final step.

Equation (7) is a U-shaped non-monotonic function with a local minimum of Pi. The best way to see this is with a sketch (Figure 2).

Figure 2. Applying a non-monotonic transformation sqd(pi) to intervals with Pi = 0.3, n = 50. Monotonic ranges fully exclude Pi.
Figure 2. Applying a non-monotonic transformation sqd(pi) to intervals with Pi = 0.3, n = 50. Monotonic ranges fully exclude Pi.

To address non-monotonic functions, we must examine the interval (wi, wi+). The function ‘sqd’ has a single turning point at Pi, so any interval must fall into one of three possible conditions: increasing, decreasing and non-monotonic.

Where the interval does not contain a turning point, the interval may be treated as if it were monotonic. This obtains a slightly conservative outcome because the interval has a tail.

  • Suppose Pi = 0.3. The Wilson interval for pi = 0.1, (0.0435, 0.2136), models that there is a 2.5% chance that the true value is greater than 0.2136 (Figure 2, left). The chance that the true value is greater than 0.3 is ~0.001. However, once transformed (0.0022, 0.0341), only that part of the lower tail that could fold back and fall within the interval could behave conservatively. This outcome has a negligible probability of about 8×10-6, i.e. 1 in 125,000. In Figure 3 below, the reflected tail still appears to fall below the lower bound (brown dashed line), beyond the interval.
  • On the other hand, consider pi = 0.4, whose interval contains a turning point. The function is non-monotonic within the interval, the minimum is the local minimum (0) and the maximum is the simple maximum of the transformed bounds (0.0008, 0.0331). This transformation is a more conservative interval, because the interval on the shorter side (0.0008) folds back on itself. Indeed, with pi = 0.4, nearly 100% of the lower tail falls within the interval, yielding an effective overall performance of α/2 rather than α. In Figure 3, the blue reflected tail falls far short of the upper bound.

As well as being conservative, this transformation also causes the metric to lose information, a point we will return to in the conclusions. Information loss (loss of direction) is an unavoidable consequence of collapsing degrees of freedom in a metric, but we may be able to address conservatism. 

Figure 3 plots the transformed sqd(pi) distributions, computed by delta approximation, seen on the y-axis in Figure 2. This allows us to see the behaviour of the tail areas close to zero (inset). 

Figure 3. Plotting sqd(pi) distributions. Note how the distributions, and lower tails, 'fold over' at 0.
Figure 3. Plotting sqd(pi) distributions. Note how the distributions, and lower tails, ‘fold over’ at 0. For pi = 0.1, part of the tail folds over, but it makes no difference to the overall calculation. For pi = 0.4, however, the tail is ~100% within the interval. 

We are now ready to formalise this model.

Let di and di+ be the lower and upper interval bounds for each term. By inspection we have

(di, di+) = (8){ (sqd(wi), sqd(wi+))
(sqd(wi+), sqd(wi))
(0, max(sqd(wi), sqd(wi+)))
 if wi > Pi
if wi+ < Pi

We construct an interval (ϕp, ϕp+) about ϕp as follows. 

For the dependent two-valued case (df = 1), ϕp obtains the deterministic sum. By Equation (2′) the interval for ϕp2 is simply

 dependent interval (L2, U2) = (∑di, ∑di+), (9)

which we can quote for ϕp by taking the square root of each bound.

Example 1: correlating the present perfect in DCPSE, k = 2 source corpora

Wallis (2021: 231, building on Wallis 2012) summarises an attempt to explore whether present perfect verb forms correlate more closely with present or past verb forms in multiple different subdivisions in DCPSE. Let us begin with the simplest example.

This experiment had a relatively unusual design. The present perfect distribution is treated as an observation, O, and the present and past verb forms are treated as two different expected distributions to be ‘fit’ to, E1 and E2. This means we only need calculate wi and wi+ once, even if we supply alternate values for Pi. However the method we describe below is not reliant on this, and the two scores are still considered independent (thanks to the independence of the two prior distributions {Pi}).

Consider the data in Table 1, drawn from Wallis (2012). We will use the Wilson score interval at an error level α of 0.05.

  o p w w+ P sqd(w) sqd(w+)
LLC 2,488 0.4799 0.4664 0.4935 0.4913 0.000311 0.000003
ICE-GB 2,696 0.5201 0.5065 0.5336 0.5087 0.000003 0.000311

Table 1. Distribution of present perfect forms by source corpus, DCPSE: frequency oO, proportion and 95% Wilson score intervals. Right: P = expected probability for present tense verb forms E1, n = 5,784 and sqd(w), sqd(w+).

The signed difference d = p1P1 = -0.0114 has the interval obtained by Equation (6). This gives us (w1P1, w1+P1) = (-0.0249, 0.0022), which includes zero.

To translate this to an absolute interval we negate it, –d = 0.0114 ∈ (-0.0022, 0.0249), and crop it at zero [0, 0.0249).

Equation (8) obtains the same result as the signed interval (Equation (6)) if the entire interval range is positive, and has the same absolute value for negative intervals.

However if an interval includes zero (as here), the lower bound will be cropped at zero.

In this case, the lower bound di = 0 as the interval includes the minimum (and is reflected for both values of i), and the maximum di+ = 0.000311. This produces a positive interval of [0, 0.0249). The square bracket means that 0 is included in the range.

Finally, let us apply Equation (3′) with the ‘sqd’ transform to obtain the interval for ϕp2.

We obtain the following 

k-constrained sum (L2, U2) = p2 – √κ∑(sqd(pi ) – di   )², ϕp2 + √κ∑(di+    – sqd(pi ))²), (10)

where κ = 2/1 = 2, ϕp2 = 0.000129. This also obtains the interval [0, 0.249).

If we adjust Table 1 slightly by shifting data in the frequency column so that the interval no longer includes zero, all three methods obtain the same result (after accounting for the sign).

Example 2: correlating the present perfect, k = 10 text categories

Again, we compute the confidence interval for the observed, present perfect forms. The continuity-corrected score interval may be employed if desired instead, or an alternative (reliable) confidence interval for the Binomial proportion may be substituted.

  p w w+
formal face-to-face 0.1174 0.1094 0.1259
informal face-to-face 0.4326 0.4199 0.4454
telephone conversations 0.0603 0.0545 0.0668
broadcast discussions 0.1086 0.1008 0.1169
broadcast interviews 0.0462 0.0410 0.0519
spontaneous commentary 0.1056 0.0980 0.1138
parliamentary language 0.0299 0.0258 0.0346
legal cross-examination 0.0035 0.0022 0.0053
assorted spontaneous 0.0161 0.0131 0.0197
prepared speech 0.0799 0.0732 0.0871

Table 2. Computing 95% Wilson score interval widths for the observed present perfect distribution for text categories O = {npi} in DCPSE, n = 5,784 (data from Wallis 2012).

Figure 4 reproduces a frequency distribution from Wallis (2012), but includes the score intervals, scaled to the same size as the observed distribution to convey their relative size. Intervals are multiplied by n to plot them on the frequency scale.

Figure 3. Expected distributions (E1, E2) for present and past forms, scaled to the observed distribution O for present perfect forms (after Wallis 2012), with 95% Wilson score intervals.
Figure 4. Expected distributions (E1, E2) for present and past forms, scaled to the observed distribution O for present perfect forms (after Wallis 2012), with 95% Wilson score intervals.

In three out of ten categories, the present forms (prior proportions Pi) fall within the 95% interval for the present perfect proportions. These categories are ‘telephone conversations’, ‘broadcast discussions’ and ‘spontaneous commentary’. In the other seven categories (including ‘legal cross-examination’), Pi is found outside of the interval. On the other hand, only one of the past tense form proportions is found inside the interval (‘prepared speech’).

We can compute the ϕp(present) score and intervals.

  P sqd(p) sqd(w) sqd(w+)  
formal face-to-face 0.1047 0.000080 0.000011 0.000225
informal face-to-face 0.4878 0.001526 0.002310 0.000901
telephone conversations 0.0580 0.000003 0.000006 0.000039 0 ›
broadcast discussions 0.1098 0.000001 0.000040 0.000025 0 ‹
broadcast interviews 0.0377 0.000036 0.000006 0.000101
spontaneous commentary 0.1026 0.000005 0.000011 0.000063 0 ›
parliamentary language 0.0215 0.000036 0.000009 0.000086
legal cross-examination 0.0062 0.000004 0.000008 0.000000
assorted spontaneous 0.0207 0.000011 0.000029 0.000001
prepared speech 0.0511 0.000415 0.000244 0.000651
goodness of fit score ϕp 0.046000  

Table 3. Computing goodness of fit score ϕp for O = {npi} against E1 = {nPi} (present verb forms) and 95% confidence intervals for each summed term. Right: ‘0’ includes minimum (di = 0, overruled ‘greened out’ bound), ‘›’ increasing, ‘‹’ decreasing. Bold: selected upper interval.

We determine bounds di and di+ with Equation (8). To read Table 3, consider the first line (‘formal face-to-face conversations’). The interval excludes P1 and is increasing, so d1 = 0.000011 and d1+ = 0.000225. For ‘informal face-to-face conversations’ (second line), the interval is decreasing and exclusive of P2, so d2 = 0.000901 and d2+ = 0.002310. But for ‘telephone conversations’ the interval includes P3, so d3 = 0 and d3+ = 0.000039. We can now identify the bounds of each transformed interval.

  d d+
formal face-to-face 0.000011 0.000225
informal face-to-face 0.000901 0.002310
telephone conversations 0.000000 0.000039
broadcast discussions 0.000000 0.000040
broadcast interviews 0.000006 0.000101
spontaneous commentary 0.000000 0.000063
parliamentary language 0.000009 0.000086
legal cross-examination 0.000000 0.000008
assorted spontaneous 0.000001 0.000029
prepared speech 0.000244 0.000651

Table 4. The 95% confidence interval bounds for each sqd(pi) score calculated independently.

Each of these intervals (di, di+) are the projected 95% interval of sqd(pi), defined by Equation (8).

To compute the interval, we use Equation (10). First we calculate the squared difference between sqd(pi) and di or di+ (depending on the bound), multiply by κ = 10/9 = 1.1111, and take the root of the sum of these squares. In the case of the lower bound the result is 0.000689.

This obtains an interval width on the squared sqd(pi) scale, which we subtract from (lower bound) or add to (upper) ϕp2 = 0.002116 (see Equation (10)). This obtains

ϕp ∈ (0.037776, 0.054776).

This method obtains two intervals as follows.

ϕp(present) ∈ (ϕp, ϕp+) = (0.037776, 0.054776), and
ϕp(past) ∈ (ϕp, ϕp+) = (0.057165, 0.072034).

Testing for significant difference between scores

Both of these intervals are much greater than zero, but as we have noted, thanks to the unsigned score and information loss, testing against zero is pointless with these intervals. For this evaluation, a lower score means a closer correlation, i.e. a better ‘fit’, so all we can really say is that neither are close ‘fits’.

Are the scores significantly different? Well, we can already see that the intervals are distinct so by the Wilson interval heuristic (Wallis 2021) they are significantly different!

To compare two independent observations, we employ the Bienaymé rule for the inner pair of intervals (cf. Newcombe-Wilson test). (Another way of thinking about this is we apply Equation (1′) but with –p1 instead of p1. See An algebra of intervals.)

We test the difference ϕd = ϕp(2) – ϕp(1), where the index j ∈ {1, 2} stands for present and past respectively against a zero-based interval (ϕd, ϕd+), defined by opposite bounds. Again, we employ the Pythagorean width approximation (Figure 1, left), u1 = p1w1, u2+ = w2+ – p2, etc.

ϕd = –(u1   )² + (u2+   )² = -0.0114, and
ϕd+ = (u1+   )² + (u2   )² = 0.0112.

In this case, the difference ϕp(past) – ϕp(present) = 0.0182, which is, as expected, considerably beyond the interval, and therefore the difference is significant.

Figure 5 illustrates significant increases whether measured across text categories (of uneven size), the 280 texts in DCPSE (of different lengths) or the two source corpora. In reading this graph we might note that the more data we have, the more reliable the estimate, as we have k – 1 independent observations to draw from. Also note that a smaller ϕp score represents a closer correlation or ‘fit’.

The higher scores for text categories is possibly due to text categories tending to sort texts into present or past-referring groups. By randomly allocating texts to ‘pseudo-categories’ in the same proportions, Wallis (2012) caused the resulting average ϕp score to fall.

Figure 4. Testing the difference between scores. ϕp(present) and ϕp(past), with 95% confidence intervals obtained from Equation (10).
Figure 5. Testing the difference between scores. ϕp(present) and ϕp(past), with 95% confidence intervals obtained from Equation (10).

Other goodness of fit metrics

Wallis (2012) identifies a number of other metrics based on goodness of fit χ2, such as variance-weighted ϕe. Building on what we have learned, we can compute an interval on χ2 by rewriting it:

χ2 = ∑(oi – ei)2/ei = n∑(piPi)2/Pi. (11)

We can redefine sqd(pi) = n(piPi)2/Pi and employ Equations (8) and (10) (with χ2 in place of ϕp2) to obtain an interval on the goodness of fit χ2 score. 

Second, we scale the result according to the relevant formula (Wallis 2021: 230).

variance-weighted ϕe = χ²/(n²  ∑1/ei ). (12)

where ei = nPi. (By inspection, we could dispense with n from Equation (11) and n2 from (12), summing 1/Pi, but this is a less general form.)


These intervals can be used for estimating the certainty of metrics and comparing ϕp scores. However, for k > 2 they are inoperable for comparison with zero (testing for significant difference from zero).

First, the fact that the squared difference function (‘sqd’) is non-monotonic means that the interval is liable to exhibit information loss, most clearly at zero. We could see this with the two-value case: not only did the unsigned function lose information, but if just one of the intervals included Pi, the interval was not the same as that for the simple difference score, d = p1P1. Even if we included zero in the range, where k > 2, it could not be properly used for testing for significant difference from zero.

Second, the interval is conservative, also due to this non-monotonicity property. For empirical purposes erring on the side of caution is recommended. But in some cases, a term might have the overwhelming majority of one tail falling inside the interval range, effectively converting an interval that excludes α chance events to one that excludes α/2.

An improved method could adjust for this problem by lowering the upper bound (otherwise max(di, di+)) until the sum of both tail areas is α. In the ‘texts’ example, 0.75 of the intervals (LLC) and 0.56 (ICE-GB) have this property, so this may be worth exploring. Note that if we are concerned with whether the difference is significant we can employ the unadjusted interval first.

Third, ϕp is unsigned. A ϕp of zero is only obtained by the identity {pi} = {Pi}. Aside from this case, neither sqd(pi) nor the lower bound di can ever be less than zero (see Figure 2). See also Table 4.

Therefore to test for a significant difference between {pi} and {Pi}, you should use a goodness of fit χ2 test. This does not suffer from information loss or conservatism in the same way, even as it collapses degrees of freedom into a single test value. To compare two goodness of fit tests, you can use the gradient goodness of fit meta-test (Wallis 2021: 250). This compares two different assessments, {pi} vs. {Pi} and {p’i} vs. {P’i}, on a paired cell-by-cell basis.

Our test for comparing ϕp scores is comparable to this test, but compares the absolute aggregate scores. In brief, it permits us to determine whether two independently derived goodness of fit ϕp scores are sufficiently different for the outcome to be unlikely due to chance.

Our test will thus be more conservative than the gradient goodness of fit test: it loses information by generalising multiple degrees of freedom into a single metric and it loses the sign of differences.

As with all Wilson-based methods on this blog, it may be corrected for continuity and adjusted for small population or random-text sampling.


Wallis, S.A. (2012). Goodness of fit measures for discrete categorical data. London: Survey of English Usage, UCL. » Post

Wallis, S.A. (2021). Statistics in Corpus Linguistics Research. New York: Routledge

Zou, G.Y. & A. Donner (2008). Construction of confidence limits about effect measures: A general approach. Statistics in Medicine27:10, 1693-1702.

See also

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.