Introduction
In Wallis (2021), I offered two approaches to computing confidence intervals on the effect size Cramér’s ϕ. I also motivated and summarised approaches to a comparable goodness of fit metric (where a high ϕ score reflects a greater difference and thus a ‘poor fit’).
A goodness of fit evaluation is one where we compare an observed distribution of k cells, say, with an expected distribution of the same number of cells. The test, which is a type of χ^{2} test, has a number of applications. A goodness of fit ϕ score would be expected to range from 0 to 1, with 0 representing identity and 1 representing the opposite, a maximally distinct distribution.
In an earlier paper published on this blog (Wallis 2012), I considered a range of possible measures that had this property. However, one of the questions I had left unresolved was how to compute a confidence interval on such a measure.
Why might we want to do this?
- To cite or plot measures with confidence intervals, identifying the level of certainty we can ascribe to a particular observed measure.
- To compare ϕ with an arbitrary level, e.g. to test if ϕ ≠ D where D ≠ 0. (As we shall see, where k > 2 and ϕ unsigned, comparing goodness of fit ϕ with 0 is more difficult due to loss of information, and it is preferable to employ a goodness of fit test instead.)
- To compare two ϕ scores for their significant difference in a given direction, e.g. to establish that, say, ϕ_{1} > ϕ_{2}.
Summing independent, dependent and constrained variances
The Bienaymé theorem serves for computing the total variance of the sum of k independent Normally distributed variables by simple summation of variance.
Bienaymé variance s^{2} = s_{1}^{2} + s_{2}^{2} + … + s_{k}^{2} = ∑s_{i}^{2}.(1)
A total standard deviation s is obtained by taking the square root of Equation (1).
To estimate a confidence interval on a sum of k independent proportions, ∑p_{i}, we follow Zou and Donner (2008). A confidence interval on a sum of proportions may be obtained by substituting interval widths, u^{–} = (p – w^{–}) and u^{+} = (w^{+} – p), for each s_{i} term in the equation. The confidence interval is then found with the square root of the result. The constant z_{α/2} factors out. See An algebra of intervals.
independent sum (L, U) = (∑p_{i} – √∑(p_{i} – w_{i}^{–} )², ∑p_{i} + √∑(w_{i}^{+} – p_{i} )²), (1′)
This assumes that all of these proportions are independent. But what of chi-square-type scenarios, where there are k – 1 degrees of freedom for k proportions summing to 1?
Obviously, we are not interested in the confidence interval for ∑p_{i}, as this must be 1 (or [1, 1] if you prefer). But we are interested in confidence intervals for the sum of functions of p_{i}, ∑fn(p_{i}). Zou and Donner argue that equations of this type should be sound provided that the original interval is sound.
Consider the simplest two-valued 2 × 1 goodness of fit χ^{2}. As we know, the two proportions are completely dependent. If p_{1} increases, p_{2} = 1 – p_{1} must fall. The table has a single degree of freedom. Consequently, standard deviations and interval positions are simply summed.
total standard deviation s = s_{1} + s_{2}. (2)
dependent sum (L, U) = (∑fn(w_{i}^{–}), ∑fn(w_{i}^{+})), (2′)
for an increasing monotonic function, fn, over P = [0, 1]. We will discuss other function types below.
Another way of thinking about this is that independent variables are considered to vary at right angles (tangents) to each other, whereas strictly dependent variables vary along the same axis. In some circumstances this means variables subtract and even cancel each other out; in others (like χ^{2}) they sum.
How do we generalise this idea to closed k × 1 goodness of fit χ^{2} tables, where there are k – 1 degrees of freedom? Now there are fewer dimensions than variables.
Generalising χ^{2}
We are used to thinking about chi-square (χ^{2}) as a test procedure, either in a homogeneity or goodness of fit evaluation. But it can also be thought of as the sum of k squares of standardised random variables that have a Normal (Gaussian) distribution. Their variance can be computed with Equation (1).
Test evaluations (sometimes specified as Pearson’s χ^{2}) are implementations of this principle, where k is the number of degrees of freedom in the test. As summing independent variances will clearly tend to lead to ever-higher totals, critical values of chi-square also increase with degrees of freedom.
If x_{1}, x_{2},… x_{k} are k independent Normally-distributed variables with different means and variances (which we can write as x_{i} ~ N(μ_{i}, σ_{i}^{2}) for i ∈ {1, 2,… k}), then the sum of the square of the standardised variables, z_{i}, is chi-square distributed with k degrees of freedom. We may write
w = ∑(x_{i} – μ_{i})^{2}/σ_{i}^{2} = ∑z_{i}^{2} ~ χ^{2}(k),
where z_{i} = (x_{i} – μ_{i})/σ_{i} is the standardised Normal variable, i.e. so that z_{i} ~ N(0, 1).
For functions of the goodness of the fit chi-square type we simply rescale the variance and interval by the factor kappa κ = k/df = k/(k – 1).
k-constrained variance s^{2} = κ∑s_{i}^{2}.(3)
Similarly, we have
k-constrained sum (L, U) = (∑p_{i} – √κ∑(p_{i} – w_{i}^{–} )², ∑p_{i} + √κ∑(w_{i}^{+} – p_{i} )²), (3′)
In the following example these functions converge on (2) and (2′) respectively.
Goodness of fit ϕ_{p}
Goodness of fit measure ϕ_{p} is a ‘root mean square’ fit score that cannot exceed 1. It is a ‘mutual’ goodness of fit because the two distributions could be treated as equivalent, whereas ϕ_{e}, which we discuss below, is a proper subset goodness of fit. (For a discussion, see Wallis (2012)).
Wallis (2021: 229) offers three formulae: Equation (4) is the simplest on which to base a confidence interval.
ϕ_{p} = √½∑(p_{i} – P_{i} )², (4)
where each observed proportion p_{i} is free to vary on a k-constrained basis (∑p_{i} = 1), whereas each P_{i} is constant (and also sums to 1). Let the independent Wilson score confidence interval for each observed p_{i} be (w_{i}^{–}, w_{i}^{+}).
As with Cramér’s 2 × 2 ϕ, we might note that goodness of fit ϕ also has a simpler two-value signed solution. Wallis (2021: 229) observes that ϕ_{p} is equal to unsigned simple difference | d | where d = p_{1} – P_{1}. So we can simply define
signed ϕ_{p} = d = p_{1} – P_{1}, (5)
which has the displaced Wilson confidence interval
signed ϕ_{p} ∈ (w_{1}^{–} – P_{1}, w_{1}^{+} – P_{1}). (6)
I have updated my popular 2 × 2 chi-square spreadsheet to calculate intervals for signed ϕ_{p}.
However, the signed score does not generalise to more than 2 cells. Unsigned ϕ_{p} cannot be less than zero, so testing if ϕ_{p} ≠ 0 is unlikely to be fruitful (see below). We cannot use it as a substitute test for significant difference from zero.
For k > 2, let us define a squared difference function, ‘sqd(p_{i})’, for each term. The distribution {P_{i}} is given for our purposes, so we omit the parameter for brevity.
sqd(p_{i}) = (p_{i} – P_{i})^{2}/2. (7)
We can now compute intervals for ϕ_{p}^{2} = ∑sqd(p_{i}), obtaining the interval for ϕ_{p} as a final step.
Equation (7) is a U-shaped non-monotonic function with a local minimum of P_{i}. The best way to see this is with a sketch (Figure 2).
To address non-monotonic functions, we must examine the interval (w_{i}^{–}, w_{i}^{+}). The function ‘sqd’ has a single turning point at P_{i}, so any interval must fall into one of three possible conditions: increasing, decreasing and non-monotonic.
Where the interval does not contain a turning point, the interval may be treated as if it were monotonic. This obtains a slightly conservative outcome because the interval has a tail.
- Suppose P_{i} = 0.3. The Wilson interval for p_{i} = 0.1, (0.0435, 0.2136), models that there is a 2.5% chance that the true value is greater than 0.2136 (Figure 2, left). The chance that the true value is greater than 0.3 is ~0.001. However, once transformed (0.0022, 0.0341), only that part of the lower tail that could fold back and fall within the interval could behave conservatively. This outcome has a negligible probability of about 8×10^{-6}, i.e. 1 in 125,000. In Figure 3 below, the reflected tail still appears to fall below the lower bound (brown dashed line), beyond the interval.
- On the other hand, consider p_{i} = 0.4, whose interval contains a turning point. The function is non-monotonic within the interval, the minimum is the local minimum (0) and the maximum is the simple maximum of the transformed bounds (0.0008, 0.0331). This transformation is a more conservative interval, because the interval on the shorter side (0.0008) folds back on itself. Indeed, with p_{i} = 0.4, nearly 100% of the lower tail falls within the interval, yielding an effective overall performance of α/2 rather than α. In Figure 3, the blue reflected tail falls far short of the upper bound.
As well as being conservative, this transformation also causes the metric to lose information, a point we will return to in the conclusions. Information loss (loss of direction) is an unavoidable consequence of collapsing degrees of freedom in a metric, but we may be able to address conservatism.
Figure 3 plots the transformed sqd(p_{i}) distributions, computed by delta approximation, seen on the y-axis in Figure 2. This allows us to see the behaviour of the tail areas close to zero (inset).
We are now ready to formalise this model.
Let d_{i}^{–} and d_{i}^{+} be the lower and upper interval bounds for each term. By inspection we have
(d_{i}^{–}, d_{i}^{+}) = (8){ (sqd(w_{i}^{–}), sqd(w_{i}^{+}))
(sqd(w_{i}^{+}), sqd(w_{i}^{–}))
(0, max(sqd(w_{i}^{–}), sqd(w_{i}^{+}))) if w_{i}^{–} > P_{i}
if w_{i}^{+} < P_{i}
otherwise,
We construct an interval (ϕ_{p}^{–}, ϕ_{p}^{+}) about ϕ_{p} as follows.
For the dependent two-valued case (df = 1), ϕ_{p} obtains the deterministic sum. By Equation (2′) the interval for ϕ_{p}^{2} is simply
dependent interval (L^{2}, U^{2}) = (∑d_{i}^{–}, ∑d_{i}^{+}), (9)
which we can quote for ϕ_{p} by taking the square root of each bound.
Example 1: correlating the present perfect in DCPSE, k = 2 source corpora
Wallis (2021: 231, building on Wallis 2012) summarises an attempt to explore whether present perfect verb forms correlate more closely with present or past verb forms in multiple different subdivisions in DCPSE. Let us begin with the simplest example.
This experiment had a relatively unusual design. The present perfect distribution is treated as an observation, O, and the present and past verb forms are treated as two different expected distributions to be ‘fit’ to, E_{1} and E_{2}. This means we only need calculate w_{i}^{–} and w_{i}^{+} once, even if we supply alternate values for P_{i}. However the method we describe below is not reliant on this, and the two scores are still considered independent (thanks to the independence of the two prior distributions {P_{i}}).
Consider the data in Table 1, drawn from Wallis (2012). We will use the Wilson score interval at an error level α of 0.05.
o | p | w^{–} | w^{+} | P | sqd(w^{–}) | sqd(w^{+}) | |
LLC | 2,488 | 0.4799 | 0.4664 | 0.4935 | 0.4913 | 0.000311 | 0.000003 |
ICE-GB | 2,696 | 0.5201 | 0.5065 | 0.5336 | 0.5087 | 0.000003 | 0.000311 |
The signed difference d = p_{1} – P_{1} = -0.0114 has the interval obtained by Equation (6). This gives us (w_{1}^{–} – P_{1}, w_{1}^{+} – P_{1}) = (-0.0249, 0.0022), which includes zero.
To translate this to an absolute interval we negate it, –d = 0.0114 ∈ (-0.0022, 0.0249), and crop it at zero [0, 0.0249).
Equation (8) obtains the same result as the signed interval (Equation (6)) if the entire interval range is positive, and has the same absolute value for negative intervals.
However if an interval includes zero (as here), the lower bound will be cropped at zero.
In this case, the lower bound d_{i}^{–} = 0 as the interval includes the minimum (and is reflected for both values of i), and the maximum d_{i}^{+} = 0.000311. This produces a positive interval of [0, 0.0249). The square bracket means that 0 is included in the range.
Finally, let us apply Equation (3′) with the ‘sqd’ transform to obtain the interval for ϕ_{p}^{2}.
We obtain the following
k-constrained sum (L^{2}, U^{2}) = (ϕ_{p}^{2} – √κ∑(sqd(p_{i} ) – d_{i}^{–} )², ϕ_{p}^{2} + √κ∑(d_{i}^{+} – sqd(p_{i} ))²), (10)
where κ = 2/1 = 2, ϕ_{p}^{2} = 0.000129. This also obtains the interval [0, 0.249).
If we adjust Table 1 slightly by shifting data in the frequency column so that the interval no longer includes zero, all three methods obtain the same result (after accounting for the sign).
Example 2: correlating the present perfect, k = 10 text categories
Again, we compute the confidence interval for the observed, present perfect forms. The continuity-corrected score interval may be employed if desired instead, or an alternative (reliable) confidence interval for the Binomial proportion may be substituted.
p | w^{–} | w^{+} | |
formal face-to-face | 0.1174 | 0.1094 | 0.1259 |
informal face-to-face | 0.4326 | 0.4199 | 0.4454 |
telephone conversations | 0.0603 | 0.0545 | 0.0668 |
broadcast discussions | 0.1086 | 0.1008 | 0.1169 |
broadcast interviews | 0.0462 | 0.0410 | 0.0519 |
spontaneous commentary | 0.1056 | 0.0980 | 0.1138 |
parliamentary language | 0.0299 | 0.0258 | 0.0346 |
legal cross-examination | 0.0035 | 0.0022 | 0.0053 |
assorted spontaneous | 0.0161 | 0.0131 | 0.0197 |
prepared speech | 0.0799 | 0.0732 | 0.0871 |
Figure 4 reproduces a frequency distribution from Wallis (2012), but includes the score intervals, scaled to the same size as the observed distribution to convey their relative size. Intervals are multiplied by n to plot them on the frequency scale.
In three out of ten categories, the present forms (prior proportions P_{i}) fall within the 95% interval for the present perfect proportions. These categories are ‘telephone conversations’, ‘broadcast discussions’ and ‘spontaneous commentary’. In the other seven categories (including ‘legal cross-examination’), P_{i} is found outside of the interval. On the other hand, only one of the past tense form proportions is found inside the interval (‘prepared speech’).
We can compute the ϕ_{p}(present) score and intervals.
P | sqd(p) | sqd(w^{–}) | sqd(w^{+}) | ||
formal face-to-face | 0.1047 | 0.000080 | 0.000011 | 0.000225 | › |
informal face-to-face | 0.4878 | 0.001526 | 0.002310 | 0.000901 | ‹ |
telephone conversations | 0.0580 | 0.000003 | 0.000006 | 0.000039 | 0 › |
broadcast discussions | 0.1098 | 0.000001 | 0.000040 | 0.000025 | 0 ‹ |
broadcast interviews | 0.0377 | 0.000036 | 0.000006 | 0.000101 | › |
spontaneous commentary | 0.1026 | 0.000005 | 0.000011 | 0.000063 | 0 › |
parliamentary language | 0.0215 | 0.000036 | 0.000009 | 0.000086 | › |
legal cross-examination | 0.0062 | 0.000004 | 0.000008 | 0.000000 | ‹ |
assorted spontaneous | 0.0207 | 0.000011 | 0.000029 | 0.000001 | ‹ |
prepared speech | 0.0511 | 0.000415 | 0.000244 | 0.000651 | › |
goodness of fit score ϕ_{p} | 0.046000 |
We determine bounds d_{i}^{–} and d_{i}^{+} with Equation (8). To read Table 3, consider the first line (‘formal face-to-face conversations’). The interval excludes P_{1} and is increasing, so d_{1}^{–} = 0.000011 and d_{1}^{+} = 0.000225. For ‘informal face-to-face conversations’ (second line), the interval is decreasing and exclusive of P_{2}, so d_{2}^{–} = 0.000901 and d_{2}^{+} = 0.002310. But for ‘telephone conversations’ the interval includes P_{3}, so d_{3}^{–} = 0 and d_{3}^{+} = 0.000039. We can now identify the bounds of each transformed interval.
d^{–} | d^{+} | |
formal face-to-face | 0.000011 | 0.000225 |
informal face-to-face | 0.000901 | 0.002310 |
telephone conversations | 0.000000 | 0.000039 |
broadcast discussions | 0.000000 | 0.000040 |
broadcast interviews | 0.000006 | 0.000101 |
spontaneous commentary | 0.000000 | 0.000063 |
parliamentary language | 0.000009 | 0.000086 |
legal cross-examination | 0.000000 | 0.000008 |
assorted spontaneous | 0.000001 | 0.000029 |
prepared speech | 0.000244 | 0.000651 |
Each of these intervals (d_{i}^{–}, d_{i}^{+}) are the projected 95% interval of sqd(p_{i}), defined by Equation (8).
To compute the interval, we use Equation (10). First we calculate the squared difference between sqd(p_{i}) and d_{i}^{–} or d_{i}^{+} (depending on the bound), multiply by κ = 10/9 = 1.1111, and take the root of the sum of these squares. In the case of the lower bound the result is 0.000689.
This obtains an interval width on the squared sqd(p_{i}) scale, which we subtract from (lower bound) or add to (upper) ϕ_{p}^{2} = 0.002116 (see Equation (10)). This obtains
ϕ_{p} ∈ (0.037776, 0.054776).
This method obtains two intervals as follows.
ϕ_{p}(present) ∈ (ϕ_{p}^{–}, ϕ_{p}^{+}) = (0.037776, 0.054776), and
ϕ_{p}(past) ∈ (ϕ_{p}^{–}, ϕ_{p}^{+}) = (0.057165, 0.072034).
Testing for significant difference between scores
Both of these intervals are much greater than zero, but as we have noted, thanks to the unsigned score and information loss, testing against zero is pointless with these intervals. For this evaluation, a lower score means a closer correlation, i.e. a better ‘fit’, so all we can really say is that neither are close ‘fits’.
Are the scores significantly different? Well, we can already see that the intervals are distinct so by the Wilson interval heuristic (Wallis 2021) they are significantly different!
To compare two independent observations, we employ the Bienaymé rule for the inner pair of intervals (cf. Newcombe-Wilson test). (Another way of thinking about this is we apply Equation (1′) but with –p_{1} instead of p_{1}. See An algebra of intervals.)
We test the difference ϕ_{d} = ϕ_{p}(2) – ϕ_{p}(1), where the index j ∈ {1, 2} stands for present and past respectively against a zero-based interval (ϕ_{d}^{–}, ϕ_{d}^{+}), defined by opposite bounds. Again, we employ the Pythagorean width approximation (Figure 1, left), u_{1}^{–} = p_{1} – w_{1}^{–}, u_{2}^{+} = w_{2}^{+} – p_{2}, etc.
ϕ_{d}^{–} = –√(u_{1}^{–} )² + (u_{2}^{+} )² = -0.0114, and
ϕ_{d}^{+} = √(u_{1}^{+} )² + (u_{2}^{–} )² = 0.0112.
In this case, the difference ϕ_{p}(past) – ϕ_{p}(present) = 0.0182, which is, as expected, considerably beyond the interval, and therefore the difference is significant.
Figure 5 illustrates significant increases whether measured across text categories (of uneven size), the 280 texts in DCPSE (of different lengths) or the two source corpora. In reading this graph we might note that the more data we have, the more reliable the estimate, as we have k – 1 independent observations to draw from. Also note that a smaller ϕ_{p} score represents a closer correlation or ‘fit’.
The higher scores for text categories is possibly due to text categories tending to sort texts into present or past-referring groups. By randomly allocating texts to ‘pseudo-categories’ in the same proportions, Wallis (2012) caused the resulting average ϕ_{p} score to fall.
Other goodness of fit metrics
Wallis (2012) identifies a number of other metrics based on goodness of fit χ^{2}, such as variance-weighted ϕ_{e}. Building on what we have learned, we can compute an interval on χ^{2} by rewriting it:
χ^{2} = ∑(o_{i} – e_{i})^{2}/e_{i} = n∑(p_{i} – P_{i})^{2}/P_{i}. (11)
We can redefine sqd(p_{i}) = n(p_{i} – P_{i})^{2}/P_{i} and employ Equations (8) and (10) (with χ^{2} in place of ϕ_{p}^{2}) to obtain an interval on the goodness of fit χ^{2 }score.
Second, we scale the result according to the relevant formula (Wallis 2021: 230).
variance-weighted ϕ_{e} = √χ²/(n² ∑1/e_{i} ). (12)
where e_{i} = nP_{i}. (By inspection, we could dispense with n from Equation (11) and n^{2} from (12), summing 1/P_{i}, but this is a less general form.)
Conclusions
These intervals can be used for estimating the certainty of metrics and comparing ϕ_{p} scores. However, for k > 2 they are inoperable for comparison with zero (testing for significant difference from zero).
First, the fact that the squared difference function (‘sqd’) is non-monotonic means that the interval is liable to exhibit information loss, most clearly at zero. We could see this with the two-value case: not only did the unsigned function lose information, but if just one of the intervals included P_{i}, the interval was not the same as that for the simple difference score, d = p_{1} – P_{1}. Even if we included zero in the range, where k > 2, it could not be properly used for testing for significant difference from zero.
Second, the interval is conservative, also due to this non-monotonicity property. For empirical purposes erring on the side of caution is recommended. But in some cases, a term might have the overwhelming majority of one tail falling inside the interval range, effectively converting an interval that excludes α chance events to one that excludes α/2.
An improved method could adjust for this problem by lowering the upper bound (otherwise max(d_{i}^{–}, d_{i}^{+})) until the sum of both tail areas is α. In the ‘texts’ example, 0.75 of the intervals (LLC) and 0.56 (ICE-GB) have this property, so this may be worth exploring. Note that if we are concerned with whether the difference is significant we can employ the unadjusted interval first.
Third, ϕ_{p} is unsigned. A ϕ_{p} of zero is only obtained by the identity {p_{i}} = {P_{i}}. Aside from this case, neither sqd(p_{i}) nor the lower bound d_{i}^{–} can ever be less than zero (see Figure 2). See also Table 4.
Therefore to test for a significant difference between {p_{i}} and {P_{i}}, you should use a goodness of fit χ^{2} test. This does not suffer from information loss or conservatism in the same way, even as it collapses degrees of freedom into a single test value. To compare two goodness of fit tests, you can use the gradient goodness of fit meta-test (Wallis 2021: 250). This compares two different assessments, {p_{i}} vs. {P_{i}} and {p’_{i}} vs. {P’_{i}}, on a paired cell-by-cell basis.
Our test for comparing ϕ_{p} scores is comparable to this test, but compares the absolute aggregate scores. In brief, it permits us to determine whether two independently derived goodness of fit ϕ_{p} scores are sufficiently different for the outcome to be unlikely due to chance.
Our test will thus be more conservative than the gradient goodness of fit test: it loses information by generalising multiple degrees of freedom into a single metric and it loses the sign of differences.
As with all Wilson-based methods on this blog, it may be corrected for continuity and adjusted for small population or random-text sampling.
References
Wallis, S.A. (2012). Goodness of fit measures for discrete categorical data. London: Survey of English Usage, UCL. » Post
Wallis, S.A. (2021). Statistics in Corpus Linguistics Research. New York: Routledge.
Zou, G.Y. & A. Donner (2008). Construction of confidence limits about effect measures: A general approach. Statistics in Medicine, 27:10, 1693-1702.