In Plotting the Wilson distribution (Wallis 2018), I showed how it is possible to plot the distribution of the Wilson interval for all values of α. This exercise is revealing in a number of ways.
First, it shows the relationship between
Over the last few years I have become convinced that approaching statistical understanding from the perspective of the tangible observation p is more instructive and straightforward to conceptualise than approaching it (as is traditional) from the imaginary ‘true value’ in the population, P. In particular, whenever you conduct an experiment you want to know how reliable your results are (or to put it an other way, what range of values you might reasonably expect were you to repeat your experiment) — not just if it is statistically significantly different from some arbitrary number, P!
Second, and as a result, just as it is possible to see the closeness of fit between the Binomial and the Normal distribution, through this exercise we can visualise the inverse relationship between Normal and Wilson distributions. We can see immediately that it is a fallacy to assume that the distribution of probable values about p is Normal, although numerous statistics books still quote ‘Wald’-type intervals and many methods operate on this assumption. (I am intermittently amused by plots of otherwise sophisticated modelling algorithms with impossibly symmetric intervals in probability space.)
Third, I showed in the paper that ‘the Wilson distribution’ is properly understood as two distributions: the distribution of probable values of P below and above p. If we employ a continuity-correction, the two distributions become clearly distinct.
This issue sometimes throws people. Compare:
Wilson distributions correspond to (2) and (3) above, obtained by finding the roots of the Normal approximation. See Wallis (2013). The sum, or mean, of these is not (1), as becomes clearer when we plot other related distributions.
There are a number of other interesting and important conclusions from this work, including that the logit Wilson interval is in fact almost Normal, except for p = 0 or 1.
In this post I want to briefly comment on some recent computational work I conducted in preparation for my forthcoming book (Wallis, in press). This involves plotting the Clopper-Pearson distribution.
This is the interval for p obtained by inverting the Binomial interval about P.
To arrive at the Wilson interval we undertake two steps: first, assume that the Binomial distribution may be adequately approximated with the Normal, and then, on that basis, extrapolate an inverted interval. But what if we cut out the first stage, and simply invert the Binomial interval about P?
In other words, we are concerned with the relationship between
Whereas the Normal and Wilson distributions are continuous, the Binomial distribution B is discrete, only being capable of generating values for true fractions p = r / n. This introduces an interesting issue below.
The probability distribution function (pdf) is very well known,
Binomial distribution B(r; n, P) ≡ nCr P^{r} (1 – P)^{(n – r)}, (1)
where nCr represents the combinatorial function for r out of n, e.g. the number of unique ways of obtaining r heads out of n coin tosses. On this basis, the corresponding cumulative distribution function (cdf), summing values from r₁ to r₂ inclusive can be simply written as
Cumulative Binomial distribution B(r₁, r₂; n, P) ≡ Σ
r = r₁..r₂nCr P^{r} (1 – P)^{(n – r)}. (2)
To obtain the Clopper-Pearson interval we must employ the interval equality principle. In Wallis (2013) we cited this as follows.
Interval equality principle:
lower bound w⁻ = P₁ E₁⁺ = p where P₁ < p, and (3)
upper bound w⁺ = P₂ E₂⁻ = p where P₂ > p.
The interval equality principle says that when a value of P, which we might call P₁, is significantly less than p, p is significantly more than P₁. So the lower interval limit for p, w⁻, will be P₁ when the upper interval bound for P₁ (E₁⁺) is p.
Got that? A picture might help.
In our case, we don’t invert the Normal interval but the Binomial.
To obtain inverted Binomial bounds of p = r / n we find the values P₁ and P₂ where the tail areas are equal to α/2:
lower bound b⁻ = P₁ B(r, n; n, P₁) = α/2, (4)
upper bound b⁺ = P₂ B(0, r; n, P₂) = α/2.
We can use a computer search procedure to find these values. Note that whereas the Binomial distribution about P is discrete, having n values (simple fractions of the form p = r / n), P can be any rational fraction of a (near-)infinite population of size N, i.e. P ∈ [0, 1].
Consequently, although the Binomial interval about P is discrete, the Clopper-Pearson interval about p is continuous. See Wallis (in press). We will see this when it comes to plotting Clopper-Pearson distributions.
First, we obtain the distributions of b⁻ and b⁺ by finding the relevant solutions of Equation (4) for 0 < α ≤ 1. This obtains positions on the horizontal p axis for different values of α.
However, α represents the tail area, for the lower bound, equivalent to the cumulative distribution function (cdf). We need to plot the pdf (see above). We differentiate the area under the curve to obtain the height (y-position). To do this we use the delta approximation method described in Wallis (2018).
Performing this step allows us to compute curves like the brown lines in Figure 2. We have plotted the equivalent Wilson and continuity-corrected Wilson intervals for comparison.
We can see that the interval tends to fall somewhere between the two Wilson intervals (with and without continuity-correction) and has a similar shape.
Perhaps the best way to see how the three curves compare is to plot them over multiple values of r.
For significance testing purposes, the key area is the tail area between 0 < α < 0.05, the ‘error level’, which we have indicated with dotted lines. Thus a dotted line that is further from the grey line (p) represents a more conservative interval. The dashed lines show the start of the distribution in each case (α = 1).
Wilson and Clopper-Pearson intervals can be computed for very small samples — even samples as small as n = 1. As you might expect, small samples expose the greatest discrepancies between interval calculations. Figure 3 depicts all three distributions for p = 0, n = 1. Note that the only possibilities with n = 1 are to observe r = 0 (false) or 1 (true), i.e. p = 0.0 or 1.0.
These distributions differ rather dramatically! In Wallis (in press), I observe that the continuity-corrected Wilson distribution (and interval) may be computed by simply relocating p by 12n. See Correcting for continuity for more on this. This means that the continuity-corrected Wilson, with its subsequent peak, is actually one half of a bimodal (twin peak) distribution centred at 0.5. This behaviour is discussed at some length in Plotting the Wilson distribution.
On the other hand, the Clopper-Pearson distribution becomes constant (horizontal). Recall that the Binomial distribution is in fact discrete. The shape of this distribution is a consequence of approximating the Binomial distribution with the Normal, a step that of course we omitted when computing the Clopper-Pearson distribution.
For experimentalists, the key question concerns the position of the interval. Here we can see that in fact the Clopper-Pearson interval is more conservative than the continuity-corrected Wilson.
The Clopper-Pearson interval is sometimes described as an ‘exact’ interval, just as the Binomial test is an ‘exact’ test. For small values of n it may be preferred over the Wilson interval, even with a Yates’ continuity-correction applied.
However, observing these distributions, the close relationship between Clopper-Pearson and Wilson intervals cannot be denied. Nor can anyone looking at the animation above be under any illusion that observed P is Normally distributed about p!
As various in-depth computational reviews demonstrate (Wallis 2013, Newcombe 1998), the continuity-corrected Wilson closely matches the Clopper-Pearson interval. Of course there are discrepancies. Thus close observation of these distributions would show that the Clopper-Pearson is more conservative than the continuity-corrected Wilson interval at α = 0.05 in most cases; but, where n = 10, less conservative for the inner interval where p < 0.3 or p > 0.7. For small n < 4 and skewed values, the Clopper-Pearson interval is the most conservative, allowing false-positive (Type I) errors at the margins.
Finally, although the uncorrected Wilson fares less well in comparison, it shares the overall shape while retaining an additional useful asset. Translated on the logit scale it is near-Normally distributed, permitting the Wilson score interval width to be effectively proxied for a meaningful standard deviation measure for carrying out variance-weighted logistic regression.
Newcombe, R.G. 1998. Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine 17: 857-872.
Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208 » Post
Wallis, S.A. 2018. Plotting the Wilson distribution. corp.ling.stats. London: Survey of English Usage, UCL. » ePublished
Wallis, S.A. 2020, in press. Statistics in Corpus Linguistics Research. Oxford: Routledge.
Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference.Journal of the American Statistical Association 22: 209-212.
]]>Wallis (2013) provides an account of an empirical evaluation of Binomial confidence intervals and contingency test formulae. The main take-home message of that article was that it is possible to evaluate statistical methods objectively and provide advice to researchers that is based on an objective computational assessment.
In this article we develop the evaluation of that article further by re-weighting estimates of error using Binomial and Fisher weighting, which is equivalent to an ‘exhaustive Monte-Carlo simulation’. We also develop an argument concerning key attributes of difference intervals: that we are not merely concerned with when differences are zero (conventionally equivalent to a significance test) but also accurate estimation when difference may be non-zero (necessary for plotting data and comparing differences).
All statistical procedures may be evaluated in terms of the rate of two distinct types of error.
It is customary to treat these errors separately because the consequences of rejecting and retaining a null hypothesis are qualitatively distinct.
In classical experiments, whether in the lab or with corpora, researchers should err on the side of caution and risk Type II errors but not Type I errors. The premise is that it is safer to avoid investing research effort in a dead end – by yourself or others – rather than to find out later that you have wasted time and resources.
Note, however, that this is not a universal rule. If you were offering a potentially life-saving experimental drug to someone who is expected to otherwise die, you might risk Type I errors (that the drug had no significant effect, i.e. it did not work). This issue has arisen recently in clinical trial of the Ebola vaccine (Calain 2018). We must still attempt to weigh up the risk of side-effects.
Secondly, we need to decide on a ‘gold standard’ criterion. We need an independent measure of ‘correctness’. A test evaluation can have one of four possible outcomes (Table 1).
‘Gold standard’ test | ||
Test evaluation | True (significant) | False (non-significant) |
True (‘significant’) | Type I | |
False (‘non-significant’) | Type II |
Where the test we are evaluating is ‘significant’ and the gold standard test is significant, the tests are consistent, and using the evaluated test does not generate an error. Likewise, where the test evaluation and gold standard test both obtain a non-significant result, the methods perform equally. But in other cases we have either Type I or Type II errors. The idea is we can add up these two types of error separately and thereby compare test performances.
This method is an effective one for evaluating tests. But it is not sufficient to evaluate intervals. This is because with an interval we also wish to know how far results diverge. We want to know that if we plot a confidence interval over many values, whether it is accurate for all values of p, and not misleading for certain ones (say, when p is close to 0 or 1). We turn to this question next.
Calain, P. (2018). The Ebola clinical trials: a precedent for research ethics in disasters. Journal of Medical Ethics 44:3-8.
Wallis, S.A. (2013). Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics 20(3), 178-208. » Post
]]>Experimenting with deriving accurate 2 × 2 φ intervals, I also considered using Liebetrau’s population standard deviation estimate.
To recap: Cramér’s φ (Cramér 1946) is a probabilistic intercorrelation for contingency tables based on the χ² statistic. An unsigned φ score is defined by
Cramér’s φ = √χ²/N(k – 1)(1)
where χ² is the r × c test for homogeneity (independence), N is the total frequency in the table, and k the minimum number of values of variables X and Y, i.e. k = min(r, c). For 2 × 2 tables, k – 1 = 1, so φ = √χ²/N is often quoted.
An alternative formula for 2 × 2 tables obtains a signed result, where a negative sign implies that the table tends towards the opposite diagonal.
signed 2 × 2 φ ≡ (ad – bc) / √(a + b)(c + d)(a + c)(b + d),(2)
where a, b, c and d are cell frequencies. However, Equation (2) cannot be applied to larger tables.
The method I discuss here is potentially extensible to other effect sizes and other published estimates of standard deviations.
We employ Liebetrau’s best estimate of the population standard deviation of φ for r × c tables:
s(φ) ≈ 1
2φN {4Σ
i jp_{i,j}³
p_{i+}² p_{+j}² – 3Σ
i1
p_{i+} (Σ
jp_{i,j}²
p_{i+} p_{+j} )² – 3Σ
j1
p_{+j} (Σ
ip_{i,j}²
p_{i+} p_{+j} )²
+2Σ
i j[ p_{i,j}
p_{i+} p_{+j} (Σ
kp_{k,j}²
p_{k+} p_{+j} )(Σ
lp_{i,l}²
p_{i+} p_{+l} )]}, for φ ≠ 0, (3)
where p_{i,j} = f_{i,j} / N and p_{i+}, p_{+j}, etc. represent row and column (prior) probabilities (Bishop, Fienberg and Holland 1975: 386). If φ = 0 we adjust the table by a small delta.
The wrong way to approach this question is to assume that the confidence interval (φ⁻, φ⁺) is Normally distributed about φ by Equation (3).
lower bound φ⁻ = φ – z_{α/2}.s(φ),
upper bound φ⁺ = φ + z_{α/2}.s(φ). (4)
This is a ‘Wald’ standard error interval on φ. It fails for the same reasons as the ‘Wald’ interval.
It overshoots the range of φ as it approaches extreme values (it contains impossible values), and it has a zero width at the extremes (it is ‘certain’).
The mistake, as with the Wald interval, is to confuse the expected population and observed sample values of φ. See Wallis (2013).
Let us look at 2 × 2 φ, as we can calculate a derived Newcombe-Wilson φ interval and plot the performance of both intervals side-by-side. We can also plot from [–1 to 1] using Equation (2) above.
We populate a contingency table as follows.
y₁ | y₂ | |
x₁ | a = n₁(φ+1)/2 | b = n₁(1–φ)/2 |
x₂ | c = n₂(1–φ)/2 | d = n₂(φ+1)/2 |
We will simply hold n₁ = n₂ = 10.
We can clearly see that the Liebetrau ‘Wald’ interval predicts a range of values that exceed the possible range of φ ∈ [–1 to 1] and collapses to zero, i.e. certainty, at the extremes. As a general rule, both behaviours are indicators of an incorrect confidence interval calculation.
Some statisticians might ‘fix’ the overshoot problem by simply inserting ‘min’ and ‘max’ functions to constrain the width. But Figure 1 shows that converging to zero width at the extremes badly distorts the interval over the entire range, causing it to deviate from the other two intervals, labelled ‘Liebetrau inverted’ and ‘NW derived’.
We discuss the inverted Liebetrau interval in the next section, but even were you not to accept that model as accurate, the interval derived from the Newcombe-Wilson is also very similar. Two very different methods obtain closely comparable, and plausible, results.
We can see that the erroneous ‘Wald’ interval underestimates variation on the inner side close to 0 while tending to overestimate it on the outer side. It is this general property of overall inaccuracy that ultimately should seal Wald-type intervals’ fate – this type of interval derivation is quite simply incorrect!
The correct approach employs the interval equality principle (Wallis 2013). We obtain the roots of the Gaussian standard deviation equation by finding the values of (φ⁻, φ⁺) where
φ⁻ + z_{α/2}.s(φ⁻) = φ,
φ⁺ – z_{α/2}.s(φ⁺) = φ. (5)
Note that now φ⁻ and φ have swapped positions. We are saying ‘the lower bound of the interval, φ⁻, has a Gaussian upper bound which equals the observed value, φ’. This is the interval equality principle.
Since Liebetrau’s function is expressed in terms of cell value probabilities (p_{i,j}, etc.), we must convert φ to a canonical table which will be sure to have a score of φ. We have achieved this with our table of formulae above.
Finally we employ a search procedure to find the relevant interval. Our algorithm is a simple binary search on a monotonic function, so it is highly generalisable. We provide the algorithm at the end of this post.
The same search function is employed for upper and lower bounds, and is robust for the entire range of φ.
The method works quite well. Figure 1 shows it slightly deviates from the derived Newcombe-Wilson method, an effect which seems to be consistent rather than due to rounding. The Newcombe-Wilson method employs the Bienaymé sum-of-variances rule, which may explain the discrepancy. Nonetheless the advantage of a direct calculation rather than a search procedure is obvious. The Newcombe-Wilson method has other advantages aside from computational efficiency: in particular it is easily adapted for continuity-correction.
However for intervals on larger r × c φ tables, Liebetrau’s method with search offers a viable alternative.
Bishop, Y.M.M., S.E. Fienberg & P.W. Holland (1975). Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press.
Cramér, H. (1946). Mathematical methods of statistics. Princeton, NJ: Princeton University Press.
Wallis, S.A. (2013a). Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics 20(3), 178-208. » Post
function phi_score(a, b, c, d) { return ((a*d) - (b*c)) / sqrt((a+b)*(a+c)*(b+d)*(c+d)) }
function Liebetrau(a, b, c, d) { phi = phi_score(a, b, c, d) if (phi == 0) { a += 0.0001 // add delta if phi == 0 b -= 0.0001 phi = phi_score(a, b, c, d) } n = a + b + c + d pa = a / n, pb = b / n, pc = c / n, pd = d / n pab = pa + pb, pcd = pc + pd, pac = pa + pc, pbd = pb + pd p3a = pow(pa, 3) / (pow(pac * pab, 2)) p3b = pow(pb, 3) / (pow(pab * pbd, 2)) p3c = pow(pc, 3) / (pow(pac * pcd, 2)) p3d = pow(pd, 3) / (pow(pbd * pcd, 2)) sq = (p3a + p3b + p3c + p3d) * 4 pa3 = ((pa * pa) / (pac * pab)) pb3 = ((pb * pb) / (pab * pbd)) pc3 = ((pc * pc) / (pac * pcd)) pd3 = ((pd * pd) / (pbd * pcd)) pa3c3 = pa3 + pc3, pac3 = pa3c3 * pa3c3 / pac pb3d3 = pb3 + pd3, pbd3 = pb3d3 * pb3d3 / pbd pa3b3 = pa3 + pb3, pab3 = pa3b3 * pa3b3 / pab pc3d3 = pc3 + pd3, pcd3 = pc3d3 * pc3d3 / pcd tri = (pac3 + pbd3 + pab3 + pcd3) * 3 pa4 = (pa / (pac * pab)) * pa3c3 * pa3b3 pb4 = (pb / (pbd * pab)) * pb3d3 * pa3b3 pc4 = (pc / (pac * pcd)) * pa3c3 * pc3d3 pd4 = (pd / (pbd * pcd)) * pb3d3 * pc3d3 dia = (pa4 + pb4 + pc4 + pd4) * 2 s2 = sq - tri + dia return s2 > 0 ? sqrt(s2 / n) / (2 * abs(phi)) : 0 }
function LiebBound(n, Phi, upper) { a = ((Phi+1)*n)/4 b = n/2-a e = Liebetrau(a, b, b, a)*zcrit // z(alpha/2) return upper ? Phi+e : Phi-e }
Find population score of φ where φ ± z_{α/2}.s(φ) = phi (observed).
function findL(n, phi, upper) // lower bound if false { if (upper & (abs(phi - 1) < 0.00001)) return 1 // trap if (!upper & (abs(phi + 1) < 0.00001)) return -1 p = upper ? ((2-(phi+1))/2 + phi) : ((phi+1)/2-1) p2 = abs((p-phi)/2) for (i=0; i < 1000; i++) { a = LiebBound(n, p, !upper) d = fabsl(a-phi) if (d > 0.00000001) // accurate to 8 dps { if (p+p2 >= 1) d2 = 2 else { a2 = LiebBound(n, p+p2, !upper) d2 = abs(a2-phi) } if (p-p2 <= -1) d3 = 2 else { a3 = LiebBound(n, p-p2, !upper) d3 = abs(a3-phi) } pdiff = p2 if (d3 < d2) { pdiff=-p2 d2 = d3 } if (d2 < d) p+=pdiff else { p2 = p2/2 if (p2 == 0) return p } } else return p } return p }]]>
Elsewhere in this blog we introduce the concept of statistical significance by considering the reliability of a single sampled observation of a Binomial proportion: an estimate of the probability of selecting an item in the future. This allows us to develop an understanding of the likely distribution of what the true value of that probability in the population might be. In short, were we to make future observations of that item, we could expect that each sampled probability would be found within a particular range – a confidence interval – a fixed proportion of times, such as 1 in 20 or 1 in 100. This ‘fixed proportion’ is termed the ‘error level’ because we predict that the true value will be outside the range 1 in 20 or 1 in 100 times.
This process of inferring about future observations is termed ‘inferential statistics’. Our approach is to build our understanding in a series of stages based on confidence intervals about the single proportion. Here we will approach the same question by deconstructing the chi-square test.
A core idea of statistical inference is this: randomness is a fact of life. If you sample the same phenomenon multiple times, drawing on different data each time, it is unlikely that the observation will be identical, or – to put it in terms of an observed sample – it is unlikely that the mean value of the observation will be the same. But you are more likely than not to find the new mean near the original mean, and the larger the size of your sample, the more reliable your estimate will be. This, in essence, is the Central Limit Theorem.
This principle applies to the central tendency of data, usually the arithmetic mean, but occasionally a median. It does not concern outliers: extreme but rare events (which, by the way, you should include, and not delete, from your data).
We are mainly concerned with Binomial or Multinomial proportions, i.e. the fraction of cases sampled which have a particular property. A Binomial proportion is a statement about the sample, a simple fraction p = f / n. But it is also the sample mean probability of selecting a value. Suppose we selected a random case from the sample. In the absence of any other knowledge about that case, the average chance that X = x₁ is also p.
The same principle applies to the mean of Real or Integer values, for which one might use Welch’s or Student’s t test, and the median rank of Ordinal data, for which a Mann-Whitney U test may be appropriate.
With this in mind, we can form an understanding of significance, or to be precise, significant difference. The ‘difference’ referred to here is the difference between an uncertain observed value and a predicted or known population value, d = p – P, or the difference between two uncertain observed values, d = p₂ – p₁. The first of these differences is found in a single-sample z test, the second in a two-sample z test. See Wallis (2013b).
A significance test is created by comparing an observed difference with a second element, a critical threshold extrapolated from the underlying statistical model of variation.
Suppose we wish to evaluate a claim of the type ‘p is other than expected’. If this is true, there is a significant difference between p and some given value P, or (in algebra) d = p – P ≠ 0. We employ a single-sample test to predict how reliable this claim is likely to be.
We compare d with a threshold distance drawn from a single-sample z test. This is the ‘two-tailed Normal confidence interval width on the population mean P’ that we are introduced to in Wallis (2013a). This interval width is calculated using the population standard deviation, S:
population standard deviation S = √P(1 − P)/n,
population confidence interval (E⁻, E⁺) = (P – z_{α/2}.S, P + z_{α/2}.S).
Each test incorporates a model that predicts how we expect observations to vary – called the ‘statistical model of variation’. We use the Normal distribution instead of the Binomial (see Wallis (2013a)). Calculating the threshold is easy if we know (or have been given) a value of P.
When we perform the test, we compare this observed distance d with the threshold obtained from the model. The same approach can be formulated in four ways.
First, we can test p. In Figure 1, p is inside the interval.
Single-sample z test (testing p):
If P – z_{α/2}.S < p < P + z_{α/2}.S there is no significant difference between p and P.
By simple algebra this can be reformulated as a test of the difference, d:
Single-sample z test (testing d):
If –z_{α/2}.S < d < z_{α/2}.S there is no significant difference between p and P.
In this formulation, the ‘threshold of significance’ is an estimate of how large a difference d must be before a change in the observed direction – above or below d – would be unlikely to occur by chance more than a certain number of times (the error level, α).
So far the threshold was expressed as a confidence interval. But it can also be expressed as a critical value. Suppose we divide this equation by the standard deviation. S. We can reformulate the test in terms of z = d / S. This test is based on the standard Normal distribution, i.e. a Normal distribution with a mean μ = 0 and standard deviation s = 1.
Single-sample z test (testing z):
If –z_{α/2} < z < z_{α/2} there is no significant difference between p and P.
Although they are expressed in slightly different terms, these formulae are mathematically identical and achieve the same result. They have the same components: an observation and a threshold (or an upper and lower threshold). The components are simply expressed on different scales, or with different origins. The threshold is calculated by the statistical model of variation – hence ‘z_{α/2}’ refers to the two-tailed critical value of the standard Normal distribution for an error level α.
These two-tailed tests evaluate the hypothesis that d ≠ 0. A one-tailed test can be generalised by setting a different threshold and comparing only one of the boundaries, for example:
One-tailed single-sample z test (testing z):
If –z_{α} < z then p is not significantly less than P.
Wallis (2013a) shows the Wilson score interval (w⁻, w⁺) inverts the Gaussian model and testing process, but yields the same result. The same mathematical model applies, but now we test P instead of p. The ‘critical threshold’ is expressed as the interval bounds for p, w⁻ and w⁺:
Single-sample z test (testing P) = Wilson score interval test:
If w⁻ < P < w⁺ there is no significant difference between p and P.
The same principles apply to other tests. For example, they apply to the z test for two samples drawn from the same population. See Wallis (2013b).
However, now d = p₂ – p₁, i.e. the difference between the proportions for each sample. The critical distance is calculated on the basis of a two-tailed Normal interval width centred on an intermediate point, the pooled probability estimate, p.^This is the best estimate of the average probability in the population, P. The presumption that data is drawn from the same population is the null hypothesis of the test. If the test is significant we reject the null hypothesis that the samples are drawn from the same population.
Alternatively, where samples are already known to be drawn from different populations, we can use the Wilson score interval on observations p₁ and p₂ to estimate a combined ‘Newcombe-Wilson’ interval. This interval then defines the critical threshold for the equivalent test.
Note that whereas the critical value of the standard Normal distribution z_{α/2} used in the equations above is constant, in all other cases the critical distance threshold is not a constant. It may alter with the location of observed and population values, p, p₁, P, etc. It is also scaled by the weight of evidence supporting each observation, n, n₁, etc. The greater the volume of supporting evidence for an observation, the smaller a distance needs to be before it is considered too large to be likely due to random variation.
These two elements – a numerical difference (whether between observations or between observations and expected values) and the weight of evidence – are found again and again in significance tests.
Wallis (2013b) describes how these z tests can be reformulated as χ² contingency tests and generalised to tables of many cells. We showed that certain simple z and χ² tests obtained the same result. A χ² test compares the observed value of each cell o_{i,j} with the expected value e_{i,j}. We can find the same core concept of ‘significant difference’ in the χ² formula.
You may recall the formula for chi-square given below. The test calculates, for each cell in the table in turn, the square of the difference d_{i,j} = o_{i,j} – e_{i,j} and scales the result by dividing it by the expected value e_{i,j} before adding up each difference. Thus, for example, the 2 × 2 test sums four scaled difference terms. See Figure 2.
chi-square χ² ≡ Σ(o_{i,j} – e_{i,j})² / e_{i,j} = ΣR_{i,j}².
Dividing by e_{i,j} converts differences to the same scale. The formula can be rewritten as the sum of squared ‘standardised residuals’, R_{i,j} (Sheskin 2011: 671), where R_{i,j} is defined as
standardised residual R_{i,j} = (o_{i,j} – e_{i,j}) / √e_{i,j} = d_{i,j} / √e_{i,j}.
Each of these are difference terms on the standard Normal distribution scale – the same scale as the third test formulation above. They are then squared and summed to obtain a χ² score. This score is on the squared standard Normal distribution scale, or expressed as ‘standard variance’. As we put it in Wallis (2013b):
[T]o all intents and purposes, ‘chi-squared’ with a single degree of freedom could be called ‘z-squared’.
What happens with more than one degree of freedom?
The chi-square test also compares the χ² score with a critical threshold distance. In this case it is the critical value of χ² for the number of degrees of freedom of the test. The ‘number of degrees of freedom’ for a homogeneity test is the total number of independent differences found in the table.
In a χ² test for independence with r rows and c columns, the number of degrees of freedom is df = (r – 1)(c – 1). We subtract 1 from each row and column total because, in every row and column in the table, the cells sum to the sample size for that row or column. The observed frequency of the last cell is known: it contains the remainder once all the others have been taken into account. So in a 2 × 2 table, although there are four cells, if we know row and column totals, once we also know the value of any one of the four cells we can work out the other cell frequencies by subtraction.
If a table has a single degree of freedom, its variation can be expressed as a difference and confidence interval along a single dimension with no loss of information. Indeed, the z method has one benefit over simple χ². It tells us about the sign of the difference (which is significantly greater, p or P; or p₁ or p₂). Nonetheless 2 × 1 and 2 × 2 χ² tests can be reformulated as z tests and confidence intervals, and if the chi-square test is significant, the same will be true for the mathematically-equivalent z test.
For goodness of fit χ² tests arranged vertically, the table consists of a single column, summing to sample size n. The number of degrees of freedom is c – 1. Again, if we know the total frequency n and the cell frequencies of c – 1 cells, we can work out the last cell frequency by subtraction.
But if we have more than one degree of freedom, as in a 3 × 1 goodness of fit test, or a 2 × 3 test of homogeneity, then the test is not reducible to a single dimension of variation. A single numeric score loses information, just as quoting an area or volume does not tell us about the precise dimensions of an object. In general, we suggest that plotting data with confidence intervals was a much more revealing way of analysing data than traditional approaches involving standardised residuals or sub- tests of parts of the table.
This reformulation of the chi-square test as a sum of squared standardised residuals, i.e. scaled differences, also helps us understand how Yates’ (1934) correction for continuity works. Yates’ method adds a continuity correction term to the expected mean P in the direction of the change we want to test, or widening the confidence interval about P on either side. Either way it expands the critical threshold by the correction term.
Yates’ interval (E⁻, E⁺) ≡ P ± (z_{α/2}√P(1 – P)/n + 12n).
For a difference z test we add correction terms for each subsample size: n₁, n₂.
On a frequency scale, the term is simply the constant 0.5. Thus the conventional formulation for Yates’ chi-square subtracts 0.5 from the absolute value of the difference term d_{i,j} = o_{i,j} – e_{i,j}. Note that we can either subtract from the observed difference or add to the critical threshold (in this case, a confidence interval about P to achieve the same result.
Yates’ χ² ≡ Σ(|o_{i,j} – e_{i,j}| – 0.5)² / e_{i,j}.
This is the most common expression of Yates’ test. In Correcting for continuity we discuss how to apply a continuity-correction to the Wilson interval by simply adding the correction term to either sides of the observed proportion, p.
Finally, the log-likelihood test formula also computes sums of scaled differences. This fact is not initially apparent in the formula, and is revealed by a little logarithmic algebra. Log-likelihood is usually defined as the following.
log-likelihood G² ≡ 2Σo_{i,j}.log(o_{i,j}/e_{i,j}),
where ‘log’ refers to the natural logarithm. This formula can be rewritten as
log-likelihood G²= 2Σo_{i,j}.[log(o_{i,j}) – log(e_{i,j})].
Now we can see a summed series of differences between observed and expected cell values, scaled by the weight of evidence in each case, this time the number of observations o_{i,j}. G² is compared with the critical value of χ². It is another type of contingency test, and the same number of degrees of freedom apply.
Significance tests may be considered as an assessment combining:
The two different versions of the two-sample z test differ only in the first aspect. They have the same difference, d = p₂ – p₁. And they have the same sample sizes, n₁ and n₂..
One version of the z test is based on the samples being drawn from the same population. It says, let us assume that there is a single population mean probability, P, which we estimate using the pooled probability estimate p^– the weighted average of p₁ and p₂, or ‘prior probability’. The test then uses this pooled probability to estimate a Gaussian standard deviation, S. This z test obtains the exact same result as the equivalent 2 × 2 χ² test.
The two-population test does not make that assumption. That test simply says we have two observations, p₁ and p₂, drawn from different populations each with their own means, P₁ and P₂. Wallis (2013a) argues that the Newcombe-Wilson test gives us the optimum method for carrying out this test. It uses the Wilson score interval for both observations and then calculates a critical value based on a ‘Pythagorean’ approximation. It assumes that the variation for observation p₁ is independent to the variation for p₂ (because they are drawn independently) and therefore this variation is tangential. The total standard deviation is the hypotenuse of an Euclidean triangle.
Other test formulae may hide one or other of these aspects, but if you look closely enough, you should be able to establish that this deconstruction into three elements applies. Where things become complicated is if there is more than one degree of freedom.
In χ² and log-likelihood tests, the sample size n does not appear explicitly in formulae. But it is found in observed and expected cell frequencies, o and e.
The statistical model and the weight of evidence together give us a critical threshold for evaluation. The remaining component we turn to next. This is the concept of an effect size.
Often when we carry out research we wish to measure the degree to which one variable affects the value of another. In the process we may set aside the question as to whether this effect is sufficiently large as to be considered significant. That is, we focus only on the effect size, and we put the statistical model or the weight of evidence to one side. In doing so, we are engaging in descriptive statistics rather than inferential statistics.
There are two reasons why we might do this: to visualise results and to compare results. Visualising differences between observations can be very revealing. But simple difference d is not the only possible measure of effect size, and it is limited to tables with a single degree of freedom or comparing pairs of probabilities.
Elsewhere in this blog we review approaches to calculating effect sizes for larger ‘r × c’ contingency tables and for goodness of fit (‘r × 1’ or ‘1 × c’) tables. In Wallis (2019) we discussed the task of comparing the results of experiments, a task that has traditionally been carried out by comparing effect sizes (or worse, simply comparing χ² scores or ‘p values’).
These tables have more than one degree of freedom. They have multiple dimensions or axes along which they can independently vary. On the other hand, an effect size is a single number, which means that it can have only one dimension or degree of freedom. Were we to create a test that simply compared an effect size with a threshold, our new test would have one degree of freedom, irrespective of whether the effect size itself was calculated from a table with many degrees. Consequently tests on effect sizes for larger tables will tend to be less sensitive than tests on the original data. This does not render such tests invalid, provided that the effect size is considered a meaningful dependent variable for our research question.
Finally, it is worth noting that this process of deconstruction we have applied to χ² is not limited to Multinomial and Binomial models. It applies to other tests that we note in passing in this blog.
For example, Student’s t test (Sheskin 2011: 163) is analogous to the z test. It evaluates the mean values of samples and populations, rather than the mean probability of making a decision. It compares differences with critical values and confidence intervals obtained using the Student’s t distribution rather than the Normal z (although for large n, t approximates to the Normal). But despite these differences the same deconstruction can be applied: the t distribution is the statistical model of variation, the effect size is the observed difference and the weight of evidence is the number of observations, n.
Sheskin, D.J. 2011. Handbook of Parametric and Nonparametric Statistical Procedures. (5th Ed.) Boca Raton, Fl: CRC Press.
Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208. » Post
Wallis, S.A. 2013b. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post
Wallis, S.A. 2019. Comparing χ^{2} tables for separability of distribution and effect. Journal of Quantitative Linguistics, 26:4, 330-355. » Post
Yates, F. (1934). Contingency tables involving small numbers and the chi-square test. Journal of the Royal Statistical Society, 1:2, 217-235.
A common bias I witness among researchers in discussing statistics is the intuition (presumption) that distributions are Gaussian (Normal) and symmetric. But many naturally-occurring distributions are not Normal, and a key reason is the influence of boundary conditions.
Even for ostensibly Real variables, unbounded behaviour is unusual. Nature is full of boundaries.
Consequently, mathematical models that incorporate boundaries can sometimes offer a fresh perspective on old problems. Gould (1996) discusses a prediction in evolutionary biology regarding the expected distribution of biomass for organisms of a range of complexity (or scale), from those composed of a single cell to those made up of trillions of cells, like humans. His argument captures an idea about evolution that places the emphasis not on the most complex or ‘highest stages’ of evolution (as conventionally taught), but rather on the plurality of blindly random evolutionary pathways. Life becomes more complex due to random variation and stable niches (‘local maxima’) rather than some external global tendency, such as a teleological advantage of complexity for survival.
Gould’s argument may be summarised in the following way. Through blind random Darwinian evolution, simple organisms may evolve into more complex ones (‘complexity’ measured as numbers of cells or organism size), but at the same time others may evolve into simpler, but perhaps equally successful ones. ‘Success’ here means reproductive survival – producing new organisms of the same scale or greater that survive to reproduce themselves.
His second premise is also non-controversial. Every organism must have at least one cell and all the first lifeforms were unicellular.
Now, run time’s arrow forwards. Assuming a constant and an equal rate of evolution, by simulation we can obtain a range of distributions like those in the Figure below.
The result of this ‘evolution’ is a Poisson distribution of biomass over complexity (Gould 1996: 171). Provided some species evolve into more complex forms, over time the concentration of biomass at the bottom end (complexity c) reduces and the length of the upper ‘tail’ of the distribution increases.
Anthropocentrism is the ideological predisposition to view ourselves in the centre of history. From this perspective, evolution is often explained to children as a story leading to human ‘perfection’. We see complex organisms like us at c = 10 in the Figure above, and attempt to trace back evolutionary paths. Yet we know that present-day animals are equally ‘evolved’, and one can see evolution as a story of increasing diversity rather than perfection. Indeed, from the perspective of the total distribution of biological matter, ‘the Earth is currently populated by unicellular organisms with a long tail’.
Gould’s model predicts that biomass is distributed such that the vast majority of living cells are to be found in organisms at the lowest end of complexity. If evolution could only increase complexity, we would see an exponential distribution with a decreasing peak and increasing spread as time t increased. But in our model, like Gould’s, we allowed evolution to decrease and increase the proportion of biomass at any level of complexity, c, at the same rate. Eventually c = 2 overtakes c = 1 because for c = 1 all evolution must be in the direction of increasing complexity.
When Gould first made his prediction, something appeared to be missing. According to his model, there should be many more unicellular organisms on Earth than had previously been estimated. It was eventually supported by the discovery of unicellular organisms in Darwinian niches deep in the Earth’s crust. The missing biomass turned out to be underground, in soil and rocks.
Gould’s model emphasises that systems can generate decidedly non-Normal outcomes where boundaries are involved. In fact, physics places lower and upper limits on other tangible variables, such as height. Yet textbooks on statistics for school students gives the distribution of heights of children in a class as an example of data expected to have a Normal distribution.
If you think about it, the height of schoolchildren has a lower limit of (more than) zero! This does not appear to matter to the textbook example because we do not expect a typical class sample to be close to the physical limit, and therefore an approximation to the Normal appears reasonable. There is an error introduced by the impact of the boundary, but, as data points are usually clustered far from the boundary, that error is small.
In other words, boundaries matter most if you are close to them.
Physics also places upper limits on physical size. For example, bone and muscle strength are in proportion to the cross-section of a leg, whereas mass is in proportion to volume. Alligators grow in proportion, so an alligator that doubles in length will increase in volume by the cubic (2³ = 8), but its legs will only increase in cross-section by the square (2² = 4). If there was a glut of food such that these animals grew extremely large, they would hit the limit that their legs could not support their mass any further.
The largest animals on Earth are waterborne — and Godzilla’s legs would snap if she stood up.
This code simulates the effect of random variation on complexity over evolutionary cycles, commencing with unicellular organisms. Reconstructed based on a description by Stephen J. Gould (1996). Bold font (e.g. ‘distribution’) refers to sets.
function Gould(cycles, rate)
{integer t, i
float e, distribution(cycles)
set distribution = {1, 0, 0, … } // 100% are unicellular, in cell 0
set changes = {0, 0, 0, …}
for t = 0 to cycles – 1 // cycles over time t
{for i = 0 to t // first calculate changes on the basis of generation t
{e = rate × distribution(i)
if (i > 0) // if not at boundary
{changes(i – 1) += e/2 // increase on either side of i
changes(i + 1) += e/2 // i.e. half evolve up, half down}
else
changes(i + 1) += e // otherwise, all evolve upchanges(i) –= e // reduce at position i accordingly}
for i = 0 to t + 1 // once calculated, apply and clear changes
{distribution(i) += changes(i)
changes(i) = 0}}
plot distribution // plot curve over cycles}
Gould, S.J. (1996). Life’s Grandeur. London: Random House.
]]>Cramér’s φ is an effect size measure used for evaluating correlations in contingency tables. In simple terms, a large φ score means that the two variables have a large effect on each other, and a small φ score means they have a small effect.
φ is closely related to χ², but it factors out the ‘weight of evidence’ and concentrates only on the slope. The simplest definition of φ is the unsigned formula
φ ≡ √χ² / N(k – 1),(1)
where k = min(r, c), the minimum of the number of rows and columns. In a 2 × 2 table, unsigned φ is simply φ = √χ² / N.
In Wallis (2012), I made a number of observations about φ.
Whereas in a larger table, there are multiple degrees of freedom and therefore many ways one might obtain the same φ score, 2 × 2 φ may usefully be signed, in which case φ ∈ [-1, 1]. A signed φ obtains a different score for an increase and a decrease in proportion.
φ ≡ (ad – bc) / √(a + b)(c + d)(a + c)(b + d),(2)
where a, b, c and d are cell scores in sequence, i.e. [[a b][c d]]:
x₁ | x₂ | |
y₁ | a | b |
y₂ | c | d |
In Wallis (2019: 343-344) I applied Liebetrau’s estimate of standard deviation s(φ) (cited by Bishop et al. 1975) to create an interval and compare two 2 × 2 tables for significant difference. The formula looks like this:
s(φ) ≈ 1
2φN {4Σ
i jp_{i,j}³
p_{i+}² p_{+j}² – 3Σ
i1
p_{i+} (Σ
jp_{i,j}²
p_{i+} p_{+j} )² – 3Σ
j1
p_{+j} (Σ
ip_{i,j}²
p_{i+} p_{+j} )²
+2Σ
i j[ p_{i,j}
p_{i+} p_{+j} (Σ
kp_{k,j}²
p_{k+} p_{+j} )(Σ
lp_{i,l}²
p_{i+} p_{+l} )]}, for φ ≠ 0, (3)
where p_{i,j} = f_{i,j} / N and p_{i+}, p_{+j}, etc. represent row and column (prior) probabilities (Bishop, Fienberg and Holland 1975: 386).
We can get around φ = 0 by adjusting the table slightly.
A very common mistake (and one that I made in that paper, and I really should know better) is to simply employ the interval φ ± z_{α/2}.s(φ). This is a ‘Wald’-type interval, which overshoots and collapses to zero-width. However, it is possible to employ a search procedure to invert the function in exactly the same way as the Clopper-Pearson interval.
In Appendix 2 of Wallis (2012) I proved that for 2 × 2 φ, the following equality holds:
φ² = dp_{R}(X, Y) × dp_{R}(Y, X),(4)
where dp_{R}(X, Y) is the relative dependent probability, a directional measure.
This score can be shown to be equivalent to the (negated) difference d between proportions across the y axis, and similarly for dp_{R}(Y, X):
dp_{R}(X, Y) = –d(y₁) = p(y₁ | x₁) – p(y₁ | x₂) = (a / c) – (b / d),
dp_{R}(Y, X) = –d(x₁) = p(x₁ | y₁) – p(x₁ | y₂) = (a / b) – (c / d).(5)
Note that we will often cite proportions in the form p₁ = p(x₁ | y₁) = a / b, etc. This formulation is simply the difference in proportions calculated across the rows or columns of the 2 × 2 table. Crucially, we can calculate very accurate confidence intervals on differences d of this type, with no failure when d = 0. See below.
Returning to Equation (4), we can write
φ² = –d(y₁) × –d(x₁) = d(y₁) × d(x₁).(6)
The first step in constructing a new confidence interval on any measure is to identify whether terms are independent or dependent. As we know, 2 × 2 tables have a single degree of freedom, and therefore these differences are dependent.
Differences d(y₁) and d(x₁) have a strictly increasing monotonic relationship (if one increases, so does the other), and if d(y₁) = 0, d(x₁) = 0. See Figure 1.
With the Gaussian method, testing in either direction obtains the same result. This is a corollary of the fact that, irrespective of whether we use d(x₁) or d(y₁), the z test for two independent proportions drawn from the same population obtains the same result as the 2 × 2 χ² test (Wallis 2013). The same equivalence is true (accepting rounding errors) for the Newcombe-Wilson test.
We can now derive an interval as follows.
The first thing to note is that all potential values of φ must obey Equation (6), including the interval bounds. Therefore we know this statement must also be true:
w⁺(φ)² = w⁺(d(y₁)) × w⁺(d(x₁)),(7)
where w⁺ represents the absolute upper bound of each term, i.e. the position of φ, d(y₁), etc where it would be just significantly greater than the observed φ, d(y₁), etc. The same logic applies to the lower bound. See Figure 2.
The next step is to deal with the square. We might think that we should just take the square root and be finished.
However, φ and d scores are in a declining monotonic relationship: d is negative when φ is positive, and vice-versa. Note the negative sign in Figure 1. If we plotted d instead of –d, the vertical axis would need to be from –1 to +1 and the resulting curves would mirror those shown, falling below the horizontal axis.
The rule for a declining monotonic relationship (such as q = 1 – p) is that we swap the bounds (see Reciprocating the Wilson interval). This gives us an interval for unsigned φ.
w⁻(|φ|) = √w⁺(d(y₁)) × w⁺(d(x₁)),
w⁺(|φ|) = √w⁻(d(y₁)) × w⁻(d(x₁)),
The last step is to recover the sign. When should one of these interval bounds be negative? Answer: when the relevant interval bound for the difference for y₁, w(d(y₁)), is positive. For the Gaussian, the signs of w(d(y₁)) and w(d(x₁)) are always the same, so it does not matter which we choose.
For the Newcombe-Wilson interval it is possible to have different signs, in which case we approximate the interval bound to simply w(d(y₁)) + w(d(x₁)).
We arrive at a confidence interval about φ defined by Equation (8).
w⁻(φ) = { –sign(w⁺(d(y₁)))√w⁺(d(y₁)) × w⁺(d(x₁))
w⁺(d(y₁)) + w⁺(d(x₁)) if signs are equal
otherwise,
w⁺(φ) = { –sign(w⁺(d(y₁)))√w⁻(d(y₁)) × w⁻(d(x₁))
w⁻(d(y₁)) + w⁻(d(x₁)) if signs are equal
otherwise,(8)
where sign(x) obtains +1 for x > 0, and –1 otherwise.
To express an interval about φ as a zero-based interval (to extract widths), we can cite the interval (φ – w⁺(φ), φ – w⁻(φ)).
We now have a formula for computing intervals on φ, based on intervals for differences d. We may insert any legitimate difference interval into this formula, as we shall see.
Let us use the sample data in Table 1.
x₁ | x₂ | total | p(x₁) | |
y₁ | 124 | 27 | 151 | 0.8212 |
y₂ | 501 | 66 | 567 | 0.8836 |
total | 625 | 93 | 718 | |
p(y₁) | 0.1984 | 0.2903 | φ | -0.0757 |
For the purposes of illustration we will use Table 1 to compute intervals for φ based on Newcombe-Wilson and Gaussian intervals for d. See Wallis (2013) for how to compute these intervals.
First, we compute Gaussian bounds about d with α = 0.05,
d ± z_{α/2}√p(1 – p)(1/n₁ + 1/n₂),
where p is the pooled probability estimate, e.g. (a + b) / N. We do this for both variables X and Y:
d(x₁) ∈ 0.0624 – (-0.0603, 0.0603) = (0.0021, 0.1227),
d(y₁) ∈ 0.0919 – (-0.0888, 0.0888) = (0.0031, 0.1807).
This gives us the following interval for φ (based around φ, i.e. w⁻(φ) < φ < w⁺(φ)).
w⁺(φ) = –√0.0031 × 0.0021 = -0.0026,
w⁻(φ) = –√0.1807 × 0.1227 = -0.1489.
We may cite φ ∈ -0.0757 – (-0.1489, -0.0026), which excludes zero and therefore represents a statistically significant decline. The Gaussian interval is symmetric, so we can also extract its standard deviation by dividing the interval width by the critical value of z.
s(φ) = (w⁺(φ) – φ) / z_{α/2} = 0.0373.
A continuity-corrected interval may be computed by first calculating intervals for d(y₁) and d(x₁) with continuity corrections. Similarly, a finite population correction or a correction for random text sampling may be applied.
As we discussed at the outset, φ has a close relationship with the 2 × 2 χ² test for homogeneity. We know that φ² = χ² / N. So you might reason that Gaussian intervals on d should be used for computing intervals for φ, it is precisely consistent when comparing φ with 0. This argument also applies to comparing 2 × 2 χ² test results with φ (see Wallis 2019).
However, it is also possible to employ Newcombe-Wilson intervals for estimating the intervals for d. In this case Newcombe-Wilson intervals for d(y₁) with α = 0.05 are (-0.1034, 0.0889) for d = 0.0919.
To employ the interval, we must first reposition the Newcombe-Wilson interval around d by subtracting the standard zero-origin Newcombe-Wilson interval from d. See Figure 2.
To reposition the upper bound around zero (that d, being positive, must exceed to be significant) relative to d, we subtract it from d to become a lower absolute bound for d (that must exclude zero to be significant).
This gives us the following:
d(x₁) ∈ 0.0624 – (-0.0729, 0.0603) = (0.0021, 0.1353),
d(y₁) ∈ 0.0919 – (-0.1034, 0.0889) = (0.0031, 0.1953),
w⁺(φ) = –√0.0031 × 0.0021 = -0.0025 (signs are equal), and
w⁻(φ) = –√0.1953 × 0.1353 = -0.1625.
Thus we can cite φ ∈ -0.0757 – (-0.1625, -0.0025) at this error level. The interval does not include zero, so the equivalent test is significant. If it is desired to employ a continuity correction or any other adjustments, these may be applied to difference intervals and then combined.
We can initially compare how these intervals perform by plotting them with the same contingency table interpolation method that we used for Figure 1.
I have published a spreadsheet so you can experiment yourself with different prior probabilities and different sample size N. This does not include the inverted Liebetrau method, which must be computed separately.
[Note: Due to the complexity of Liebetrau’s interval calculation, this spreadsheet contains macros to perform standard functions to calculate the Wilson interval and Liebetrau’s standard deviation, Equation (3). Despite the warning message, these macros are harmless, and indeed, useful.]
To read the graph in Figure 3, note that signed φ ∈ [-1, 1]. As N = 20, intervals are quite wide, which helpfully exposes differences between formulae performance.
This graph reinforces observations we noted earlier. The Gaussian method suffers from ‘overshoot’, i.e. projecting possible values for d or φ that are mathematically impossible.
The method employing Wilson inversion about observed p (see Wallis 2013), i.e. the Newcombe-Wilson-based interval, tends towards the centre of the distribution as we should expect.
The unadjusted Newcombe-Wilson method closely matches Liebetrau’s method after inversion. But it has one further advantage — it can be corrected for continuity.
There is a growing interest in plotting and citing sizes of effect with confidence intervals as a preference to null hypothesis significance testing. Figure 3 shows that we can use an interval for comparing effect size φ with some arbitrary constant, say φ > 0.25 or φ < 0.9.
Liebetrau’s method for an estimated standard deviation of population φ scores, s(φ), may be deployed in a search procedure. This uses the interval equality principle. For example, the lower bound of φ may be calculated by finding a solution to:
φ⁻ + z_{α/2}.s(φ⁻) = φ.(9)
The main benefit of this method, like the Clopper-Pearson interval, is to act as a ‘gold standard’ for other methods. By contrast, the Newcombe-Wilson method can be computed directly from formulae and corrected for continuity (etc).
Note that when comparing with φ = 0, the 2 × 2 φ Gaussian test obtains the same result as the homogeneity χ² test. But, like the ‘Wald’ interval, the Gaussian interval is inaccurate, and should be discarded.
To compare two φ scores you should employ the Newcombe-Wilson interval. In Wallis (2019) we compared differences in φ as a type of unbiased gradient meta-test for comparing tables (unbiased because it does not prioritise either dependent or independent variable).
Out of the two remaining approaches, the Newcombe-Wilson-based interval in a pairwise test is the best-behaved method for comparing 2 × 2 χ² tables, and consistent with the gradient method outlined in Wallis (2019).
The method of derivation described may be of interest for anyone wishing to compute intervals for other effect size measures.
Bishop, Y.M.M., S.E. Fienberg & P.W. Holland (1975). Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press.
Wallis, S.A. 2012. Measures of association for contingency tables. London: Survey of English Usage, UCL. » Post
Wallis, S.A. 2013. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post
Wallis, S.A. 2019. Comparing χ² Tables for Separability of Distribution and Effect: Meta-Tests for Comparing Homogeneity and Goodness of Fit Contingency Test Outcomes. Journal of Quantitative Linguistics. 26:4, 330-355. » Post
]]>Many conventional statistical methods employ the Normal approximation to the Binomial distribution (see Binomial → Normal → Wilson), either explicitly or buried in formulae.
The well-known Gaussian population interval (1) is
Gaussian interval (E⁻, E⁺) ≡ P ± z√P(1 – P)/n,(1)
where n represents the size of the sample, and z the two-tailed critical value for the Normal distribution at an error level α, more properly written z_{α/2}. The standard deviation of the population proportion P is S = √P(1 – P)/n, so we could abbreviate the above to (E⁻, E⁺) ≡ P ± z.S.
When these methods require us to calculate a confidence interval about an observed proportion, p, we must invert the Normal formula using the Wilson score interval formula (Equation (2)).
Wilson score interval (w⁻, w⁺) ≡ p + z²/2n ± z√p(1 – p)/n + z²/4n²
1 + z²/n.
(2)
In a 2013 paper for JQL (Wallis 2013a), I referred to this inversion process as the ‘interval equality principle’. This means that if (1) is calculated for p = E⁻ (the Gaussian lower bound of P), then the upper bound that results, w⁺, will equal P. Similarly, for p = E⁺, the lower bound of p, w⁻ will equal P.
We might write this relationship as
p ≡ GaussianLower(WilsonUpper(p, n, α), n, α), or, alternatively
P ≡ WilsonLower(GaussianUpper(P, n, α), n, α), etc. (3)
where E⁻ = GaussianLower(P, n, α), w⁺ = WilsonUpper(p, n, α), etc.
Note. The parameters n and α become useful later on. At this stage the inversion concerns only the first parameter, p or P.
Nonetheless the general principle is that if you want to calculate an interval about an observed proportion p, you can derive it by inverting the function for the interval about the expected population proportion P, and swapping the bounds (so ‘Lower’ becomes ‘Upper’ and vice versa).
In the paper, using this approach I performed a series of computational evaluations of the performance of different interval calculations, following in the footsteps of more notable predecessors. Comparison with the analogous interval calculated directly from the Binomial distribution showed that a continuity-corrected version of the Wilson score interval performed accurately.
Continuity corrections are used because the original source Binomial distribution (that we are approximating to) is ‘chunky’. See the figure below.
All observed proportions must be whole fractions of n, p ∈ {0/n, 1/n, 2/n,… n/n}, and yet the interval calculation we use is based on the Normal interval (1), which is continuous. So, using a method due to Frank Yates, we add an extra ‘half 1/n’ to intervals on either side of P.
The most famous example of a continuity-correction is employed with a standard chi-square formula
Yates’ χ² = Σ(|o_{i,j} – e_{i,j}| – 0.5)² / e_{i,j} (4)
for all cells at index positions i, j in a contingency table. This formula is expressed in units of n rather than 1, so the correction is simply 0.5.
Strictly speaking, Yates’ formula has a flaw. It should guarantee that if the difference between observed and expected cells, d = o_{i,j} – e_{i,j}, is within ±0.5, the entire term should go to zero. This makes little difference for 2 × 2 tables, but for tables with more than one degree of freedom the following is recommended.
Yates’ χ² = Σ(DiffCorrect(o_{i,j} – e_{i,j}, 0.5))² / e_{i,j},(4′)
where DiffCorrect(d, c) = d – c if d > c, d + c if d < –c, and 0 otherwise.
χ² is based on the Normal distribution z (Wallis 2013b). The standard deviation for a Gaussian population interval about a known or predicted population value P (Equation (2)) may be corrected for continuity by Yates’ population interval.
Yates’ interval (E⁻, E⁺) ≡ P ± (z√P(1 – P)/n + 12n).(5)
It is easy to see the relationship between Equations (5) and (1). Moreover it is straightforward to apply other adjustments to the standard deviation or variance (the variance is simply the square of the standard deviation, so this amounts to the same thing).
The continuity-corrected Wilson score interval formula is not often presented, and when it does appear, it appears in slightly different forms in the literature. However, on the basis of Robert Newcombe’s (1998) paper, I have tended to present it as Equation (6).
In fact this is simplified, as it is also necessary to employ ‘min’ and ‘max’ constraints to ensure that w_{cc}⁻ ∈ [0, p] and w_{cc}⁺ ∈ [p, 1].
w_{cc}⁻ ≡ 2np + z² – (z√z² – 1/n + 4np(1 – p) + (4p – 2) + 1)
2(n + z²), and
w_{cc}⁺ ≡ 2np + z² + (z√z² – 1/n + 4np(1 – p) – (4p – 2) + 1)
2(n + z²).
(6)
Indeed, for the last ten years or so I have been working with this formula. It exists in spreadsheets I give our students. But it has two obvious problems.
As we shall see, there are circumstances when we might wish to modify the variance and thus the width of the interval, but to not adjust the correction for continuity.
Consider the finite population correction or ‘f.p.c.’. This is typically presented as an adjustment to standard deviation. See this post.
Finite population correction ν = √(N – n)/(N – 1). (7)
As the name implies, the finite population correction is applied to an interval or test when a sample is not drawn from an infinite population as the standard model assumes, but when it is drawn from one of a fixed size, N. In particular, it is relevant if the sample is a sizeable proportion of the population, say, 5%. Clearly if N >> n, then the finite population correction factor ν tends to 1, and has no effect.
To apply this adjustment to Equation (1) and (5), we can multiply the standard deviation term by ν.
Gaussian interval (E⁻, E⁺) ≡ P ± zν√P(1 – P)/n. (1′)
and
Yates’ interval (E⁻, E⁺) ≡ P ± (zν√P(1 – P)/n + 12n).(5′)
This adjustment may also be applied to Equation (1). By inspecting (1′) we can see that rather than multiply the standard deviation by ν, we could also adjust the sample size, n′ = n/ν², and substitute n′ for n in each equation. We can now apply it to the uncorrected Wilson score interval, Equation (2).
But we cannot use the same method with Equation (6), the continuity-corrected Wilson interval. To see why, first consider Equation (5). We need to adjust the standard deviation S, but not the continuity-correction term, c = 12n.
Why do we not rescale c? Answer: because the entire point of a continuity correction is to overcome the ‘chunkiness’ of the original source Binomial distribution. See above. So we should not modify n in the formula for c. The original distribution is no less chunky! The interval is narrower because a finite population means we can be more certain.
To apply this correction to a χ² test, we can calculate the test in the normal way and divide the result by ν². This method works for the standard test or Yates’ version (Equation (4)).
Our task is therefore to find a formula for (6) that separates out the scale of the standard deviation from the continuity-corrected term.
It turns out that the solution turns out to be extremely simple and intuitive. Indeed it is so simple and intuitive once you see it that it is rather surprising that papers do not simply give it in this form! (I suspect that this says more about a tendency for mathematical brevity on the one hand and the tendency for researchers to copy formulae rather than analyse and explain them from first principles.)
Aside: The route to a Eureka moment is not always very edifying. In my case, I could have kicked myself! After three days of struggling with algebraic reductions of Equation (6), I read back through Newcombe (1998) and his sources. Blyth and Still (1983) were also not very clear, but at least their paper reformulates Equations (2) and (6) differently. Then I remembered something. I had plotted Equation (6) when plotting the Wilson distribution. The corrected intervals began at p ± 12n. See the figure below.
Here it is (drum roll please):
Let us use functions to define the interval bounds for the uncorrected interval (Equation (2)),
w⁻ = WilsonLower(p, n′, α),
w⁺ = WilsonUpper(p, n′, α).
Then
w_{cc}⁻ = WilsonLower(p – 12n, n′, α),
w_{cc}⁺ = WilsonUpper(p + 12n, n′, α). (8)
That was not hard, was it?
This equation solves our problem. The continuity correction is added to the origin of the interval, p, first. Just as with Yates’ formula (4), we modify the variance in Equation (2) by rescaling n (hence we use n′ = n/ν²). But we retain c = 12n without rescaling it.
Note that when we apply a continuity correction to the population proportion P, we calculate the interval on the basis of P first and then add 12n second. But when we apply a continuity correction to the observed proportion p, we add it to p first, and then calculate the interval. This is logical, because the interval equality principle also applies to the continuity-corrected interval.
Sometimes statisticians make life unnecessarily difficult for ourselves. The solution above is hinted at by Blyth, Still, and Newcombe, but it is certainly not presented in the simple way I have done above.
Secondly, it is rare to see a statistical discussion on correcting for continuity and finite populations at the same time. Corrections for continuity tend to be forgotten as soon as formulae become more complex or tables gain more dimensions. However the reasons for correcting for continuity have not suddenly disappeared! The source distribution is still ‘chunky’! As a general point, continuity corrections may be omitted from effect size estimates, but should be taken into account in significance testing or interval calculations.
Yet with care and consideration – and some first-principles mathematics – it is possible to apply corrections for continuity and finite population to the same formulae. Other corrections, such as cluster sampling corrections (in corpora, this is usually random text sampling), can also now be applied just as easily.
Given the proven improvements in reducing Type I errors that this adjustment involves, especially for small samples, I am increasingly of the view that we should apply continuity corrections whenever we carry out a significance test. Equation (2) may still be used for plotting purposes, but for comparing proportions we should employ Yates’ 2 × 2 test or the Newcombe-Wilson test with continuity correction (see Wallis 2013a, b).
Blyth, C.R. & H.A. Still. 1983. Binomial Confidence Intervals. Journal of the American Statistical Association 78, 108-116.
Newcombe, R.G. 1998a. Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine 17, 857-872.
Wallis, S.A. 2013a. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208. » Post
Wallis, S.A. 2013b. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post
]]>
The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics.
Please note that this course is very popular, and numbers are deliberately limited on a first-come, first-served basis! You will be taught in a small group by a teaching team.
Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.
Over the three days, participants will learn about the following:
At the end of the course, participants will have:
For more information, including costs, booking information, timetable, see the website.
The standard approach to teaching (and thus thinking about) statistics is based on projecting distributions of ranges of expected values. The distribution of an expected value is a set of probabilities that predict what the value will be, according to a mathematical model of what you predict should happen.
For the experimentalist, this distribution is the imaginary distribution of very many repetitions of the same experiment that you may have just undertaken. It is the output of a mathematical model.
Thinking about this projected distribution represents a colossal feat of imagination: it is a projection of what you think would happen if only you had world enough and time to repeat your experiment, again and again. But often you can’t get more data. Perhaps the effort to collect your data was huge, or the data is from a finite set of available data (historical documents, patients with a rare condition, etc.). Actual replication may be impossible for material reasons.
In general, distributions of this kind are extremely hard to imagine, because they are not part of our directly-observed experience. See Why is statistics difficult? for more on this. So we already have an uphill task in getting to grips with this kind of reasoning.
Significant difference (often shortened to ‘significance’) refers to the difference between your observations (the ‘observed distribution’) and what you expect to see (the expected distribution). But to evaluate whether a numerical difference is significant, we have to take into account both the shape and spread of this projected distribution of expected values.
When you select a statistical test you do two things:
The problem is that in many cases it is very difficult to imagine this projected distribution, or — which amounts to the same thing — the implications of the statistical model.
When tests are selected, the main criterion you have to consider concerns the type of data being analysed (an ‘ordinal scale’, a ‘categorical scale’, a ‘ratio scale’, and so on). But the scale of measurement is only one of several parameters that allows us to predict how random selection might affect the resampling of data.
A mathematical model contains what are usually called assumptions, although it might be more accurate to call them ‘preconditions’ or parameters. If these assumptions about your data are incorrect, the test is likely to give an inaccurate result. This principle is not either/or, but can be thought of as a scale of ‘degradation’. The less the data conforms to these assumptions, the more likely your test is to give the wrong answer.
This is particularly problematic in some computational applications. The programmer could not imagine the projected distribution, so they tweaked various parameters until the program ‘worked’. In a ‘black-box’ algorithm this might not matter. If it appears to work, who cares if the algorithm is not very principled? Performance might be less than optimal, but it may still produce valuable and interesting results.
But in science there really should be no such excuse.
The question I have been asking myself for the last ten years or so is simply can we do better? Is there a better way to teach (and think about) statistics than from the perspective of distributions projected by counter-intuitive mathematical models (taken on trust) and significance tests?
One of the simplest statistical models concerns Binomial distributions. I find myself writing again and again about this class of distributions (and the mathematical model underpinning them) because they are central to corpus linguistics research, where variables mostly concern categorical decisions.
But even if you are principally concerned with other types of statistical model, bear with me. The argument below may be applied to the Student’s t distribution, for example. The differences lie in the formulae for computing intervals. The reasoning process is directly comparable.
The conventional way to think about a Binomial evaluation is as follows.
This particular test is called the Binomial test.
Below is an example, taken from an earlier blog post, Comparing frequencies within a discrete distribution. This particular evaluation models the Binomial distribution for P = 0.5 and n = 173 (the amount of data in our sample, termed the sample size).
The Binomial distribution (purple hump) is the distribution we would expect to see if we repeatedly tried to sample P, i.e. we repeated our experiment ad infinitum.
That is what we mean by a ‘projected distribution’. We can’t see it, and we can’t construct it by repeated observation because we have insufficient time!
The height of each column in this distribution is the chance that we might observe any particular frequency, r ∈ (0, n), whenever we perform our experiment. For the maths to work, we assume that every single one of the n cases in our sample is randomly and independently sampled from a population of cases whose mean probability was P.
The values we would most likely observe are 86 and 87 (173 × 0.5 = 86.5). In this case, p cannot be 0.5, even if this is the ‘expected value’ P!
However, the chance of either of these values being obtained is pretty small — 0.06. There is a range of values to either side of P where we would expect to see p fall. What the pattern shows us is that, say, a value of r = 60 or less is very unlikely to have occurred by chance.
The formula for the Binomial function looks like this:
Binomial distribution B(r) = nCr P^{r} (1 – P)^{(n – r)}. (1)
This function generates the probability that any given value of r will be obtained, given P and n. For more information on what these terms mean, see Wallis (2013).
Next, we consider our particular observation, which might be expressed as a frequency, f = 65 or proportion p = f / n = 0.3757.
Now, the conventional approach to this test is to add up all the columns in the area greater than or equal to p, or from 0 to 65 (see the box in the figure above). This ‘Binomial tail sum’ area turns out to be 0.000669 to six decimal places. So we can report that there is less than a 0.000669 chance that this observation, p, was less than P due to mere random chance. In other words, we can say that the difference p – P is significant, at an error level α < 0.05.
Since this calculation is a little time-consuming and computationally arduous to carry out with large values of n, for over 200 years researchers have used an approximation credited to Carl Friedrich Gauss, namely to approximate the chunky Binomial distribution to another, smooth distribution, called the Gaussian or ‘Normal’ distribution.
In the graph below, the Gaussian distribution is plotted as a dashed line. As you can see, in this case the difference between the two shapes is almost imperceptible.
But now we can dispense with all that complicated ‘adding up of combinations’ that the Binomial test requires. The Gaussian approximation calculates the standard deviation of the Normal distribution, S, using a very simple function. On a probabilistic scale this calculation looks like this.
S = √P(1 − P)/n , (2)
S = √0.25 / 173 = 0.0380.
The Normal distribution is a regular shape that can be specified by two parameters: the mean and the standard deviation. We have mean P = 0.5 and standard deviation S = 0.0380.
Now we can apply a further trick. To perform the test, we don’t actually need to add up the area to the left of p. That’s a lot of work. All we need do is work out what p would need to be in order for the difference p – P to be just at the edge between significance and non-significance. At this point, the area under the curve will equal a given threshold probability, α/2 of the total area under the curve, where α represents the acceptable ‘error level’ (e.g. 1 in 20 = 0.05, 1 in 100 = 0.01 and so on). This area is half of α because, as the graph indicates, there will be another similar ‘tail area’ at the other side of the curve.
In simple terms, the area shaded in pink in the graph above is half of 5% of the total area under the curve, or — to put it another way — if the true rate in the population P was 0.5, the chance of a random sample obtaining a value of p less than the line to the right of that area is 0.05/2 = 0.025. (In our graph we have scaled all values on the horizontal axis by the total frequency, n, but this just means we multiply everything on a probability scale by n!)
How do we work this out? Well, we use the critical value of the Normal distribution, which we can write as z_{α/2} or, less commonly, Φ^{-1}(α/2), where Φ(x) is the Normal cumulative probability distribution function. This allows us to compute an interval where (1 – α = 95%) of the area under the curve is z_{α/2} standard deviations from P.
For α = 0.05, this ‘two-tailed’ value is 1.95996. The Normal confidence interval about P is then simply the range centred on P:
(P–z_{α/2}.S … P … P+z_{α/2}.S) = (0.4245, 0.5745).
Since p = 0.3757 is outside this range, we can report that p is significantly different from P (or p – P is a significant difference, which amounts to the same thing). This is more informative than saying ‘the result is significant’. But crucially, it relies on us pre-identifying a value of P, which we cannot obtain from data!
We have marked this out in the graph above, again, multiplying by n.
The conventional approach to statistics focuses on the mathematical model, and the projected distribution. Is there another way?
An alternative way of thinking about statistics is to start from the observer’s perspective.
Most of the time we simply do not have a population value P, but we always have an observation p. In our example we assumed P was 0.5 for the purposes of the test — to compare p with 0.5. But this is a very limited application of statistics. What if we don’t know what P is? We only have observations to go on.
Conclusion: Instead of focusing on the projected distribution of a known population value, we should focus instead on projecting the behaviour of observed values.
The following graph plots the Wilson score distribution about p, using a method I developed in an earlier blog post. That distribution (blue line) may be given a confidence interval (the Wilson score interval) with the pink dot in the centre. We have plotted the equivalent 95% interval as before, so, again, 2.5% of the area under the curve can be found in the tail area ‘triangle’ above the upper bound (vertical line), and 2.5% of the area under the curve is found in the tail area below the lower bound.
The confidence interval for p (indicated by the line with the pink dot) is:
95% Wilson score interval (w⁻, w⁺) = (0.3070, 0.4498),
using the Wilson score interval formula (Wilson 1927):
Wilson score interval (w⁻, w⁺) ≡ p + z²/2n ± z√p(1 – p)/n + z²/4n²
1 + z²/n,
(3)
where z represents the error level z_{α/2}, shortened for reasons of space.
This particular distribution looks very similar to the Normal distribution. However, it is a little squeezed on the left hand side. It is asymmetric, with the interval widths being unequal:
y⁻ = p – w⁻ = 0.3757 – 0.3070 = 0.0687, and
y⁺ = w⁺ – p = 0.4498 – 0.3757 = 0.0741.
For more information, see Plotting the Wilson distribution.
What does this interval tell us?
In our sample, we observed p = 0.3757 as the proportion p(A | {A, B}) = 65/173.
On the basis of this information alone, we can predict that the range of the most likely values for P in the population from which the sample is drawn is between 0.3070 and 0.4498, if we make this prediction with a 95% level of confidence.
The value w⁻ represents the lowest possible value of P consistent with the observation p, i.e. that if P < p, but P > w⁻, we would report that the difference was not significant.
Similarly, the value w⁺ is the largest possible value of P consistent with p.
Note that we have dispensed with any need to consider the actual population proportion, P. We don’t need to know what it is. Instead we view it, through our ‘Wilson telescope’, from the perspective of our observation p. The picture is a bit blurry, which is why we have a confidence interval that stretches over some 10% of the probability scale. But we have a reasonable estimate of where P is likely to be.
If we want to test p against a particular value, say, P = 0.5, we can now do so trivially easily. It is outside this range, so we can report our observation p is significantly different from P = 0.5. If we plot data with score intervals, we can even compare observations by eye.
Consider the following thought experiment.
As an adult, you meet up with a bunch of random friends you haven’t seen for several years. Twenty in all, with nothing particular to connect them together.
For the sake of our thought experiment, let us assume this group of friends are twenty random individuals drawn from the population, but if they all went to the same school we might be concerned about whether they only represented a more limited population!
It turns out, as you chat, that 5 out of 20 had chicken pox (varicella) as a child. (Chicken pox is a childhood disease, immunisation is widespread and few adults slip through the net, so anyone over 20 can be assumed to be unlikely to get it by that age).
On the basis of this observation alone, what is the most likely rate of chicken pox in the population? Can we be 95% confident it is less than half?
To work out the answer, we know two facts: p = 5 / 20 = 0.25, and n = 20.
Using Equation (3), this gives us
95% Wilson score interval (w⁻, w⁺) = (0.1186, 0.4687),
which excludes 0.5, so the 95% interval is indeed less than half. (With a correction for continuity, the interval becomes (0.0959, 0.4941) — still below 0.5).
If you think about it, this conclusion is at one and the same time, remarkably powerful — and counter-intuitive.
How can it be that, with only 20 people to go on, we can be so definite in our conclusions?
In many ways the idea of an interval about an observation p is just as curious as the idea of an interval about P. Both are based on the counterintuitive idea that simple randomness leads to a predictable degree of variation when data is resampled.
The Wilson interval on p has many more applications than either traditional tests or confidence intervals on P. This is simply because, as we noted earlier, most of the time we simply do not know what P is.
For example, we can compare Wilson intervals using what I have elsewhere referred to as the Wilson score interval comparison heuristic:-
For any pair of proportions, p₁ and p₂, check the following:
What this means is that in many cases we don’t need to perform a statistical test to compare them. We can simply ‘eyeball’ the data. We can also use confidence intervals to perform tests, like the Newcombe-Wilson test.
Armed with our new-found mathematical understanding of statistics, we can also ask other, related questions.
For example, we might ask how much data would we need to conclude that an observation of p = 0.25 allows us to conclude that P < 0.5?
Plotting 95% and 99% bounds of the Wilson score interval for p = 0.25, n = {4, 8,.. 40}.
To get the answer, I have plotted the upper and lower bound of the Wilson score interval for n as multiples of 4 (our observation concerns whole numbers, remember). For good measure I have included the error level α = 0.01 alongside 0.05. We can clearly see the asymmetry of the interval.
We can see that for α = 0.05, we only need n = 16 guests at our get-together to justify a claim that the population value P is below 50%, but at α = 0.01, we need 28 guests. (This is proof positive that anyone who demands a smaller error level needs more friends!)
Does this all mean we should dispense with significance tests altogether and replace them with confidence interval analysis? This is something that many in the ‘New Statistics’ movement claim. I argue against this because not all tests can be substituted for confidence interval comparisons. For example, the z test summarised above can also be carried out using a 2 × 1 χ² test computation. But for r > 2, an r × 1 χ² test is not the same as a series of 2 × 1 tests.
Dispensing with tests altogether is premature, but a focus on confidence intervals on observed data is a much better way to engage statistically with data than ‘black-box’ tests.
Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208. » Post
Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212.
]]>We have discussed the Wilson score interval at length elsewhere (Wallis 2013a, b). Given an observed Binomial proportion p = f / n observations, and confidence level 1-α, the interval represents the two-tailed range of values where P, the true proportion in the population, is likely to be found. Note that f and n are integers, so whereas P is a probability, p is a proper fraction (a rational number).
The interval provides a robust method (Newcombe 1998, Wallis 2013a) for directly estimating confidence intervals on these simple observations. It can take a correction for continuity in circumstances where it is desired to perform a more conservative test and err on the side of caution. We have also shown how it can be employed in logistic regression (Wallis 2015).
The point of this paper is to explore methods for computing Wilson distributions, i.e. the analogue of the Normal distribution for this interval. There are at least two good reasons why we might wish to do this.
The first is to shed insight onto the performance of the generating function (formula), interval and distribution itself. Plotting an interval means selecting a single error level α, whereas visualising the distribution allows us to see how the function performs over the range of possible values for α, for different values of p and n.
A second good reason is to counteract the tendency, common in too many presentations of statistics, to present the Gaussian (‘Normal’) distribution as if it were some kind of ‘universal law of data’, a mistaken corollary of the Central Limit Theorem. This is particularly unwise in the case of observations of Binomial proportions, which are strictly bounded at 0 and 1.
As we shall see, the Wilson distribution diverges from the Gaussian most dramatically as it tends towards the boundaries of the probabilistic range, i.e. where the interval approaches 0 or 1. By contrast, the Normal distribution is unbounded, and continues to plus or minus infinity.
The Wilson score interval (Wilson 1927) may be computed with the following formula.
Wilson score interval (w⁻, w⁺) ≡ p + z²/2n ± z√p(1 – p)/n + z²/4n²
1 + z²/n.
(1)
Let us first consider cases where P is less than p. At the lower bound of this interval (P = w⁻) the upper bound for the Gaussian interval for P, E⁺, must be equal to p (Wallis 2013a).
We can carry out a test for significant difference between p and P by either
To consider cases where P is greater than p, we simply reverse this logic. We test if p is smaller than the lower bound of a Gaussian interval for P, or P is greater than the upper bound of the Wilson interval for p. The Gaussian version of the test is called the single proportion z test. It can also be calculated as a goodness of fit χ² test (Wallis 2013a, b).
As p tends to 0, we obtain increasingly skewed distributions (Figure 3). The interval cannot be easily approximated by a Normal interval, and the sum of the two distributions is decidedly not Gaussian (‘Normal’).
In Figure 3, note how the mean p is no longer the most likely value (mode).
In plotting this distribution pair, the area on either side of p is projected to be of equal size, i.e. it treats as a given that the true value P is equally likely to be above and below p. This is not necessarily true! Indeed we might multiply both distributions by the probability of the prior. But this fact should not cause us to change the plot.
Note how, thanks to the proximity to the boundary at zero, the interval for w⁻ becomes increasingly compressed between 0 and p, reflected by the increased height of the curve.
The tendency to express the distribution like an exponential decline on the least bounded side reaches its limit when p = 0 or 1. The ‘squeezed interval’ is uncomputable and simply disappears.
Newcombe, R.G. 1998. Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine 17: 857-872.
Wallis, S.A. 2013a. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208 » Post
Wallis, S.A. 2013b. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post
Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212.
]]>