Accurate confidence intervals

1. Introduction Paper (PDF)

There is a growing interest among practising researchers in plotting data with confidence intervals (sometimes termed ‘credible intervals’ or ‘compatibility intervals’) due to their explanatory power. However, many statistics compendia offer extremely limited coverage of confidence interval methods, and cited formulae often involve a common mathematical error.

This issue is of particular concern for linguists. Numerous research problems that linguists address concern discrete alternatives, and engage Binomial or Multinomial statistics, i.e., the statistics of simple choice proportions. However, the treatment of Binomial intervals and other derived properties of tables (see e.g. Zar (2010: 85), Sheskin (2011: 286; 661)) tends to be weak.

In linguistics, a number of additional properties, scores and effect sizes are commonly cited. In Section 3 we show how one can give Gries’s ΔP score (Gries 2013) an interval due to Newcombe (1998b). However, we need a general algebraic method for calculating confidence intervals for any linguistic property. This is the subject of this paper. We show how Zou and Donner’s (2008) method can be used to create intervals for a wide range of properties, and evaluate intervals computed over differing numerical scales against a Fisher ‘exact’ test.

1.1 The standard error

Common presentations of intervals employ a method termed (asymptotic) standard error. The method is extremely pervasive and frequently-cited, finding its way into tests, algorithms and specialist treatises (e.g. Bishop, Fienberg and Holland 1975).

The model can be expressed simply as follows: given an observation of a variable x, assume that variation, scaled by a standard deviation S(x), is Normally distributed about x.

standard error interval for x, (e, e+) = x ± zα/2.S(x),(1)

where zα/2 is the two-tailed standard Normal deviate for an error level α.1 For the purposes of computing an interval, α is constant (say, 0.01 or 0.05), so zα/2 is treated as a constant.

This standard deviation term may measure the variation of observed values within a sample, within-sample standard deviation, s(x). Here the interval models the scatter (‘reference range’) of the values observed within a sample.

However, inferential statistics concerns the sampling of an observation x from a population with true value X. Here we are interested in the standard deviation of sample means. In plain English, we identify a standardised estimate of the variability of observed averages (means) when they are sampled. Such a mean might be Real (e.g. the mean pitch of n utterances), Interval (the mean length of n phrases), or Binomial (the proportion of n clauses, phrases or words with a particular feature). In this paper we will focus on Binomial intervals predicting a population proportion P. These intervals have the greatest utility to linguists. Binomial proportions may represent linguistic alternation rates, or observed rates derived from multiple choices, such as semasiological shares (Wallis 2021a: 77) or standardised type-token ratios.

Engaging a mathematical model relies on making assumptions (requirements) that the data conforms to certain parameters. In the case of the Binomial model, these include (i) sampled instances should be drawn independently and randomly from the population, and (ii) instances must be free to vary, so that proportions (rates) can range from 0 to 100%. We also assume (iii) that the population is infinite, or much larger than the sample.2

Equation (1) above assumes that the probability distribution function of the error is Normal (Gaussian) and symmetric. For small Real or Interval samples, zα/2 is replaced by the equivalent critical value of the (symmetric) t-distribution.

However, a symmetric interval is incompatible with bounded variables. Suppose x is an observed Binomial proportion, p. This property is bounded by the probabilistic scale P = [0, 1]. According to the standard error model, the desired confidence interval is p ∈ (p, p+) = p ± zα/2.S(p).

However, phenomena that linguists study are often rare. For example, the word somebody in the written component of the British Component of the International Corpus of English (ICE-GB, Nelson, Wallis and Aarts 2002) is infrequent (4 cases in 423,581 words), but its alternate, someone, is less so (82 cases). Written data is subdivided into print and non-print sources. In the non-printed subcorpus (Table 1), we find zero examples of somebody, i.e., p = 0.

written somebody f someone total n proportion p
non-printed 0 24 24 0.0000
printed 4 58 62 0.0645

Table 1. Alternation of somebody/someone, printed and non-printed ICE-GB subcorpora.

One of the following statements must be true.

  1. S(p) = 0. The interval has zero width, p = p = p+, and is thus symmetric. But this means the observation has no error. It is falsely certain.
  2. S(p) > 0. The error interval has a non-zero width. The lower bound p < 0. It ‘overshoots’. The model says there is a 50% chance that the true proportion, P < 0, which is also impossible.

Wallis (2021a: 297) terms this presumption of Gaussian uncertainty the ‘Normal fallacy’. A symmetric interval on a bounded variable cannot be correct.

1.2 Population sampling intervals and confidence intervals

The conventional model used in χ2 and z tests employs the Normal approximation to the Binomial population proportion P. This gives us a legitimate, albeit approximate, Normal interval about P.

population standard deviation S(P) ≡ √P(1 – P)/n, and(2)
Gaussian interval (E, E+) = P ± zα/2.S(P).

This population interval identifies the range of values a sampled proportion p will be expected to be found at a certain error level, α, given P.

We may engage in ‘what if’ reasoning. Suppose we thought the true rate of the somebody/someone alternation in printed texts of the kind found in ICE-GB was 15 in 100, i.e. P = 0.15. In a sample of 62 cases, the observed proportion should fall within (E, E+) = 0.15 ± 0.0889 = (0.0611, 02389) at the α = 0.05 error level, i.e. on 19 out of 20 sampling attempts.

We can now compare our observed rate, p = 0.0645, with this interval. It is within the range, so the observed p is not significantly different from the hypothesised rate P.

This model is not perfect. If P = 0 or 1 the interval width becomes zero. It overshoots near the boundary (e.g., for P = 0.01 and n = 62, E = -0.0148). Fortunately, applying Yates’s correction for continuity (Yates 1934) conservatively compensates for both problems, plus the ‘smoothing error’ created by the approximation of the discrete Binomial distribution by a continuous Normal curve. The interval is moved away from P by half an instance on either side.

Yates’ Gaussian interval (Ecc, E+cc) = P ± (zα/2.S(P) + 12n).(3)

Equations (2) and (3) are employed in the z test for the single proportion (Wallis 2013) to compare an observed proportion, p, with an expected one, P. They perform identically to their equivalent 2 × 1 χ2 goodness of fit test (see Equation (12), below).

However, population intervals have limited utility. Usually we do not know P. Instead, we wish to predict the most likely range of values of P based on the observed rate, p, and the error level, α. We need a confidence interval for p.3

For decades, students wishing to create confidence intervals were directed to employ Equations (2) or (3), but substitute p for P. Thus the following would be used in place of (2).

observed standard deviation S(p) ≡ √p(1 – p)/n, and(4)
Wald interval (e, e+) = p ± zα/2.S(p).

In one statistics reference after another, we see standard error or ‘Wald’ intervals quoted. But they obtain results inconsistent with their equivalent z or χ2 test.

Consider our earlier example where P = 0.15 and n = 62. Substituting p = 0.0645 into Equation (4), we obtain the interval (e, e+) = (0.0034, 0.1256), which excludes P = 0.15. The Wald interval has given us a different result!

This interval has zero-width behaviour for p = 0 or 1, and overshoots near the boundary. These problems are conventionally addressed by the ‘3-sigma rule’, which rules out Equation (4) if p ± 3S(p) exceeds [0, 1] (i.e., for small samples and proportions close to 0 or 1). Yet the principal utility of inferential statistics concerns small samples, and many fields, linguistics included, contend with low-frequency terms.

But arguably the worst problem is that the Wald interval obtains results inconsistent with the Gaussian model (Equation (2)), i.e., it rules results ‘significant’ when the equivalent z test does not, and vice-versa. This, we believe, is the source of the historic low status of confidence intervals. Without a method for computing a confidence interval consistent with the equivalent significance test, confidence intervals cannot be ‘proper’ statistics.

However, if we can address this problem, then confidence intervals become very powerful. Plotted intervals are visually intuitive and permit us to contrast observations on the same scale by eye without performing significance tests. See Figures 10 and 11.

This paper is set out as follows. In the next section we discuss the interval equality principle, which defines an interval by inverting an equivalent test procedure, guaranteeing consistency with the test. In Section 3 we introduce difference intervals and tests, and in Section 4 we generalise both approaches with mathematical functions and operators. We derive intervals for effect sizes in Section 5, with linguistic examples. Section 6 is the conclusion.

Excerpt

4. Confidence intervals for other properties

Although Binomial proportions are ubiquitous in linguistic research problems, we often wish to compute intervals for other properties.

4.1 Functions of the simple proportion

We can simply obtain confidence intervals for monotonic functions of p (Wallis 2021a: 175). Monotonic functions always either increase or decrease over the parameter’s range and have a unique solution when inverted.

For any function fn(p) of a Binomial proportion p monotonic over P = [0, 1], we may define a transformed Wilson score interval as

transformed Wilson (wt, wt+) = { (fn(w), fn(w+))
(fn(w+), fn(w))
 if fn increases with p, or
otherwise.
(20)

For example, the logit (log odds) function, logit(p) ≡ ln(p / (1 – p)), is monotonic and increasing, so the logit Wilson interval is simply (logit(w), logit(w+)). The reciprocal function, 1/p, monotonically decreases, so its interval, (1/w+, 1/w), has interval bounds reversed. To compute an interval for mean clause length, l¯ = words/clauses, consider p = clauses/words (Wallis 2021a: 171).

Probability density distributions for selected functions of p ∈ {0.1, 0.3, 0.5} with n = 10 are shown in Figure 8. Exceptionally, the logit Wilson is symmetric and approximately ‘Normal’ (Wallis 2021a: 307), but note in passing how the others have very different distributions!

Intervals subject to non-monotonic transforms require us to identify turning points (local minima or maxima). Suppose that a is a turning point within an interval (w, w+). The lower bound is simply min(fn(w), fn(w+), fn(a)), and the upper bound is the maximum of the same sequence. See Section 5.1.

Figure 8. Unit probability density distributions of selected functions of Binomial proportions, p ∈ {0.1, 0.3, 0.5}, n = 10, tail areas at α = 0.05 are labelled for p = 0.3. From upper left, clockwise: natural logarithm ln(p), reciprocal 1/p, square p2 and logit(p), which is approximately Normal.

Contents

  1. Introduction
    1.1 The standard error
    1.2 Population sampling intervals and confidence intervals
  2. The interval equality principle
    2.1 The Wilson score interval
    2.2 Adjusting the formula using functional notation
    2.3 Obtaining intervals by search
    2.4 Performance
  3. Difference intervals and 2 × 2 tests
    3.1 A chi-square based interval
    3.2 The Newcombe-Wilson interval
    3.3 Performance
  4. Confidence intervals for other properties
    4.1 Functions of the Binomial proportion
    4.2 Functions of two or more independent proportions
    4.3 Performance
    4.4 Analytic reduction
  5. Effect sizes and meta-tests
    5.1 Unweighted goodness of fit ϕp
    5.2 Cramér’s 2 × 2 ϕ
    5.3 Meta-tests for differences between scores
  6. Conclusions
    References

References

Bishop, Y.M.M., Fienberg, S.E. & Holland, P.W. 1975. Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press.

Gries, S.Th. (2013). 50-something years of work on collocations: What is or should be next… International Journal of Corpus Linguistics 18(1). 137–165.

Nelson, G., B. Aarts & S.A. Wallis (2002). Exploring Natural Language: Working with the British Component of the International Corpus of English. Varieties of English Around the World series. Amsterdam: John Benjamins.

Newcombe, R.G. 1998b. Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine, 17, 857-872. 

Sheskin, D.J. 2011. Handbook of Parametric and Nonparametric Statistical Procedures. (5th ed.) Boca Raton, Fl: CRC Press.

Wallis, S.A. 2013a. Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics 20(3), 178-208. » Post

Wallis, S.A. 2013b. z-squared: The origin and application of χ2. Journal of Quantitative Linguistics 20(4), 350-378. » Post

Wallis, S.A. 2021a. Statistics in Corpus Linguistics Research: a new approach. New York and Abingdon: Routledge.

Zar, J.H. 2010. Biostatistical analysis (5th ed.). Upper Saddle River, NJ: Prentice Hall.


Notes

1. Sometimes, unhelpfully, the term ‘standard error’ is used as a substitute term for ‘standard deviation’.

2. Assumption (ii) means that ‘per million word’ rates are unlikely to be Binomial proportions for most sampled linguistic phenomena. Alternative baselines should be considered (Wallis 2021a: 47). For text corpus samples, assumption (i) can be addressed by a ‘random-text sampling’ correction (Wallis 2021a: 277).

3. Confidence intervals should not be confused with ‘replication’ or ‘resampling’ intervals (Wallis 2020). A confidence interval predicts the range of values of the true mean P, whereas a resampling interval predicts the range of the mean of a second sample.


See also

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.