corp.ling.stats

Summer School in English Corpus Linguistics 2024 (online)

Sean — Tue, 30 Apr 2024 12:06:50 +0000

Have you begun research with corpora, but have been unsure what to do next?
Have you heard about parsed corpora (treebanks) but wondered how you might use them in your research?
Do you want a practical primer in statistics?

I’m pleased to announce the eleventh UCL Summer School in English Corpus Linguistics, our masterclass in research with parsed corpora, which is taking place online from 1-3 July. It is timed to run from 9:00 to 13:30 British Summer Time (GMT+1), to make it accessible for students across Europe, Africa, Asia and Australasia.

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics from the perspective of the ‘Survey Methodology’. It is offered at £165 for early bookings made before 14 May, rising to £195 after.

This year we are innovating our programme, including a new session on World Englishes with Guyanne Wilson.

This is a picture of our face-to-face teaching on the course, which we would love to return to! But with Covid-19 a continuing threat worldwide, and for reasons of accessibility and cost, we have decided to run the course online for another year.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2017/03/w9a8021.jpg?w=300" data-large-file="https://corplingstats.files.wordpress.com/2017/03/w9a8021.jpg?w=600" class="alignnone size-full wp-image-2058" src="https://corplingstats.files.wordpress.com/2017/03/w9a8021.jpg?w=739" alt="This is a picture of our face-to-face teaching on the course, which we would love to return to! But with Covid-19 a continuing threat worldwide, and for reasons of accessibility and cost, we have decided to run the course online for another year." srcset="https://corplingstats.files.wordpress.com/2017/03/w9a8021.jpg 600w, https://corplingstats.files.wordpress.com/2017/03/w9a8021.jpg?w=150&h=85 150w, https://corplingstats.files.wordpress.com/2017/03/w9a8021.jpg?w=300&h=169 300w" sizes="(max-width: 600px) 100vw, 600px" />

Aims and objectives of the course

Over the course of the three days, participants learn about the following:

the scope of Corpus Linguistics, and how we can use it to study the English Language;
key issues in Corpus Linguistics methodology;
how to use corpora to analyse issues in syntax and semantics;
basic elements of statistics;
how to navigate large and small corpora, particularly ICE-GB and DCPSE.

Learning outcomes

At the end of the course, participants should have:

acquired a basic but solid knowledge of the terminology, concepts and methodologies used in English Corpus Linguistics;
had practical experience working with two state-of-the-art corpora and a corpus exploration tool (ICECUP);
have gained an understanding of the breadth of Corpus Linguistics and the potential application for projects;
have learned about the fundamental concepts of inferential statistics and their practical application to Corpus Linguistics.

What it costs

Attendance fee: £165 until May 14; £195 afterwards.
Corpus and software:
- temporarily accessible for all participants (no separate purchase necessary)
- choice of free corpus to all students, £25 for both corpora
- Special Offer of 25% standard price for all attendees

Continuity correction for risk ratio and other intervals

Sean — Sat, 02 Mar 2024 12:03:32 +0000

Introduction

In An algebra of intervals, we showed that we can calculate confidence intervals for formulae composed of common mathematical operators, including powers and logarithms. We employed a method proposed by Zou and Donner (2008), itself an extension of Newcombe (1998). Wallis (forthcoming) describes the method more formally.

However, Newcombe’s method is arguably better-founded mathematically than that of Zou and Donner, who make an additional assumption. They assume that the number scale on which two properties are distinguished is not material to the quality of the resulting interval.

Why might this assumption be problematic? Well, when we compute a difference interval with Newcombe’s method, we do so by summing squared inner interval widths. These are equal to independent variance terms (multiplied by a constant, the critical value of the Normal distribution z_α/2), which are Normal at inner bounds. So far, so good. However, if such an interval is transformed onto a different number scale, but the same summation-of-variance (Bienaymé) method is then employed — Zou and Donner’s method — we are now summing terms which are by definition no longer Normal!

I was suspicious of this assumption, which seemed to me to be optimistic at best, and I was concerned to evaluate it computationally. The method I used was as follows.

Perform the same inner interval calculation for every potential value of two proportions, p₁ and p₂, over a range of sample sizes (1 to 200). This interval can be treated as a significance test equivalent to the exact Fisher test (evaluating if p₁ is significantly different from p₂). Thus, for a difference d = p₂ – p₁, if the resulting interval for d includes 0, the result is not significant. For a ratio, e.g. r = p₁/p₂, if the interval includes 1, the result is not significant.
We then compared the result of the two tests: our new test and Fisher.
If there is a discrepancy in the outcome, it will be of one of two types:
1. Type I errors (our test was improperly deemed significant) and
2. Type II errors (our test was improperly deemed non-significant, i.e. it failed to detect a significant result according to Fisher).
To properly account for the chance of observing a particular pair of proportions, each error is weighted by Fisher scores before being summed.

This method evaluates the inner (mesial) interval close to the middle of the range. It does not evaluate the same interval for non-zero points, or for the outer interval. But unlike Monte Carlo methods, it is exhaustive.

What I found partly supported my suspicions. There was indeed an additional error cost introduced by these approximations, and this error differed by number scale (or, by the formula, which amounts to the same thing). The graph below demonstrates the scale of the issue. If we aim for α = 0.05 but then compute an interval with an additional Type I error ε of 0.03, this additional error is not negligible!

All of these interpolated intervals, including Newcombe’s for d, obtain detectable errors, but there is some good news. We observed that employing a continuity correction reduces the scale of those errors.

Figure 1 shows an example plot obtained by this method (taken from a recent blog post). This includes computations for simple difference d, Cohen’s h, risk and odds ratios, and logarithm, each of which perform Zou and Donner’s difference calculations on different number scales.

Figure 4. Cohen’s h evaluated by Fisher-weighted error rates for Type I errors, performance for h ≠ 0 against the Fisher ‘exact’ test, computed for values of n₁ = n₂ ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2024/02/cohen-eval-1-2.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2024/02/cohen-eval-1-2.png?w=739" class="alignnone size-full wp-image-7892" src="https://corplingstats.files.wordpress.com/2024/02/cohen-eval-1-2.png?w=739" alt="Cohen’s h evaluated by Fisher-weighted error rates for Type I errors, performance for h ≠ 0 against the Fisher ‘exact’ test, computed for values of n1 = n2 ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples." srcset="https://corplingstats.files.wordpress.com/2024/02/cohen-eval-1-2.png 1220w, https://corplingstats.files.wordpress.com/2024/02/cohen-eval-1-2.png?w=150&h=75 150w, https://corplingstats.files.wordpress.com/2024/02/cohen-eval-1-2.png?w=300&h=151 300w, https://corplingstats.files.wordpress.com/2024/02/cohen-eval-1-2.png?w=768&h=387 768w, https://corplingstats.files.wordpress.com/2024/02/cohen-eval-1-2.png?w=1024&h=515 1024w" sizes="(max-width: 1220px) 100vw, 1220px" />

Figure 1. Difference d, Cohen’s h, odds, risk and log ratios evaluated by Fisher-weighted error rates for Type I errors, against the Fisher ‘exact’ test, computed for values of n₁ = n₂ ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples.

One can make a number of observations about this graph. The saw-tooth behaviour, the ordering of intervals by performance, and so on. But if we want to minimise Type I errors (where we wrongly assess a non-significant difference as ‘significant’), this graph reveals that employing a continuity correction suppresses them.

Our previous evaluations showed that for unequal-sized sample sizes, where n₁ = 5n₂, we tended to see a lower overall error rate (this is not quite correct for χ²). See also Table 1 below. The increased sample size for p₁ (amounting to 3 times the data in the table overall) means that the discrete Fisher is smoother, and therefore the ‘smoothing correction’ aspect of the continuity correction is less necessary. But there remains an error.

Continuity corrections, reprised

Statistical advice has tended to be rather ambivalent about continuity corrections (Yates, 1934).

In part, this is due to suspicion about so-called ‘exact’ tests. It is argued that they are not always beneficial, as they will err on the side of caution given a particular error level. The term ‘exact’ refers to the fact that an exact error rate for an observation given an expected distribution may be obtained. It does not mean that a test is exact: on the contrary, it means that it is conservative.

In my book (Wallis 2021: 106), I give a couple of worked examples where a target tail area of α/2 = 0.025 yields an exact Binomial population interval about P with an upper tail area of 0.0106 and 0.0115 respectively, i.e. around half the target error level.

But if we want to rely on intervals for the purpose of empirical evaluation, we should err on the side of caution when employing confidence intervals too.

Zou and Donner’s method is not dependent on which underlying ‘good coverage interval’ is employed. So we can substitute a continuity-corrected Wilson score interval for the standard one. For example, to compute a continuity-corrected risk ratio interval for r = p₁/p₂ we calculate the continuity-corrected Wilson intervals for p₁ and p₂ respectively, and apply the risk ratio formula as usual.

Conventionally, a continuity correction is an adjustment to the Normal approximation to the Binomial distribution for a proportion p that moves the best estimate for p out by half a unit, i.e. c = 12n. We apply this adjustment to each interval calculation separately.

This has two well-known benefits:

For small n, this rounding adjustment pushes the Normal ‘envelope’ out conservatively to incorporate the discrete Binomial interval.
For skewed p, the adjustment absorbs errors due to the skewed distribution of the Binomial.

In Correcting for continuity, I derived an efficient and decomposable formula for adding a continuity correction to the Wilson score interval for p. The method allows us to treat the continuity correction as a distinct element in the formula.

Essentially, we define Wilson functions for the lower and upper bounds of the Wilson score interval, like this:

w^– = WilsonLower(p, n, α/2),
w⁺ = WilsonUpper(p, n, α/2).

The continuity-corrected versions of these intervals are then obtained by moving p half a unit outwards,

w^–_cc = WilsonLower(p – 12n, n, α/2),
w⁺_cc = WilsonUpper(p + 12n, n, α/2).

This reformulation is elegant. In my book, and in other posts on this blog, I show how a correction for continuity may be applied in conditions where the sample size might need to be independently re-weighted, either to account for a small finite population, or to address the common situation in corpus linguistics where data is drawn from a random sample of texts, but where multiple instances are drawn from the same text or subtext, produced by the same person, etc., rather than being randomly obtained from a population of utterances.

In this blog post, I will show a further advantage of this reformulation. It makes it very easy to scale the correction factor itself.

Is greater correction required?

Given that we are seeing additional errors due to the difference theorem being applied on different number scales, the obvious question is should we employ a more conservative approach when computing intervals in these cases?

We will consider the impact of increasing the correction factor c = 12n. This increases the number of Type II errors at the expense of Type I, i.e. it will make the resulting interval more conservative.

Note that even for large n, where corrections for continuity are generally considered to be unnecessary, Figure 1 is evidence of a benefit in terms of reduced Type I errors. An excess error rate of over 0.01 for an assessment ostensibly against α = 0.05 is not negligible.

A simple initial assessment summing the errors across all 200 cases shows that this method is likely to be fruitful. If we double c, excess Type I errors are all-but eliminated.

	χ²	NW	Cohen’s h	risk ratio	logarithm	odds ratio
n₁ = n₂
no c.c.	3.8476	3.3051	3.8610	4.0109	4.2974	4.3852
c.c.	0.0000	0.5780	0.7589	0.8584	1.3264	1.4061
2 × c.c.	0.0000	0.0000	0.0237	0.0950	0.0950	0.0950
n₁ = 5n₂
no c.c.	3.4533	3.1207	3.2112	3.3081	3.4745	3.4975
c.c.	0.4016	0.7377	0.7744	0.7944	0.7880	0.8090
2 × c.c.	0.0000	0.0000	0.0000	0.0118	0.0300	0.0300

Table 1. Total Type I total error rates, summed for n₂ ∈ {1, 2,… 200}, α = 0.05.

What is the optimum adjustment? We can visualise the trade-off between Type I and II errors for different continuity correction scale factors by simply multiplying c = 12n by a score, γ, and repeating the evaluation. We plot mean excess mesial errors, ε,¯that is, mean additional errors for the mesial (inner) interval.

In Figure 2, γ = 0 refers to no correction. Yates’s conventional continuity correction is where γ = 1 (centre).

Figure 2. Evaluating the trade-off between mean Type I and Type II errors, with a continuity correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on equal-sized samples, n₁ = n₂ ∈ {1, 2,… 200}, α = 0.05.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2024/03/eval-cc-7.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2024/03/eval-cc-7.png?w=739" class="alignnone wp-image-8201" src="https://corplingstats.files.wordpress.com/2024/03/eval-cc-7.png?w=601&h=429" alt="aluating the trade-off between mean Type I and Type II errors, with a continuity correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on equal-sized samples, n1 = n2 ∈ {1, 2,… 200}, α = 0.05." width="601" height="429" srcset="https://corplingstats.files.wordpress.com/2024/03/eval-cc-7.png?w=601&h=429 601w, https://corplingstats.files.wordpress.com/2024/03/eval-cc-7.png?w=150&h=107 150w, https://corplingstats.files.wordpress.com/2024/03/eval-cc-7.png?w=300&h=214 300w, https://corplingstats.files.wordpress.com/2024/03/eval-cc-7.png?w=768&h=549 768w, https://corplingstats.files.wordpress.com/2024/03/eval-cc-7.png 1020w" sizes="(max-width: 601px) 100vw, 601px" />

This graph offers us a ready-reckoner for different number scales.

We might aim for a similar level of performance as Yates’s χ² test. This is conservative, minimising Type I errors but permitting a small number of Type II errors.

To achieve comparable performance, we must increase γ to 1.5 for all evaluations over all but the odds ratio, where we might select γ = 1.75. (This ‘bump’ is due to an anomalous high-cost error at n₂ = 1, so we might still consider γ = 1.5 to be acceptable.) Although the logarithm function, log_p₂(p₁), appears to be more resistant to Type I error elimination than all others apart from the odds ratio, the mean error reaches 0.0008 for γ = 1.5.

Revealingly, this evaluation shows a very similar benefit for the Newcombe-Wilson difference interval, for which we had previously accepted an additional Type I mean error rate of around 0.0025 with the continuity-corrected interval (γ = 1). A recommendation to increase γ for this interval also makes sense.

Lest we be overly critical, applying Newcombe’s method to the difference between proportions with exact Clopper-Pearson intervals is not optimal either. The mean Type I error of 0.0047 for equal-sized data (a total error rate of 0.9334) is twice the mean error found with the conventionally continuity-corrected Wilson interval. See Further evaluation of Binomial confidence intervals. The Clopper-Pearson interval is exact — it computes a continuous interval where the error rate is precisely α/2 — but the act of combining intervals for each proportion generates errors. By contrast, the conservatism of the ordinary continuity correction with the Wilson interval absorbs almost half of these errors!

A correction factor of γ = 1.5 effectively eliminates mesial Type I errors for Newcombe’s (1998) difference interval in the equal-sample case. The interval is a little more conservative, but still well-behaved, as we can see in Figure 8 below. Errors identified for small samples disappear.

Figure 3 performs the same evaluation for unequal-sized samples where n₁ = 5n₂, i.e. where n₁ ranges from 5 to 1,000. We can see that the overall mean rate tends to be lower, but Type I errors are more difficult to eliminate altogether, becoming negligible (at the expense of increased Type II errors) where γ = 2. Nonetheless, at γ = 1.5, the risk of additional Type I errors has fallen to around 0.001 (outperforming Yates’s test at the expense of more Type II errors), and at γ = 1.75, the rate is less than 0.0005.

The continuity-corrected χ² test outperforms the other, independent interval difference formulae, although Yates’s test (γ = 1) is not error-free. On the other hand, the performance of Zou and Donner’s method over different interval scales seems to be remarkably consistent.

Figure 3. Evaluating the trade-off between mean Type I and Type II errors, with a continuity-correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on unequal-sized samples, n₁ = 5n₂, n₂ ∈ {1, 2,… 200}, α = 0.05.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2024/03/eval5-cc-3.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2024/03/eval5-cc-3.png?w=739" class="alignnone wp-image-8213" src="https://corplingstats.files.wordpress.com/2024/03/eval5-cc-3.png?w=601&h=469" alt="Evaluating the trade-off between mean Type I and Type II errors, with a continuity correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on unequal-sized samples, n1 = 5n2, n2 ∈ {1, 2,… 200}, α = 0.05." width="601" height="469" srcset="https://corplingstats.files.wordpress.com/2024/03/eval5-cc-3.png?w=601&h=469 601w, https://corplingstats.files.wordpress.com/2024/03/eval5-cc-3.png?w=150&h=117 150w, https://corplingstats.files.wordpress.com/2024/03/eval5-cc-3.png?w=300&h=234 300w, https://corplingstats.files.wordpress.com/2024/03/eval5-cc-3.png?w=768&h=599 768w, https://corplingstats.files.wordpress.com/2024/03/eval5-cc-3.png 1020w" sizes="(max-width: 601px) 100vw, 601px" />

Substituting α = 0.01 has a comparable performance benefit, and the proposal to increase γ to 1.5 is robust. See below.

These mean excess mesial error rates are, of course, means. We can also examine the particular error rate for any one of these evaluations over sample size, as in Figure 1. For example, Figure 4 shows the effect of increasing γ on the likelihood of the Newcombe-Wilson difference interval obtaining a significant result when the Fisher test does not, for samples of equal size.

Figure 4. The effect of varying the degree of correction γ on the Newcombe-Wilson difference test (d ≠ 0) when compared to the Fisher ‘exact’ test, evaluated by Fisher-weighted error rates for Type I errors, for values of n₁ = n₂ ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2024/03/nw-eval-cc-1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2024/03/nw-eval-cc-1.png?w=739" class="alignnone size-full wp-image-8062" src="https://corplingstats.files.wordpress.com/2024/03/nw-eval-cc-1.png?w=739" alt="The effect of varying the degree of correction γ on the Newcombe-Wilson difference test (d ≠ 0) when compared to the Fisher ‘exact’ test, evaluated by Fisher-weighted error rates for Type I errors, for values of n1 = n2 ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples." srcset="https://corplingstats.files.wordpress.com/2024/03/nw-eval-cc-1.png 1220w, https://corplingstats.files.wordpress.com/2024/03/nw-eval-cc-1.png?w=150&h=75 150w, https://corplingstats.files.wordpress.com/2024/03/nw-eval-cc-1.png?w=300&h=151 300w, https://corplingstats.files.wordpress.com/2024/03/nw-eval-cc-1.png?w=768&h=387 768w, https://corplingstats.files.wordpress.com/2024/03/nw-eval-cc-1.png?w=1024&h=515 1024w" sizes="(max-width: 1220px) 100vw, 1220px" />

Note: Some high-penalty errors for small n may be eliminated by checking whether the continuity-corrected Wilson centres overlap. For example, if p₁ > p₂, we could check if p₁ – γ.c₁ < p₂ – γ.c₂ (where c₁ and c₂ are the respective continuity corrections). But this is beside the point. For this exercise we are not optimising the test, but evaluating the performance of the interval.

Varying α

Thus far we have focused on an error rate of α = 0.05 and considered the performance of 95% intervals. This is entirely proper: the correct way to approach statistical evaluation is to identify an error rate and stick to it. Some researchers consistently opt for 99% intervals, i.e. where α = 0.01.

The practice of opting for a smaller α in the belief that doing so yields ‘better’ results is based on a common but mistaken assumption about ‘p-values’. In truth, in selecting an error level we are simply exercising a precautionary trade-off. However, there are circumstances where one may reduce the error level α by dividing it by the number of evaluations one wishes to undertake.

So we should confirm that our assessment stands up with smaller error levels α < 0.05.

99% intervals

Figure 6 plots mean excess mesial errors (weighted by a Fisher prior), ε,¯against the continuity correction factor γ, for equal-sized samples and difference-from-zero tests for the same series of difference and ratio formulae, this time with α = 0.01.

Figure 5. Evaluating the trade-off between mean Type I and Type II errors, with a continuity-correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on equal-sized samples, n₁ = n₂ ∈ {1, 2,… 200}, α = 0.01.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2024/03/eval-1-cc-1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2024/03/eval-1-cc-1.png?w=739" class="alignnone wp-image-8202" src="https://corplingstats.files.wordpress.com/2024/03/eval-1-cc-1.png?w=600&h=444" alt="eval-1-cc" width="600" height="444" srcset="https://corplingstats.files.wordpress.com/2024/03/eval-1-cc-1.png?w=600&h=444 600w, https://corplingstats.files.wordpress.com/2024/03/eval-1-cc-1.png?w=150&h=111 150w, https://corplingstats.files.wordpress.com/2024/03/eval-1-cc-1.png?w=300&h=222 300w, https://corplingstats.files.wordpress.com/2024/03/eval-1-cc-1.png?w=768&h=568 768w, https://corplingstats.files.wordpress.com/2024/03/eval-1-cc-1.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Again, we see a trade-off at γ = 1.5 and an odds ratio ‘bump’ owing to an error at n₂ = 1 (compare with Figure 2). For unequal samples where n₁ = 5n₂, our analysis obtains Figure 6.

Figure 7. Evaluating the trade-off between mean Type I and Type II errors, with a continuity-correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on unequal-sized samples, n₁ = 5n₂, n₂ ∈ {1, 2,… 200}, α = 0.01.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2024/03/eval5-1-cc-1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2024/03/eval5-1-cc-1.png?w=739" class="alignnone wp-image-8203" src="https://corplingstats.files.wordpress.com/2024/03/eval5-1-cc-1.png?w=600&h=465" alt="Evaluating the trade-off between mean Type I and Type II errors, with a continuity-correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on equal-sized samples, n1 = 5n2, n2 ∈ {1, 2,… 200}, α = 0.01." width="600" height="465" srcset="https://corplingstats.files.wordpress.com/2024/03/eval5-1-cc-1.png?w=600&h=465 600w, https://corplingstats.files.wordpress.com/2024/03/eval5-1-cc-1.png?w=150&h=116 150w, https://corplingstats.files.wordpress.com/2024/03/eval5-1-cc-1.png?w=300&h=233 300w, https://corplingstats.files.wordpress.com/2024/03/eval5-1-cc-1.png?w=768&h=596 768w, https://corplingstats.files.wordpress.com/2024/03/eval5-1-cc-1.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 6. Evaluating the trade-off between mean Type I and Type II errors, with a continuity-correction factor γ. Errors are evaluated against a Fisher test, with Fisher-weighted errors on unequal-sized samples, n₁ = 5n₂, n₂ ∈ {1, 2,… 200}, α = 0.01.

To obtain comparable performance to Yates’s test (grey line at γ = 1), setting γ = 1.5 would suffice (cf. mean errors, Yates: 0.000836 and odds ratio, γ = 1.5: 0.000839), but further improvement is possible.

These graphs with a smaller error level reveal a greater relative difference between functions. The plot lines appear separated out. Indeed, in Figure 3, the Type I error rate with α = 0.05 appears to be little affected by formula, contrary to my initial concern about number scales. But this optimism may have been premature. With a smaller target error level, the different number scales obtain more distinct mean excess mesial error rate performances.

Smaller α

So what happens with even smaller target error levels, α, which might be used in experiments where multiple comparisons are performed? We obtained a mean Type I error rate for a range of error levels from 0.001 to 0.1 and plotted them for different correction factors.

Figure 8. Mean excess Type I mesial errors ε as a fraction of the target error level, α, for Fisher-weighted error levels from 0.001 to 0.1 (log scale).

" data-image-caption="

Figure 8. Mean excess Type I mesial errors ε as a fraction of the target error level, α, for Fisher-weighted error levels from 0.001 to 0.1 (log scale).

" data-medium-file="https://corplingstats.files.wordpress.com/2024/03/eval-alpha-1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2024/03/eval-alpha-1.png?w=739" class="alignnone wp-image-8205" src="https://corplingstats.files.wordpress.com/2024/03/eval-alpha-1.png?w=601&h=502" alt="Mean excess Type I mesial errors ε as a fraction of the target error level, α, for Fisher-weighted error levels from 0.001 to 0.1 (log scale)." width="601" height="502" srcset="https://corplingstats.files.wordpress.com/2024/03/eval-alpha-1.png?w=601&h=502 601w, https://corplingstats.files.wordpress.com/2024/03/eval-alpha-1.png?w=150&h=125 150w, https://corplingstats.files.wordpress.com/2024/03/eval-alpha-1.png?w=300&h=251 300w, https://corplingstats.files.wordpress.com/2024/03/eval-alpha-1.png?w=768&h=642 768w, https://corplingstats.files.wordpress.com/2024/03/eval-alpha-1.png 1020w" sizes="(max-width: 601px) 100vw, 601px" />

Figure 7. Mean excess Type I mesial errors ε as a fraction of the target error level, α, for Fisher-weighted error levels from 0.001 to 0.1 (log scale).

Figure 7, computed with equal-sized samples, shows that irrespective of α, increasing γ reduces the number of errors, as we would expect. Indeed, expressed as a proportion of the target α, excess error rates are quite stable once we discount small samples n₂ < 5, where a single error can disproportionately impact the mean.

But we can also see what we might describe as a ‘fanning out effect’ for small α. The smaller the error rate α (and the wider the interval), the greater the impact the formula and number scale has on the accuracy of the interval, especially for smaller samples.

These observations do not affect the rationale for increasing γ, however, which remains robust and well-justified. Irrespective of the α level, increasing the correction factor γ to 1.5 suppresses Type I errors (bottom of figure).

Conclusions

Continuity corrections are traditionally applied to significance tests to account for the fact that a Binomial discrete and asymmetric distribution is approximated by a smooth and symmetric Normal one. They increase the coverage of an interval by repositioning the best estimate of each proportion outwards by a factor in proportion to the reciprocal of the sample size.

Frequently, statistics books advise their use in a somewhat ad hoc manner, usually only recommending their use with respect to small samples and simple tests. But this always seemed odd to me. Thus in Wallis (2021: 237-239, 259), I observed that original proportions in more complex meta-tests were no less discrete in this condition as they were in conventional ones.

In this situation we might use a meta-test to compare two runs of the same evaluation. Suppose an original evaluation was performed with Yates’s χ² test (or Fisher’s test). It would make sense to compare the two evaluations for significant difference by also employing a correction for continuity. Why employ a less conservative method for comparing the difference between two differences?

The same reasoning applies to algebraic functions of proportions. Indeed, as we have seen, we should be even more careful.

We have demonstrated that in fact, the benefits of employing a continuity correction extend beyond the small-sample and simple test situation to the task of optimising the performance of derived intervals.

We found that to obtain comparable performance to Yates’s χ² test, we must employ a more conservative correction, multiplying the conventional half-unit constant proposed by Frank Yates, c = 12n, by 1.5.
In the case of the odds ratio, a multiple of 1.75 may be preferred, especially for very small samples.

We developed a ready-reckoner by plotting mean error rates over an increased scale factor γ for equal-sized samples based on the mean Type I error rate for the inner interval. Continuity corrections, traditionally seen as smoothing adjustments, are also capable of being effective adjustments for building in a small amount of additional conservatism when combining intervals.

These intervals are well-behaved, i.e. they perform plausibly across all values on the range, provided we address extreme proportions in ratios with a small delta. See An algebra of intervals.

How does this increased scale factor affect intervals in practice? In Figure 8 we consider the impact of applying such a correction on ratio, product, sum and difference intervals for equal-sized samples where n₁ = n₂ = 10.

As before, we plot operators of p₁ and p₂ with intervals, varying a table across a diagonal interpolation ϕ (we vary p₁ from 0 to 1 and hold p₂ = 1 – p₁).

Figure 5. Plotting intervals with γ = 0 (no correction), 1 (Yates) or 1.5 (Wallis). Although the difference between γ = 1 and 1.5 appears small (except, perhaps, for the ratio), this additional conservatism absorbs Type I errors generated by the Newcombe / Zou and Donner theorem.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2024/03/algebra-cc-1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2024/03/algebra-cc-1.png?w=739" class="alignnone wp-image-8063" src="https://corplingstats.files.wordpress.com/2024/03/algebra-cc-1.png?w=600&h=470" alt="Plotting intervals with γ = 0 (no correction), 1 (Yates) or 1.5 (Wallis). Although the difference between γ = 1 and 1.5 appears small (except, perhaps, for the ratio), this additional conservatism absorbs Type I errors generated by the Newcombe / Zou and Donner theorem." width="600" height="470" srcset="https://corplingstats.files.wordpress.com/2024/03/algebra-cc-1.png?w=600&h=470 600w, https://corplingstats.files.wordpress.com/2024/03/algebra-cc-1.png?w=150&h=118 150w, https://corplingstats.files.wordpress.com/2024/03/algebra-cc-1.png?w=300&h=235 300w, https://corplingstats.files.wordpress.com/2024/03/algebra-cc-1.png?w=768&h=602 768w, https://corplingstats.files.wordpress.com/2024/03/algebra-cc-1.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 8. Plotting intervals with γ = 0 (no correction), 1 (Yates) or 1.5 (Wallis). Although the difference between γ = 1 and 1.5 appears small (except, perhaps, for the ratio), this additional conservatism absorbs Type I errors generated by the Newcombe / Zou and Donner theorem.

References

Newcombe, R.G. (1998). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine, 17, 873-890.

Wallis, S.A. (2021). Statistics in Corpus Linguistics Research. New York: Routledge.

Wallis, S.A. (forthcoming). Accurate confidence intervals on Binomial proportions, functions of proportions and other related scores. » Post

Yates, F. (1934). Contingency tables involving small numbers and the chi-square test. Journal of the Royal Statistical Society, 1(2), 217-235.

Zou, G.Y. & A. Donner (2008). Construction of confidence limits about effect measures: A general approach. Statistics in Medicine, 27:10, 1693-1702.

Confidence intervals for Cohen’s h

Sean — Wed, 28 Feb 2024 13:35:49 +0000

1. Introduction

Cohen’s h (Cohen, 2013) is an effect size for the difference of two independent proportions that is sometimes cited in the literature. h ranges between minus and plus pi, i.e. h ∈ [–π, π].

Jacob Cohen suggests that if |h| > 0.2, this is a ‘small effect size’, if |h| > 0.5, it is ‘medium’, and if |h| > 0.8 it is ‘large’. This conventional application of effect sizes – as a descriptive method for distinguishing sizes – is widespread.

The score is defined as the difference between the arcsine transform of the root of Binomial proportions p_i for i ∈ {1, 2}, hence the expanded range, ±π.

That is,

h = ψ(p1) – ψ(p2),(1)

where the transform function ψ(p) is defined as

ψ(p) = 2 arcsin(√p).(2)

In this blog post I will explain how to derive an accurate confidence interval for this property h. The benefits of doing so are multiple.

We can plot h scores with intervals, so we can visualise the reliability of their estimate, pay attention to the smallest bound, etc.
We can compare two scores, h₁ and h₂, for significant difference. In other words, we can conclude that h₂ > h₁, or vice versa.
We can reinterpret ‘large’ and ‘small’ effects for statistical power.
We can consider whether an inner bound is greater than Jacob’s thresholds. Thus if h is positive, if h^– > 0.5 we can report that the likely population score is at least a ‘medium’ effect.

An absolute (unsigned and non-directional) version of |h| is sometimes cited. We can compute intervals for unsigned |h|. We will return to this question later.

2. Deriving an interval

2.1 Preliminaries: the Wilson score interval

We will use the Wilson score interval on Binomial proportions p at an error level α for our purposes (Wilson 1927). This is written p ∈ (w^–, w⁺), is directly calculable by formula, and has good performance. It may be corrected for continuity and adjusted for finite populations or random-text sampling.

Once corrected for continuity, Wilson’s interval has similar performance to the ‘exact’ Clopper-Pearson interval (Wallis 2013, 2021: 311), which could be substituted into what follows. Other intervals are available, although few outperform the continuity-corrected Wilson interval (Newcombe 1998a, Wallis 2013).

The Wilson score interval is asymmetric. Unless p = 0.5, the interval width on one side of p will not be the same as the interval width on the other.

Wilson score interval about p = 0.3, with n = 10, obtained by inverting the Normal approximation to the Binomial about P, at 95% interval bounds (w^–,

" data-image-caption="

Wilson score interval about p = 0.3, with n = 10, obtained by inverting the Normal approximation to the Binomial about P, at 95% interval bounds (w^–,w⁺).

" data-medium-file="https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=739" class="alignnone wp-image-6947" src="https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=600&h=579" alt="wilson1" width="600" height="579" srcset="https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=600&h=579 600w, https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=150&h=145 150w, https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=300&h=290 300w, https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=768&h=742 768w, https://corplingstats.files.wordpress.com/2022/01/wilson1.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 1. Wilson score interval about p = 0.3, with n = 10, obtained by inverting the Normal approximation to the Binomial about P, at 95% interval bounds (w^–,w⁺).

2.2 Stage 1. An interval for the transform

In An Algebra of Intervals (and in my book), we noted that we can obtain an interval for any monotonic transformation of a Binomial proportion p by simply applying the same transform function to the interval bounds for p.

The method is as follows.

For a function ψ(p), we first determine if it is monotonic.

If it is monotonic within the range of P = [0, 1], the bounds will be ψ(w^–) and ψ(w⁺).
- If the function increases with p then ψ(w^–) < ψ(w⁺).
- If it is falling, we swap the bounds to place the lower number first.
If the function is not monotonic, it will contain at least one turning point (a local maximum or minimum) where the function changes direction. This complicates matters, but it does not mean that an interval is not computable. For an example of such a function, see The confidence of entropy.

The first step is therefore to examine the behaviour of Equation (2).

It turns out that ψ(p) is a monotonic function, which means that for every value p ∈ P = [0, 1] there is a unique value of ψ(p).

How do we know? We simply compute Equation (2) over P, and observe that the function always increases with increasing p, and has no local maximum or minimum along the range. See Figure 2.

Figure 2. Plotting ψ(p) against p, with confidence intervals ψ(w^–), ψ(w⁺) for n = 10, α = 0.05.

" data-image-caption="

Figure 2. Plotting ψ(p) against p, with confidence intervals ψ(w^–), ψ(w⁺) for n = 10, α = 0.05.

" data-medium-file="https://corplingstats.files.wordpress.com/2024/02/psi-int.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2024/02/psi-int.png?w=739" class="alignnone wp-image-7838" src="https://corplingstats.files.wordpress.com/2024/02/psi-int.png?w=600&h=600" alt="Plotting ψ(p) against p, with confidence intervals ψ(w–), ψ(w+) for n = 10, α = 0.05." width="600" height="600" srcset="https://corplingstats.files.wordpress.com/2024/02/psi-int.png?w=600&h=600 600w, https://corplingstats.files.wordpress.com/2024/02/psi-int.png?w=150&h=150 150w, https://corplingstats.files.wordpress.com/2024/02/psi-int.png?w=300&h=300 300w, https://corplingstats.files.wordpress.com/2024/02/psi-int.png?w=768&h=768 768w, https://corplingstats.files.wordpress.com/2024/02/psi-int.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 2. Plotting ψ(p) against p, with confidence intervals ψ(w^–), ψ(w⁺) for n = 10, α = 0.05.

We may now calculate confidence intervals for ψ. Since it is a rising monotonic function, we have, simply:

ψ(p) ∈ (ψ(w^–), ψ(w⁺)).(3)

2.3 Stage 2. An interval for the difference

Next, we need to compute an interval for Cohen’s h (Equation (1)).

Newcombe (1998b) pointed out that when we compare two intervals on two proportions p₁ and p₂, we are concerned with the inner intervals, i.e. the interval on p₁ close to p₂, and vice versa.

For a difference d = p₂ – p₁, we may derive a zero-based interval as

0 ∈ (w_d^–, w_d⁺) = (–√(u₁^–)² + (u₂⁺)², √(u₁⁺)² + (u₂^–)²),(4)

where interval widths u_i^– = p_i – w_i^– and u_i⁺ = w_i⁺ – p_i, and w₁^–, etc. are the interval bounds for p₁, etc. Selecting the relevant interval width is important. As we saw, the Wilson interval for p is asymmetric. Figure 2 shows that the interval for ψ(p) is also asymmetric.

Consider the upper bound of a zero-based interval for d. If d is positive, then p₂ > p₁. The upper bound of the zero-based interval, w_d⁺, is the smallest (positive) value that d may be that allows us to report as ‘a significant difference’. It is calculated from the square root of the sum of the following pair of squared interval widths: the upper width for p₁ (u₁⁺) and the lower width for p₂ (u₂^–). These are the inner intervals where p2 > p₁. On the other hand if p₂ < p₁, d is negative, and we focus on w_d^– and the opposite intervals.

This is a long-winded way of saying: check the geometry!

We can then rewrite Equation (4) as an interval for d by simple subtraction:

d ∈ (d^–, d⁺) = d – (w_d^–, w_d⁺) = (d – w_d⁺, d – w_d^–).(5)

Zou and Donner (2008) generalise this formula by arguing that it can be applied to any pair of good-coverage intervals. That is, provided that the intervals are reasonably accurate, we can compute a difference interval between them by the same process of summing squared interval widths, paying attention to the inner interval. In Wallis (forthcoming), I evaluate this claim more critically.

Nonetheless, this means that to create an interval according to Zou and Donner, all we need to do is substitute p₁ with ψ(p₂) and p₂ with ψ(p₁) in Newcombe’s formula (Equation (4)). We swap indices because h is expressed as ψ(p₁) – ψ(p₂), rather than the other way around.

We could simply substitute u₁^–, etc., but for clarity we will spell this out.

0 ∈ (w_h^–, w_h⁺) = (–√(ψ(w₁⁺) – ψ(p₁))² + (ψ(p₂) – ψ(w₂^–))², √(ψ(p₁) – ψ(w₁^–))² + (ψ(w₂⁺) – ψ(p₂))²),(6)

and the following interval for h:

h ∈ (h^–, h⁺) = h – (w_h^–, w_h⁺) = (h – w_h⁺, h – w_h^–).(7)

We can plot d and h, and their respective intervals computed by Equations (5) and (7) respectively. To express the overall range we will permute p2 = 1 – p₁ from p₁ = 0 to 1, or Cramér’s ϕ along the diagonal from ϕ = -1 to 1.

Figure 3. Intervals for Cohen’s h plotted against Cramér’s φ for a diagonal interpolation (p₂ = 1 – p₂) from p₁= 0 (left) to p₁ = 1 (right), n = 10, α = 0.05.

" data-image-caption="

Figure 3. Intervals for Cohen’s h plotted against Cramér’s φ for a diagonal interpolation (p₂ = 1 – p₂) from p₁= 0 (left) to p₁ = 1 (right), n = 10, α = 0.05.

" data-medium-file="https://corplingstats.files.wordpress.com/2024/02/cohen-int.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2024/02/cohen-int.png?w=739" class="alignnone wp-image-7837" src="https://corplingstats.files.wordpress.com/2024/02/cohen-int.png?w=601&h=409" alt="Figure 3. Intervals for Cohen’s h plotted against Cramér’s ϕ for a diagonal interpolation (p2 = 1 – p1) from p1 = 0 (left) to p1 = 1 (right), n = 10, α = 0.05." width="601" height="409" srcset="https://corplingstats.files.wordpress.com/2024/02/cohen-int.png?w=601&h=409 601w, https://corplingstats.files.wordpress.com/2024/02/cohen-int.png?w=150&h=102 150w, https://corplingstats.files.wordpress.com/2024/02/cohen-int.png?w=300&h=204 300w, https://corplingstats.files.wordpress.com/2024/02/cohen-int.png?w=768&h=523 768w, https://corplingstats.files.wordpress.com/2024/02/cohen-int.png 1020w" sizes="(max-width: 601px) 100vw, 601px" />

Figure 3. Intervals for Cohen’s h plotted against Cramér’s ϕ for a diagonal interpolation (p₂ = 1 – p₂) from p₁= 0 (left) to p₁ = 1 (right), n = 10, α = 0.05.

3. Evaluating the interval

Figure 3 shows that the point at which h^– and h⁺ cross the zero axis is almost the same as the equivalent point for d^– and d⁺. But we know that Zou and Donner’s method involves an approximation.

The justification for selecting the Wilson inner intervals for p₁ and p₂ was that these interval widths are in proportion to a Normal interval at the closest bounds of p₁ and p₂. See Wallis (2021: 125).

But we transformed all values of p by applying Equation (2) to p₁, w₁^–, w₁⁺, etc. If the inner intervals were Normal before they were transformed, they will not be Normal afterwards.

The question is then how much additional error is created by applying this transformation beforehand?

One way to evaluate this is discussed in Wallis (forthcoming) and Evaluating the performance of risk ratio and odds ratio tests. Testing if h is significantly different from zero should be equivalent to a χ² or Fisher test. This evaluation is a subset of all possible ones. However we can perform it exhaustively. Errors are weighted by a Fisher prior probability to account for the combinatorial chance of a particular outcome.

This method permits us to see two things:

the scale of Type I errors introduced by the transformation – this is the risk that our new interval might exclude zero, and yet an exact Fisher test would rule it to be ‘non-significant’ – and
how these errors rank Cohen’s transformation against others, such as the risk ratio or odds ratio.

We rerun our ratio evaluation, including our new interval. In Figures 4 and 5, we can see the performance of the Cohen’s h inner interval at zero against a Fisher test. Figure 4 computes errors for tables where n₁ = n₂ ∈ {1, 2,… 200}. Figure 5 performs the same where n₁ = 5n₂.

Figure 5. Cohen’s h evaluated by Fisher-weighted error rates for Type I errors, performance for h ≠ 0 against the Fisher ‘exact’ test, computed for values of n₁ ∈ {1, 2,… 200}, α = 0.05, with unequal-sized samples, n₁= 5n₂.

We can observe that there is a small additional error cost involved in employing Zou and Donner’s theorem with Cohen’s h compared to the simple difference d (Newcombe-Wilson). This error is smaller still for the unequal-sized sample, where more data supports p₂.

To rank them we can sum all two hundred Type I error scores (Table 1). The rate 3.3051/200 represents an additional mean error of 0.0165 where the uncorrected Newcombe-Wilson interval finds a significant difference and the Fisher test does not. If a continuity correction is applied, this error cost falls to 0.0029.

	χ²	NW	Cohen’s h	risk ratio	logarithm	odds ratio
n₁ = n₂
no c.c.	3.8476	3.3051	3.8610	4.0109	4.2974	4.3852
c.c.	0.0000	0.5780	0.7589	0.8584	1.3264	1.4061
n₁ = 5n₂
no c.c.	3.4533	3.1207	3.2112	3.3081	3.4745	3.4975
c.c.	0.4016	0.7377	0.7744	0.7944	0.7880	0.8090

Table 1. Total Type I error rates, summed for n₂ ∈ {1, 2,… 200}, α = 0.05.

The error is slightly smaller than that for the risk ratio (p₁/p₂), where the transformation function is a logarithm, and smaller than the ratio of logs or odds. These errors can be reduced substantially by employing a correction for continuity.

An even more conservative interval which eliminates, or near-eliminates, Type I errors may be obtained by employing a larger correction.

How does it perform against an equivalent Newcombe-Wilson test? The following animation plots Type I errors generated by estimating difference on these additional scales, against the equivalent NW test, with or without continuity correction. This is a method of visualising discrepancies between the application of the inner interval Normal approximation when it is applied to different number scales.

Again, we see that Cohen’s h (here, shown as a red line) obtains fewer errors than the risk ratio (ordinary ratio). The odds ratio and logarithm (dashed) are more difficult to distinguish.

Animation 1. Evaluating Cohen’s h, risk, odds and log ratios against the equivalent Newcombe-Wilson test.

" data-image-caption="

Animation 1. Evaluating Cohen’s h, risk, odds and log ratios against the equivalent Newcombe-Wilson test.

" data-medium-file="https://corplingstats.files.wordpress.com/2024/02/cohen-nw.gif?w=300" data-large-file="https://corplingstats.files.wordpress.com/2024/02/cohen-nw.gif?w=739" class="alignnone size-full wp-image-7900" src="https://corplingstats.files.wordpress.com/2024/02/cohen-nw.gif?w=739" alt="Animation. Evaluating Cohen's h, risk, odds and log ratios against the equivalent Newcombe-Wilson test." srcset="https://corplingstats.files.wordpress.com/2024/02/cohen-nw.gif 950w, https://corplingstats.files.wordpress.com/2024/02/cohen-nw.gif?w=150&h=95 150w, https://corplingstats.files.wordpress.com/2024/02/cohen-nw.gif?w=300&h=189 300w, https://corplingstats.files.wordpress.com/2024/02/cohen-nw.gif?w=768&h=485 768w" sizes="(max-width: 950px) 100vw, 950px" />

Animation 1. Evaluating Cohen’s h, risk, odds and log ratios against the equivalent Newcombe-Wilson test.

4. Unsigned Cohen’s |h|

It is quite common to see unsigned scores, |h|, cited in the literature.

Consider Figure 3. Note that an interval may include zero – indeed, this corresponds to the state of a ‘non-significant difference’ (i.e. not significantly different from zero). The points where the interval crosses the zero axis are indicated, and all points between the two circled areas all include zero.

We can derive a confidence interval for an unsigned score from a signed one. The following method is preferred. We transform the interval for signed h by paying attention to the global minimum of |h|, i.e. zero.

If the interval excludes zero, we have, simply:

|h| ∈ (|h|^–, |h|⁺) = (min(|h^–|, |h⁺|), max(|h^–|, |h⁺|)).

If the interval includes zero, |h|^– = 0. The interval is closed at zero, hence the square bracket:

|h| ∈ (|h|^–, |h|⁺) = [0, max(|h^–|, |h⁺|)).(8)

This transformation loses information, but this may be what we want. Consider the task of comparing two signed scores, h₂ > h₁. This will detect outcomes where the scores have different signs. The proposition |h₂| > |h₁| is a subset of these results.

The method is also conservative because the absolute function is a non-monotonic transform. The interval ‘folds back’ on itself at 0. See Confidence intervals on goodness of fit ϕ scores.

Why is this conservative? Consider the interval for the contingency table [[6, 4], [4, 6]] at α = 0.05. We have n₁ = n₂ = 10, p₁ = 0.6, p₂ = 0.4. We obtain h = 0.4027 ∈ (-0.4251, 1.1442). This predicts that if the true value is less than h, there is a 0.05 chance that it is less than -0.4251.

The transformed interval for |h| = 0.4027 ∈ [0, 1.1442).

But now we have lost that 0.05 chance of the true population value being a lower score. Instead, for this table, the threshold is equivalent to a lower bound of h^– = -1.1442, an event that has an infinitesimal chance (α < 0.000001) of occurring!

5. Conclusions

We have demonstrated how to derive an interval for Cohen’s h, an effect size for pairs of proportions or 2 × 2 contingency tables, like Cramér’s ϕ.

Armed with this interval we can perform any of the additional procedures outlined in the introduction, including plotting intervals on observed h-scores and comparing their significant difference. We can also identify when what Cohen calls a ‘large’ or ‘medium’ effect is supportable by inferential statistics, i.e. when the lower bound of the unsigned interval excludes the threshold value.

Cohen’s h is the difference between two arcsine-transformed proportions, a fact which necessarily transforms the number scale on which probability density function distributions are computed. Whereas the Newcombe-Wilson difference method for the difference interval relies on an observation that uncertainty is Normally distributed at Wilson score interval bounds, the arcsine transform (Equation (2)) is non-linear (Figure 1), and therefore any Normal distribution plotted on the p-axis will become non-Normal on a transformed axis.

Consequently, Zou and Donner’s (2008) method, which generalises the Newcombe-Wilson formula from differences in p to differences in any property with a good coverage interval, will obtain slightly different outcomes due to this additional approximation. It introduces small discrepancies, classed either as additional Type I and II errors. The question is: how substantial are these errors, and to what extent are they addressed by standard methods, e.g. continuity corrections?

We found that these errors were not negligible, but, when compared to the ‘exact’ Fisher test, were of approximately the same order overall as those obtained for the Newcombe-Wilson difference interval. The error rate was slightly greater than for the simple difference, but the method performs better than the equivalent for the risk ratio, logarithm or odds ratio.

Note: Since writing this blog post, I have discovered that these errors can be controlled by a simple method. If we include a continuity correction factor that is 1.5 times larger than normal into both Wilson intervals for p₁ and p₂, the resulting interval for h has very few Type I errors.

Finally, we considered intervals for unsigned |h|. Unsigned effect sizes are quite common, because they can be obtained for tables with more than one degree of freedom. However, they are conservative and lossy. When dealing with an effect size with a single degree of freedom, the optimum method is to first compute an interval for the signed score, and then collapse the interval onto the positive number scale, as we discussed in this article.

References

Cohen, J. (2013). Statistical power analysis for the behavioral sciences (2^nd ed). New York: Routledge.

Newcombe, R.G. (1998a). Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine, 17, 857-872.

Newcombe, R.G. (1998b). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine, 17, 873-890.

Wallis, S.A. (2013). Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics, 20:3, 178-208. » Post

Wallis, S.A. (2021). Statistics in Corpus Linguistics Research. New York: Routledge.

Wallis, S.A. (forthcoming). Accurate confidence intervals on Binomial proportions, functions of proportions and other related scores. » Post

Zou, G.Y. & A. Donner (2008). Construction of confidence limits about effect measures: A general approach. Statistics in Medicine, 27:10, 1693-1702.

Confidence intervals

Sean — Thu, 13 Apr 2023 15:26:01 +0000

In this blog we identify efficient methods for computing confidence intervals for many properties.

When we observe any measure from sampled data, we do so in order to estimate the most likely value in the population of data – ‘the real world’, as it were – from which our data was sampled. This is subject to a small number of assumptions (the sample is randomly drawn without bias, for example). But this observed value is merely the best estimate we have, on the information available. Were we to repeat our experiment, sample new data and remeasure the property, we would probably obtain a different result.

A confidence interval is the range of values in which we predict that the true value in the population will likely be, based on our observed best estimate and other properties of the sample, subject to a certain acceptable level of error, say, 5% or 1%.

A confidence interval is like a blur in a photograph. We know where a feature of an object is, but it may be blurry. With more data, better lenses, a greater focus and longer exposure times, the blur reduces.

In order to make the reader’s task a little easier, I have summarised the main methods for calculating confidence intervals here. If the property you are interested in is not explicitly listed here, it may be found in other linked posts.

1. Binomial proportion p

The following methods for obtaining the confidence interval for a Binomial proportion have high performance.

The Clopper-Pearson interval
The Wilson score interval
The Wilson score interval with continuity correction

A Binomial proportion, p ∈ [0, 1], and represents the proportion of instances of a particular type of linguistic event, which we might call A, in a random sample of interchangeable events of either A or B. In corpus linguistics this means we need to be confident (as far as it is possible) that all instances of an event in our sample can genuinely alternate (all cases of A may be B and vice-versa).

These confidence intervals express the range of values where a possible population value, P, is not significantly different from the observed value p at a given error level α. This means that they are a visual manifestation of a simple significance test, where all points beyond the interval are considered significantly different from the observed value p. The difference between the intervals is due to the significance test they are derived from (respectively: Binomial test, Normal z test, z test with continuity correction).

As well as my book, Wallis (2021), a good place to start reading is Wallis (2013), Binomial confidence intervals and contingency tests.

The ‘exact’ Clopper-Pearson interval is obtained by a search procedure from the Binomial distribution. As a result, it is not easily generalised to larger sample sizes. Usually a better option is to employ the Wilson score interval (Wilson 1927), which inverts the Normal approximation to the Binomial and can be calculated by a formula. This interval may also accept a continuity correction and other adjustments for properties of the sample.

Wilson score interval about p = 0.3, with n = 10, obtained by inverting the Normal approximation to the Binomial about P, at 95% interval bounds (w^–,

" data-image-caption="

Wilson score interval about p = 0.3, with n = 10, obtained by inverting the Normal approximation to the Binomial about P, at 95% interval bounds (w^–,w⁺).

" data-medium-file="https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=739" class="alignnone wp-image-6947" src="https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=600&h=580" alt="wilson1" width="600" height="580" srcset="https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=600&h=580 600w, https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=150&h=145 150w, https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=300&h=290 300w, https://corplingstats.files.wordpress.com/2022/01/wilson1.png?w=768&h=742 768w, https://corplingstats.files.wordpress.com/2022/01/wilson1.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 1: Wilson score interval about p = 0.3, with n = 10, obtained by inverting the Normal approximation to the Binomial about P, at 95% interval bounds (w^–, w⁺).

2. Functions of p

Let us consider next a measure that can be expressed as a function of a Binomial proportion.

To compute an interval for a function of p, our first step is to analyse the behaviour of the function over the probabilistic range p ∈ [0, 1]. The simplest way to do this is to plot the function and identify (i) whether it is monotonic over the range, and if it is not, (ii) identify turning points (local maxima or minima).

2.1 Monotonic functions

A monotonic function is one that is either guaranteed to increase or decrease over the range of its parameter (its possible values). See Reciprocating the Wilson interval.

For example:

fn(p) = p² is monotonic and increasing over the probabilistic range p ∈ [0, 1].
fn(p) = (p – 0.5)² is non-monotonic over the same range.

A confidence interval for a monotonic function of p can be obtained by simply applying the same function, fn, to its lower and upper bounds. If the function is increasing the transformed bounds will be in the same sequence, but if it is decreasing, the transformed lower bound will be at the upper end:

increasing: fn(p) ∈ (fn(w^–), fn(w⁺))
decreasing: fn(p) ∈ (fn(w⁺), fn(w^–))

Monotonic functions have a 1:1 mapping and are invertible, so confidence intervals on these functions inherit these properties. Importantly, if p is significantly different from P it follows that their transformed values are also significantly different:

p ≠ P fn(p) ≠ fn(P).

If the interval for fn(p) excludes the transformed expected value, fn(P), then the interval for p must exclude P (i.e. p is significantly different from P). A significance test between values of p and P should obtain the same result as a test between fn(p) and fn(P).

Some non-monotonic functions. The lower function has two solutions p for one value of f(p). The stepped function has a plateau where there are many values of p for one value of f(p).

" data-image-caption="

Figure: Some non-monotonic functions. The lower function has two solutions p for one value of f(p). The stepped function has a plateau where there are many values of p for one value of f(p).

" data-medium-file="https://corplingstats.files.wordpress.com/2012/11/mono2.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2012/11/mono2.png?w=420" class="alignnone size-full wp-image-759" src="https://corplingstats.files.wordpress.com/2012/11/mono2.png?w=739" alt="Some non-monotonic functions" srcset="https://corplingstats.files.wordpress.com/2012/11/mono2.png 420w, https://corplingstats.files.wordpress.com/2012/11/mono2.png?w=150&h=133 150w, https://corplingstats.files.wordpress.com/2012/11/mono2.png?w=300&h=266 300w" sizes="(max-width: 420px) 100vw, 420px" />

Figure 2: Some non-monotonic functions. The lower function has two solutions p for one value of f(p). The stepped function has a plateau where there are many values of p for one value of f(p).

By contrast, a non-monotonic function is inevitably ‘lossy’, that is, if more than one value of p can obtain the score fn(p), it follows that an interval for fn(p) may include scores fn(p′), where p′ is significantly different from p. We must bear this in mind when considering how we test for significant difference and what results may mean.

Some example increasing monotonic functions:

odds p / (1 – p)
logit log(p) – (1 – log(p))
weighting, e.g., frequency f = np (sample size n is ‘given’, and thus a constant)
addition, e.g., an intercept as in kp + c
logarithm log_k(p)
logistic (inverse logit)
power p^k where k > 0

Some example decreasing monotonic functions:

reciprocal 1/p, e.g. clause length l = 1/p
power p^k where k < 0

2.2 Non-monotonic functions

This is all very well, but what if we need to compute a confidence interval for a non-monotonic function?

For example, Binomial entropy can be expressed as the negated sum of a function of a proportion p and its alternate q = 1 – p:

η(p) = –(p.log₂(p) + (1 – p).log₂(1 – p)).

This function is not monotonic, but rises and falls over p ∈ [0, 1]. It has a single turning point, a maximum at p = 0.5, where η(p) = 1.

If the confidence interval for p excludes this turning point (p = 0.5), then the function can be said to be monotonic within the interval. A conservative interval for η(p) is obtained from these bounds.
Alternatively, if the interval includes the turning point, the upper bound of the new interval is simply the maximum value for η(p) = mˆ = 1, and the lower bound is the smaller of the two transformed bounds, min(η(w^–), η(w⁺)).

Where a transformed interval contains a turning point we include it, either as a maximum or minimum value.

maximum: fn(p) ∈ (min(fn(w^–), fn(w⁺)), mˆ)
minimum: fn(p) ∈ (m̌, max(fn(w^–), fn(w⁺))

Another common example is found in squared error terms of the form (p_i – P_i)², where p_i and P_i are observed and expected proportions respectively. We have a Binomial confidence interval for p_i, but P_i is treated as ‘given’ or constant, so it has no interval. The term has a minimum turning point, m̌ = 0, where p_i = P_i.

3. Functions of multiple proportions

So far we have simply transformed the confidence interval for a single observed Binomial proportion. But many formulae contain more than one independent observed proportion or property, each of which has their own confidence interval. We may wish to obtain intervals on the following:

difference p₂ – p₁
sum p₁ + p₂
ratio p₁ / p₂
product p₁ × p₂
power p₁^p₂
logarithm log_p₂(p₁)

3.1 Differences

Newcombe (1998) offers an efficient confidence interval for the difference between two observed proportions, d = p₂ – p₁. Since the intervals for each proportion are independent, he employs a Pythagorean reasoning analogous to the Bienaymé sum of variances rule. The interval widths of the new combined interval are obtained from the hypotenuse of a triangle whose other two sides are the relevant interval widths of each term.

Figure: Calculating the lower bound of the Newcombe-Wilson interval, using the Pythagorean Bienaymé formula.

" data-image-caption="

Figure: Calculating the lower bound of the Newcombe-Wilson interval, using the Pythagorean Bienaymé formula.

" data-medium-file="https://corplingstats.files.wordpress.com/2023/04/nw3.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2023/04/nw3.png?w=739" class="alignnone wp-image-7768" src="https://corplingstats.files.wordpress.com/2023/04/nw3.png?w=356&h=237" alt="Calculating the lower bound of the Newcombe-Wilson interval, using the Pythagorean Bienaymé formula." width="356" height="237" srcset="https://corplingstats.files.wordpress.com/2023/04/nw3.png?w=356&h=237 356w, https://corplingstats.files.wordpress.com/2023/04/nw3.png?w=712&h=474 712w, https://corplingstats.files.wordpress.com/2023/04/nw3.png?w=150&h=100 150w, https://corplingstats.files.wordpress.com/2023/04/nw3.png?w=300&h=200 300w" sizes="(max-width: 356px) 100vw, 356px" />

Figure 3: Calculating the lower bound of the Newcombe-Wilson interval, using the Pythagorean Bienaymé formula.

For the difference formula d = p₂ – p₁, we may write:

(w_d^–, w_d⁺) = (–√(u₁^–)² + (u₂⁺)², √(u₁⁺)² + (u₂^–)²),

where u_i^– = p_i – w_i^– and u_i⁺ = w_i⁺ – p_i, and (w_i^–, w_i⁺) are the Wilson score interval bounds for p_i, i ∈ {0, 1}.

This is a zero-based interval, and in this form can be used to simply test whether a difference is significantly different from zero (because d falls outside it).

We may reposition the interval about the difference by subtracting it from d. Now it is an interval on our property, d.

d ∈ (d^–, d⁺) = d – (w_d^–, w_d⁺) = (d – w_d⁺, d – w_d^–).

Note that the resulting upper bound of d, d⁺ = d – w_d^–, is based on the lower bound of p₁ (u₁^–), because p₁ is subtracted in the expression p₂ – p₁, and the upper bound of p₂ (which is positive).

If the resulting interval excludes zero, p₁ and p₂ are significantly different.

3.2 Other mathematical operators

Zou and Donner (2008) generalises this principle to any sound confidence interval of any property on the same scale. Substituting –p₁ for p₁ gives us a sum (see also section 3.4 below), and substituting log(p_i) for p_i gives us a ratio interval. Indeed, armed with the ability to compute confidence intervals on logarithmic functions of p, plus this generalised formula, we can create intervals for all of the above. See An algebra of intervals and Confidence intervals on powers and logs.

Once we can obtain an interval for an effect size, we can compare effect sizes by simply constructing a difference interval and checking if it includes zero.

3.3 Analytical reduction

Before we create a confidence interval for a formula, we need to rewrite the formula in as simple a form as possible.

The key principle is each variable citation = one degree of freedom:

Every independent observed proportion, which would attract an independent confidence interval, has a single degree of freedom, and should be cited once only in the formula.

For example, consider percentage difference, which is typically written

d^% = (p₂ – p₁) / p₁ = d / p₁.

On the basis of this formula we might cite the Newcombe-Wilson difference interval for d (see section 3.1 above) and then use this interval to calculate the ratio formula for d / p₁.

Unfortunately the result is a narrow and excessively conservative interval, because we assumed that d and p₁ were independent, and they are not.

However, we can simplify it even more so that p₁ appears only once:

d^% = p₂/p₁ – 1.

Now we can compute the confidence interval for the ratio of two independent proportions, p₂/p₁, and subtract 1.

This process of simplification is a necessary first step. The best advice is simply to think about the number of different ways the same formula can be expressed, and whether any terms can be ‘cancelled out’. Remember that authors may cite a version of a formula that is easy to explain to the reader: that version may not be optimum for deriving a confidence interval.

3.4 k-constrained summation p₁ + p₂ + … + p_k

If we apply Zou and Donner’s (2008) theorem to the unconstrained sum of independent proportions, we obtain the following interval:

independent sum s ∈ (s^–, s⁺) = (∑p_i – √∑(u_i^–)², ∑p_i + √∑(u_i⁺)²).

In other words, the lower bound is the Pythagorean diagonal of tangential lower bounds; the upper bound is calculated from the upper bounds. We can also substitute any function, fn(p_i), for p_i, provided that we can compute sound confidence intervals on it.

A number of effect size measures, including entropy and goodness of fit ϕ, are computed across discrete Multinomial variables with a closed set of k types or outcomes. In this case, it is necessary to sum a series of error terms, fn(p_i), in the knowledge that the sum of the proportions, ∑p_i, is actually 1. There are k – 1 degrees of freedom. We let κ = k/(k – 1), and scale the sum of variances accordingly.

k-constrained sum s ∈ (s^–, s⁺) = (∑fn(p_i) – √κ∑(u_i^–)², ∑fn(p_i) + √κ∑(u_i⁺)²),

where u_i^– = fn(p_i) – fn(w_i^–) and u_i⁺ = fn(w_i⁺) – fn(p_i).

Tip: When developing an interval, compare its performance for k = 2 with a Binomial derivation for p and q = (1 – p). Although the k-constrained interval will tend to be conservative, it should perform comparably.

4. Performance

Intervals calculated by these methods are far superior to conventional methods that erroneously assume that the probability density function of an interval is Normal (and symmetric) on some scale. With the exception of the logit (log odds) function, intervals about p (Figure 1) or functions of p are not Normal; when two intervals are combined, a derived interval is also unlikely to be Normal. We need not perform a complex evaluation to reveal this: all we need to do is plot the performance of the interval over the range of p, or permutations of p₁, p₂, etc.

However, Zou and Donner’s theorem does introduce a performance cost in the form of increased Type I errors. This is the error of assuming a difference to be significant when it is not. If we aim for an error level of 0.05 (say), even if we employ the continuity-corrected Wilson score interval for p, a small additional error appears. The performance of proportions in combination may deteriorate, allowing more errors to creep in, especially for small n.

Fortunately, we have found that multiplying the continuity correction term by 1.5 vastly reduces these errors, to a level comparable with Yates’ famous test.

5. Conclusions

This brief post is not intended as a complete account of all possible derivations of confidence intervals. I have not addressed the derivation of intervals on Cramér’s ϕ, for example. With the exception of an initial mention of Clopper-Pearson, I have avoided mention of computational search-based methods. Similarly, there are other adaptations apart from continuity correction that may be applicable for small populations or where instances are not properly random by drawn from contiguous text (random text samples). You will find a discussion of these elsewhere on this blog.

Rather, my intention in writing this post was to give the reader a route in to a set of methods which offer the promise of efficient calculation, good accuracy and a high level of generalisation.

A reader new to the world of confidence intervals will note how these algebraic methods allow us to create a large number of new intervals and thus new tests. A property with an associated confidence interval may be appreciated as observable subject to a predictable level of uncertainty, this uncertainty being expressed by the interval. In contrast, traditional null hypothesis significance testing (NHST) tends to separate descriptive measures and testing procedures, and this kind of evaluation is generally obscured from the research user. This makes interpreting results and carrying out further tests (such as meta-tests comparing repeat runs of the same experiment) much more difficult.

Our methods are also generalisable to confidence intervals on measures other than the Binomial proportion, such as the ratio of two t-distributed natural numbers or positive Reals.

This blog also contains a number of posts exploring the probability density distribution (pdf) ‘shape’ of these confidence intervals. These plots show that for small sample sizes at least, these methods often create distributions that are only occasionally approximately Normal.

References

Newcombe, R.G. (1998). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine, 17, 873-890.

Wallis, S.A. (2013). Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics, 20:3, 178-208. » Post

Wallis, S.A. (2021). Statistics in Corpus Linguistics Research. New York: Routledge.

Wilson, E.B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212.

Wallis, S.A. (forthcoming). Accurate confidence intervals on Binomial proportions, functions of proportions and other related scores. » Post

Zou, G.Y. & A. Donner (2008). Construction of confidence limits about effect measures: A general approach. Statistics in Medicine, 27:10, 1693-1702.

Summer School in English Corpus Linguistics 2023 (online)

Sean — Mon, 27 Mar 2023 08:47:34 +0000

I am pleased to announce the tenth annual UCL Summer School in English Corpus Linguistics, to be held online from 19-21 June.

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics from the perspective of the ‘Survey Methodology’. It is offered at £125 for early bookings made before 14 May, rising to £150 after.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora. It is timed to run from 9:00 to 13:00 British Summer Time (GMT+1), to make it accessible for students across Europe, Africa and Asia.

This year we are joined by Beth Malory, who will be running new sessions on discourse analysis and sociolinguistics research. More information and the provisional timetable are available on the Survey website.

Aims and objectives of the course

Over the three days, participants will learn about the following:

the scope of Corpus Linguistics, and how we can use it to study the English Language;
key issues in Corpus Linguistics methodology;
how to use corpora to analyse issues in syntax and semantics;
basic elements of statistics;
how to navigate large and small corpora, particularly ICE-GB and DCPSE.

Learning outcomes

At the end of the course, participants will have:

acquired a basic but solid knowledge of the terminology, concepts and methodologies used in English Corpus Linguistics;
had practical experience working with two state-of-the-art corpora and a corpus exploration tool (ICECUP);
have gained an understanding of the breadth of Corpus Linguistics and the potential application for projects;
have learned about the fundamental concepts of inferential statistics and their practical application to Corpus Linguistics.

For more information, including costs, booking information, timetable, see the website.

Plotting entropy confidence interval distributions

Sean — Mon, 29 Aug 2022 16:46:45 +0000

Introduction

In this blog post, I will discuss the distribution of confidence intervals for the information-theoretic measure, entropy.

One of the problems we face when reasoning with statistical uncertainty concerns our ability to mentally picture its shape. As students we were shown the Normal distribution and led to believe that it is reasonable to assume that uncertainty about an observation is Normally distributed.

Even when students are introduced to other distributions, such as the Poisson, the tendency to assume that uncertainty is expressed as a Normal distribution (‘the Normal fallacy’) is extremely common. The assumption is not merely an issue of weak mathematics and poor conceptualisation: since Gauss’s famous method of least squares relies on Normality, this issue affects fitting algorithms and error estimation applied to non Real variables, such as the one discussed here.

As a general rule, whenever I have developed methods for computing confidence intervals I have done my best to plot, not just the interval bounds (the upper or lower critical threshold at a given error level) but the probability density function (pdf) distribution of the interval bounds. The results are often surprising, and gain us fresh insight into the intervals we are using.

Entropy is an interesting case study for two reasons. First, there are two methods for computing the two-valued measure, one more precise but less generalisable than the other. Second, like many effect sizes, the function involves a non-monotonic transformation, which has important implications for how we conceptualise uncertainty and intervals. (Indeed, so far I have not published the equivalent distributions for goodness of fit ϕ or diversity, both of which engage the same type of transformations.)

First we will do some necessary recapitulation, so bear with me.

Preliminaries: entropy, and intervals for the single proportion

In The confidence of entropy – and information we introduced the standardised or ‘normalised’ entropy measure η ∈ [0, 1] obtained from Equation (1).

entropy η ≡ – Σ
i p_i.log_k(p_i) = – 1
ln(k) Σ
i p_i.ln(p_i), (1)

for k values for competing proportions p_i, i ∈ {1, 2,… k}, Σp_i = 1, with k – 1 degrees of freedom. If p_i = 0 or 1, the term p_i.ln(p_i) is zero.

On the face of it, this function seems quite simple. One might imagine that the distribution of uncertainty would be approximately bell-shaped (Normal) about the score. Indeed, for large n, the distribution pair does approach this shape, as we shall see.

To compute confidence intervals for a formula like this, we first compute Binomial confidence intervals for each of the probability terms p_i. Whereas one might employ an ‘exact’ Clopper-Pearson computation, the Wilson score interval p ∈ (w^–, w⁺), which may be computed with and without continuity corrections for each p_i is an excellent alternative that is more efficient to compute.

Wilson interval bounds (Wilson 1927) can be considered as a three-parameter function, with parameters proportion p, sample size n and error level α. We may write

w^– = WilsonLower(p, n, α/2),
w⁺ = WilsonUpper(p, n, α/2), (2)

where

Wilson score interval (w^–, w⁺) ≡ p + z²/2n ± z√p(1 – p)/n + z²/4n²
1 + z²/n,
(3)

and z is the two-tailed critical value of the Normal distribution at error level α (written z_α/2 in full). Continuity-corrected intervals are easily obtained by substituting

w^–_cc = WilsonLower(max(0, p – 12n), n, α/2),
w⁺_cc = WilsonUpper(min(1, p + 12n), n, α/2).(2′)

In the previous blog post, we derived two methods for obtaining confidence intervals for entropy. The first was a special Binomial case for k = 2; the second was a more general Multinomial case for a variable with any number of outcomes k.

Method 1. Binomial entropy interval, k = 2

In the Binomial case, we let p = p₁. Then, since p₂ = 1 – p₁, Equation (1) becomes

η(p) = –(p.log₂(p) + (1 – p).log₂(1 – p)), (4)

and η(p) = 0 if p = 0 or 1.

This is a symmetric non-monotonic function of the single proportion, p, with a maximum at p = 0.5, and two minima at p = 0 and 1.

As the function can be determined by a single parameter, all we need to do is suitably transform the interval bounds. Were η(p) to be monotonic, we could simply transform bounds η(w^–), η(w⁺) and order the two scores appropriately.

However, since the function is not monotonic, we have to detect and manage the maximum. We may define an interval (η_b^–,η_b⁺) where

η_b^– = min(η(w^–), η(w⁺)),

η_b⁺ = (5){ η(w⁺)
η(w^–)
1 if w⁺ < 0.5
if w^– > 0.5
otherwise.

See The confidence of entropy for more information. To compute continuity-corrected intervals we simply substitute w^–_cc and w⁺_cc into the formula.

Note. We will employ the subscript ‘b’ in this blog post (for Binomial method) to distinguish intervals from the Multinomial case (labelled ‘m’).

Figure 3. Plotting the upper and lower confidence interval bounds for binary entropy obtained by Equation (6). We have a ‘folded’ interval at p = 0.4.

" data-image-caption="

Figure 3. Plotting the upper and lower confidence interval bounds for binary entropy obtained by Equation (6). We have a ‘folded’ interval at p = 0.4.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/ent2.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/ent2.png?w=739" class="alignnone wp-image-7391" src="https://corplingstats.files.wordpress.com/2022/08/ent2.png?w=600&h=531" alt="Figure 3. Plotting the upper and lower confidence interval bounds for binary entropy obtained by Equation (6). We have a ‘folded’ interval at p = 0.4." width="600" height="531" srcset="https://corplingstats.files.wordpress.com/2022/08/ent2.png?w=600&h=531 600w, https://corplingstats.files.wordpress.com/2022/08/ent2.png?w=150&h=133 150w, https://corplingstats.files.wordpress.com/2022/08/ent2.png?w=300&h=265 300w, https://corplingstats.files.wordpress.com/2022/08/ent2.png?w=768&h=679 768w, https://corplingstats.files.wordpress.com/2022/08/ent2.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 1. Plotting the upper and lower confidence interval bounds for binary entropy obtained by Equation (5).

The resulting interval bounds may be plotted alongside entropy η for varying p as in Figure 1. Intervals are read vertically, as illustrated. Note how the two generating functions change places and the maximum does not exceed 1. This type of plot reveals how the interval bounds behave, but it does not tell us about how uncertainty is distributed, or, to put it another way, where the population value is more likely to be. In order to do this we need to plot the interval function for particular values of p and thus η(p).

Method 2. Multinomial entropy interval, k > 2

Method 1 is limited to base 2 entropy scores, so for a more general Multinomial solution we employ Zou and Donner’s (2008) theorem to the sum of k terms with k – 1 degrees of freedom (see Wallis forthcoming). Although we have an accurate deterministic method for k = 2, we can also apply this formula to the Binomial condition, which we do here for the purposes of comparison.

First, we compute intervals for the inner term in Equation (1), which we will call ‘inf(p_i)’.

inf(p_i) = –p_i.log_k(p_i). (6)

Like entropy, the ‘inf’ function is non-monotonic, rising to a maximum of mˆ = 1/e ≈ 0.367879. We can compute intervals for Equation (6), inf(p_i) ∈ (h_i^–, h_i⁺), using a similar approach as before:

h_i^– = min(inf(w_i^–), inf(w_i⁺)),

h_i⁺ = (7){ inf(w_i⁺)
inf(w_i^–)
inf(mˆ) if w_i⁺ < mˆ
if w_i^– > mˆ
otherwise.

Next, we compute an interval for entropy by computing interval widths for the k-constrained sum by summing the square of interval widths and multiplying by a factor, κ = k/(k – 1):

u^– = √κ Σ(inf(p_i) – h_i^–)², and u⁺ = √κ Σ(inf(p_i) – h_i⁺)², (8)

This results in the following interval (which we will label with the subscript ‘m’ for the Multinomial method):

(η_m^–, η_m⁺) = (max(η – u^–, 0), min(η + u⁺, 1)). (9)

Again, for more discussion of this method and its implications, see The confidence of entropy. Previously, we noted that the Binomial interval was less conservative than the Multinomial, although the intervals tended to have more and more comparable performance as n increased.

We include ‘max’ and ‘min’ functions to avoid the phenomenon of Multinomial overshoot, where the interval becomes wider than is possible. We can see examples of this in Figure 2.

Figure 6. The discrepancy between the two calculation methods reduces with larger samples but it does not completely disappear. In all cases the k-constrained method is more conservative.

" data-image-caption="

Figure 6. The discrepancy between the two calculation methods reduces with larger samples but it does not completely disappear. In all cases the k-constrained method is more conservative.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/ent5.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/ent5.png?w=739" class="alignnone wp-image-7457" src="https://corplingstats.files.wordpress.com/2022/08/ent5.png?w=600&h=519" alt="Figure 6. The discrepancy between the two calculation methods reduces with larger samples but it does not completely disappear. In all cases the k-constrained method is more conservative." width="600" height="519" srcset="https://corplingstats.files.wordpress.com/2022/08/ent5.png?w=600&h=519 600w, https://corplingstats.files.wordpress.com/2022/08/ent5.png?w=150&h=130 150w, https://corplingstats.files.wordpress.com/2022/08/ent5.png?w=300&h=259 300w, https://corplingstats.files.wordpress.com/2022/08/ent5.png?w=768&h=664 768w, https://corplingstats.files.wordpress.com/2022/08/ent5.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 2. Multinomial intervals (green) and Binomial intervals (blue), for varying p and sample size n. Note the pronounced overshoot above η = 1. There is also a very small lower bound overshoot near η = 0.

Plotting the distribution of entropy intervals

A confidence interval bound for a property such as entropy, η, can be thought of as the point on its scale beyond which the tail area of uncertainty (the area beyond the interval) equals the error level α. Except where η = 0 or 1, the interval will have an upper and lower bound, but these are unlikely to be symmetric. We perform the computation for each interval independently.

To plot the interval distribution we employ the delta approximation method described in Wallis (2021: 297). For more information see Plotting the Wilson distribution.

Suffice it to say, this is a flexible mathematical technique that converts an interval formula, like those above, to a distribution curve, revealing where the majority of expected values of the true value in the population are predicted to be. The proportion p is usually a natural fraction of n, i.e. p = f / n where f ∈ {0, 1, 2,… n}.

We obtain a height h(η′, α) for each interval bound η′ and for a sequence of error levels α. We compute the distributions for each bound separately. The area under each bound should sum to 1.

Note. As we shall see, where a distribution is ‘cropped’ at 1, it will generate a sharp peak at 1, although arguably this is an artifact of the plotting algorithm. We might argue that this peak represented a legitimate overshoot (see Plotting the Wilson replication distribution). Where η⁺ > 1, the cropped interval includes 1, i.e. η ∈ (η^–, 1]. Alternatively we might better conceive of it as representing cases where the uncertainty was folded back.

All computations are performed in Excel. The latest version of the spreadsheet avoids macros, which would make it less portable than would be ideal.

Figure 3 shows interval distributions for the case where p = 0.1 and n = 10. This is a very small sample size to estimate entropy, and so generates a wide interval. If you think about it, the probability range is effectively halved, since η(p) reaches a maximum of 1 at p = 0.5, and then declines as p increases further (hence the graph for p = 0.1 is also the graph for p = 0.9).

We compute normalised entropy η(p) = 0.4690. The two-tailed Binomial interval with α = 0.05 is (η_b^–, η_b⁺) = (0.1293, 0.9733) whereas the Multinomial interval (η_m^–, η_m⁺) is (0.1097, 0.9876). With a continuity correction applied, these intervals become (0.0473, 0.9951) and (0.0168, 1) respectively.

To make the figure a little easier to read, we have highlighted the interval bounds with coloured triangles and labelled them appropriately (e.g. ‘η_b^–_cc(0.05)’ means the Binomial lower bound with continuity correction where α = 0.05 (i.e. for a 95% confidence interval).

Figure 3. Pdf distributions of entropy intervals (η^–, η⁺) by Binomial and Multinomial methods, with and without corrections for continuity. Sample size n = 10, p = 0.1 (or 0.9), η(p) = 0.4690, with two-tailed intervals for α = 0.05 indicated.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/entdist1-2.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/entdist1-2.png?w=739" class="alignnone wp-image-7599" src="https://corplingstats.files.wordpress.com/2022/08/entdist1-2.png?w=601&h=422" alt="Figure 3. Pdf distributions of entropy intervals (η–, η+) by Binomial and Multinomial methods, with and without corrections for continuity. Sample size n = 10, p = 0.1 (or 0.9), η(p) = 0.4690, with two-tailed intervals for α = 0.05 indicated." width="601" height="422" srcset="https://corplingstats.files.wordpress.com/2022/08/entdist1-2.png?w=601&h=422 601w, https://corplingstats.files.wordpress.com/2022/08/entdist1-2.png?w=150&h=105 150w, https://corplingstats.files.wordpress.com/2022/08/entdist1-2.png?w=300&h=211 300w, https://corplingstats.files.wordpress.com/2022/08/entdist1-2.png?w=768&h=539 768w, https://corplingstats.files.wordpress.com/2022/08/entdist1-2.png 1020w" sizes="(max-width: 601px) 100vw, 601px" />

A number of features are visible in this graph.

Continuity-corrected intervals begin at η(p ± 12n) for the Binomial (0.2864, 0.6098), whereas uncorrected intervals appear to be a continuous curve. A small rounding error introduced by Equation (9) sees the Multinomial interval begin slightly beyond this point (the point where α = 1, see Plotting the Wilson distribution).

The corrected intervals are ‘squeezed’ into the remaining space, so the curve tends to be higher.

The lower bounds for all four intervals are greater than zero in all cases, unless p = 0.

In this case, the uncorrected Binomial upper bound is just below 1 (0.9936), but all intervals are cropped by the global maximum, causing the method to place the excess at 1, which is visualised as a spike. As we discussed in The confidence of entropy, both methods are conservative. We can plot the distribution of the uncropped interval function in order to witness this overshoot phenomenon.

Binomial folding. Since it is directly computed, we can plot the distribution of ‘folded’ Binomial scores. In other words, Equation (4) computes both bounds, but does not order the lower or higher, or crop an upper bound at 1, which we do with Equation (5). If we plot these uncropped intervals, the remaining curves turn back on themselves, below 1 (see also Figure 1). But this entire area should be added to the curve at 1.
Multinomial overshooting. The Multinomial method averages the upper and lower bounds by a sum-of-squares calculation on the interval widths. Part of the curve can exceed, or ‘overshoot’, the range of η ∈ [0, 1]. Now, these are impossible scores, because no true value of entropy in the population, Η, could fall outside this range. Again, this area is added to the excess.

Note: Both methods generate a spike at 1, and Multinomial overshoot also exhibits a second spike beyond 1. What do these represent? They are artifacts of the plotting method. Non-monotonic functions tend to return a large number of cases at the turning point, in this case the maximum, 1. As a result, when we compute an area under the curve, we may generate an infinite column of zero width! In addition, range constraints (‘max’ and ‘min’ functions) substitute limits for the residual overshoot area. The second overshoot spike is in fact due to the turning point of the ‘inf’ function (Equation (7)).

If we permute values of p from 0 to 1, entropy increases to 1 as p reaches 0.5 and then falls back to 0 as a mirror image (see Figure 1). Consequently each distribution for p > 0.5 is the exact same pattern as for the alternate proportion q = 1 – p < 0.5.

Animation 1. Distributions of intervals computed by permuting observed entropy from p = 0.0 to 0.5 and back (noting η(0.4) = η(0.6)), with a small sample size, n = 10. Includes Multinomial overshoot (blue) and folded Binomial (orange).

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/ent1dist.webp?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/ent1dist.webp?w=600" class="alignnone size-full wp-image-7671" src="https://corplingstats.files.wordpress.com/2022/08/ent1dist.webp?w=739" alt="Animation 1. Distributions of intervals computed by permuting observed entropy from p = 0.0 to 0.5 and back (noting η(0.4) = η(0.6)), with a small sample size, n = 10. Includes Multinomial overshoot (blue) and folded Binomial (orange)." srcset="https://corplingstats.files.wordpress.com/2022/08/ent1dist.webp 600w, https://corplingstats.files.wordpress.com/2022/08/ent1dist.webp?w=150&h=113 150w, https://corplingstats.files.wordpress.com/2022/08/ent1dist.webp?w=300&h=225 300w" sizes="(max-width: 600px) 100vw, 600px" />

Calculating residual areas

In Plotting the Wilson replication distribution we discussed the concept of interval residuals, the area under the curve where interval functions generated scores outside the possible range of true values Η ∈ [0, 1]. Whereas some functions might generate overshoot erroneously (the Wald interval being a good example), in other cases this excess may be unavoidable.

We can compute the area of the Binomial fold, ρ_b⁺, by searching for the value of α where the Wilson upper bound is 0.5, i.e. at the turning point of Equation (5). The simplest way to do this is to employ Equation (2). If p < 0.5, we apply Equation (10). If p > 0.5, we solve for q = 1 – p.

Find α where WilsonUpper(p, n, α/2) = 0.5. (10)

We can also find each area of lower and upper Multinomial overshoot by solving for α with a search procedure. We remove ‘min’ and ‘max’ limits from Equation (9) and identify the cross-over point where the bound equals zero or 1. For example, to find the lower bound we search for α where η – u^– = 0.

These methods yield the computed areas in Table 1. Each figure is the proportion of the area of the respective distribution, i.e. the upper interval in the case of ρ⁺, etc.

	ρ^–	ρ⁺	ρ^–_cc	ρ⁺_cc
Binomial		0.0114		0.0268
Multinomial	0.0000	0.0406	0.0113	0.0841

Table 1. Residual areas ρ (separate tails) for cases of Binomial folding and Multinomial overshoot visible in Figure 1, p = 0.1, n = 10. See also Animation 1.

As we discussed, Binomial folding only occurs at the upper end of the range, when η = 1 is within the distribution. Multinomial overshoot can be found at both upper and lower ends of the range [0, 1], and will be more conservative (and approximate). Multinomial overshoot is invariably treated as simply returning the limit, hence the ‘spike’ at 1 in Figure 1. The overshoot spike is due to the ‘inf’ function (see above).

With the exception of the continuity-corrected Multinomial upper bound, ρ_m⁺_cc, all are smaller than α = 0.05, and so 95% interval bounds are within the range of Η.

The impact of sample size

The pdf distributions in Figure 3 (and their permutation in Animation 1 above) are spread across most of the entire entropy range, due to the small sample size for estimating entropy.

Indeed, for the same data, interval widths are a factor of between 2 and √2 larger than the equivalent for the single proportion, p. This is equivalent to a sample size for the simple proportion of no larger than n = 10/2 = 5. This seems intuitively plausible: the variable reaches its maximum at p = 0.5 rather than 1, and loses information due to this ‘doubling up’.

So, what happens if we increase n?

Consider Figure 4, calculated with sample size n = 40. Now, with p = 0.1 (or 0.9) the distribution has no detectable overshoot at either end, and no spike.

We see a close fit between the two methods for the upper interval. For the lower interval, the two methods also converge, but there is a greater discrepancy here, with the Multinomial method being slightly more conservative.

Finally, we can see that the impact of the continuity correction reduces, which we would expect with increasing n.

Figure 2. Distributions of upper and lower bounds of entropy η for n = 40, p = 0.1 or 0.9, η(p) = 0.4690, showing convergence between Binomial and Multinomial methods for larger samples. The upper bound intervals at α = 0.05 are almost identical whichever method is employed.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/entdist2.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/entdist2.png?w=739" class="alignnone wp-image-7531" src="https://corplingstats.files.wordpress.com/2022/08/entdist2.png?w=600&h=267" alt="Figure 2. Distributions of upper and lower bounds of entropy η for n = 40, p = 0.1 or 0.9, η(p) = 0.4690, showing convergence between Binomial and Multinomial methods for larger samples. The upper bound intervals at α = 0.05 are almost identical whichever method is employed." width="600" height="267" srcset="https://corplingstats.files.wordpress.com/2022/08/entdist2.png?w=600&h=267 600w, https://corplingstats.files.wordpress.com/2022/08/entdist2.png?w=150&h=67 150w, https://corplingstats.files.wordpress.com/2022/08/entdist2.png?w=300&h=134 300w, https://corplingstats.files.wordpress.com/2022/08/entdist2.png?w=768&h=342 768w, https://corplingstats.files.wordpress.com/2022/08/entdist2.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 4. Distributions of upper and lower bounds of entropy η for n = 40, p = 0.1 or 0.9, η(p) = 0.4690, showing convergence between Binomial and Multinomial methods for larger samples. The upper bound intervals at α = 0.05 are almost identical, whichever method is employed.

This does not mean that the distributions converge equally over the entire range, however. A greater discrepancy between the upper bound distributions can be observed as p tends towards 0.5.

Thus Figure 5, which plots the same graph for p = 0.25, shows that the two methods of calculation of the upper bound differ, with the two-tailed 95% Multinomial interval upper bound reaching 1 with or without a correction for continuity. Being calculated directly, rather than smoothed by Zou and Donner’s theorem, the Binomial method is less conservative, even with the continuity correction.

Figure 3. Distributions for upper and lower entropy bounds, p = 0.25 or 0.75 and n = 40. Note how the lower bounds converge while the upper bounds now differ.

" data-image-caption="

Figure 3. Distributions for upper and lower entropy bounds, p = 0.25 or 0.75 and n = 40. Note how the lower bounds converge while the upper bounds now differ.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/entdist3-1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/entdist3-1.png?w=739" class="alignnone wp-image-7550" src="https://corplingstats.files.wordpress.com/2022/08/entdist3-1.png?w=650&h=457" alt="Figure 3. Distributions for upper and lower entropy bounds, p = 0.25 or 0.75 and n = 40. Note how the lower bounds converge while the upper bounds now differ." width="650" height="457" srcset="https://corplingstats.files.wordpress.com/2022/08/entdist3-1.png?w=650&h=457 650w, https://corplingstats.files.wordpress.com/2022/08/entdist3-1.png?w=150&h=106 150w, https://corplingstats.files.wordpress.com/2022/08/entdist3-1.png?w=300&h=211 300w, https://corplingstats.files.wordpress.com/2022/08/entdist3-1.png?w=768&h=541 768w, https://corplingstats.files.wordpress.com/2022/08/entdist3-1.png 1020w" sizes="(max-width: 650px) 100vw, 650px" />

Figure 5. Distributions for upper and lower entropy bounds, p = 0.25 or 0.75 and n = 40. Note how the lower bounds converge while the upper bounds now differ. Binomial folding is negligible (ρ_b⁺ = 0.0016, ρ_b⁺_cc= 0.0027), but Multinomial overshoot is in excess of a = 0.05 (ρ_m⁺ = 0.0550, ρ_m⁺_cc= 0.0787).

Binomial and Multinomial lower intervals appear to converge, although if we increase p towards 0.5 and η = 1, the methods diverge again (see Animation 3 below). The two methods have comparable performance, but the Multinomial method involves a smoothing assumption which errs on the side of caution.

We will end this section with two animations. Animation 2 plots distributions with a constant entropy (η(p) = 0.4690), with increasing sample size n, doubling each time from n = 10. This illustrates ever-more ‘Normal-like’ performance as sample sizes increase, tending towards greater symmetry and convergence between distributions.

Animation 2. Converging on a bell curve? Constant observed entropy, but increasing sample size, with n doubling each time from 10 to 640.

" data-image-caption="

Animation 2. Converging on a bell curve? Constant observed entropy, but increasing sample size, with n doubling each time from 10 to 640.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/ent2dist.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/ent2dist.png?w=600" class="alignnone size-full wp-image-7664" src="https://corplingstats.files.wordpress.com/2022/08/ent2dist.png?w=739" alt="ent2dist" srcset="https://corplingstats.files.wordpress.com/2022/08/ent2dist.png 600w, https://corplingstats.files.wordpress.com/2022/08/ent2dist.png?w=150&h=113 150w, https://corplingstats.files.wordpress.com/2022/08/ent2dist.png?w=300&h=225 300w" sizes="(max-width: 600px) 100vw, 600px" />

Animation 2. Converging on a bell curve? Constant observed entropy, but increasing sample size, with n doubling each time from 10 to 640. For simplicity we have not included Multinomial overshoot.

However, this does not mean that larger sample sizes guarantee a quasi-Normal bell curve. It would be more accurate to say that an observed entropy score supported by a large data sample will tend to be Normal the more the entropy curve within the interval approximates a straight line (‘exhibits linearity’) and is far from boundaries. Overall, a better fit is to the Wilson distribution, which also converges on the Normal for large n and non-extreme p.

Animation 3 shows the effect of permuting p from 0 to 0.5 with n = 40. We can see the extent to which the different methods converge, the changing distributions near both extremes 0 and 1, and the impact of cropping and folding the upper bound where η → 1. This animation also shows the effect of Multinomial overshoot, which spikes at 1. The upper Multinomial overshoot spikes near 1.1 as before.

With a larger sample it is also easier to see that as η approaches 1, the Binomial ‘folded’ interval distributions converge on the dominant unfolded ones, becoming the same distribution for η = 1. Indeed, were p to increase past 0.5, this source interval would become the dominant lower bound of the entropy interval.

Animation 3. Interval distributions obtained by permuting p from 0 to 0.5 for a larger sample, n = 40. The upper bound Multinomial overshoot can be clearly seen beyond 1: the second spike is due to the ‘inf’ function (Equation (6)).

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/ent3dist.webp?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/ent3dist.webp?w=600" class="alignnone size-full wp-image-7669" src="https://corplingstats.files.wordpress.com/2022/08/ent3dist.webp?w=739" alt="Animation 3. Interval distributions obtained by permuting p from 0 to 0.5 for a larger sample, n = 40. The upper bound Multinomial overshoot can be clearly seen beyond 1: the second spike is due to the maximum of the ‘inf’ function (Equation (6))." srcset="https://corplingstats.files.wordpress.com/2022/08/ent3dist.webp 600w, https://corplingstats.files.wordpress.com/2022/08/ent3dist.webp?w=150&h=113 150w, https://corplingstats.files.wordpress.com/2022/08/ent3dist.webp?w=300&h=225 300w" sizes="(max-width: 600px) 100vw, 600px" />

Conclusions

We can derive high quality confidence intervals for metrics and effect sizes such as entropy, which involve two types of transformations that lose information: non-monotonic functions and summations. With such ‘lossy’ transformations, the resulting intervals may not be less accurate, but they are liable to be conservative. A non-monotonic function implies that the interval distribution will be effectively ‘folded’ back on itself. Summation overlays variation due to each independent summed term. The Multinomial summation of squared interval widths is more conservative than the direct Binomial method.

To plot the distributions of these intervals we employed the same method of delta approximation (Wallis 2021) that we previously used for a number of properties and mathematical relationships. This allowed us to view the distribution of variation in the ultimate functions, and, by applying the function to intermediate properties, allowed us to make sense of the large peak at 1.

In the case of a folded interval, one might legitimately argue that the two curves should be summed (stacked, one on top of the other), rather than superimposed. For our purposes it made sense to superimpose them, allowing us to see their convergence at 1, but were we concerned to predict the likely position of a true value, stacking seems preferable. On the other hand, the Multinomial approximation offers no such opportunity, with all excess variation assumed to fall at the extreme. We also showed how we could compute the relative size of these excess folded and overshot areas, which we referred to as interval residuals.

Whether one uses a direct Binomial method or the Multinomial one, both methods have closely comparable performance. Plotting their distributions, especially when one explores component properties, emphasises their difference. But the overall location of intervals are similar. The Binomial method is preferable for base 2 entropy intervals, but the Multinomial method approximates well, especially for larger n. It should go without saying that the principal benefit of the latter is that it is extensible to multiple degrees of freedom.

The interval distributions we have plotted are rarely Normal. Conventional fitting algorithms assume tangential estimates of Normal error (usually by summing over variance, or squared interval widths). Whereas the inverse-logistic transformation of p obtains near-Normal ‘logit-Wilson’ intervals (Wallis 2021: 307), permitting a sound regression method, the same is not true on the entropy scale. A great deal of caution should therefore be employed if off-the-shelf regression methods are employed on observed entropy scores.

References

Wallis, S.A. (2021). Statistics in Corpus Linguistics Research. New York: Routledge. » Announcement

Wallis, S.A. (forthcoming). Accurate confidence intervals on Binomial proportions, functions of proportions and other related scores. » Post

Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212.

Zou, G.Y. & A. Donner (2008). Construction of confidence limits about effect measures: A general approach. Statistics in Medicine, 27:10, 1693-1702.

The confidence of entropy – and information

Sean — Mon, 08 Aug 2022 14:53:44 +0000

Introduction

Two measures that are sometimes found in linguistic studies are information, defined as the negative log of the probability, and entropy. These are information-theoretic measures first defined by Claude Shannon (see e.g. Shannon and Weaver 1949). Entropy is also found in mutual information scores.

This blog post is not intended to introduce information theory, for which there are numerous sources! Rather, I am concerned with demonstrating how to approach the problem of computing confidence intervals on observed measures estimated from samples.

As well as plotting information and entropy scores we may be interested in comparing whether two or more observed scores are significantly different. Both tasks are easily achieved once we have a method for computing their interval, discussed in detail in Wallis (forthcoming).

Information

Estimates of observed Shannon information may be expressed in the following form:

information ι(p) ≡ –log₂(p).(1)

Notes: Some researchers might use Euler’s natural logarithm (ln) rather than log to the base 2, but this simply means the results are just scaled differently (they have different units, ‘nats’ instead of ‘bits’).

For consistency of notation across this blog, I have used Greek lower case iota (ι) for ‘information’. This is an observed value, so should be lower case, and we will use the Greek to avoid confusion with the lower case Latin i used for indices. Most sources cite Equation (1) with Greek capital iota, which is indistinguishable from Latin capital ‘I’. This might seem a trivial point but it is essential not to confuse modeled or expected estimates on the one hand, and observed ones on the other.

Equation (1) tells us that we can transform a probability or observed proportion, p, to an information score by applying the negative log function. We simply define a confidence interval for Equation (1) by applying the same function to the interval bounds for p.

This equation can be thought of as a way of projecting the same data expressed in terms of observed proportions onto a different numerical scale. See Wallis (forthcoming), and Reciprocating the Wilson interval.

We must pay attention to the shape of the curve function within the range p ∈ [0, 1]. Equation (1) is monotonically decreasing, that is, ι(p) falls with increasing p. Once transformed, the lower and upper bounds of the interval switch places. In Figure 1, the curve for ι(p) is a continuous dark blue line, and intervals are represented by dashed lines.

Figure 1: Plot of observed information score ι(p) against p, with transformed Wilson intervals (assuming n = 10 and α = 0.05). Continuity-corrected intervals are estimated by moving p outwards by 12n on either side.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/inf.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/inf.png?w=739" class="alignnone wp-image-7363" src="https://corplingstats.files.wordpress.com/2022/08/inf.png?w=601&h=489" alt="Figure 1: Plot of observed information score (p) against p, with transformed Wilson intervals (assuming n = 10 and  = 0.05). Continuity-corrected intervals are estimated by moving p outwards by 1/2n on either side." width="601" height="489" srcset="https://corplingstats.files.wordpress.com/2022/08/inf.png?w=601&h=489 601w, https://corplingstats.files.wordpress.com/2022/08/inf.png?w=150&h=122 150w, https://corplingstats.files.wordpress.com/2022/08/inf.png?w=300&h=244 300w, https://corplingstats.files.wordpress.com/2022/08/inf.png?w=768&h=626 768w, https://corplingstats.files.wordpress.com/2022/08/inf.png 1020w" sizes="(max-width: 601px) 100vw, 601px" />

We obtain the interval

ι(p) ∈ (–log₂(w⁺), –log₂(w^–)), (2)

where p ∈ (w^–, w⁺) is the confidence interval for p. Where p = 0, ι(p), and hence its upper bound, tends to infinity.

To obtain the interval (w^–, w⁺) we will use the Wilson score interval (Wilson 1927). This interval can be defined by Wilson functions (Wallis 2021: 111).

w^– = WilsonLower(p, n, α/2),
w⁺ = WilsonUpper(p, n, α/2), (3)

where each is computed by this formula (Wallis 2013):

Wilson score interval (w^–, w⁺) ≡ p + z²/2n ± z√p(1 – p)/n + z²/4n²
1 + z²/n,
(4)

where z is the two-tailed critical value of the Normal distribution at error level α (written z_α/2 in full).

We may also apply corrections for continuity, etc. in the usual way. Thus in Figure 1 we have also plotted the continuity-corrected interval,

ι(p) ∈ (–log₂(w⁺_cc), –log₂(w^–_cc)),

where

w^–_cc = WilsonLower(max(0, p – 12n), n, α/2),
w⁺_cc = WilsonUpper(min(1, p + 12n), n, α/2).(3′)

Whereas the continuity-corrected Wilson score interval is optimum for most purposes, it is also possible to substitute the Clopper-Pearson interval (Wallis 2021: 147) for small samples.

The method is very simple. First calculate an interval for the single proportion and then substitute these bounds into Equation (2). Where p = 0, the upper bound is infinite, but the lower bound is still computable.

Using the method outlined in Plotting the Wilson distribution (see also Wallis 2021: 297), we can compute the probability density distribution of this information function.

Note. I plotted the distribution for the natural logarithm for Wallis (forthcoming), and the resulting curves in Figure 2 below have the same shape. The x-axis has a different scale because we are using log₂ rather than ln, and the minus sign means the curves are mirror images. But this figure is essentially the same.

Figure 2. Selected distributions of the information interval for ι(p) = –log₂(p). These visualise the predicted distribution of error (uncertainty) according to our model.

" data-image-caption="

Figure 2. Selected distributions of the information interval for ι(p) = –log₂(p). These visualise the predicted distribution of error (uncertainty) according to our model.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/inf-dist-1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/inf-dist-1.png?w=739" class="alignnone wp-image-7491" src="https://corplingstats.files.wordpress.com/2022/08/inf-dist-1.png?w=600&h=483" alt="Figure 2. Selected distributions of the information interval for ι(p) = –log[2](p). These visualise the predicted distribution of error (uncertainty) according to our model." width="600" height="483" srcset="https://corplingstats.files.wordpress.com/2022/08/inf-dist-1.png?w=600&h=483 600w, https://corplingstats.files.wordpress.com/2022/08/inf-dist-1.png?w=150&h=121 150w, https://corplingstats.files.wordpress.com/2022/08/inf-dist-1.png?w=300&h=241 300w, https://corplingstats.files.wordpress.com/2022/08/inf-dist-1.png?w=768&h=618 768w, https://corplingstats.files.wordpress.com/2022/08/inf-dist-1.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 2. Selected distributions of the information interval for ι(p) = –log₂(p). These visualise the predicted distribution of error (uncertainty), according to our model.

Where p = 0.0 (purple line), ι(p) is infinite. The upper bound of the distribution is uncomputable.

Entropy

Entropy is classically defined by an expression in the following form

entropy η ≡ – Σ
i p_i.log_k(p_i) = – 1
ln(k) Σ
i p_i.ln(p_i), (5)

where we have k values for competing proportions p_i, Σp_i = 1, and the score has k – 1 degrees of freedom. Where a term p_i = 0 or 1, the product p_i.log(p_i) = (0 × -∞) is zero.

Note: I am using Greek lower case eta, η, to emphasise that this is an observed entropy value. Many sources use upper case eta, which looks like a Latin capital ‘H’.

As with information, some sources cite different log scales, usually log₂ for any k-valued application. Others, e.g. Kumar et al. (1986), refer to Equation (5) as ‘normalised’ entropy, η ∈ [0, 1], to avoid confusion.

Binomial: Confidence intervals where k = 2

In the special case where k = 2, the second proportion, q, is guaranteed to be 1 – p. This obtains the following non-monotonic function, which has a maximum of 1 at p = 0.5. (We will quote the entropy formula with the single parameter p, since it is uniquely defined by this parameter.)

η(p) = –(p.log₂(p) + (1 – p).log₂(1 – p)), (6)

and η(p) = 0 if p = 0 or 1. We will start by plotting the curve.

Figure 2. Classical two-value entropy curve with example transformed Wilson score intervals (k = 2, n = 10, α = 0.05) at p = 0.1 and 0.4.

" data-image-caption="

Figure 2. Classical two-value entropy curve with example transformed Wilson score intervals (k = 2, n = 10, α = 0.05) at p = 0.1 and 0.4.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/ent1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/ent1.png?w=739" class="alignnone wp-image-7390" src="https://corplingstats.files.wordpress.com/2022/08/ent1.png?w=600&h=531" alt="Figure 2. Classical two-value entropy curve with example transformed Wilson score intervals (k = 2, n = 10, α = 0.05) at p = 0.1 and 0.4." width="600" height="531" srcset="https://corplingstats.files.wordpress.com/2022/08/ent1.png?w=600&h=531 600w, https://corplingstats.files.wordpress.com/2022/08/ent1.png?w=150&h=133 150w, https://corplingstats.files.wordpress.com/2022/08/ent1.png?w=300&h=266 300w, https://corplingstats.files.wordpress.com/2022/08/ent1.png?w=768&h=680 768w, https://corplingstats.files.wordpress.com/2022/08/ent1.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 3. Classical two-value entropy curve with example transformed Wilson score intervals (k = 2, n = 10, α = 0.05) at p = 0.1 and 0.4.

By examination of the maximum, we can define η(p) ∈ (η^–, η⁺) where

η^– = min(η(w^–), η(w⁺)),

η⁺ = (7){ η(w⁺)
η(w^–)
1 if w⁺ < 0.5
if w^– > 0.5
otherwise.

If the interval for p includes the maximum, 0.5, then the upper bound will be the maximum entropy (i.e. 1), and the lower bound will be the smaller of the two transformed bounds. On the other hand, if the interval maps onto a monotonic section of the curve then we simply apply the transformation function to each term.

We plot these intervals over p in Figure 4. The overall shape may be a little unexpected, with wider intervals, representing a modeled expectation of greater sampling uncertainty, about non-central entropy scores.

The dotted lines plot η(w^–) and η(w⁺) respectively: the lower bound is the minimum of the scores, whereas the upper bound reaches 1 for all points where the interval contains 0.5. Thanks to the small sample size in the model n, this is quite a large range.

Figure 3. Plotting the upper and lower confidence interval bounds for binary entropy obtained by Equation (6). We have a ‘folded’ interval at p = 0.4.

" data-image-caption="

Figure 3. Plotting the upper and lower confidence interval bounds for binary entropy obtained by Equation (6). We have a ‘folded’ interval at p = 0.4.

Figure 4. Plotting the upper and lower confidence interval bounds for binary entropy obtained by Equation (6). We have a ‘folded’ interval at p = 0.4.

Where the interval includes 1, we might say the range is ‘folded’. A subset of entropy scores within the interval may be obtained twice, from two different values of p.

For example, where p = 0.4, (w^–, w⁺) = (0.1682, 0.6873). See Figure 3, middle. The entropy scores for these point bounds (η(w^–), η(w⁺)) = (0.6535, 0.8962), but the range includes 1. (0.6535, 1].

Since η(w^–) is the lower score, and the range includes 1, any score for p > 0.5 (i.e. from 0.5 to w⁺) obtains a score already accounted for in the range (w^–, 0.5).

This is an example of loss of information resulting from non-monotonic transformation functions. We can’t do very much about this, as it is a direct result of the function. It is similar to the loss of information resulting from representing multi-dimensional differences as a single effect size score.

The interval is also conservative, in that we have taken the minimum of the two transformed bounds. This is a different point.

With this interval, a second proportion p₂ may exceed w⁺ and still obtain entropy values that fall within the interval. Consider the point p₂ = 0.8, whose entropy, η(p₂) = 0.7219, is within the range (0.6535, 1].

In cases like this, there will be a greater than (1 – α) chance that the population score will fall within the folded interval. See also Confidence intervals on goodness of fit ϕ scores. One might use a search procedure to find the optimum point where the error were corrected, by adjusting the α error parameter in the WilsonLower or WilsonUpper function.

Such a step is legitimate, but it would mean that an entropy difference test will necessarily deviate in performance from the standard Binomial model.

We plot the pdf distribution of the unconstrained interval bounds (the ‘I’-shaped error bars in Figure 4) in Animation 1 below. Since entropy scores reverse for p > 0.5 we have not included these. It should be obvious that the distribution of uncertainty is not Normal! This plot also nicely visualises the ‘folding’ phenomenon at η = 1.

Animation 1. Probability density function distributions for p = 0.0 to 0.5. The dotted lines are the distributions for the continuity-corrected interval.

" data-image-caption="

Animation 1. Probability density function distributions for p = 0.0 to 0.5. The dotted lines are the distributions for the continuity-corrected interval.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/entdist.gif?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/entdist.gif?w=600" class="alignnone size-full wp-image-7494" src="https://corplingstats.files.wordpress.com/2022/08/entdist.gif?w=739" alt="Animation 1. Probability density function distributions for p = 0.0 to 0.5. The dotted lines are the distributions for the continuity-corrected interval." srcset="https://corplingstats.files.wordpress.com/2022/08/entdist.gif 600w, https://corplingstats.files.wordpress.com/2022/08/entdist.gif?w=150&h=125 150w, https://corplingstats.files.wordpress.com/2022/08/entdist.gif?w=300&h=250 300w" sizes="(max-width: 600px) 100vw, 600px" />

Animation 1. Probability density function distributions for p = 0.0 to 0.5. The x-axis is entropy η. The dotted lines are the distributions for the continuity-corrected interval.

Multinomial: Generalised approximations for k > 2

For k > 2 we use the formula for intervals for k-constrained sums (Wallis forthcoming).

In Equation (4) the ‘ln(k)’ term can be treated as a scale constant. Alternatively, one may simply employ log to the base k in the sum.

Let us define the inner term in the sum as inf(p_i) for each term i = 1… k.

inf(p_i) = –p_i.log_k(p_i). (8)

We set inf(p_i) to zero if p_i is 0 or 1. This function is also non-monotonic over the range p_i ∈ [0, 1]. Although both p_i and ln(p_i) are monotonic, the product of two monotonic functions is non-monotonic if one is increasing and the other decreasing.

Next, we need to determine the maximum of the function inf(p_i). This is the same value irrespective of k (recall that ln(k) was a scale factor in Equation (4)). It turns out that the maximum of Equation (7) is where p_i = mˆ = 1/e ≈ 0.367879 (e is Euler’s constant). This has a maximum score, inf(mˆ) ≈ 0.530738.

For each term in the sum, we compute an interval by testing if mˆ is within the interval for p_i.

Let us define an interval inf(p_i) ∈ (h_i^–, h_i⁺), where

h_i^– = min(inf(w_i^–), inf(w_i⁺)),

h_i⁺ = (9){ inf(w_i⁺)
inf(w_i^–)
inf(mˆ) if w_i⁺ < mˆ
if w_i^– > mˆ
otherwise.

We can plot inf(p_i) and the resulting interval, which we do in Figure 5. The plot is similar to Figure 4, although with a different maximum score, and an eccentric cross-over point close to mˆ.

Figure 4. Plotting the function term inf(p) and transformed intervals, peaking at inf(mˆ). The cross-over point (peak of h^–) is not quite at mˆ.

" data-image-caption="

Figure 4. Plotting the function term inf(p) and transformed intervals, peaking at inf(mˆ). The cross-over point (peak of h^–) is not quite at mˆ.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/ent3.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/ent3.png?w=739" class="alignnone wp-image-7392" src="https://corplingstats.files.wordpress.com/2022/08/ent3.png?w=600&h=538" alt="Figure 4. Plotting the function term inf(p) and transformed intervals, peaking at inf(m). The cross-over point (peak of h–) is not quite at m." width="600" height="538" srcset="https://corplingstats.files.wordpress.com/2022/08/ent3.png?w=600&h=538 600w, https://corplingstats.files.wordpress.com/2022/08/ent3.png?w=150&h=134 150w, https://corplingstats.files.wordpress.com/2022/08/ent3.png?w=300&h=269 300w, https://corplingstats.files.wordpress.com/2022/08/ent3.png?w=768&h=688 768w, https://corplingstats.files.wordpress.com/2022/08/ent3.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 5. Plotting the function term inf(p) and transformed intervals, peaking at inf(mˆ). The cross-over point (peak of h^–) is not quite at mˆ.

We compute interval widths for the k-constrained sum by

u^– = √κ Σ(inf(p_i) – h_i^–)², and u⁺ = √κ Σ(inf(p_i) – h_i⁺)², (10)

where kappa κ = k/(k – 1). The resulting interval is simply

η ∈ (η – u^–, η + u⁺). (11)

Using this approximation we can obtain intervals for k = 2 and compare the results directly. See Figure 6.

Figure 5. Comparing the performance of the k-constrained method (Equations (9) and (10)) with the Binomial method of direct transformation (Equation (6)) where k = 2. Note that the k-constrained method is slightly conservative at extremes, but the main penalty is in the middle region.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/ent4-2.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/ent4-2.png?w=739" class="alignnone wp-image-7477" src="https://corplingstats.files.wordpress.com/2022/08/ent4-2.png?w=600&h=522" alt="Figure 5. Comparing the performance of the k-constrained method (Equations (9) and (10)) with the Binomial method of direct transformation (Equation (6)) where k = 2. Note that the k-constrained method is slightly conservative at extremes, but the main penalty is in the middle region." width="600" height="522" srcset="https://corplingstats.files.wordpress.com/2022/08/ent4-2.png?w=600&h=522 600w, https://corplingstats.files.wordpress.com/2022/08/ent4-2.png?w=150&h=131 150w, https://corplingstats.files.wordpress.com/2022/08/ent4-2.png?w=300&h=261 300w, https://corplingstats.files.wordpress.com/2022/08/ent4-2.png?w=768&h=669 768w, https://corplingstats.files.wordpress.com/2022/08/ent4-2.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 6. Comparing the performance of the k-constrained method (Equations (9) and (10)) with the Binomial method of direct transformation (Equation (6)) where k = 2. Note that the k-constrained method is slightly conservative at extremes, but the main penalty is in the middle region. The equivalent Wilson interval is included for comparison. See below.

We saw that the maximum interval for inf(p) exceeded 0.5 for a substantial part of the range, so it is unsurprising to see that the upper interval may overshoot. We can correct for this simply by constraining the interval to η ∈ [0, 1], and it is unlikely to be a problem in practice.

It is rather more important to pay attention to the areas where the interval is within the allowable range but more conservative than that obtained by the direct method, notably in the region near 0.5 (from mˆ to 1 – mˆ). This is clearly a result of averaging the slopes between asymmetric peaks of inf(p) and inf(1 – p).

In a research context, we can accept a more conservative but flexible interval method. Even with rounding errors, the interval will tend to be wider than that obtained by direct calculation.

Comparison with Wilson intervals

How does this interval perform compared to a standard error or Wilson approach? Well, a standard error model of variance about observed values is incorrect. But a Wilson-based model is legitimate in p-space. In Figure 6 we also plotted an interval labelled ‘naïve Wilson interval’, which substitutes η for p into the Wilson score interval (Equation (3)).

We know this approach is naïve and incorrect, but what is the scale of the errors produced?

Our derived entropy intervals are much more conservative (aside from the lower bound ‘peak’). However, the main problem is that the scale of conservatism is different on either side of η. In simple terms, the upper bound is approximately 1/2 the correct width, but the lower bound is too small by 1/√2. The following curves are close to our computed scores, and converge with increased n.

w^– = WilsonLower(η, n/2, α/2),
w⁺ = WilsonUpper(η, n/4, α/2).

An unequal error is a problem for anyone using a naïve variance-based line or model fitting of entropy scores, for the obvious reason that whenever we draw a line through a set of points, we cannot know which side of the ideal line a datapoint will fall on!

Larger n

Returning to our two competing calculation methods, if you experiment with higher values of n in this spreadsheet, you will see that they obtain a smaller area of discrepancy, but this still represents an identifiable loss of sensitivity. However the difference is rather less dramatic than with the pictured n = 10 (which is a very small sample size for estimating entropy, especially with larger k).

Figure 6. The discrepancy between the two calculation methods reduces with larger samples but it does not completely disappear. In all cases the k-constrained method is more conservative.

" data-image-caption="

Figure 6. The discrepancy between the two calculation methods reduces with larger samples but it does not completely disappear. In all cases the k-constrained method is more conservative.

Figure 7. The discrepancy between the two calculation methods reduces with larger samples but it does not completely disappear. In all cases the k-constrained method is more conservative.

For small k, combinatorial maths indicates this middle region is liable to be far more likely to occur in practice, so it is worth considering whether this loss is better avoided by substituting the comparison of Multinomial entropy scores with a series of Binomial evaluations, which would also be more straightforward to interpret. On the other hand, if no single observed proportion dominates and exceeds mˆ = 1/e, the method will not lose much power (it will also perform acceptably if it dominates and exceeds 1 – mˆ).

Larger k

Visualising the performance of intervals with higher levels of dimensionality is quite difficult!

To picture the performance of this interval formula with k = 3 terms, the following animation may be helpful. The upper bound becomes convex when p₃ > mˆ. Note that the horizontal axis represents the remaining space, 1 – p₃.

Animation 1. Entropy intervals for k = 3, permuting values of p₃ from 0.0 to 0.9. The horizontal axis is p₁, and p₂ = p₃ – p₁.

" data-image-caption="

Animation 1. Entropy intervals for k = 3, permuting values of p₃ from 0.0 to 0.9. The horizontal axis is p₁, and p₂ = p₃ – p₁.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/08/ent3anim.gif?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/08/ent3anim.gif?w=600" class="alignnone size-full wp-image-7459" src="https://corplingstats.files.wordpress.com/2022/08/ent3anim.gif?w=739" alt="Animation 1. Entropy intervals for k = 3, permuting values of p3 from 0.0 to 0.9. The horizontal axis is p1, and p2 = p3 - p1." srcset="https://corplingstats.files.wordpress.com/2022/08/ent3anim.gif 600w, https://corplingstats.files.wordpress.com/2022/08/ent3anim.gif?w=150&h=125 150w, https://corplingstats.files.wordpress.com/2022/08/ent3anim.gif?w=300&h=250 300w" sizes="(max-width: 600px) 100vw, 600px" />

Animation 2. Entropy intervals for k = 3, n = 10 and α = 0.05, permuting values of p₃ from 0.0 to 0.9. The horizontal axis is p₁, and p₂ = p₃ – p₁.

This performance is driven by the shape of the ‘inf’ function (Equation (7)) and its interval (8). See Figure 4.

In brief, when p₃ = 0, the full range of the horizontal axis in Figure 4 (from 0 to 1) is available to p₁ (and p₂), and inf(p₃) = 0. This obtains a set of curves similar to Figure 5, but adjusted by the 95% interval for inf(p₃) ∈ (h₃^–, h₃⁺) = (0, 0.3238).

As p₃ increases, the available range reduces, so that when p₃ = 0.3, say, p₁ ranges from 0 to 0.7. inf(p₃) = 0.3288 with interval (0.2186, 0.3349), and the curve is generated by ‘inf’ functions for p₁ and p₂ over the remaining range.

Where p₃ = 1, p₁ = p₂ = 0, so η = 0, with a 95% confidence interval (0, 0.6190).

As before, note we are using a rather unrealistically small sample size, n = 10, with three outcomes, hence the wide interval. This is useful to expose any unusual behaviour of functions and to confirm to us the method is robust, but few studies will rely on such small samples. Indeed, one would generally assume that n >> k.

Conclusions

We have demonstrated how to obtain confidence intervals for information and entropy estimates obtained from samples. The interval for an observed information score is a monotonic transformation of a Binomial interval for the simple proportion.

How do we test if two information scores are significantly different? Since ι(p) is a monotonic function of p, we can employ a contingency table and test (Fisher, 2 × 2 χ² or Newcombe-Wilson). If the test is significant, the scores must differ.

We derived two methods of computation for entropy confidence intervals, one for simple Binomial alternatives, where k = 2, and a more general approximation for Multinomial outcomes. Provided that proportions {p_i} are properly Multinomial (and therefore free to vary), the Binomial approximation and k-constrained sum are appropriate. The latter is more conservative than the method of direct transformation for k = 2, so in these cases the Binomial method is preferable.

Armed with interval methods, we can also employ Zou and Donner’s (2008) difference theorem for comparing the significant difference between any two independently observed properties. See Wallis (forthcoming) and An algebra of intervals.

These methods are robust and, since they are based on the Wilson score interval, they are also capable of accepting a continuity correction and other types of sampling adjustment. We plot the continuity-corrected interval for information scores by way of example: we could do likewise for entropy, however in this blog post it is more important to focus on the difference between alternative methods of computation. Nonetheless whenever we refer to the Wilson score interval we are really referring to a class of configurable methods.

References

Kumar, U., V. Kumar and J.N. Kapur (1986). Normalised measures of entropy. International Journal of General Systems 12:1, 55-69.

Shannon, C. E. and W. Weaver (1949). The Mathematical Theory of Communication. Urbana, Illinois: University of Illinois Press, 1949.

Wallis, S.A. (2013). Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208. » Post

Wallis, S.A. (2021). Statistics in Corpus Linguistics Research. New York: Routledge. » Announcement

Wallis, S.A. (forthcoming). Accurate confidence intervals on Binomial proportions, functions of proportions and other related scores. » Post

Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212.

Zou, G.Y. & A. Donner (2008). Construction of confidence limits about effect measures: A general approach. Statistics in Medicine, 27:10, 1693-1702.

Confidence intervals for the ratio of competing dependent proportions

Sean — Mon, 27 Jun 2022 12:03:16 +0000

Introduction

How do we compute the confidence interval for the ratio of competing dependent proportions, where p₁ and p₂ are drawn from the same set of outcomes?

We have discussed elsewhere on this blog how we might employ the Zou and Donner risk ratio method for independent proportions (Zou and Donner 2008).

But what should we do if proportions are in competition, such that should one proportion increase the other must fall? In a recent paper on end weight (Wallis 2022a), I wished to obtain an interval for the ratio between two competing patterns, i.e. f₁ / f₂ where f₁ and f₂ are the total number of each outcome.

For example, there are n = 42 conjoin patterns with a single postmodifier (initial: 8, final: 34) for a particular pattern (PP postmodification of conjoined PPs) across all of ICE-GB. I wanted to know the following:

What is the mean ‘end weight ratio’, i.e. the ratio of cases in the final relative to the initial position?

And what is the confidence interval for this variable?

In an early draft, I employed the Zou and Donner (2008) method for observed proportions p₁ / p₂ = f₁ / f₂. See An algebra of intervals. But these outcomes are not independent, but alternate patterns in direct competition. Indeed, the premise is that the conjoins could be reversed. So this method is not correct.

Recall that we already know how to calculate their difference. We need to calculate their ratio.

For a pair of dependent proportions, p₂ = 1 – p₁, the difference interval for d = p₂ – p₁ = 1 – 2p₁ is simply

d ∈ (1 – 2w₁⁺, 1 – 2w₁^–),(1)

where w₁^– and w₁⁺ are the Wilson score intervals for p₁. See also Comparing frequencies within a discrete distribution, Wallis (2022b), and Figure 1 below.

So we already have a difference interval for dependent proportions of this type. The equivalent difference interval for independent proportions is the Newcombe-Wilson interval about d, which we discuss below.

Example data

We’ll use some invented data to illustrate our working.

Consider two frequencies, f₁ = 15, f₂ = 30. We obtain p₁ = 0.3333 ∈ (0.2136, 0.4793) for α = 0.05. Likewise, p₂ = 1 – p₁ = 0.6667 ∈ (0.5207, 0.7864).

Since the interval for p₁ does not include 0.5, the difference is significant. The interval for p₂ is directly dependent on the interval for p₁, mirroring it, and also does not include 0.5. Let’s call this the mirror principle for simplicity.

The difference interval for d = 0.3333 ∈ (0.0414, 0.5728) by Equation (1) excludes zero.

Figure 1: Left, mirrored 95% Wilson score intervals for p₂ = 1 – p₁, right, unadjusted and adjusted Newcombe-Wilson intervals on the difference d.

" data-image-caption="

Figure 1: Left, mirrored 95% Wilson score intervals for p₂ = 1 – p₁, right, unadjusted and adjusted Newcombe-Wilson intervals on the difference d.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/06/dp-ratio1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/06/dp-ratio1.png?w=739" class="alignnone wp-image-7303" src="https://corplingstats.files.wordpress.com/2022/06/dp-ratio1.png?w=600&h=496" alt="Figure 1: Left, mirrored 95% Wilson score intervals for p2 = 1 – p1, right, unadjusted and adjusted Newcombe-Wilson intervals on the difference d." width="600" height="496" srcset="https://corplingstats.files.wordpress.com/2022/06/dp-ratio1.png?w=600&h=496 600w, https://corplingstats.files.wordpress.com/2022/06/dp-ratio1.png?w=150&h=124 150w, https://corplingstats.files.wordpress.com/2022/06/dp-ratio1.png?w=300&h=248 300w, https://corplingstats.files.wordpress.com/2022/06/dp-ratio1.png?w=768&h=635 768w, https://corplingstats.files.wordpress.com/2022/06/dp-ratio1.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 1: Left, mirrored 95% Wilson score intervals for p₂ = 1 – p₁, right, unadjusted and adjusted Newcombe-Wilson intervals on the difference d.

To read Figure 1 as a test, note that if the intervals on the left ‘touch’ (cross the 0.5 line), the proportions are not significantly different.

Likewise if the difference interval on the right crosses the 0 line, the difference between p₁ and p₂ is not large enough to be judged significant.

We will discuss two methods for computing the interval, one more complex than the other. Let us start with the more complex before discussing a short cut.

Method 1: Employing the Newcombe-Wilson difference interval

Suppose we convert this interval to a Newcombe-Wilson difference interval (Newcombe 1998). This interval assumes that proportions are independent and free to vary from 0 to 1. What would we require individual width terms to be?

The zero-based interval may be written as

(w_d^–, w_d⁺) = (–√(u₁^–)² + (u₂⁺)², √(u₁⁺)² + (u₂^–)²), (2)

where u_i^– = (p_i – w_i^–), u_i⁺ = (w_i⁺ – p_i). We can also express this as an interval about d:

d ∈ (d^–, d⁺) = d – (w_d^–, w_d⁺) = (d – w_d⁺, d – w_d^–). (3)

In our example, the interval widths for p₁ are

u₁^– = (p₁ – w₁^–) = 0.1197 and
u₁⁺ = (w₁⁺ – p₁) = 0.1460.

Due to the mirror principle with dependent proportions, u₂^– = u₁⁺ and u₂⁺ = u₁^–. Equation (2) may now be simplified to

(w_d^–, w_d⁺) = (–√2(u₁^–)², √2(u₁⁺)².

However, the resulting interval is too narrow. See Figure 1, right, first interval. Why is this?

Figure 2. Sketch of Newcombe-Wilson zero-based lower bound width, with equal input widths u₂⁺ = u₁^–, obtained by the Bienaymé method of summing variances.

" data-image-caption="

Figure 2. Sketch of Newcombe-Wilson zero-based lower bound width, with equal input widths u₂⁺ = u₁^–, obtained by the Bienaymé method of summing variances.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/06/dp-ratio2.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/06/dp-ratio2.png?w=667" class="alignnone wp-image-7304" src="https://corplingstats.files.wordpress.com/2022/06/dp-ratio2.png?w=400&h=375" alt="Figure 2. Sketch of Newcombe-Wilson zero-based lower bound width, with equal input widths u2+ = u1–, obtained by the Bienaymé method of summing variances." width="400" height="375" srcset="https://corplingstats.files.wordpress.com/2022/06/dp-ratio2.png?w=400&h=375 400w, https://corplingstats.files.wordpress.com/2022/06/dp-ratio2.png?w=150&h=141 150w, https://corplingstats.files.wordpress.com/2022/06/dp-ratio2.png?w=300&h=282 300w, https://corplingstats.files.wordpress.com/2022/06/dp-ratio2.png 667w" sizes="(max-width: 400px) 100vw, 400px" />

Figure 2. Sketch of Newcombe-Wilson zero-based lower bound width, with equal input widths u₂⁺ = u₁^–, obtained by the Bienaymé method of summing variances.

Consider Figure 2, which depicts how a (negated) Newcombe-Wilson interval width, w_d^–, is obtained by Equation (2) where u₂⁺ = u₁^–. The correct critical distance is 2u₁^–: if inner intervals overlap at all the result is not significant. But if we employ the Pythagorean (Bienaymé) sum to obtain w_d^– as shown, we obtain √2.u₁^–.

In this figure the area represents variance, lengths represent standard deviation and interval widths. The area is a triangle of a half unit square. We must double the variance.

Or, to put it another way, we have to multiply the standard deviation/interval width by a further √2 ≅ 1.4142.

The lower bound term, w_d^– then becomes –√2κ(u₁^–)² = –2u₁^– and w_d⁺ = 2u₁⁺:

(w_d^–, w_d⁺) = (–2u₁^–, 2u₁⁺).(2′)

d ∈ (d^–, d⁺) = (d – 2u₁⁺, d + 2u₁^–).(3′)

Where d = 1 – 2p₁, u₁^– = (p₁ – w₁^–), u₁⁺ = (w₁⁺ – p₁), Equations (3′) and (1) become equivalent. Allowing for rounding errors, this reformulation has identical performance to the single-sample z test against 0.5 (Wallis 2021: 166).

A ratio of competing dependent proportions

In order to generalise this result for risk ratios and other formulae (Wallis 2022b) we use ‘k-adjusted’ interval widths. These are premised on the fact that by switching from independent to dependent proportions we have lost a degree of freedom.

We multiply the variance by κ = k / (k – 1), where k is the number of outcomes, i.e. 2. We therefore multiply the interval widths by the square root of 2:

u′₁^– = √2.(p₁ – w₁^–),
u′₁⁺ = √2.(w₁⁺ – p₁), (4)

which gives us adjusted absolute intervals as

p₁ ∈ (w′₁^–, w′₁⁺) = (p₁ – u′₁^–, p₁ + u′₁⁺),
p₂ ∈ (w′₂^–, w′₂⁺) = (p₂ – u′₁⁺, p₂ + u′₁^–).(5)

Note that these intervals may ‘overshoot’ (exceed the probabilistic range P = [0, 1]), although when Equation (4) is located on a difference interval (by Equation (3′)) the resulting interval remains within bounds.

Nonetheless, we can now introduce these terms into Zou and Donner’s theorem (Zou and Donner 2008). This may be written as

(L, U) ≡ (θˆ₁ – θˆ₂ – √(θˆ₁ – l₁)² + (u₂ – θˆ₂)², θˆ₁ – θˆ₂ + √(u₁ – θˆ₁)² + (θˆ₂ – l₂)²), (6)

where (l_i, u_i) are the lower and upper interval bounds for parameter θˆ_i.

To compute a risk ratio for p₁ / p₂ we must substitute the following:

θˆ₁ = ln(p₁), with bounds (l₁, u₁) = (ln(w₁^–), ln(w₁⁺)), and
θˆ₂ = ln(p₂), with bounds (l₂, u₂) = (ln(w₂^–), ln(w₂⁺)),(7)

where ‘ln’ is the natural logarithm, and parameters p₁ and p₂ are independent. This obtains an interval for the difference in logs (i.e. the ratio) on a logarithmic scale, which we convert to the Real scale by employing the inverse log (exp) function, i.e. (exp(L), exp(U)).

Next, to employ dependent proportions, we substitute our adjusted interval bounds, w′₁^–, etc. from Equation (5) into Equation (7).

Our example data, p₁/p₂ = 0.5 and n = 45, yields the following

θˆ₁ = ln(p₁) = -1.0986, l₁ = -1.8080, u₁ = -0.6166, and
θˆ₂ = ln(p₂) = -0.4055, l₂ = -0.7760, u₂ = -0.1791.

This obtains (L, U) = (-1.4377, -0.0852) on the log scale, and the following interval for the dependent risk ratio:

ratio r = p₁/p₂ = 0.5 ∈ (0.2375, 0.9183).

Method 2: Functional reformulation: the odds

The second method is refreshingly simple. The formula p₁/p₂ = p₁/(1 – p₁) is an increasing monotonic function of p₁, which we can write as fn(p₁) = p₁/(1 – p₁).

In Reciprocating the Wilson interval we learned that any monotonic function of a proportion can be given a confidence interval by simply applying the same function to the bounds and arranging the interval bounds in increasing order. We might write this simply as:

(L, U) = { (fn(w_i^–), fn(w_i⁺))
(fn(w_i⁺), fn(w_i^–)) if increasing
if decreasing.(8)

For a non-monotonic function, we study the turning points (local minima or maxima), and determine whether or not the interval includes them. If the interval does not include a turning point, we can use Equation (8). Intervals containing turning points should be evaluated carefully. (For an example of this reasoning, see Confidence intervals on goodness of fit ϕ scores.)

In fact the resulting ratio has a common name. It is the odds, the ratio between two frequencies f(A) and f(B), or fn(p) = odds(p) = p/(1 – p).

In our case, fn is increasing and monotonic, so the interval for odds(p₁) is simply

o = odds(p₁) ∈ (w₁^–/(1 – w₁^–), w₁⁺/(1 – w₁⁺)),

so, for our example data

odds o = 0.5 ∈ (0.2716, 0.9205).

This interval is narrower, with a slightly increased upper bound in this case. The approximations involved by the log transform and the Zou and Donner theorem have had a small cost.

Performance

We can visualise how these intervals for the ratio of dependent proportions perform when compared to the unadjusted (independent proportion) risk ratio interval. With sample size n = 10 and α = 0.05, we obtain the following plot (Figure 3).

Note that Method 1 does not compute for the entirety of the probabilistic range. For small p₁ < 0.2125, the adjusted lower bound for p₁, w′₁^–, falls below zero (‘overshoots’) and its log transform is uncomputable. The resulting lower bound, L → 0. Similarly, by the mirror theorem, where p₁ > 0.7875, the adjusted lower bound for p₂ < 0.2125, w′₂^–, also falls below zero and the upper bound, U → ∞.

Figure 3. Plot of two methods for computing (r^–, r⁺), α = 0.05, n = 10, across diagonal values of p₁ and p₂, where p₁ = 1 – p₂, with difference intervals for comparison. Inner dashed line: independent proportion interval, outer: dependent proportion interval.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2022/06/dp-ratio3-1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/06/dp-ratio3-1.png?w=739" class="alignnone wp-image-7332" src="https://corplingstats.files.wordpress.com/2022/06/dp-ratio3-1.png?w=601&h=452" alt="Figure 3. Plot of two methods for computing (r–, r+), α = 0.05, n = 10, across diagonal values of p1 and p2, where p1 = 1 – p2, with difference intervals for comparison. Inner dashed line: independent proportion interval, outer: dependent proportion interval." width="601" height="452" srcset="https://corplingstats.files.wordpress.com/2022/06/dp-ratio3-1.png?w=601&h=452 601w, https://corplingstats.files.wordpress.com/2022/06/dp-ratio3-1.png?w=150&h=113 150w, https://corplingstats.files.wordpress.com/2022/06/dp-ratio3-1.png?w=300&h=226 300w, https://corplingstats.files.wordpress.com/2022/06/dp-ratio3-1.png?w=768&h=578 768w, https://corplingstats.files.wordpress.com/2022/06/dp-ratio3-1.png 1020w" sizes="(max-width: 601px) 100vw, 601px" />

As we would expect, the dependent proportion interval is more conservative than its independent proportion counterpart. This is because any variation away from the mean by one proportion is mirrored by a movement away from the mean by the other.

In Figure 3, the dependent and independent difference intervals for d ∈ (d^–, d⁺) computed with Equations (1) and (3) respectively, are shown for comparison. These intervals also have the same property.

Method 1 is generally more conservative than the direct method, Method 2. We should always try functional reformulation first. With Method 2, as p₁ increases, the upper bound of p₁/(1 – p₁) tends to converge on the interval for the independent proportion. Likewise, as p₁ tends to zero, the lower bound tends to converge on the same interval.

Wallis (2022b) points out that a ‘significant difference’ may be reported should a ratio interval exclude 1, or a difference interval exclude 0. This means we can compare the performance visually by examining the cross-over points indicated in Figure 3.

Observe how both Method 1 and 2 closely align at upper and lower bounds at this cross-over point, and match the thresholds for the difference interval. There is a small error introduced by computing the interval on a logarithmic scale (see Wallis 2022b), however, using exhaustive computation and Fisher weighting with independent proportion intervals, we previously found that these additional errors had a marginal effect on their application when evaluated as a significance test.

Note that for a different test outcome, the interval bounds must cross the test condition (1 or 0) on either side of a natural fraction of n (in Figure 3, on either side of a horizontal axis marker). Even if there is a slight difference in the cross-over point for 1 and 0, what actually matters for the researcher is if one test obtains a significant result and the other does not.

However, when we consider the overall performance of the ratio interval, the improvements gained by using Method 2 along the entirety of the range are obvious.

Conclusions

We have demonstrated two methods for obtaining a confidence interval for the ratio of two frequencies or proportions in competition, i.e. the odds.

Although parameter interval bounds may overshoot, and logarithms become uncomputable, Method 1 is still well-defined, i.e. the lower bound L → 0, or upper bound U → ∞. However, a better approach is to simply reformulate the expression as a function of a single proportion, permitting a direct translation of Wilson score interval bounds.

If we apply both methods to our original end weight data (initial: 8, final: 34), we obtain intervals of 34/8 = 4.25 ∈ (1.95, 13.13) with Method 1, and 4.25 ∈ (2.00, 9.01) with Method 2. We might report the latter as saying that with a best estimate of 4.25, there are between 2 and 9 times more conjoin-final cases than conjoin-initial (at a 95% confidence level).

This does not mean that we should reject the approach of Method 1 (indeed, it may be unavoidable), but it does mean we should recognise that it will tend to be conservative.

In Wallis (2022b), we showed how this k-adjustment method might be employed in closed sums of functions of proportions, such as when we compute intervals for ‘goodness of fit’ effect sizes. Here we know that proportions sum to 1, and we have one fewer degrees of freedom than the number of summed terms.

However, where possible we should employ direct reformulation.

As well as the ratio and difference, product, power and logarithm formulae are easy enough to derive. The product p(1 – p) is non-monotonic over the range of P, whereas the sum p + (1 – p) = 1 ∈ [1, 1]! Other properties, such as percentage difference, may be obtained by direct reformulation, for example:

percentage difference d^% = (p₂ – p₁)/p₂ = (2p₂ – 1)/p₂ = 2 – 1/p₂.

Direct reformulation is necessarily consistent with the original Binomial interval, is highly generalisable, and may also be employed with corrections for continuity, finite population and random text sampling.

References

Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine, 17, 873-890.

Wallis, S.A. 2021. Statistics in Corpus Linguistics Research: A new approach. Routledge: New York. » Announcement

Wallis, S.A. 2022a. Directional evidence revisited: End weight bias and templating in conjoined phrase postmodification. London: Survey of English Usage. » Post

Wallis, S.A. 2022b. Accurate confidence intervals on Binomial proportions, functions of proportions, algebraic formulae and effect sizes. London: Survey of English Usage. » Post

Zou, G.Y. & Donner, A. 2008. Construction of confidence limits about effect measures: A general approach. Statistics in Medicine, 27(10), 1693-1702.

Directional evidence revisited

Sean — Thu, 16 Jun 2022 07:45:43 +0000

End weight bias and templating in conjoined phrase postmodification

Abstract Full Paper (PDF)

The tendency of speakers and writers to place larger constructions at the end of sentences, whether consciously or unconsciously, is well established. Often this question of ‘end weight’ is usually discussed in relation to grammatical transformations. In this short paper we demonstrate a simple method for investigating a similar phenomenon in coordination patterns where conjoins are either noun phrases, e.g. the X of Y or Z, or prepositional phrases, e.g. the X of Y or of Z. We then investigate whether the coordinated noun phrases (Y, Z) are themselves postmodified, either by another prepositional phrase or by a clause. As postmodifying phrases and clauses are potentially expansive, they are grammatically complex and we operationalise them as signifiers of ‘weight’. We find that both sets of coordination patterns are end-sequence biased by weight.

We also find an elevated frequency for patterns where both first and last conjoins in the sequence are greater than would be expected were they independently selected. Setting aside potential explanations of directional influence, which cannot be decided inductively, we focus instead on the content of these doubly-postmodified constructions and examine them for evidence of templating, i.e. lexical-syntactic repetition.

We also show that these results are not explicable by semantic ordering in coordination, and contrast evidence from prepositional and clausal postmodification with that from premodifying adjective phrases, where scope ambiguity may also be a factor.

1. Introduction

Are phrases at the end of a coordination sequence of conjoined phrases larger, more complex or ‘heavier’ than those at the start?

The principle of ‘end weight’ is often discussed in the context of empirical evidence of information structuring (see e.g. Kaltenböck 2020): moreover, students of English are taught to position larger constructions at the end of utterances (Cowan 2008). Similarly, studies of the dative alternation with the double object construction – Aden gave the prize to Beth (dative) vs. Aden gave Beth the prize (double object) – have observed that the size of the movable object (the prize) appears to be a factor in its position (Bresnan, Cueni, Nikitina and Baayen 2005).

However a freer structure for study – one that requires no additional transformative device such as extraposition or double-object constructions – is the coordination of like phrases.

If there is a general cognitive or communicative principle engaged in extraposition and other broadly semantically neutral transformations such as the dative alternation, it seems likely that coordination is also final end-weighted, i.e. the hypothesis is that the final conjoin would tend to be ‘heavier’ than earlier ones. Cognitively, such a method would minimise interruptions to the producer’s attention, and allow them to concentrate on the coordinated phrase sequence itself. Communicatively, end-weight strategies package information to the recipient without large potentially distracting diversions, a principle also termed ‘end focus’. Whereas explicit teaching tends to prioritise conscious communicative purposes, as linguists we are usually more interested in evidence of spontaneous biases.

Since planning is more difficult to employ in spontaneous speech than edited writing, observing differences between speech and writing may help us distinguish explanations.

An important method that adds ‘weight’ to phrases is noun phrase postmodification, typically by clauses and preposition(al) phrases (PPs). This is not the only method for adding weight: alternatives include introduction of premodifying adjective and determinative phrases, adjuncts, ‘floating’ postmodifiers, or the use of compound nouns. However, since a clause or PP may itself be expanded, their introduction opens the door to potentially unlimited constructions.

In a sequence of like conjoins, the same structures could be added to any conjoin, but on the principle of end weight, we hypothesize they tend to be found at the end of a sequence rather than at the start.

Such a pattern could arise in at least two ways. A speaker may plan ahead to place weightier conjoins at the end of a sequence. Alternatively, it is also possible that, having introduced a particularly lengthy construction, a speaker might then decide to stop the coordination sequence.

One potential reason for postmodification end-weighting in conjoins concerns ambiguity of scope. Adjective premodification of nouns is well known to exhibit this phenomenon, c.f. the old men and women.

Let us consider a simple example. Example (1) consists of a noun phrase with conjoined postmodifying (NPPO) prepositional phrases (PPs) identified by brackets:

(1)…a systematic adoption [of the ideals [of Bildung]] and [of the German middle class way [of life]] _{[S2B-042 #47]}

It would be entirely possible to rewrite this noun phrase as Example (1′).

(1′)…a systematic adoption [of the German middle class way [of life]] and [of the ideals [of Bildung]]

However (1′) is slightly ambiguous. Are the ‘ideals’ systematically adopted, or are they part of ‘the German middle class way’? Arguably, the original example (1) is ambiguous for the same reason! In speech, intonation may help. In summary, the positioning of constructions can aid in resolving ambiguity, provided that the speaker plans ahead.

However a more substantive issue concerns ordering. Some coordination patterns are semantically sequenced by the conjunctions. Consider (2) and (3) below.

(2)…having a degree in say English Literature or <,> uh Greek and Latin whatever …only says something about your ability [in that area] and not [in the wider areas [of life]]…_{[S1B-029 #153]}

(3)…the consequences of these proposals for the movement of traffic [outside the areas immediately affected], and particularly [in the direction [of the A3]].

Example (2) is exclusionary, (3) is specificatory. Reversing the conjoins is quite difficult.

(2′)…having a degree in say English Literature or <,> uh Greek and Latin whatever …only says something about your ability not [in the wider areas [of life]] but [in that area]…

(3′)…the consequences of these proposals for the movement of traffic particularly [in the direction[of the A3]], and also [outside the areas immediately affected].

Rewritten examples seem quite strained, especially the specificatory ones. It seems more straightforward in English to start with a broader concept and then narrow it, than to present a narrow concept and widen it.

This might affect a result otherwise attributed to ‘end weight’. In these ordered examples there may be logical-semantic reasons why the second conjoin, because it represents a subset of the first (whether excluded or specified), might tend to be more complex and grammatically ‘heavier’.

This type of reasoning does not apply to (4), which is ordered logically. There is no particular reason why the consequent (the second conjoin) is ‘heavier’ than the antecedent (the first).

(4)In the fixed dunes, [with their much higher organic content,] and therefore [with a greater proportion [of fine particles]]…_{[W2A-022 #75]}

For the purposes of the present study we will first pool ordered and unordered examples alike. In Section 3.3 we review our data by repeating our experiments, requiring and or or to immediately precede the last conjoin, and thereby obtain a dataset of unordered cases.

2. Experiments

2.1 Conjoined prepositional phrases containing noun phrases postmodified by PPs

We obtain data from the fully-parsed British Component of the International Corpus of English (ICE-GB, Nelson, Wallis and Aarts 2002).

All of the experiments obtain data by the following approach. We construct four Fuzzy Tree Fragments (FTFs) according to a single schema, and extract data using ICECUP. The yellow nodes are optional, so we have four versions of this FTF (neither, initial, final, both).

In our first experiment we will use the schema in Figure 1. We relax the constraint that the PP must immediately follow the noun phrase head (indicated by a white ‘After’ arrow, rather than a black ‘Immediately after’ arrow). Should any other element fall between the head and PP, the FTF will still find it. However, this relaxation has a drawback. The FTF matches cases with multiple postmodifiers more than once, creating duplicate matches, so we should review all our results and subtract any duplicates manually.

Figure 1. FTF schema: optional NP postmodification in conjoined prepositional phrases (ordered or unordered sequences). Four FTFs are constructed, with the right-most NPPO, PP nodes present or removed.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2022/06/cj-pp-pp-annotated.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/06/cj-pp-pp-annotated.png?w=701" class="alignnone size-full wp-image-7266" src="https://corplingstats.files.wordpress.com/2022/06/cj-pp-pp-annotated.png?w=739" alt="Figure 1. FTF schema: optional NP postmodification in conjoined prepositional phrases (ordered or unordered sequences). Four FTFs are constructed, with the right-most NPPO, PP nodes present or removed." srcset="https://corplingstats.files.wordpress.com/2022/06/cj-pp-pp-annotated.png 701w, https://corplingstats.files.wordpress.com/2022/06/cj-pp-pp-annotated.png?w=150&h=66 150w, https://corplingstats.files.wordpress.com/2022/06/cj-pp-pp-annotated.png?w=300&h=131 300w" sizes="(max-width: 701px) 100vw, 701px" />

Using this schema, for all ICE-GB data, we obtain values for the highlighted cells and construct a contingency table by subtraction (Table 1). The FTF with both ‘NPPO, PP’ nodes yields 13 cases, the FTF with a postmodifying PP node in the first position matches all of these plus another 8.

Note that this search method pools data from coordination sequences of any number and does not pay attention to intervening conjoins. A pattern that only postmodifies a medial conjoin would therefore register as ‘neither’. However longer conjoin sequences are relatively low in frequency.

We extract the following proportions with 95% Wilson score intervals (see Table 1):

p(first) = 21/186 = 0.1129 ∈ (0.0750, 0.1664),
p(last) = 47/186 = 0.2527 ∈ (0.1957, 0.3197).

CJ, PP +PP	– last	+ last	total	p(last)
– first	135	34	165
+ first	8	13	21	0.6190
total	139	47	186	0.2527
p(first)		0.2766	0.1129

Table 1. Contingency table for independent decisions to have a postmodifying PP in first or last place for conjoined PPs (‘+ first’ means the first conjoin is postmodified by a PP), all ICE-GB data. χ² = 16.83 (Yates’s χ² = 14.71).

If we compute confidence intervals on p(first) and p(last), we find that the intervals do not overlap, we can say that p(last) is significantly greater than p(first), i.e. it is more likely that a later conjoin is postmodified than an earlier one. In other words, we find a potential end-weight bias.

2.2 Interaction and patterning

We could stop at this point. However comparing p(first) and p(last) evaluates their independent rates. It does not address their interaction.

Note that the probability of choosing the cell (+first, +last) in Table 1, which we might write as p(both) = 13/186 = 0.0699. This is nearly two and a half times the independent intersection probability, p(first) × p(last) = 0.0285. The ratio has the scaled 95% Wilson score interval for p/P, where P is simply a constant.

p/P = 0.0699/0.0285 = 2.45 ∈ (1.45, 4.06),

where p is the observed proportion, p(both), and P = p(first) × p(last). There are between 1.5 and 4 times (with a best estimate of 2.45) more ‘double postmodification’ cases than would be expected were the two postmodification acts independent.

We can compute Cramér’s 2 × 2 ϕ = 0.3008 ∈ (0.1385, 0.4646).² This tells us that there is a sizable effect size, which is 95% sure to be within this range.

This effect size can be used to compare the degree of association between decisions. However, since ϕ is associative, it is bidirectional, and does not distinguish between axes (directions).

Using these proportions, we could examine how the rate of postmodification on one conjoin changes if we know the other is postmodified. But as we shall discuss in Section 3.1, making a claim of directionality of influence is doubly misguided.

In the meantime, consider Figure 2, which plots the changing rate of each decision point as separate trends.³ We compute these second, conditional proportions like this:

p(first | last) = 17/47 = 0.2766 ∈ (0.1694, 0.4176),
p(last | first) = 17/21 = 0.6190 ∈ (0.4080, 0.7925).

We plot spoken and written rates, alongside the pooled ‘all ICE-GB’ rate, both in order to identify whether mode of delivery makes a difference to the outcome, and as a kind of weak replication check (see Wallis 2021: 201). Note that although we might perceive differences between speech and writing in Figure 2, they are not significantly different (note how the intervals overlap points).

Figure 2. Changing rate of postmodifying a noun phrase head with a PP in the last position of a series of conjoined PPs, p(last), vs. the changing rate of p(first), if the other conjoin is postmodified.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2022/06/cj-direction.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/06/cj-direction.png?w=739" class="alignnone wp-image-7253" src="https://corplingstats.files.wordpress.com/2022/06/cj-direction.png?w=601&h=379" alt="Figure 2. Changing rate of postmodifying a noun phrase head with a PP in the last position of a series of conjoined PPs, p(last), vs. the changing rate of p(first), if the other conjoin is postmodified." width="601" height="379" srcset="https://corplingstats.files.wordpress.com/2022/06/cj-direction.png?w=601&h=379 601w, https://corplingstats.files.wordpress.com/2022/06/cj-direction.png?w=150&h=95 150w, https://corplingstats.files.wordpress.com/2022/06/cj-direction.png?w=300&h=189 300w, https://corplingstats.files.wordpress.com/2022/06/cj-direction.png?w=768&h=485 768w, https://corplingstats.files.wordpress.com/2022/06/cj-direction.png 1020w" sizes="(max-width: 601px) 100vw, 601px" />

The graph draws attention to Church’s gradients, i.e. p(last | first) – p(last), etc. This gradient represents the tendency for the rate of postmodification at a particular conjoin to increase if we know that the first is postmodified. Examining the difference between conditional and absolute probabilities is an idea due to Ken Church (2000). We might also compare this gradient with the equivalent gradient for the opposite direction, i.e. p(first | last) – p(first). If there was an influence in a particular direction, one could expect a steeper gradient on the influenced term.

However, such an interpretation is incorrect. We should be careful in not over-interpreting the increased gradient for p(last) over p(first). The two gradients are not independent observations, but difference measures extracted from a contingency table with a single degree of freedom.

We already know that p(last) > p(first) (‘absolute’ values, left). And we know that there are additional cases of double-postmodification. The steeper gradient is entirely due to these two facts. In other words, it is a mathematical artifact of Table 1!⁴

Indeed, in each set of data we studied in this paper, p(last | first) exceeds 0.5 numerically, or, to put it another way, more than half the cases that are postmodified in the first position have a postmodified final conjoin.

However this does not permit us to assume a directional influence, a claim we might codify as ‘+postmodify(first) → +postmodify(last)’, i.e. choosing to postmodify the first conjoin encourages, or primes, postmodification of the final conjoin. We will return to questions of directional influence and templating in Sections 3.1 and 3.2.

With the above in mind, a simpler way to present this data is shown in Figure 3. This representation places the emphasis on particular patterns (‘initial’ = ‘first only’; ‘final’ = ‘last only’) rather than on the probability of an item being found. Thus p(first) is the probability that x exists in the initial position, which could be in either ‘initial’ or ‘both’ patterns.

For ICE-GB and spoken data, the intervals for p(initial) and p(final) do not overlap. All three are significant at α = 0.05, confirmed by a paired-frequency z test (Wallis 2021: 166).

A meaningful statistic is the end weight odds, for which we can also estimate 95% confidence intervals (Wallis 2022b). An odds score is simply the ratio of two competing proportions, in this case p(final)/p(initial). This statistic ignores ‘both’ or ‘neither’, only considering ‘first only’ and
‘last only’. For all ICE-GB data, we observe 4.25 times as many conjoin final cases as initial ones (34:8 or 4.25:1), with a 95% interval of 2.00 to 9.01 times. In other words, we are 95% confident that in the population from which our data is sampled there are between twice and 9 times as many conjoin-final as conjoin-initial cases, and our best estimate of the ratio is 4.25.

Figure 3. Probability distribution for the position of ‘heavy’ (postmodified) conjoined prepositional phrases. The final column identifies ‘excess’ double-postmodified patterns.

" data-image-caption="

Figure 3. Probability distribution for the position of ‘heavy’ (postmodified) conjoined prepositional phrases. The final column identifies ‘excess’ double-postmodified patterns.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/06/cj-patterns.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/06/cj-patterns.png?w=739" class="alignnone wp-image-7251" src="https://corplingstats.files.wordpress.com/2022/06/cj-patterns.png?w=600&h=539" alt="Figure 3. Probability distribution for the position of ‘heavy’ (postmodified) conjoined prepositional phrases. The final column identifies ‘excess’ double-postmodified patterns." width="600" height="539" srcset="https://corplingstats.files.wordpress.com/2022/06/cj-patterns.png?w=600&h=539 600w, https://corplingstats.files.wordpress.com/2022/06/cj-patterns.png?w=150&h=135 150w, https://corplingstats.files.wordpress.com/2022/06/cj-patterns.png?w=300&h=269 300w, https://corplingstats.files.wordpress.com/2022/06/cj-patterns.png?w=768&h=689 768w, https://corplingstats.files.wordpress.com/2022/06/cj-patterns.png 1020w" sizes="(max-width: 600px) 100vw, 600px" />

Figure 3. Probability distribution for the position of ‘heavy’ (postmodified) conjoined prepositional phrases. The final column identifies ‘excess’ double-postmodified patterns.

. . .

Excerpt

3. Discussion

3.1 The (directional) causality trap

Haunting this article is the spectre of directional explanations. Observing a high number of conjoin-final ‘heavy’ phrases, we are tempted to infer directionality of decision making and thus influence. If we plot graphs like Figure 2, this temptation becomes even greater. The gradient for p(last) tends to be steeper than that for the opposite inference. In the case of postmodified noun phrases we can obtain a significantly steeper result.

But what does this mean? We could claim that where a gradient in one direction is found to be significantly greater than another this gradient is likely to be seen in future data. This means that we might say the prediction is reproducible, but it does not mean that the reason this pattern is observed is due to a particular underlying process.

But as we have seen, this result can also be explained as a mathematical artifact of two other facts: that conjoins are end-weighted (p(last) > p(first), and p(final) > p(initial)) and that the intersection (the doubly-postmodified ‘both’ pattern) is greater than expected.

In fact, any observed pattern like this is the aggregate result of multiple patterns and tendencies, idiomatic expressions and schema, as well as genuinely independent decisions which influence one another.

As with any correlation, great care should be taken not to interpret a directional correlation as evidence of causality. We cannot know for certain that a decision regarding adding a postmodifier to the first conjoin is made prior to a decision to add one to the last, however intuitive or seductive this reasoning might be. Human mental processing is highly parallelised, and conjoins might be constructed internally in parallel, and only articulated in a single order.

Finally, although we see an elevated rate for cases where both first and last conjoin is postmodified, this might be due to a specific set of cases, such as idiomatic patterns or templating.

There are some circumstances in corpus linguistics where direction might be deduced, for example where one speaker primes another. But greater care must be applied when dealing with linguistic interaction research within an utterance by the same speaker. For a start, direction does not automatically accord with word order. We have previously discovered interactions between decisions that are only credibly explained by planning ahead, such as attributive adjective phrases conditioned by the semantics of the head that follows. When I say the large grey cat, I have a mental picture of the cat I am describing to you, and I am constrained by the eventual noun I might possibly eventually utter – cat, feline, animal, creature, etc.

Similarly, objective who/whom alternation is shown to interact with a following subject (Wallis (2021: 39). The choice of subject, like the choice of noun phrase head, necessarily concerns the overarching intended meaning of the clause or phrase.

In some processes, we might advance an argument that some decisions are likely to be made in a particular order because the option to add a second term only arises should the first be made, such as in embedded constructions (Wallis 2019). However, even embedding may involve some degree of look-ahead. Wallis (2022b) finds evidence that proper nouns postmodified by PPs found in titles appear to defy the expectation of a declining additive probability. Although analysed grammatically as multi-level embedding, the rise in probability observed appears to be only explicable by ‘chunking’ (the construction is introduced as a single unit), or the application of a title ‘formula’, such as the X of Y, e.g. the Duke of York.

In our case, a plausible cognitive model could hypothesise that the memory and attention demands of introducing an additional PP mitigates against it being added in the initial position, but this pattern in the data might be explained as a result of some other (possibly as yet unknown) phenomenon.

. . .

Introduction
Experiments
2.1 Conjoined prepositional phrases containing noun phrases postmodified by PPs

2.2 Interaction and patterning

2.3 Conjoined noun phrases, postmodified by PPs

2.4 Clausal postmodification
Discussion
3.1 The (directional) causality trap

3.2 Templating evidence

3.3 The effect of order

3.4 Adjective phrase, PP and clausal distributions
Conclusions

References

Bresnan, J., A. Cueni, T. Nikitina & R.H. Baayen 2007. Predicting the Dative Alternation. In G. Bouma, I. Kraemer, & J. Zwarts (eds.), Cognitive Foundations of Interpretation. Amsterdam: KNAW. 69-94.

Church, K. 2000. Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p², Coling, 173-179.

Cowan, R. 2008. The Teacher’s Grammar of English with Answers. Cambridge: CUP.

Kaltenböck, G. 2020. Chapter 22 in Aarts, B., Popova, G. and Bowie, J. (eds.) The Oxford Handbook of English Grammar. Oxford University Press.

Nelson, G., B. Aarts & S.A. Wallis 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Varieties of English Around the World series. Amsterdam: John Benjamins.

Wallis, S.A. 2019. Investigating the additive probability of repeated language production decisions. International Journal of Corpus Linguistics 24:4, 490-521. » Post

Wallis, S.A. 2021. Statistics in Corpus Linguistics Research: A new approach. Routledge: New York. » More information

Wallis, S.A. 2022a. Accurate confidence intervals on Binomial proportions, functions of proportions, algebraic formulae and effect sizes. London: Survey of English Usage. » Post

Wallis, S.A. 2022b. Are embedding decisions independent? Evidence from preposition(al) phrases. London: Survey of English Usage. » Post

Notes

[1] The tree is drawn from left to right for reasons of space rather than top-down. Word order is from the top, down, on the right hand side. Gloss: NPHD = noun phrase head, NPPO = noun phrase postmodifier, PP = prepositional phrase, CJ = conjoin, P = prepositional (function), PREP = preposition, PC = prepositional complement, NP = noun phrase. Black arrows = immediately after, white arrows = (eventually) after.

[2] These are 95% intervals computed using the method outlined in (Wallis 2021: 225).

[3] This is not an additive probability chart. The equivalent additive probability chart would link p(last) with p(first | last) as a chain of additive decisions.

[4] The elevated double-postmodification rate is why χ² was significant. To demonstrate this, the expected value is 21×47/186 = 5.31. Set out Table 1 with cells in the ‘known totals’ tab on the 2 × 2 χ² spreadsheet, www.ucl.ac.uk/english-usage/statspapers/2x2chisq.xls. Then substitute 5.31 for 13. χ², ϕ and ϕ_p tend to zero.

Are embedding decisions independent?

Sean — Tue, 17 May 2022 20:17:18 +0000

Evidence from preposition(al) phrases

Abstract Full Paper (PDF)

One of the more difficult challenges in linguistics research concerns detecting how constraints might apply to the process of constructing phrases and clauses in natural language production. In previous work (Wallis 2019) we considered a number of operations modifying noun phrases, including sequential and embedded modification with postmodifying clauses. Notably, we found a pattern of a declining additive probability for each decision to embed postmodifying clauses, albeit a pattern that differed in speech and writing.

In this paper we use the same research paradigm to investigate the embedding of an altogether simpler structure: postmodifying nouns with prepositional phrases. These are approximately twice as frequent and structures exhibit as many as five levels of embedding in ICE-GB (two more than are found for clauses). Finally the embedding model is simplified because only one noun phrase can be found within each prepositional phrase. We discover different initial rates and patterns for common and proper nouns, and certain subsets of pronouns and numerals. Common nouns (80% of nouns in the corpus) do appear to generate a secular decline in the additive probability of embedded prepositional phrases, whereas the equivalent rate for proper nouns rises from a low initial probability, a fact that appears to be strongly affected by the presence of titles.

It may be generally assumed that like clauses, prepositional phrases are essentially independent units. However, we find evidence from a number of sources that indicate that some double-layered constructions may be being added as single units. In addition to titles, these constructions include schematic or idiomatic expressions whose head is an ‘indefinite’ pronoun or numeral.

1. Introduction

In (Wallis 2019), we described a research design which considered the additive probability for repeatedly performing the same construction step. To take a simple example used in the paper, we might consider the additive probability of repeatedly adding an attributive adjective phrase to a noun head, according to a canonical scheme that looks like this.

base _Ø→ + term₁ _Ø→ + term₂ ⋅⋅⋅⋅⋅⋅ _Ø→ + term_n

We examine the probability of adding the x-th term (in this case, an attributive adjective phrase), which we label p(x), to an existing string (a noun head). Thus we obtain results like

the cat
the black cat
the large black cat
etc.

In such a construction, decisions are not necessarily ordered by the lexical order in which they appear: the speaker could have assembled a mental model of the ‘cat’ that they wished to communicate, and then selected adjective phrases before assembling them serially as a noun phrase. Or they could have simply avoided attributive adjectives altogether and said the cat that was black and large.

Nonetheless, we can calculate the probability that a speaker or writer adds an attributive adjective phrase, at point x, before the noun, which we will simply label p(x). So p(1) represents the chance of adding the first adjective phrase, p(2) the chance of adding the second given the first, and so on.

We first obtain a frequency distribution of at least x adjective phrases, F(x). We can then simply divide p(x) = F(x)/F(x – 1) to obtain the additive probability at each stage.

Each set is a subset of the previous one in the sequence. The set of noun phrases with at least one attributive adjective phrase is a subset of all noun phrases, etc. We test for a significant fall or rise by examining whether one additive probability point p(x – 1) is within the Wilson score interval (Wilson 1927) for the next, p(x). This is a ‘goodness of fit’ test condition.

If the additive probability does not significantly change as repetition x increases, then it means that we have no evidence of an interaction between one decision and the next. With confidence intervals we can also consider the size of effect (maximum and minimum slope) at any point.

In the paper we found a serial and sequential impact with attributive adjective phrases, with a declining probability observed with each repetition. We suggested three possible types of explanation, which are not necessarily exclusive.

Logical-semantic constraints, such as semantic ordering of adjectives and semantic coherence (so one tends to say large black cat rather than black large cat or large small cat).
Communicative economy. The communicative environment imposes constraints. For example, a random sample of nouns will include second, third references to previously introduced (and described) concepts. These subsequent references will likely be adjectiveless ([the] cat), or a pronoun (he/she/it).
Cognitive memory/processing constraints, which were originally primarily conceived of as having a negative impact (hence ‘constraint’). However, mental processing may also make certain expressions easier than others to produce.

Note that we are not primarily concerned about the influence of a particular noun on a particular modifier, such as avoiding colour adjectives with abstract nouns (cf. a black mood). Rather, our focus is on the hypothesis that the cumulative impact of previous operations causes an additive probability to fall – or, in some cases, rise.

Indeed, in the case of repeated postmodifying clauses following the noun head, there was an initial decline and then a rise. This subsequent increase seemed most likely due to templating, a tendency to re-use structures. Consider this example:

(1)…the dream becomes a text [to renarrate], [to revise], [to listen to], [to read], [to analyse]. _{[W2A-002 #33]}

In the case of adjective phrases, the pool of plausible compatible adjectives tended to be used up, suggesting Explanation (1) above. But the same does not appear to apply to clauses following the noun head. Indeed, for communicative purposes, one might imagine someone attempting to convey a particular location or event by repeatedly adding postmodifying clauses if their interlocutor seemed puzzled.

Example (1) is a type of ‘asyndetic coordination’ (coordination without a coordinator ‘and’, ‘or’ or ‘but’). In the paper, we took additional steps to count coordinated examples. We would not wish to treat Example (1) differently were the writer to conclude with and to analyse. Pooling serial postmodification and coordinated cases, it became clearer that the pattern did adopt a ‘fall and rise’ pattern, suggesting that there was a second phenomenon at work in the case of longer strings.

1.1 Embedding

Wallis (2019) we compared sequences comprising serial postmodification of the same head with embedded postmodification, i.e. where a postmodifying clause includes a noun head, and that head is then postmodified. This appeared to demonstrate a decline – from p(1) to p(2) for spoken data, and from p(2) to p(3) for writing. See Figure 1.

Since each additional term modifies this new head, we may make a default assumption that decisions are made as the construction is assembled in sequence, i.e. in order of increasing depth. The reasoning is that the second addition can only be made after the first has been added.

Unfortunately for our study, postmodifying clauses are not very often embedded, and we rapidly ran out of data, despite a starting point of some 190,000 noun phrases!

Figure 1. Studying the impact of cumulative cost on embedding, noun phrase postmodifying clauses with head nouns, after Wallis (2019). We have insufficient data to determine whether this trend would continue at further levels of embedding, although we note the difference between speech and writing.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2022/05/embed1-1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/05/embed1-1.png?w=739" class="alignnone wp-image-7202" src="https://corplingstats.files.wordpress.com/2022/05/embed1-1.png?w=599&h=401" alt="Figure 1. Studying the impact of cumulative cost on embedding, noun phrase postmodifying clauses with head nouns, after Wallis (2019). We have insufficient data to determine whether this trend would continue at further levels of embedding, although we note the difference between speech and writing." width="599" height="401" srcset="https://corplingstats.files.wordpress.com/2022/05/embed1-1.png?w=599&h=401 599w, https://corplingstats.files.wordpress.com/2022/05/embed1-1.png?w=150&h=100 150w, https://corplingstats.files.wordpress.com/2022/05/embed1-1.png?w=300&h=201 300w, https://corplingstats.files.wordpress.com/2022/05/embed1-1.png?w=768&h=514 768w, https://corplingstats.files.wordpress.com/2022/05/embed1-1.png 1020w" sizes="(max-width: 599px) 100vw, 599px">

The task of extracting and counting embedded structures relies on the combination of a fully parsed corpus, the British Component of the International Corpus of English (ICE-GB, Nelson et al. 2002), and an effective search tool, ICECUP. Whereas one can obtain adjective sequences from an unparsed corpus, and possibly even attempt to recover serial postmodification from such a source, retrieving and counting embedded terms requires a parse analysis.

1.2 Why are preposition(al) phrases interesting?

Figure 1 identifies what appears to be a genuine ‘cost’ of embedding, but we cannot determine whether the initial trend is a general one, or is limited to a difference between first and second order embedding. The observed difference in speech and writing might be due to processing cost or differing communicative strategy. It might also be due to topic differences in speech and writing subcorpora, although this appears less likely.

In this paper we want to suggest that a different structure, prepositional phrases (PPs, also called ‘preposition phrases’), may be more fruitful for evaluating the processing costs of embedding.

The first motivation for adopting PPs is that it is possible to find longer chains of embedding in the million-word ICE-GB. Consider Examples (2) and (3), which are viable (and readily interpreted) embedded strings. These are 5-deep structures, i.e. structures with two further levels of embedding than we find for clauses.

(2)So consultation and cooperation with the public <,> as well as speed and a sympathetic understanding [in response [to calls [for help [from the victims [of crime]]]]] and a physical presence on the streets are what the public now seek of the police <,> _{[S2B-037 #16]}

(3)The introduction [of the Independent Police Complaints Authority <,> [with its wide powers [of intervention and supervision [of the investigation [of complaints]]]]] can but reassure the doubting public <,> _{[S2B-031 #69]}

In Examples (2) and (3), each of the noun phrase heads are all nouns, and not pronouns, numerals, nominal adjectives or proforms. In a later section we examine what happens if we relax the requirement for heads to be nouns.¹

A second motivation is that, since postmodifying PPs are of higher frequency, we may be able to examine more grammatical subcategories or text categories.

Finally, unlike clauses, which may have a noun phrase acting as a subject, object or complement, only one element, the prepositional complement, can be a noun phrase. The overall model is streamlined and simpler.

2 Experiments excluding embedded conjoins

In this paper we will conduct experiment in two phases. First we will consider a subset of cases where we exclude cases where the embedded sequence contains one or more conjoined element, as in Example (3). These can be identified and counted with ICECUP 3.

We include conjoins in our data in Section 3.

2.1 Obtaining data

Consider the Fuzzy Tree Fragment (FTF) in Figure 2. This is an example of one-level embedding under a node that is not itself a prepositional complement noun phrase (‘(¬PC, NP)’). In other words, it will be at the start of an embedding sequence. We also exclude conjoined prepositional complement noun phrases by creating a second FTF and subtracting these cases.

We can permute the wordclass categories in the first NPHD ‘slot’ (circled) to obtain a series of slightly different queries, and we apply each query across all data in ICE-GB to obtain the frequency totals in Table 1. We will also distinguish common and proper nouns.

We first extract the ‘base’ in our scheme. Figure 3(a) depicts a second FTF consisting of the two nodes in the first row in Figure 2 (‘(¬PC, NP)’ plus the designated head), and Figure 3(b) illustrates a parallel structure for conjoined prepositional phrases (‘PC, NP (CJ, NP)’). To find conjoined cases for Figure 2 we replace the topmost node with these two nodes.

We can now obtain frequency data for f = F(1) and n = F(0), and compute the additive probability or modification rate, p(1) = f/n. See Table 2.

Figure 2. Level 1 Fuzzy Tree Fragment (FTF) for a head noun that is not in a prepositional complement noun phrase, which is followed by a prepositional phrase acting as a noun phrase postmodifier (NPPO, PP), which in turn consists of a preposition (P, PREP) and its complement NP (PC, NP) with a head (NPHD). The FTF is drawn left-to-right rather than top-down for ease of visualisation, and the matching ‘sentence’ would be read from the top, down on the right.

" data-image-caption="

" data-medium-file="https://corplingstats.files.wordpress.com/2022/05/single-pp-embed.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/05/single-pp-embed.png?w=739" class="alignnone wp-image-7179" src="https://corplingstats.files.wordpress.com/2022/05/single-pp-embed.png?w=700&h=276" alt="Figure 2. Level 1 Fuzzy Tree Fragment (FTF) for a head noun that is not in a prepositional complement noun phrase, which is followed by a prepositional phrase acting as a noun phrase postmodifier (NPPO, PP), which in turn consists of a preposition (P, PREP) and its complement NP (PC, NP) with a head (NPHD)." width="700" height="276" srcset="https://corplingstats.files.wordpress.com/2022/05/single-pp-embed.png?w=700&h=276 700w, https://corplingstats.files.wordpress.com/2022/05/single-pp-embed.png?w=150&h=59 150w, https://corplingstats.files.wordpress.com/2022/05/single-pp-embed.png?w=300&h=118 300w, https://corplingstats.files.wordpress.com/2022/05/single-pp-embed.png?w=768&h=303 768w, https://corplingstats.files.wordpress.com/2022/05/single-pp-embed.png 832w" sizes="(max-width: 700px) 100vw, 700px">

Figure 3(a).

" data-image-caption="

Figure 3(a).

" data-medium-file="https://corplingstats.files.wordpress.com/2022/05/pp-start-1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/05/pp-start-1.png?w=600" class="alignnone size-full wp-image-7216" src="https://corplingstats.files.wordpress.com/2022/05/pp-start-1.png?w=739" alt="Figure 3(a)." srcset="https://corplingstats.files.wordpress.com/2022/05/pp-start-1.png 600w, https://corplingstats.files.wordpress.com/2022/05/pp-start-1.png?w=150&h=31 150w, https://corplingstats.files.wordpress.com/2022/05/pp-start-1.png?w=300&h=62 300w" sizes="(max-width: 600px) 100vw, 600px">

Figure 3(a).

Figure 3(b).

" data-image-caption="

Figure 3(b).

" data-medium-file="https://corplingstats.files.wordpress.com/2022/05/pp-start2.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/05/pp-start2.png?w=600" class="alignnone size-full wp-image-7215" src="https://corplingstats.files.wordpress.com/2022/05/pp-start2.png?w=739" alt="Figure 3(b)." srcset="https://corplingstats.files.wordpress.com/2022/05/pp-start2.png 600w, https://corplingstats.files.wordpress.com/2022/05/pp-start2.png?w=150&h=31 150w, https://corplingstats.files.wordpress.com/2022/05/pp-start2.png?w=300&h=62 300w" sizes="(max-width: 600px) 100vw, 600px">

Figure 3(b). FTFs for obtaining our ‘base’ term, ‘level 0’, restricted by nouns (N). The first query (a) finds all nouns that are noun phrase heads which are not found in prepositional complement NPs. This FTF will be at the start of an embedding sequence, if we also exclude cases matching the second FTF (b), which finds cases of conjoined prepositional complement NPs.

	N(com)	N(prop)	PRON	NUM	NADJ	PROFM
postmodifier frequency f = F(1)	13,441	565	1,489	944	54	1
head frequency n = F(0)	83,079	21,224	89,244	7,132	628	591
additive probability p = f / n	0.1618	0.0266	0.0167	0.1324	0.0860	0.0017

Table 1. Frequency distributions of PP structures in Figures 2 and 3 across both speech and writing in ICE‑GB, subdivided by the wordclass of the initial noun phrase head (cf. Figure 3). Nouns, common and proper, comprise more than 90% of the data. We also calculate the additive probability p.

Figure 4. Variable rates of initial postmodification, with 95% Wilson score intervals. Common nouns have the highest rate of PP postmodification, followed by numerals.

" data-image-caption="

Figure 4. Variable rates of initial postmodification, with 95% Wilson score intervals. Common nouns have the highest rate of PP postmodification, followed by numerals.

" data-medium-file="https://corplingstats.files.wordpress.com/2022/05/embed2-1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/05/embed2-1.png?w=739" class="alignnone wp-image-7223" src="https://corplingstats.files.wordpress.com/2022/05/embed2-1.png?w=600&h=544" alt="Figure 4. Variable rates of initial postmodification, with 95% Wilson score intervals. Common nouns have the highest rate of PP postmodification, followed by numerals." width="600" height="544" srcset="https://corplingstats.files.wordpress.com/2022/05/embed2-1.png?w=600&h=544 600w, https://corplingstats.files.wordpress.com/2022/05/embed2-1.png?w=150&h=136 150w, https://corplingstats.files.wordpress.com/2022/05/embed2-1.png?w=300&h=272 300w, https://corplingstats.files.wordpress.com/2022/05/embed2-1.png?w=768&h=696 768w, https://corplingstats.files.wordpress.com/2022/05/embed2-1.png 1020w" sizes="(max-width: 600px) 100vw, 600px">

Figure 4. Variable rates of initial postmodification, with 95% Wilson score intervals. Common nouns have the highest rate of PP postmodification, followed by numerals.

We plot this initial rate of postmodification in Figure 4. The mean rate (i.e. the rate for all data unspecified by head) is 0.0817, with a 95% Wilson interval of (0.0805, 0.0829).²

Excerpt

2.3 Common and proper nouns

Subdividing the data into structures postmodifying common and proper nouns obtains the graph in Figure 7. The general trend for nouns now appears to be due to the sum of an initial rise in additive probability for proper nouns (p(2) > p(1)) combined with a secular decline in the rate for common nouns from p(1) to p(3), in writing at least.

Figure 7. Additive probability trend analysis for embedded PPs following common and proper nouns (see Figures 2 and 5).

" data-image-caption="

Figure 7. Additive probability trend analysis for embedded PPs following common and proper nouns (see Figures 2 and 5).

" data-medium-file="https://corplingstats.files.wordpress.com/2022/05/embed3-1.png?w=300" data-large-file="https://corplingstats.files.wordpress.com/2022/05/embed3-1.png?w=739" class="alignnone wp-image-7222" src="https://corplingstats.files.wordpress.com/2022/05/embed3-1.png?w=600&h=394" alt="Figure 7. Additive probability trend analysis for embedded PPs following common and proper nouns (see Figures 2 and 5)." width="600" height="394" srcset="https://corplingstats.files.wordpress.com/2022/05/embed3-1.png?w=600&h=394 600w, https://corplingstats.files.wordpress.com/2022/05/embed3-1.png?w=150&h=99 150w, https://corplingstats.files.wordpress.com/2022/05/embed3-1.png?w=300&h=197 300w, https://corplingstats.files.wordpress.com/2022/05/embed3-1.png?w=768&h=504 768w, https://corplingstats.files.wordpress.com/2022/05/embed3-1.png 1020w" sizes="(max-width: 600px) 100vw, 600px">

Figure 7. Additive probability trend analysis for embedded PPs following common and proper nouns (see Figures 2 and 5).

The wordclass of the initial postmodified head makes a big difference to the subsequent pattern.

We can read this graph as saying that proper nouns are postmodified overall at a much lower rate (which we might expect), but the postmodification rate for this subsequent embedded postmodified head converges with the equivalent rate for common nouns.

This should not be surprising, as the principal association of a second order embedding will be with its immediate head, not a previous one. However, there appears to be a residual effect: written data still exhibits a significantly lower rate for proper nouns than common nouns at x = 2. Note also that proper noun heads in the spoken data have a slightly higher initial probability of being postmodified than in the written.

Introduction
1.1 Embedding

1.2 Why are preposition(al) phrases interesting?
Experiments excluding embedded conjoins
2.1 Obtaining data

2.2 Nouns postmodified by prepositional phrases

2.3 Common and proper nouns

2.4 Heads of any type

2.5 Nouns, pronouns and other wordclass types

2.6 Pronoun subtypes

Restricting embedded heads
Experiments allowing embedded conjoins
3.1 Obtaining data

3.2 Nouns postmodified by prepositional phrases

3.3 Common and proper nouns
Conclusions

References

Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine, 17, 873-890.

Wallis, S.A. 2019. Investigating the additive probability of repeated language production decisions. International Journal of Corpus Linguistics 24:4, 490-521. » Post

Wilson, E. B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22, 209-212.

Notes

[1] The ICECUP software matches FTFs against tree structures by assigning a single grammatical node in a corpus tree analysis to a single node in the FTF. But Example (3) illustrates two important exceptions to this rule. It includes a compound proper noun, the Independent Police Complaints Authority, which matches a single node. It also includes a coordinated pair of noun phrases, intervention and supervision, which matches the ‘PC,NP’ node. This is found by an additional method discussed in Section 3.

[2] For nouns containing nouns, the rate is 0.1385 ∈ (0.1327, 0.1445), which is more than double the equivalent rate for postmodifying clauses (0.0556 ∈ (0.0546, 0.0577)).

corp.ling.stats

Summer School in English Corpus Linguistics 2024 (online)

Aims and objectives of the course

Learning outcomes

What it costs

See also

Continuity correction for risk ratio and other intervals

Introduction

Continuity corrections, reprised

Is greater correction required?

Varying α

99% intervals

Smaller α

Conclusions

References

See also

Confidence intervals for Cohen’s h

1. Introduction

2. Deriving an interval

2.1 Preliminaries: the Wilson score interval

2.2 Stage 1. An interval for the transform

2.3 Stage 2. An interval for the difference

3. Evaluating the interval

4. Unsigned Cohen’s |h|

5. Conclusions

References

See also

Confidence intervals

1. Binomial proportion p

2. Functions of p

2.1 Monotonic functions

2.2 Non-monotonic functions

3. Functions of multiple proportions

3.1 Differences

3.2 Other mathematical operators

3.3 Analytical reduction

3.4 k-constrained summation p1 + p2 + … + pk

4. Performance

5. Conclusions

References

See also

Summer School in English Corpus Linguistics 2023 (online)

Aims and objectives of the course

Learning outcomes

See also

Plotting entropy confidence interval distributions

Introduction

Preliminaries: entropy, and intervals for the single proportion

Method 1. Binomial entropy interval, k = 2

Method 2. Multinomial entropy interval, k > 2

Plotting the distribution of entropy intervals

Calculating residual areas

The impact of sample size

Conclusions

References

See also

The confidence of entropy – and information

Introduction

Information

Entropy

Binomial: Confidence intervals where k = 2

Multinomial: Generalised approximations for k > 2

Comparison with Wilson intervals

Larger n

Larger k

Conclusions

References

See also

Confidence intervals for the ratio of competing dependent proportions

Introduction

Example data

Method 1: Employing the Newcombe-Wilson difference interval

A ratio of competing dependent proportions

Method 2: Functional reformulation: the odds

Performance

Conclusions

References

See also

Directional evidence revisited

End weight bias and templating in conjoined phrase postmodification

3.4 k-constrained summation p₁ + p₂ + … + p_k