Sean – corp.ling.stats

Summer School in English Corpus Linguistics 2024 (online)

April 30, 2024 SeanLeave a comment

Have you begun research with corpora, but have been unsure what to do next?
Have you heard about parsed corpora (treebanks) but wondered how you might use them in your research?
Do you want a practical primer in statistics?

I’m pleased to announce the eleventh UCL Summer School in English Corpus Linguistics, our masterclass in research with parsed corpora, which is taking place online from 1-3 July. It is timed to run from 9:00 to 13:30 British Summer Time (GMT+1), to make it accessible for students across Europe, Africa, Asia and Australasia.

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics from the perspective of the ‘Survey Methodology’. It is offered at £165 for early bookings made before 14 May, rising to £195 after.

This year we are innovating our programme, including a new session on World Englishes with Guyanne Wilson.

This is a picture of our face-to-face teaching on the course, which we would love to return to! But with Covid-19 a continuing threat worldwide, and for reasons of accessibility and cost, we have decided to run the course online for another year.

Aims and objectives of the course

Over the course of the three days, participants learn about the following:

the scope of Corpus Linguistics, and how we can use it to study the English Language;
key issues in Corpus Linguistics methodology;
how to use corpora to analyse issues in syntax and semantics;
basic elements of statistics;
how to navigate large and small corpora, particularly ICE-GB and DCPSE.

Learning outcomes

At the end of the course, participants should have:

acquired a basic but solid knowledge of the terminology, concepts and methodologies used in English Corpus Linguistics;
had practical experience working with two state-of-the-art corpora and a corpus exploration tool (ICECUP);
have gained an understanding of the breadth of Corpus Linguistics and the potential application for projects;
have learned about the fundamental concepts of inferential statistics and their practical application to Corpus Linguistics.

What it costs

Attendance fee: £165 until May 14; £195 afterwards.
Corpus and software:
- temporarily accessible for all participants (no separate purchase necessary)
- choice of free corpus to all students, £25 for both corpora
- Special Offer of 25% standard price for all attendees

Continuity correction for risk ratio and other intervals

March 2, 2024May 3, 2024 SeanLeave a comment

Introduction

In An algebra of intervals, we showed that we can calculate confidence intervals for formulae composed of common mathematical operators, including powers and logarithms. We employed a method proposed by Zou and Donner (2008), itself an extension of Newcombe (1998). Wallis (forthcoming) describes the method more formally.

However, Newcombe’s method is arguably better-founded mathematically than that of Zou and Donner, who make an additional assumption. They assume that the number scale on which two properties are distinguished is not material to the quality of the resulting interval.

Why might this assumption be problematic? Well, when we compute a difference interval with Newcombe’s method, we do so by summing squared inner interval widths. These are equal to independent variance terms (multiplied by a constant, the critical value of the Normal distribution z_α/2), which are Normal at inner bounds. So far, so good. However, if such an interval is transformed onto a different number scale, but the same summation-of-variance (Bienaymé) method is then employed — Zou and Donner’s method — we are now summing terms which are by definition no longer Normal!

I was suspicious of this assumption, which seemed to me to be optimistic at best, and I was concerned to evaluate it computationally. The method I used was as follows.

Perform the same inner interval calculation for every potential value of two proportions, p₁ and p₂, over a range of sample sizes (1 to 200). This interval can be treated as a significance test equivalent to the exact Fisher test (evaluating if p₁ is significantly different from p₂). Thus, for a difference d = p₂ – p₁, if the resulting interval for d includes 0, the result is not significant. For a ratio, e.g. r = p₁/p₂, if the interval includes 1, the result is not significant.
We then compared the result of the two tests: our new test and Fisher.
If there is a discrepancy in the outcome, it will be of one of two types:
1. Type I errors (our test was improperly deemed significant) and
2. Type II errors (our test was improperly deemed non-significant, i.e. it failed to detect a significant result according to Fisher).
To properly account for the chance of observing a particular pair of proportions, each error is weighted by Fisher scores before being summed.

This method evaluates the inner (mesial) interval close to the middle of the range. It does not evaluate the same interval for non-zero points, or for the outer interval. But unlike Monte Carlo methods, it is exhaustive.

What I found partly supported my suspicions. There was indeed an additional error cost introduced by these approximations, and this error differed by number scale (or, by the formula, which amounts to the same thing). The graph below demonstrates the scale of the issue. If we aim for α = 0.05 but then compute an interval with an additional Type I error ε of 0.03, this additional error is not negligible!

All of these interpolated intervals, including Newcombe’s for d, obtain detectable errors, but there is some good news. We observed that employing a continuity correction reduces the scale of those errors.

Figure 1 shows an example plot obtained by this method (taken from a recent blog post). This includes computations for simple difference d, Cohen’s h, risk and odds ratios, and logarithm, each of which perform Zou and Donner’s difference calculations on different number scales.

Cohen’s h evaluated by Fisher-weighted error rates for Type I errors, performance for h ≠ 0 against the Fisher ‘exact’ test, computed for values of n1 = n2 ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples. — Figure 1. Difference d, Cohen’s h, odds, risk and log ratios evaluated by Fisher-weighted error rates for Type I errors, against the Fisher ‘exact’ test, computed for values of n₁ = n₂ ∈ {1, 2,… 200}, α = 0.05, with equal-sized samples.

One can make a number of observations about this graph. The saw-tooth behaviour, the ordering of intervals by performance, and so on. But if we want to minimise Type I errors (where we wrongly assess a non-significant difference as ‘significant’), this graph reveals that employing a continuity correction suppresses them.

Our previous evaluations showed that for unequal-sized sample sizes, where n₁ = 5n₂, we tended to see a lower overall error rate (this is not quite correct for χ²). See also Table 1 below. The increased sample size for p₁ (amounting to 3 times the data in the table overall) means that the discrete Fisher is smoother, and therefore the ‘smoothing correction’ aspect of the continuity correction is less necessary. But there remains an error. Continue reading “Continuity correction for risk ratio and other intervals” →

Confidence intervals for Cohen’s h

February 28, 2024April 22, 2024 SeanLeave a comment

1. Introduction

Cohen’s h (Cohen, 2013) is an effect size for the difference of two independent proportions that is sometimes cited in the literature. h ranges between minus and plus pi, i.e. h ∈ [–π, π].

Jacob Cohen suggests that if |h| > 0.2, this is a ‘small effect size’, if |h| > 0.5, it is ‘medium’, and if |h| > 0.8 it is ‘large’. This conventional application of effect sizes – as a descriptive method for distinguishing sizes – is widespread.

The score is defined as the difference between the arcsine transform of the root of Binomial proportions p_i for i ∈ {1, 2}, hence the expanded range, ±π.

That is,

h = ψ(p1) – ψ(p2),(1)

where the transform function ψ(p) is defined as

ψ(p) = 2 arcsin(√p).(2)

In this blog post I will explain how to derive an accurate confidence interval for this property h. The benefits of doing so are multiple.

We can plot h scores with intervals, so we can visualise the reliability of their estimate, pay attention to the smallest bound, etc.
We can compare two scores, h₁ and h₂, for significant difference. In other words, we can conclude that h₂ > h₁, or vice versa.
We can reinterpret ‘large’ and ‘small’ effects for statistical power.
We can consider whether an inner bound is greater than Jacob’s thresholds. Thus if h is positive, if h^– > 0.5 we can report that the likely population score is at least a ‘medium’ effect.

An absolute (unsigned and non-directional) version of |h| is sometimes cited. We can compute intervals for unsigned |h|. We will return to this question later.

Continue reading “Confidence intervals for Cohen’s h” →

Aims and objectives of the course

Learning outcomes

What it costs

See also

Share this:

Introduction

Share this:

1. Introduction

Share this: