Confidence intervals

In this blog we identify efficient methods for computing confidence intervals for many properties.

When we observe any measure from sampled data, we do so in order to estimate the most likely value in the population of data – ‘the real world’, as it were – from which our data was sampled. This is subject to a small number of assumptions (the sample is randomly drawn without bias, for example). But this observed value is merely the best estimate we have, on the information available. Were we to repeat our experiment, sample new data and remeasure the property, we would probably obtain a different result.

A confidence interval is the range of values in which we predict that the true value in the population will likely be, based on our observed best estimate and other properties of the sample, subject to a certain acceptable level of error, say, 5% or 1%.

A confidence interval is like a blur in a photograph. We know where a feature of an object is, but it may be blurry. With more data, better lenses, a greater focus and longer exposure times, the blur reduces.

In order to make the reader’s task a little easier, I have summarised the main methods for calculating confidence intervals here. If the property you are interested in is not explicitly listed here, it may be found in other linked posts.

1. Binomial proportion p

The following methods for obtaining the confidence interval for a Binomial proportion have high performance.

  • The Clopper-Pearson interval
  • The Wilson score interval
  • The Wilson score interval with continuity correction

A Binomial proportion, p ∈ [0, 1], and represents the proportion of instances of a particular type of linguistic event, which we might call A, in a random sample of interchangeable events of either A or B. In corpus linguistics this means we need to be confident (as far as it is possible) that all instances of an event in our sample can genuinely alternate (all cases of A may be B and vice-versa).

These confidence intervals express the range of values where a possible population value, P, is not significantly different from the observed value p at a given error level α. This means that they are a visual manifestation of a simple significance test, where all points beyond the interval are considered significantly different from the observed value p. The difference between the intervals is due to the significance test they are derived from (respectively: Binomial test, Normal z test, z test with continuity correction).

As well as my book, Wallis (2021), a good place to start reading is Wallis (2013), Binomial confidence intervals and contingency tests.

The ‘exact’ Clopper-Pearson interval is obtained by a search procedure from the Binomial distribution. As a result, it is not easily generalised to larger sample sizes. Usually a better option is to employ the Wilson score interval (Wilson 1927), which inverts the Normal approximation to the Binomial and can be calculated by a formula. This interval may also accept a continuity correction and other adjustments for properties of the sample.

Continue reading “Confidence intervals”

Further evaluation of Binomial confidence intervals

Abstract Paper (PDF)

Wallis (2013) provides an account of an empirical evaluation of Binomial confidence intervals and contingency test formulae. The main take-home message of that article was that it is possible to evaluate statistical methods objectively and provide advice to researchers that is based on an objective computational assessment.

In this article we develop the evaluation of that article further by re-weighting estimates of error using Binomial and Fisher weighting, which is equivalent to an ‘exhaustive Monte-Carlo simulation’. We also develop an argument concerning key attributes of difference intervals: that we are not merely concerned with when differences are zero (conventionally equivalent to a significance test) but also accurate estimation when difference may be non-zero (necessary for plotting data and comparing differences).

1. Introduction

All statistical procedures may be evaluated in terms of the rate of two distinct types of error.

  • Type I errors (false positives): this is evidence of so-called ‘radical’ or ‘anti-conservative’ behaviour, i.e. rejecting null hypotheses which should not have been rejected, and
  • Type II errors (false negatives): this is evidence of ‘conservative’ behaviour, i.e. retaining or failing to reject null hypotheses unnecessarily.

It is customary to treat these errors separately because the consequences of rejecting and retaining a null hypothesis are qualitatively distinct. Continue reading “Further evaluation of Binomial confidence intervals”

Detecting direction in interaction evidence

IntroductionPaper (PDF)

I have previously argued (Wallis 2014) that interaction evidence is the most fruitful type of corpus linguistics evidence for grammatical research (and doubtless for many other areas of linguistics).

Frequency evidence, which we can write as p(x), the probability of x occurring, concerns itself simply with the overall distribution of a linguistic phenomenon x – such as whether informal written English has a higher proportion of interrogative clauses than formal written English. In order to calculate frequency evidence we must define x, i.e. decide how to identify interrogative clauses. We must also pick an appropriate baseline n for this evaluation, i.e. we need to decide whether to use words, clauses, or any other structure to identify locations where an interrogative clause may occur.

Interaction evidence is different. It is a statistical correlation between a decision that a writer or speaker makes at one part of a text, which we will label point A, and a decision at another part, point B. The idea is shown schematically in Figure 1. A and B are separate ‘decision points’ in a given relationship (e.g. lexical adjacency), which can be also considered as ‘variables’.

Figure 1: Associative inference from lexico-grammatical choice variable A to variable B (sketch).
Figure 1: Associative inference from lexico-grammatical choice variable A to variable B (sketch).

This class of evidence is used in a wide range of computational algorithms. These include collocation methods, part-of-speech taggers, and probabilistic parsers. Despite the promise of interaction evidence, the majority of corpus studies tend to consist of discussions of frequency differences and distributions.

In this paper I want to look at applications of interaction evidence which are made more-or-less at the same time by the same speaker/writer. In such circumstances we cannot be sure that just because B follows A in the text, the decision relating to B was made after the decision at A. Continue reading “Detecting direction in interaction evidence”