Deconstructing the chi-square


Elsewhere in this blog we introduce the concept of statistical significance by considering the reliability of a single sampled observation of a Binomial proportion: an estimate of the probability of selecting an item in the future. This allows us to develop an understanding of the likely distribution of what the true value of that probability in the population might be. In short, were we to make future observations of that item, we could expect that each sampled probability would be found within a particular range – a confidence interval – a fixed proportion of times, such as 1 in 20 or 1 in 100. This ‘fixed proportion’ is termed the error level because we predict that the true value will be outside the range 1 in 20 or 1 in 100 times.

This process of inferring about future observations is termed ‘inferential statistics’. Our approach is to build our understanding in a series of stages based on confidence intervals about the single proportion. Here we will approach the same question by deconstructing the chi-square test.

A core idea of statistical inference is this: randomness is a fact of life. If you sample the same phenomenon multiple times, drawing on different data each time, it is unlikely that the observation will be identical, or – to put it in terms of an observed sample – it is unlikely that the mean value of the observation will be the same. But you are more likely than not to find the new mean near the original mean, and the larger the size of your sample, the more reliable your estimate will be. This, in essence, is the Central Limit Theorem.

This principle applies to the central tendency of data, usually the arithmetic mean, but occasionally a median. It does not concern outliers: extreme but rare events (which, by the way, you should include, and not delete, from your data).

We are mainly concerned with Binomial or Multinomial proportions, i.e. the fraction of cases sampled which have a particular property. A Binomial proportion is a statement about the sample, a simple fraction p = f / n. But it is also the sample mean probability of selecting a value. Suppose we selected a random case from the sample. In the absence of any other knowledge about that case, the average chance that X = x₁ is also p.

The same principle applies to the mean of Real or Integer values, for which one might use Welch’s or Student’s t test, and the median rank of Ordinal data, for which a Mann-Whitney U test may be appropriate.

With this in mind, we can form an understanding of significance, or to be precise, significant difference. The ‘difference’ referred to here is the difference between an uncertain observed value and a predicted or known population value, d = pP, or the difference between two uncertain observed values, d = p₂ – p₁. The first of these differences is found in a single-sample z test, the second in a two-sample z test. See Wallis (2013b).


Figure 1. The single-sample population z test. The statistical model assumes that future unobserved samples are Normally distributed, centred on the population mean P. Distance d is compared with a critical threshold, zα/2.S, to carry out the test.

A significance test is created by comparing an observed difference with a second element, a critical threshold extrapolated from the underlying statistical model of variation. Continue reading

Boundaries in nature

Although we are primarily concerned with Binomial probabilities in this blog, it is occasionally worth a detour to make a point.

A common bias I witness among researchers in discussing statistics is the intuition (presumption) that distributions are Gaussian (Normal) and symmetric.  But many naturally-occurring distributions are not Normal, and a key reason is the influence of boundary conditions, as in this simple example.

Even for ostensibly Real variables, unbounded behaviour is unusual. Nature is full of boundaries.

Consequently, mathematical models that incorporate boundaries can sometimes offer a fresh perspective on old problems. Gould (1996) discusses a prediction in evolutionary biology regarding the expected distribution of biomass for organisms of a range of complexity (or scale), from those composed of a single cell to those made up of trillions of cells, like humans. His argument captures an idea about evolution that places the emphasis not on the most complex or ‘highest stages’ of evolution (as conventionally taught), but rather on the plurality of blindly random evolutionary pathways. Life becomes more complex due to random variation and stable niches (‘local maxima’) rather than some external global tendency, such as a teleological advantage of complexity for survival.

Gould’s argument may be summarised in the following way. Through blind random Darwinian evolution, simple organisms may evolve into more complex ones (‘complexity’ measured as numbers of cells or organism size), but at the same time others may evolve into simpler, but perhaps equally successful ones. ‘Success’ here means reproductive survival – producing new organisms of the same scale or greater that survive to reproduce themselves.

His second premise is also non-controversial. Every organism must have at least one cell and all the first lifeforms were unicellular.

Now run time’s arrow forwards. Assuming a constant and an equal rate of evolution, by simulation we can obtain a range of distributions like those in the Figure below.


Gould’s Poisson projection of evolution, simulated by a very simple algorithm. It ‘evolves’ organisms at a constant rate r (here 0.001) from time t = 0 when all lifeforms are unicellular (c = 1). Where c = 1, all evolution is upwards (the boundary condition), otherwise it is equally subdivided. By t = 1,157, approximately the same number have complexity c = 2. Note the long tail that extends over time, and the consequent potential for a sentient organism at c = 10 to view evolution as a story leading inevitably to their perfection!

Continue reading

Confidence intervals on pairwise φ statistics


Cramér’s φ is an effect size measure used for evaluating correlations in contingency tables. In simple terms, a large φ score means that the two variables have a large effect on each other, and a small φ score means they have a small effect.

φ is closely related to χ², but it factors out the ‘weight of evidence’ and concentrates only on the slope. The simplest definition of φ is the unsigned formula

φ ≡ √χ² / N(k – 1),(1)

where k = min(r, c), the minimum of the number of rows and columns. In a 2 × 2 table, unsigned φ is simply φ = √χ² / N.

In Wallis (2012), I made a number of observations about φ.

  • It is probabilistic, φ ∈ [0, 1].
  • φ is the best estimate of the population interdependent probability, p(XY). It measures the linear interpolation from flat to identity matrix.
  • It is non-directional, so φ(X, Y) ≡ φ(Y, X).

Whereas in a larger table, there are multiple degrees of freedom and therefore many ways one might obtain the same φ score, 2 × 2 φ may usefully be signed, in which case φ ∈ [-1, 1]. A signed φ obtains a different score for an increase and a decrease in proportion.

φ ≡ (adbc) / √(a + b)(c + d)(a + c)(b + d),(2)

where a, b, c and d are cell scores in sequence, i.e. [[a b][c d]]:

x x
y a b
y c d

Continue reading

Correcting for continuity


Many conventional statistical methods employ the Normal approximation to the Binomial distribution (see Binomial → Normal → Wilson), either explicitly or buried in formulae.

The well-known Gaussian population interval (1) is

Gaussian interval (E⁻, E⁺) ≡ P ± zP(1 – P)/n,(1)

where n represents the size of the sample, and z the two-tailed critical value for the Normal distribution at an error level α, more properly written zα/2. The standard deviation of the population proportion P is S = √P(1 – P)/n, so we could abbreviate the above to (E⁻, E⁺) ≡ P ± zS.

When these methods require us to calculate a confidence interval about an observed proportion, p, we must invert the Normal formula using the Wilson score interval formula (Equation (2)).

Wilson score interval (w⁻, w⁺) ≡ [p + z²/2n ± zp(1 – p)/n + z²/4] / [1 + z²/n].(2)

In a 2013 paper for JQL (Wallis 2013a), I referred to this inversion process as the ‘interval equality principle’. This means that if (1) is calculated for p = E⁻ (the Gaussian lower bound of P), then the upper bound that results, w⁺, will equal P. Similarly, for p = E⁺, the lower bound of pw⁻ will equal P.
We might write this relationship as

p ≡ GaussianLower(WilsonUpper(p, n, α), n, α), or, alternatively
P ≡ WilsonLower(GaussianUpper(P, n, α), n, α), etc. (3)

where E⁻ = GaussianLower(P, n, α), w⁺ = WilsonUpper(p, n, α), etc.

Note. The parameters n and α become useful later on. At this stage the inversion concerns only the first parameter, p or P.

Nonetheless the general principle is that if you want to calculate an interval about an observed proportion p, you can derive it by inverting the function for the interval about the expected population proportion P, and swapping the bounds (so ‘Lower’ becomes ‘Upper’ and vice versa).

In the paper, using this approach I performed a series of computational evaluations of the performance of different interval calculations, following in the footsteps of more notable predecessors. Comparison with the analogous interval calculated directly from the Binomial distribution showed that a continuity-corrected version of the Wilson score interval performed accurately. Continue reading

UCL Summer School in English Corpus Linguistics 2019

I am pleased to announce the seventh annual Summer School in English Corpus Linguistics to be held at University College London from 1-3 July.

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics.

Please note that this course is very popular, and numbers are deliberately limited on a first-come, first-served basis! You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Continue reading

The other end of the telescope


The standard approach to teaching (and thus thinking about) statistics is based on projecting distributions of ranges of expected values. The distribution of an expected value is a set of probabilities that predict what the value will be, according to a mathematical model of what you predict should happen.

For the experimentalist, this distribution is the imaginary distribution of very many repetitions of the same experiment that you may have just undertaken. It is the output of a mathematical model.

  • Note that this idea of a projected distribution is not the same as the term ‘expected distribution’. An expected distribution is a series of values you predict your data should match.
  • Thus in what follows we simply compare a single expected value P with an observed value p. This can be thought of as comparing the expected distribution E = {P, 1 – P} with the observed distribution O = {p, 1 – p}.

Thinking about this projected distribution represents a colossal feat of imagination: it is a projection of what you think would happen if only you had world enough and time to repeat your experiment, again and again. But often you can’t get more data. Perhaps the effort to collect your data was huge, or the data is from a finite set of available data (historical documents, patients with a rare condition, etc.). Actual replication may be impossible for material reasons.

In general, distributions of this kind are extremely hard to imagine, because they are not part of our directly-observed experience. See Why is statistics difficult? for more on this. So we already have an uphill task in getting to grips with this kind of reasoning.

Significant difference (often shortened to ‘significance’) refers to the difference between your observations (the ‘observed distribution’) and what you expect to see (the expected distribution). But to evaluate whether a numerical difference is significant, we have to take into account both the shape and spread of this projected distribution of expected values.

When you select a statistical test you do two things:

  • you choose a mathematical model which projects a distribution of possible values, and
  • you choose a way of calculating significant difference.

The problem is that in many cases it is very difficult to imagine this projected distribution, or — which amounts to the same thing — the implications of the statistical model.

When tests are selected, the main criterion you have to consider concerns the type of data being analysed (an ‘ordinal scale’, a ‘categorical scale’, a ‘ratio scale’, and so on). But the scale of measurement is only one of several parameters that allows us to predict how random selection might affect the resampling of data.

A mathematical model contains what are usually called assumptions, although it might be more accurate to call them ‘preconditions’ or parameters. If these assumptions about your data are incorrect, the test is likely to give an inaccurate result. This principle is not either/or, but can be thought of as a scale of ‘degradation’. The less the data conforms to these assumptions, the more likely your test is to give the wrong answer.

This is particularly problematic in some computational applications. The programmer could not imagine the projected distribution, so they tweaked various parameters until the program ‘worked’. In a ‘black-box’ algorithm this might not matter. If it appears to work, who cares if the algorithm is not very principled? Performance might be less than optimal, but it may still produce valuable and interesting results.

But in science there really should be no such excuse.

The question I have been asking myself for the last ten years or so is simply can we do better? Is there a better way to teach (and think about) statistics than from the perspective of distributions projected by counter-intuitive mathematical models (taken on trust) and significant tests? Continue reading

Plotting the Wilson distribution

Introduction Full Paper (PDF)

We have discussed the Wilson score interval at length elsewhere (Wallis 2013a, b). Given an observed Binomial proportion p = f / n observations, and confidence level 1-α, the interval represents the two-tailed range of values where P, the true proportion in the population, is likely to be found. Note that f and n are integers, so whereas P is a probability, p is a proper fraction (a rational number).

The interval provides a robust method (Newcombe 1998, Wallis 2013a) for directly estimating confidence intervals on these simple observations. It can take a correction for continuity in circumstances where it is desired to perform a more conservative test and err on the side of caution. We have also shown how it can be employed in logistic regression (Wallis 2015).

The point of this paper is to explore methods for computing Wilson distributions, i.e. the analogue of the Normal distribution for this interval. There are at least two good reasons why we might wish to do this.

The first is to shed insight onto the performance of the generating function (formula), interval and distribution itself. Plotting an interval means selecting a single error level α, whereas visualising the distribution allows us to see how the function performs over the range of possible values for α, for different values of p and n.

A second good reason is to counteract the tendency, common in too many presentations of statistics, to present the Gaussian (‘Normal’) distribution as if it were some kind of ‘universal law of data’, a mistaken corollary of the Central Limit Theorem. This is particularly unwise in the case of observations of Binomial proportions, which are strictly bounded at 0 and 1. Continue reading