## Boundaries in nature

Although we are primarily concerned with Binomial probabilities in this blog, it is occasionally worth a detour to make a point.

A common bias I witness among researchers in discussing statistics is the intuition (presumption) that distributions are Gaussian (Normal) and symmetric.  But many naturally-occurring distributions are not Normal, and a key reason is the influence of boundary conditions.

Even for ostensibly Real variables, unbounded behaviour is unusual. Nature is full of boundaries.

Consequently, mathematical models that incorporate boundaries can sometimes offer a fresh perspective on old problems. Gould (1996) discusses a prediction in evolutionary biology regarding the expected distribution of biomass for organisms of a range of complexity (or scale), from those composed of a single cell to those made up of trillions of cells, like humans. His argument captures an idea about evolution that places the emphasis not on the most complex or ‘highest stages’ of evolution (as conventionally taught), but rather on the plurality of blindly random evolutionary pathways. Life becomes more complex due to random variation and stable niches (‘local maxima’) rather than some external global tendency, such as a teleological advantage of complexity for survival.

Gould’s argument may be summarised in the following way. Through blind random Darwinian evolution, simple organisms may evolve into more complex ones (‘complexity’ measured as numbers of cells or organism size), but at the same time others may evolve into simpler, but perhaps equally successful ones. ‘Success’ here means reproductive survival – producing new organisms of the same scale or greater that survive to reproduce themselves.

His second premise is also non-controversial. Every organism must have at least one cell and all the first lifeforms were unicellular.

Now, run time’s arrow forwards. Assuming a constant and an equal rate of evolution, by simulation we can obtain a range of distributions like those in the Figure below.

## The other end of the telescope

### Introduction

The standard approach to teaching (and thus thinking about) statistics is based on projecting distributions of ranges of expected values. The distribution of an expected value is a set of probabilities that predict what the value will be, according to a mathematical model of what you predict should happen.

For the experimentalist, this distribution is the imaginary distribution of very many repetitions of the same experiment that you may have just undertaken. It is the output of a mathematical model.

• Note that this idea of a projected distribution is not the same as the term ‘expected distribution’. An expected distribution is a series of values you predict your data should match.
• Thus in what follows we simply compare a single expected value P with an observed value p. This can be thought of as comparing the expected distribution E = {P, 1 – P} with the observed distribution O = {p, 1 – p}.

Thinking about this projected distribution represents a colossal feat of imagination: it is a projection of what you think would happen if only you had world enough and time to repeat your experiment, again and again. But often you can’t get more data. Perhaps the effort to collect your data was huge, or the data is from a finite set of available data (historical documents, patients with a rare condition, etc.). Actual replication may be impossible for material reasons.

In general, distributions of this kind are extremely hard to imagine, because they are not part of our directly-observed experience. See Why is statistics difficult? for more on this. So we already have an uphill task in getting to grips with this kind of reasoning.

Significant difference (often shortened to ‘significance’) refers to the difference between your observations (the ‘observed distribution’) and what you expect to see (the expected distribution). But to evaluate whether a numerical difference is significant, we have to take into account both the shape and spread of this projected distribution of expected values.

When you select a statistical test you do two things:

• you choose a mathematical model which projects a distribution of possible values, and
• you choose a way of calculating significant difference.

The problem is that in many cases it is very difficult to imagine this projected distribution, or — which amounts to the same thing — the implications of the statistical model.

When tests are selected, the main criterion you have to consider concerns the type of data being analysed (an ‘ordinal scale’, a ‘categorical scale’, a ‘ratio scale’, and so on). But the scale of measurement is only one of several parameters that allows us to predict how random selection might affect the resampling of data.

A mathematical model contains what are usually called assumptions, although it might be more accurate to call them ‘preconditions’ or parameters. If these assumptions about your data are incorrect, the test is likely to give an inaccurate result. This principle is not either/or, but can be thought of as a scale of ‘degradation’. The less the data conforms to these assumptions, the more likely your test is to give the wrong answer.

This is particularly problematic in some computational applications. The programmer could not imagine the projected distribution, so they tweaked various parameters until the program ‘worked’. In a ‘black-box’ algorithm this might not matter. If it appears to work, who cares if the algorithm is not very principled? Performance might be less than optimal, but it may still produce valuable and interesting results.

But in science there really should be no such excuse.

The question I have been asking myself for the last ten years or so is simply can we do better? Is there a better way to teach (and think about) statistics than from the perspective of distributions projected by counter-intuitive mathematical models (taken on trust) and significance tests? Continue reading “The other end of the telescope”

## Verb Phrase book published

### Why this book?

The grammar of English is often thought to be stable over time. However a new book, edited by Bas Aarts, Joanne Close, Geoffrey Leech and Sean Wallis, The Verb Phrase in English: investigating recent language change with corpora (Cambridge University Press, 2013) presents a body of research from linguists that shows that using natural language corpora one can find changes within a core element of grammar, the Verb Phrase, over a span of decades rather than centuries.

The book draws from papers first presented at a symposium on the verb phrase organised for the Survey of English Usage’s 50th anniversary and on research from the Changing English Verb Phrase project.