## Boundaries in nature

Although we are primarily concerned with Binomial probabilities in this blog, it is occasionally worth a detour to make a point.

A common bias I witness among researchers in discussing statistics is the intuition (presumption) that distributions are Gaussian (Normal) and symmetric.  But many naturally-occurring distributions are not Normal, and a key reason is the influence of boundary conditions.

Even for ostensibly Real variables, unbounded behaviour is unusual. Nature is full of boundaries.

Consequently, mathematical models that incorporate boundaries can sometimes offer a fresh perspective on old problems. Gould (1996) discusses a prediction in evolutionary biology regarding the expected distribution of biomass for organisms of a range of complexity (or scale), from those composed of a single cell to those made up of trillions of cells, like humans. His argument captures an idea about evolution that places the emphasis not on the most complex or ‘highest stages’ of evolution (as conventionally taught), but rather on the plurality of blindly random evolutionary pathways. Life becomes more complex due to random variation and stable niches (‘local maxima’) rather than some external global tendency, such as a teleological advantage of complexity for survival.

Gould’s argument may be summarised in the following way. Through blind random Darwinian evolution, simple organisms may evolve into more complex ones (‘complexity’ measured as numbers of cells or organism size), but at the same time others may evolve into simpler, but perhaps equally successful ones. ‘Success’ here means reproductive survival – producing new organisms of the same scale or greater that survive to reproduce themselves.

His second premise is also non-controversial. Every organism must have at least one cell and all the first lifeforms were unicellular.

Now, run time’s arrow forwards. Assuming a constant and an equal rate of evolution, by simulation we can obtain a range of distributions like those in the Figure below.

## Random sampling, corpora and case interaction

### Introduction

One of the main unsolved statistical problems in corpus linguistics is the following.

Statistical methods assume that samples under study are taken from the population at random.

Text corpora are only partially random. Corpora consist of passages of running text, where words, phrases, clauses and speech acts are structured together to describe the passage.

The selection of text passages for inclusion in a corpus is potentially random. However cases within each text may not be independent.

This randomness requirement is foundationally important. It governs our ability to generalise from the sample to the population.

The corollary of random sampling is that cases are independent from each other.

I see this problem as being fundamental to corpus linguistics as a credible experimental practice (to the point that I forced myself to relearn statistics from first principles after some twenty years in order to address it). In this blog entry I’m going to try to outline the problem and what it means in practice.

The saving grace is that statistical generalisation is premised on a mathematical model. The problem is not all-or-nothing. This means that we can, with care, attempt to address it proportionately.

[Note: To actually solve the problem would require the integration of multiple sources of evidence into an a posteriori model of case interaction that computed marginal ‘independence probabilities’ for each case abstracted from the corpus. This is way beyond what any reasonable individual linguist could ever reasonably be expected to do unless an out-of-the-box solution is developed (I’m working on it, albeit slowly, so if you have ideas, don’t fail to contact me…).]

There are numerous sources of case interaction and clustering in texts, ranging from conscious repetition of topic words and themes, unconscious tendencies to reuse particular grammatical choices, and interaction along axes of, for example, embedding and co-ordination (Wallis 2012a), and structurally overlapping cases (Nelson et al 2002: 272).

In this blog post I first outline the problem and then discuss feasible good practice based on our current technology.  Continue reading “Random sampling, corpora and case interaction”