Recently I’ve been working on a problem that besets researchers in corpus linguistics who work with samples which are not drawn randomly from the population but rather are taken from a series of sub-samples. These sub-samples (in our case, texts) may be randomly drawn, but we cannot say the same for any two cases drawn from the same sub-sample. It stands to reason that two cases taken from the same sub-sample are more likely to share a characteristic under study than two cases drawn entirely at random. I introduce the paper elsewhere on my blog.
In this post I want to focus on an interesting and non-trivial result I needed to address along the way. This concerns the concept of variance as it applies to a Binomial distribution.
Most students are familiar with the concept of variance as it applies to a Gaussian (Normal) distribution. A Normal distribution is a continuous symmetric ‘bell-curve’ distribution defined by two variables, the mean and the standard deviation (the square root of the variance). The mean specifies the position of the centre of the distribution and the standard deviation specifies the width of the distribution.
Common statistical methods on Binomial variables, from χ² tests to line fitting, employ a further step. They approximate the Binomial distribution to the Normal distribution. They say, although we know this variable is Binomially distributed, let us assume the distribution is approximately Normal. The variance of the Binomial distribution becomes the variance of the equivalent Normal distribution.
In this methodological tradition, the variance of the Binomial distribution loses its meaning with respect to the Binomial distribution itself. It seems to be only valuable insofar as it allows us to parameterise the equivalent Normal distribution.
What I want to argue is that in fact, the concept of the variance of a Binomial distribution is important in its own right, and we need to understand it with respect to the Binomial distribution, not the Normal distribution. Sometimes it is not necessary to approximate the Binomial to the Normal, and if we can avoid this approximation our results are likely to be stronger as a result.
Every fundamental primer in statistics approaches the problem in the following way.
A Binomial variable is a two-valued variable (hence ‘bi-nomial’). The values can be anything, but let us simply call them, according to coin-tossing tradition, as ‘heads’ and ‘tails’. The proportion of cases that are heads in any randomly-drawn sample, of size n, taken from a population, which we might term p, is free to vary from 0 to 1. That is, all n cases in the sample may be heads (p = 1) or all may be tails (p = 0).
Now, suppose we know, Zeus-like, the actual proportion in the population, P. We don’t have to be a deity – we might assume that our coin is unbiased so P = 0.5 (heads and tails are equally probable) – but a common error is when people get big P (true value in the population) and little p (observed value in a sample) muddled up. Let’s leave observed p aside for a minute.
We can calculate the distribution for P and n using the following Binomial formula:
Binomial distribution B(r) = nCr Pr (1 – P)(n – r),(1)
where r ranges from 0 to n. This means that the probability of obtaining exactly r heads out of n coin tosses is calculated by multiplying
- the combinatorial function nCr (the number of unique ways we can obtain exactly r cases out of n cases);
- the probability that r cases are heads Pr and
- the probability that the remainder are tails (1 – P)(n – r).
This formula obtains the ideal Binomial distribution.
The graph below shows what this looks like for ten tosses of an unbiased coin, where P = 0.5 and n = 10. The mean of this distribution is nP, i.e. 0.5 × 10 = 5.
Note. Equation (1) also works for a ‘trick’ coin, e.g. where P = 0.9 (9 times out of 10 we obtain heads). Although most primers first show a graph of P = 0.5, few real-world Binomial variables are equiprobable. (Don’t be misled by the symmetry of this graph.)
This distribution has a number of important characteristics.
- The most obvious characteristic is that it is discrete – the only possible values of r are integer values from 0 to n. Therefore if we sample 10 coin tosses, an observed probability p could be 0, 0.1, 0.2, right up to 1. If the true value of P was 0.45, we could not observe p = 0.45 if we only had ten coin tosses.
- A less obvious, but important, characteristic is that this distribution is probabilistic – the sum of all columns ∑B(r) = 1.
- Finally, for all values of P other than 0.5, the distribution is assymmetric. See below.
You can also see how unlikely it is that all coins are heads or all tails. The chance of this happening is not zero, but it is small. There is only one possible combination of heads and tails where all ten coins are heads (HHHHHHHHHH) out of 1,024 (2n) possible patterns. The probability of observing p = 0 is 1 in 1,024.
There are ten ways that one coin will be a tail and nine heads (THHHHHHHHH, HTHHHHHHHH,… HHHHHHHHHT), and so on.
The combinatorial function nCr tells us exactly how many different ways we can obtain r cases out of n potential cases. The full formula is given in Equation (2) below, where x! means the factorial of x, or x(x – 1)(x – 2)…(1).
combinatorial function nCr = n!/(n – r)!r!.(2)
You should be able to see that in cases where r = 0 or r = n, nCr = 1; where r = 1 or r = n – 1, nCr = n.
If P = 0.5 then the Binomial function (1) above becomes simply
B(r) = nCr Pr (1 – P)(n – r) = nCr × ½n.
Note that these distributions are clearly assymmetric, being centred at P < 0.5 and bounded by 0 and n. As P approaches zero this assymmetry becomes more acute.
Another aspect we can immediately see from the graphs above is that, as well as increasingly becoming less symmetric, as P approaches zero, the distribution becomes more concentrated together. We say that the variance of the distribution decreases.
The variance of a Binomial distribution on the integer scale r = 0…n can be obtained from the function
(integer) variance S² = nP(1 – P).
To compare different-sized samples, we obviously need to use the same scale. The simplest standardisation is to adopt a probabilistic scale, i.e. where p = 0…1. To do this we divide this formula by n². The variance of a Binomial distribution on a probabilistic scale is obtained from the function
(probabilistic) variance S² = P(1 – P)/n.(3)
Thus if P = 0.5 and n = 10, S² = 0.025. If P = 0.1 and n = 10, S² = 0.009. (You shouldn’t need a calculator to work this out!) This formula has the following properties.
- For the same n > 1, as P tends to zero, P(1 – P) will also tend to 0. (Consider: if a coin had zero chance of being a head, it will always be a tail!)
- For the same P > 0, as n increases, P(1 – P)/n decreases. (Obviously if P = 0 then S² cannot decrease!)
Variance is simply the square of the standard deviation of the same distribution:
standard deviation S ≡ √P(1 – P)/n.
Approximating to the Normal distribution
The concept of variance and standard deviation are usually applied to the Normal distribution. Here they have immediate meaning because, as we noted in the introduction, a Normal distribution can be described by two parameters: the mean, in this case P, and the standard deviation, S.
Indeed, in the same statistics primers, at around this point we are encouraged to set aside what we have learned about the Binomial distribution and simply assume that it is ‘close to’ the Normal distribution N(P, S). We might see comments that this is an acceptable step for large n or where both nP and n(1 – P) > 5.
It is worth emphasising: this step (due to an observation by de Moivre in the 18th Century) is an approximation. The Binomial and Normal distributions are different. Here is the distribution for P = 0.3 again, but this time with a Normal distribution approximated to it. The Binomial distribution B is strictly just a series of points (middle lines). A difference between the two distributions that places lines above the tail (indicated) could correspond to a ‘Type I error’.
- Most obviously, the Normal distribution is continuous rather than discrete. This means we can obtain an estimate for the expected probability that p = 0.45.
- Like the Binomial distribution, the standardised Normal distribution is also probabilistic, i.e. the area under the curve sums to 1.
- Finally, the Normal distribution is symmetric. Moreover, it assumes that the observed variable is unbounded. An unbounded variable is free to vary from minus infinity (-∞) to plus infinity (+∞). (This is a corollary: if the variable was bounded, it could not be symmetric.)
It is worth considering this last point. Many statistics textbooks use example variables from the natural and physical sciences.
- For example, the height of children in a class, which we might call H, is usually considered to be an unbounded variable, suitable for the Normal distribution.
- But in fact, the height of children is a bounded variable.
- It has a lower limit. At the risk of stating the obvious, children cannot be less than zero height(!), and indeed, to be permitted to go to school, must be of a certain age and be physically safe to do so. H must have a lower limit rather greater than zero.
- It has an upper limit. A number of factors, from growth rates to the physical strength of bone, limit the possible height of children.
- Far from being unbounded, H is bounded by biology!
What everyone does is assume that the observed mean height is so far from the bounds that although the bounds exist, they have negligible effect on the distribution. (This is not always a healthy assumption, but it is the source of these injunctions to only approximate to the Normal distribution in cases where nP > 5.)
On the other hand, Binomial variables (and the Binomial distribution), are strictly bounded. We may write, e.g. P ∈ [0, 1], which simply means “P ranges from 0 to 1 inclusive”. The probability P may also be expressed as a proportion or percentage, so we might say that a rate can be any value from 0% to 100%.
Observing Binomial distributions
So far we have discussed the ideal Binomial distribution. Equation (1) is the mathematical extrapolation of the likelihood, B(r) of observing r future results for a sample of n cases drawn randomly from a population if the true rate in the population was P.
In some circumstances we may observe a Binomial distribution. I do this in class with students – each student tosses a coin a fixed number of times and we note down the number of students who had 0 heads, 1 head and so on.
In the paper I am working on, I realised that this principle can also be employed to identify the extent to which a corpus sample might deviate from an ideal random sample for a given variable. This is an important question for corpus linguistics.
The first step is to partition the corpus sample into subsamples according to the text that they are drawn from. To all intents and purposes, these texts can be assumed to be random even if they were not subject to controlled sampling.
Note that two cases drawn from different texts are therefore likely to be independent and equivalent to a pair of cases in a true random sample. However two cases from the same text may share characteristics. There are all sorts of reasons why this is likely to be the case, from a shared topic to personal preferences, priming and other psycholinguistic effects. The reason does not actually matter – we just need to recognise this is likely to be the case.
- Question: How may we measure the deviation of the corpus sample from an ideal random sample?
- Answer: By studying the distribution of these subsamples.
Suppose the subsamples are equivalent to random samples. Even though cases are drawn from the same text, suppose it turns out that the particular variable is not sensitive to context, previous utterances, etc. In this case, we would expect these sub-samples to be Binomially distributed.
To plot the following graph we first ‘quantise’ (round up or down to a particular number) the observed probability p. The vertical axis, Ψ, is simply the number of texts in the direct conversations category of ICE-GB, where the probability that a clause is interrogative (p(inter) is 0, 0.01, 0.02, etc.). There are 90 texts in this category. We can see that this distribution is approximately Binomial.
We may calculate the variance of this observed distribution with the following pair of formulae, derived from Sheskin (1997).
The first estimate (4) does not take into account the fact that samples are drawn from a population, whereas the second measure, termed the unbiased estimate of the population variance, does. For that reason, we here use capital P to refer to each probability in the first case and lower case p to refer to observations.
variance of a set of scores s′ss² = ∑(Pi – P)² / t′,(4)
observed between-subsample variance sss² = ∑(pi – p)² / (t′ – 1),(5)
where pi is the observed probability for subsample i out of t′ non-empty text-based subsamples, and mean p = ∑pi / t′. (Note that the mean of samples, p, is not necessarily the same as the mean p summed over all cases.)
Equations (4) and (5) have one deficiency. It assumes that each subsample is of the same size. This is fine for classroom coin-tossing. It is unlikely to be the case in a corpus sample.
The estimate of variance for a set of different-sized subsamples can be obtained from
variance of a set of scores (different sizes) = s′ss² = ∑pri (Pi – P)²,(6)
observed between-subsample variance sss² = t′ / (t′ – 1) × ∑pri (pi – p)²,(7)
where pri = ni /n, ni is the size of subsample i, t′ is the number of non-empty samples, count(ni > 0), and n the total sample size, i.e. ∑ni.
It is possible to prove that if pri is equal to the Binomial probability B(i) in Equation (1), i.e. so that the distribution matches the ideal Binomial, Equation (6) becomes Equation (3).
∑nCr Pr (1 – P)(n – r) (r/n – P)² ≡ P(1 – P)/n.
This means that Equation (6) defines the correct mathematical relationship between a Binomial distribution on a probabilistic scale and its expected variance. Another way of putting this is that it is legitimate to apply Equations (6) and (7) to a Binomial variable.
Example: To illustrate this equivalence, consider the following computation for P = 0.3 and n = 2. Equation (3) obtains, simply S² = (0.3 × 0.7)/2 = 0.105.
|r/n||r||nCr||B(r)||B(r) × (r/n – P)²|
We can therefore contrast the observed subsample variance with the variance that would be predicted assuming each subsample were a random sample, i.e. the expected Binomial variance, which in this notation would be
predicted between-subsample variance Sss² = p(1 – p) / t′.
If the two variance scores are the same, then to all intents and purposes, our subsamples are random samples, and the entire corpus sample can be considered a random collection of random samples, i.e. a random sample.
However, if the observed subsample variance differs than that predicted, we are entitled to take this into account when considering the variance of the corpus sample. We employ the ratio of variances, Fss, to adjust the sample size accordingly.
cluster-adjustment ratio Fss = Sss² / sss², and (8)
corrected sample size n′ = (n – t′)Fss + t′.
If the observed sample has a greater variance than the predicted variance, Fss < 1, and we can say that there are fewer truly independent random cases in our overall corpus sample, we increase our uncertainty of our cross-corpus observation, significance tests become more strict, confidence intervals wider, etc. The corrected sample size has a minimum of t′ independent cases drawn from independent samples.
In the paper, we observe that sometimes Fss > 1 and discuss reasons for this. Suffice it to say it is certainly possible, although this may at first sight appear counter-intuitive.
To illustrate the method, consider the following graph. This is the same data as the figure above. You can download this spreadsheet to inspect the calculation for yourself.
Note that in this case we see a close correspondence between the two predicted distributions – Binomial and Normal. The observed distribution is also approximately Normal (accepting the randomness we would anticipate in any observed distribution of course).
The method of comparing variances we employed makes no assumptions about the Binomial approximating to the Normal distribution.
However, this method usually comes under the umbrella of analysis of variance (ANOVA), which is premised on data being Normally distributed. Instead of assuming that ANOVA might be legitimately employed for Binomial (bounded, assymmetric, discrete) distributions, we were concerned to prove that our definitions of variance were applicable to the Binomial.
Why might this matter? There are two reasons.
- The approximation to the Normal distribution is an approximation, and introduces a number of ‘smoothing’ errors as a result.
- We must ensure that the method is robust for highly skewed values of p.
In the figure above the Normal and Binomial distributions are similar. However, this is not always the case.
Consider the following graph (Figure 4 in the paper). Here data is drawn, not from a single genre, but across the diverse genres contained within the ICE-GB corpus, from the most highly interactive speech contexts to the most didactic of written instructional texts.
The two upper dotted lines are the predicted Normal and Binomial distributions for this observed value of p (0.0399) and t = 500 texts. You can see how the Normal distribution is narrower than the predicted Binomial.
Equation (5) captures the total variance between subsamples in this figure. It is approximately 4% of the predicted variance according to Equation (3).
The lower line is the Normal distribution premised on the observed subsample variance. Again, you can see a large deviation between the observed frequency distribution (bars) and this Normal distribution, which is also clearly clipped by the lower bound at p = 0.
If our method were dependent on the Normal distribution, we simply could not sustain it in highly-skewed contexts such as this.
Sheskin, D.J. 1997. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.