Not everything that looks like a probability is one.
Just because a variable or function ranges from 0 to 1, it does not mean that it behaves like a unitary probability over that range.
Natural probabilities
What we might term a natural probability is a proper fraction of two frequencies, which we might write as p = f / n.
- Provided that f can be any value from 0 to n, p can range from 0 to 1.
- In this formula, f and n must also be natural frequencies, that is, n stands for the size of the set of all cases, and f the size of a true subset of these cases. The term ‘natural’ here refers to the mathematical sense of the set of positive integers.
Aside: In certain models, these frequencies could be obtained from the sum of a set of probability estimates, each representing the probability that the observation was genuinely independent from others in the sample. This might permit a ‘frequency’ to be observed that was not a natural number. But the principle is the same.
This natural probability is expected to be a Binomial variable, and the formulae for z tests, χ² tests, Wilson intervals, etc., as well as logistic regression and similar methods, may be legitimately applied to such variables. The Binomial distribution is the expected distribution of such a variable if each observation is drawn independently at random from the population (an assumption that is not strictly true with corpus data).
Another way of putting this is that a Binomial variable expresses the number of individual events of Type A in a situation where an outcome of either A or B are possible. If we observe, say that 8 out of 10 cases are of Type A, then we can say we have an observed probability of A being chosen, p(A | {A, B}), of 0.8. In this case, f is the frequency of A (8), and n the frequency of both A and B (10). See Wallis (2013a). Continue reading “An unnatural probability?”