Not everything that looks like a probability is.

Just because a variable or function ranges from 0 to 1, it does not mean that it behaves like a unitary probability over that range.

### Natural probabilities

What we might term a **natural** probability is a proper fraction of two frequencies, which we might write as *p* = *f*/*n*.

- Provided that
*f*can be any value from 0 to*n*,*p*can range from 0 to 1. - In this formula,
*f*and*n*must also be natural frequencies, that is,*n*stands for the size of the set of all cases, and*f*the size of a true subset of these cases.

This natural probability is expected to be a Binomial variable, and the formulae for *z *tests, χ² tests, Wilson intervals, etc., as well as logistic regression and similar methods, may be legitimately applied to such variables. The Binomial distribution is the expected distribution of such a variable if each observation is drawn independently at random from the population (an assumption that is not strictly true with corpus data).

Another way of putting this is that a Binomial variable expresses the number of individual events of Type A in a situation where an outcome of either A and B are possible. If we observe, say 8 out of 10 cases are of Type A, then we can say we have an observed probability of A being chosen, *p*(A | {A, B}), of 0.8. In this case, *f* is the frequency of A (8), and *n* the frequency of both A and B (10). See Wallis (2013a).

### Unnatural probabilities

However, sometimes researchers obtain variables or ratios that look like probabilities, but in fact are not.

**Any power of a natural probability**, e.g.*p*² or √*p*, will range from 0 to 1, but*will not behave linearly*(proportionately) with*p*. To compute confidence intervals on*p*², we must first reverse the square function and compute*w*⁻,*w*⁺, and then square these: (*w*⁻)² and (*w*⁺)². Any monotonic function of*p*(such as a power function) may be inverted and intervals computed in this way. See Reciprocating the Wilson interval.**Baselines incorporating invariant terms**(such as word-based baselines) can be expressed as probabilities (in the case of words, usually a very small*p*) but these are not natural probabilities. It is quite unrealistic to believe that*p*could ever approach 1, and “max out” at a much lower probability. See That vexed problem of choice and Freedom to vary and significance tests.**Effect size measures**such as Cramér’s φ and adjusted*C*(Sheskin 1997) also range from 0 to 1 but can be thought of as being based on*multiple*natural probabilities,*p*₁,*p*₂, etc. Methods for computing confidence intervals on φ do exist in the literature (see Comparing χ² tests for separability) although they are based on Wald estimates and are non-optimal.

On the other hand both *onomasiological* (choice) and *semasiological* (use) variables are Binomial. The chance of being exposed to one particular *use* of a word out of many can be considered as a Binomial variable, even though it is a by-product of multiple onomasiological choices between linguistic alternates. See Choice vs. use.

### Combined probabilities

What if we have an observation that is the result of pooling **two** different sets of results, each with their own independent true rate?

For example, suppose the true rate of *shall* vs. *will* in first person declarative cases (*I shall/will go…*) is 1:1, but in interrogative cases (*Shall/Will I go?*) the ratio is 1:9. For the sake of argument, let us also suppose that in our data there is the same number of declarative as interrogative cases.

We can use *P _{d}* to represent the expected declarative probability of

*shall*, and

*P*for the interrogative probability.

_{i}Next, suppose we do not distinguish the two categories but simply put them together. What will the overall distribution of *P* look like?

*overall probability P* = (*P _{d}* +

*P*) / 2,

_{i}This variable will range from 0 to 1, but the expected distribution of *P* will not be Binomial. The peak will not even be at *P* = 0.3. Instead it will have two peaks, one at 0.5 and the other at 0.1.

The figure is computed by applying the simple formula above to the Binomial distributions for *P _{d}* and

*P*, and the Normal approximation to these distributions.

_{i}As the graph clearly shows, the result is not a Binomial distribution about *P*=0.3! It is worth thinking about why this is a visible problem in this case.

- we have assumed that the two rates are
**some distance apart**(0.5 and 0.1), - we assume there are
**approximately the same number**of declarative and interrogative cases of first person*shall/will*modal alternation, so the summed distribution looks like neither individual distribution, and - there are
**only two distinct categories**(the more categories that exist, the more the data will tend to behave Binomially, but with more noise).

If the combined rates were close together (e.g. 0.4 and 0.5), or 90% of the data was declarative (a more realistic assumption in many contexts), then *P* would be closer to the peaks, and the combined pattern would be a closer approximation to the Binomial distribution about *P*. It also follows that confidence intervals for observations would be more reliable.

Does this mean that since we are frequently dealing with datasets containing cases which necessarily have different local pressures on each one, that it is not viable simply to assume that the variable is truly Binomial? See also Is language really “a set of alternations”?

The answer is no, the Binomial distribution should be **our default assumption**. That is, unless we have *a priori* linguistic reasons for separating out data into distinct categories, we can reasonably assume that the data will be Binomially distributed about a true rate, *P*. Bear in mind that we only have observations to go on — we don’t know what the true rate is!

If we suspect that our data may contain different subsets each with very different rates, we can test this hypothesis:

- Split data into relevant linguistic sub-categories and determine the observed probability in each case.
- Carry out a 2 × 2 or
*r*× 2 χ² test for homogeneity (Wallis 2013b) where the**independent variable**is the grammatical condition. For example, if our data were divided into interrogative vs. declarative sets, a significant difference would mean we could conclude that*P*≠_{d}*P*._{i}

Note that this distinction need not be grammatical. A similar result might be found from combining speech and writing data (in this case a Newcombe-Wilson test is preferred, see Wallis 2013a).

The same principle also applies to semasiological variables, i.e., variables where the overall observed *p* value is the result of several different choices.

### Dispersion rates

A recent paper I was asked to review looked at **dispersion rates**.

A dispersion rate for a word represents the number of texts in which a word appears *at least once*. Implicit in the paper was the assumption that a dispersion rate could be treated like a true probability. After all, it is theoretically possible that all texts contain a modal verb, and it is possible that all texts contain none. So we may write:

- dispersion rate(modal)
*dr*∈ [0, 1], where*dr*=*d*/*t*.

The maximum value of the dispersion frequency *d* is the number of texts, *t*.

One might express *dr* as a probability (the probability of selecting a text that contains a modal, *p*(modal | text)). **But is dr a natural probability?**

The answer has to be Yes, but… Yes, the dispersion rate can be approximated by a Binomial variable. But the measure suffers from a number of defects.

Consider the relationship between dispersion counts and frequency counts. An item that appears repeatedly in the same text contributes multiple hits towards the frequency *f* but only adds 1 to the dispersion count *d*. On the other hand, if the item never appears more than once in the same text, then *f* = *d*. So *f* ≥ *d*.

- The dispersion rate contains
**less information**regarding the distribution of data than*p*. Information about second and third occurrences is simply ignored. As a result, evenly distributed low frequency items can score the same as clustered high frequency ones. - For
**low frequency**items,*dr*is approximately linear with*p*, although on a different scale (*t*, the number of texts, rather than*n*, the number of potential cases). - For
**high frequency**items,*dr*is likely to saturate (tend to 1) more quickly than*p*. - Single-author text samples should be of the
**same size**, because the chance of observing an item, all other things being equal, is at least proportional to the number of words. This is not easy to guarantee, particularly if corpora contain very short content such as letters or telephone calls.

So it is possible to employ Wilson intervals, log-likelihood or χ² tests to compare probabilities in the form of *p*(item | text). However, since *dr* contains less information than *p,* greater accuracy will be achieved by recasting the analysis in terms of simple probabilities of occurrence.

We noted that *dr* contained less information regarding the distribution of data than *p*. This means that a significance test comparing two dispersion rates (*dr*₁, *dr*₂) will have lower statistical power than a comparable statistical test comparing two Binomial probabilities (*p*₁, *p*₂).

From an experimental design point of view, one further point needs to be noted. **Dispersion rates do not sum hierarchically.** In other words, the probability of finding a modal in a text, *p*(M) is equal to the sum of the probabilities of all modal forms, *p*(m₁)+*p*(m₂)+…*p*(m_{n}). The same is not true of dispersion rates. A dispersion rate cannot be used in an alternation study because of this.

One of the reasons why dispersion rates have been proposed is as an alternative to per-million-word frequencies (per word probabilities), but neither of these is a substitute for an alternation study.

Dispersion rates count a single ‘hit’ per text equal to multiple ‘hits’ per text, and are at the extreme end of a methodological continuum suppressing case interaction. However, the prior probability weighting methods we describe in this post sum hierarchically and permit alternation studies.

### References

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical **Procedures*. Boca Raton, Fl: CRC Press.

Wallis, S.A. 2013a. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics* **20**:3, 178-208. **»** Post

Wallis, S.A. 2013b. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. **»** Post

### See also

- Freedom to vary and significance tests
- Reciprocating the Wilson interval
- That vexed problem of choice
- Choice vs. use
- Is language really “a set of alternations”?
- Binomial → Normal → Wilson