# An unnatural probability?

Not everything that looks like a probability is.

Just because a variable or function ranges from 0 to 1, it does not mean that it behaves like a probability over that range.

### Natural probabilities

What we might term a natural probability is a proper fraction of two frequencies, which we might write as p = f/n.

• Provided that f can be any value from 0 to n, p can range from 0 to 1.
• In this formula, f and n must also be natural frequencies, that is, n stands for the size of the set of all cases, and f the size of a true subset of these cases.

This natural probability is a Binomial variable, and the formulae for z tests, χ² tests, Wilson intervals, etc. may be legitimately applied to such variables.

Another way of putting this is that a Binomial variable expresses the number of individual events of Type A in a situation where an outcome of either A and B are possible. If we observe, say 8 out of 10 cases are of Type A, then we can say we have an observed probability of A being chosen, p(A | {A, B}), of 0.8. In this case, f is the frequency of A (8), and n the frequency of both A and B (10). See Wallis (2013).

### Unnatural probabilities

However, sometimes researchers obtain variables or ratios that look like probabilities, but in fact are not.

• Any power of a natural probability, e.g. p² or √p, will range from 0 to 1, but will not behave linearly (proportionately) with p. To compute confidence intervals on p², we must first reverse the square function and compute w⁻, w⁺, and then square these: (w⁻)² and (w⁺)². See Reciprocating the Wilson interval.
• Baselines incorporating invariant terms (such as word-based baselines) can be expressed as probabilities (in the case of words, usually a very small p) but these are not natural probabilities. It is quite unrealistic to believe that p could ever approach 1. See That vexed problem of choice and Freedom to vary and significance tests.
• Effect size measures such as Cramér’s φ and adjusted C (Sheskin 1997) also range from 0 to 1 but can be thought of as being based on multiple natural probabilities, p₁, p₂, etc. Methods for computing confidence intervals on φ do exist in the literature (see Comparing χ² tests for separability) although they are based on Wald estimates and are non-optimal.

On the other hand both onomasiological (choice) and semasiological (use) variables are Binomial. The chance of being exposed to one particular use of a word out of many is a Binomial variable, even though it is probably a by-product of multiple onomasiological choices between linguistic alternates. See Choice vs. use.

### Dispersion rates

A recent paper I was asked to review looked at dispersion rates.

A dispersion rate for a word represents the number of texts in which a word appears at least once. Implicit in the paper was the assumption that a dispersion rate could be treated like a true probability. After all, it is theoretically possible that all texts contain a modal verb, and it is possible that all texts contain none. So we may write:

• dispersion rate(modal) dr ∈ [0, 1], where dr = d/t.

The maximum value of the dispersion frequency d is the number of texts, t.

One might express dr as a probability (the probability of selecting a text that contains a modal, p(modal | text)). But is dr a natural probability?

The answer has to be Yes, but… Yes, the dispersion rate is a Binomial variable. But the measure suffers from a number of defects.

Consider the relationship between dispersion counts and frequency counts. An item that appears repeatedly in the same text contributes multiple hits towards the frequency f but only adds 1 to the dispersion count d. On the other hand, if the item never appears more than once in the same text, then f = d. So f ≥ d.

• The dispersion rate contains less information regarding the distribution of data than p. Evenly distributed low frequency items can score the same as clustered high frequency ones.
• For low frequency items, dr is approximately linear with p, although on a different scale (t, the number of texts, rather than n, the number of potential cases).
• For high frequency items, dr is likely to saturate (tend to 1) more quickly than p.
• Single-author text samples should be of the same size. This is not easy to guarantee, particularly if corpora contain very short content such as letters or telephone calls.

So it is possible to employ Wilson intervals, log-likelihood or χ² tests to compare probabilities in the form of p(item | text). However, your best bet is likely to be to recast the analysis in terms of simple probabilities of occurrence.

We noted that dr contained less information regarding the distribution of data than p. This means that a significance test comparing two dispersion rates (dr₁, dr₂) will have lower statistical power than a comparable statistical test comparing two Binomial probabilities (p₁, p₂).

From an experimental design point of view, one further point needs to be noted. Dispersion rates do not sum hierarchically. In other words, the probability of finding a modal in a text, p(M) is equal to the sum of the probabilities of all modal forms, p(m₁)+p(m₂)+…p(mn). The same is not true of dispersion rates. A dispersion rate cannot be used in an alternation study because of this.

One of the reasons why dispersion rates have been proposed is as an alternative to per-million-word frequencies (per word probabilities), but neither of these is a substitute for an alternation study.

Dispersion rates count a single ‘hit’ per text equal to multiple ‘hits’ per text, and are at the extreme end of a methodological continuum suppressing case interaction. However, the weighting methods described in this post still sum hierarchically and permit alternation studies.

### References

Sheskin, D.J. 1997. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.

Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208. » Post

# Summer school in English Corpus Linguistics 2013

Thanks to all who attended the Survey of English Usage’s summer school in English Corpus Linguistics at UCL in August!

The three-day event ran from Tuesday 27 August to Thursday 29 August 2013, and there were lectures, seminars and hands-on sessions.

As a service to those who were able to attend (and a few who could not), I have published the slides from my talk on ‘Simple statistics for corpus linguistics’ and a spreadsheet for demonstrating the binomial distribution below.

If you want to try to replicate the class experience in your own time, please note that at around the half-way point, each member of the class was asked to toss a coin ten times and report the results. We then input the number of students who threw 0 heads, 1, head, 2 heads, etc. into the spreadsheet.

# ICAME talk on linguistic interaction

I spoke on Capturing patterns of linguistic interaction in a parsed corpus at ICAME 34, Santiago de Compostela, Spain, on 25 May.

The talk presents my latest research in the linguistic interaction research thread (see Wallis 2012). My slides and handout are published below.

### References

Wallis, S.A. 2012. Capturing patterns of linguistic interaction in a parsed corpus: an insight into the empirical evaluation of grammar? London: Survey of English Usage » Post

# Comparing frequencies within a discrete distribution

### Introduction

In a recent study, my colleague Jill Bowie obtained a discrete frequency distribution by manually classifying cases in a small sample drawn from a large corpus.

Jill converted this distribution into a row of probabilities and calculated Wilson score intervals on each observation, to express the uncertainty associated with a small sample. She had one question, however:

How do we know whether the proportion of one quantity is significantly greater than another?

We might use a Newcombe-Wilson test (see Wallis 2013a), but this assumes that samples are drawn from independent sources. In Jill’s example, data is drawn from the same sample, and all probabilities must sum to 1. We need to employ a stricter dependent-sample test.

### Example

A discrete distribution looks something like this: F = {108, 65, 6, 2}. This is the frequency data for the middle column (circled) in the following chart.

This may be converted into a probability distribution P, representing the proportion of examples in each category, by simply dividing by the total: P = {0.60, 0.36, 0.03, 0.01}, which sums to 1. We can plot these probabilities, with Wilson score intervals, as shown below.

An example graph plot showing the changing proportions of meanings of the verb think over time in the US TIME Magazine Corpus, with Wilson score intervals, after Levin (2013). In this post we discuss the 1960s data (circled). The sum of each column probability is 1. Many thanks to Magnus for the data!

So how do we know if one proportion is significantly greater than another?

• When comparing values diachronically (horizontally), data is drawn from independent samples. We can use the Newcombe-Wilson test, and employ the handy visual rule that if intervals do not overlap they must be significantly different.
• However, probabilities drawn from the same sample (vertically) sum to 1 — which is not the case for independent samples! There are k−1 degrees of freedom, where k is the number of classes. It turns out that if we need to perform a significance test, the test we need to use is even more primitive than the 2 × 1 goodness of fit χ² test.

# A methodological progression

### Introduction

One of the most controversial arguments in corpus linguistics concerns the relationship between a ‘variationist’ paradigm comparable with lab experiments, and a traditional corpus linguistics paradigm focusing on normalised word frequencies.

Rather than see these two approaches as diametrically opposed, we propose that it is more helpful to view them as representing different points on a methodological progression, and to recognise that we are often forced to compromise our ideal experimental practice according to the data and tools at our disposal.

Viewing these approaches as being represented along a progression allows us to step back from any single perspective and ask ourselves how different results can be reconciled and research may be improved upon. It allows us to consider the potential value in performing more computer-aided manual annotation — always an arduous task — and where such annotation effort would be usefully focused.

The idea is sketched in the figure below.

A methodological progression: from normalised word frequencies to verified alternation.

# Choice vs. use

### Introduction

Many linguistic researchers are interested in semasiological variation, that is, how the meaning of words and expressions may be observed to vary over time or space. One word might have one dominant meaning or use at one point in time, and other meanings may supplant them. This is of obvious interest to etymology. How do new meanings come about? Why do others decline? Do old meanings die away or retain a specialist use?

Most of the research we have discussed on this blog is, by contrast, concerned with onomasiological variation, or variation in the choice of words or expressions to express the same meaning. In a linguistic choice experiment, the field of meaning is held to be constant, or approximately so, and we are concerned primarily with language production:

• Given that a speaker (or writer, but we take speech as primary) wishes to express some thought, T, what is the probability that they will use expression E₁ out of the alternate forms {E₁, E₂,…} to express it?

This probability is meaningful in the language production process: it measures the actual use out of the options available to the speaker, at the point of utterance.

Conversely, semasiological researchers are concerned with a different type of probability:

• Given that a speaker used an expression E, what is the probability that their meaning was T₁ out of the set of {T₁, T₂,…}?

For the hearer, this measure can also be thought of as the exposure rate: what proportion of times should a hearer (reader) interpret E as expressing T₁? This probability is meaningful to a language receiver, but it is not a meaningful statistic at the point of language production.

From the speaker’s point of view we can think of onomasiological variation as variation in choice, and semasiological variation as variation in relative proportion of use.

# Verb Phrase book published

### Why this book?

The grammar of English is often thought to be stable over time. However a new book, edited by Bas Aarts, Joanne Close, Geoffrey Leech and Sean Wallis, The Verb Phrase in English: investigating recent language change with corpora (Cambridge University Press, 2013) presents a body of research from linguists that shows that using natural language corpora one can find changes within a core element of grammar, the Verb Phrase, over a span of decades rather than centuries.

The book draws from papers first presented at a symposium on the verb phrase organised for the Survey of English Usage’s 50th anniversary and on research from the Changing English Verb Phrase project.