When we carry out experiments and perform statistical tests we have two distinct aims.

- To form
**statistically robust conclusions**about empirical data. - To make
**logically sound arguments**about experimental conclusions.

Robustness is essentially an *inductive* mathematical or **statistical** issue.

Soundness is a *deductive* question of **experimental design** and reporting.

Robust conclusions are those that are likely to be repeated if another researcher were to come along and perform the same experiment with different data sampled in much the same way. Sound arguments distinguish between what we can legitimately infer from our data, and the hypothesis we may wish to test.

[**Note:** Logicians describe two aims of deductive reasoning: soundness and **completeness**. Completeness is the property that all implications of a given set of propositions are identified. However, as a rule, we err on the side of caution when reporting experimental results: it is far more important to summarise results soundly than to attempt to exhaustively identify all possible conclusions – and make unwarranted claims as a result.]

### Robust?

**Confidence intervals** are all about robustness.

Consider the top ‘cogitate’ line in the following graph. The error bars represent 95% Wilson confidence intervals.

Let’s try to put these results into words.

First, we’ll assume we don’t know how to calculate confidence intervals. We might simply **quote probabilities**:

- In the 1920s data ‘cogitate’ uses account for 77.27% of the uses of the word
*think*, falling to 59.67% and 56.67% in the 1960s and 2000s respectively.

But this is unsatisfactory. We don’t know how reliable these percentages are. Are we *really* saying that if we were to repeat this experiment we would get exactly the same results (to two decimal places at least)? All we can really say is that these were the percentages (or proportions or probabilities, it amounts to the same thing) that we obtained from our data. This is why quoting (or plotting) proportions without some indication of the confidence in the citation is frowned upon. *The citation is not robust.*

One thing we can do is **dispense with the decimal places**. For computational purposes we typically want higher accuracy and we might cite more figures in a table. But this detail makes little sense in citation and discussion.

- In the 1920s data ‘cogitate’ uses account for 77% of the uses of the word
*think*, falling to 60% and 57% in the 1960s and 2000s respectively.

However just getting rid of the decimal point is arbitrary. Let’s look at this in a different way.

We have three **observations** (referred to throughout corp.ling.stats as lower-case *p*) from which we want to make an educated guess (an *inference*, hence **inferential statistics**) about the most likely value of the **population** probability (which we label *P*) – the ‘real’ share of *think* uses at the time in question that ‘cogitate’ meaning accounts for in comparable English language utterances.

**Confidence intervals** tell us (at a certain level of confidence, e.g. 95%) the range of values that we can expect the population probability *P* to fall within. Since we want to report robust repeatable results, we cite confidence intervals. So we could write the following:

- In the 1920s data ‘cogitate’ uses account for between 65.83% and 85.71% of the use of the word
*think*, falling to 52.39-66.54% and 49.90-63.19% in the 1960s and 2000s respectively.

Again, probably we don’t need to quote to two decimal percentage points (this is *four* decimal places of probability after all: 52.39% = 0.5239). The more digits, the harder it is to read. Cite four decimal places in a table if you wish, *but they are unlikely to be necessary when discussing results*. (The only exception would be when these tiny distinctions were statistically significantly distinct.)

So the following is probably better:

- In the 1920s data ‘cogitate’ uses account for between 66% and 86% of the uses of the word
*think*(Wilson score interval at a 95% confidence level), falling to 52-67% and 50-63% in the 1960s and 2000s respectively.

Note that we have indicated the confidence level and the method used for calculating it. We are telling our reader how to reproduce our experiment. We don’t need to say this every time and this explanation can be relegated to a footnote on the first use.

Provided that the interval is reasonably symmetric it is possible to combine quoting the observed probability and the error range in what we might call **‘±’ notation**:

- In the 1920s data ‘cogitate’ uses account for 77% (66-86%) of the uses of the word
*think*(Wilson score interval at a 95% confidence level), falling to 60% ±7% and 57% ±7% in the 1960s and 2000s respectively.

Whichever you use of the last two is up to you. Either are acceptable. Note how we got around the problem of the assymmetric 1920s interval.

#### Robust significance

With **significance tests** a similar logic applies. Let’s consider chi-square, although we could apply exactly the same argument to other tests.

In our data the fall between 1920s and 1960s is significant (2 × 2 χ² at a 0.05 error level), whereas the overlapping confidence intervals indicate that the change between the 1960s and 2000s is *not* significant.

χ² tests with one degree of freedom essentially do two things (see Wallis 2013 for more about this):

- they measure the difference between two probabilities (a measure of the
**size of an effect**), and - they check whether this difference is outside expected limits (a
**confidence interval**), in which case we can conclude that there is a significant difference between the observations.

Consequently, when you cite a significance test *bear in mind that the reader is not interested in the actual χ² score*. They want to know that your results are robust.

A χ² score depends on two things:

- the
**amount of data**that you had available and - the
**size of effect**you observed.

A higher χ² score simply means you had more data or observed a greater effect size.

#### *** is for expletives, not results!

More controversially perhaps, I have to point out that *the same argument applies to citing error levels*.

It is possible to transform χ² values into error levels (*e* = 0.05, 0.01, 0.0001, etc – Excel even has a function, CHIDIST, to do this for you) and it is common to see citation of error levels, as if a smaller error estimate meant that the result was ‘stronger’ than another. It is also common to say that something is ‘highly significant’ or mark results with multiple asterisks. But these practices all reflect the same logical mistake.

Like all tests of significance, χ² derives ‘mechanically’ from this combination of effect size and quantity of data. So does the error level. This means that *two results with exactly the same size of effect may have different χ² scores*, error estimates, etc. merely due to different volumes of data being available.

Consider the following thought experiment.

Suppose we analyse data from two corpora. One has 1 million words and the other 100 million. The same pattern is observed, such as a fall of approximately 18% in ‘cogitate’ uses of *think*. However the results from the larger corpus will be supported by more data, and have a lower error estimate than the results from the smaller corpus. This is, after all, how the maths works!

Reporting that one is ‘highly significant’ and the other merely ‘significant’ is not meaningful. Worse, it implies that *the results are different*. But in this case they are exactly the same!

**Both χ² scores and error levels are artefacts of the statistical testing process.**

To compare two results we need to **compare effect sizes**.

It *is* useful to be able to say, for example:

- ‘Cogitate’ uses of
*think*fell by around 18% (a decline of between 4% and 29%, using a 95% Newcombe-Wilson interval) between the 1920s and 1960s.

If you want to say that one effect size is greater than another, **the correct approach is to employ a statistical test**. It is possible to compare results of two experiments by performing a separability test to determine that they are *significantly* different. This test compares the two observed declines against a confidence interval.

**Test effect sizes for significant difference if you can.** Unfortunately, because of the widespread practice of citing error levels, statistical tests for comparing effect sizes are rarely seen as important, and are little-known (this is one of my personal areas of research).

As a general rule, **only cite a change if it is significant**. Try to avoid saying that a non-significant change is “indicative”. In my book, an indicative result is when you don’t know how to test your data for significance, but the results *look* sizeable.

So we should amend our first summary to read something like the following. (We take out figures for the 2000s because they do not significantly differ.)

- In the 1920s data ‘cogitate’ uses account for 77% (66-86%) of the uses of the word
*think*(Wilson score interval at a 95% confidence level), falling to around 60% ±7% by the 1960s.

### Sound?

A sound argument is about **what we can reasonably infer** from our results.

Might there be an alternative explanation that accounts for the observations other than the one we are attempting to find evidence for?

A good researcher needs to play ‘devil’s advocate’. We need to be self-critical, and careful in what we claim our results demonstrate.

Most of the work here is in **designing the experiment** optimally in the first place. Corpus linguists engage in what is termed *ex-post facto* research, meaning that we carry out retrospective data analysis. We cannot easily introduce new experimental conditions. Consider the cost of re-sampling sentences, annotating them and building new corpora! However, we can readily alter our experimental design. We can refine queries and variable definitions, choose alternative baselines, and manually eliminate cases where alternation may not plausibly occur. This question is dealt with at length elsewhere on this blog.

Second, **do not misinterpret use proportions** as if they represented speaker choices. This data examines four different meanings of the same utterance, the verb *think*, where the baseline is simply all cases of *think*. The probability of *think* being used to express one particular meaning, say quotative *think*, is meaningful to the hearer (because it represents their exposure to that meaning). However, this probability represents something quite different from the choice of *think* out of all potential quotative expressions. It is extremely important not to confuse choice and use and the type of inference one can draw from the results.

The growth of quotative *think* is unlikely to represent a novel meaning in the data, rather that in the past the same meaning was expressed differently. The most plausible hypothesis would be that in the ’20s, speakers employed other verbs in place of *think* in statements such as *I think we should go*. The way to evaluate this hypothesis is to pose the question in terms of choice: i.e. compare quotative *think* against its alternates.

These two issues concerns the **dependent variable**.

A third question concerns whether the** independent variable** is really measuring what we think it is.

For example, in Magnus Levin’s data above, the independent variable is **time**, and the dependent variable is the meaning of *think*. But how do we know that ‘time’ is really representing chronological time *per se*, or just samples taken from that time period?

This is generally termed a “sampling problem”.

- Were the samples collected in exactly the same way, and how might sociological and technological change interact with time? For example, were we to extend the DCPSE corpus (1960s-1990s) into the future, we would probably wish to include popular modes of communication that
**simply didn’t exist**in the earlier time period (such as text messaging), or which had socially**limited use**(such as email). Similarly, does ‘time’ in the TIME Magazine Corpus reflect the impact of editorial style changes? - The International Corpus of English protocol required the collection of 10,000 words of
**telephone conversations**from the 1990s in each of its international 1M word samples. This was a problem in a number of countries: in some, few used the telephone, whereas in others, recording phone conversations was specifically banned by the state (even when prior consent was given). Legal cross-examination is differentially limited by social class, and so forth.

Independent variables are looking less and less independent!

The solution to this problem is usually considered in terms of injunctions to researchers to obtain a balanced sample (an instruction equivalent to “first, catch your rabbit!”).

The problem is that it is rarely possible to balance a sample for everything! Consequently it is extremely important therefore to **recognise the limitations of your data**, work around these problems if you can and state the problems plainly if you can’t.

To conclude, consider the following graph, which shows the distribution of tensed VP per million words by text categories and two time periods (LLC=1960s, ICE-GB=1990s) in DCPSE.

This graph is discussed in some detail in That vexed problem of choice. It shows that there is significant variation over time and genre in ‘tensed VP density’ (the number of tense-marked verb phrases as a proportion of the number of words) in DCPSE, although superficially, when we compare all the texts together (‘Total’ column pair on the right), the VP density appears to be constant over time.

In that paper the authors employ this graph to demonstrate that this variation is substantial, and therefore by eliminating it (by refining a modal baseline from words to tensed VPs), they are able to increase the soundness of their conclusions about changes in core modal use over time.

**However, another point should also be made.** Whenever we limit a sample to particular types of text we tend to introduce or increase the effect of particular sources of variation (such as the dominance of a particular editorial style) and reduce the number of speakers or writers. We may also undo some of the attempts to balance the sample (e.g. by gender or social class). Bowie *et al*. comment that some categories (such as legal cross-examination) have a small number of participants and results from these need to be treated with caution.

### Correlation / Cause

Finally, note that in this discussion I have avoided following the well-trodden path of soberly pronouncing on the difference between **correlation and cause**. True, statistical results are couched in terms of correlational evidence, and experiments do not prove what the cause of that correlation might be.

However the problem with the conventional discussion is that we all tend to *think* in terms of causes, and we can slip into this language relatively easily. Stern warnings are not enough.

It seems to me to me much more helpful to encourage every researcher to ask themselves the question –

*What do my results really demonstrate?*

– and take it from there.

### References

Bowie, J., S.A. Wallis and B. Aarts forthcoming. Contemporary change in modal usage in spoken British English: mapping the impact of ‘genre’. In Marín Arrese, J.I. and J. Van der Auwer (eds.). *Current issues on Evidentiality and Modality in English*. Berlin: Mouton de Gruyter.

Levin, M. 2013. The progressive in modern American English. In Aarts, B., J. Close, G. Leech and S.A. Wallis (eds). *The Verb Phrase in English: **Investigating recent language change with corpora*. Cambridge: CUP. » Table of contents and ordering info

Wallis, S.A. 2013. *z*-squared: the origin and application of χ². *Journal of Quantitative Linguistics* **20**:4, 350-378. **»** Post