# Robust and sound?

When we carry out experiments and perform statistical tests we have two distinct aims.

1. To form statistically robust conclusions about empirical data.
2. To make logically sound arguments about experimental conclusions.

Robustness is essentially an inductive mathematical or statistical issue.

Soundness is a deductive question of experimental design and reporting.

Robust conclusions are those that are likely to be repeated if another researcher were to come along and perform the same experiment with different data sampled in much the same way. Sound arguments distinguish between what we can legitimately infer from our data, and the hypothesis we may wish to test.

[Note: Logicians describe two aims of deductive reasoning: soundness and completeness. Completeness is the property that all implications of a given set of propositions are identified. However, as a rule, we err on the side of caution when reporting experimental results: it is far more important to summarise results soundly than to attempt to exhaustively identify all possible conclusions – and make unwarranted claims as a result.]

### Robust?

Confidence intervals are all about robustness.

Consider the top ‘cogitate’ line in the following graph. The error bars represent 95% Wilson confidence intervals.

An example graph plot showing changing proportions of meanings of the verb think over time in the TIME Magazine Corpus, with Wilson score intervals, after Levin (2013). Many thanks to Magnus for the data!

Let’s try to put these results into words.

First, we’ll assume we don’t know how to calculate confidence intervals. We might simply quote probabilities:

• In the 1920s data ‘cogitate’ uses account for 77.27% of the uses of the word think, falling to 59.67% and 56.67% in the 1960s and 2000s respectively.

But this is unsatisfactory. We don’t know how reliable these percentages are. Are we really saying that if we were to repeat this experiment we would get exactly the same results (to two decimal places at least)? All we can really say is that these were the percentages (or proportions or probabilities, it amounts to the same thing) that we obtained from our data. This is why quoting (or plotting) proportions without some indication of the confidence in the citation is frowned upon. The citation is not robust.

One thing we can do is dispense with the decimal places. For computational purposes we typically want higher accuracy and we might cite more figures in a table. But this detail makes little sense in citation and discussion.

• In the 1920s data ‘cogitate’ uses account for 77% of the uses of the word think, falling to 60% and 57% in the 1960s and 2000s respectively.

However just getting rid of the decimal point is arbitrary. Let’s look at this in a different way.

We have three observations (referred to throughout corp.ling.stats as lower-case p) from which we want to make an educated guess (an inference, hence inferential statistics) about the most likely value of the population probability (which we label P) – the ‘real’ share of think uses at the time in question that ‘cogitate’ meaning accounts for in comparable English language utterances.

Confidence intervals tell us (at a certain level of confidence, e.g. 95%) the range of values that we can expect the population probability P to fall within. Since we want to report robust repeatable results, we cite confidence intervals. So we could write the following:

• In the 1920s data ‘cogitate’ uses account for between 65.83% and 85.71% of the use of the word think, falling to 52.39-66.54% and 49.90-63.19% in the 1960s and 2000s respectively.

Again, probably we don’t need to quote to two decimal percentage points (this is four decimal places of probability after all: 52.39% = 0.5239). The more digits, the harder it is to read. Cite four decimal places in a table if you wish, but they are unlikely to be necessary when discussing results. (The only exception would be when these tiny distinctions were statistically significantly distinct.)

So the following is probably better:

• In the 1920s data ‘cogitate’ uses account for between 66% and 86% of the uses of the word think (Wilson score interval at a 95% confidence level), falling to 52-67% and 50-63% in the 1960s and 2000s respectively.

Note that we have indicated the confidence level and the method used for calculating it. We are telling our reader how to reproduce our experiment. We don’t need to say this every time and this explanation can be relegated to a footnote on the first use.

Provided that the interval is reasonably symmetric it is possible to combine quoting the observed probability and the error range in what we might call ‘±’ notation:

• In the 1920s data ‘cogitate’ uses account for 77% (66-86%) of the uses of the word think (Wilson score interval at a 95% confidence level), falling to 60% ±7% and 57% ±7% in the 1960s and 2000s respectively.

Whichever you use of the last two is up to you. Either are acceptable. Note how we got around the problem of the assymmetric 1920s interval.

#### Robust significance

With significance tests a similar logic applies. Let’s consider chi-square, although we could apply exactly the same argument to other tests.

In our data the fall between 1920s and 1960s is significant (2 × 2 χ² at a 0.05 error level), whereas the overlapping confidence intervals indicate that the change between the 1960s and 2000s is not significant.

χ² tests with one degree of freedom essentially do two things (see Wallis 2013 for more about this):

1. they measure the difference between two probabilities (a measure of the size of an effect), and
2. they check whether this difference is outside expected limits (a confidence interval), in which case we can conclude that there is a significant difference between the observations.

Consequently, when you cite a significance test bear in mind that the reader is not interested in the actual χ² score. They want to know that your results are robust.

A χ² score depends on two things:

1. the amount of data that you had available and
2. the size of effect you observed.

A higher χ² score simply means you had more data or observed a greater effect size.

#### *** is for expletives, not results!

More controversially perhaps, I have to point out that the same argument applies to citing error levels.

It is possible to transform χ² values into error levels (e = 0.05, 0.01, 0.0001, etc – Excel even has a function, CHIDIST, to do this for you) and it is common to see citation of error levels, as if a smaller error estimate meant that the result was ‘stronger’ than another. It is also common to say that something is ‘highly significant’ or mark results with multiple asterisks. But these practices all reflect the same logical mistake.

Like all tests of significance, χ² derives ‘mechanically’ from this combination of effect size and quantity of data. So does the error level. This means that two results with exactly the same size of effect may have different χ² scores, error estimates, etc. merely due to different volumes of data being available.

Consider the following thought experiment.

Suppose we analyse data from two corpora. One has 1 million words and the other 100 million. The same pattern is observed, such as a fall of approximately 18% in ‘cogitate’ uses of think. However the results from the larger corpus will be supported by more data, and have a lower error estimate than the results from the smaller corpus. This is, after all, how the maths works!

Reporting that one is ‘highly significant’ and the other merely ‘significant’ is not meaningful. Worse, it implies that the results are different. But in this case they are exactly the same!

Both χ² scores and error levels are artefacts of the statistical testing process.

To compare two results we need to compare effect sizes.

It is useful to be able to say, for example:

• ‘Cogitate’ uses of think fell by around 18% (a decline of between 4% and 29%, using a 95% Newcombe-Wilson interval) between the 1920s and 1960s.

If you want to say that one effect size is greater than another, the correct approach is to employ a statistical test. It is possible to compare results of two experiments by performing a separability test to determine that they are significantly different. This test compares the two observed declines against a confidence interval.

Test effect sizes for significant difference if you can. Unfortunately, because of the widespread practice of citing error levels, statistical tests for comparing effect sizes are rarely seen as important, and are little-known (this is one of my personal areas of research).

As a general rule, only cite a change if it is significant. Try to avoid saying that a non-significant change is “indicative”. In my book, an indicative result is when you don’t know how to test your data for significance, but the results look sizeable.

So we should amend our first summary to read something like the following. (We take out figures for the 2000s because they do not significantly differ.)

• In the 1920s data ‘cogitate’ uses account for 77% (66-86%) of the uses of the word think (Wilson score interval at a 95% confidence level), falling to around 60% ±7% by the 1960s.

### Sound?

A sound argument is about what we can reasonably infer from our results.

Might there be an alternative explanation that accounts for the observations other than the one we are attempting to find evidence for?

A good researcher needs to play ‘devil’s advocate’. We need to be self-critical, and careful in what we claim our results demonstrate.

Most of the work here is in designing the experiment optimally in the first place. Corpus linguists engage in what is termed ex-post facto research, meaning that we carry out retrospective data analysis. We cannot easily introduce new experimental conditions. Consider the cost of re-sampling sentences, annotating them and building new corpora! However, we can readily alter our experimental design. We can refine queries and variable definitions, choose alternative baselines, and manually eliminate cases where alternation may not plausibly occur. This question is dealt with at length elsewhere on this blog.

Second, do not misinterpret use proportions as if they represented speaker choices. This data examines four different meanings of the same utterance, the verb think, where the baseline is simply all cases of think. The probability of  think being used to express one particular meaning, say quotative think, is meaningful to the hearer (because it represents their exposure to that meaning). However, this probability represents something quite different from the choice of think out of all potential quotative expressions. It is extremely important not to confuse choice and use and the type of inference one can draw from the results.

The growth of quotative think is unlikely to represent a novel meaning in the data, rather that in the past the same meaning was expressed differently. The most plausible hypothesis would be that in the ’20s, speakers employed other verbs in place of think in statements such as I think we should go. The way to evaluate this hypothesis is to pose the question in terms of choice: i.e. compare quotative think against its alternates.

These two issues concerns the dependent variable.

A third question concerns whether the independent variable is really measuring what we think it is.

For example, in Magnus Levin’s data above, the independent variable is time, and the dependent variable is the meaning of think. But how do we know that ‘time’ is really representing chronological time per se, or just samples taken from that time period?

This is generally termed a “sampling problem”.

• Were the samples collected in exactly the same way, and how might sociological and technological change interact with time? For example, were we to extend the DCPSE corpus (1960s-1990s) into the future, we would probably wish to include popular modes of communication that simply didn’t exist in the earlier time period (such as text messaging), or which had socially limited use (such as email). Similarly, does ‘time’ in the TIME Magazine Corpus reflect the impact of editorial style changes?
• The International Corpus of English protocol required the collection of 10,000 words of telephone conversations from the 1990s in each of its international 1M word samples. This was a problem in a number of countries: in some, few used the telephone, whereas in others, recording phone conversations was specifically banned by the state (even when prior consent was given). Legal cross-examination is differentially limited by social class, and so forth.

Independent variables are looking less and less independent!

The solution to this problem is usually considered in terms of injunctions to researchers to obtain a balanced sample (an instruction equivalent to “first, catch your rabbit!”).

The problem is that it is rarely possible to balance a sample for everything! Consequently it is extremely important therefore to recognise the limitations of your data, work around these problems if you can and state the problems plainly if you can’t.

To conclude, consider the following graph, which shows the distribution of tensed VP per million words by text categories and two time periods (LLC=1960s, ICE-GB=1990s) in DCPSE.

Tensed VPs per million words, by text category, compared across the two ‘time’ subcorpora of DCPSE (after Bowie et al. forthcoming).

This graph is discussed in some detail in That vexed problem of choice. It shows that there is significant variation over time and genre in ‘tensed VP density’ (the number of tense-marked verb phrases as a proportion of the number of words) in DCPSE, although superficially, when we compare all the texts together (‘Total’ column pair on the right), the VP density appears to be constant over time.

In that paper the authors employ this graph to demonstrate that this variation is substantial, and therefore by eliminating it (by refining a modal baseline from words to tensed VPs), they are able to increase the soundness of their conclusions about changes in core modal use over time.

However, another point should also be made. Whenever we limit a sample to particular types of text we tend to introduce or increase the effect of particular sources of variation (such as the dominance of a particular editorial style) and reduce the number of speakers or writers. We may also undo some of the attempts to balance the sample (e.g. by gender or social class). Bowie et al. comment that some categories (such as legal cross-examination) have a small number of participants and results from these need to be treated with caution.

### Correlation / Cause

Finally, note that in this discussion I have avoided following the well-trodden path of soberly pronouncing on the difference between correlation and cause. True, statistical results are couched in terms of correlational evidence, and experiments do not prove what the cause of that correlation might be.

However the problem with the conventional discussion is that we all tend to think in terms of causes, and we can slip into this language relatively easily. Stern warnings are not enough.

It seems to me to me much more helpful to encourage every researcher to ask themselves the question –

What do my results really demonstrate?

– and take it from there.

### References

Bowie, J., S.A. Wallis and B. Aarts forthcoming. Contemporary change in modal usage in spoken British English: mapping the impact of ‘genre’. In Marín Arrese, J.I. and J. Van der Auwer (eds.). Current issues on Evidentiality and Modality in English. Berlin: Mouton de Gruyter.

Levin, M. 2013. The progressive in modern American English. In Aarts, B., J. Close, G. Leech and S.A. Wallis (eds). The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP. » Table of contents and ordering info

Wallis, S.A. 2013. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post