Over the last year, the field of psychology has been rocked by a major public dispute about statistics. This concerns the failure of claims in papers, published in top psychological journals, to replicate.
Replication is a big deal: if you publish a correlation between variable X and variable Y – that there is an increase in the use of the progressive over time, say, and that increase is statistically significant, you expect that this finding would be replicated were the experiment repeated.
I would strongly recommend Andrew Gelman’s brief history of the developing crisis in psychology. It is not necessary to agree with everything he says (personally, I find little to disagree with, although his argument is challenging) to recognise that he describes a serious problem here.
There may be more than one reason why published studies have failed to obtain compatible results on repetition, and so it is worth sifting these out.
In this blog post, what I want to do is try to explore what this replication crisis is – is it one problem, or several? – and then turn to what solutions might be available and what the implications are for corpus linguistics.
A corpus linguistics example
The debate between Neil Millar and Geoff Leech regarding the alleged increase (Millar 2009) and decline (Leech 2011) of the modal auxiliary verbs is an example of this problem.
Millar based his conclusions on the TIME corpus, discovering that the rate of modal verbs per million words tended to increase over time. Leech, using the Brown series of US English corpora, discovered the opposite. Both applied statistical methods to their data but obtained very different conclusions.
Inferential statistics operates by predicting the result of repeated runs of the same experiment, i.e. on samples of data drawn from the same population.
Stating that something “significantly increases over time” can be reformulated as:
- subject to caveats of random sampling (the sample is, or approximates to, a random sample of utterances drawn from the same population), and Binomial variables (observations are free to vary from 0 to 1),
- we can calculate a confidence interval at a given error rate (say 1 in 20 times for a 5% error rate / 95% interval) on the difference in two observations of variable X taken at two time points 1 and 2, x₂ – x₁,
- all points within this interval (including the lower bound) are greater than 0,
- on repeated runs of the same experiment we can expect to see an observation fall outside of the confidence interval of the difference at the predicted rate (here, 1 time in 20).
Note: For the purposes of this blog post, I am focusing on the last bullet point – when we say that something “fails to replicate”, we mean that on a repetition the result falls outside the confidence interval of the difference on the very next occasion!
Leech obtained a different result from Millar on the first attempted repetition of this experiment. This could be a fluke, but it seems to be a failure to replicate. There should only be a 1 in 20 chance of this happening.
Observing such a replication failure should lead us to ask some searching questions about these two studies, many of which are discussed elsewhere in this blog.
Much of the controversy can be summed up by the bottom row in this table, drawn from Millar (2009). This appears to show a 23% increase in modal use between the 1920s and 2000s. With a lot of data and a sizeable effect, this increase seems bound to be significant.
|1920s||1930s||1940s||1950s||1960s||1970s||1980s||1990s||2000s||% diff 1920s-2000s|
In attempting to identify why Leech and Millar obtain different results, the following questions should be considered.
- Are the two samples drawn from the same population, or are they drawn from two distinct populations? To put it another way, are there characteristics of the TIME data that makes it distinct from the general written data in the Brown corpora? For example, does TIME have a ‘house style’, with subeditors enforcing it, which has led to a greater frequency of modal use? Has TIME tended to curate more stories with more modal hedges than the overall trend? Jill Bowie (Bowie et al 2013) reported that genre subdivisions within the spoken DCPSE corpus often exposed different modal trends.
- Does Millar’s data support a general observation of increased modal use? Bowie observes that Millar’s aggregate data fluctuates over the entire time period (see Table, bottom row), and some changes in sub-periods appear to be consistent with the trend reported by Leech in an earlier study in 2003. According to this observation, simply expressing the trend as an increase in modal verb use seems misleading.
- Is it legitimate to aggregate all modals together? In one sense, modals are a well-defined category of verb: a closed category, especially if one excludes the semi-modals. So “modal use” is a legitimate variable. But we can also see that different modal verbs are undergoing different patterns of change over time (see Table). Millar reports that shall and must are in decline in his data while will and can are increasing. Whereas shall and will may be alternates in some contexts, this does not mean that bundling all modal trends together is particularly meaningful. Moreover, since the synchronic distribution of modals (like most linguistic variables) is sensitive to genre, this issue also interacts with my first bullet point, i.e. the fact that there are known differences between corpora.
- How reliable is a per-million-word measure? What does the data look like if we use a different baseline, for example, modal use per tensed verb phrase (or tensed main verb)? Doing this allows us to factor out variation in ‘tensed VP density’ (i.e. the variation in potential sites for modals to be deployed) between texts. Failure to do this (as both Leech and Millar do) means that we are not measuring when writers choose to use modal verbs, but the rate to which we, the reader, are exposed to them. See That vexed problem of choice.
If VP density in text samples changes over time in either corpus, this may explain these different results – not as a result of increasing or declining modal use but as a result of increasing or declining tensed VP density (or declining / increasing density of other constituents). More generally, word-based baselines almost always conflate opportunity and use because the option to insert the element is not available following every other word (exceptions might include pauses or expletives, but these exceptions prove the rule). This conflation undermines the Binomial model and increases the risk that results will not replicate. The solution is to focus on identifying each choice-point as much as possible.
- Does per word (per-million-word) data conform to the Binomial statistical model? Since the entire corpus cannot consist of modal verbs, observations of X can never approach 1, so the answer has to be no. However, the effect of this inappropriate model is that it tends to lead to the underreporting of otherwise significant results. See Freedom to vary and statistical tests. This may be a problem, but logically, it cannot be an explanation for obtaining two different ‘significant’ results in opposite directions!
All of the above are reasons to be unsurprised at the fact that Millar’s summary finding was not replicated in Leech’s data. But to be fair, many of Millar’s individual trends did appear to be consistent with results found in the Brown corpus.
The replication crisis has been most discussed in psychology and the social sciences. In psychology, some published findings have been controversial to say the least. Claims that ‘Engineers have more sons; nurses have more daughters’ have tended to attract the interest of other psychologists relatively quickly. But this is shooting fish in a barrel.
In psychology, it is common to perform studies with small numbers of participants – 10 per experimental condition is usually cited as a minimum, which means that between 20 and 40 participants becomes the norm. Many kinds of failure to replicate are due to what statisticians tend to call ‘basic errors’, such as using an inappropriate statistical test. I discuss this elsewhere in this blog.
In this blog I have tended to argue for applying the simplest possible experimental designs (2 × 2 contingency tests, for example) over multivariate regression algorithms which may work, but are treated as ‘black boxes’ by almost all who use them. Such algorithms may ‘over fit’ data, i.e. they match the data more closely than is mathematically justified.
I argue that if you don’t understand how your results were derived, you are taking them on faith.
This does not mean I don’t think that some multi-variable methods are not superior theoretically, or potentially more powerful than simple tests. On the contrary, I object that before we use them we need to be sure that we understand what they are doing with our data. We have to ask ourselves constantly, what do our results mean?
However, the replication problem does not go away entirely once we have dealt with these so-called basic errors.
The road not travelled
Andrew Gelman and Eric Loken (2013) raise a more fundamental problem that, if valid, is particularly problematic for corpus linguists. This concerns a question that goes to the heart of the post-hoc analysis of data, and the fundamental philosophy of statistical claims and the scientific method.
Essentially their argument goes like this.
- All data contains random noise, and thus every variable in a dataset (extracted from a corpus) will contain random noise. Researchers tend to assume that by employing a significance test we ‘control’ for this noise. But this is a mischaracterisation. In fact, even faced with a dataset consisting of pure noise, we would detect a ‘significant’ result 1 in 20 times (at a 0.05 threshold).
- Any data set may contain multiple variables, there are multiple potential definitions of these variables, and there are multiple analyses we could perform on the data. In a corpus we could modify definitions of variables, perform new queries, change baselines, etc., to perform new analyses.
- It follows that there is a very large number of potential hypotheses we could test against the data. (Note: this is not an argument against choosing a better baseline on theoretical grounds!)
This is not very controversial. However, Gelman and Loken’s more provocative claim is as follows.
- Few researchers would admit to running very many tests against data and reporting results, which the authors term ‘fishing’ for significant results, or ‘p-hacking’. There are some algorithms that do this (multivariate logistic regression anyone?), but most research is not like this.
- Unfortunately, the authors argue, standard post-hoc analysis methods – exploring data, graphing results and reporting significant results – does much the same thing. We dispense with blind alleys (‘forking paths’), because we can see that they are not likely to produce significant results. Although we don’t actually run these dead-end tests, for mathematical purposes our educated eyeballing of data to focus on interesting phenomena has done the same thing.
- As a result, we underestimate the robustness of our results, and they often fail to replicate.
Gelman and Loken are not alone in making this criticism. Cumming (2014) objects to ‘NHST’ (null hypothesis significance testing) interpreted as an imperative that
“explains selective publication, motivates data selection and tweaking until the p value is sufficiently small, and deludes us into thinking that any finding that meets the criterion of statistical significance is true and does not require replication.”
Since it would be unfair to criticise others for a problem that my own work may be prone to, let us consider the following graph which we used while writing Bowie and Wallis (2016). The graph does not appear in the final version of the paper – not because we didn’t like it, but because we decided to adopt a different baseline in breaking down an overall pattern of change into sub-components.
There are two critical questions that follow from Gelman and Loken’s critique.
- In plotting this kind of graph and reporting confidence intervals, are we misrepresenting the level of certainty found in the graph?
- Are we engaging in, or encouraging, retrospective cherry-picking of contrasts between observations and confidence intervals?
In the following graph there are 19 decades and 5 trend lines, i.e. 95 confidence intervals. There are 171 × 5 potential pairwise comparisons, and 10 × 19 vertical pairwise comparisons. So there are 1,045 potential statistical pairwise tests which would be reasonable to carry out. With a 1 in 20 error rate, at least 52 ‘significant’ pairwise comparisons would be incapable of replication.
Gelman, Loken, Cumming et al. would argue that by selecting a few statistically significant claims from this graph, we have committed precisely the error they object to.
However, I have to defend this graph, and others like it, by arguing that this is not our method. We don’t sift through 1,045 possible comparisons and then report significant results selectively! In the paper, and in our work more generally, we really don’t encourage this kind of cherry-picking (the human equivalent of over-fitting). We are more concerned with the overall patterns that we see, general trends, etc., which are more likely to be replicable in broad terms.
Thus, for example, in that paper we don’t pull out specific significant pairwise comparisons to make strong claims. In this particular graph we can see an apparently statistically significant sharp decline between 1900 and 1930 in the tendency of writers to use the verb SAY (as in he is said to have stayed behind) before a to-infinitive perfect, compared to the other verbs in the group. This observation may be replicable, but the conclusions of the paper do not depend on this observation. This claim, and similar claims, do not appear in the paper.
Similarly, if we turn back to Neil Millar’s modals-per-million-word data for a moment, Bowie’s observation that the data does not show a consistent increase over time is interesting. Millar did not select the time period in order to report that modals were on the increase – on the contrary, he non-arbitrarily took the start and end point of the timeframe sampled. But the conclusion that ‘modals increased over the entire period’ was only one statement that described the data. In shorter periods there was a significant fall, and different modal verbs behaved differently. Indeed, the complexity of his results is best summed up by the detailed graphs in his paper!
In conclusion: it is better to present and discuss the pattern, not just the end point – or the slogan.
Nonetheless we may still have the sneaking suspicion that what we are doing is a kind of researcher bias. We tend to report statistically significant results and ignore those inconvenient non-significant ones. The fear is that results assumed to be due to chance 1 in 20 times are more likely due to chance 1 in 5 times (say), simply because we have – inadvertently and unconsciously – already preselected our data and methods to obtain significant results. Some highly experienced researchers have suggested that we fix this problem by adopting tougher error levels – adopt a 1 in 100 level and we might arrive at 1 in 25. The problem is that this assumes we know the appropriate multiplier to apply.
Recommendation 1: include a replication step
Gelman and Loken suggest instead that published studies should always involve a replication process. They argue it is preferable that researchers publish half as many experiments and include a replication step than publish non-replicable results.
Suggested method: Before you start, create two random subcorpora A and B by randomly drawing texts from the corpus and assigning them to A and B in turn. You may wish to control for balance, e.g. to ensure subsampling is drawn equitably from each genre category. Perform the study on A, and summarise the results. Without changing a single query, variable or analysis step, apply exactly the same analysis to B.
Do we get compatible results, i.e. results that fall within the confidence intervals of the first experiment? More precisely, are the results statistically separable?
An alternative to formal replication is to repeat the experiment with well-defined, as distinct from randomly generated, subcorpora.
Sampling subcorpora: Suppose you apply an analysis to spoken data in ICE-GB, and then repeat it with written data. Do we get broadly similar results? If we obtain comparable results for two subcorpora with a known difference in sampling, it is probable they would pass a replication test where two subsamples were not sampled differently. On the other hand, if results are different, this would justify further investigation.
Even where replication is not carried out (for reasons of insufficient data, perhaps), an uncontroversial corollary of this argument is that your research method should be sufficiently transparent so that it can be replicated by others.
As a general principle, authors should make raw data available to permit a reanalysis by other analysis methods. I find it frustrating when papers publish per million word frequencies in tables, when what is needed for a reanalysis is raw frequency data!
Recommendation 2: focus on large effects – and clear visualisations
Another of Gelman and Loken’s recommendations is that researchers need to spend more time focusing on sizes of effect, rather than just reporting statistical significance. With lots of data and large effect sizes, the problem is reduced. Certainly we should be wary of citing just-significant results with a small effect size.
Where does this leave the arguments I have made elsewhere in favour of visualising data with confidence intervals? One of the implications of the ‘forking paths’ argument is that we tend not to report dead-end, non-significant results. But well considered graphs can visualise all data in a given frame, rather than selected data (of course we have to ‘frame’ this data, select variables, etc.).
One advantage of graphing data with confidence intervals is that we apply the same criteria to all data points and allow the reader to interpret the graph. Significant and non-significant contrasts are available to be viewed. We also visualise effect sizes and the weight of evidence (confidence intervals), even if it is arguable that our model is insufficiently conservative.
Thus a strength of Millar’s paper is the reporting of trends and graphs. In the graph above, the confidence intervals improve our understanding of the overall trends we see.
We just should not assume that every significant difference will be replicable.
Recommendation 3: play devil’s advocate
This is really one of mine, but I suggest it is implicit in the argument above.
It seems to me to be an absolutely essential requirement for any empirical scientist to play devil’s advocate to their own hypothesis.
That is, it is not sufficient to ‘find something interesting in data’, and publish. What we are really trying to do is detect meaningful phenomena in data, or to put it another way, we are trying to find robust evidence of phenomena that have implications for linguistic theory. We are trying to move from observed correlation to a hypothesised underlying cause.
Statistics is a tool to help us do this. But logic also plays an essential part.
Without wishing to create a checklist for empirical linguistics (such that a researcher is convinced in the validity of their results simply because they can tick off the list), we might argue that the following steps are necessary in all empirical research.
- Identify the underlying research question, framed in general theoretical terms.
- Operationalise the research question as a series of testable hypotheses or predictions, and evaluate them. Plot graphs! Visualising data with confidence intervals allows us to visualise expected variation and make more robust claims.
- Focus reporting on global patterns across the entire dataset. If your research ends up prioritising an apparently unusual local pattern in a selected part of the data, consider whether this may be an artefact of sampling.
- Critique the results of this evaluation in terms of the original research question, and play devil’s advocate: what other possible underlying explanations might there be for the observed results?
- Consider alternative hypotheses and test them. Try to design new experiments to separate out different possible explanations for the observed phenomenon.
- Plan to include a replication step prior to publication. This means being prepared to partition the data in the way described above, dividing the corpus into different pools of source texts.
Whether or not Gelman and Loken’s argument applies to your corpus linguistics study — and we have to eliminate basic errors first — the principal conclusion is that it is difficult to understate the importance of reporting accuracy and transparency. If the study does not appear to replicate in the future, possible reasons must be capable of exploration by future researchers. It would not have been possible to explore the differences between Leech and Millar’s data had Neil Millar simply summarised a few trends and reported some statistically significant findings.
It is incumbent on all of us to properly describe the limitations of data and sampling; definitions of variables and abstraction (query) methods for populating them; as well as graphing data to reveal both significant and non-significant patterns at the same time.
A typical mistake is to refer to ‘British English’ (say) as a short hand for ‘data drawn from British English texts sampled according to the sampling frame defined in Section 3’. Many failures to replicate in psychology can be attributed to precisely this type of logical error – that the experimental dataset is not a reliable model for the population claimed.
Finally, Cumming (2014) makes an important distinction between exploratory research and prespecified research. Corpus linguistics is almost inevitably exploratory, as it is impossible to prespecify data collection in post-hoc analysis. In a natural experiment we cannot control for confounding variables, and we must frame our conclusions accordingly.
Bowie, J., Wallis, S.A. and Aarts, B. 2013. Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Marín-Arrese, J.I., Carretero, M., Arús H.J. and van der Auwera, J. (eds.) English Modality, Berlin: De Gruyter, 57-94.
Bowie, J. and Wallis, S.A. 2016. The to-infinitival perfect: A study of decline. In Werner, V., Seoane, E., and Suárez-Gómez, C. (eds.) Re-assessing the Present Perfect, Topics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94.
Cumming, G. 2014. The New Statistics: Why and How, Psychological Science, 25(1), 7-29.
Gelman, A. and Loken, E. 2013. The garden of forking paths. Columbia University. » ePublished.
Leech, G. 2011. The modals ARE declining: reply to Neil Millar’s ‘Modal verbs in TIME: frequency changes 1923–2006’. International Journal of Corpus Linguistics 16(4).
Millar, N. 2009. Modal verbs in TIME: frequency changes 1923–2006. International Journal of Corpus Linguistics 14(2), 191–220.