About corp.ling.stats

This is a personal blog discussing experimental design and statistics for corpus linguistics.

Corpus linguistics is an approach to linguistics research which focuses on analysing volumes of annotated text data, or corpora.

Over the last decade and more, the size and complexity of these corpora have increased.

  • Increased size – with multi-million-word corpora becoming commonplace – means that we now have the weight of evidence that might allow statistically meaningful statements to be made on hitherto unanswerable questions.
  • Increased complexity – such as the series of recursive grammatical tree structures found in million-word parsed corpora (Penn Treebank, ICE-GB, Prague Dependency Treebank etc.) – raises new possibilities for research, and consequent issues of how to pose research questions correctly to obtain sound conclusions.

I believe that the statistical methods we have at our disposal as researchers are frequently weak and often misunderstood.

In his excellent book on corpus linguistics with the statistical programming platform R, Stefan Gries comments that his book is not “a fully fledged and practical introduction to corpus-linguistic statistics – in fact I am not aware of such a textbook” (Gries 2009a: 173). I hope this blog will help fill some gaps for the time being. This blog is not a textbook, but a live work in progress, to which I invite linguists to contribute.

UPDATE November 2020: My new book Statistics in Corpus Linguistics Research, built upon the research discussed on this blog to that date, is published. Whether it is “fully fledged” or “practical” I will leave others to decide.

Secondly, the growth in annotation complexity has not been matched by analytical methods. We should have enough data in our parsed corpora, for example, to potentially address some novel and non-trivial linguistic questions, such as how we might empirically evaluate the grammatical framework employed. But we do not yet have the methods to allow us to do so.

There are problems on both sides of the linguistics/statistics divide.

  • Many linguists are not trained in mathematics and statistics, and rely on off-the-shelf advice which may not be appropriate or optimal for their research question.
  • Conversely, most statisticians work with data which is very different in structure from annotated corpora.

This blog is an attempt to find a way to bring statistics to linguists and allow us to imagine new types of experiments we might conduct.

This blog is not intended as a complete survey of general methods of inferential statistics, which others, including Butler (1985), Oakes (1998), Sheskin (2000) and Gries (2009b) have provided. However, many of these methods are difficult to apply or inappropriate for typical corpus data, and it is easy for the lay reader to become confused about which test to apply in what circumstances. I therefore emphasise simplicity and core methods which can lead to defensible results, and I discuss the types of conclusions that may be drawn from results.

After all, if you are not sure what your results mean, or why you chose one method over another, then you will have great difficulty explaining your results to a wider audience.

Rather than repeat common ground found in statistics primers, therefore, in the first phase of my work up to 2020 I focused on practical issues such as how to

Note that significance tests (a topic that typically dominates experimental design and statistics textbooks) represent only one area among several in this list.

I believe these are some of the most productive areas for improving the effectiveness of corpus research, but if I haven’t covered something important do let me know! The fact that this blog contains original research papers should indicate just how underexplored some of these issues are.

Since the publication of my book, I have been primarily working on a new topic: a general method for accurately estimating confidence intervals for properties derived from binomial and multinomial statistics. The idea is that if a metric, let’s call it h, can be defined by a formula consisting of independently-observed frequencies and sample sizes (and thus binomial proportions), and the like, it should be possible to obtain a good quality confidence interval for it. The benefits of doing so are multiple.

  1. We can plot h scores with intervals, so we can visualise the reliability of their estimate, pay attention to the bounds, etc.
  2. We can compare two scores, h1 and h2, for significant difference. In other words, we can conclude that h2 > h1, or vice versa. This can be useful in a ‘meta test’ comparing runs of an experiment, or comparing and plotting h scores over a series of datasets.
  3. We can reinterpret ‘large’ and ‘small’ effects for statistical power. We can consider whether a lower bound of the absolute value, | h |, is greater than commonly-cited thresholds for effect size.

Each of the above are avenues to reintroduce logical reasoning into the interpretation of statistical claims.

This blog is aimed at the informed linguistic researcher. I don’t assume a background in statistics, and there are papers and blog posts which try to explain inferential statistics from first principles. This type of explanation is frequently skipped over in textbooks (Wackerly et al. 2008 being an exception) and, in my experience, the result is often a major source of confusion and anxiety among researchers.

This concerns me, because all statistical tests (and related methods) are based on an expected model of behaviour: if the model is wrong, an evaluation against it will necessarily be meaningless! Experimental design is therefore critical – and inevitably a linguistic question. Hence the slogan:

you can’t fix a weak experiment with a good statistic.

The “trick” is to get the experimental design right in the first place.

Along the way we’ll need to discuss some fairly tricky concepts. I’ll try my best to express arguments in simple and straightforward language, but I will inevitably have to use some mathematical definitions, terms and formulae. Unfortunately, maths comes with the territory!

But if you can’t follow my explanations – and I assume that this will happen not infrequently – then please let me know. Do feel free to email me with queries and errata.

– Sean Wallis

Further reading

These books come very highly recommended.

For an alternative, linguistics-oriented, introduction to some of these topics see Gries (2009b). If my explanations don’t convince you, perhaps Stefan’s will.

A thorough compendium of methods, but a more mathematically forbidding read, is Sheskin (2000). Again, I cannot recommend this book highly enough.

A standard reference work found on numerous mathematics student reading lists is Wackerly et al. (2008). Like this blog, this book differs from the first two by introducing inferential statistics through probability theory.

My focus parallels that of the ‘New Statistics’ school in psychology (Cumming and Calin-Jageman 2017). I suspect that this is not entirely coincidental, although my work was carried out independently to theirs.

This school objects to simplistic significance testing and prefer visualisations with confidence intervals. I agree with many of their conclusions, although perhaps not all. However, the methods I focus on and the intervals (and tests) I propose are highly compatible with this perspective.

Finally, there is my new book — Statistics in Corpus Linguistics Research: a new approach (Wallis 2021) — which I would encourage students and colleagues to read. Some chapters had their genesis in early versions of articles published in this blog, some chapters are new material which I have not published here. All have been heavily rewritten and integrated. For more information, see this announcement.

Citation

corp.ling.stats includes a number of academic discussion papers, spreadsheets and PowerPoint presentations which I am opting to make freely available prior to publication. There are several years of private work underpinning much of this content (hence initial e-publication dates are given for citation purposes until conventional publication is formalised). All material is subject to copyright and may be quoted on the condition it is cited appropriately.

Blog posts may be cited as

Wallis S.A. year. Title, corp.ling.stats. London: Survey of English Usage, UCL. URL (accessed: date of reading).

where the year is given in the URL. For example

Wallis S.A. 2012. Robust and Sound?, corp.ling.stats. London: Survey of English Usage, UCL. https://corplingstats.wordpress.com/2012/04/04/robust-and-sound (accessed 10 April 2012).

Blog posts that contain papers include a recommended method of citation at the end (this method will change if its academic publication status changes). If you are in doubt about the appropriate citation of any material please email me or post a question.

References

Butler, C.S. 1985. Statistics in Linguistics. Oxford: Blackwell. » ePublished.

Cumming, G. and Calin-Jageman, R. 2017. Introduction to the New Statistics. London/New York: Routledge.

Gries, S. Th. 2009a. Quantitative Corpus Linguistics with R. New York/London: Routledge.

Gries, S. Th. 2009b. Statistics for Linguistics with R. Berlin/New York: Mouton de Gruyter.

Oakes, M.P. 1998. Statistics for Corpus Linguistics. Edinburgh: EUP.

Sheskin, D.J. 2000. Handbook of Parametric and Nonparametric Statistical Procedures. 2nd Edition. Boca Raton, Fl: Chapman Hall/CRC Press.

Wackerly, D.D., Mendenhall, W., and Scheaffer, R.L. 2008. Mathematical statistics with applications. 7th Edition. Belmont, Ca.: Brooks/Cole.

Wallis, S.A. 2021. Statistics in Corpus Linguistics Research. New York: Routledge. » Announcement

Postscript

If you found this blog because you are interested in experimental design and statistics but do not work with linguistic data, I hope you find some of the posts useful.

Corpus linguistics has three particular characteristics which shape this blog:

  • Data is highly structured (sentences are marked up for part of speech, and in some cases parsed). Sampling is derived from contiguous text rather than a genuine random sample of items.
  • Collecting new data from scratch (and recording, transcribing, structurally annotating, and grammatically annotating it) is often extremely expensive as a result, and corpus linguistics is inevitably ex post facto research: analysing and reanalysing a richly annotated database.
  • Variables are mostly categorical discrete alternatives (a, b, c) rather than being numerical. Some variables may be given a number (e.g. time, clause length), but they tend to be the exception. Hence there is a concentration on Binomial confidence intervals and contingency tests.

Nonetheless, much of what is written here concerns general principles of inferential statistics, and is equally applicable to other fields of scientific inquiry.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.