This is a personal blog discussing **experimental design and statistics** for **corpus linguistics**.

**Corpus linguistics** is an approach to linguistics research which focuses on analysing volumes of annotated text data, or **corpora**.

Over the last decade and more, the **size** and **complexity** of these corpora have increased.

- Increased
**size**– with multi-million-word corpora becoming commonplace – means that we now have the weight of evidence that might allow statistically meaningful statements to be made on hitherto unanswerable questions. - Increased
**complexity**– such as the series of recursive grammatical tree structures found in million-word**parsed corpora**(Penn Treebank, ICE-GB, Prague Dependency Treebank etc.) – raises new possibilities for research, and consequent issues of how to pose research questions correctly to obtain sound conclusions.

I believe that the statistical methods we have at our disposal as researchers are frequently weak and often misunderstood.

In his excellent book on corpus linguistics with the statistical programming platform R, Stefan Gries comments that his book is not “a fully fledged and practical introduction to corpus-linguistic statistics – in fact I am not aware of such a textbook” (Gries 2009a: 173). I hope this blog will help fill some gaps for the time being. This is not a textbook, but a live work in progress, to which I invite linguists to contribute.

Secondly the growth in annotation complexity has not been matched by analytical methods. We should have enough data in our parsed corpora, for example, to potentially address some novel and non-trivial linguistic questions, such as how we might empirically evaluate the grammatical framework employed. *But we do not yet have the methods to allow us to do so.*

There are problems on both sides of the linguistics/statistics divide.

- Many
**linguists**are not trained in mathematics and statistics, and rely on off-the-shelf advice which may not be appropriate or optimal for their research question. - Conversely, most
**statisticians**work with data which is very different in structure from annotated corpora.

This blog is an attempt to find a way to bring statistics to linguists and allow us to imagine new types of experiments we might conduct.

This blog is *not* intended as a complete survey of general methods of inferential statistics, which others, including Butler (1985), Oakes (1998), Sheskin (2000) and Gries (2009b) have provided. However, many of these methods are difficult to apply or inappropriate for typical corpus data, and it is easy for the lay reader to become confused about which test to apply in what circumstances. I therefore emphasise simplicity and core methods which can lead to defensible results, and I discuss the types of conclusions that may be drawn from results.

After all, if you are not sure what your results mean, or why you chose one method over another, then you will have great difficulty explaining your results to a wider audience.

Rather than repeat common ground found in statistics primers, therefore, I have focused on practical issues such as how to

- pose research questions in terms of choice and constraint,
- employ confidence intervals correctly (including in graph plots),
- select optimal significance tests,
- measure the size of the effect of one variable on another,
- estimate the similarity of distribution patterns, and
- evaluate whether the results of two experiments significantly differ.

Note that significance tests (a topic that typically dominates experimental design and statistics textbooks) represent only one area among several in this list.

I believe these are some of the most productive areas for improving the effectiveness of corpus research, but if I haven’t covered something important do let me know! The fact that this blog contains original research papers should indicate just how underexplored some of these issues are.

This blog is aimed at the **informed linguistic researcher**. I don’t assume a background in statistics, and there are papers and blog posts which try to explain inferential statistics from first principles. This type of explanation is frequently skipped over in textbooks (Wackerly *et al*. 2008 being an exception) and, in my experience, the result is often a major source of confusion and anxiety among researchers.

This concerns me, because all statistical tests (and related methods) are based on an *expected model of behaviour*: if the model is wrong, an evaluation against it will necessarily be meaningless! Experimental design is therefore critical – and inevitably a linguistic question. Hence the slogan:

*you can’t fix a weak experiment with a good statistic.*

The “trick” is to get the experimental design right in the first place.

Along the way we’ll need to discuss some fairly tricky concepts. I’ll try my best to express arguments in simple and straightforward language, but I will inevitably have to use some mathematical definitions, terms and formulae. Unfortunately, maths comes with the territory!

But if you can’t follow my explanations – and I assume that this will happen not infrequently – then please let me know. Do feel free to email me with queries and errata.

### Further reading

Three books come very highly recommended.

For an alternative, linguistics-oriented, introduction to some of these topics see Gries (2009b). If my explanations don’t convince you, perhaps Stefan’s will.

A thorough compendium of methods, but a more mathematically forbidding read, is Sheskin (2000). Again, I cannot recommend this book highly enough.

Finally, a standard reference work found on numerous mathematics student reading lists is Wackerly *et al*. (2008). Like this blog, this book differs from the first two by introducing inferential statistics through probability theory.

### Citation

corp.ling.stats includes a number of academic discussion papers, spreadsheets and PowerPoint presentations which I am opting to make freely available prior to publication. There is some five+ years of private work underpinning much of this content (hence initial e-publication dates are given for citation purposes until conventional publication is formalised). **All material is subject to copyright** and may be quoted on the condition it is cited appropriately.

Blog posts may be cited as

Wallis S.A. year. *Title*, URL (accessed: date of reading).

where the year is given in the URL. For example

Wallis S.A. 2012. *Robust and Sound?*, https://corplingstats.wordpress.com/2012/04/04/robust-and-sound (accessed 10 April 2012).

Blog posts that contain **papers** include a recommended method of citation at the end (this method will change if its academic publication status changes). If you are in doubt about the appropriate citation of any material please email me or post a question.

### References

Butler, C.S. 1985. *Statistics in Linguistics*. Oxford: Blackwell. **»** ePublished.

Gries, S. Th. 2009a. *Quantitative Corpus Linguistics with R*. New York/London: Routledge.

Gries, S. Th. 2009b. *Statistics for Linguistics with R*. Berlin/New York: Mouton de Gruyter.

Oakes, M.P. 1998. *Statistics for Corpus Linguistics*. Edinburgh: EUP.

Sheskin, D.J. 2000. *Handbook of Parametric and Nonparametric Statistical Procedures*. 2nd Edition. Boca Raton, Fl: Chapman Hall/CRC Press.

Wackerly, D.D., Mendenhall, W., Scheaffer, R.L. 2008. *Mathematical statistics with applications*. 7th Edition. Belmont, Ca.: Brooks/Cole.

### Postscript

If you found this blog because you are interested in experimental design and statistics but do not work with linguistic data, I hope you find some of the posts useful.

Corpus linguistics has three particular characteristics which shape this blog:

- Data is highly structured (sentences are marked up for part of speech, and in some cases parsed). Sampling is derived from contiguous text rather than a genuine random sample of items.
- Collecting new data from scratch (and recording, transcribing, structurally annotating, and grammatically annotating it) is often
*extremely*expensive as a result, and corpus linguistics is inevitably*ex post facto*research: analysing and reanalysing a richly annotated database. - Variables are mostly discrete alternatives (
*a*,*b*,*c*) rather than being numerical. Some variables may be given a number (e.g. time, clause length), but they tend to be the exception. Hence there is a concentration on confidence intervals and contingency tests.

Nonetheless, much of what is written here concerns general principles of inferential statistics, and is equally applicable to other fields of scientific inquiry.