Welcome!

corp.ling.stats is a research blog by Sean Wallis focusing on the intersection between corpus linguistics research and mathematical statistics and probability theory. About this blogLatest…

News

Out now… Statistics in Corpus Linguistics Research (Routledge)

I am very pleased to announce that my new book, Statistics in Corpus Linguistics Research, is now available from Routledge. Drawing on more than ten years of research, and containing a large quantity of material never published before, the book is written for corpus linguistics researchers of all kinds, from students of corpus linguistics wishing to apply statistical analysis for the first time,…

Designing experiments

Directional evidence revisited

End weight bias and templating in conjoined phrase postmodification Abstract Full Paper (PDF) The tendency of speakers and writers to place larger constructions at the end of sentences, whether consciously or unconsciously, is well established. Often this question of ‘end weight’ is usually discussed in relation to grammatical transformations. In this short paper we demonstrate…

Are embedding decisions independent?

Evidence from preposition(al) phrases Abstract Full Paper (PDF) One of the more difficult challenges in linguistics research concerns detecting how constraints might apply to the process of constructing phrases and clauses in natural language production. In previous work (Wallis 2019) we considered a number of operations modifying noun phrases, including sequential and embedded modification with…

The replication crisis: what does it mean for corpus linguistics?

Introduction Over the last year, the field of psychology has been rocked by a major public dispute about statistics. This concerns the failure of claims in papers, published in top psychological journals, to replicate. Replication is a big deal: if you publish a correlation between variable X and variable Y — that there is an…

What might a corpus of parsed spoken data tell us about language?

Abstract Paper (PDF) This paper summarises a methodological perspective towards corpus linguistics that is both unifying and critical. It emphasises that the processes involved in annotating corpora and carrying out research with corpora are fundamentally cyclic, i.e. involving both bottom-up and top-down processes. Knowledge is necessarily partial and refutable. This perspective unifies ‘corpus-driven’ and ‘theory-driven’…

Loading…

Something went wrong. Please refresh the page and/or try again.

Confidence intervals

Plotting entropy confidence interval distributions

Introduction One of the problems researchers face when reasoning with statistical uncertainty concerns our ability to mentally picture its distribution. As students we were shown the Normal distribution and led to believe that it is reasonable to assume that uncertainty about an observation is Normally distributed. Even when students are introduced to other distributions, such…

The confidence of entropy – and information

Introduction Two measures that are sometimes found in linguistic studies are information, defined as the negative log of the probability, and entropy. These are information-theoretic measures first defined by Claude Shannon (see e.g. Shannon and Weaver 1949). Entropy is also found in mutual information scores. This blog post is not intended to introduce information theory,…

Confidence intervals for the ratio of competing dependent proportions

Introduction How do we compute the confidence interval for the ratio of competing dependent proportions, where p1 and p2 are drawn from the same set of outcomes? We have discussed elsewhere on this blog how we might employ the Zou and Donner risk ratio method for independent proportions (Zou and Donner 2008). But what should…

Confidence intervals for type-token ratios

1. Introduction Type-token ratios (TTRs) are commonly used for assessing child language development. They are also occasionally used in other studies, for example to compare subcorpora or varieties of English more generally. A related concept is a hapax-token ratio (HTR), which we also discuss below. TTRs can be expressed as a simple proportion, p = f…

Loading…

Something went wrong. Please refresh the page and/or try again.

Contingency tests… latest posts…