The confidence of diversity

Introduction

Occasionally it is useful to cite measures in papers other than simple probabilities or differences in probability. When we do, we should estimate confidence intervals on these measures. There are a number of ways of estimating intervals, including bootstrapping and simulation, but these are computationally heavy.

For many measures it is possible to derive intervals from the Wilson score interval by employing a little mathematics. Elsewhere in this blog I discuss how to manipulate the Wilson score interval for simple transformations of p, such as 1/p, 1 – p, etc.

Below I am going to explain how to derive an interval for grammatical diversity, d, which we can define as the probability that two randomly-selected instances have different outcome classes.

Diversity is an effect size measure of a frequency distribution, i.e. a vector of k frequencies. If all frequencies are the same, the data is evenly spread, and the score will tend to a maximum. If all frequencies except one are zero, the chance of picking two different instances will of course be zero. Diversity is well-behaved except where categories have frequencies of 1. Continue reading

Advertisements

Why is statistics difficult?

Imagine you are somewhere on a road that you have never been on before. Picture it. It’s peaceful and calm. A car comes down the road. As it gets to a corner, the driver appears to lose control, and the car crashes into a wall. Fortunately the lone driver is OK but they can’t recall exactly what happened.

Let’s think about what you experienced. The car crash might involve a number of variables an investigator would be interested in.

How fast was the car going? Where were the brakes applied?

Look on the road. Get out a tape measure. How long was the skid before the car finally stopped?

How big and heavy was the car? How loud was the bang when the car crashed?

These are all physical variables. We are used to thinking about the world in terms of these kinds of variables: velocity, position, length, volume and mass. They are tangible: we can see and touch them, and we have physical equipment that helps us measure them. Continue reading

Point tests and multi-point tests for separability of homogeneity

Introduction

I have been recently reviewing and rewriting a paper for publication that I first wrote back in 2011. The paper (Wallis forthcoming) concerns the problem of how we test whether repeated runs of the same experiment obtain essentially the same results, i.e. results are not significantly different from each other.

These meta-tests can be used to test an experiment for replication: if you repeat an experiment and obtain significantly different results on the first repetition, then, with a 1% error level, you can say there is a 99% chance that the experiment is not replicable.

These tests have other applications. You might be wishing to compare your results with those of others in the literature, compare results with different operationalisation (definitions of variables), or just compare results obtained with different data – such as comparing a grammatical distribution observed in speech with that found within writing.

The design of tests for this purpose is addressed within the t-testing ANOVA community, where tests are applied to continuously-valued variables. The solution concerns a particular version of an ANOVA, called “the test for interaction in a factorial analysis of variance” (Sheskin 1997: 489).

However, anyone using data expressed as discrete alternatives (A, B, C etc) has a problem: the classical literature does not explain what you should do.

Gradient and point tests

gradient-point-test

Figure 1: Point tests (A) and gradient tests (B), from Wallis (forthcoming).

The rewrite of the paper caused me to distinguish between two types of tests: ‘point tests’, which I describe below, and ‘gradient tests’. Continue reading

Detecting direction in interaction evidence

IntroductionPaper (PDF)

I have previously argued (Wallis 2014) that interaction evidence is the most fruitful type of corpus linguistics evidence for grammatical research (and doubtless for many other areas of linguistics).

Frequency evidence, which we can write as p(x), the probability of x occurring, concerns itself simply with the overall distribution of a linguistic phenomenon x – such as whether informal written English has a higher proportion of interrogative clauses than formal written English. In order to calculate frequency evidence we must define x, i.e. decide how to identify interrogative clauses. We must also pick an appropriate baseline n for this evaluation, i.e. we need to decide whether to use words, clauses, or any other structure to identify locations where an interrogative clause may occur.

Interaction evidence is different. It is a statistical correlation between a decision that a writer or speaker makes at one part of a text, which we will label point A, and a decision at another part, point B. The idea is shown schematically in Figure 1. A and B are separate ‘decision points’ in a given relationship (e.g. lexical adjacency), which can be also considered as ‘variables’.

Figure 1: Associative inference from lexico-grammatical choice variable A to variable B (sketch).

Figure 1: Associative inference from lexico-grammatical choice variable A to variable B (sketch).

This class of evidence is used in a wide range of computational algorithms. These include collocation methods, part-of-speech taggers, and probabilistic parsers. Despite the promise of interaction evidence, the majority of corpus studies tend to consist of discussions of frequency differences and distributions.

In this paper I want to look at applications of interaction evidence which are made more-or-less at the same time by the same speaker/writer. In such circumstances we cannot be sure that just because B follows A in the text, the decision relating to B was made after the decision at A. Continue reading

UCL Summer School in English Corpus Linguistics 2017

I am pleased to announce the fifth annual Summer School in English Corpus Linguistics to be held at University College London from 5-7 July.

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

W9A8081
Continue reading

The replication crisis: what does it mean for corpus linguistics?

Introduction

Over the last year, the field of psychology has been rocked by a major public dispute about statistics. This concerns the failure of claims in papers, published in top psychological journals, to replicate.

Replication is a big deal: if you publish a correlation between variable X and variable Y – that there is an increase in the use of the progressive over time, say, and that increase is statistically significant, you expect that this finding would be replicated were the experiment repeated.

I would strongly recommend Andrew Gelman’s brief history of the developing crisis in psychology. It is not necessary to agree with everything he says (personally, I find little to disagree with, although his argument is challenging) to recognise that he describes a serious problem here.

There may be more than one reason why published studies have failed to obtain compatible results on repetition, and so it is worth sifting these out.

In this blog post, what I want to do is try to explore what this replication crisis is – is it one problem, or several? – and then turn to what solutions might be available and what the implications are for corpus linguistics. Continue reading

The variance of Binomial distributions

Introduction

Recently I’ve been working on a problem that besets researchers in corpus linguistics who work with samples which are not drawn randomly from the population but rather are taken from a series of sub-samples. These sub-samples (in our case, texts) may be randomly drawn, but we cannot say the same for any two cases drawn from the same sub-sample. It stands to reason that two cases taken from the same sub-sample are more likely to share a characteristic under study than two cases drawn entirely at random. I introduce the paper elsewhere on my blog.

In this post I want to focus on an interesting and non-trivial result I needed to address along the way. This concerns the concept of variance as it applies to a Binomial distribution.

Most students are familiar with the concept of variance as it applies to a Gaussian (Normal) distribution. A Normal distribution is a continuous symmetric ‘bell-curve’ distribution defined by two variables, the mean and the standard deviation (the square root of the variance). The mean specifies the position of the centre of the distribution and the standard deviation specifies the width of the distribution.

Common statistical methods on Binomial variables, from χ² tests to line fitting, employ a further step. They approximate the Binomial distribution to the Normal distribution. They say, although we know this variable is Binomially distributed, let us assume the distribution is approximately Normal. The variance of the Binomial distribution becomes the variance of the equivalent Normal distribution.

In this methodological tradition, the variance of the Binomial distribution loses its meaning with respect to the Binomial distribution itself. It seems to be only valuable insofar as it allows us to parameterise the equivalent Normal distribution.

What I want to argue is that in fact, the concept of the variance of a Binomial distribution is important in its own right, and we need to understand it with respect to the Binomial distribution, not the Normal distribution. Sometimes it is not necessary to approximate the Binomial to the Normal, and if we can avoid this approximation our results are likely to be stronger as a result.

Continue reading