# Change and certainty: plotting confidence intervals (2)

### Introduction

In a previous post I discussed how to plot confidence intervals on observed probabilities. Using this method we can create graphs like the following. (Data is in the Excel spreadsheet we used previously: for this post I have added a second worksheet.)

The graph depicts both the observed probability of a particular form and the certainty that this observation is accurate. The ‘I’-shaped error bars depict the estimated range of the true value of the observation at a 95% confidence level (see Wallis 2013 for more details).

A note of caution: these probabilities are semasiological proportions (different uses of the same word) rather than onomasiological choices (see Choice vs. use).

An example graph plot showing the changing proportions of meanings of the verb think over time in the US TIME Magazine Corpus, with Wilson score intervals, after Levin (2013). Many thanks to Magnus for the data!

In this post I discuss ways in which we can plot intervals on changes (differences) rather than single probabilities.

The clearer our visualisations, the better we can understand our own data, focus our explanations on significant results and communicate our results to others. Continue reading

# Measures of association for contingency tables

### Introduction Paper (PDF)

Often when we carry out research we wish to measure the degree to which one variable affects the value of another, setting aside the question as to whether this impact is sufficiently large as to be considered significant (i.e., significantly different from zero).

The most general term for this type of measure is size of effect. Effect sizes allow us to make descriptive statements about samples. Traditionally, experimentalists have referred to ‘large’, ‘medium’ and ‘small’ effects, which is rather imprecise. Nonetheless, it is possible to employ statistically sound methods for comparing different sizes of effect by estimating a Gaussian confidence interval (Bishop, Fienberg and Holland 1975) or by comparing pairs of contingency tables employing a “difference of differences” calculation (Wallis 2011).

In this paper we consider effect size measures for contingency tables of any size, generally referred to as “r × c tables”. This effect size is the “measure of association” or “measure of correlation” between the two variables. There are more measures applying to 2 × 2 tables than for larger tables. Continue reading

# Robust and sound?

When we carry out experiments and perform statistical tests we have two distinct aims.

1. To form statistically robust conclusions about empirical data.
2. To make logically sound arguments about experimental conclusions.

Robustness is essentially an inductive mathematical or statistical issue.

Soundness is a deductive question of experimental design and reporting.

Robust conclusions are those that are likely to be repeated if another researcher were to come along and perform the same experiment with different data sampled in much the same way. Sound arguments distinguish between what we can legitimately infer from our data, and the hypothesis we may wish to test.

# A statistics crib sheet

### Confidence intervalsHandout

Confidence intervals on an observed rate p should be computed using the Wilson score interval method. A confidence interval on an observation p represents the range that the true population value, P (which we cannot observe directly) may take, at a given level of confidence (e.g. 95%).

Note: Confidence intervals can be applied to onomasiological change (variation in choice) and semasiological change (variation in meaning), provided that P is free to vary from 0 to 1 (see Wallis 2012). Naturally, the interpretation of significant change in either case is different.

Methods for calculating intervals employ the Gaussian approximation to the Binomial distribution.

#### Confidence intervals on Expected (Population) values (P)

The Gaussian interval about P uses the mean and standard deviation as follows:

mean xP = F/N,
standard deviation S ≡ √P(1 – P)/N.

The Gaussian interval about P can be written as P ± E, where E = z.S, and z is the critical value of the standard Normal distribution at a given error level (e.g., 0.05). Although this is a bit of a mouthful, critical values of z are constant, so for any given level you can just substitute the constant for z. [z(0.05) = 1.95996 to six decimal places.]

In summary:

Gaussian intervalP ± z√P(1 – P)/N.

#### Confidence intervals on Observed (Sample) values (p)

We cannot use the same formula for confidence intervals about observations. Many people try to do this!

Most obviously, if p gets close to zero, the error e can exceed p, so the lower bound of the interval can fall below zero, which is clearly impossible! The problem is most apparent on smaller samples (larger intervals) and skewed values of p (close to 0 or 1).

The Gaussian is a reasonable approximation for an as-yet-unknown population probability P, it is incorrect for an interval around an observation p (Wallis 2013a). However the latter case is precisely where the Gaussian interval is used most often!

What is the correct method?

# Goodness of fit measures for discrete categorical data

### Introduction Paper (PDF)

A goodness of fit χ² test evaluates the degree to which an observed discrete distribution over one dimension differs from another. A typical application of this test is to consider whether a specialisation of a set, i.e. a subset, differs in its distribution from a starting point (Wallis 2013). Like the chi-square test for homogeneity (2 × 2 or generalised row r × column c test), the null hypothesis is that the observed distribution matches the expected distribution. The expected distribution is proportional to a given prior distribution we will term D, and the observed O distribution is typically a subset of D.

A measure of association, or correlation, between two distributions is a score which measures the degree of difference between the two distributions. Significance tests might compare this size of effect with a confidence interval to determine that the result was unlikely to occur by chance.

Common measures of the size of effect for two-celled goodness of fit χ² tests include simple difference (swing) and proportional difference (‘percentage swing’). Simple swing can be defined as the difference in proportions:

d = O₁/D₁ – O₀/D₀.

For 2 × 1 tests, simple swings can be compared to test for significant change between test results. Provided that O is a subset of D then these are real fractions and d is constrained d ∈ [-1, 1]. However, for r × 1 tests, where r > 2, we need to obtain an aggregate score to estimate the size of effect. Moreover, simple swing cannot be used meaningfully where O is not a subset of D.

In this paper we consider a wide range of different potential methods to address this problem.

Correlation scores are a sample statistic. The fact that one is numerically larger than the other does not mean that the result is significantly greater. To determine this we need to either

1. estimate confidence intervals around each measure and employ a z test for two proportions from independent populations to compare these intervals, or
2. perform an r × 1 separability test for two independent populations (Wallis 2011) to compare the distributions of differences of differences.

In cases where both tests have one degree of freedom, these procedures obtain the same result. With r > 2 however, there will be more than one way to obtain the same score. The distributions can have a significantly different pattern even when scores are identical.

We apply these methods to a practical research problem, how to decide if present perfect verb phrases more closely correlate with present- and past-marked verb phrases. We consider if present perfect VPs are more likely to be found in present-oriented texts or past-oriented ones.

# Comparing χ² tests for separability

### Abstract Paper (PDF)

This paper describes a series of statistical meta-tests for comparing independent contingency tables for different types of significant difference. Recognising when an experiment obtains a significantly different result and when it does not is an issue frequently overlooked in research publication. Papers are frequently published citing ‘p values’ or test scores suggesting a ‘stronger effect’ substituting for sound statistical reasoning. This paper sets out a series of tests which together illustrate the correct approach to this question.

These meta-tests permit us to evaluate whether experiments have failed to replicate on new data; whether a particular data source or subcorpus obtains a significantly different result than another; or whether changing experimental parameters obtains a stronger effect.

The meta-tests are derived mathematically from the χ² test and the Wilson score interval, and consist of pairwise ‘point’ tests, ‘homogeneity’ tests and ‘goodness of fit’ tests. Meta-tests for comparing tests with one degree of freedom (e.g. ‘2 × 1’ and ‘2 × 2’ tests) are generalised to those of arbitrary size). Finally, we compare our approach with a competing approach offered by Zar (1999), which, while straightforward to calculate, turns out to be both less powerful and less robust.

### Introduction

Researchers often wish to compare the results of their experiments with those of others.

Alternatively they may wish to compare permutations of an experiment to see if a modification in the experimental design obtains a significantly different result. By doing so they would be able to investigate the empirical question of the effect of modifying an experimental design on reported results, as distinct from a deductive argument concerning the optimum design.

One of the reasons for carrying out such a test concerns the question of replication. Significance tests and confidence intervals rely on an a priori Binomial model predicting the likely distribution of future runs of the same experiment. However, there is a growing concern that allegedly significant results published in eminent psychology journals have failed to replicate (see, e.g. Gelman and Loken 2013). The reasons may be due to variation of the sample, or problems with the experimental design (such as unstated assumptions or baseline conditions that vary over experimental runs). The methods described here permit us to define a ‘failure to replicate’ as occurring when subsequent repetitions of the same experiment obtain statistically separable results on more occasions than predicted by the error level, ‘α’, used for the test.

Consider Table 1, taken from Aarts, Close and Wallis (2013). The two tables summarise a pair of 2 × 2 contingency tests for two different sets of British English corpus data for the modal alternation shall vs. will. The spoken data is drawn from the Diachronic Corpus of Present-day Spoken English, which contains matching data from the London-Lund Corpus and the British Component of the International Corpus of English (ICE-GB). The written data is drawn from the Lancaster-Oslo-Bergen (LOB) corpus and the matching Freiburg-Lancaster-Oslo-Bergen (FLOB) corpus.

Both 2 × 2 subtests are individually significant (χ² = 36.58 and 35.65 respectively). The results (see the effect size measures φ and percentage difference d%). appear to be different.

How might we test if the tables are significantly different from each other?

 (spoken) shall will Total χ²(shall) χ²(will) summary LLC (1960s) 124 501 625 15.28 2.49 d% = -60.70% ±19.67% φ = 0.17 χ² = 36.58 s ICE-GB (1990s) 46 544 590 16.18 2.63 TOTAL 170 1,045 1,215 31.46 5.12
 (written) shall+ will+’ll Total χ²(shall+) χ²(will+’ll) summary LOB (1960s) 355 2,798 3,153 15.58 1.57 d% = -39.23% ±12.88% φ = 0.08 χ² = 35.65 s FLOB (1990s) 200 2,723 2,923 16.81 1.69 TOTAL 555 5,521 6,076 32.40 3.26

Table 1: A pair of 2 × 2 tables for shall/will alternation, after Aarts et al. (2013): upper, spoken, lower: written, with other differences in the experimental design. Note that χ² values are almost identical but Cramér’s φ and percentage swing d% are different.

We can plot Table 1 as two independent pairs of probability observations, as in Figure 1. We calculate the proportion p = f/n in each case, and – in order to estimate the likely range of error introduced by the sampling procedure – compute Wilson score intervals at a 95% confidence level.

Figure 1: Example data in Table 1, plotted with 95% Wilson score intervals (Wallis 2013a).

The intervals in Figure 1 are shown by ‘I’ shaped error bars: were the experiment to be re-run multiple times, in 95% of predicted repeated runs, observations at each point will fall within the interval. Where Wilson intervals do not overlap at all (e.g. LLC vs. LOB, marked ‘A’) we can identify the difference is significant without further testing; where they overlap such that one point is within the interval the difference is non-significant; otherwise a test must be applied.

In this paper we discuss two different analytical comparisons.

1. ‘Point tests’ compare pairs of observations (‘points’) across the dependent variable (e.g. shall/will) and tables t = {1, 2}. To do this we compare the two points and their confidence intervals. We carry out a 2 × 2 χ² test for homogeneity or a Newcombe-Wilson test (Wallis 2013a) to compare each point. We can compare the initial 1960s data (LLC vs. LOB, indicated) in the same way as we might compare spoken 1960s and 1990s data (e.g. LLC vs. ICE-GB).
2. ‘Gradient tests’ compare differences in ‘sizes of effect’ (e.g. a change in the ratio shall/will over time) between tables t. We might ask, is the gradient significantly steeper for the spoken data than for the written data?

Note that these tests evaluate different things and have different outcomes. If plot-lines are parallel, the gradient test will be non-significant, but the point test could still be significant at every pair of points. The two tests are complementary analytical tools.

#### 1.1 How not to compare test results

A common, but mistaken, approach to comparing experimental results involves simply citing the output of significance tests (Goldacre 2011). Researchers frequently make claims citing, t, F or χ² scores, ‘p values’ (error levels), etc, as evidence for the strength of results. However, this fundamentally misinterprets the meaning of these measures, and comparisons between them are not legitimate.