## Further evaluation of Binomial confidence intervals

### Abstract Paper (PDF)

Wallis (2013) provides an account of an empirical evaluation of Binomial confidence intervals and contingency test formulae. The main take-home message of that article was that it is possible to evaluate statistical methods objectively and provide advice to researchers that is based on an objective computational assessment.

In this article we develop the evaluation of that article further by re-weighting estimates of error using Binomial and Fisher weighting, which is equivalent to an ‘exhaustive Monte-Carlo simulation’. We also develop an argument concerning key attributes of difference intervals: that we are not merely concerned with when differences are zero (conventionally equivalent to a significance test) but also accurate estimation when difference may be non-zero (necessary for plotting data and comparing differences).

### 1. Introduction

All statistical procedures may be evaluated in terms of the rate of two distinct types of error.

• Type I errors (false positives): this is evidence of so-called ‘radical’ or ‘anti-conservative’ behaviour, i.e. rejecting null hypotheses which should not have been rejected, and
• Type II errors (false negatives): this is evidence of ‘conservative’ behaviour, i.e. retaining or failing to reject null hypotheses unnecessarily.

It is customary to treat these errors separately because the consequences of rejecting and retaining a null hypothesis are qualitatively distinct. Continue reading “Further evaluation of Binomial confidence intervals”

## Point tests and multi-point tests for separability of homogeneity

### Introduction

I have been recently reviewing and rewriting a paper for publication that I first wrote back in 2011. The paper (Wallis 2019) concerns the problem of how we test whether repeated runs of the same experiment obtain essentially the same results, i.e. results are not significantly different from each other.

These meta-tests can be used to test an experiment for replication: if you repeat an experiment and obtain significantly different results on the first repetition, then, with a 1% error level, you can say there is a 99% chance that the experiment is not replicable.

These tests have other applications. You might be wishing to compare your results with those of others in the literature, compare results with a different operationalisation (definitions of variables), or just compare results obtained with different data – such as comparing a grammatical distribution observed in speech with that found within writing.

The design of tests for this purpose is addressed within the t-testing ANOVA community, where tests are applied to continuously-valued variables. The solution concerns a particular version of an ANOVA, called “the test for interaction in a factorial analysis of variance” (Sheskin 1997: 489).

However, anyone using data expressed as discrete alternatives (A, B, C etc) has a problem: the classical literature does not explain what you should do.

The rewrite of the paper caused me to distinguish between two types of tests: ‘point tests’, which I describe below, and ‘gradient tests’. Continue reading “Point tests and multi-point tests for separability of homogeneity”

## Genre differences and experimental observations

### Spoken categories, modal verbs and change over time

In a recently-published paper, Bowie, Wallis and Aarts (2013) demonstrate that observations regarding changes in the frequency of modal verbs over time are highly sensitive to differences in genre (‘register’ or ‘text category’). Our paper, although based on spoken British English, may shed some light on a recent dispute between Leech (2011) and Millar (2009) regarding how linguists should interpret corpus observations regarding changes in the modal verb system in written US English.

The following table summarises statistically significant percentage decreases and increases of individual modal verbs as a proportion of the number of tensed verb phrases (VPs that could conceivably take a modal verb), within different spoken genre subcategories of the Diachronic Corpus of Present-day Spoken English (DCPSE). The statistical test used examines differences in observed probabilities between samples, i.e. a Newcombe-Wilson test.

For our purposes the cited percentages do not matter, but the direction of travel (indicated by coloured cells) does.

 can may could might shall will should would must All formal f2f ns ns ns ns ns ns -60% ns -75% informal f2f 27% -42% ns 47% -32% ns ns ns -53% ns telephone -37% ns -44% ns -56% -30% ns -44% ns -35% b. discussions -41% -59% ns ns -83% ns ns ns -54% -20% b. interviews ns -61% ns -59% ns -41% -55% -32% -57% -35% commentary ns ns ns ns -93% 58% ns ns -64% ns parliament ns ns ns ns ns -39% ns -30% ns -20% legal x-exam 304% ns ns ns ns ns 1,265% 254% ns 157% spontaneous ns ns ns ns ns ns ns ns ns ns prepared sp. ns -63% ns ns ns 327% ns -32% -48% ns All genres ns -40% -11% ns -48% 13% -14% -7% -54% -6%

Significant changes (α<0.05) in the proportion of individual core modals out of tensed verb phrases from the 1960s (LLC) to 1990s (ICE-GB) components in DCPSE, adapted from Bowie et al. 2013.

This study concerns modal verbs within text categories. Against a general baseline (words, verb phrases or tensed verb phrases), the total number of modals decrease in use over the course of the period covered by the data (at least, noting the caveat, for spoken English data sampled comparably). Above, we employ tensed verb phrases as the most meaningful baseline out of the three. See That vexed problem of choice.

• Note that if we take all genres together (bottom row in the table), except for will, every significant change is a decline in use, but in the (large) category of informal face-to-face conversation (second row from top), can and might are both significantly increasing.
• Legal cross-examination is a predictable outlier, but broadcast interviews and discussions appear to generate very different results. Continue reading “Genre differences and experimental observations”