Confidence intervals on pairwise φ statistics


Cramér’s φ is an effect size measure used for evaluating correlations in contingency tables. In simple terms, a large φ score means that the two variables have a large effect on each other, and a small φ score means they have a small effect.

φ is closely related to χ², but it factors out the ‘weight of evidence’ and concentrates only on the slope. The simplest definition of φ is the unsigned formula

φ ≡ √χ² / N(k – 1),(1)

where k = min(r, c), the minimum of the number of rows and columns. In a 2 × 2 table, unsigned φ is simply φ = √χ² / N.

In Wallis (2012), I made a number of observations about φ.

  • It is probabilistic, φ ∈ [0, 1].
  • φ is the best estimate of the population interdependent probability, p(XY). It measures the linear interpolation from flat to identity matrix.
  • It is non-directional, so φ(X, Y) ≡ φ(Y, X).

Whereas in a larger table, there are multiple degrees of freedom and therefore many ways one might obtain the same φ score, 2 × 2 φ may usefully be signed, in which case φ ∈ [-1, 1]. A signed φ obtains a different score for an increase and a decrease in proportion.

φ ≡ (adbc) / √(a + b)(c + d)(a + c)(b + d),(2)

where a, b, c and d are cell scores in sequence, i.e. [[a b][c d]]:

x x
y a b
y c d

Continue reading

Point tests and multi-point tests for separability of homogeneity


I have been recently reviewing and rewriting a paper for publication that I first wrote back in 2011. The paper (Wallis 2018) concerns the problem of how we test whether repeated runs of the same experiment obtain essentially the same results, i.e. results are not significantly different from each other.

These meta-tests can be used to test an experiment for replication: if you repeat an experiment and obtain significantly different results on the first repetition, then, with a 1% error level, you can say there is a 99% chance that the experiment is not replicable.

These tests have other applications. You might be wishing to compare your results with those of others in the literature, compare results with different operationalisation (definitions of variables), or just compare results obtained with different data – such as comparing a grammatical distribution observed in speech with that found within writing.

The design of tests for this purpose is addressed within the t-testing ANOVA community, where tests are applied to continuously-valued variables. The solution concerns a particular version of an ANOVA, called “the test for interaction in a factorial analysis of variance” (Sheskin 1997: 489).

However, anyone using data expressed as discrete alternatives (A, B, C etc) has a problem: the classical literature does not explain what you should do.

Gradient and point tests


Figure 1: Point tests (A) and gradient tests (B), from Wallis (forthcoming).

The rewrite of the paper caused me to distinguish between two types of tests: ‘point tests’, which I describe below, and ‘gradient tests’. Continue reading