Goodness of fit measures for discrete categorical data

Introduction Paper (PDF)

A goodness of fit χ² test evaluates the degree to which an observed discrete distribution over one dimension differ’s from another. A typical application of this test is to consider whether a specialisation of a set, i.e. a subset, differs in its distribution from a starting point (Wallis 2013). Like the chi-square test for homogeneity (2 × 2 or generalised row r × column c test), the null hypothesis is that the observed distribution matches the expected distribution. The expected distribution is proportional to a given prior distribution we will term D, and the observed O distribution is typically a subset of D.

A measure of association, or correlation, between two distributions is a score which measures the degree of difference between the two distributions. Significance tests might compare this size of effect with a confidence interval to determine that the result was unlikely to occur by chance.

Common measures of the size of effect for two-celled goodness of fit χ² tests include simple difference (swing) and proportional difference (‘percentage swing’). Simple swing can be defined as the difference in proportions:

d = O1/D1O0/D0.

For 2 × 1 tests, simple swings can be compared to test for significant change between test results. Provided that O is a subset of D then these are real fractions and d is constrained d ∈ [-1, 1]. However, for r × 1 tests, where r > 2, we need to obtain an aggregate score to estimate the size of effect. Moreover, simple swing cannot be used meaningfully where O is not a subset of D.

In this paper we consider a wide range of different potential methods to address this problem.

Correlation scores are a sample statistic. The fact that one is numerically larger than the other does not mean that the result is significantly greater. To determine this we need to either

  1. estimate confidence intervals around each measure and employ a z test for two proportions from independent populations to compare these intervals, or
  2. perform an r × 1 separability test for two independent populations (Wallis 2019) to compare the distributions of differences of differences.

In cases where both tests have one degree of freedom, these procedures obtain the same result. With r > 2 however, there will be more than one way to obtain the same score. The distributions can have a significantly different pattern even when scores are identical.

  • Update: In Confidence intervals on goodness of fit ϕ scores (2021), I discuss how to construct confidence intervals on selected methods. These can be used in the plotting and citation of sample scores, comparing a sample (observed) score with a putative value D > 0, and comparing two sample scores.

We apply these methods to a practical research problem, how to decide if present perfect verb phrases more closely correlate with present- and past-marked verb phrases. We consider if present perfect VPs are more likely to be found in present-oriented texts or past-oriented ones.

Excerpt

1.1 A simple example: correlating the present perfect

Bowie et al. (2013) discuss the present perfect construction. The present perfect expresses a particular relationship between present and past events and it is not a priori determined as to whether we would expect its use more commonly in texts which are more present- or past-referring. We may estimate the degree to which a text refers to the present by counting the frequency of present tensed verb phrases in it (and normalising as appropriate), ditto for the past.

present LLC ICE-GB Total present perfect
goodness of fit
present non-perfect 33,131 32,114 65,245 d% = -4.45 ± 5.13%
present perfect 2,696 2,488 5,184 ϕ′ = 0.0227
TOTAL 35,827 34,602 70,429 χ² = 2.68 ns
past
other TPM VPs 18,201 14,293 32,494 d% = +14.92 ± 5.47%
present perfect 2,696 2,488 5,184 ϕ′ = 0.0694
TOTAL 20,897 16,781 37,678 χ² = 25.06 s

Comparing present perfect cases against (upper) tensed, present-marked VPs, (lower) tensed, past-marked VPs (after Bowie et al. 2013).

Bowie and Wallis limit their discussion to two 400,000 word text categories in the DCPSE corpus, divided by time, namely LLC (1960s) and ICE-GB (1990s) texts. The table above shows their analysis, employing percentage swing d% and Wallis ϕ′ (see section 3 in the paper). They found that the present perfect more closely associated with present tensed VPs. Note that in employing measures for this purpose, a higher value of χ², ϕ or d% implies a weaker correlation between the present perfect and the particular baseline being tested against it.

However with only two categories of text, this can only be a coarse-grained assessment. To test the hypothesis that the present perfect is more likely in texts with a greater preponderance of present-referring VPs than past-referring ones, we need to find a way to extend our evaluation to smaller units than 0.4M-word subcorpora, ideally to the level of individual texts.

Before we do this it seems sensible to consider a middle position. DCPSE is subdivided sociolinguistically into different text genres of different sizes. The figure below plots the observed distribution O and the distributions for the present referring and past referring VPs scaled by O, across these 10 text categories.

The distribution of the present perfect O, scaled distributions E for present and past, across text categories of DCPSE.
The distribution of the present perfect O, scaled distributions E for present and past, across text categories of DCPSE.

‘Eyeballing’ this data seems to suggest a close congruence between the distribution of the present perfect and the present in some categories (e.g. broadcast discussions, spontaneous commentary) and a closer relationship with the past in others (prepared speech). It appears intuitively that there is a closer relationship between present perfect and the present, but how might this be measured?

Any measure of correlation between pairs of distributions needs to scale appropriately to permit populous categories, such as informal face-to-face conversation, and less populous ones, such as legal cross-examination, to add evidence to the metric appropriately.

Contents

  1. Introduction
  2. Reduced χ²
  3. Cramér’s ϕ
  4. Normalised ϕ′
  5. Probabilistically-weighted ϕp
  6. Variance-weighted ϕ measures
  7. Bayesian mean dependent probability
  8. Generalising R²
  9. Numerical evaluation of extrema
  10. Correlating the present perfect

Citation

Wallis, S.A. 2012. Goodness of fit measures for discrete categorical data. London: Survey of English Usage, UCL. https://www.ucl.ac.uk/english-usage/statspapers/gofmeasures.pdf

References

Bowie, J., S.A. Wallis and B. Aarts 2013. The perfect in spoken British English. » ePublished. In Aarts, B., J. Close, G. Leech and S.A. Wallis (eds.). The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP. » Table of contents and ordering info

Wallis, S.A. 2013. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post

Wallis, S.A. 2019. Comparing χ² tables for separability of distribution and effect. Journal of Quantitative Linguistics 26:4, 330-355. » Post

See also

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.