Capturing patterns of linguistic interaction

Abstract Full Paper (PDF)

Numerous competing grammatical frameworks exist on paper, as algorithms and embodied in parsed corpora. However, not only is there little agreement about grammars among linguists, but there is no agreed methodology for demonstrating the benefits of one grammar over another. Consequently the status of parsed corpora or ‘treebanks’ is suspect.

The most common approach to empirically comparing frameworks is based on the reliable retrieval of individual linguistic events from an annotated corpus. However this method risks circularity, permits redundant terms to be added as a ‘solution’ and fails to reflect the broader structural decisions embodied in the grammar. In this paper we introduce a new methodology based on the ability of a grammar to reliably capture patterns of linguistic interaction along grammatical axes. Retrieving such patterns of interaction does not rely on atomic retrieval alone, does not risk redundancy and is no more circular than a conventional scientific reliance on auxiliary assumptions. It is also a valid experimental perspective in its own right.

We demonstrate our approach with a series of natural experiments. We find an interaction captured by a phrase structure analysis between attributive adjective phrases under a noun phrase with a noun head, such that the probability of adding successive adjective phrases falls. We note that a similar interaction (between adjectives preceding a noun) can also be found with a simple part-of-speech analysis alone. On the other hand, preverbal adverb phrases do not exhibit this interaction, a result anticipated in the literature, confirming our method.

Turning to cases of embedded postmodifying clauses, we find a similar fall in the additive probability of both successive clauses modifying the same NP and embedding clauses where the NP head is the most recent one. Sequential postmodification of the same head reveals a fall and then a rise in this additive probability. Reviewing cases, we argue that this result can only be explained as a natural phenomenon acting on language production which is expressed by the distribution of cases on an embedding axis, and that this is in fact empirical evidence for a grammatical structure embodying a series of speaker choices.

We conclude with a discussion of the implications of this methodology for a series of applications, including optimising and evaluating grammars, modelling case interaction, contrasting the grammar of multiple languages and language periods, and investigating the impact of psycholinguistic constraints on language production.

Continue reading

Choosing the right test

Introduction

One of the most common questions a new researcher has to deal with is the following:

what is the right statistical test for my purpose?

To answer this question we must distinguish between

  1. different experimental designs, and
  2. optimum methods for testing significance.

In corpus linguistics, many research questions involve choice. The speaker can say shall or will, choose to add a postmodifying clause to an NP or not, etc. If we want to know what factors influence this choice then these factors are termed independent variables (IVs) and the choice is  the dependent variable (DV). These choices are mutually exclusive alternatives. Framing the research question like this immediately helps us focus in on the appropriate class of tests.  Continue reading

Robust and sound?

When we carry out experiments and perform statistical tests we have two distinct aims.

  1. To form statistically robust conclusions about empirical data.
  2. To make logically sound arguments about experimental conclusions.

Robustness is essentially an inductive mathematical or statistical issue.

Soundness is a deductive question of experimental design and reporting.

Robust conclusions are those that are likely to be repeated if another researcher were to come along and perform the same experiment with different data sampled in much the same way. Sound arguments distinguish between what we can legitimately infer from our data, and the hypothesis we may wish to test.

Continue reading

Some bêtes noires

There are a number of common issues in corpus linguistics papers.

  1. an extremely common tendency for authors to primarily cite frequencies normalised per million or thousand words (i.e. a per word baseline or multiple thereof),
  2. data is usually plotted without confidence intervals, so it is not possible to spot visually whether a perceived change might be statistically significant, and
  3. significance tests are often employed without a clear statement of what the test is evaluating.

Experimental design

The first issue may be unique to corpus linguistics, deriving from its particular historical origins.

It concerns the experimenter attempting to identify counterfactual alternates or select baselines. This is an experimental design question.

In the beginning was the Word.

Linguists examining volumes of plain text data (later supported by computing and part-of-speech tagging) invariably concentrated on the idea of the word as the unit of language. Collocation and concordancing sat alongside lexicography as the principal tools of the trade. “Statistics” here primarily concerned probabilistic measures of association between neighbouring words in order to find common patterns. This activity is of course perfectly fine, and allowed researchers to make huge gains in our understanding of language.

But…

Without labouring the point (which I do elsewhere on this blog), the corollary of the statement that language is grammatical is that if, instead of describing the distribution of words, n-grams, etc, we wish to investigate how language is produced, the word cannot be our primary focus. Continue reading

A statistics crib sheet

Confidence intervalsHandout

Confidence intervals on an observed rate p should be computed using the Wilson score interval method. A confidence interval on an observation p represents the range that the true population value, P (which we cannot observe directly) may take, at a given level of confidence (e.g. 95%).

Note: Confidence intervals can be applied to onomasiological change (variation in choice) and semasiological change (variation in meaning), provided that P is free to vary from 0 to 1 (see Wallis 2012). Naturally, the interpretation of significant change in either case is different.

Methods for calculating intervals employ the Gaussian approximation to the Binomial distribution.

Confidence intervals on Expected (Population) values (P)

The Gaussian interval about P uses the mean and standard deviation as follows:

mean xP = F/N,
standard deviation S ≡ √P(1 – P)/N.

The Gaussian interval about P can be written as P ± E, where E = z.S, and z is the critical value of the standard Normal distribution at a given error level (e.g., 0.05). Although this is a bit of a mouthful, critical values of z are constant, so for any given level you can just substitute the constant for z. [z(0.05) = 1.95996 to six decimal places.]

In summary:

Gaussian intervalP ± z√P(1 – P)/N.

Confidence intervals on Observed (Sample) values (p)

We cannot use the same formula for confidence intervals about observations. Many people try to do this!

Most obviously, if p gets close to zero, the error e can exceed p, so the lower bound of the interval can fall below zero, which is clearly impossible! The problem is most apparent on smaller samples (larger intervals) and skewed values of p (close to 0 or 1).

The Gaussian is a reasonable approximation for an as-yet-unknown population probability P, it is incorrect for an interval around an observation p (Wallis 2013a). However the latter case is precisely where the Gaussian interval is used most often!

What is the correct method?

Continue reading

Comparing χ² tests for separability

Introduction Paper (PDF)

Researchers often wish to compare the results of their experiments with those of others. Alternatively they may wish to compare permutations of an experiment to see if a modification in the experimental design obtains a significantly different result.

This question concerns an empirical analysis of the effect of modifying an experimental design on reported results, rather than a deductive argument concerning the optimum design.

Many researchers attempt this type of evaluation by employing statements about their results (citing, t, F or χ² values, error levels or “p values”, etc), as benchmarks for the strength of their results, implying a comparison that is frequently misunderstood (Goldacre 2011).

Alternatively, descriptive statistics of effect size such as percentage difference, log odds ratios, or Cramér’s φ may be used for comparison. These measures adjust for the volume of data and measure the pattern of change observed. However, effect size comparisons are discussed in the literature in surprisingly crude terms, e.g. ‘strong’, ‘medium’ and ‘weak’ effects (cf. Sheskin 1997: 244). In this paper we explain how to evaluate differences in effect size statistically.

In summary:

  • The fact that one chi-square value or error level exceeds another merely means that reported indicators differ. It does not mean that the results are statistically separable, i.e. that the results are significantly different from each other at a given likelihood of error.
  • However if we wish to claim a difference in experimental outcomes between experimental ‘runs’, this is precisely what we must establish.

In this paper we attempt to address how this question of separability may be evaluated. Continue reading

z-squared: the origin and application of χ²

Abstract Paper (PDF)

A set of statistical tests termed contingency tests, of which χ² is the most well-known example, are commonly employed in linguistics research. Contingency tests compare discrete distributions, that is, data divided into two or more alternative categories, such as alternative linguistic choices of a speaker or different experimental conditions. These tests are highly ubiquitous, and are part of every linguistics researcher’s arsenal.

However the mathematical underpinnings of these tests are rarely discussed in the literature in an approachable way, with the result that many researchers may apply tests inappropriately, fail to see the possibility of testing particular questions, or draw unsound conclusions. Contingency tests are also closely related to the construction of confidence intervals, which are highly useful and revealing methods for plotting the certainty of experimental observations.

This paper is organised in the following way. The foundations of the simplest type of χ² test, the 2 × 1 goodness of fit test, are introduced and related to the z test for a single observed proportion p and the Wilson score confidence interval about p. We then show how the 2 × 2 test for independence (homogeneity) is derived from two observations p₁ and p₂ and explain when each test should be used. We also briefly introduce the Newcombe-Wilson test, which ideally should be used in preference to the χ² test for observations drawn from two independent populations (such as two subcorpora). We then turn to tests for larger tables, generally termed “r × c” tests, which have multiple degrees of freedom and therefore may encompass multiple trends, and discuss strategies for their analysis. Finally, we turn briefly to the question of differentiating test results. We introduce the concept of effect size (also termed ‘measures of association’) and finally explain how we may perform statistical separability tests to distinguish between two sets of results.

Introduction

Karl Pearson’s famous chi-square test is derived from another statistic, called the z statistic, based on the Normal distribution.

The simplest versions of χ² can be shown to be mathematically identical to equivalent z tests. The tests produce the same result in all circumstances. For all intents and purposes “chi-squared” could be called “z-squared”. The critical values of χ² for one degree of freedom are the square of the corresponding critical values of z.

  • The standard 2 × 2 χ² test is another way of calculating the z test for two independent proportions taken from the same population (Sheskin 1997: 226).
  • This test is based on an even simpler test. The 2 × 1 (or 1 × 2) “goodness of fit” (g.o.f.) χ² test is an implementation of one of the simplest tests in statistics, called the Binomial test, or population z test (Sheskin 1997: 118). This test compares a sample observation against a predicted value which is assumed to be Binomially distributed.

If this is the case, why might we need chi-square? Continue reading