Point tests and multi-point tests for separability of homogeneity

Introduction

I have been recently reviewing and rewriting a paper for publication that I first wrote back in 2011. The paper (Wallis forthcoming) concerns the problem of how we test whether repeated runs of the same experiment obtain essentially the same results, i.e. results are not significantly different from each other.

These meta-tests can be used to test an experiment for replication: if you repeat an experiment and obtain significantly different results on the first repetition, then, with a 1% error level, you can say there is a 99% chance that the experiment is not replicable.

These tests have other applications. You might be wishing to compare your results with those of others in the literature, compare results with different operationalisation (definitions of variables), or just compare results obtained with different data – such as comparing a grammatical distribution observed in speech with that found within writing.

The design of tests for this purpose is addressed within the t-testing ANOVA community, where tests are applied to continuously-valued variables. The solution concerns a particular version of an ANOVA, called “the test for interaction in a factorial analysis of variance” (Sheskin 1997: 489).

However, anyone using data expressed as discrete alternatives (A, B, C etc) has a problem: the classical literature does not explain what you should do.

Gradient and point tests

gradient-point-test

Figure 1: Point tests (A) and gradient tests (B), from Wallis (forthcoming).

The rewrite of the paper caused me to distinguish between two types of tests: ‘point tests’, which I describe below, and ‘gradient tests’. Continue reading

Adapting variance for random-text sampling

Introduction Paper (PDF)

Conventional stochastic methods based on the Binomial distribution rely on a standard model of random sampling whereby freely-varying instances of a phenomenon under study can be said to be drawn randomly and independently from an infinite population of instances.

These methods include confidence intervals and contingency tests (including multinomial tests), whether computed by Fisher’s exact method or variants of log-likelihood, χ², or the Wilson score interval (Wallis 2013). These methods are also at the core of others. The Normal approximation to the Binomial allows us to compute a notion of the variance of the distribution, and is to be found in line fitting and other generalisations.

In many empirical disciplines, samples are rarely drawn “randomly” from the population in a literal sense. Medical research tends to sample available volunteers rather than names compulsorily called up from electoral or medical records. However, provided that researchers are aware that their random sample is limited by the sampling method, and draw conclusions accordingly, such limitations are generally considered acceptable. Obtaining consent is occasionally a problematic experimental bias; actually recruiting relevant individuals is a more common problem.

However, in a number of disciplines, including corpus linguistics, samples are not drawn randomly from a population of independent instances, but instead consist of randomly-obtained contiguous subsamples. In corpus linguistics, these subsamples are drawn from coherent passages or transcribed recordings, generically termed ‘texts’. In this sampling regime, whereas any pair of instances in independent subsamples satisfy the independent-sampling requirement, pairs of instances in the same subsample are likely to be co-dependent to some degree.

To take a corpus linguistics example, a pair of grammatical clauses in the same text passage are more likely to share characteristics than a pair of clauses in two entirely independent passages. Similarly, epidemiological research often involves “cluster-based sampling”, whereby each subsample cluster is drawn from a particular location, family nexus, etc. Again, it is more likely that neighbours or family members share a characteristic under study than random individuals.

If the random-sampling assumption is undermined, a number of questions arise.

  • Are statistical methods employing this random-sample assumption simply invalid on data of this type, or do they gracefully degrade?
  • Do we have to employ very different tests, as some researchers have suggested, or can existing tests be modified in some way?
  • Can we measure the degree to which instances drawn from the same subsample are interdependent? This would help us determine both the scale of the problem and arrive at a potential solution to take this interdependence into account.
  • Would revised methods only affect the degree of certainty of an observed score (variance, confidence intervals, etc.), or might they also affect the best estimate of the observation itself (proportions or probability scores)?

Continue reading

Comparing frequencies within a discrete distribution

Note:
This page explains how to compare observed frequencies f₁ and f₂ from the same distributionF = {f₁, f₂,…}. To compare observed frequencies f₁ and f₂ from different distributions, i.e. where F₁ = {f₁,…} and F₂ = {f₂,…}, you need to use a chi-square or Newcombe-Wilson test.

Introduction

In a recent study, my colleague Jill Bowie obtained a discrete frequency distribution by manually classifying cases in a small sample drawn from a large corpus.

Jill converted this distribution into a row of probabilities and calculated Wilson score intervals on each observation, to express the uncertainty associated with a small sample. She had one question, however:

How do we know whether the proportion of one quantity is significantly greater than another?

We might use a Newcombe-Wilson test (see Wallis 2013a), but this test assumes that we want to compare samples from independent sources. Jill’s data are drawn from the same sample, and all probabilities must sum to 1. Instead, the optimum test is a dependent-sample test.

Example

A discrete distribution looks something like this: F = {108, 65, 6, 2}. This is the frequency data for the middle column (circled) in the following chart.

This may be converted into a probability distribution P, representing the proportion of examples in each category, by simply dividing by the total: P = {0.60, 0.36, 0.03, 0.01}, which sums to 1.

We can plot these probabilities, with Wilson score intervals, as shown below.

tag1cmp

An example graph plot showing the changing proportions of meanings of the verb think over time in the US TIME Magazine Corpus, with Wilson score intervals, after Levin (2013). In this post we discuss the 1960s data (circled). The sum of each column probability is 1. Many thanks to Magnus for the data!

So how do we know if one proportion is significantly greater than another?

  • When comparing values diachronically (horizontally), data is drawn from independent samples. We may use the Newcombe-Wilson test, and employ the handy visual rule that if intervals do not overlap they must be significantly different.
  • However, probabilities drawn from the same sample (vertically) sum to 1 — which is not the case for independent samples! There are k−1 degrees of freedom, where k is the number of classes. It turns out that the relevant significance test we need to use is an extremely basic test, but it is rarely discussed in the literature.

Continue reading

Freedom to vary and significance tests

Introduction

Statistical tests based on the Binomial distribution (z, χ², log-likelihood and Newcombe-Wilson tests) assume that the item in question is free to vary at each point. This simply means that

  • If we find f items under investigation (what we elsewhere refer to as ‘Type A’ cases) out of N potential instances, the statistical model of inference assumes that it must be possible for f to be any number from 0 to N.
  • Probabilities, p = f / N, are expected to fall in the range [0, 1].

Note: this constraint is a mathematical one. All we are claiming is that the true proportion in the population could conceivably range from 0 to 1. This property is not limited to strict alternation with constant meaning (onomasiological, “envelope of variation” studies). In semasiological studies, where we evaluate alternative meanings of the same word, these tests can also be legitimate.

However, it is common in corpus linguistics to see evaluations carried out against a baseline containing terms that simply cannot plausibly be exchanged with the item under investigation. The most obvious example is statements of the following type: “linguistic Item x increases per million words between category 1 and 2”, with reference to a log-likelihood or χ² significance test to justify this claim. Rarely is this appropriate.

Some terminology: If Type A represents say, the use of modal shall, most words will not alternate with shall. For convenience, we will refer to cases that will alternate with Type A cases as Type B cases (e.g. modal will in certain contexts).

The remainder of cases (other words) are, for the purposes of our study, not evaluated. We will term these invariant cases Type C, because they cannot replace Type A or Type B.

In this post I will explain that not only does introducing such ‘Type C’ cases into an experimental design conflate opportunity and choice, but it also makes the statistical evaluation of variation more conservative. Not only may we mistake a change in opportunity as a change in the preference for the item, but we also weaken the power of statistical tests and tend to reject significant changes (in stats jargon, “Type II errors”).

This problem of experimental design far outweighs differences between methods for computing statistical tests. Continue reading

Testing tests

Introduction

Over the last few months I have been looking at computationally evaluating confidence intervals and significance tests. This process has helped me sharpen up the recommendations I can give to researchers. I have updated some online papers and blog posts as a result.

This analysis has exposed a difference, rarely commented upon, between the optimum test for contingency (“χ²-type”) tests when independent variable samples are drawn from the same population or independent populations.

For 2 × 2 tests it is recommended to use a different test (Newcombe-Wilson) when the IV is sociolinguistic (e.g. genre, time, different subcorpora) or otherwise divides samples by participants, than when the same participant may be sampled in either value (e.g. when the IV is a lexical-grammatical variable).

Meta-comment: In a way this is another benefit of a blog — unlike traditional publication, I can quickly correct any problems or improve papers as a result of my discoveries or those of colleagues. However it also means I need to draw the attention of my readership to any changes.

Confidence intervals and significance tests are closely related, for reasons discussed here. So if we can evaluate a formula for a confidence interval in some way, then we can also potentially evaluate the test. Continue reading

Choosing the right test

Introduction

One of the most common questions a new researcher has to deal with is the following:

what is the right statistical test for my purpose?

To answer this question we must distinguish between

  1. different experimental designs, and
  2. optimum methods for testing significance.

In corpus linguistics, many research questions involve choice. The speaker can say shall or will, choose to add a postmodifying clause to an NP or not, etc. If we want to know what factors influence this choice then these factors are termed independent variables (IVs) and the choice is  the dependent variable (DV). These choices are mutually exclusive alternatives. Framing the research question like this immediately helps us focus in on the appropriate class of tests.  Continue reading

Some bêtes noires

There are a number of common issues in corpus linguistics papers.

  1. an extremely common tendency for authors to primarily cite frequencies normalised per million or thousand words (i.e. a per word baseline or multiple thereof),
  2. data is usually plotted without confidence intervals, so it is not possible to spot visually whether a perceived change might be statistically significant, and
  3. significance tests are often employed without a clear statement of what the test is evaluating.

Experimental design

The first issue may be unique to corpus linguistics, deriving from its particular historical origins.

It concerns the experimenter attempting to identify counterfactual alternates or select baselines. This is an experimental design question.

In the beginning was the Word.

Linguists examining volumes of plain text data (later supported by computing and part-of-speech tagging) invariably concentrated on the idea of the word as the unit of language. Collocation and concordancing sat alongside lexicography as the principal tools of the trade. “Statistics” here primarily concerned probabilistic measures of association between neighbouring words in order to find common patterns. This activity is of course perfectly fine, and allowed researchers to make huge gains in our understanding of language.

But…

Without labouring the point (which I do elsewhere on this blog), the corollary of the statement that language is grammatical is that if, instead of describing the distribution of words, n-grams, etc, we wish to investigate how language is produced, the word cannot be our primary focus. Continue reading