Testing tests

Introduction

Over the last few months I have been looking at computationally evaluating confidence intervals and significance tests. This process has helped me sharpen up the recommendations I can give to researchers. I have updated some online papers and blog posts as a result.

This analysis has exposed a difference, rarely commented upon, between the optimum test for contingency (“χ²-type”) tests when independent variable samples are drawn from the same population or independent populations.

For 2 × 2 tests it is recommended to use a different test (Newcombe-Wilson) when the IV is sociolinguistic (e.g. genre, time, different subcorpora) or otherwise divides samples by participants, than when the same participant may be sampled in either value (e.g. when the IV is a lexical-grammatical variable).

Meta-comment: In a way this is another benefit of a blog — unlike traditional publication, I can quickly correct any problems or improve papers as a result of my discoveries or those of colleagues. However it also means I need to draw the attention of my readership to any changes.

Confidence intervals and significance tests are closely related, for reasons discussed here. So if we can evaluate a formula for a confidence interval in some way, then we can also potentially evaluate the test. Continue reading

Advertisements

Random sampling, corpora and case interaction

Introduction

One of the main unsolved statistical problems in corpus linguistics is the following.

Statistical methods assume that samples under study are taken from the population at random.

Text corpora are only partially random. Corpora consist of passages of running text, where words, phrases, clauses and speech acts are structured together to describe the passage.

The selection of text passages for inclusion in a corpus is potentially random. However cases within each text may not be independent.

This randomness requirement is foundationally important. It governs our ability to generalise from the sample to the population.

The corollary of random sampling is that cases are independent from each other.

I see this problem as being fundamental to corpus linguistics as a credible experimental practice (to the point that I forced myself to relearn statistics from first principles after some twenty years in order to address it). In this blog entry I’m going to try to outline the problem and what it means in practice.

The saving grace is that statistical generalisation is premised on a mathematical model. The problem is not all-or-nothing. This means that we can, with care, attempt to address it proportionately.

[Note: To actually solve the problem would require the integration of multiple sources of evidence into an a posteriori model of case interaction that computed marginal ‘independence probabilities’ for each case abstracted from the corpus. This is way beyond what any reasonable individual linguist could ever reasonably be expected to do unless an out-of-the-box solution is developed (I’m working on it, albeit slowly, so if you have ideas, don’t fail to contact me…).]

There are numerous sources of case interaction and clustering in texts, ranging from conscious repetition of topic words and themes, unconscious tendencies to reuse particular grammatical choices, and interaction along axes of, for example, embedding and co-ordination (Wallis 2012a), and structurally overlapping cases (Nelson et al 2002: 272).

In this blog post I first outline the problem and then discuss feasible good practice based on our current technology.  Continue reading

Choosing the right test

Introduction

One of the most common questions a new researcher has to deal with is the following:

what is the right statistical test for my purpose?

To answer this question we must distinguish between

  1. different experimental designs, and
  2. optimum methods for testing significance.

In corpus linguistics, many research questions involve choice. The speaker can say shall or will, choose to add a postmodifying clause to an NP or not, etc. If we want to know what factors influence this choice then these factors are termed independent variables (IVs) and the choice is  the dependent variable (DV). These choices are mutually exclusive alternatives. Framing the research question like this immediately helps us focus in on the appropriate class of tests.  Continue reading

Robust and sound?

When we carry out experiments and perform statistical tests we have two distinct aims.

  1. To form statistically robust conclusions about empirical data.
  2. To make logically sound arguments about experimental conclusions.

Robustness is essentially an inductive mathematical or statistical issue.

Soundness is a deductive question of experimental design and reporting.

Robust conclusions are those that are likely to be repeated if another researcher were to come along and perform the same experiment with different data sampled in much the same way. Sound arguments distinguish between what we can legitimately infer from our data, and the hypothesis we may wish to test.

Continue reading