The perspective that the study of linguistic data should be driven by studies of individual speaker choices has been the subject of attack from a number of linguists.
The first set of objections have come from researchers who have traditionally focused on linguistic variation expressed in terms of rates per word, or per million words.
No such thing as free variation?
As Smith and Leech (2013) put it: “it is commonplace in linguistics that there is no such thing as free variation” and that indeed multiple differing constraints apply to each term. On the basis of this observation they propose an ‘ecological’ approach, although in their paper this approach is not clearly defined.
I comment on this argument (Wallis forthcoming) by distinguishing two distinct potential interpretations of this statement:
- Multiplicity of lexical meaning: Words often have multiple meanings and associations, so exact semantic alternates represent an idealisation. This statement is clearly correct, so an alternation study needs to ensure that (as far as possible) terms of Type A can be replaced with Type B, and vice versa, without altering the meaning of the surrounding context. Evaluating cases for mutual substitution is therefore a key component of alternation studies.
- Multiplicity of pressure on choices: Every choice is subject to many different determining factors. As Glynn writes (in Arppe et al, 2010), “When a speaker chooses a concept, he or she chooses between a wide inventory of lexemes, each of which profiles different elements of that concept. Moreover, the speaker profiles that lexeme in combination with a wide range of grammatical forms, each also contributing to how the speaker wishes to depict the concept.” One can take from this a valid criticism of studies that focus on individual lexemes without considering that observed variation may be due to deeper conceptual choices.
However there is another, mistaken, interpretation of this second objection, which I respond to in the same article.
In analysing natural language we expect that speakers have personal preferences, may adopt particular uses with genre and register, be affected by context and audience, etc. We do not require that at every single choice point the exact same influences, biases and constraints apply in the mind of the speaker. We cannot completely eliminate these constraints in each and every case.
However, we are not attempting to explain why, precisely, a speaker chose to perform a particular utterance at a given point. Rather, we are attempting to generalise across the entire set of such choices to identify statistically sound patterns, correlations and trends. The critical question for a researcher may be formulated differently:
Does one or more of these multiple constraints represent a systematic bias on the rate?
If the answer to this question is yes, then it implies that these constraints should be detectable by experiment – provided we have sufficient data and pose the question correctly.
Should we study alternations at all?
Dylan Glynn claims that the study of binary alternations in linguistics is the result of “two methodological errors”:
- a theoretical inheritance from generative grammar and
- the methodological convenience of employing simple statistics (such as chi-square).
He argues that “the study of alternations has its place” but more complex statistical methods, such as multi-factorial analysis, are more appropriate. Although he is not explicit about the reasons in the article, he appears to be arguing for multi-causal explanations of speaker choices, as the earlier quote indicates.
I am extremely sympathetic to this argument, but I would point out that this is not an objection to an alternationist methodology per se. Indeed, as Gilquin points out in the same article, alternation studies, including binary alternation studies (NB: in corp.ling.stats we do not ignore multinomial analyses, see especially, Wallis 2013), remain a currently valuable starting point for linguistic research.
Evaluating one alternation at a time is necessarily a reductionist method (a fact of the experimental scientific method), but this does not mean that we need to adopt reductionist explanations premised as a series of independent causes.
Here’s the rub. Fundamentally, multi-factorial analyses rely mathematically on the ability of instances in datasets to alternate, or, to put it another way, that observations are free to vary. They build on alternation studies. Moreover, they only analyse correlations, not causes.
Therefore, an ‘ecological’ (or holistic) argument is not an argument against reviewing instances and “cleaning” datasets (provided that this is done in a well-justified documented manner) to ensure that alternations can take place at each point.
A multi-factorial analysis does not address the problem that different examples of the same term (e.g. the same lexeme) may in fact be free to vary with different alternates, or indeed, be incapable of alternation because no alternates exist. (Another way of putting this is that in fact these different categories are not really the same term at all.)
For example the lexical item may can be partitioned into different types:
- nouns – the month May, etc.
- modal verbs
The modal alternation may then be partitioned by the verb context, e.g. interrogative vs. declarative position, first vs. second vs. third person contexts, etc. (See also A methodological progression.) To be clear, there is nothing wrong with exploring these multiple factors together using a multi-factorial approach, but this analysis is still premised on the possibility of alternation.
My argument is that this issue cannot be addressed by statistical inference, but by refining the experimental design. Discretely categorising samples into alternating sets requires a linguistic analysis that a statistical model is not able to infer or ‘fix’.
Descriptive statistical methods, such as collocational models, might help us identify these different valences as part of this experimental redesign process. But it takes a linguistic argument to explain these collocates and justify the redesign.
The idea that the methodological effort is best expended in the use of more sophisticated algorithms rather than in refining the experimental design is not new. It is a familiar one to those of us with a background in Artificial Intelligence (AI) research. AI has been dominated for fifty years by a dichotomy between processing and knowledge – should we place our research emphasis on building more sophisticated models representing the world, or by developing better algorithms that can infer those necessary models from data?
Within Machine Learning, which is the AI discipline that borrows most from inferential statistics, the tendency has been to rely on algorithmic improvement rather than representational sophistication. However, this perspective is by no means universal. Knowledge Discovery takes its departure from ML by investing effort in the meaningful representation of data for analysis, just as scientific research invests effort in experimental design for meaningful statistics.
- An unnatural probability?
- A methodological progression
- Choice vs. use
- Freedom to vary and significance tests
- Inferential statistics – and other animals
Arppe, A., G. Gilquin, D. Glynn, M. Hilpert and A. Zeschel. 2010. Cognitive Corpus Linguistics: five points of debate on current theory and methodology. Corpora 5:1, 1-27.
Smith, N. and G. Leech, 2013. Verb structures in twentieth century British English. Chapter 4 in Aarts, B., J. Close, G. Leech and S.A. Wallis (eds.) 2013. The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP.
Wallis, S.A. 2013. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post
Wallis, S.A. forthcoming. That vexed problem of choice. London: Survey of English Usage, UCL. » Post