Over the last year, the field of psychology has been rocked by a major public dispute about statistics. This concerns the failure of claims in papers, published in top psychological journals, to replicate.

Replication is a big deal: if you publish a correlation between variable *X* and variable *Y* – that there is an increase in the use of the progressive over time, say, and that increase is statistically significant, you expect that this finding would be replicated were the experiment repeated.

I would strongly recommend Andrew Gelman’s brief history of the developing crisis in psychology. It is not necessary to agree with everything he says (personally, I find little to disagree with, although his argument is challenging) to recognise that he describes a serious problem here.

There may be more than one reason why published studies have failed to obtain compatible results on repetition, and so it is worth sifting these out.

In this blog post, what I want to do is try to explore what this replication crisis is – is it one problem, or several? – and then turn to what solutions might be available and what the implications are for corpus linguistics.

The debate between Neil Millar and Geoff Leech regarding the alleged increase (Millar 2009) and decline (Leech 2011) of the modal auxiliary verbs is an example of this problem.

Millar based his conclusions on the TIME corpus, discovering that the rate of modal verbs per million words tended to increase over time. Leech, using the Brown series of US English corpora, discovered the opposite. Both applied statistical methods to their data but obtained very different conclusions.

Inferential statistics operates by predicting the result of repeated runs of the same experiment, i.e. on samples of data drawn from the same population.

Stating that something “significantly increases over time” can be reformulated as:

- subject to caveats of
**random sampling**(the sample is, or approximates to, a random sample of utterances drawn from the same population), and**Binomial variables**(observations are free to vary from 0 to 1), - we can calculate a
**confidence interval**at a given error rate (say 1 in 20 times for a 5% error rate / 95% interval) on the difference in two observations of variable*X*taken at two time points 1 and 2,*x*₂ –*x*₁, **all points**within this interval (including the lower bound) are greater than 0,**on repeated runs of the same experiment we can expect to see an observation fall outside of the confidence interval of the difference at the predicted rate**(here, 1 time in 20).

**Note:** For the purposes of this blog post, I am focusing on the last bullet point – when we say that something “fails to replicate”, we mean that on a repetition the result falls outside the confidence interval of the difference *on the very next occasion!*

Leech obtained a different result from Millar on the first attempted repetition of this experiment. This could be a fluke, but it seems to be a failure to replicate. There should only be a 1 in 20 chance of this happening.

Observing such a replication failure should lead us to ask some searching questions about these two studies, many of which are discussed elsewhere in this blog.

Much of the controversy can be summed up by the bottom row in this table, drawn from Millar (2009). This appears to show a 23% increase in modal use between the 1920s and 2000s. With a lot of data and a sizeable effect, this increase seems bound to be significant.

1920s | 1930s | 1940s | 1950s | 1960s | 1970s | 1980s | 1990s | 2000s | % diff 1920s-2000s | |

will |
2,194.63 | 1,681.76 | 1,856.40 | 1,988.37 | 1,965.76 | 2,135.73 | 2,057.43 | 2,273.23 | 2,362.52 | +7.7% |

would |
1,690.70 | 1,665.01 | 2,095.76 | 1,669.18 | 1,513.30 | 1,828.92 | 1,758.44 | 1,797.03 | 1,693.19 | +0.1% |

can |
832.91 | 742.30 | 955.73 | 1,093.39 | 1,233.13 | 1,305.82 | 1,231.99 | 1,475.95 | 1,777.07 | +113.4% |

could |
661.33 | 822.72 | 1,188.24 | 998.83 | 950.73 | 1,106.25 | 1,156.61 | 1,378.39 | 1,342.56 | +103.0% |

may |
583.59 | 515.12 | 496.93 | 502.74 | 628.13 | 743.66 | 775.92 | 937.08 | 931.91 | +59.7% |

should |
577.46 | 450.07 | 454.87 | 495.26 | 441.96 | 475.50 | 453.33 | 521.46 | 593.27 | +2.7% |

must |
485.31 | 418.03 | 456.57 | 417.62 | 401.36 | 390.47 | 347.02 | 306.69 | 250.59 | -48.4% |

might |
374.52 | 375.40 | 500.33 | 408.90 | 399.80 | 458.99 | 416.81 | 474.23 | 433.34 | +15.7% |

shall |
212.19 | 120.79 | 96.42 | 70.52 | 50.48 | 35.65 | 25.93 | 16.09 | 9.26 | -95.6% |

ought |
50.22 | 37.94 | 39.31 | 40.34 | 36.91 | 34.29 | 28.27 | 34.90 | 27.65 | -44.9% |

Total | 7,662.86 | 6,829.14 | 8,140.56 | 7,685.15 | 7,621.56 | 8,515.28 | 8,251.75 | 9,215.05 | 9,421.36 | +22.9% |

In attempting to identify why Leech and Millar obtain different results, the following questions should be considered.

**Are the two samples drawn from the same population, or are they drawn from two distinct populations?**To put it another way, are there characteristics of the TIME data that makes it distinct from the general written data in the Brown corpora? For example, does TIME have a ‘house style’, with subeditors enforcing it, which has led to a greater frequency of modal use? Has TIME tended to curate more stories with more modal hedges than the overall trend? Jill Bowie (Bowie*et al*2013) reported that genre subdivisions within the spoken DCPSE corpus often exposed different modal trends.**Does Millar’s data support a general observation of increased modal use?**Bowie observes that Millar’s aggregate data fluctuates over the entire time period (see Table, bottom row), and some changes in sub-periods appear to be consistent with the trend reported by Leech in an earlier study in 2003. According to this observation, simply expressing the trend as an increase in modal verb use seems misleading.**Is it legitimate to aggregate all modals together?**In one sense, modals are a well-defined category of verb: a closed category especially if one excludes the semi-modals. So “modal use” is a legitimate variable. But we can also see that different modal verbs are undergoing different patterns of change over time (see Table). Millar reports that*shall*and*must*are in decline in his data while*will*and*can*are increasing. Whereas*shall*and*will*may be alternates in some contexts, this does not mean that bundling all modal trends together is particularly meaningful. Moreover, since the synchronic distribution of modals (like most linguistic variables) is sensitive to genre, this issue also interacts with my first bullet point, i.e. the fact that there are known differences between corpora.**How reliable is a per-million-word measure?**What does the data look like if we use a different baseline, for example, modal use per tensed verb phrase (or tensed main verb)? Doing this allows us to factor out variation in ‘tensed VP density’ (i.e. the variation in potential sites for modals to be deployed) between texts. Failure to do this (as both Leech and Millar do) means that we are not measuring when writers**choose**to use modal verbs, but the rate to which we, the reader, are**exposed**to them. See That vexed problem of choice.

If VP density in text samples changes over time in either corpus, this may explain these different results – not as a result of increasing or declining modal use but as a result of increasing or declining tensed VP density (or declining / increasing density of other constituents). More generally, word-based baselines almost always conflate opportunity and use because the option to insert the element is not available following every other word (exceptions might include pauses or expletives, but these exceptions prove the rule). This conflation undermines the Binomial model and increases the risk that results will not replicate. The solution is to focus on identifying each choice-point as much as possible.**Does per word (per-million-word) data conform to the Binomial statistical model?**Since the entire corpus cannot consist of modal verbs, observations of*X*can never approach 1, so the answer has to be no. However, the effect of this inappropriate model is that it tends to lead to the underreporting of otherwise significant results. See Freedom to vary and statistical tests – the test becomes less sensitive, i.e. it tends to underreport significant results. This may be a problem, but logically, it cannot be an explanation for obtaining two different ‘significant’ results in opposite directions.

All of the above are reasons to be unsurprised at the fact that Millar’s summary finding was not replicated in Leech’s data. But to be fair, many of Millar’s individual trends did appear to be consistent with results found in the Brown corpus.

The replication crisis has been most discussed in psychology and the social sciences. In psychology, some published findings have been controversial to say the least. Claims that ‘Engineers have more sons; nurses have more daughters’ have tended to attract the interest of other psychologists relatively quickly. But this is shooting fish in a barrel.

In psychology, it is common to perform studies with small numbers of participants – 10 per experimental condition is usually cited as a minimum, which means that between 20 and 40 participants becomes the norm. Many kinds of failure to replicate are due to what statisticians tend to call ‘basic errors’, such as using an inappropriate statistical test. I discuss this elsewhere in this blog.

In this blog I have tended to argue for applying the simplest possible experimental designs (2 × 2 contingency tests, for example) over multivariate regression algorithms which may work, but are treated as ‘black boxes’ by almost all who use them. Such algorithms may ‘over fit’ data, i.e. they match the data more closely than is mathematically justified.

I argue that if you don’t understand how your results were derived, you are taking them on faith.

This does not mean I don’t think that some multi-variable methods are not superior theoretically, or potentially more powerful than simple tests. On the contrary, I object that before we use them we need to be sure that we understand what they are doing with our data. We have to ask ourselves constantly, *what do our results mean?*

However, the replication problem does not go away entirely once we have dealt with these so-called basic errors.

Andrew Gelman and Eric Loken (2013) raise a more fundamental problem that, if valid, is particularly problematic for corpus linguists. This concerns a question that goes to the heart of the post-hoc analysis of data, and the fundamental philosophy of statistical claims and the scientific method.

Essentially their argument goes like this.

- All data contains random noise, and thus every variable in a dataset (extracted from a corpus) will contain random noise. Researchers tend to assume that by employing a significance test we ‘control’ for this noise. But this is a mischaracterisation. In fact, even faced with a dataset consisting of pure noise, we would detect a ‘significant’ result 1 in 20 times (at a 0.05 threshold).
- Any data set may contain multiple variables, there are multiple potential definitions of these variables, and there are multiple analyses we could perform on the data. In a corpus we could modify definitions of variables, perform new queries, change baselines, etc., to perform new analyses.
- It follows that there is a very large number of potential hypotheses we could test against the data. (Note: this is not an argument against choosing a better baseline on theoretical grounds!)

This is not very controversial. However, Gelman and Loken’s more provocative claim is as follows.

- Few researchers would admit to running very many tests against data and reporting results, which the authors term ‘fishing’ for significant results, or ‘p-hacking’. There are some algorithms that do this (multivariate logistic regression anyone?), but most research is not like this.
- Unfortunately, the authors argue,
**standard post-hoc analysis methods – exploring data, graphing results and reporting significant results – does much the same thing.**We dispense with blind alleys (‘forking paths’), because we can see that they are not likely to produce significant results. Although we don’t actually run these dead-end tests, for mathematical purposes*our educated eyeballing of data to focus on interesting phenomena has done the same thing*.

- As a result, we underestimate the robustness of our results, and they often fail to replicate.

Since it would be unfair to criticise others for a problem that my own work may be prone to, let us consider the following graph which we used while writing Bowie and Wallis (2016). The graph does not appear in the final version of the paper – not because we didn’t like it, but because we decided to adopt a different baseline in breaking down an overall pattern of change into sub-components.

There are two critical questions that follow from Gelman and Loken’s critique.

*In plotting this kind of graph and reporting confidence intervals, are we misrepresenting the level of certainty found in the graph?**Are we engaging in, or encouraging, retrospective cherry-picking of contrasts between observations and confidence intervals?*

In the following graph there are 19 decades and 5 trend lines, i.e. 95 confidence intervals. There are 171 × 5 potential pairwise comparisons, and 10 × 19 vertical pairwise comparisons. So there are 1,045 potential statistical pairwise tests which would be reasonable to carry out. With a 1 in 20 error rate, at least 52 ‘significant’ pairwise comparisons would be incapable of replication.

Gelman and Loken argue that by selecting a few statistically significant claims from this graph, we have committed precisely the error they object to.

However, I have to defend this graph, and others like it, by arguing that **this is not our method**. We don’t sift through 1,045 possible comparisons and then report significant results selectively! In the paper, and in our work more generally, we really don’t encourage this kind of cherry-picking. We are more concerned with the overall patterns that we see, general trends, etc., which are more likely to be replicable in broad terms.

Thus, for example, in that paper we don’t pull out specific significant pairwise comparisons to make strong claims. In this particular graph we can see an apparently statistically significant sharp decline between 1900 and 1930 in the tendency of writers to use the verb SAY (as in *he is said to have stayed behind*) before a *to-*infinitive perfect, compared to the other verbs in the group. This observation may be replicable, but **the conclusions of the paper do not depend on this observation**. This claim, and similar claims, do not appear in the paper.

Similarly, if we turn back to Neil Millar’s modals-per-million-word data for a moment, Bowie’s observation that the data does not show a consistent increase over time is interesting. Millar did not select the time period in order to report that modals were on the increase – on the contrary, he non-arbitrarily took the start and end point of the trend. But the conclusion that ‘modals increased over the entire period’ was only one statement that described the data. In shorter periods there was a significant fall, and different modal verbs behaved differently. Indeed, the complexity of his results is best summed up by the detailed graphs in his paper!

**In conclusion:** it is better to present and discuss the pattern, not just the end point – or the slogan.

Nonetheless we may still have the sneaking suspicion that what we are doing is a kind of researcher bias. We tend to report statistically significant results and ignore those inconvenient non-significant ones. The fear is that results assumed to be due to chance 1 in 20 times are more likely due to chance 1 in 5 times (say), simply because we have – inadvertently and unconsciously – already preselected our data and methods to obtain significant results. Some highly experienced researchers have suggested that we fix this problem by adopting tougher error levels – adopt a 1 in 100 level and we might arrive at 1 in 25. The problem is that this assumes we know the appropriate multiplier to apply.

Gelman and Loken suggest instead that published studies should always involve a replication process. They argue it is preferable that researchers publish half as many experiments and include a replication step than publish non-replicable results.

**Suggested method:** Before you start, create two random subcorpora A and B by randomly drawing texts from the corpus and assigning them to A and B in turn. You may wish to control for balance, e.g. to ensure subsampling is drawn equitably from each genre category. Perform the study on A, and summarise the results. Without changing a single query, variable or analysis step, apply exactly the same analysis to B. Do we get compatible results, i.e. *results that fall within the confidence intervals of the first experiment*?

Even where replication is not carried out (for reasons of insufficient data, perhaps), an uncontroversial corollary of this argument is that the method should be sufficiently transparent so that it can be replicated by others.

Moreover raw data should be reported to permit a reanalysis by other analysis methods. I find it frustrating when papers publish per million word frequencies when what is needed for a reanalysis is raw frequency data.

Another of Gelman and Loken’s recommendations is that researchers need to spend more time focusing on sizes of effect, rather than just reporting statistical significance. With lots of data and large effect sizes, the problem is reduced. Certainly we should be wary of citing just-significant results with a small effect size.

Where does this leave the arguments I have made elsewhere in favour of visualising data with confidence intervals? One of the implications of the ‘forking paths’ argument is that we tend not to report dead-end, non-significant results. But well considered graphs can visualise all data in a given frame, rather than selected data (of course we have to ‘frame’ this data, select variables, etc.).

One advantage of graphing data with confidence intervals is that we apply the same criteria to all data points and allow the reader to interpret the graph. Significant and non-significant contrasts are available to be viewed. We also visualise effect sizes and the weight of evidence (confidence intervals), even if it is arguable that our model is insufficiently conservative.

Thus a strength of Millar’s paper is the reporting of trends and graphs. In the graph above, the confidence intervals improve our understanding of the overall trends we see.

We just should not assume that every significant difference will be replicable.

This is really one of mine, but I suggest it is implicit in the argument above.

It seems to me to be an absolutely essential requirement for any empirical scientist to play devil’s advocate to their own hypothesis.

That is, it is not sufficient to ‘find something interesting in data’, and publish. What we are really trying to do is detect meaningful phenomena in data, or to put it another way, we are trying to find robust evidence of phenomena that have implications for linguistic theory. We are trying to move from observed correlation to a hypothesised underlying cause.

Statistics is a tool to help us do this. But logic also plays an essential part.

Without wishing to create a checklist for empirical linguistics (such that a researcher is convinced in the validity of their results simply because they can tick off the list), we might argue that the following steps are necessary in all empirical research.

**Identify the underlying research question**, framed in general theoretical terms.**Operationalise the research question**as a series of testable hypotheses or predictions, and evaluate them. Plot graphs! Visualising data with confidence intervals allows us to visualise expected variation and make more robust claims.**Critique the results of this evaluation**in terms of the original research question, and play devil’s advocate: what other possible underlying explanations might there be for the observed results?**Consider alternative hypotheses**and test them. Try to design new experiments to separate out different possible explanations for the observed phenomenon.**Plan to include a replication step**prior to publication. This means being prepared to partition the data in the way described above, dividing the corpus into different pools of source texts.

Whether or not Gelman and Loken’s argument applies to your corpus linguistics study – and we have to eliminate basic errors first – the principal conclusion is that it is difficult to understate the importance of **reporting accuracy and transparency**. If the study does not appear to replicate in the future, possible reasons must be capable of exploration by future researchers. It would not have been possible to explore the differences between Leech and Millar’s data had Neil Millar simply summarised a few trends and reported some statistically significant findings.

It is incumbent on all of us to properly describe the limitations of data and sampling; definitions of variables and abstraction (query) methods for populating them; as well as graphing data to reveal both significant and non-significant patterns at the same time.

A typical mistake is to refer to ‘British English’ (say) as a short hand for ‘data drawn from British English texts sampled according to the sampling frame defined in Section 3’. Many failures to replicate in psychology can be attributed to precisely this type of logical error – that the experimental dataset is not a reliable model for the population claimed.

Bowie, J., Wallis, S.A. and Aarts, B. 2013. Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Marín-Arrese, J.I., Carretero, M., Arús H.J. and van der Auwera, J. (eds.) *English Modality*, Berlin: De Gruyter, 57-94.

Bowie, J. and Wallis, S.A. 2016. The *to*-infinitival perfect: A study of decline. In Werner, V., Seoane, E., and Suárez-Gómez, C. (eds.) *Re-assessing the Present Perfect*, Topics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94.

Gelman, A. and Loken, E. 2013. The garden of forking paths. Columbia University. **»** ePublished.

Leech, G. 2011. The modals ARE declining: reply to Neil Millar’s ‘Modal verbs in TIME: frequency changes 1923–2006’. *International Journal of Corpus Linguistics* 16(4).

Millar, N. 2009. Modal verbs in TIME: frequency changes 1923–2006. *International Journal of Corpus Linguistics* 14(2), 191–220.

]]>

One of the longest-running, and in many respects the least helpful, methodological debates in corpus linguistics concerns the spat between so-called **corpus-driven** and **corpus-based** linguists.

I say that this has been largely unhelpful because it has encouraged a dichotomy which is almost certainly false, and the focus on whether it is ‘right’ to work from corpus data upwards towards theory, or from theory downwards towards text, distracts from some serious methodological challenges we need to consider (see other posts on this blog).

Usually this discussion reviews the achievements of the most well-known corpus-based linguist, John Sinclair, in building the *Collins Cobuild Corpus*, and deriving the *Collins Cobuild Dictionary* (Sinclair *et al*. 1987) and *Grammar* (Sinclair *et al*. 1990) from it.

**In this post I propose an alternative examination.**

I want to suggest that *the greatest success story for corpus-based research is the development of part-of-speech taggers* (usually called a ‘POS-tagger’ or simply ‘tagger’) trained on corpus data.

These are industrial strength, reliable algorithms, that obtain good results with minimal assumptions about language.

So, *who needs theory?*

Taggers consist of two parts:

**a ‘learning’ algorithm**that collects rules from training data, and**a ‘tagging’ algorithm**which applies rules to new texts to classify words by their part of speech (word class).

The corpus-based aspect is the ‘learning’ algorithm.

A typical rule might be that if the word *old* (which can be a noun/nominal adjective, as in *the old*, or adjective, *the old man*) is followed by a noun, then *old* is more likely to be an adjective than otherwise.

The tagging algorithm takes a sentence and applies these rules like a crossword solver. It classifies the words that it is most certain of before considering those it is less confident about. Thus, in *the old man*, *the* is unambiguously a determiner, whereas both *old* and *man* can belong to more than one word class.

The learning algorithm generates summary statistics bottom-up from training data it is given, which are lots of sentences/texts which have already been tagged with the same part of speech scheme (i.e., a corpus).

It is not necessary to make many assumptions about the grammar of the language we are working with to obtain results comparable to the best reported in the literature. The computer does not need to ‘know’ what a noun or a verb is. It can simply obtain statistics about these different categories from the corpus.

But these algorithms *do* embody some assumptions about their language input. These assumptions can be enumerated as follows, although different classification schemes might vary in some details:

- language consists of
**sentences**divided into lexical**words**; - each
**sentence**is capable of being analysed separately; **words**include part-words such as genitive markers and cliticised words, and compounds, where multiple words can be given the same tag;- there are a fixed set of
**word class tags**that each particular instance of a word can be categorised by – these commonly consist of word class category (noun, verb, etc.), plus secondary information (plural proper noun, copular verb, etc.); - these tags were correctly applied to the
**training data**.

Databases extracted by the learning algorithm typically consist of **frequency distributions** for every word-tag pattern, i.e. the number of cases in the training corpus where a given lexical word has a particular tag; and **transition probabilities** for each word-tag pattern if words have more than one tag.

The performance of these linguistically unsophisticated algorithms is striking. **A typical tagger trained on a million words of English using a standard set of tags will make the correct decision for new sentences of a similar type some 95% of the time.**

Different algorithms may vary in storage efficiency. My crude simulated annealing stochastic tagger (Wallis 2012), which stores transition probabilities exhaustively, is less space-efficient than Eric Brill’s patch tagger (Brill 1992). *However, they obtain similar results.*

The remaining 5% of residual incorrect examples tend to be cases that are idiomatic, or are part of a multi-word string of ambiguous words, or are a result of weaknesses in the training data.

To address these weaknesses we can make a number of improvements.

**Store a finite set of idioms, strings or compounds.**This is a bit clumsy and*ad hoc*, doesn’t scale well, but can actually improve performance.**Add modules to the database and algorithm.**The Brill tagger employs some simple*ad hoc*regular morphology detection at an initial stage. A more thorough approach might consist of a morphological model of ‘lemmatisation’ (identifying word stems and affixes, e.g.*re-educated*→*re–*+*educate*+ –*ed*). The advantage of this step is that even if we don’t have the word*re-educated*in our training set we can recognise*educate*as a verb and the entire word as a gerund noun or verb. Generalisation allows us to pool statistics, so we can have more reliable rules, and compress information, so we don’t have to store separate statistics for every single word.**Create a more general type of rule.**The rules we have described were tied to particular words, such as*old*. It would be more efficient if we had a rule that said something like ‘for any word capable of being either an adjective or a noun, if it is followed by an adjective or noun, then it is likely to be an adjective.’*Note that to create such a rule we have to look for it*(this is precisely what the Brill tagger does).

But now let us consider where this path has taken us. Every step we have proposed to improve the performance of this corpus-driven algorithm requires the insertion of knowledge about idioms, morphology and grammar, top-down, into the algorithm.

A methodological corpus-driven purism that stated that we must work exclusively bottom-up was a little disingenuous, because we had to employ auxiliary assumptions (1) to (5) above from the start.

But now every improvement we wish to make requires further theoretical assumptions. It turns out that it is not possible to perform part-of-speech tagging without assumptions, and to improve the algorithm we need more theory.

Finally, whereas the learning algorithm might work bottom-up, the tagging algorithm itself works top-down, in that it applies its knowledge base of word-tag probabilities to new corpus data.

I have the utmost respect for corpus-driven linguists. The discipline of examining data with minimal assumptions is absolutely crucial! All scientists have to examine the data *as it is*, not compartmentalise it according to pre-given assumptions.

Over the years I have written extensively on not taking queries for granted, and directed corpus researchers to continually review the underlying sentences from which their statistics are derived.

However, it is simply not possible to work without *any* assumptions, even when building a bottom-up computer algorithm like a part-of-speech tagger.

So I would conclude that corpus-based research is properly located as part of a larger research cycle, in which it is valid and reasonable to work bottom-up and top-down at different times. Corpus-driven research methods are part of a family of exploratory methods from which all corpus linguists should draw. Insights from computationally-obtained summary statistics (whether from collocations, *n*-grams, phrase frames, indexes, or databases of part of speech taggers) are important resources for further research.

But insisting that the only legitimate corpus methods are bottom-up prevents us carrying out research with a corpus which asks questions that are inevitably framed by a particular theory.

Brill, E. 1992. A simple rule-based part of speech tagger. In *Proceedings of the third conference on applied natural language processing* (ANLC ’92). Association for Computational Linguistics, Stroudsburg, PA, USA, 152-155.

Sinclair, J., Hanks, P., Fox, G., Moon, R. and Stock, P. and others, 1987 (eds.), Collins *Cobuild English Language Dictionary*, London: Collins.

Sinclair, J., Fox, G., Bullon, S., Krishnamurthy, R., Manning, E., Todd, J. and others, 1990 (eds.) *Collins Cobuild English Grammar*, London: Collins.

Wallis S.A. 2012. *Tagging ICE Phillipines and other corpora*. London: Survey of English Usage. **»** ePublished

]]>

When the entire premise of your methodology is publicly challenged by one of the most pre-eminent figures in an overarching discipline, it seems wise to have a defence. Noam Chomsky’s famous objection to corpus linguistics therefore needs a serious response.

“One of the big insights of the scientific revolution, of modern science, at least since the seventeenth century… is that arrangement of data isn’t going to get you anywhere. You have to ask probing questions of nature. That’s what is called experimentation, and then you may get some answers that mean something. Otherwise you just get junk.” (Noam Chomsky, quoted in Aarts 2001).

Chomsky has consistently argued that the systematic *ex post facto* analysis of natural language sentence data is incapable of taking theoretical linguistics forward. In other words, corpus linguistics is a waste of time, because it is capable of focusing only on external phenomena of language – what Chomsky has at various times described as ‘e-language’.

Instead we should concentrate our efforts on developing new theoretical explanations for the internal language within the mind (‘i-language’). Over the years the terminology varied, but the argument has remained the same: real linguistics is the study of i-language, not e-language. Corpus linguistics studies e-language. Ergo, it is a waste of time.

Chomsky refers to what he calls ‘the Galilean Style’ to make his case. This is the argument that it is necessary to engage in theoretical abstractions in order to analyse complex data. “[P]hysicists ‘give a higher degree of reality’ to the mathematical models of the universe that they construct than to ‘the ordinary world of sensation’” (Chomsky, 2002: 98). We need a theory in order to make sense of data, as so-called ‘unfiltered’ data is open to an infinite number of possible interpretations.

In the Aristotelian model of the universe the sun orbited the earth. The same data, reframed by the Copernican model, was explained by the rotation of the earth. However, the Copernican model of the universe was not arrived at by theoretical generalisation alone, but by a combination of theory and observation.

Chomsky’s first argument contains a kernel of truth. The following statement is taken for granted across all scientific disciplines: **you need theory to analyse data**. To put it another way, there is no such thing as an ‘assumption free’ science. But the second part of this argument, that the necessity of theory permits scientists to dispense with engagement with data (or even allows them to dismiss data wholesale), is not a characterisation of the scientific method that modern scientists would recognise. Indeed, Beheme (2016) argues that this method is also a mischaracterisation of Galileo’s method. Galileo’s particular fame, and his persecution, came from one source: the observations he made through his telescope.

In astronomy it is necessary to build physical theories of the universe to make sense of observed data. Astronomical science must proceed by a process of theory building, attempting to account for observations within the theoretical framework. Moreover, rather than relying on naive Popperian refutation (abandoning a theory if one observation appears to contradict the theory), science tends to rely on **triangulation** (approaching the same theoretical generalisation from multiple sources and directions), and **pluralism**, i.e. the existence of competing theories such that if one fails another may replace it (Putnam 1974). Triangulation may also mean designing new experiments to test theoretical predictions as technology advances – such as viewing the earth from space, or placing atomic clocks on airliners to test special relativity.

Arguing for the necessity of theory is not an argument against corpus linguistics *per se*, but it is an argument of a particular type of corpus linguistics practice. The ‘Birmingham School’ of corpus linguistics, most associated with John Sinclair, has prided itself on making minimal theoretical assumptions and working bottom-up from words themselves. Some of the results of this approach are impressive. However,

- this type of corpus linguistics is not theory neutral or assumption free (e.g. we assume that
*w*₁,*w*₂ are words, and a word is a linguistically meaningful unit); - the process of validating theoretical generalisations entails a linguistic decision based on an external theory (e.g. there exists a distinct wordclass termed ‘adjective’);
- once theoretical generalisations are derived bottom-up (e.g. cases of
*w*₁,*w*₂, etc are members of the set of adjectives), we arrive at a methodological paradox.

Sinclair’s methodological paradox is simply this: if it is true that statements of the kind ‘*w*₁ is an adjective’ are linguistically valuable, then it follows that when analysing new data, we should exploit this new knowledge. However, Sinclair’s method is to work inductively from new data without making such *a priori* assumptions. Either he has to dispense with his previous conclusions, and start from scratch, or he has to change his method.

In conclusion, the argument that you need theory to interpret data, because data has multiple possible interpretations, is correct. However this statement does not extend to permitting scientists to select data to fit their theory. Awkward and challenging results may not be ignored.

Moreover, if Chomsky’s argument were correct, no scientific field would ever arrive at a dominant scientific model. Every scientist could adopt different theoretical frameworks and premises because there was no agreed process for either refuting a theory or determining the outcome of competition between theories. Science has a pattern of both pluralistic competitive research *and* consensus-forming around ‘strong theories’. Chomsky’s characterisation of science may be a description of the fractious state of linguistics, but it departs from the scientific method.

I would suggest that it would be preferable to make linguistics more like science, rather than to make science more like linguistics.

Chomsky’s second argument is that the process of translation from internal to external language is subject to error. Consequently, studying e-language is not a productive way to study i-language. We need to study i-language, therefore we should reject corpus data.

This argument has been more influential than the first.

It also appears to be a reasonable criticism of a certain kind of corpus linguistics. Corpus linguistics has tended to focus on word frequencies, which, in the absence of a theoretical interpretation as to *why* certain forms might be more frequent than others, simply becomes descriptive. Chomsky can reasonably summarise this as studying the epiphenomena of linguistics.

By contrast, theoretical linguists have tended to use an introspective method (backed up occasionally with second-party elicitation) on the grammatical acceptability of test sentences. This is a scholastic approach drawn from traditional prescriptive grammars. The method contains a significant subjective element, even when data is drawn from elicitation experiments with large numbers of test subjects. Direct introspection simply tells us that we *believe* a sentence to be ‘grammatical’.

Could this type of research question be posed with corpus data? No, but corpus linguists do not have to dispense with introspective insight. Corpus linguists are linguists too!

Moving from million-word to billion-word POS-tagged corpora has not generated greater insight, merely more robust results. However, this observation is properly a criticism of the research foci of much corpus linguistics as practised. (I would argue that this is a limitation of POS-tagged corpus research.) It is not an argument against corpus *data*.

However, there are two reasons why Chomsky’s second argument cannot hold. The first is what we might call **the ‘linguists are not God’ reason**.

Linguists do not have special access to i-language data. Their data is from introspection, elicitation or even corpora. But *this* data is also external language! If there were no systematic mapping between i-language and e-language within an individual, ‘i-linguistics’ would not be possible.

Chomsky and his followers could theorise about any number of internal models. But they could never choose between them except by appealing to some general abstract principle, such as Occam’s razor (simplicity). Linguistic data cannot penetrate the question because *all* linguistic data is in fact e-language data.

The best, most robust, carefully-obtained data from uncued experimental settings is still e-language. It may be collected in a more focused (and artificial) way than corpus data, but it is also no more ‘internal’ than corpus data. Introspection data elicited from experiments may elicit subjective grammatical expectations, but results are no more scientific than those from any other scientists’ introspection. Physicists do not despair of their equipment and resort to interviewing their peers! Perhaps linguists should follow their lead.

The second counter-argument is that the process of articulating i-language as e-language is a *cognitive* one, that is, it takes place through cognitive processes in the mind. According to Chomsky, this process exposes the pure i-language to the distorting prism of articulation, and thereby makes e-language unreliable data.

However, if this were true, the same objection would necessarily be true for the generation of i-language in the first place. **If articulation of e-language is subject to error, the generation of i-language itself must also be error-prone.**

Random variation, cultural bias, personal preference, processing interference, etc, can take place at either stage, because these phenomena are artefacts of actual neurological pathways. Different types of error may arise at different locations, but there is no special error-free part of the brain. Speakers under the influence of alcohol have confused thoughts *and* slur their words. Alcohol, like error, is not selective.

A number of corpus linguists, including Geoffrey Leech, have commented on the regular ‘grammaticality’ of even the most informal spontaneous speech data. This observation should not be surprising – if speech data did not follow grammatical rules, speakers would not understand each other, and, given the historical and ontological primacy of speech over writing, language could never develop!

There may be noise in the signal, but the signal is not exclusively noise. We should not give up on corpora just yet.

Corpus data is simply uncued natural language data (sometimes termed ‘ecological’ data) as distinct from data obtained in an experimental setting. The key advantage of experimental data is that a researcher can manipulate variables under investigation and avoid variation in potentially confounding variables while obtaining data. A secondary advantage may be that one can construct a setting that provides a high frequency of sought-after phenomena that might otherwise be rare in a corpus. The disadvantages are the risk that the experimental conditions obtained are artificial (and possibly artificially *cued*), and the cost of obtaining and annotating data.

A corpus could contain experimental data, or data obtained by experiment could be annotated to the same level as a parsed corpus such as ICE-GB. These methods are not in competition but are complementary. A corpus can provide test data for experiments, identify potentially worth-while experiments, and provide a control for experimental outcomes.

Corpus linguistics offers three kinds of evidence to a theoretical linguist – factual evidence that phenomena exist, evidence of frequency and distribution, and ‘interaction evidence’ pertaining to the co-occurrence of phenomena (Wallis 2014).

There is no need to discount corpora as a lesser source, or one more likely to be tainted by error than other sources. It is a *different* source of evidence, one that requires due methodological care, but one that has the potential for both the evaluation of theory against real-world natural language and robust statistical evaluation.

If data can only be studied by first relating it to a theory, then theoretical linguists first need to pay attention to how corpora are annotated. Do corpora contain useful representations for linguistic research? Are phenomena of interest to linguists capable of being captured within the corpus?

‘Annotation’ is the process of systematically applying a theoretical description to all the texts in a corpus. A decision to annotate instances of a particular phenomenon entails significant effort. All such instances in the corpus must be identified, and each decision must be properly motivated. Like classification schemes in science (e.g the periodic table), linguistic phenomena are not simply identified, but related within a coherent annotation scheme. It follows that the entire scheme must be linguistically defended and systematically applied.

Syntacticians should pay particular attention to parsed corpora. It follows that if linguists are studying grammar then grammatically analysed corpora (‘parsed corpora’ or ‘treebanks’) are likely to be much more valuable than corpora with part-of-speech wordclass tags applied to each word. However, there is wide disagreement between theoretical linguists as to which grammatical scheme is optimal.

Inevitably the effort of annotation means that one has to choose a particular scheme at a particular point in time and systematically apply it. This poses a problem for researchers using the corpus. If they are stuck in a ‘hermeneutic trap’, only able to pose research questions within the annotation framework, and engage in circular reasoning, then corpus linguistics has a serious problem. After the huge effort of annotation you can only please a small number of linguists!

The solution to this problem offered by Wallis and Nelson (2001) is ‘abstraction’ – a process of reinterpretation of the annotated sentences from the representation in the corpus to the preferred representation of the linguist researcher, which takes place during the research process itself. Linguists do not have to accept the theoretical framework applied to a corpus in order to use it. Instead, the corpus representation is considered simply as a ‘handle on the data’, a method for systematically obtaining data across a corpus. It is not necessary to accept the framework uncritically.

In practice this means that researchers might find themselves constructing logical combinations of structural queries to retrieve a dataset aligned to their research theory and goals. But this is a small price to pay for having a grammatical framework already applied and evaluated against corpus data.

Finally abstraction is not an end goal but a means to obtaining an abstracted dataset expressed in terms commensurate with the theoretical demands of the researcher. It is this dataset that may then be subject to a third process, one we refer to as ‘analysis’, hence the ‘3A’ model of corpus linguistics, distinguishing the stages of annotation, abstraction and analysis.

Aarts, B. 2001. Corpus linguistics, Chomsky and Fuzzy Tree Fragments. In: C. Mair and M. Hundt (eds.) *Corpus linguistics and linguistic theory*. Amsterdam: Rodopi. 5-13.

Beheme, C. 2016. How Galilean is the ‘Galilean Method’? *History and Philosophy of the Language Sciences*, http://hiphilangsci.net/2016/04/02/how-galilean

Chomsky, N. 2002. *On Nature and Language*. Cambridge: Cambridge University Press.

Putnam, H. 1974. The ‘Corroboration’ of Scientific Theories, republished in Hacking, I. (ed.) (1981), *Scientific Revolutions*, Oxford Readings in Philosophy, Oxford: OUP. 60-79.

Wallis, S.A. 2014. What might a corpus of parsed spoken data tell us about language? In L. Veselovská and M. Janebová (eds.) *Complex Visibles Out There*. Olomouc: Palacký University, 2014. 641-662. **»** Post

Wallis, S.A. and Nelson G. 2001. Knowledge discovery in grammatically analysed corpora. *Data Mining and Knowledge Discovery*, **5**: 307–340.

]]>

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Over the three days, participants will learn about the following:

- the scope of Corpus Linguistics, and how we can use it to study the English Language;
- key issues in Corpus Linguistics methodology;
- how to use corpora to analyse issues in syntax and semantics;
- basic elements of statistics;
- how to navigate large and small corpora, particularly ICE-GB and DCPSE.

At the end of the course, participants will have:

- acquired a basic but solid knowledge of the terminology, concepts and methodologies used in English Corpus Linguistics;
- had practical experience working with two state-of-the-art corpora and a corpus exploration tool (ICECUP);
- have gained an understanding of the breadth of Corpus Linguistics and the potential application for projects;
- have learned about the fundamental concepts of inferential statistics and their practical application to Corpus Linguistics.

For more information, including costs, booking information, timetable, see the website.

]]>

Recently I’ve been working on a problem that besets researchers in corpus linguistics who work with samples which are not drawn randomly from the population but rather are taken from a series of sub-samples. These sub-samples (in our case, texts) may be randomly drawn, but we cannot say the same for any two cases drawn from the same sub-sample. It stands to reason that two cases taken from the same sub-sample are more likely to share a characteristic under study than two cases drawn entirely at random. I introduce the paper elsewhere on my blog.

In this post I want to focus on an interesting and non-trivial result I needed to address along the way. This concerns the concept of **variance** as it applies to a Binomial distribution.

Most students are familiar with the concept of variance as it applies to a Gaussian (Normal) distribution. A Normal distribution is a continuous symmetric ‘bell-curve’ distribution defined by two variables, the **mean** and the **standard deviation** (the square root of the variance). The mean specifies the position of the centre of the distribution and the standard deviation specifies the width of the distribution.

Common statistical methods on Binomial variables, from χ² tests to line fitting, employ a further step. They approximate the Binomial distribution to the Normal distribution. They say, *although we know this variable is Binomially distributed, let us assume the distribution is approximately Normal*. The variance of the Binomial distribution becomes the variance of the equivalent Normal distribution.

In this methodological tradition, the variance of the Binomial distribution loses its meaning with respect to the Binomial distribution itself. It seems to be only valuable insofar as it allows us to parameterise the equivalent Normal distribution.

What I want to argue is that in fact, the concept of the variance of a Binomial distribution is important in its own right, and we need to understand it with respect to the Binomial distribution, not the Normal distribution. Sometimes it is not necessary to approximate the Binomial to the Normal, and if we can avoid this approximation our results are likely to be stronger as a result.

Every fundamental primer in statistics approaches the problem in the following way.

A Binomial variable is a two-valued variable (hence ‘bi-nomial’). The values can be anything, but let us simply call them, according to coin-tossing tradition, as ‘heads’ and ‘tails’. The proportion of cases that are heads in any randomly-drawn sample, of size *n*, taken from a population, which we might term *p*, is free to vary from 0 to 1. That is, all *n *cases in the sample may be heads (*p* = 1) or all may be tails (*p* = 0).

Now, suppose we know, Zeus-like, the actual proportion in the population, *P.* We don’t *have* to be a deity – we might assume that our coin is unbiased so *P* = 0.5 (heads and tails are equally probable) – but a common error is when people get big *P* (true value in the population) and little *p* (observed value in a sample) muddled up. Let’s leave observed *p* aside for a minute.

We can calculate the distribution for *P* and *n* using the following Binomial formula:

*Binomial distribution B*(*r*) =* nCr P ^{r}* (1 –

where *r* ranges from 0 to *n*. This means that the probability of obtaining exactly *r* heads out of *n* coin tosses is calculated by multiplying

- the
**combinatorial function***nCr*(the number of unique ways we can obtain exactly*r*cases out of*n*cases); - the
**probability that***r*cases are heads*P*and^{r} - the
**probability that the remainder are tails**(1 –*P*)^{(n – r)}.

This formula obtains the ideal Binomial distribution.

The graph below shows what this looks like for ten tosses of an unbiased coin, where *P* = 0.5 and *n* = 10. The mean of this distribution is *nP*, i.e. 0.5 × 10 = 5.

**Note.** Equation (1) also works for a ‘trick’ coin, e.g. where *P* = 0.9 (9 times out of 10 we obtain heads). Although most primers first show a graph of *P* = 0.5, few real-world Binomial variables are equiprobable. (Don’t be misled by the symmetry of this graph.)

This distribution has a number of important characteristics.

- The most obvious characteristic is that it is
**discrete**– the only possible values of*r*are integer values from 0 to*n*. Therefore if we sample 10 coin tosses, an observed probability*p*could be 0, 0.1, 0.2, right up to 1. If the true value of*P*was 0.45, we could not observe*p*= 0.45 if we only had ten coin tosses. - A less obvious, but important, characteristic is that this distribution is
**probabilistic**– the sum of all columns ∑*B*(*r*) = 1. - Finally, for all values of
*P*other than 0.5, the distribution is**assymmetric**. See below.

You can also see how unlikely it is that all coins are heads or all tails. The chance of this happening is not zero, but it is small. There is only one possible combination of heads and tails where all ten coins are heads (HHHHHHHHHH) out of 1,024 (2^{n}) possible patterns. The probability of observing *p *= 0 is 1 in 1,024.

There are ten ways that one coin will be a tail and nine heads (THHHHHHHHH, HTHHHHHHHH,… HHHHHHHHHT), and so on.

The combinatorial function *nCr* tells us exactly how many different ways we can obtain *r* cases out of *n* potential cases. The full formula is given in equation (2) below, where *x*! means the factorial of *x*, or *x*(*x*-1)(*x*-2)…(1).

*combinatorial function nCr* = *n*!/(*n-r*)!*r*!.(2)

You should be able to see that in cases where *r* = 0 or *r* = *n*, *nCr* = 1; where *r* = 1 or *r* = *n*-1, *nCr* = *n*.

If *P* = 0.5 then the Binomial function (1) above becomes simply

*B*(*r*) =* nCr P ^{r}* (1 –

However, the general function is much more flexible. It allows us to consider distributions for different values of *P*. (Again, these are plotted on an integer scale.)

Note that these distributions are clearly assymmetric, being centred at *P* < 0.5 and bounded by 0 and *n*. As *P* approaches zero this assymmetry becomes more acute.

Another aspect we can immediately see from the graphs above is that, as well as increasingly becoming less symmetric, as *P* approaches zero, the distribution becomes more concentrated together. We say that the variance of the distribution decreases.

The variance of a Binomial distribution on the integer scale *r *= 0…*n* can be obtained from the function

*(integer) variance S*² = *nP*(1 – *P*).

To compare different-sized samples, we obviously need to use the same scale. The simplest standardisation is to adopt a probabilistic scale, i.e. where *p *= 0…1. To do this we divide this formula by *n*². The variance of a Binomial distribution on a **probabilistic scale** is obtained from the function

*(probabilistic) variance S*² = *P*(1 – *P*)/*n*.(3)

Thus if *P* = 0.5 and *n* = 10, *S*² = 0.025. If *P* = 0.1 and *n* = 10, *S*² = 0.009. (You shouldn’t need a calculator to work this out!) This formula has the following properties.

- For the same
*n*> 1, as*P*tends to zero,*P*(1 –*P*) will also tend to 0. (Consider: if a coin had zero chance of being a head, it will always be a tail!) - For the same
*P*> 0, as*n*increases,*P*(1 –*P*)/*n*decreases. (Obviously if*P*=0 then*S*² cannot decrease!)

Variance is simply the square of the standard deviation of the same distribution:

*standard deviation S* ≡ √*P*(1 – *P*)/*n*.

The concept of variance and standard deviation are usually applied to the **Normal distribution**. Here they have immediate meaning because, as we noted in the introduction, a Normal distribution can be described by two parameters: the **mean**, in this case *P*, and the **standard deviation**, *S*.

Indeed, in the same statistics primers, at around this point we are encouraged to set aside what we have learned about the Binomial distribution and simply assume that it is ‘close to’ the Normal distribution *N*(*P*, *S*). We might see comments that this is an acceptable step for large *n* or where both *nP* and *n*(1 – *P*) > 5.

It is worth emphasising: this step (due to an observation by de Moivre in the 18th Century) is an **approximation**. The Binomial and Normal distributions are different. Here is the distribution for *P *= 0.3 again, but this time with a Normal distribution approximated to it. There is a small difference between the two mid-points, which we have labelled as ‘error’.

- Most obviously, the Normal distribution is
**continuous**rather than discrete. This means we can obtain an estimate for the expected probability that*p*= 0.45. - Like the Binomial distribution, the standardised Normal distribution is also
**probabilistic**, i.e. the area under the curve sums to 1. - Finally, the Normal distribution is
**symmetric**. Moreover, it assumes that the observed variable is unbounded. An unbounded variable is free to vary from minus infinity (-∞) to plus infinity (+∞). (This is a corollary: if the variable was bounded, it could not be symmetric.)

It is worth considering this last point. Many statistics text books use example variables from the natural and physical sciences.

- For example, the height of children in a class, which we might call
*H*, is usually considered to be an unbounded variable, suitable for the Normal distribution. - But in fact, the height of children is a bounded variable.
**It has a lower limit.**At the risk of stating the obvious, children cannot be less than zero height(!), and indeed, to be permitted to go to school, must be of a certain age and be physically safe to do so.*H*must have a lower limit rather greater than zero.**It has an upper limit.**A number of factors, from growth rates to the physical strength of bone, limit the possible height of children.

- Far from being unbounded,
*H*is bounded by biology!

What everyone does is assume that the observed mean height is **so far** from the bounds that although the bounds exist, they have negligible effect on the distribution. (This is not always a healthy assumption, but it is the source of these injunctions to only approximate to the Normal distribution in cases where *nP* > 5.)

On the other hand, Binomial variables (and the Binomial distribution), are **strictly** bounded. We may write, e.g. *P* ∈ [0, 1], which simply means “*P* ranges from 0 to 1 inclusive”. The probability *P* may also be expressed as a proportion or percentage, so we might say that a rate can be any value from 0% to 100%.

So far we have discussed the *ideal* Binomial distribution. Equation (1) is the mathematical extrapolation of the likelihood, *B*(*r*) of observing *r* future results for a sample of *n* cases drawn randomly from a population if the true rate in the population was *P*.

In some circumstances we may *observe* a Binomial distribution. I do this in class with students – each student tosses a coin a fixed number of times and we note down the number of students who had 0 heads, 1 head and so on.

In the paper I am working on, I realised that this principle can also be employed to identify the extent to which a corpus sample might deviate from an ideal random sample for a given variable. This is an important question for corpus linguistics.

The first step is to partition the corpus sample into subsamples according to the text that they are drawn from. To all intents and purposes, these texts can be assumed to be random even if they were not subject to controlled sampling.

Note that two cases drawn from different texts are therefore likely to be independent and equivalent to a pair of cases in a true random sample. However two cases from the same text may share characteristics. There are all sorts of reasons why this is likely to be the case, from a shared topic to personal preferences, priming and other psycholinguistic effects. The reason does not actually matter – we just need to recognise this is likely to be the case.

**Question:**How may we measure the deviation of the corpus sample from an ideal random sample?**Answer:**By studying the distribution of these subsamples.

Suppose the subsamples are equivalent to random samples. Even though cases are drawn from the same text, suppose it turns out that the particular variable is not sensitive to context, previous utterances, etc. In this case, we would expect these sub-samples to be Binomially distributed.

To plot the following graph we first ‘quantise’ (round up or down to a particular number) the observed probability *p*. The vertical axis, *f*, is simply the number of texts in the direct conversations category of ICE-GB, where the probability that a clause is interrogative (*p*(inter) is 0, 0.01, 0.02, etc.). There are 90 texts in this category. We can see that this distribution is approximately Binomial.

We may calculate the variance of this observed distribution with the following pair of formulae, derived from Sheskin (1997).

The first estimate (4) does not take into account the fact that samples are drawn from a population, whereas the second measure, termed the *unbiased estimate of the population variance*, does. For that reason, we here use capital *P* to refer to each probability in the first case and lower case *p* to refer to observations.

*variance of a set of scores* *s’*_{ss}² = ∑(*P _{i}* –

*observed between-subsample variance s*_{ss}² = ∑(*p _{i}* –

where *p _{i}* is the observed probability for subsample

Equations (4) and (5) have one deficiency. It assumes that each subsample is of the same size. This is fine for classroom coin-tossing. It is unlikely to be the case in a corpus sample.

The estimate of variance for a set of different-sized subsamples can be obtained from

*variance of a set of scores (different sizes)* = *s’*_{ss}² = ∑*pr _{i }*(

*observed between-subsample variance s*_{ss}² = *t*/(*t*-1) × ∑*pr _{i }*(

where *pr _{i}* =

It is possible to prove that if *pr _{i}* is equal to the Binomial probability

* ∑nCr P ^{r}* (1 –

This means that equation (6) *defines the correct mathematical relationship between a Binomial distribution on a probabilistic scale and its expected variance*. Another way of putting this is that it is legitimate to apply equations (6) and (7) to a Binomial variable.

**Example:** To illustrate this equivalence, consider the following computation for *P* = 0.3 and *n* = 2. Equation (3) obtains, simply *S*² = (0.3 × 0.7)/2 = 0.105.

r/n |
r |
nCr |
B(r) |
B(r) × (r/n – P)² |

0 | 0 | 1 | 0.49 | 0.0441 |

0.5 | 1 | 2 | 0.42 | 0.0168 |

1 | 2 | 1 | 0.09 | 0.0441 |

Totals |
4 | 1 | 0.1050 | |

We can therefore contrast the observed subsample variance with the variance that would be predicted assuming each subsample were a random sample, i.e. the expected Binomial variance, which in this notation would be

*predicted between-subsample variance S*_{ss}² = *p*(1 – *p*)/*t*.

If the two variance scores are the same, then to all intents and purposes, our subsamples are random samples, and the entire corpus sample can be considered a random collection of random samples, i.e. a random sample.

However, if the observed subsample variance differs than that predicted, we are entitled to take this into account when considering the variance of the corpus sample. We employ the ratio of variances, * F*_{ss}, to adjust the sample size accordingly.

*cluster-adjustment ratio F*_{ss} = *S*_{ss}² / *s*_{ss}², and (6)

*corrected sample size n’* = *nF*_{ss}.

If the observed sample has a greater variance than the predicted variance, *F*_{ss} < 1, and we can say that there are fewer truly independent random cases in our overall corpus sample, we increase our uncertainty of our cross-corpus observation, significance tests become more strict, confidence intervals wider, etc.

In the paper, we observe that sometimes *F*_{ss} > 1 and discuss reasons for this. Suffice it to say it is certainly possible, although this may at first sight appear counter-intuitive.

To illustrate the method, consider the following graph. This is the same data as the figure above. You can download this spreadsheet to inspect the calculation for yourself.

Note that in this case we see a close correspondence between the two predicted distributions – Binomial and Normal. The observed distribution is also approximately Normal (accepting the randomness we would anticipate in any observed distribution of course).

The method of comparing variances we employed makes no assumptions about the Binomial approximating to the Normal distribution.

However, this method usually comes under the umbrella of analysis of variance (ANOVA), which is premised on data being Normally distributed. Instead of assuming that ANOVA *might* be legitimately employed for Binomial (bounded, assymmetric, discrete) distributions, we were concerned to *prove* that our definitions of variance were applicable to the Binomial.

Why might this matter? There are two reasons.

- The approximation to the Normal distribution is an approximation, and introduces a number of ‘smoothing’ errors as a result.
- We must ensure that the method is robust for highly skewed values of
*p*.

In the figure above the Normal and Binomial distributions are similar. However, this is not always the case.

Consider the following graph (Figure 4 in the paper). Here data is drawn, not from a single genre, but across the diverse genres contained within the ICE-GB corpus, from the most highly interactive speech contexts to the most didactic of written instructional texts.

The two upper dotted lines are the predicted Normal and Binomial distributions for this observed value of *p* (0.0399) and *t* = 500 texts. You can see how the Normal distribution is narrower than the predicted Binomial.

Equation (5) captures the total variance between subsamples in this figure. It is approximately 4% of the predicted variance according to equation (3).

The lower line is the Normal distribution premised on the observed subsample variance. Again, you can see a large deviation between the observed frequency distribution (bars) and this Normal distribution, which is also clearly clipped by the lower bound at *p* = 0.

If our method were dependent on the Normal distribution, we simply could not sustain it in highly-skewed contexts such as this.

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.

]]>

Conventional stochastic methods based on the Binomial distribution rely on a standard model of random sampling whereby freely-varying instances of a phenomenon under study can be said to be drawn randomly and independently from an infinite population of instances.

These methods include confidence intervals and contingency tests (including multinomial tests), whether computed by Fisher’s exact method or variants of log-likelihood, χ², or the Wilson score interval (Wallis 2013). These methods are also at the core of others. The Normal approximation to the Binomial allows us to compute a notion of the variance of the distribution, and is to be found in line fitting and other generalisations.

In many empirical disciplines, samples are rarely drawn “randomly” from the population in a literal sense. Medical research tends to sample available volunteers rather than names compulsorily called up from electoral or medical records. However, provided that researchers are aware that their random sample is limited by the sampling method, and draw conclusions accordingly, such limitations are generally considered acceptable. Obtaining consent is occasionally a problematic experimental bias; actually recruiting relevant individuals is a more common problem.

However, in a number of disciplines, including **corpus linguistics**, samples are not drawn randomly from a population of independent instances, but instead consist of randomly-obtained contiguous subsamples. In corpus linguistics, these subsamples are drawn from coherent passages or transcribed recordings, generically termed ‘texts’. In this sampling regime, whereas any pair of instances in independent subsamples satisfy the independent-sampling requirement, pairs of instances in the same subsample are likely to be co-dependent to some degree.

To take a corpus linguistics example, a pair of grammatical clauses in the same text passage are more likely to share characteristics than a pair of clauses in two entirely independent passages. Similarly, epidemiological research often involves “cluster-based sampling”, whereby each subsample cluster is drawn from a particular location, family nexus, etc. Again, it is more likely that neighbours or family members share a characteristic under study than random individuals.

If the random-sampling assumption is undermined, a number of questions arise.

- Are statistical methods employing this random-sample assumption simply
**invalid**on data of this type, or do they gracefully degrade? - Do we have to employ very
**different tests**, as some researchers have suggested, or can existing tests be modified in some way? - Can we measure the
**degree**to which instances drawn from the same subsample are interdependent? This would help us determine both the scale of the problem and arrive at a potential solution to take this interdependence into account. - Would revised methods only affect the
**degree of certainty**of an observed score (variance, confidence intervals, etc.), or might they also affect the**best estimate of the observation**itself (proportions or probability scores)?

We will employ a method related to ANOVA and F-tests, applying this method to a probabilistic rather than linear scale. This step is not taken lightly but as we shall see in section 6, it can be justified.

Consider an observation *p* drawn from a number of texts, *t*, based on *n* total instances. Conventionally we would assume that these *n* instances are randomly drawn from an infinite population, and then employ the Normal approximation to the Binomial distribution:

*standard deviation s* ≡ √*p*(1 – *p*)/*n*.

*variance s*² = *p*(1 – *p*)/*n*, and(1)

*Wilson’s score interval* (*w*⁻, *w*⁺)

≡ [*p* + *z*_{α/2}²/2*N* ± *z*_{α/2}√*p*(1 – *p*)/*N* + *z*_{α/2}²/4*N²*] / [1 + *z*_{α/2}²/*N*].(2)

where *z*_{α/2} is the critical value of the Normal distribution for a given error level α (see Wallis 2013 for a detailed discussion). Other derivations from (1) include χ² and log-likelihood tests, least-square line-fitting, and so on. The model assumes that all *n* instances are randomly drawn from an infinite (or very large) population. However, we suspect that our subsamples are not equivalent to random samples, and that this sampling method will affect the result.

To investigate this question, our approach involves two stages.

First, we measure the variance of scores between text subsamples according to two different models, one that presumes that each subsample is a random sample, and one calculated from the actual distribution of subsample scores. Consider the frequency distribution of probability scores, *p _{i}*, across all

*subsample mean p* = ∑*p _{i}* /

If subsamples were randomly drawn from the population, it would follow from (1) that the variance could be **predicted** by

*between-subsample variance S*_{ss}² = *p*(1 – *p*)/*t*.(3)

To measure the **actual** variance of the distribution we employ a method derived from Sheskin (1997: 7). First, note that the variance of a series of *N* observed scores *X _{i}*, can be obtained by

*s*² = ∑(*X _{i}* –

which can be rewritten as

*observed between-subsample variance s*_{ss}² = ∑(*p _{i}* –

This formula measures the internal variance of the series, but it fails to take into account the fact that the series is a subsample from which we wish to predict the true population value. The formula for the *unbiased estimate of the population variance* may be obtained by

*observed between-subsample variance s*_{ss}² = ∑(*p _{i}* –

where *t* – 1 is the number of degrees of freedom of our set of observations. This is the formula we will use for our computations.

Second, we adjust the weight of evidence according to the degree to which these two variances (equations (3) and (4)) disagree. If the observed and predicted variance estimates coincide, then the total set of subsamples is, to all intents and purposes, a random sample from the population, and no adjustment is needed to sample variances, standard deviations, confidence intervals, tests, etc.

We can expect, however, that in most cases the actual distribution has greater spread than that predicted by the randomness assumption. In such cases, we employ the **ratio of variances**, *F*_{ss}, as a scale factor for the number of random independent cases, *n*.

Gaussian variances with the same probability *p* are inversely proportion to the number of cases supporting them, *n*, i.e. *s*² ≡ *p*(1 – *p*)/*n* (cf. equation (1)). Assuming the Normal approximation to the Binomial holds for the distribution of *p*, we can estimate a corrected total independent sample size *n’*, by multiplying *n* by the ratio of variances for the same *p*.

*cluster-adjustment ratio F*_{ss} = *S*_{ss}² / *s*_{ss}², and (5)

*corrected sample size n’* = *nF*_{ss}.

To put it another way, the ratio *n’*:*n* is the same as *S*_{ss}²:*s*_{ss}². This ratio should be less than 1, and thus *n* is decreased. If we decrease *n* in equations (1) and (2), we obtain larger estimates of sample variance and wider confidence intervals. An adjusted *n* is easily generalised to contingency tests and other methods.

…

Figure 6 plots the distribution of *p* with Wilson intervals across ICE-GB genre categories. The thin ‘I’-shaped error bars represent the conventional Wilson score interval for *p*, assuming random sampling. The thicker error bars represent the adjusted Wilson interval obtained using the probabilistically-weighted method of equation (7). These results are tabulated in Table 2 in the paper.

The figure reinforces observations we made earlier. Within a single text type, such as *broadcast interviews*, *p* has a compressed range and cannot plausibly approach 1. (Note that mean *p* does not exceed 0.03 in any genre.) The observed between-text distribution is smaller than that predicted by equation (3), and, armed with this information, we are able to reduce the 95% Wilson score interval for *p*. This degree of compression (or, to put it another way, the plausible value of max(*p*)) may also differ by text genre.

However, the reduction due to range-compression is offset by a countervailing tendency: pooling genres increases the variance of *p*. The distribution of texts across the entire corpus consists of the sum of the spoken and written distributions (means 0.0091 and 0.0137 respectively), and so on.

The Wilson interval for the mean *p* averaged over all of ICE-GB approximately doubles in width (*F*_{ss} = 0.2509), and the intervals for *spoken*, *dialogue*, *private,* *written* and *printed *(marked in bold in Figure 6) also expand, albeit to lesser extents. The other intervals contract (*F*_{ss} > 1), tending to generate a more consistent set of intervals over all text categories.

- Introduction
- Previous research

2.1 Employing rank tests

2.2 Case interaction models - Adjusting the Binomial model
- Example 1: interrogative clause probability, direct conversations

4.1 Alternative method: fitting - Example 2: Clauses per word, direct conversations
- Uneven-size subsamples
- Example 3: Interrogative clause probability, all ICE-GB data
- Example 4: Rate of transitive complement addition
- Conclusions

Wallis, S.A. 2015. *Adapting random-instance sampling variance estimates and Binomial models for random-text sampling*. London: Survey of English Usage, UCL. http://www.ucl.ac.uk/english-usage/statspapers/recalibrating-intervals.pdf

- Spreadsheet example (Excel)
- The variance of Binomial distributions
- Random sampling, corpora and case interaction
- Reciprocating the Wilson interval
- Freedom to vary and significance tests

Sheskin, D.J. 1997. *Handbook of Parametric and Nonparametric Statistical Procedures*. Boca Raton, Fl: CRC Press.

Wallis, S.A. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. *Journal of Quantitative Linguistics ***20**:3, 178-208 **»** Post

]]>

Recently, a number of linguists have begun to question the wisdom of assuming that linguistic change tends to follow an ‘S-curve’ or more properly, **logistic**, pattern. For example, Nevalianen (2015) offers a series of empirical observations that show that whereas data sometimes follows a continuous ‘S’, frequently this does not happen. In this short article I try to explain why this result should not be surprising.

The fundamental assumption of logistic regression is that a probability representing a true fraction, or share, of a quantity undergoing a continuous process of change by default follows a logistic pattern. This is a reasonable assumption in certain limited circumstances because **an ‘S-curve’ is mathematically analogous to a straight line** (cf. Newton’s first law of motion).

Regression is a set of computational methods that attempts to find the closest match between an observed set of data and a function, such as a straight line, a polynomial, a power curve or, in this case, an S-curve. We say that the logistic curve is the underlying model we expect data to be matched against (regressed to). In another post, I comment on the feasibility of employing Wilson score intervals in an efficient logistic regression algorithm.

We have already noted that change is assumed to be continuous, which implies that the input variable (*x*) is **real and linear**, such as time (and not e.g. probabilistic). In this post we discuss different outcome variable types. What are the ‘limited circumstances’ in which logistic regression is mathematically coherent?

- We assume probabilities are free to vary from 0 to 1.
- The envelope of variation must be constant, i.e. it must always be possible for an observed probability to reach 1.

Taken together this also means that probabilities are Binomial, not multinomial. Let us discuss what this implies.

The logistic curve can be expressed as the function

*P*= logistic(*x*,*m*,*k*) ≡ 1 / (1 +*e*^{–m(x – k)}).

In a simple Binomial alternation, we have two probabilities, *P* and *Q*, where *Q* = 1 – *P* (as this is an expected model, rather than observed data, I am following the convention of capitalisation used elsewhere in this blog).

*Q*= 1 – logistic(*x*,*m*,*k*) = 1 – {1 / (1 +*e*^{–m(x – k)})},*Q*= 1 / (1 +*e*^{+m(x – k)}) = logistic(*x*, –*m*,*k*).

Another way of saying this is that

- logistic(
*x*,*m*,*k*) + logistic(*x*, –*m*,*k*) ≡ 1.

One curve goes up, the other goes down.

The logistic model assumes one degree of freedom.

However, if the number of alternating types increase above 2, not all forms can follow the logistic curve. The only solution to the following equivalence is where the number of types *t* = 2 (identified above).

- ∑
_{i=1..t}logistic(*x*,*m*,_{i}*k*) = 1._{i}

What this means is that if you have an outcome variable with three or more types, plotted over time (say), you cannot expect these outcome probabilities to follow an S-curve, because this would be mathematically impossible!

The following graph is from a paper on the *to*-infinitive perfect, i.e. verb patterns ‘V *to have* V(ed)’, e.g. *claims to have achieved*. Subdividing the preceding verb into semantic classes where alternation would be possible (replacing CLAIM with SAY, for instance), we obtained graphs like the following.

We might be able to argue that SAY or KNOW approximates to a logistic curve, but what can we say about REPORT, which rises and falls over the time period? (This rise and fall is statistically significant, as the confidence intervals indicate.)

In fact, as Nevalainen also points out, results like this should not be surprising, and we should not feel obliged to ‘explain’ them as a defect in the data.

**Note:** In the paper, to allow us to contrast and subdivide patterns of change, we consider probabilities against a global baseline of all potential *to*-infinitive perfect forms, below. (‘Total’ below is equivalent to *p*=1 above.) This presentation tends to downplay the variation within the set of forms but the point remains: REPORT is not behaving logistically!

Given that we cannot expect three-way (and above) alternation patterns to adopt a logistic curve, this begs a question. What kinds of patterns might we expect?

One possibility is that we witness a **hierarchical alternation**. Consider an alternation {*a*, {*b*, *c*}}, where *a* alternates with the pair *b+c*, and *b* independently alternates with *c*. Type *a* may be the modal MUST, whereas *b* and *c* might correspond to HAVE *to* and HAVE *got to* respectively.

- Since
*a*alternates with*b+c*, we can assume*p*(*a*| {*a*,*b*,*c*}) follows a logistic curve over time (say). - Since
*b*alternates with*c*,*p*(*b*| {*b*,*c*}) also adopts a logistic curve over the same axis. But note the different baseline:*b*is alternating within the envelope of variation defined by the remainder 1 –*p*(*a*| {*a*,*b*,*c*}).

The following graph is plotted for *x* = 0 to 20, with constant *k* = 10 in both cases, so the point at which the probability is 0.5 coincides for both *a* and *b*. Selecting *m _{a}* = -0.5 and

If you want to understand more closely how this works and wish to experiment with settings, download this spreadsheet. For simplicity, the turning point, or 0.5-intercept, *k*, is the same in both cases, although this is not necessary.

This ‘hill’ is obviously not a logistic curve, although it has a well-defined relationship with the logistic function. It is a logistic curve within the envelope of variation defined by the remainder — the area above the blue dotted line. (The red dotted line is not plotted against the same baseline, so in practice we would not observe this directly unless we reformulated the experiment.)

We only see that the hill-shaped curve is logistic when we change the baseline to the pair {*b*, *c*}. Altering the experimental design allows us to witness the alternation as a simple substitution of one form for another uninfluenced by other factors.

**Hint:** This observation means that when performing logistic regression, it is worth experimenting with hierarchically-structuring variables into type pairs that may plausibly alternate independently from other forms. But it is also possible that alternation may be more than two-way and data does not fit a hierarchical ‘binary tree’ model.

On a logit scale, the same curves look like this. Straight lines on this scale match logistic curves. The green curve does not!

As it is not possible for all three types to simultaneously adopt a logistic curve, an observer examining **at least one** of the patterns will see a non-logistic curve. But this only works if the three types might alternate hierarchically. However, with three or more types in competition, there is no reason why **any** particular type will follow a logistic curve.

Note how we snuck in the word ‘independent’ earlier — we assumed that the transition in preferred use from *b* to *c* (HAVE *to* and HAVE *got to*) would occur completely in isolation from the other substitution (for *a*, MUST).

In practice this pattern of neat independent alternation is probably unlikely. For exactly the same reason that the result of zooming in on a part of a curve might appear to be a straight line, just because a function matches a logistic curve over a small range does not mean that the overall pattern of change is truly logistic.

The so-called ‘three-body problem’ is a well-known fundamental problem in studying physical dynamic systems, and frequently results in bounded **chaotic** outcomes (Gleick 1977). This blog is not the place to discuss chaos theory, except to note that non-linear behaviour (‘chaos’) arises when three bodies are continuously influencing each other, such as two moons orbiting a planet, or one moon orbiting two planets.

In our case, the equivalent situation is where three or more alternating types exist.

We should not therefore be surprised if results do not converge to a neat logistic curve but some other pattern. Moreover, this is also true for a study that only examines the alternation of two simple types. In other words, even if the possibility of using MUST is ignored in a study of HAVE [*got*] *to*, we cannot guarantee that if the proportion of cases of MUST changes substantially over a given period, this will not influence the HAVE [*got*] *to* alternation so as to alter the shape of the curve.

This final conclusion is a reasonable ‘ecological’ objection to an over-reliance on logistic regression. It also means we should be careful about arguments for or against a particular baseline of study simply on the basis of *r*² scores (measures of fit to a logistic or other line).

This issue is a problem of fitting to a *particular* line. It does not undermine the proper use of confidence intervals or testing for significant differences between observations. In addition to a caution against input axes that are non-linear, we might therefore extend the ‘limited circumstances’ identified above:

- We assume probabilities are free to vary from 0 to 1.
- The envelope of variation must be constant, i.e. it must always be possible for an observed probability to reach 1.
- The alternation is not influenced by other alternating forms or otherwise be subject to systematic ecological pressures.

Bowie, J. and Wallis, S.A. 2016. The *to*-infinitival perfect: A study of decline. In Werner, V., Seoane, E., and Suárez-Gómez, C. (eds.) *Re-assessing the Present Perfect*, Topics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94.

Gleick, J. 1977. *Chaos: Making a New Science*, London: Heineman.

Nevalainen, T. 2015. Descriptive adequacy of the S-curve model in diachronic studies of language change. In Sanchez-Stockhammer, C. (ed.) *Can we predict language change?*Helsinki: Varieng, UoH. » ePublished

- Excel spreadsheet
- Logistic regression with Wilson intervals
- Freedom to vary and significance tests
- That vexed problem of choice

]]>

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

- The Summer School is a primer in Corpus Linguistics for students of the English language. It is designed to be both accessible and inspiring!
- Attendees are taught by world-class researchers at the Survey of English Usage, UCL.
- Students are expected to have a basic knowledge of English linguistics and grammar.
- It will take place in the English Department of University College London, in the heart of Central London.

For more information, including costs, booking information, timetable, see the website.

]]>

Back in 2010 I wrote a short article on the **logistic** (‘S’) curve in which I described its theoretical justification, mathematical properties and relationship to the Wilson score interval. This observed two key points.

- We can map any set of independent probabilities
*p*∈ [0, 1] to a flat Cartesian space using the inverse logistic (‘logit’) function, defined as- logit(
*p*) ≡ log(*p*/ 1 –*p*) = log(*p*) – log(1 –*p*), - where ‘log’ is the natural logarithm and logit(
*p*) ∈ [-∞, ∞].

- logit(
- By performing this transformation
- the logistic curve in probability space becomes a
**straight line**in logit space, and - Wilson score intervals for
*p*∈ (0, 1) are**symmetrical**in logit space, i.e. logit(*p*) – logit(*w*⁻) = logit(*w*⁺) – logit(*p*).

- the logistic curve in probability space becomes a

The logit function is entirely reversible, so we can translate any linear function, *y* = *mx* + *c*, on the logit scale back to the probability scale using the logistic function:

- logistic(
*x*) = 1 / (1 +*e*^{–(mx + c)}).

Since *m* is the gradient of the line in logit space, and *c* is a fitting constant, it is possible to conceive of *m* as equivalent to a **size of effect** measure: the steeper the gradient, the more dramatic the effect. The coefficient of fit, *r²*, is the inverse of a significance test for the fit, so a high *r²* means a close fit.

This is the rationale used by the method most conventionally referred to as ‘logistic regression’, more accurately **logistic linear regression for multiple independent variables**. This is the conventional ‘black box’ approach to logistic regression. This relies on a number of mathematical assumptions, including:

- IVs are on a Real or Integer scale, like time (or can be meaningfully approximated by an Integer, such as ‘word length’ or ‘complexity’); and
- logit relationships are linear, i.e. the dependent variable can be fit to a model, or function, consisting of a series of weighted factors and exponents of some or all of the supplied independent variables (a straight line in
*n*-dimensional logit space). Consider extending the second graph above in a third dimension so the line declines into the page to see what I mean.

However, properly conceived, logistic regression can do more than that. As we shall see, there are good reasons for believing that relationships may not be linear on a logit scale.

One of the ideas I had at that time is that it would be possible to perform regression in logit space by minimising least square errors over variance, where ‘variance’ is approximated by the square of the Wilson interval width. Any function in logit space can then be converted back to a function in *p* space by the logistic function.

- As anyone who paid attention in maths class knows, linear regression is a method for finding the best line through a set of data points and operates by minimising the total error
*e*between each point and the proposed line. So if all your data neatly lines up, then a line might pass through them all and we have a regression score,*r²*, of 1. Usually, however, we have a bit of a scatter. A cloud, in which any line would be as good as any other, will score 0. - The error is computed by the square of the difference between each point and the line, and then totalled up, so, for a series of probabilities,
*p*, and a function,_{x}*P*(*x*), representing the ideal population value:- error
*e*= ∑(*P*(*x*) –*p*)²._{x}

- error
- A variation of this method takes account of a situation where different data points are considered to be of greater certainty than others. In this case we divide the squared errors by the square of the standard deviation for each point,
*S*² (also called the variance):_{x}*standard deviation S*≡ √_{x}*P*(*x*)(1 –*P*(*x*))/*N*(*x*).*error e*= ∑(*P*(*x*) –*p*)² /_{x}*S*_{x}².

My proposal was to swap the standard deviation on *P*(*x*) for the Wilson width on *p _{x}* (in logit space). The Wilson interval width is an adjusted error for sample probability values. It contains a constant factor,

- The method is very efficient.
- The Wilson width only needs to be calculated once for all
*p*values, whereas the standard deviation must be calculated on_{x}*P*(*x*), which changes all the time during fitting. (Computing standard deviations using the Gaussian approximation to the Binomial on observations*p*obtains the mathematically incorrect ‘Wald’ interval.)_{x} - Regression is guaranteed to converge, as we don’t have a moving target.

- The Wilson width only needs to be calculated once for all
- It is also generalisable to other functions. Here we fit a straight line,
*P*(*x*) =*m*(*x*–*k*), but a polynomial or other function could readily be employed.

The following examples draw on data from Bowie and Wallis (forthcoming), in which we used this approach to regress time series data.

Here is some test data from that paper which illustrates the method. (If you are interested, it is the probability of selecting OUGHT *to*-infinitive perfect, ‘*to* HAVE Ved’, out of the ‘modality’ verbs OUGHT, HAVE and BE over 20 decades from COHA — not *exactly* an alternation pattern!)

The dashed straight line represents the line selected by the algorithm. Upper and lower lines track the upper and lower bounds of the 95% Wilson interval.

One problem is that if *p* = 0 or 1, logit(*p*) is infinite. However, if you think about it, the interval width is also infinite, so the point may safely be ignored. This is the data for ‘BE *to* HAVE Ved’.

Finally, here is the same data and regression lines plotted on a probability scale and presented with Wilson score intervals.

This example bears out the earlier observation that we could conceivably apply functions other than straight lines to this data. The second logit graph (lower line in the graph above) may better approximate to a non-linear function.

From my perspective, I prefer not to over-fit data to regression lines. I tend to think that it is better that researchers visualise their data appropriately.

Data may converge to a non-linear function on a logit scale (i.e. not logistic on a probability scale) for at least two reasons.

- There are
**multiple ecological pressures**on the linguistic item (in this case, proportions of lemmas out of a set of*to*-infinitive perfects in COHA) or it may be because the item is not free to vary from 0 to 1, which is sometimes the case. We have already noted that these forms are not true alternates. - The forms may be subject to
**multinomial competition**. In this case, we have three competing forms, a 3-body situation which would not be expected to obtain a logit-linear (logistic) outcome. For simplicity, I plotted two curves, but it is wise to bear in mind that the logistic is a model of a simple change along one degree of freedom, and a three-way variable has two degrees.

It is sometimes worth comparing regression scores, *r*², numerically (all fitting algorithms try to maximise *r*²). If we simply wish to establish a direction of change, the threshold for a fit can be low, with scores above 0.75 being citable. On the other hand if we have an *r*² of 0.95 or higher we can meaningfully cite the gradient *m*.

Generally, however, conventional thresholds for fitting tend to be more relaxed than those applied to confidence intervals and significance tests because the underlying logic of the task is quite distinct. Several functions may have a close fit to the given data, and we cannot say that one is *significantly* more likely than another merely by comparing *r*² scores.

Bowie, J. and Wallis, S.A. (forthcoming). The *to*-infinitival perfect: a study of decline.

- Competition between choices over time
- Impossible logistic multinomials
- Binomial confidence intervals and contingency tests

]]>

In a recent paper focusing on distributions of simple NPs (Aarts and Wallis, 2014), we found an interesting correlation across text genres in a corpus between two independent variables. For the purposes of this study, a “simple NP” was an NP consisting of a single-word head. What we found was a strong correlation between

- the probability that an NP consists of a single-word head,
*p*(single head), and - the probability that single-word heads were a personal pronoun,
*p*(personal pronoun | single head).

Note that these two variables are independent because they do not compete, unlike, say, the probability that a single-word NP consists of a noun, vs. the probability that it is a pronoun. The scattergraph below illustrates the distribution and correlation clearly.

Note that we have not plotted confidence intervals on this graph, although it would be possible to do so.

**Aside:** Scatter (distribution) and confidence intervals are very different concepts. A 95% confidence interval for the mean observed probability *p* averaged across a dataset does **not** imply that 95% of the *data *is within that interval. It means that were we to repeat the experiment 100 times, only 5 times out of 100 would this observed mean probability *p* fall outside the range. A distribution frequently expresses a much greater spread than the interval on the mean.

The paper points out that there is no clean partition between speech and writing for either of these characteristics, or a combination of them. On the other hand spoken transcriptions have both a higher proportion of single-word NPs, and a higher proportion of those single-word NPs are personal pronouns than written texts.

A simple linear correlation of these data points has a fit of *r*² = 0.8213, which is a credible correlation. In the paper we initially wrote:

In plain English, genres appearing to the left of the graph contain a lower proportion of NPs with a single-pronoun head (i.e. the NPs tend to be more complex). Similarly, the text categories appearing towards the bottom of the graph tend to have fewer NPs consisting of personal pronouns as a proportion of the total of nouns, numerals and other single-word NPs (the most likely explanation being that the head words are grammatically more diverse). Despite the fact that these two probabilities are independent, they appear to closely correlate (linear *r*² = 0.82). Moreover, we can see that spoken and written categories, whilst distributing along a continuum, also overlap.

The second sentence above is worth considering.

- If single-head NPs consist wholly of personal pronouns then the other categories that might be single-head NPs (nouns, numerals, other pronouns, etc.) will fall.
- However, the reverse may not be true. Single-head NPs in texts which rarely consist of personal pronouns could be
**dominated**by one category: nouns, numerals, etc.

What we need to do is arrive at a plausible measure of **grammatical diversity** that would distinguish between these two alternative explanations. What follows is an exercise in exploratory data analysis.

We could define ‘diversity’ as the probability that two single-word NPs taken at random from each genre have different grammatical categories, out of the available categories: **C** = {noun, personal pronoun, other pronoun, nominal adjective, numeral or proform}. Note that this conceptualisation of ‘grammatical diversity’ is relative to a *particular* set.

**Note:** Diversity is not useful for binary categories, because mutual substitution must apply. For example, if **C** = {personal pronoun, other pronoun} then any decline in the proportion of personal pronouns out of **C** must be explained by a rise in other pronouns.

If we change the set (e.g. subdivide proper and common nouns), the results are likely to be different. We sum across the set, *c* ∈ **C**:

*diversity d*(*c*∈**C**) = ∑*p*(*c*).(1 –*p*‘(*c*)) if*n*> 1; 1 otherwise

where **C** is the set of categories, *p*(*c*)* *is the probability that item 1 is category *c *and *p*‘(*c*) the probability that item 2 is category (*c*).

*p*(*c*) =*F*(*c*)/*n**p*‘(*c*) = (*F*(*c*)*–*1)/(*n –*1)

Using *p*‘ for item 2 includes an adjustment for the fact that we already know that the first item is *c*. (Consider: if *n *= 4, the probability of item 2 = item 1 = *c* is calculated out of the remaining three cases. This makes no real difference for large *n*.)

- If
*F*(*c*) is zero for any category,*p*(*c*) is zero, and discounted. This means that the measure is robust. - If
*F*(*c*) tends to*n*for any category, then 1 –*p*‘(*c*) tends to zero, and disappears. The other categories will tend to zero, so*d*will be zero.

The maximum *d* is achieved where each category is equally probable.

If we now return to our data, the following scattergraph plots the probability of the single-word NP being a personal pronoun against *d*. This has a medium correlation *r*² = 0.7156. In effect, this means that over 70% of variation in personal pronoun use could be simply explained by variation in diversity. A high correlation does not logically imply a cause, but a failure to correlate would be evidence against diversity as a plausible explanation. (This is another way of stating refutation of null hypotheses.)

The vertical axis is identical to that in the earlier graph. At first sight this correlation seems to support the claim cited above, that “the most likely explanation being that the head words are grammatically more diverse.” Note that most of the written text categories appear to have a higher level of diversity (and a smaller proportion of personal pronoun use) than spoken transcription categories.

However, we should express caution here. Performing the same correlation analysis with the corresponding proportion of noun heads finds a higher value of *r*² = 0.9077. That is, as the proportion of personal pronouns decrease, the proportion of single nouns increase. So on reflection, alternation with nouns (the next most numerous set) seems to be a better explanation. So we decided to alter the conclusions to the paper (highlighted above) to reflect this.

More generally, not all categories semantically alternate, i.e. it is frequently not possible to simply replace any personal pronoun with another pronoun, proform or numeral without having to substantively rewrite a sentence and alter the meaning. This underlines that whereas this type of approach may be useful for surveying competing trends, in order to really determine what might be going on requires a proper alternation study.

Frequently we will obtain results that could be explained by multiple underlying causes. In this case, variation between text category in personal pronoun use as a proportion of simple, single-word NPs might be explained by direct competition with a single alternative category (e.g. growth in nouns or numerals) or simply by a tendency to express NPs over a broader range of categories. In this case we found that both were plausible explanations, and indeed, they may both be true simultaneously. But we also found that the hypothesis that pronouns were primarily alternating with nouns obtained a stronger correlation.

Of course, we have had to define diversity to perform this analysis, and in other circumstances diversity may correlate more strongly. This definition is relative to a particular set of categories. Although this measure may be mathematically principled, it should be obvious how important it is to be clear about how diversity is measured when drawing *linguistic* conclusions.

Finally, the difficulty in pinning down a specific explanation in survey results should cause us to consider all claims of this nature to be somewhat conditional. This returns us to one of the core arguments of this blog, i.e. that only by identifying alternation between forms in circumstances when speakers and writers have a choice are we ultimately able to compare different potential explanations with any certainty.

- Is language really “a set of alternations”?
- That vexed problem of choice
- A methodological progression

Aarts, B. and S.A. Wallis 2014. Noun phrase simplicity in spoken English. In L. Veselovská and M. Janebová (eds.) *Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure.* Olomouc: Palacký University, 2014. pp 501-511.

]]>