### Introduction

In a recent paper focusing on distributions of simple NPs (Aarts and Wallis, 2014), we found an interesting correlation across text genres in a corpus between two independent variables. For the purposes of this study, a “simple NP” was an NP consisting of a single-word head. What we found was a strong correlation between

- the probability that an NP consists of a single-word head,
*p*(single head), and - the probability that single-word heads were a personal pronoun,
*p*(personal pronoun | single head).

Note that these two variables are independent because they do not compete, unlike, say, the probability that a single-word NP consists of a noun, vs. the probability that it is a pronoun. The scattergraph below illustrates the distribution and correlation clearly.

Note that we have not plotted confidence intervals on this graph, although it would be possible to do so.

**Aside:** Scatter (distribution) and confidence intervals are very different concepts. A 95% confidence interval for the mean observed probability *p* averaged across a dataset does **not** imply that 95% of the *data *is within that interval. It means that were we to repeat the experiment 100 times, only 5 times out of 100 would this observed mean probability *p* fall outside the range. A distribution frequently expresses a much greater spread than the interval on the mean.

The paper points out that there is no clean partition between speech and writing for either of these characteristics, or a combination of them. On the other hand spoken transcriptions have both a higher proportion of single-word NPs, and a higher proportion of those single-word NPs are personal pronouns than written texts.

### Exploring possible explanations

A simple linear correlation of these data points has a fit of *r*² = 0.8213, which is a credible correlation. In the paper we initially wrote:

In plain English, genres appearing to the left of the graph contain a lower proportion of NPs with a single-pronoun head (i.e. the NPs tend to be more complex). Similarly, the text categories appearing towards the bottom of the graph tend to have fewer NPs consisting of personal pronouns as a proportion of the total of nouns, numerals and other single-word NPs (the most likely explanation being that the head words are grammatically more diverse). Despite the fact that these two probabilities are independent, they appear to closely correlate (linear *r*² = 0.82). Moreover, we can see that spoken and written categories, whilst distributing along a continuum, also overlap.

The second sentence above is worth considering.

- If single-head NPs consist wholly of personal pronouns then the other categories that might be single-head NPs (nouns, numerals, other pronouns, etc.) will fall.
- However, the reverse may not be true. Single-head NPs in texts which rarely consist of personal pronouns could be
**dominated**by one category: nouns, numerals, etc.

What we need to do is arrive at a plausible measure of **grammatical diversity** that would distinguish between these two alternative explanations. What follows is an exercise in exploratory data analysis.

### Defining diversity

We could define ‘diversity’ as the probability that two single-word NPs taken at random from each genre have different grammatical categories, out of the available categories: **C** = {noun, personal pronoun, other pronoun, nominal adjective, numeral or proform}. Note that this conceptualisation of ‘grammatical diversity’ is relative to a *particular* set.

**Note:** Diversity is not particularly useful for binary categories, because mutual substitution must apply. For example, if **C** = {personal pronoun, other pronoun} then any decline in the proportion of personal pronouns out of **C** must be explained by a rise in other pronouns.

If we change the set (e.g. subdivide proper and common nouns), the results are likely to be different. We sum across the set, *c* ∈ **C**:

*diversity d*(*c*∈**C**) = ∑*p*(*c*).(1 –*p*‘(*c*)) if*n*> 1; 1 otherwise

where **C** is the set of categories, *p*(*c*)* *is the probability that item 1 is category *c *and *p*‘(*c*) the probability that item 2 is category (*c*).

*p*(*c*) =*F*(*c*)/*n**p*‘(*c*) = (*F*(*c*)*–*1)/(*n –*1)

Using *p*‘ for item 2 includes an adjustment for the fact that we already know that the first item is *c*. (Consider: if *n *= 4, the probability of item 2 = item 1 = *c* is calculated out of the remaining three cases. This makes no real difference for large *n*.)

- If
*F*(*c*) is zero for any category,*p*(*c*) is zero, and discounted. This means that the measure is robust. - If
*F*(*c*) tends to*n*for any category, then 1 –*p*‘(*c*) tends to zero, and disappears. The other categories will tend to zero, so*d*will be zero.

The maximum *d* is achieved where each category is equally probable.

- Elsewhere in this blog we discuss computing confidence intervals for diversity.

### Plotting diversity

If we now return to our data, the following scattergraph plots the probability of the single-word NP being a personal pronoun against *d*. This has a medium correlation *r*² = 0.7156. In effect, this means that over 70% of variation in personal pronoun use could be simply explained by variation in diversity. A high correlation does not logically imply a cause, but a failure to correlate would be evidence against diversity as a plausible explanation. (This is another way of stating refutation of null hypotheses.)

The vertical axis is identical to that in the earlier graph. At first sight this correlation seems to support the claim cited above, that “the most likely explanation being that the head words are grammatically more diverse.” Note that most of the written text categories appear to have a higher level of diversity (and a smaller proportion of personal pronoun use) than spoken transcription categories.

However, we should express caution here. Performing the same correlation analysis with the corresponding proportion of noun heads finds a higher value of *r*² = 0.9077. That is, as the proportion of personal pronouns decrease, the proportion of single nouns increase. So on reflection, alternation with nouns (the next most numerous set) seems to be a better explanation. So we decided to alter the conclusions to the paper (highlighted above) to reflect this.

More generally, not all categories semantically alternate, i.e. it is frequently not possible to simply replace any personal pronoun with another pronoun, proform or numeral without having to substantively rewrite a sentence and alter the meaning. This underlines that whereas this type of approach may be useful for surveying competing trends, in order to really determine what might be going on requires a proper alternation study.

### In conclusion

Frequently we will obtain results that could be explained by multiple underlying causes. In this case, variation between text category in personal pronoun use as a proportion of simple, single-word NPs might be explained by direct competition with a single alternative category (e.g. growth in nouns or numerals) or simply by a tendency to express NPs over a broader range of categories. In this case we found that both were plausible explanations, and indeed, they may both be true simultaneously. But we also found that the hypothesis that pronouns were primarily alternating with nouns obtained a stronger correlation.

Of course, we have had to define diversity to perform this analysis, and in other circumstances diversity may correlate more strongly. This definition is relative to a particular set of categories. Although this measure may be mathematically principled, it should be obvious how important it is to be clear about how diversity is measured when drawing *linguistic* conclusions.

Finally, the difficulty in pinning down a specific explanation in survey results should cause us to consider all claims of this nature to be somewhat conditional. This returns us to one of the core arguments of this blog, i.e. that only by identifying alternation between forms in circumstances when speakers and writers have a choice are we ultimately able to compare different potential explanations with any certainty.

### See also

- The confidence of diversity
- Is language really “a set of alternations”?
- That vexed problem of choice
- A methodological progression

### References

Aarts, B. and S.A. Wallis 2014. Noun phrase simplicity in spoken English. In L. Veselovská and M. Janebová (eds.) *Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure.* Olomouc: Palacký University, 2014. pp 501-511.