In a previous post I discussed how to plot confidence intervals on observed probabilities. Using this method we can create graphs like the following. (Data is in the Excel spreadsheet we used previously: for this post I have added a second worksheet.)
The graph depicts both the observed probability of a particular form and the certainty that this observation is accurate. The ‘I’-shaped error bars depict the estimated range of the true value of the observation at a 95% confidence level (see Wallis 2013 for more details).
A note of caution: these probabilities are semasiological proportions (different uses of the same word) rather than onomasiological choices (see Choice vs. use).
In this post I discuss ways in which we can plot intervals on changes (differences) rather than single probabilities.
The clearer our visualisations, the better we can understand our own data, focus our explanations on significant results and communicate our results to others.
The benefit of plotting these intervals should be immediately obvious. They tell us visually whether observations are significantly different over time (or any other contrast).
- We can compare intervals and probabilities horizontally. The simplest way to do this is to compare points in a pairwise fashion. Where intervals do not overlap, observations must be significantly different from each other.
- For example, quotative uses are significantly more frequent in the 2000s than the 1920s, ‘cogitate’ uses have fallen, and so on.
The logic of visual comparison is as follows.
- Do the intervals overlap?
- If no: the observations are significantly different.
- Does either observed probability fall within the other interval?
- If yes: the observations are not significantly different.
- Otherwise test for significance using a 2 × 2 test.
- This works because the error for the difference between two probabilities, the minimum significant difference W, must be greater than the larger of the two inner interval widths, w₁ and w₂, but smaller than their sum. In algebra: max(w₁, w₂) < W < w₁+w₂.
[Aside: If you compare points vertically between trend lines, you need to apply a different test (a single sample z test for comparing frequencies within a distribution).]
When we plot confidence intervals on single probabilities we use the Wilson score interval. This is an asymmetric interval (see figure above) which cannot exceed the probability range [0, 1].
Wilson intervals can also be calculated using a ‘continuity correction’ (correcting for the fact that frequency data is discrete rather than continuous). In the following example we will use the uncorrected interval, but the method outlined here can also be used with the continuity-corrected Wilson interval (see the previous post).
Robert Newcombe (1998) proposed a new interval based on the Wilson interval (which we refer to as the Newcombe-Wilson interval) calculated for the difference between two observations d = p₂ – p₁. We will use the notation w₁⁻ and w₁⁺ to refer to the lower and upper bound of the Wilson score interval for p₁. This method can also be used to perform an optimal two-sample independent-population z test.
The simplest formula for computing Newcombe’s interval for d is to employ the sum of independent variances rule (also known as the “Bienaymé formula”). This obtains a difference interval (W⁻, W⁺), defined as follows:
Lower bound, W⁻ = −√(p₁−w₁⁻)² + (w₂⁺−p₂)²,
Upper bound, W⁺ = √(w₁⁺−p₁)² + (p₂−w₂⁻)².
We will use capital letters for the difference interval to avoid confusion with the two single intervals. The sketch below illustrates the idea.
The interval for the difference between two probabilities is computed by summing the squares of the inner interval widths, and then taking the square root of the result.
Mathematical note: strictly speaking, the probability space is logistic (curved), rather than Cartesian (flat), so this step involves a conservative approximation. (Consider measuring the hypotenuse of a triangle drawn on the side of a football: when inflated, the hypotenuse is shorter than if the ball were deflated and flattened.) Zhou and Donner (2008) comment that unless the interval widths are large, the error thereby introduced is small.
Comparing intervals: an illustration
Let’s perform the test by constructing an interval for the two ‘cogitate’ points for the 1920s and 1960s. These intervals overlap slightly, so we should test for significance. The raw data is in the table below, taken from the second worksheet in the example spreadsheet. The first row represents p₁, w₁⁻, w₁⁺, the second p₂, etc. The difference d = 0.5967 − 0.7727 = -0.1760.
We use the Wilson score intervals for both points to compute the lower and upper bound of the interval for d. The rule is to take diagonal pairs of inner intervals together (highlighted above).
Lower bound, W⁻ = −√0.0131 + 0.0047 = -0.1335,
Upper bound, W⁺ = √0.0071 + 0.0053 = 0.1114.
Since d is less than the lower bound (-0.1760 < -0.1335), the difference between the points p₁ and p₂ is greater than is likely to occur by chance at a 95% confidence, and therefore we can say that this difference is a significant difference.
How does this work? The idea is illustrated graphically below. Note that the arrowed ranges (p₁−w₁⁻, w₂⁺−p₂) are employed in the formula for W⁻. The inner interval (the interval in the direction of the change) is created by combining the widths of the two intervals on the inner side of each point.
We can perform the same calculation for every sequential pair of points in each series. This finds that the only other significant difference by time is to be found in the 2000s for quotative uses of think. Although the ‘intend’ use appears to be fluctuating in probability, there is insufficient data for us to conclude that this change over time is significant.
|’20s to ’60s||d||W⁻||W⁺|
|’60s to 2000s||d||W⁻||W⁺|
|’20s to 2000s||d||W⁻||W⁺|
Producing tables such as these is fine, but it can be difficult for a reader to follow an argument unless you express results visually and plot a graph. A picture can be worth a thousand numbers, but unclear graphs can be misleading.
One of the most common way change is cited in papers is in terms of percentage difference. We see statements of this kind all the time in the press: “X has grown by 50%” or “Y has fallen by 10%”.
We have already defined simple difference d = p₂ – p₁, so we can define percentage (or proportional) difference very simply.
percentage difference d% = d / p₁.
Percentage difference is the simple difference scaled by the starting point, p₁, so a confidence interval can be obtained by scaling, i.e. by also dividing W⁻ and W⁺ by p₁. Using this formula we can plot graphs like the following.
[Note: For plotting purposes Excel will put confidence intervals at the extremity of the bar, rather than on the x axis (i.e. at zero change), whereas we expressed the NW interval as a range about zero. To plot the same interval at the end of the bar we need to invert the interval: the upper error bar width is |W⁻/p₁|, with |W⁺/p₁| being the lower bar width.]
- We can immediately see that ‘cogitate’ uses of think significantly fall over the period, since the decline is greater than the inner interval.
- However, we cannot see any change in quotative and interpretative uses. Unfortunately, in these cases the frequency (and hence the probability, p₁) is zero: since you can’t divide by zero, we get no bar (or an infinite one!).
Unfortunately, percentage difference presents us with a number of problems.
- We have already seen that if p₁ = 0, the results are meaningless. Percentage difference from zero cannot be visualised because it is infinite!
- We cannot easily compare results in different columns because each column is scaled differently. So we cannot employ the logic we used for the significant difference of single points.
- Measuring change relative to a starting point is meaningful in limited circumstances. Exponential growth curves (or growth in S-curves) exhibit doubling over a given period, so when comparing probabilities over time it can make sense to divide by the starting point.
- It may also be feasible to compare growth rates of independent terms.
- An additional problem is that the starting point is also uncertain (see the first graph above).
- A further conceptual problem with percentage difference is that a positive and negative percentage difference do not represent the same thing: +100% means doubling, whereas the inverse (halving) is -50% (Aarts et al. 2013).
Simple difference revisited
Is there any way we can visualise change in terms of simple difference (sometimes called ‘swing’) and yet allow viewers to see the relative difference appropriately?
Jill Bowie and I came up with the idea of floating bar charts (Bowie et al. forthcoming). The idea is to plot a range, p₁ to p₂, as a floating column between 0 and 1. We plot Newcombe-Wilson intervals on the end-point (p₂), and shade the bar to reveal the direction of change.
The following chart shows how this works with our data. Note how the direction of change is expressed by shading.
Unlike the percentage swing bar chart above, we can plot all simple differences and identify which of these are statistically significant. This means that we can distinguish between quotative and interpretive changes (something that could be seen in the bottom right hand corner of the line graph but not the bar chart).
[Tip: To plot this graph in Excel we create a stacked chart with three series: (1) a hidden bar with no shading: min(p₁, p₂), (2) an ascending bar: max(d,0), and (3) a descending bar: max(-d, 0). The confidence interval is plotted at the top of the stack. See the second worksheet in this Excel spreadsheet.]
The idea of floating bar charts is somewhere between the first graph, where we plotted points over time (also on an absolute scale from 0 to 1) and percentage difference graphs.
We think this visualisation is relatively easy to ‘read’. What do you think? Comments, as always, are very welcome!
Q. Look at the intervals below, taken from the first and last figures in this article. Look closely at the interval for the ‘interpretative’ data for 2010 (left) and the difference interval (right).
- The first interval (left) is significantly different from zero: the lower interval does not cross the zero axis.
- In the second figure (reproduced right) the difference is not statistically significant from the 1920s probability (which is zero). So the interval exceeds the observed change.
How can this be possible?
A. These intervals and tests are doing different things.
- The single interval says that if a true value in the population is zero (or close to it), then the sample is sufficiently large so that the observed probability is different from it. (Since the sample is drawn from the population, the true value cannot be zero!) This is equivalent to a 2 × 1 goodness of fit χ² test where the expected value is extremely skewed.
- The difference interval compares two samples drawn from independent populations, one in the 1920s and one in the 2000s. Both samples have independent confidence intervals (see the first figure). The 1920s data is not “zero” but between 0.00 and 0.05. So in fact it is not surprising that the 2000s data, p = 0.04 (from 0.02 to 0.07), is not significantly different from it. This is equivalent to a 2 × 2 χ² test.
The fact that the interval crosses the starting point also means that there is a greater than a α/2 = 2.5% chance that the true difference between these values (in the population) could be in the opposite direction to that observed. This is another way of thinking about significant difference: if we say a difference is “significant” we mean it is significantly different from zero, hence the difference is either positive, or negative, but not both.
Single intervals and difference intervals are performing different functions, in exactly the same way that χ² tests can be used for different types of question.
When you are plotting graphs and intervals, you need to remind yourself what they mean, and make sure you explain this to your readers.
- Plotting confidence intervals on graphs (single probabilities)
- Comparing frequencies within a discrete distribution (from the same sample)
- Choice vs. use
- Excel spreadsheet
Aarts, B., G. Leech, J. Close and S.A. Wallis (eds.) 2013. The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP. » Table of contents and ordering info
Aarts, B., J. Close and S.A. Wallis 2013. Choices over time: methodological issues in current change. Chapter 2 in Aarts et al (2013). » ePublished.
Bowie, J., S.A. Wallis and B. Aarts forthcoming. Contemporary change in modal usage in spoken British English: mapping the impact of ‘genre’. In: J. van der Auwera and J.I. Marín Arrese (eds.), Current issues on evidentiality and modality in English: theoretical, descriptive and contrastive studies. Berlin: Mouton de Gruyter.
Levin, M. 2013. The progressive verb in modern American English. Chapter 8 in Aarts et al (2013).
Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890.
Wallis, S.A. 2013. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post
Zou G.Y. and Donner A. 2008. Construction of confidence limits about effect measures: A general approach. Statistics in Medicine 27: 1693-1702.