Change and certainty: plotting confidence intervals (2)

Introduction

In a previous post I discussed how to plot confidence intervals on observed probabilities. Using this method we can create graphs like the following. (Data is in the Excel spreadsheet we used previously: for this post I have added a second worksheet.)

The graph depicts both the observed probability of a particular form and the certainty that this observation is accurate. The ‘I’-shaped error bars depict the estimated range of the true value of the observation at a 95% confidence level (see Wallis 2013 for more details).

A note of caution: these probabilities are semasiological proportions (different uses of the same word) rather than onomasiological choices (see Choice vs. use).

An example graph plot showing the changing proportions of meanings of the verb think over time in the US TIME Magazine Corpus, with Wilson score intervals, after Levin (2013). Many thanks to Magnus for the data!

In this post I discuss ways in which we can plot intervals on changes (differences) rather than single probabilities.

The clearer our visualisations, the better we can understand our own data, focus our explanations on significant results and communicate our results to others.

The benefit of plotting these intervals should be immediately obvious. They tell us visually whether observations are significantly different over time (or any other contrast).

  • We can compare intervals and probabilities horizontally. The simplest way to do this is to compare points in a pairwise fashion. Where intervals do not overlap, observations must be significantly different from each other.
    • For example, quotative uses are significantly more frequent in the 2000s than the 1920s, ‘cogitate’ uses have fallen, and so on.

The logic of visual comparison is as follows.

  1. Do the intervals overlap?
    • If no: the observations are significantly different.
  2. Does either observed probability fall within the other interval?
    • If yes: the observations are not significantly different.
  3. Otherwise test for significance using a 2 × 2 test.
  • This works because the error for the difference between two probabilities, the minimum significant difference W, must be greater than the larger of the two inner interval widths, w₁ and w₂, but smaller than their sum. In algebra: max(w₁, w₂) < W < w₁+w₂.

[Aside: If you compare points vertically between trend lines, you need to apply a different test (a single sample z test for comparing frequencies within a distribution).]

When we plot confidence intervals on single probabilities we use the Wilson score interval. This is an asymmetric interval (see figure above) which cannot exceed the probability range [0, 1].

Wilson intervals can also be calculated using a ‘continuity correction’ (correcting for the fact that frequency data is discrete rather than continuous). In the following example we will use the uncorrected interval, but the method outlined here can also be used with the continuity-corrected Wilson interval (see the previous post).

Robert Newcombe (1998) proposed a new interval based on the Wilson interval (which we refer to as the Newcombe-Wilson interval) calculated for the difference between two observations d = p₂ – p₁. We will use the notation w₁⁻ and w₁⁺ to refer to the lower and upper bound of the Wilson score interval for p₁. This method can also be used to perform an optimal two-sample independent-population z test.

The simplest formula for computing Newcombe’s interval for d is to employ the sum of independent variances rule (also known as the “Bienaymé formula”). This obtains a difference interval (W⁻, W⁺), defined as follows:

Lower bound, W⁻ = −√(p₁−w₁⁻)² + (w₂⁺−p₂)²,
Upper bound, W⁺ = √(w₁⁺−p₁)² + (p₂−w₂⁻)².

We will use capital letters for the difference interval to avoid confusion with the two single intervals. The sketch below illustrates the idea.

Calculating the lower bound of the Newcombe-Wilson interval, using the Pythagorean Bienaymé formula.

Calculating the lower bound of the Newcombe-Wilson interval, using the Pythagorean Bienaymé formula.

The interval for the difference between two probabilities is computed by summing the squares of the inner interval widths, and then taking the square root of the result.

Mathematical note: strictly speaking, the probability space is logistic (curved), rather than Cartesian (flat), so this step involves a conservative approximation. (Consider measuring the hypotenuse of a triangle drawn on the side of a football: when inflated, the hypotenuse is shorter than if the ball were deflated and flattened.) Zhou and Donner (2008) comment that unless the interval widths are large, the error thereby introduced is small.

Comparing intervals: an illustration

Let’s perform the test by constructing an interval for the two ‘cogitate’ points for the 1920s and 1960s. These intervals overlap slightly, so we should test for significance. The raw data is in the table below, taken from the second worksheet in the example spreadsheet. The first row represents p₁, w₁⁻, w₁⁺, the second p₂, etc. The difference d = 0.5967 − 0.7727 = -0.1760.

p w w (p−w⁻)² (w−p
1. (1920s) 0.7727 0.6583 0.8571 0.0131 0.0071
2. (1960s) 0.5967 0.5239 0.6654 0.0053 0.0047

Single intervals for ‘cogitate’ in the 1920s and 1960s. The probability of think having a ‘cogitate’ use in the 1920s was between 65.83% and 85.71%, declining to 52.39-66.54% by the 1960s. As ranges overlap we should obtain a more accurate interval to test for their significant difference. The method uses the squared difference terms on the right.

We use the Wilson score intervals for both points to compute the lower and upper bound of the interval for d. The rule is to take diagonal pairs of inner intervals together (highlighted above).

Lower bound, W⁻ = −√0.0131 + 0.0047 = -0.1335,
Upper bound, W⁺ = √0.0071 + 0.0053 = 0.1114.

Since d is less than the lower bound (-0.1760 < -0.1335), the difference between the points p₁ and p₂ is greater than is likely to occur by chance at a 95% confidence, and therefore we can say that this difference is a significant difference.

How does this work? The idea is illustrated graphically below. Note that the arrowed ranges (p₁−w₁⁻, w₂⁺−p₂) are employed in the formula for W⁻. The inner interval (the interval in the direction of the change) is created by combining the widths of the two intervals on the inner side of each point.

Calculating the Newcombe-Wilson difference interval. The inner interval is computed from the lower bound of p₁ and the upper bound of p₂.

We can perform the same calculation for every sequential pair of points in each series. This finds that the only other significant difference by time is to be found in the 2000s for quotative uses of think. Although the ‘intend’ use appears to be fluctuating in probability, there is insufficient data for us to conclude that this change over time is significant.

’20s to ’60s d W W
‘cogitate’ -0.1760 -0.1335 0.1114 − sig
‘intend’ 0.1318 -0.1111 0.1322 ns
quotative 0.0331 -0.0373 0.0578 ns
interpretive 0.0110 -0.0283 0.0556 ns
’60s to 2000s d W W
‘cogitate’ -0.0300 -0.0977 0.0964 ns
‘intend’ -0.0877 -0.0920 0.0911 ns
quotative 0.0907 -0.0544 0.0532 + sig
interpretive 0.0270 -0.0362 0.0339 ns
’20s to 2000s d W W
‘cogitate’ -0.2061 -0.1317 0.1082 − sig
‘intend’ 0.0442 -0.1059 0.1272 ns
quotative 0.1238 -0.0514 0.0668 + sig
interpretive 0.0381 -0.0353 0.0581 ns

Pairwise comparison tables for time points, see also the Excel spreadsheet.

Percentage difference

Producing tables such as these is fine, but it can be difficult for a reader to follow an argument unless you express results visually and plot a graph. A picture can be worth a thousand numbers, but unclear graphs can be misleading.

One of the most common way change is cited in papers is in terms of percentage difference. We see statements of this kind all the time in the press: “X has grown by 50%” or “Y has fallen by 10%”.

We have already defined simple difference d = p₂ – p₁, so we can define percentage (or proportional) difference very simply.

percentage difference d% = d / p₁.

Percentage difference is the simple difference scaled by the starting point, p₁, so a confidence interval can be obtained by scaling, i.e. by also dividing W⁻ and W⁺ by p₁. Using this formula we can plot graphs like the following.

Percentage difference, d% = (p₂ – p₁) / p₁, for 1920s to 2000s change. Both quotative and interpretative uses of think have zero probability in 1920.

[Note: For plotting purposes Excel will put confidence intervals at the extremity of the bar, rather than on the x axis (i.e. at zero change), whereas we expressed the NW interval as a range about zero. To plot the same interval at the end of the bar we need to invert the interval: the upper error bar width is |W⁻/p₁|, with |W⁺/p₁| being the lower bar width.]

  • We can immediately see that ‘cogitate’ uses of think significantly fall over the period, since the decline is greater than the inner interval.
  • However, we cannot see any change in quotative and interpretative uses. Unfortunately, in these cases the frequency (and hence the probability, p₁) is zero: since you can’t divide by zero, we get no bar (or an infinite one!).

Unfortunately, percentage difference presents us with a number of problems.

  • We have already seen that if p₁ = 0, the results are meaningless. Percentage difference from zero cannot be visualised because it is infinite!
  • We cannot easily compare results in different columns because each column is scaled differently. So we cannot employ the logic we used for the significant difference of single points.
    • Measuring change relative to a starting point is meaningful in limited circumstances. Exponential growth curves (or growth in S-curves) exhibit doubling over a given period, so when comparing probabilities over time it can make sense to divide by the starting point.
    • It may also be feasible to compare growth rates of independent terms.
  • An additional problem is that the starting point is also uncertain (see the first graph above).
  • A further conceptual problem with percentage difference is that a positive and negative percentage difference do not represent the same thing: +100% means doubling, whereas the inverse (halving) is -50% (Aarts et al. 2013).

Simple difference revisited

Is there any way we can visualise change in terms of simple difference (sometimes called ‘swing’) and yet allow viewers to see the relative difference appropriately?

Jill Bowie and I came up with the idea of floating bar charts (Bowie et al. forthcoming). The idea is to plot a range, p₁ to p₂, as a floating column between 0 and 1. We plot Newcombe-Wilson intervals on the end-point (p₂), and shade the bar to reveal the direction of change.

Plotting absolute difference

From line plots to floating bars. Left, conventional two-column plot of p₁, p₂, etc. Right, the same plot as a floating bar. Where the near-side (inner) interval is within the bar, the change is significant.

The following chart shows how this works with our data. Note how the direction of change is expressed by shading.

Unlike the percentage swing bar chart above, we can plot all simple differences and identify which of these are statistically significant. This means that we can distinguish between quotative and interpretive changes (something that could be seen in the bottom right hand corner of the line graph but not the bar chart).

Floating bar chart of changing usage of think, 1920s-2000s. Shading indicates direction (from light to dark) and intervals are placed at the end-point (2000s). Where the inner interval is within the bar, the change is significant.

[Tip: To plot this graph in Excel we create a stacked chart with three series: (1) a hidden bar with no shading: min(p₁, p₂), (2) an ascending bar: max(d,0), and (3) a descending bar: max(-d, 0). The confidence interval is plotted at the top of the stack. See the second worksheet in this Excel spreadsheet.]

The idea of floating bar charts is somewhere between the first graph, where we plotted points over time (also on an absolute scale from 0 to 1) and percentage difference graphs.

We think this visualisation is relatively easy to ‘read’. What do you think? Comments, as always, are very welcome!

A question

Q. Look at the intervals below, taken from the first and last figures in this article. Look closely at the interval for the ‘interpretative’ data for 2010 (left) and the difference interval (right).

Single vs. difference interval. Left, single interval on 2000s data, right, difference interval between 1920s and 2000s data.

  • The first interval (left) is significantly different from zero: the lower interval does not cross the zero axis.
  • In the second figure (reproduced right) the difference is not statistically significant from the 1920s probability (which is zero). So the interval exceeds the observed change.

How can this be possible?

A. These intervals and tests are doing different things.

  • The single interval says that if a true value in the population is zero (or close to it), then the sample is sufficiently large so that the observed probability is different from it. (Since the sample is drawn from the population, the true value cannot be zero!) This is equivalent to a 2 × 1 goodness of fit χ² test where the expected value is extremely skewed.
  • The difference interval compares two samples drawn from independent populations, one in the 1920s and one in the 2000s. Both samples have independent confidence intervals (see the first figure). The 1920s data is not “zero” but between 0.00 and 0.05. So in fact it is not surprising that the 2000s data, p = 0.04 (from 0.02 to 0.07), is not significantly different from it. This is equivalent to a 2 × 2 χ² test.

The fact that the interval crosses the starting point also means that there is a greater than 5% chance that the true difference between these values (in the population) could be in the opposite direction to that observed. This is another way of thinking about significant difference: if we say a difference is “significant” we mean it is significantly different from zero, hence the difference is either positive, or negative, but not both.

Single intervals and difference intervals are performing different functions, in exactly the same way that χ² tests can be used for different types of question.

When you are plotting graphs and intervals, you need to remind yourself what they mean, and make sure you explain this to your readers.

See also

References

Aarts, B., G. Leech, J. Close and S.A. Wallis (eds.) 2013. The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP. » Table of contents and ordering info

Aarts, B., J. Close and S.A. Wallis 2013. Choices over time: methodological issues in current change. Chapter 2 in Aarts et al (2013). » ePublished.

Bowie, J., S.A. Wallis and B. Aarts forthcoming. Contemporary change in modal usage in spoken British English: mapping the impact of ‘genre’. In: J. van der Auwera and J.I. Marín Arrese (eds.), Current issues on evidentiality and modality in English: theoretical, descriptive and contrastive studies. Berlin: Mouton de Gruyter.

Levin, M. 2013. The progressive verb in modern American English. Chapter 8 in Aarts et al (2013).

Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890.

Wallis, S.A. 2013. z-squared: the origin and application of χ². Journal of Quantitative Linguistics 20:4, 350-378. » Post

Zou G.Y. and Donner A. 2008. Construction of confidence limits about effect measures: A general approach. Statistics in Medicine 27: 1693-1702.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s