goodness of fit – corp.ling.stats

Directional evidence revisited

June 16, 2022March 12, 2024 SeanLeave a comment

End weight bias and templating in conjoined phrase postmodification

Abstract Full Paper (PDF)

The tendency of speakers and writers to place larger constructions at the end of sentences, whether consciously or unconsciously, is well established. Often this question of ‘end weight’ is usually discussed in relation to grammatical transformations. In this short paper we demonstrate a simple method for investigating a similar phenomenon in coordination patterns where conjoins are either noun phrases, e.g. the X of Y or Z, or prepositional phrases, e.g. the X of Y or of Z. We then investigate whether the coordinated noun phrases (Y, Z) are themselves postmodified, either by another prepositional phrase or by a clause. As postmodifying phrases and clauses are potentially expansive, they are grammatically complex and we operationalise them as signifiers of ‘weight’. We find that both sets of coordination patterns are end-sequence biased by weight.

We also find an elevated frequency for patterns where both first and last conjoins in the sequence are greater than would be expected were they independently selected. Setting aside potential explanations of directional influence, which cannot be decided inductively, we focus instead on the content of these doubly-postmodified constructions and examine them for evidence of templating, i.e. lexical-syntactic repetition.

We also show that these results are not explicable by semantic ordering in coordination, and contrast evidence from prepositional and clausal postmodification with that from premodifying adjective phrases, where scope ambiguity may also be a factor.

Continue reading “Directional evidence revisited” →

Are embedding decisions independent?

May 17, 2022June 19, 2022 SeanLeave a comment

Evidence from preposition(al) phrases

Abstract Full Paper (PDF)

One of the more difficult challenges in linguistics research concerns detecting how constraints might apply to the process of constructing phrases and clauses in natural language production. In previous work (Wallis 2019) we considered a number of operations modifying noun phrases, including sequential and embedded modification with postmodifying clauses. Notably, we found a pattern of a declining additive probability for each decision to embed postmodifying clauses, albeit a pattern that differed in speech and writing.

In this paper we use the same research paradigm to investigate the embedding of an altogether simpler structure: postmodifying nouns with prepositional phrases. These are approximately twice as frequent and structures exhibit as many as five levels of embedding in ICE-GB (two more than are found for clauses). Finally the embedding model is simplified because only one noun phrase can be found within each prepositional phrase. We discover different initial rates and patterns for common and proper nouns, and certain subsets of pronouns and numerals. Common nouns (80% of nouns in the corpus) do appear to generate a secular decline in the additive probability of embedded prepositional phrases, whereas the equivalent rate for proper nouns rises from a low initial probability, a fact that appears to be strongly affected by the presence of titles.

It may be generally assumed that like clauses, prepositional phrases are essentially independent units. However, we find evidence from a number of sources that indicate that some double-layered constructions may be being added as single units. In addition to titles, these constructions include schematic or idiomatic expressions whose head is an ‘indefinite’ pronoun or numeral. Continue reading “Are embedding decisions independent?” →

Confidence intervals on goodness of fit ϕ scores

September 8, 2021March 10, 2024 SeanLeave a comment

Introduction

In Wallis (2021), I offered two approaches to computing confidence intervals on the effect size Cramér’s ϕ. I also motivated and summarised approaches to a comparable goodness of fit metric (where a high ϕ score reflects a greater difference and thus a ‘poor fit’).

A goodness of fit evaluation is one where we compare an observed distribution of k cells, say, with an expected distribution of the same number of cells. The test, which is a type of χ² test, has a number of applications. A goodness of fit ϕ score would be expected to range from 0 to 1, with 0 representing identity and 1 representing the opposite, a maximally distinct distribution.

In an earlier paper published on this blog (Wallis 2012), I considered a range of possible measures that had this property. However, one of the questions I had left unresolved was how to compute a confidence interval on such a measure.

Why might we want to do this?

To cite or plot measures with confidence intervals, identifying the level of certainty we can ascribe to a particular observed measure.
To compare ϕ with an arbitrary level, e.g. to test if ϕ ≠ D where D ≠ 0. (As we shall see, where k > 2 and ϕ unsigned, comparing goodness of fit ϕ with 0 is more difficult due to loss of information, and you should employ a goodness of fit test instead.)
To compare two ϕ scores for their significant difference in a given direction, e.g. to establish that, say, ϕ₁ > ϕ₂.

Summing independent, dependent and constrained variances

The Bienaymé theorem serves for computing the total variance of the sum of k independent Normally distributed variables by simple summation of variance.

Bienaymé variance s² = s₁² + s₂² + … + s_k² = ∑s_i².(1)

A total standard deviation s is obtained by taking the square root of Equation (1).

To estimate a confidence interval on a sum of k independent proportions, ∑p_i, we follow Zou and Donner (2008). A confidence interval on a sum of proportions may be obtained by substituting interval widths, u^– = (p – w^–) and u⁺ = (w⁺ – p), for each s_i term in the equation. The confidence interval is then found with the square root of the result. The constant z_α/2 factors out. See An algebra of intervals.

independent sum ∈ (L, U) = (∑p_i – √∑(p_i – w_i^–)², ∑p_i + √∑(w_i⁺ – p_i)²), (1′)

This assumes that all of these proportions are independent. But what of chi-square-type scenarios, where there are k – 1 degrees of freedom for k proportions summing to 1?

Obviously, we are not interested in the confidence interval for ∑p_i, as this must be 1 (or [1, 1] if you prefer). But we are interested in confidence intervals for the sum of functions of p_i, ∑fn(p_i). Zou and Donner argue that equations of this type should obtain a sound interval provided that the original intervals are sound.

Consider the simplest two-valued 2 × 1 goodness of fit χ². As we know, the two proportions are completely dependent. If p₁ increases, p₂ = 1 – p₁ must fall. The table has a single degree of freedom. Consequently, standard deviations and interval positions are simply summed.

total standard deviation s = s₁ + s₂. (2)

dependent sum (L, U) = (∑fn(w_i^–), ∑fn(w_i⁺)), (2′)

for an increasing monotonic function, fn, over P = [0, 1]. We will discuss other function types below.

Another way of thinking about this is that independent variables are considered to vary at right angles (tangents) to each other, whereas strictly dependent variables vary along the same axis. In some circumstances this means variables subtract and even cancel each other out; in others (like χ²) they sum.

Figure 1. Left: standard deviation of sum of independent variables x, y, z; right, summing standard deviations of two dependent variables on the same axis.

How do we generalise this idea to closed k × 1 goodness of fit χ² tables, where there are k – 1 degrees of freedom? Now there are fewer dimensions than variables. Continue reading “Confidence intervals on goodness of fit ϕ scores” →

End weight bias and templating in conjoined phrase postmodification

Abstract Full Paper (PDF)

Share this:

Evidence from preposition(al) phrases

Abstract Full Paper (PDF)

Share this:

Introduction

Summing independent, dependent and constrained variances

Share this: