Impossible logistic multinomials

Introduction

The fundamental assumption of logistic regression is that a probability representing a true fraction, or share, of a quantity undergoing a continuous process of change would by default follow a logistic or ‘S-curve’ pattern. This is a reasonable assumption in certain limited circumstances because it is analogous to a straight line (cf. Newton’s first law of motion).

Regression is a set of computational methods that attempts to find the closest match between an observed set of data and a function, such as a straight line, a polynomial, a power curve or, in this case, an S-curve. We say that the logistic curve is the underlying model we expect data to be matched against (regressed to). In another post, I comment on the feasibility of employing Wilson score intervals in an efficient logistic regression algorithm.

We have already noted that change is assumed to be continuous, which implies that the input variable (x) is real and linear, such as time (and not e.g. probabilistic). In this post we discuss different outcome variable types. What are the ‘limited circumstances’ in which logistic regression is mathematically coherent?

  • We assume probabilities are free to vary from 0 to 1.
  • The envelope of variation must be constant, i.e. it must always be possible for an observed probability to reach 1.

Taken together this also means that probabilities are Binomial, not multinomial. Let us discuss what this means. Continue reading

UCL Summer School in English Corpus Linguistics 2015

Here’s announcing the third annual Summer School in English Corpus Linguistics to be held at University College London, from 6-8 July.

The Summer School is a short three-day intensive course aimed at PhD-level students and researchers who wish to get to grips with Corpus Linguistics. Numbers are deliberately limited on a first-come, first-served basis. You will be taught in a small group by a teaching team.

Each day begins with a theory lecture, followed by a guided hands-on workshop with corpora, and a more self-directed and supported practical session in the afternoon.

Aims and objectives of the course

  • The Summer School is a primer in Corpus Linguistics for students of the English language. It is designed to be both accessible and inspiring!
  • Attendees are taught by world-class researchers at the Survey of English Usage, UCL.
  • Students are expected to have a basic knowledge of English linguistics and grammar.
  • It will take place in the English Department of University College London, in the heart of Central London.

For more information, including costs, booking information, timetable, see the website.

See also

Logistic regression with Wilson intervals

Introduction

Back in 2010 I wrote a short article on the logistic (‘S’) curve in which I described its theoretical justification, mathematical properties and relationship to the Wilson score interval. This observed two key points.

  • We can map any set of independent probabilities p ∈ [0, 1] to a flat Cartesian space using the inverse logistic (‘logit’) function, defined as
    • logit(p) ≡ log(p / 1 – p) = log(p) – log(1 – p),
    • where ‘log’ is the natural logarithm and logit(p) ∈ [-∞, ∞].
  • By performing this transformation
    • the logistic curve in probability space becomes a straight line in logit space, and
    • Wilson score intervals for p ∈ (0, 1) are symmetrical in logit space, i.e. logit(p) – logit(w⁻) = logit(w⁺) – logit(p).
Logistic curve (k = 1) with Wilson score intervals for n = 10, 100.

Logistic curve (k = 1) with Wilson score intervals for n = 10, 100.

Continue reading

Is “grammatical diversity” a useful concept?

Introduction

In a recent paper focusing on distributions of simple NPs (Aarts and Wallis, 2014), we found an interesting correlation across text genres in a corpus between two independent variables. For the purposes of this study, a “simple NP” was an NP consisting of a single-word head. What we found was a strong correlation between

  1. the probability that an NP consists of a single-word head, p(single head), and
  2. the probability that single-word heads were a personal pronoun, p(personal pronoun | single head).

Note that these two variables are independent because they do not compete, unlike, say, the probability that a single-word NP consists of a noun, vs. the probability that it is a pronoun. The scattergraph below illustrates the distribution and correlation clearly.

Scattergraph of text genres in ICE-GB; distributed (horizontally) by the proportion of all noun phrases consisting of a single word and (vertically) by the proportion of those NPs that are personal pronouns; spoken and written, with selected outliers identified.

Scattergraph of text genres in ICE-GB; distributed (horizontally) by the proportion of all noun phrases consisting of a single word and (vertically) by the proportion of those single-word NPs that are personal pronouns; spoken and written, with selected outliers identified.

Continue reading

What might a corpus of parsed spoken data tell us about language?

AbstractPaper (PDF)

This paper summarises a methodological perspective towards corpus linguistics that is both unifying and critical. It emphasises that the processes involved in annotating corpora and carrying out research with corpora are fundamentally cyclic, i.e. involving both bottom-up and top-down processes. Knowledge is necessarily partial and refutable.

This perspective unifies ‘corpus-driven’ and ‘theory-driven’ research as two aspects of a research cycle. We identify three distinct but linked cyclical processes: annotation, abstraction and analysis. These cycles exist at different levels and perform distinct tasks, but are linked together such that the output of one feeds the input of the next.

This subdivision of research activity into integrated cycles is particularly important in the case of working with spoken data. The act of transcription is itself an annotation, and decisions to structurally identify distinct sentences are best understood as integral with parsing. Spoken data should be preferred in linguistic research, but current corpora are dominated by large amounts of written text. We point out that this is not a necessary aspect of corpus linguistics and introduce two parsed corpora containing spoken transcriptions.

We identify three types of evidence that can be obtained from a corpus: factual, frequency and interaction evidence, representing distinct logical statements about data. Each may exist at any level of the 3A hierarchy. Moreover, enriching the annotation of a corpus allows evidence to be drawn based on those richer annotations. We demonstrate this by discussing the parsing of a corpus of spoken language data and two recent pieces of research that illustrate this perspective. Continue reading

ICAME talk on rebalancing corpora

I will be speaking on problems of corpus sampling and the evaluation of independent variable interaction at the 35th ICAME conference in Nottingham this week.

My slides are available here.

Coping with imperfect data

Introduction

One of the challenges for corpus linguists is that many of the distinctions that we wish to make are either not annotated in a corpus at all or, if they are represented in the annotation, the quality of the annotation may be imperfect. This frequently arises in corpora to which an algorithm has been applied, where the results have not been checked by linguists. However, this is a general problem. We would always recommend that cases be reviewed for accuracy of annotation.

A version of this issue also arises when checking for the possibility of alternation, that is, to ensure that items of Type A can be replaced by Type B items, and vice-versa. An example might be epistemic modal shall vs. will. Most corpora, including richly-annotated corpora such as ICE-GB and DCPSE, do not include modal semantics in their annotation scheme. In such cases the issue is not that the annotation is “imperfect”, rather that our experiment relies on a presumption that the speaker has the choice of either type at any observed point (see Aarts et al. 2013).

Continue reading