ICAME talk on rebalancing corpora

I will be speaking on problems of corpus sampling and the evaluation of independent variable interaction at the 35th ICAME conference in Nottingham this week.

My slides are available here.

3 responses to “ICAME talk on rebalancing corpora

  1. Nice presentation. However, you should consider the fact that random sampling requires a well-defined research population. Corpora aiming to serve as representative samples of the “language” by definition can not be compiled using random sampling methods since the population (“language”) can not be defined in detail. You can’t assure that each text has the same probability of getting into this corpus compared with every other text produced in this language. You don’t have a master “register” of all the texts (written and spoken) produced, so you can select randomly texts.
    I think the so-called “representative” general language corpora are in fact non-random judgment samples.

    • Thanks George. Without being unfair to others’ efforts (and corpus compilation is a big effort!), I tend to agree with you.

      The degree to which sampling is undertaken genuinely randomly is a very good question. With large corpora, the assumption has tended to be that it doesn’t matter so much (a misconstrual of the Central Limit Theorem) – but in fact compilers have very little control over representativeness. And they tend to depend on ease of availability, so there is often a strong bias towards particular types of easily-available material, e.g. student writing, and in web corpora, emails, blogs, etc.

      In smaller ‘balanced’ corpora, such as the first Survey corpus, Brown etc, I think that sampling tended to be done in a somewhat naive way, and there are things that we should do differently. In ICE-GB for example, the first text contains the intro of the recording.

      Although this is less than ideal, whether this presents a problem for the research question being considered is a different question. Noting that a corpus is not a true random sample does not mean that we should junk the data, but it is a weakness in the experimental design compared to the ideal.

      In presenting this paper I wanted to draw attention to this issue as a genuine problem requiring some careful thought, both for compilers and analysts.

      This is also why I have always argued that researchers should frame their explanations in terms of a population of language data sampled in the same way as the corpus in question, not “language in general” (whatever that may be).


      • You have absolutely right! BTW it would be interesting to investigate how various deviations from pure random sampling affect quantitative information extracted from subcorpora. It would be nice to have an idea of the bias inserted by specific genre/topic over/under-representations…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.