This is a very broad question, ultimately answered empirically by the performance of a particular parser.
However to predict performance, we might consider the types of structure that a parser is likely to find difficult and then examine a parsed corpus of speech and writing for key statistics.
Variables such as mean sentence length or main clause complexity are often cited as a proxy for parsing difficulty. However, sentence length and complexity are likely to be poor guides in this case. Spoken data is not split into sentences by the speaker, rather, utterance segmentation is a matter of transcriber/annotator choice. In order to improve performance, an annotator might simply increase the number of sentence subdivisions. Complexity ‘per sentence’ is similarly potentially misleading.
In the original London Lund Corpus (LLC), spoken data was split by speaker turns, and phonetic tone units were marked. In the case of speeches, speaker turns could be very long compound ‘run-on’ sentences. In practice, when texts were parsed, speaker turns might be split at coordinators or following a sentence adverbial.
In this discussion paper we will use the British Component of the International Corpus of English (ICE-GB, Nelson et al. 2002) as a test corpus of parsed speech and writing. It is worth noting that both components were parsed together by the same tools and research team.
A very clear difference between speech and writing in ICE-GB is to be found in the degree of self-correction. The mean rate of self-correction in ICE-GB spoken data is 3.5% of words (the rate for writing is 0.4%). The spoken genre with the lowest level of self-correction is broadcast news (0.7%). By contrast, student examination scripts have around 5% of words crossed out by writers, followed by social letters and student essays, which have around 0.8% of words marked for removal.
However, self-correction can be addressed at the annotation stage, by removing it from the input to the parser, parsing this simplified sentence, and reintegrating the output with the original corpus string. To identify issues of parsing complexity, therefore we need to consider the sentence minus any self-correction. Are there other factors that may make the input stream more difficult to parse than writing?
Perhaps a more revealing estimate of top level complexity concerns the extent to which, following parsing, these segments, termed ‘parse units’, are not considered grammatically to be clauses. The scattergraph below plots the mean proportion of parse units that are ‘non clauses’ rather than clauses on the horizontal axis. The category of ‘non clause’ does not include subjectless or verbless clauses (see below), but may include standalone phrases and pragmatically meaningful utterances (sometimes called ‘clause fragments’). By contrast, the vertical axis shows the mean number of incomplete clauses. These are clauses that have been rendered incomplete, for example because the speaker was interrupted. (We have not included confidence intervals because we are interested in the overall scatter.)
- Overall in ICE-GB there are twice the proportion of ‘non clause’ parse units in the spoken data (on average, 29% of parse units are not clauses) than the written component (14%). Business letters are an outlier, apparently due to the inclusion of full addresses and other formal ephemera. At the upper left of the written distribution, press editorials have the highest number of incomplete clauses while less than one in twenty parse units are considered non clauses.
- Comparing means, there are over four times the proportion of incomplete clauses in spoken transcripts compared to written text (2.15% to 0.51%). Means are shown with ‘X’ symbols in the scattergraph.
This scattergraph distinguishes written and spoken data to a much greater extent than, e.g. analysis of small phrases (Aarts et al. 2014). This indicates that the challenges in the parsing of speech data lie principally in high level structure. Getting the top level analysis correct is the most difficult challenge in any parsing enterprise. The sheer proportion of the number of non clauses in speech, and the relatively high proportion of incomplete clauses should cause us to be cautious about accepting performance estimates based on the parsing of written data when we are concerned with the parsing of speech.
Other types of clause reduction
Spoken data is not necessarily more complex in other aspects. For example, speech data is generally less likely to include subjectless or verbless clauses than writing. The following scattergraph plots the mean probabilities of clauses being subjectless (vertical axis) and verbless (horizontal axis) for ICE-GB text categories within speech and writing. The highest proportion of verbless clauses in any genre are found in spontaneous commentaries, a spoken genre which encourages concise phrasing, for example:
England have won four [the Soviet Union three] with three drawn [S2A-001 #167]
Subordination and coordination
Compared to writing, a lower proportion of clauses in speech are analysed as compound clauses, but this seems to be an artefact of the sentence segmentation decisions we discussed earlier. In the case of ICE-GB speech data, large coordinated spoken clauses were frequently split at the coordinator, with the coordinator (and, but, etc) then treated as a connective introducing a new clause. This decision is semantic and stylistic (in writing, termed ‘avoiding run-on sentences’), although it could be argued that in the parsing of ICE-GB, annotators over-compensated.
In objective lexical terms, the spoken data has a slightly greater tendency to exhibit coordinating words. There are 15% more connectives or coordinators per word in ICE-GB spoken data compared to writing, and 4% more subordinating conjunctions.
If ICE-GB spoken utterances were over-zealously subdivided, this tendency has had a greater impact on coordinated clauses than subordinate ones, but it has had an impact on subordination nonetheless. Thus the proportion of ‘dependent’ (subordinate) clauses out of those clauses explicitly marked as either main or dependent in spoken data is actually 85% of the equivalent rate in the written data, despite the greater rate of subordinators.
In summary, the main factor that might make speech harder to parse than writing is that spoken data tends to be more grammatically incomplete than written data. The high proportion of ‘non clauses’, and the greater number of clauses marked as incomplete, both indicate that this is where the principal difficulty lies.
This incompleteness is in addition to self-correction, that is, where speakers correct their own utterances.
Aarts, B. and S.A. Wallis 2014. Noun phrase simplicity in spoken English. In L. Veselovská and M. Janebová (eds.) Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium 2014: Language Use and Linguistic Structure. Olomouc: Palacký University, 2014. pp 501-511. » Post
Nelson, G., Wallis, S.A. and Aarts, B. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins.