II WHAT FORMS CAN LINGUISTICALLY RELEVANT INFORMATION TAKE?
Most normal language users believe that understanding is an immediate and effortless consequence of listening to speech. Against this background, describing formally what is involved in successful speech understanding is surprisingly difficult. To describe spoken language demands a complex representation which can take many forms. A useful view of the speech communication process is as a set of sub-processes inside the brains of talkers and listeners. The first set is in the talker starting with the intention to communicate, and involves a series of normally hierarchical stages where implicit knowledge about word meanings, syntax, word-sound correspondence etc. is used to encode a message into an acoustic signal. The listener is supposed to decode the signal using an approximately matched set of hierarchical but inverse perceptual processing stages, beginning with an auditory representation and terminating in recovery of the talker’s message and hence “understanding”. Each processing stage is assumed to transform the message from one internal representation to another, preserving linguistically relevant information. A full account of linguistic communication would thus require a specification of each representation and a detailed description of the mechanism of each processing stage. This view is not an explanatory model of the process but a starting framework within which detailed models could be proposed. The psychological reality of a particular model has then to be established by experimental investigation.
Although normal and abnormal linguistic and phonetic structures can be described in a fashion that is logically rigorous (see Cowie and Douglas-Cowie, 1982; this volume) the only readily accessible data which can be measured in a physical sense are the optical correlates of speech and the acoustic speech signal; if one is interested in production, various physiological measures of articulatory behaviour may be added. However, if used in isolation, conventional techniques for acoustical analysis of speech do not illuminate directly its linguistically significant properties. This issue - the nature of acoustic correlates of linguistic units - is a central one for this chapter and will be considered in detail. We must begin, however, with a brief discussion of some ways of conceptualising the elements of a linguistic message.
I shall refer to the structures that generate speech - vocal cords, pharynx, soft and hard palate, tongue, teeth, jaw, lips, nasal passages etc. - as forming the vocal tract, and to the larger moving parts - lips, tongue and jaw - as the major articulators in the vocal tract. Measurements of articulator movement reveal intricate motor patterns; the simple demonstration of attending to all the detailed antics in your own vocal tract, while speaking this sentence aloud in slow motion will confirm that speaking is a complex act which demands precise control and coordination of a large number of muscles. Despite this complexity when expressed in terms of spatio-temporal coordinates of major articulators over time, a number of general principles of vocal tract action can be described which form the basis for a more manageable taxonomy of speech involving a set of intersecting articulatory classes. Articulatory classifications of speech elements are economical, and have historical respectability - they were employed by Sanskrit grammarians roughly 2600 years ago.
A relatively small number of articulatory dimensions is sufficient to carry linguistically significant contrasts. Vowels (for example, /i/ and /a/ in “deep, dark“), semi-vowels /w/ as in “wailed”), continuant consonants (/s/ as in “monster”) and interrupted consonants (/d/ as in “dark”, /g/ in “grotto”) form a natural ranking of articulations with increasingly narrow constriction of the vocal tract. Another important dimension is the position in the vocal tract where the maximum constriction occurs; the initial consonants in “gay“, “day” and “bay” involve constriction at increasingly more forward vocal tract locations, towards the front of the mouth. These two dimensions correspond roughly to those known to phoneticians as manner and place of articulation. The voicing contrast, referring to the initial presence or absence of vibration of the vocal cords, as between the initial consonants /b/ and /p/ in “bay” and “pay”, allows further subdivisions of some of the above categories. This taxonomy allows the phonemes of a language to be represented as an intersecting set of features and hence allows utterances to be represented as articulatorily-defined segments arrayed serially in time. Thus the initial segment in “bay” is an interrupted, voiced consonant with bilabial place of articulation, that is with vocal tract constriction at the lips. The adequacy of such a description of the content of an utterance in terms of a series of phonetic segments or phonemes (consonant, vowel, consonant etc.) having in turn distinctive features (interrupted, voiced etc.) depends on the purpose for which the description is used. It shares much with schemes one might use to classify the orthography of written language; segments correspond roughly to alphabetical characters and features to properties like presence or absence of a vertical stroke in a character. For speech, descriptions at this level are natural candidates for expressing economically some of the knowledge that language users have which makes them creative. For example, we can state simple prescriptive rules for the formation of the plural of English nouns never previously encountered. Although generally written with an “s”, the plural is realised phonetically in different ways, chiefly as /Iz/, /z/ or /s/ depending on the preceding segment. The ease with which this and similar rules can be stated in segmental terms contrasts sharply with their difficul...