What are the units?
The generativity and hierarchical structure of language appears to strongly imply that there must be units in speech; these units are combined in different ways to create an infinite number of messages. Furthermore, speech is not heard as the continuous signal that it physically is; instead, listeners perceive speech sounds in distinct categories, along one or more linguistic dimensions or levels of analysis (such as articulatory gestures or features, or phonemes, syllables, morphemes, or words). Experience shapes perception to permit such analysis by highlighting and accentuating meaning ful variability while minimizing meaning less variability (see Davis & Johnsrude, 2007; Diehl, Lotto, & Holt, 2004, for reviews). Furthermore, we can repeat and imitate what someone else has said; such imitation requires that we parse another’s behaviour into components and then generate the motor commands to reproduce those behaviours (Studdert-Kennedy, 1981). Finally, we expect ‘core’ representations of language to be abstract since they must be modality independent: the spoken word [kaet] and the written form CAT must contact the same representations. What are the dimensions to which listeners are sensitive and which permit classification, imitation, and abstraction? What level or levels of analysis are ‘elemental’ in speech perception? What are the representational categories to which speech is mapped and that are used to retrieve the meaning of an utterance?
It is often assumed that the phoneme is the primary unit of perceptual analysis of speech (Nearey, 2001). The search for invariants in speech perception began with the observation that acoustically highly variable instances (variability caused in part by coarticulation and allophonic variation) were all classified by listeners as the same phoneme (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Liberman, Harris, Hoffman, & Griffith, 1957). Such perceptual constancy for phonemic identity can be viewed either as a natural outcome of perceptual systems that are maximizing sensitivity to change (see Kluender & Kiefte, 2006, pp. 171–177, for discussion) or as evidence that speech perception is a modular, specialized function and that phonemes have some cognitive reality within an efficient and restricted inventory of speech events represented in the brain.
Although patterns of speech errors during sentence planning and execution are compatible with the psychological reality of phonemes as a unit of representation in the brain (Fromkin, 1971; Klatt, 1981), awareness of the phonemes in speech is generally restricted to users of alphabetic written languages, and phonemic awareness may in fact be a result of recognizing individual words rather than a prerequisite (Charles-Luce & Luce, 1990; Marslen-Wilson & Warren, 1994). Another objection to the phoneme as the primary unit in speech perception is that subphonemic acoustic information – fine phonetic detail – has important and systematic effects on speech perception (Hawkins, 2003; McMurray, Tanenhaus, & Aslin, 2009; see also Port, 2007). Listeners may use abstract prelexical, subphonemic representations, but it is still not clear what the ‘grain size’ of these units is (Mitterer, Scharenborg, & McQueen, 2013). Alternatively, several researchers have argued that listeners map relatively low-level information about the speech signal (phonetic features) directly onto words (or their meanings) without the need for a separate “phoneme recognition” stage (Gaskell & Marslen-Wilson, 1997; Kluender & Kiefte, 2006; Marslen-Wilson & Warren, 1994).
Another possible category of representation is the morpheme; theories of spoken language production and recognition generally posit that words like brightness are assembled out of smaller morphemic units (in this case, bright and ness; Dell, 1986; Levelt, Roelofs, & Meyer, 1999; Marslen-Wilson, Tyler, Waksler, & Older, 1994) and that morphological representations may be somewhat independent of phonological representations and appear to be recruited at a relatively early stage of processing, before phonological representations are computed in detail (Cohen-Goldberg, Cholin, Miozzo, & Rapp, 2013). Furthermore, the fact that morphologically related words prime one another in the absence of priming for word form, or meaning, across different psycholinguistic paradigms suggests that morphology plays an independent role in the organization and processing of words (Bozic, Marslen-Wilson, Stamatakis, Davis, & Tyler, 2007).
Intriguingly, languages differ in terms of the evidence for prominence of a particular kind of speech unit. For example, the syllable appears to play a prominent role in speech perception in French, Spanish, Italian, Dutch, and Portuguese but not necessarily in English (Bien, Bölte, & Zwitserlood, 2015; Floccia, Goslin, Morais, & Kolinsky, 2012; Goldinger & Azuma, 2003).
The cognitive literature on speech perception is now shifting away from a preoccupation with the question of which particular linguistic unit is most important and towards a more domain-general account in which the statistics of the input are used discover the structure of natural sounds (Kluender & Kiefte, 2006; Port, 2007). Perceptual inferences can then be made in a Bayesian fashion, using probability distributions defined on structured representations. This brings theorizing about the mechanisms of auditory and speech perception in line with what is known about visual perception (Kersten, Mamassian, & Yuille, 2004; Yuille & Kersten, 2006).
Are representations auditory or gestural?
In their seminal 1959 paper, “What the Frog’s Eye Tells the Frog’s Brain,” Jerry Lettvin and colleagues (Lettvin, Maturana, McCulloch, & Pitts, 1959) identified optic nerve fibers from the retina of the frog that were sensitive to small, dark convex objects that enter the receptive field, stop, and then move about in the field intermittently. They were tempted to call these bug detectors, since it is hard to imagine a system better equipped “for detecting an accessible bug” (Lettvin et al., 1959). Before these studies, retinal cells were viewed as light sensors, which relayed a copy of the local distribution of light to the brain in an array of impulses. This study demonstrated that, in fact, information is already highly organized and interpreted by the time it leaves the retina, providing the frog with precisely the information that is most relevant and useful to it. This is highly consistent with the direct-perception or direct-realist account of perception, as put forward by James Gibson (Gibson, 1966) and others; this account emphasized that the objects of perception are not patterns of light or sound but environmental events that provide opportunities for interaction and behaviour.
Carol Fowler at Haskins Laboratories has put forward a direct-realist account of speech perception (Fowler, 1986), arguing that listeners directly perceive articulatory gestures, which are reflected in the sounds of speech. This position is similar to that held by proponents of the motor theory of speech perception, also developed at Haskins (Galantucci, Fowler, & Turvey, 2006; A.M. Liberman et al., 1967; A.M. Liberman & Mattingly, 1985), who suggested that speech is primarily a motoric phenomenon. In a series of investigations aimed at understanding the acoustic signatures of phonemes, Liberman’s group demonstrated that the spectrotemporal sound pattern of a given consonant is not invariant but that coarticulation gives every consonant (and vowel) multiple acoustic realizations. For example, when the identical consonant /d/ is spoken in different vowel contexts (e.g., dih, dee, and dar), the formant transition patterns during the articulation of the stop consonant change in each case. Despite the variation in the acoustic properties of the consonant, however, the observer hears the same /d/ sound. The way in which /d/ is articulated is the same in each case, with the tip of the tongue pressing against the alveolar ridge; this articulatory invariance led Liberman and colleagues to suggest that the goal of the speech perception system is not to perceive sounds but rather to recover the invariant articulatory gestures produced by the speaker.
More recent behavioural work makes it clear that articulation itself is not as invariant as previously believed. Although the goal of articulation can be relatively constant (i.e., upper lip contacting lower lip), the actual movements required to achieve such a goal vary substantially (Gracco & Abbs, 1986). It is possible that abstract movement goals are invariantly represented but that the actual motor commands to achieve those movements probably are not.
As in other domains of motor control, speech may rely on forward internal models (Webb, 2004; Wolpert & Ghahramani, 2000), which allow a talker to predict the sensory (acoustic) consequences of their (articulatory) actions, b...