Speech Perception and Spoken Word Recognition
eBook - ePub

Speech Perception and Spoken Word Recognition

  1. 206 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Speech Perception and Spoken Word Recognition

About this book

Speech Perception and Spoken Word Recognition features contributions from the field's leading scientists, and covers recent developments and current issues in the study of cognitive and neural mechanisms that take patterns of air vibrations and turn them 'magically' into meaning. The volume makes a unique theoretical contribution in linking behavioural and cognitive neuroscience research, and cutting across traditional strands of study, such as adult and developmental processing.

The book:

  • Focusses on the state of the art in the study of speech perception and spoken word recognition
  • Discusses the interplay between behavioural and cognitive neuroscience evidence, and between adult and developmental research
  • Evaluates key theories in the field and relates them to recent empirical advances, including the relationship between speech perception and speech production, meaning representation and real-time activation, and bilingual and monolingual spoken word recognition
  • Examines emerging areas of study such as word learning and time-course of memory consolidation, and how the science of human speech perception can help computer speech recognition

Overall this book presents a renewed focus on theoretical and developmental issues, as well as a multifaceted and broad review of the state of research, in speech perception and spoken word recognition. Particularly interested readers will be researchers of psycholinguistics and adjoining fields as well as advanced undergraduate and postgraduate students.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Speech Perception and Spoken Word Recognition by Gareth Gaskell,Jelena Mirković in PDF and/or ePUB format, as well as other popular books in Psychology & Cognitive Psychology & Cognition. We have over one million books available in our catalogue for you to explore.

1
Representation of Speech

Ingrid S. Johnsrude and Bradley R. Buchsbaum

Introduction

To comprehend a spoken utterance, listeners must map a dynamic, variable, spectrotemporally complex continuous acoustic signal onto discrete linguistic representations in the brain, assemble these so as to recognize individual words, access the meanings of these words, and combine them to compute the overall meaning (Davis & Johnsrude, 2007). Words or their elements do not correspond to any invariant acoustic units in the speech signal: the speech stream does not usually contain silent gaps to demarcate word boundaries, and dramatic changes to the pronunciation of words in different contexts arise due to variation both between and within talkers (e.g., coarticulation). Despite the continuous nature and variability of speech, native speakers of a language perceive a sequence of discrete, meaningful units. How does this happen? What are the linguistic representations in the brain, and how is the mapping between a continuous auditory signal and such representations achieved? Given that speaking is a sensorimotor skill, is speech perceived in terms of its motor or auditory features? Does processing occur on multiple linguistic levels simultaneously (e.g., phonemes, syllables, words), or is there a single canonical level of representation, with larger units (like words) being assembled from these elemental units? How is acoustic variability – among talkers, and within talkers across utterances, dealt with, such that acoustically different signals all contact the same representation? (In other words, how do you perceive that Brad and Ingrid both said “I’d love lunch!” despite marked variability in the acoustics of their productions?)
These questions are fundamental to an understanding of the human use of language and have intrigued psychologists, linguists, and others for at least 50 years. Recent advances in methods for stimulating and recording activity in the human brain permit these perennial questions to be addressed in new ways. Over the last 20 years, cognitive-neuroscience methods have yielded a wealth of data related to the organization of speech and language in the brain. The most important methods include functional magnetic resonance imaging (fMRI), which is a non-invasive method used to study brain activity in local regions and functional interactions among regions. Pattern-information analytic approaches to fMRI data, such as multi-voxel pattern analysis (Mur, Bandettini, & Kriegeskorte, 2009), permit researchers to examine the information that is represented in different brain regions. Another method is transcranial magnetic stimulation (TMS), which is used to stimulate small regions on the surface of the brain, thereby reducing neural firing thresholds or interrupting function.
Recently, intracranial electrocorticography (ECoG) has re-emerged as a valuable tool for the study of speech and language in the human brain. Intracranial electrodes are implanted in some individuals with epilepsy who are refractory to drug treatment and so are being considered for surgical resection. ECoG electrodes, placed on the surface of the brain or deep into the brain, record neural activity with unparalleled temporal and spatial resolution. The hope is that the person with epilepsy will have a seizure while implanted: electrodes in which seizure activity is first evident are a valuable clue to the location of abnormal tissue giving rise to the seizures (resection of this tissue is potentially curative). Patients can be implanted for weeks at a time and often agree to participate in basic-science research (i.e., on speech and language) during their seizure-free periods.
In this chapter, we will first review what the cognitive psychological literature reveals about the nature of the linguistic representations for speech and language (What are the units? Are representations auditory or vocal gestural?) and about how speech variability is handled. We then turn to the cognitive-neuroscience literature, and review recent papers using fMRI, TMS, and ECoG methods that speak to these important questions.

The nature of the linguistic representations for speech and language: Cognitive considerations

What are the units?

The generativity and hierarchical structure of language appears to strongly imply that there must be units in speech; these units are combined in different ways to create an infinite number of messages. Furthermore, speech is not heard as the continuous signal that it physically is; instead, listeners perceive speech sounds in distinct categories, along one or more linguistic dimensions or levels of analysis (such as articulatory gestures or features, or phonemes, syllables, morphemes, or words). Experience shapes perception to permit such analysis by highlighting and accentuating meaning ful variability while minimizing meaning less variability (see Davis & Johnsrude, 2007; Diehl, Lotto, & Holt, 2004, for reviews). Furthermore, we can repeat and imitate what someone else has said; such imitation requires that we parse another’s behaviour into components and then generate the motor commands to reproduce those behaviours (Studdert-Kennedy, 1981). Finally, we expect ‘core’ representations of language to be abstract since they must be modality independent: the spoken word [kaet] and the written form CAT must contact the same representations. What are the dimensions to which listeners are sensitive and which permit classification, imitation, and abstraction? What level or levels of analysis are ‘elemental’ in speech perception? What are the representational categories to which speech is mapped and that are used to retrieve the meaning of an utterance?
It is often assumed that the phoneme is the primary unit of perceptual analysis of speech (Nearey, 2001). The search for invariants in speech perception began with the observation that acoustically highly variable instances (variability caused in part by coarticulation and allophonic variation) were all classified by listeners as the same phoneme (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Liberman, Harris, Hoffman, & Griffith, 1957). Such perceptual constancy for phonemic identity can be viewed either as a natural outcome of perceptual systems that are maximizing sensitivity to change (see Kluender & Kiefte, 2006, pp. 171–177, for discussion) or as evidence that speech perception is a modular, specialized function and that phonemes have some cognitive reality within an efficient and restricted inventory of speech events represented in the brain.
Although patterns of speech errors during sentence planning and execution are compatible with the psychological reality of phonemes as a unit of representation in the brain (Fromkin, 1971; Klatt, 1981), awareness of the phonemes in speech is generally restricted to users of alphabetic written languages, and phonemic awareness may in fact be a result of recognizing individual words rather than a prerequisite (Charles-Luce & Luce, 1990; Marslen-Wilson & Warren, 1994). Another objection to the phoneme as the primary unit in speech perception is that subphonemic acoustic information – fine phonetic detail – has important and systematic effects on speech perception (Hawkins, 2003; McMurray, Tanenhaus, & Aslin, 2009; see also Port, 2007). Listeners may use abstract prelexical, subphonemic representations, but it is still not clear what the ‘grain size’ of these units is (Mitterer, Scharenborg, & McQueen, 2013). Alternatively, several researchers have argued that listeners map relatively low-level information about the speech signal (phonetic features) directly onto words (or their meanings) without the need for a separate “phoneme recognition” stage (Gaskell & Marslen-Wilson, 1997; Kluender & Kiefte, 2006; Marslen-Wilson & Warren, 1994).
Another possible category of representation is the morpheme; theories of spoken language production and recognition generally posit that words like brightness are assembled out of smaller morphemic units (in this case, bright and ness; Dell, 1986; Levelt, Roelofs, & Meyer, 1999; Marslen-Wilson, Tyler, Waksler, & Older, 1994) and that morphological representations may be somewhat independent of phonological representations and appear to be recruited at a relatively early stage of processing, before phonological representations are computed in detail (Cohen-Goldberg, Cholin, Miozzo, & Rapp, 2013). Furthermore, the fact that morphologically related words prime one another in the absence of priming for word form, or meaning, across different psycholinguistic paradigms suggests that morphology plays an independent role in the organization and processing of words (Bozic, Marslen-Wilson, Stamatakis, Davis, & Tyler, 2007).
Intriguingly, languages differ in terms of the evidence for prominence of a particular kind of speech unit. For example, the syllable appears to play a prominent role in speech perception in French, Spanish, Italian, Dutch, and Portuguese but not necessarily in English (Bien, Bölte, & Zwitserlood, 2015; Floccia, Goslin, Morais, & Kolinsky, 2012; Goldinger & Azuma, 2003).
The cognitive literature on speech perception is now shifting away from a preoccupation with the question of which particular linguistic unit is most important and towards a more domain-general account in which the statistics of the input are used discover the structure of natural sounds (Kluender & Kiefte, 2006; Port, 2007). Perceptual inferences can then be made in a Bayesian fashion, using probability distributions defined on structured representations. This brings theorizing about the mechanisms of auditory and speech perception in line with what is known about visual perception (Kersten, Mamassian, & Yuille, 2004; Yuille & Kersten, 2006).

Are representations auditory or gestural?

In their seminal 1959 paper, “What the Frog’s Eye Tells the Frog’s Brain,” Jerry Lettvin and colleagues (Lettvin, Maturana, McCulloch, & Pitts, 1959) identified optic nerve fibers from the retina of the frog that were sensitive to small, dark convex objects that enter the receptive field, stop, and then move about in the field intermittently. They were tempted to call these bug detectors, since it is hard to imagine a system better equipped “for detecting an accessible bug” (Lettvin et al., 1959). Before these studies, retinal cells were viewed as light sensors, which relayed a copy of the local distribution of light to the brain in an array of impulses. This study demonstrated that, in fact, information is already highly organized and interpreted by the time it leaves the retina, providing the frog with precisely the information that is most relevant and useful to it. This is highly consistent with the direct-perception or direct-realist account of perception, as put forward by James Gibson (Gibson, 1966) and others; this account emphasized that the objects of perception are not patterns of light or sound but environmental events that provide opportunities for interaction and behaviour.
Carol Fowler at Haskins Laboratories has put forward a direct-realist account of speech perception (Fowler, 1986), arguing that listeners directly perceive articulatory gestures, which are reflected in the sounds of speech. This position is similar to that held by proponents of the motor theory of speech perception, also developed at Haskins (Galantucci, Fowler, & Turvey, 2006; A.M. Liberman et al., 1967; A.M. Liberman & Mattingly, 1985), who suggested that speech is primarily a motoric phenomenon. In a series of investigations aimed at understanding the acoustic signatures of phonemes, Liberman’s group demonstrated that the spectrotemporal sound pattern of a given consonant is not invariant but that coarticulation gives every consonant (and vowel) multiple acoustic realizations. For example, when the identical consonant /d/ is spoken in different vowel contexts (e.g., dih, dee, and dar), the formant transition patterns during the articulation of the stop consonant change in each case. Despite the variation in the acoustic properties of the consonant, however, the observer hears the same /d/ sound. The way in which /d/ is articulated is the same in each case, with the tip of the tongue pressing against the alveolar ridge; this articulatory invariance led Liberman and colleagues to suggest that the goal of the speech perception system is not to perceive sounds but rather to recover the invariant articulatory gestures produced by the speaker.
More recent behavioural work makes it clear that articulation itself is not as invariant as previously believed. Although the goal of articulation can be relatively constant (i.e., upper lip contacting lower lip), the actual movements required to achieve such a goal vary substantially (Gracco & Abbs, 1986). It is possible that abstract movement goals are invariantly represented but that the actual motor commands to achieve those movements probably are not.
As in other domains of motor control, speech may rely on forward internal models (Webb, 2004; Wolpert & Ghahramani, 2000), which allow a talker to predict the sensory (acoustic) consequences of their (articulatory) actions, b...

Table of contents

  1. Cover
  2. Title
  3. Copyright
  4. CONTENTS
  5. List of contributors
  6. Introduction
  7. 1 Representation of speech
  8. 2 Perception and production of speech: Connected, but how?
  9. 3 Consonant bias in the use of phonological information during lexical processing: A lifespan and cross-linguistic perspective
  10. 4 Speech segmentation
  11. 5 Mapping spoken words to meaning
  12. 6 Zones of proximal development for models of spoken word recognition
  13. 7 Learning and integration of new word-forms: Consolidation, pruning, and the emergence of automaticity
  14. 8 Bilingual spoken word recognition
  15. 9 The effect of speech sound disorders on the developing language system: Implications for treatment and future directions in research
  16. 10 Speech perception by humans and machines
  17. Index