PART 1
Models
1
About Speech Motor Control Complexity
PASCAL PERRIER
ABSTRACT
A key issue in research about speech motor control is the one of the level of complexity that is required for the internal models: have these models to account accurately for all physical properties of the speech motor system, including the complex tongue-jaw biomechanics? Or would more simple internal representations be sufficient, which model only the static characteristics of the peripheral speech apparatus or which give a rough account of articulatory dynamics? On the basis of experimental and modeling studies of speech movements and human limb movements published in the literature, the adequacy of simplified internal representations for speech motor control is analyzed.
INTRODUCTION
In the past two decades, the analysis of speech motor control and its modeling were basically inspired by investigations and models of other skilled human movements, such as reaching, pointing, or grasping. This research approach has been proved to be efficient and it allowed the emergence of a number of speech control models which served as theoretical backgrounds for numerous studies (see among others Guenther, 1995; Laboissière, Ostry, & Feldman, 1996; Ostry, Gribble, & Gracco, 1996; Perkell et al., 1997, 2000; Perrier, Lcevenbruck, & Payan, 1996; Saltzman, 1986; Saltzman & Munhall, 1989). Consequendy, the issues raised today in the domain of motor control research are very important for future studies of speech motor control, and addressing them is a prerequisite for the improvement of the current models. Among these issues, two questions are of particular interest: (1) How complex are the acquired internal representations of the motor system, built up during the acquisition of the motor task, and used to plan and achieve the movement? and (2) What role does low-level, short-latency feedback play in achieving an accurate and stable movement control? In this paper, contributions to these issues are proposed, while taking into account some important peculiarities of the speech production task.
SPEECH PRODUCTION: A COMPLEX MOTOR TASK
Compared to many other human motor tasks classically studied in motor control research, speech production has a number of peculiarities that make it particularly complex. Some findings suggesting such a complexity were already frequently discussed in the literature:
1. Because of its semiotic nature, the goal of speech production is actually defined in an abstract domain. Hence, its physical characterization is not straightforward, and this has two consequences. First there is no unique physical correlate for a given elementary speech sound and a large variability of patterns can be observed in the neurophysiological, articulatory, and acoustic domains (see Perkell & Klatt, 1986). Second the issue of the physical space in which the motor task is planned becomes particularly complex, since the distal space can be defined either by articulatory positions, or by spectral properties of the speech signal, or by perceptual characteristics of this signal (see Browman & Goldstein, 1990; Guenther, Hampson, & Johnson, 1998; Savariaux, Perrier, & Orliaguet, 1995; Stevens, 1989; Tremblay, Shiller, & Ostry, 2003), or a multimodal space associating orosensory, auditory, and even visual characterizations.
2. Speech production has a large number of degrees of freedom that confer a many-to-one characteristic on the relationships between motor commands, articulatory positions, and acoustic or auditory properties. This characteristic, together with the above-mentioned intrinsic variability of the physical correlates of the production of a given sound, has the consequence that a large set of motor equivalence strategies can be used to implement a range of coarticulation strategies or to deal with artificial perturbations (such as a pipe or food in the mouth), or pathological perturbations (such as tongue or mandible surgery) or peripheral perturbations (Guenther, Espy-Wilson, Boyce, Matthies, Zandipour, & Perkell, 1999; McFarland, Baum, & Chabot, 1996; Perkell, Matthies, Svirsky, & Jordan, 1993; Savariaux, Perrier, Orliaguet, & Schwartz, 1999). These multiple strategies obviously contribute to the complexity of speech motor control.
3. Compared to other skilled human motor activities, speech movements in normal conditions can be very short, since vowels have a mean duration of approximately 80 ms and consonants have mean durations around 40 ms (O'Shaughnessy 1981). These characteristics seem to exclude any potential online contribution of long-latency orosensory feedback that would be processed by the cortex, and to limit the role of auditory feedback to a suprasegmental level and to an a posteriori monitoring used to correct segmental aspects of speech after it was produced (Perkell et al., 1997). The absence of online use of auditory feedback to control speech at a segmental level is well supported by experimental work showing that speakers can produce intelligible speech even after hearing loss (Lane & Wozniak, 1991; Manzella, Wozniak, Matthies, Lane, Guiod, & Perkell, 1994). Compatibly, work on stutterers and normal speakers shows that delaying auditory feedback in the range of 50–200 ms affects prosodic (speaking rate, fluency, rhythm, intonation, and stress) rather than segmental features (Hargrave, Kalinowski, Stuart, Armson, & Jones, 1994; Stager & Ludlow, 1993). At the same time, speech gestures have to be accurate enough in order to ensure that the associated acoustic signal can be correctly perceived by a listener. How accuracy can be obtained without the use of long-latency feedback that would be processed by the cortex, is a key issue for speech production research, but not for other human motor tasks except for eye saccades.
To deal with the complexity and the multimodality of speech task representations, with the numerous motor or auditory equivalence strategies and with the accuracy requirements in the absence of long-latency feedback going through the cortex, the large majority of speech motor control models published in the literature assume the existence of internal representations of the speech apparatus, called internal models (Guenther, 1995; Hirayama, Vatikiotis-Bateson, Kawato, & Jordan, 1992; Jordan, 1990; Jordan & Rumelhart, 1992; Kawato, Maeda, Uno, & Suzuki, 1990; Laboissière, Schwartz, & Bailly, 1991; Perkell et al., 1997, 2000; Perrier, Payan, & Marret, 2004).
INTERNAL MODELS AND SPEECH PRODUCTION CONTROL
A Useful Concept to Deal with the Control of Complex Motor Tasks
The internal model concept was proposed in the research domain to deal with the nonbiunivocity between motor commands and position of the final effector, and with the delays associated with the processing of long-latency feedback. The basic hypothesis is that copies of the motor system, or of subsets of it, could be learned in the brain during the acquisition of the motor skill (in our case in the speech learning phase). Once they are learned, these models could be used to estimate predictions of the consequences of motor command changes on the trajectory of the final effector. According to different existing models of human motor control published in the literature, the role of internal models in the execution of the motor task could take different forms.
A first hypothesis suggests that internal models' predictions could be the basis of task-planning strategies aiming at ensuring that the final effector moves along a specific desired trajectory in the task space. This is the so-called desired trajectory hypothesis (Kawato, 1999). Since human motor systems have usually an excess of degrees of freedom, many different motor command sequences are likely to allow the achievement of the specified trajectory. The basic idea of the desired trajectory hypothesis is that the central nervous system would use internal models prior to the execution of the movement, in order to select an optimal one from all possible motor command sequences that would both generate the required trajectory and minimize a motor criterion, classically related to the concept of effort. This kind of internal model is classically called inverse internal model, since it permits to go from the desired output to the motor commands.
For target-oriented movements, an alternative hypothesis suggests that the central nervous system would use internal models as direct models during the planning to optimize the positioning accuracy of the final effector at the target in presence of neural noise (Harris & Wolpert, 1998). From this perspective, internal models would also be used prior to the execution of movement, but the trajectory of the final effector toward the target would be a consequence of the planning strategy rather than the specification of the task itself. Another, more recent use of direct internal models was proposed by Todorov and Jordan (2002) in the framework of their optimal control feedback model. According to this model, the motor control strategy would not consist in specifying a desired trajectory in the task space and in selecting the appropriate commands for the motor system to follow this trajectory. It would instead make use of feedback information during the execution of movement to selectively modify motor commands in an optimal way, when deviations in the task space occur that would endanger the achievement of the task goal. The proposed use of feedback would require long-delay loops that are known to generate instabilities in closed-loops servo-mechanisms and that would consequently induce inadequate corrections from the controller. To overcome these problems, Torodov and Jordan (2002) have proposed using the outputs of the internal models as afferent signals.
Thus, from the perspective of motor control models such as the desired trajectory hypothesis, or Harris and Wolpert's (1996) proposal, or the optimal control feedback model, internal models are very powerful tools for dealing both with the many-to-one nature of the relations between motor commands and vocal tract configurations or spectral characteristics of the acoustic signal, as well as with the long latency of feedback processed by the cortex, and with the possible multimodality of the speech task representations. A...