Part I
History, scope, and techniques
1
History of speech synthesis
Brad H. Story
Introduction
For the past two centuries or more, a variety of devices capable of generating artificial or synthetic speech have been developed and used to investigate phonetic phenomena. The aim of this chapter is to provide a brief history of synthetic speech systems, including mechanical, electrical, and digital types. The primary goal, however, is not to reiterate the details of constructing specific synthesizers but rather to focus on the motivations for developing various synthesis paradigms and illustrate how they have facilitated research in phonetics.
The mechanical and electro-mechanical era
On the morning of December 20, 1845, a prominent American scientist attended a private exhibition of what he would later refer to as a âwonderful invention.â The scientist was Joseph Henry, an expert on electromagnetic induction and the first Secretary of the Smithsonian Institution. The âwonderful inventionâ was a machine that could talk, meticulously crafted by a disheveled 60-year-old tinkerer from Freiburg, Germany named Joseph Faber. Their unlikely meeting in Philadelphia, Pennsylvania, arranged by an acquaintance of Henry from the American Philosophical Society, might have occurred more than a year earlier had Faber not destroyed a previous version of his talking machine in a bout of depression and intoxication. Although he had spent some 20 years perfecting the first device, Faber was able to reconstruct a second version of equal quality in a yearâs time (Patterson, 1845).
The layout of the talking machine, described in a letter from Henry to his colleague H.M. Alexander, was like that of a small chamber organ whose keyboard was connected via strings and levers to mechanical constructions of the speech organs. A carved wooden face was fitted with a hinged jaw, and behind it was an ivory tongue that was moveable enough to modulate the shape of the cavity in which it was housed. A foot-operated bellows supplied air to a rubber glottis whose vibration provided the raw sound that could be shaped into speech by pressing various sequences or combinations of 16 keys available on a keyboard. Each key was marked with a symbol representing an âelementaryâ sound that, through its linkage to the artificial organs, imposed time-varying changes to the air cavity appropriate for generating apparently convincing renditions of connected speech. Several years earlier Henry had been shown a talking machine built by the English scientist Charles Wheatstone, but he noted that Faberâs machine was far superior because instead of uttering just a few words, it was âcapable of speaking whole sentences composed of any words what everâ (Rothenberg et al., 1992, p. 362).
In the same letter, Henry mused about the possibility of placing two or more of Faberâs talking machines at various locations and connecting them via telegraph lines. He thought that with âlittle contrivanceâ a spoken message could be coded as keystrokes in one location which, through electromagnetic means, would set into action another of the machines to âspeakâ the message to an audience at a distant location. Another 30 years would pass before Alexander Graham Bell demonstrated his invention of the telephone, yet Henry had already conceived of the notion while witnessing Faberâs machine talk. Further, unlike Bellâs telephone, which transmitted an electrical analog of the speech pressure wave, Henryâs description alluded to representing speech in compressed form based on slowly varying movements of the operatorâs hands, fingers, and feet as they formed the keystroke sequences required to produce an utterance, a signal processing technique that would not be implemented into telephone transmission systems for nearly another century.
It is remarkable that, at this moment in history, a talking machine had been constructed that was capable of transforming a type of phonetic representation into a simulation of speech production, resulting in an acoustic output heard clearly as intelligible speech â and this same talking machine had inspired the idea of electrical transmission of low-bandwidth speech. The moment is also ironic, however, considering that no one seized either as an opportunity for scientific or technological advancement. Henry understandably continued on with his own scientific pursuits, leaving his idea to one short paragraph in an obscure letter to a colleague. In need of funds, Faber signed on with the entertainment entrepreneur P.T. Barnum in 1846 to exhibit his talking machine for a several months run at the Egyptian Hall in London. In his autobiography, Barnum (1886) noted that a repeat visitor to the exhibition was the Duke of Wellington, who Faber eventually taught to âspeakâ both English and German phrases with the machine (Barnum, 1886, p. 134). In the exhibitorâs autograph book, the Duke wrote that Faberâs âAutomaton Speakerâ was an âextraordinary production of mechanical genius.â Other observers also noted the ingenuity in the design of the talking machine (e.g., âThe Speaking Automaton,â 1846; Athenaeum, 1846), but to Barnumâs puzzlement it was not successful in drawing public interest or revenue. Faber and his machine were eventually relegated to a traveling exhibit that toured the villages and towns of the English countryside; it was supposedly here that Faber ended his life by suicide, although there is no definitive account of the circumstances of his death (Altick, 1978). In any case, Faber disappeared from the public record, although his talking machine continued to make sideshow-like appearances in Europe and North America over the next 30 years; it seems a relative (perhaps a niece or nephew) may have inherited the machine and performed with it to generate income (âTalking Machine,â 1880; Altick, 1978).
Although the talking machine caught the serious attention of those who understood the significance of such a device, the overall muted interest may have been related to Faberâs lack of showmanship, the German accent that was present in the machineâs speech regardless of the language spoken, and perhaps the fact that Faber never published any written account of how the machine was designed or built â or maybe a mechanical talking machine, however ingenious its construction, was, by 1846, simply considered passĂŠ. Decades earlier, others had already developed talking machines that had impressed both scientists and the public. Most notable were Christian Gottlieb Kratzenstein and Wolfgang von Kempelen, both of whom had independently developed mechanical speaking devices in the late 18th century.
Inspired by a competition sponsored by the Imperial Academy of Sciences at St. Petersburg in 1780, Kratzenstein submitted a report that detailed the design of five organ pipe-like resonators that, when excited with the vibration of a reed, produced the vowels /a, e, i, o, u/ (Kratzenstein, 1781). Although their shape bore little resemblance to human vocal tract configurations, and they could produce only sustained sounds, the construction of these resonators won the prize and marked a shift toward scientific investigation of human sound production. Kratzenstein, who at the time was a Professor of Physics at the University of Copenhagen, had shared a long-term interest in studying the physical nature of speaking with a former colleague at St. Petersburg, Leonhard Euler, who likely proposed the competition. Well known for his contributions to mathematics, physics, and engineering, Euler wrote in 1761 that âall the skill of man has not hitherto been capable of producing a piece of mechanism that could imitate [speech]â (p. 78) and further noted that âThe construction of a machine capable of expressing sounds, with all the articulations, would no doubt be a very important discoveryâ (Euler, 1761, p. 79). He envisioned such a device to be used in assistance of those âwhose voice is either too weak or disagreeableâ (Euler, 1761, p. 79).
During the same time period, von Kempelen â a Hungarian engineer, industrialist, and government official â used his spare time and mechanical skills to build a talking machine far more advanced than the five vowel resonators demonstrated by Kratzenstein. The final version of his machine was to some degree a mechanical simulation of human speech production. It included a bellows as a ârespiratoryâ source of air pressure and air flow, a wooden âwindâ box that emulated the trachea, a reed system to generate the voice source, and a rubber funnel that served as the vocal tract. There was an additional chamber used for nasal sounds, and other control levers that were needed for particular consonants. Although it was housed in a large box, the machine itself was small enough that it could have been easily held in the hands. Speech was produced by depressing the bellows, which caused the âvoiceâ reed to vibrate. The operator then manipulated the rubber vocal tract into time-varying configurations that, along with controlling other ports and levers, produced speech at the word level, but could not generate full sentences due to the limitations of air supply and perhaps the complexity of controlling the various parts of the machine with only two hands. The sound quality was child-like, presumably due to the high fundamental frequency of the reed and the relatively short rubber funnel serving as the vocal tract. In an historical analysis of von Kempelenâs talking machine, Dudley and Tarnoczy (1950) note that this quality was probably deliberate because a childâs voice was less likely to be criticized when demonstrating the function of the machine. Kempelen may have been particularly sensitive to criticism considering that he had earlier constructed and publicly demonstrated a chess-playing automaton that was in fact a hoax (cf., Carroll, 1975). Many observers initially assumed that his talking machine was merely a fake as well.
Kempelenâs lasting contribution to phonetics is his prodigious written account of not only the design of his talking machine, but also the nature of speech and language in general (von Kempelen, 1791). In âOn the Mechanism of Human Speechâ [English translation], he describes the experiments that consumed more than 20 years and clearly showed the significance of using models of speech production and sound generation to study and analyze human speech. This work motivated much subsequent research on speech production, and to this day still guides the construction of replicas of his talking machine for pedagogical purposes (cf., Trouvain and Brackhane, 2011).
One person particularly inspired by von Kempelenâs work was, in fact, Joseph Faber. According to a biographical sketch (Wurzbach, 1856), while recovering from a serious illness in about 1815, Faber happened onto a copy of âOn the Mechanism of Human Speechâ and became consumed with the idea of building a talking machine. Of course, he built not a replica of von Kempelenâs machine, but one with a significantly advanced system of controlling the mechanical simulation of speech production. As remarkable as Faberâs machine seems to have been regarded by some observers, Faber was indeed late to the party, so to speak, for the science of voice and speech had by the early 1800s already shifted into the realm of physical acoustics. Robert Willis, a professor of mechanics at Cambridge University, was dismayed by both Kratzensteinâs and von Kempelenâs reliance on trial-and-error methods in building their talking machines, rather than acoustic theory. He took them to task, along with most others working in phonetics at the time, in his 1829 essay titled âOn the Vowel Sounds, and on Reed Organ-Pipes.â The essay begins:
The generality of writers who have treated on the vowel sounds appear never to have looked beyond the vocal organs for their origin. Apparently assuming the actual forms of these organs to be essential to their production, they have contented themselves with describing with minute precision the relative positions of the tongue, palate and teeth peculiar to each vowel, or with giving accurate measurements of the corresponding separation of the lips, and of the tongue and uvula, considering vowels in fact more in the light of physiological functions of the human body than as a branch of acoustics.
(Willis, 1829, p. 231)
Willis laid out a set of experiments in which he would investigate vowel production by deliberately neglecting the organs of speech. He built reed-driven organ pipes whose lengths could be increased or decreased with a telescopic mechanism, and then determined that an entire series of vowels could be generated with changes in tube length and reeds with different vibrational frequencies. Wheatstone (1837) later pointed out that Willis had essentially devised an acoustic system that, by altering tube length, and hence the frequencies of the tube resonances, allowed for selective enhancement of harmonic components of the vibrating reed. Wheatstone further noted that multiple resonances are exactly what is produced by the âcavity of the mouth,â and so the same effect occurs during speech production but with a nonuniformly shaped tube.
Understanding speech as a pattern of spectral components became a major focus of acousticians studying speech communication for much of the 19th century and the very early part of the 20th century. As a result, developments of machines to produce speech sounds were also largely based on some form of spectral addition, with little or no reference to the human speech organs. For example, in 1859 the German scientist Hermann Helmholtz devised an electromagnetic system for maintaining the vibration of a set of eight or more tuning forks, each variably coupled to a resonating chamber to control amplitude (Helmholtz, 1859, 1875). With careful choice of frequencies and amplitude settings he demonstrated the artificial generation of five different vowels. Rudolph Koenig, a well-known acoustical instrument maker in 1800s, improved on Helmholtzâs design and produced commercial versions that were sold to interested clients (Pantalony, 2004). Koenig was also a key figure in emerging technology that allowed for recording and visualization of sound waves. His invention of the phonoautograph with Edouard-LĂŠon Scott in 1859 transformed sound via a receiving cone, diaphragm, and stylus into a pressure waveform etched on smoked paper rotating about a cylinder. A few years later he introduced an alternative instrument in which a flame would flicker in response to a sound, and the movements of flame were captured on a rotating mirror, again producing a visualization of the sound as a waveform (Koenig, 1873).
These approaches were precursors to a device called the âphonodeikâ that would be later developed at the Case School of Applied Science by Dayton Miller (1909) who eventually used it to study waveforms of sounds produced by musical instruments and human vowels. In a publication documenting several lectures given at the Lowell Institute in 1914, Miller (1916) describes both the analysis of sound based on photographic representations of waveforms produced by the phonodeik, as well as intricate machines that could generate complex waveforms by adding together sinusoidal components and display the final product graphically so that it might be compared to those waveforms captured with the phonodeik. Miller referred to this latter process as harmonic synthesis, a term commonly used to refer to building complex waveforms from basic sinusoidal elements. It is, however, the first instance of the word âsynthesisâ in the present chapter. This was deliberate to remain true to the original references. Nowhere in the literature on Kratzenstein, von Kempelen, Wheatstone, Faber, Willis, or Helmholtz does âsynthesisâ or âspeech synthesisâ appear. Their devices were variously referred to as talking machines, automatons, or simply systems that generated artificial speech. Millerâs use of synthesis in relation to human vowels seems to have had the effect of labeling any future system that produces artificial speech, regardless of the theory on which it is based, a speech synthesizer.
Interestingly, the waveform synthesis described by Miller was not actually synthesis of sound, but rather synthes...