Languages & Linguistics

Speech Recognition

Speech recognition is the process of converting spoken words into text or commands using technology. It involves the analysis of audio signals to identify and understand the words being spoken. This technology is used in various applications, such as virtual assistants, dictation software, and voice-controlled devices, to enable hands-free interaction with computers and other devices.

Written by Perlego with AI-assistance

10 Key excerpts on "Speech Recognition"

  • Book cover image for: Speech and Audio Processing
    eBook - PDF

    Speech and Audio Processing

    A MATLAB®-based Approach

    9 Speech Recognition Having considered big data in the previous chapter, we now turn our attention to Speech Recognition – probably the one area of speech research that has gained the most from machine learning techniques. In fact, as discussed in the introduction to Chapter 8, it was only through the application of well-trained machine learning methods that auto- matic Speech Recognition (ASR) technology was able to advance beyond a decades long plateau that limited performance, and hence the spread of further applications. 9.1 What is Speech Recognition? Entire texts have been written on the subject of Speech Recognition, and this topic alone probably accounts for more than half of the recent research literature and computational development effort in the fields of speech and audio processing. There are good reasons for this interest, primarily driven by the wish to be able to communicate more naturally with a computer (i.e. without the use of a keyboard and mouse). This is a wish which has been around for almost as long as electronic computers have been with us. From a historical perspective we might consider identifying a hierarchy of mainstream human– computer interaction steps as follows: Hardwired: The computer designer (i.e. engineer) ‘reprograms’ a computer, and provides input by reconnecting wires and circuits. Card: Punched cards are used as input, printed tape as output. Paper: Teletype input is used directly, and printed paper as output. Alphanumeric: Electronic keyboards and monitors (visual display units), alpha- numeric data. Graphical: Mice and graphical displays enable the rise of graphical user interfaces (GUIs). WIMP: Standardised methods of windows, icons, mouse and pointer (WIMP) inter- action become predominant. Touch: Touch-sensitive displays, particularly on smaller devices.
  • Book cover image for: Speech Processing
    eBook - PDF

    Speech Processing

    A Dynamic and Optimization-Oriented Approach

    • Li Deng, Douglas O'Shaughnessy(Authors)
    • 2018(Publication Date)
    • CRC Press
      (Publisher)
    The recognized words can also serve as the input to further linguistic processing in order to achieve speech understanding, the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech. ASR has emerged as a promising area for applications such as dictation, telephone voice response system, database access, human-computer interactions, hands-free appli-cations (such as car phones or voice-enabled PDAs), and web enabling via voice. Speech is the most direct and intuitive form of human communication; successful ASR thus can enhance the ease, speed, and effectiveness with which humans can direct machines to accomplish desired tasks. ASR has become an established research area, and has already created many successful products in the market place. ASR can also be viewed as defining applications for artificial intelligence in computer science. Many important issues in ASR, such as feature extraction from auditory data, preservation of auditory perceptual invariance, integration of information across time and frequency, and exploration of distributed representations, are all in common with the key 433 434 Speech Recognition problems in artificial intelligence. It is a great scientific and technological challenge to understand how humans recognize speech and to simulate this ability in computers, a common goal in both ASR and artificial intelligence. 12.1.1 The Speech Recognition problem The problem of Speech Recognition can be viewed in terms of the decoding problem in a source-channel representation. In this representation, the speaker consists of • the information source specifying the intended word sequence W* = wi, w2, ..., wn; and • the speech production system as the noisy communication channel, whose output is the speech signal. The recognizer is aimed at determining (decoding) the intended word sequence W* produced by the speaker with as few errors as possible.
  • Book cover image for: Video, Speech, and Audio Signal Processing and Associated Standards
    • Vijay Madisetti(Author)
    • 2018(Publication Date)
    • CRC Press
      (Publisher)
    9 -13 References ........................................................................................................ 9 -14 9.1 Introduction Over the past several decades, the need has arisen to enable humans to communicate with machines in order to control their actions or obtain information. Initial attempts at providing human – machine communica-tions led to the development of the keyboard, the mouse, the trackball, the touch screen, and the joy stick. However, none of these communication devices provides the richness or the ease of use of speech which has been the most natural form of communication between humans for tens of centuries. This need of a natural voice interface between humans and machines has been met, to a limited extent, by speech processing systems which enable a machine to speak (speech synthesis systems) and which enable a machine to understand (Speech Recognition systems) human speech. We concentrate on Speech Recognition systems in this section. Speech Recognition by machine refers to the capability of a machine to convert human speech to a textual form, providing a transcription or interpretation of everything that the human speaks while the machine is listening. This capability is required for tasks in which the human is controlling the actions of the machine using only limited speaking capability, e.g., while speaking simple commands or sequences of words from a limited vocabulary (e.g., digit sequences for a telephone number). In the more general 9 -1 case, usually referred to as speech understanding, the machine needs to only recognize a limited subset of the user input speech, namely, the speech that speci fi es enough about the action requested, so that the machine can either respond appropriately, or initiate some action in response to what was understood.
  • Book cover image for: Modern Speech Recognition
    eBook - PDF

    Modern Speech Recognition

    Approaches with Case Studies

    • S. Ramakrishnan(Author)
    • 2012(Publication Date)
    • IntechOpen
      (Publisher)
    Speech Recognition for Agglutinative Languages R. Thangarajan Additional information is available at the end of the chapter http://dx.doi.org/10.5772/50140 1. Introduction Speech technology is a broader area comprising many applications like Speech Recognition, Text to Speech (TTS) Synthesis, speaker identification and verification and language identification. Different applications of speech technology impose different constraints on the problem and these are tackled by different algorithms. In this chapter, the focus is on automatically transcribing speech utterances to text. This process is called Automatic Speech Recognition (ASR). ASR deals with transcribing speech utterances into text of a given language. Even after years of extensive research and development, ASR still remains a challenging field of research. But in the recent years, ASR technology has matured to a level where success rate is higher in certain domains. A well-known example is human-computer interaction where speech is used as an interface along with or without other pointing devices. ASR is fundamentally a statistical problem. Its objective is to find the most likely sequence of words, called hypothesis, for a given sequence of observations. The sequence of observations involves acoustic feature vectors representing the speech utterance. The performance of an ASR system can be measured by aligning the hypothesis with the reference text and by counting errors like deletion, insertion and substitution of words in the hypothesis. ASR is a subject involving signal processing and feature extraction, acoustics, information theory, linguistics and computer science. Speech signal processing helps in extracting relevant and discriminative information, which is called features, from speech signal in a robust manner. Robustness involves spectral analysis used to characterize time varying properties of speech signal and speech enhancement techniques for making features resilient to noise.
  • Book cover image for: Human Computer Interaction Handbook
    eBook - PDF

    Human Computer Interaction Handbook

    Fundamentals, Evolving Technologies, and Emerging Applications, Third Edition

    • Julie A. Jacko(Author)
    • 2012(Publication Date)
    • CRC Press
      (Publisher)
    367 16 16.1 INTRODUCTION A spoken interface for a computer often emulates human– human interaction by calling on our inherent ability as humans to speak and listen. While human speech is a skill we acquire early and practice frequently, getting computers to map sounds to actions and to respond appropriately with either synthesized or recorded speech is a massive programming undertaking. Because we all speak a little differently from each other, and because the accuracy of the recognition is dependent on an audio signal that can be distorted by many factors, speech technology, like the other recognition tech-nologies, lacks 100% accuracy. When designing a spoken interface, one must design to the strengths and weaknesses of the technology to optimize the overall user experience. Speech and Language Interfaces, Applications, and Technologies Clare-Marie Karat, Jennifer Lai, Osamuyimen Stewart, and Nicole Yankelovich CONTENTS 16.1 Introduction ..................................................................................................................................................................... 367 16.2 Spoken Interface Lifecycle .............................................................................................................................................. 369 16.3 Understanding Speech and Language Technologies ....................................................................................................... 370 16.3.1 How Does Automatic Speech Recognition Work? .............................................................................................. 370 16.3.2 Current Capabilities and Limitations of Automatic Speech Recognition Systems ............................................. 370 16.3.3 How Does Text-to-Speech Work? ........................................................................................................................
  • Book cover image for: Computer Access for People with Disabilities
    eBook - PDF
    • Richard C. Simpson(Author)
    • 2013(Publication Date)
    • CRC Press
      (Publisher)
    A great deal of the ease we take for granted in verbal communication goes away when the listener doesn’t understand the meaning of what we say [6]. One implication of this lack of understanding is that an ASR system needs to be told when to listen and when not to, because it cannot distinguish between dictation directed toward itself or someone else. A second implication is that ASR may insert wildly inappropriate text into a document. As the processing power and speed of modern computers have increased, ASR software has been able to spend more of its recognition time (the time by which it has to send the results to the page) analyzing context—what other words are preceding and following the word in question. While this has improved recognition rates, it has also made it more problematic for those who can only speak one word at time. ASR is a multistep process (Figure 7.1). Once the user speaks into the micro-phone, the speech signal is “preprocessed” to convert it into a digital form suitable for computational analysis. The processed speech input is then compared to a model of the user’s voice to determine what was said. Finally, the identified speech is trans-mitted to the computer’s operating system or active application [7]. Lexicon Phonetic Models Word Models Hidden Markov Models Recogn ized Text Grammar Fourier Transform Low-Pass Filter Analog Waveform Digital Waveform Vector of Parameters FIGURE 7.1 How ASR works. (Based on Fellbaum, K., and G. Koroupetroglou, Technology and Disability, 20(1), 55–85, 2008 [4]; Rosen, K., and S. Yampolsky, Augmentative and Alternative Communication, 16(1), 48–60, 2000 [7].) 161 Automatic Speech Recognition 7.2.1 S IGNAL P REPROCESSING Preprocessing consists of several phases [4, 7]: 1. The user’s raw speech, in the form of an analog wave signal, is captured by a microphone. 2. A low-pass filter is used to remove high-frequency components from the speech signal.
  • Book cover image for: The Routledge Handbook of Translation and Technology
    • Minako O'Hagan(Author)
    • 2019(Publication Date)
    • Routledge
      (Publisher)
    6 Speech Recognition and synthesis technologies in the translation workflow Dragoș Ciobanu and Alina Secară

    Introduction

    Translation is a ‘multi-activity task which can easily cause cognitive overload and stress’ (Ehrensberger-Dow and Massey 2014 : 60). Translators are constantly making choices that generate decision fatigue, also known as ‘ego depletion’ (Baumeister et al. 1998 ). The more attached the translator is to their own work, the more likely this tendency intensifies. For translators, therefore, making the most of all available technologies – including speech tools – to produce a higher quality product faster makes intuitive sense. As we will discuss in this chapter, speech technologies present both opportunities and challenges which need to be considered carefully.
    Translation is mainly grounded in written forms of communication, yet, to speak, or not to speak has become a genuine choice for translators’ modus operandi. Automatic Speech Recognition (ASR), also known as speech-to-text (STT) or voice recognition (VR), has both improved significantly and been extended to cover an ever-growing number of languages. For example, at the time of writing, Google Voice supports 119 languages and language varieties.1 In parallel, high-quality computer-generated speech, also known as speech synthesis or text-to-speech (TTS), has become in certain cases almost indistinguishable from human speech (Shen et al. 2017 ). Speech technologies therefore provide significant scope for inclusion in professional language service workflows.
    This change is also driven by the rapid diversification of translatable content and consumer preferences. An ever-increasing amount of multimedia content accessible through voice commands is often aimed at the new and upcoming generation of Cybrids – those who have been ‘exposed to technology since they were born’ (MultiLingual 2018 : 8). Creating videos is becoming a popular alternative for businesses, replacing ‘instructions they’re used to delivering through on-screen text with video’ (Ray 2017
  • Book cover image for: Frontiers in Robotics, Automation and Control
    • Alexander Zemliak(Author)
    • 2008(Publication Date)
    • IntechOpen
      (Publisher)
    3 Automatic Speaker Recognition by Speech Signal Milan Sigmund Brno University of Technology Czech Republic 1. Introduction Acoustical communication is one of the fundamental prerequisites for the existence of human society. Textual language has become extremely important in modern life, but speech has dimensions of richness that text cannot approximate. From speech alone, fairly accurate guesses can be made as to whether the speaker is male or female, adult or child. In addition, experts can extract from speech information regarding e.g. the speaker’s state of mind. As computer power increased and knowledge about speech signals improved, research of speech processing became aimed at automated systems for many purposes. Speaker recognition is the complement of Speech Recognition. Both techniques use similar methods of speech signal processing. In automatic Speech Recognition, the speech processing approach tries to extract linguistic information from the speech signal to the exclusion of personal information. Conversely, speaker recognition is focused on the characteristics unique to the individual, disregarding the current word spoken. The uniqueness of an individual’s voice is a consequence of both the physical features of the person vocal tract and the person mental ability to control the muscles in the vocal tract. An ideal speaker recognition system would use only physical features to characterize speakers, since these features cannot be easily changed. However, it is obvious that the physical features as vocal tract dimensions of an unknown speaker cannot be simply measured. Thus, numerical values for physical features or parameters would have to be derived from digital signal processing parameters extracted from the speech signal. Suppose that vocal tracts could be effectively represented by 10 independent physical features, with each feature taking on one of 10 discrete values.
  • Book cover image for: Speech Technologies
    • Ivo Ipsic(Author)
    • 2011(Publication Date)
    • IntechOpen
      (Publisher)
    12 Wake-Up-Word Speech Recognition Veton Këpuska 1 Florida Institute of Technology, ECE Department Melbourne, Florida 32931 USA 1. Introduction Speech is considered one of the most natural forms of communications between people (Juang & Rabiner, 2005). Spoken language has the unique property that it is naturally learned as part of human development. However, this learning process presents challenges when applied to digital computing systems. The goal of Automatic Speech Recognition (ASR) is to address the problem of building a system that maps an acoustic signal into a string of words. The idea of being able to perform Speech Recognition from any speaker in any environment is still a problem that is far from being solved. However, recent advancements in the field have resulted in ASR systems that are applicable to some of Human Machine Interaction (HMI) tasks. ASR is already being successfully applied in application domains such as telephony (automated caller menus) and monologue transcriptions for a single speaker. Several motivations for building ASR systems are, presented in order of difficulty, to improve human-computer interaction through spoken language interfaces, to solve difficult problems such as speech-to-speech translation, and to build intelligent systems that can process spoken language as proficiently as humans. Speech as a computer interface has numerous benefits over traditional interfaces using mouse and keyboard: speech is natural for humans, requires no special training, improves multitasking by leaving the hands and eyes free, and is often faster and more efficient to transmit than the information provided using conventional input methods. In the presented work, in Section 2, the concept of Wake-Up-Word (WUW) is being introduced. In Section 3, the definition of the WUW task is presented. The implementation details and experimental evaluations of the WUW-SR system are depicted in Section 4.
  • Book cover image for: Handbook of Natural Language Processing
    • Nitin Indurkhya, Fred J. Damerau, Nitin Indurkhya, Fred J. Damerau(Authors)
    • 2010(Publication Date)
    For our entire lives, we are exposed to all kinds of speech data from uncontrolled environments, speakers, and topics, (i.e., “every day” speech). Despite this variation in our own personal training data, we are all able to create internal models of speech and language that are remarkably adept at dealing with variation in the speech chain. This ability to generalize is a key aspect of human speech processing that has not yet found its way into modern speech recognizers. Research activities on this topic should produce technology that will operate more effectively in novel circumstances, and that can generalize better from smaller amounts of data. Examples include moving from one acoustic environment to another, different tasks, languages, etc. Another research area could explore how well information gleaned from large resource languages and/or domains generalize to smaller resource languages and domains. 15.5.4 Developing Speech Recognizers beyond the Language Barrier State-of-the-art Speech Recognition systems today deliver top performances by building complex acoustic and language models using a large collection of domain-and language-specific speech and text examples. This set of language resources is often not readily available for many languages. The challenge here is to create spoken language technologies that are rapidly portable. To prepare for rapid development of such spoken language systems, a new paradigm is needed to study speech and acoustic units that are more language-universal than language-specific phones. Three specific research issues need to be addressed: (1) cross-language acoustic modeling of speech and acoustic units for a new target language, (2) cross-lingual lexical modeling of word pronunciations for new language, (3) cross-lingual language modeling.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.