Speech Synthesis
- LAST REVIEWED: 25 February 2016
- LAST MODIFIED: 25 February 2016
- DOI: 10.1093/obo/9780199772810-0024
- LAST REVIEWED: 25 February 2016
- LAST MODIFIED: 25 February 2016
- DOI: 10.1093/obo/9780199772810-0024
Introduction
Speech synthesis has a long history, going back to early attempts to generate speech- or singing-like sounds from musical instruments. But in the modern age, the field has been driven by one key application: Text-to-Speech (TTS), which means generating speech from text input. Almost universally, this complex problem is divided into two parts. The first problem is the linguistic processing of the text, and this happens in the front end of the system. The problem is hard because text clearly does not contain all the information necessary for reading out loud. So, just as human talkers use their knowledge and experience when reading out loud, machines must also bring additional information to bear on the problem; examples include rules regarding how to expand abbreviations into standard words, or a pronunciation dictionary that converts spelled forms into spoken forms. Many of the techniques currently used for this part of the problem were developed in the 1990s and have only advanced very slowly since then. In general, techniques used in the front end are designed to be applicable to almost any language, although the exact rules or model parameters will depend on the language in question. The output of the front end is a linguistic specification that contains information such as the phoneme sequence and the positions of prosodic phrase breaks. In contrast, the second part of the problem, which is to take the linguistic specification and generate a corresponding synthetic speech waveform, has received a great deal of attention and is where almost all of the exciting work has happened since around 2000. There is far more recent material available on the waveform generation part of the text-to-speech problem than there is on the text processing part. There are two main paradigms currently in use for waveform generation, both of which apply to any language. In concatenative synthesis, small snippets of prerecorded speech are carefully chosen from an inventory and rearranged to construct novel utterances. In statistical parametric synthesis, the waveform is converted into two sets of speech parameters: one set captures the vocal tract frequency response (or spectral envelope) and the other set represents the sound source, such as the fundamental frequency and the amount of aperiodic energy. Statistical models are learned from annotated training data and can then be used to generate the speech parameters for novel utterances, given the linguistic specification from the front end. A vocoder is used to convert those speech parameters back to an audible speech waveform.
Textbooks, Edited Collections, Surveys, and Introductions
Steady progress in synthesis since around 1990, and the especially rapid progress in the early 21st century, is a challenge for textbooks. Taylor 2009 provides the most up-to-date entry point to this field and is an excellent starting point for students at all levels. For a wider-ranging textbook that also provides coverage of Natural Language Processing and Automatic Speech Recognition, Jurafsky and Martin 2009 is also excellent. For those without an electrical engineering background, the chapter by Ellis giving “An Introduction to Signal Processing for Speech” in Hardcastle, et al. 2010 is essential background reading, since most other texts are aimed at readers with some previous knowledge of signal processing. Most of the advances in the field since around 2000 have been in the statistical parametric paradigm. No current textbook covers this subject in sufficient depth. King 2011 gives a short and simple introduction to some of the main concepts, and Taylor 2009 contains one relatively brief chapter. For more technical depth, it is necessary to venture beyond textbooks, and the tutorial article Tokuda, et al. 2013 is the best place to start, followed by the more technical article Zen, et al. 2009. Some older books, such as Dutoit 1997, still contain relevant material, especially in their treatment of the text processing part of the problem. Sproat’s comment that “text-analysis has not received anything like half the attention of the synthesis community” (p. 73) in his introduction to text processing in van Santen, et al. 1997 is still true, and Yarowsky’s chapter on homograph disambiguation in the same volume still represents a standard solution to that particular problem. Similarly, the modular system architecture described by Sproat and Olive in that volume is still the standard way of configuring a text-to-speech system.
Dutoit, Thierry. 1997. An introduction to text-to-speech synthesis. Norwell, MA: Kluwer Academic.
DOI: 10.1007/978-94-011-5730-8
Starting to get dated, but still contains useful material.
Hardcastle, W. J., J. Laver, and F. E. Gibbon. 2010. The handbook of phonetic sciences. Blackwell Handbooks in Linguistics. Oxford: Wiley-Blackwell.
A wealth of information, one highlight being the excellent chapter by Ellis introducing speech signal processing to readers with minimal technical background. The chapter on speech synthesis is too dated. Other titles in this series are worth consulting, such as the one on speech perception.
Jurafsky, D., and J. H. Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. 2d ed. Upper Saddle River, NJ: Prentice Hall.
A complete course in speech and language processing, very widely used for teaching at advanced undergraduate and graduate levels. The authors have a free online video lecture course covering the Natural Language Processing parts. A third edition of the book is expected.
King, S. 2011. An introduction to statistical parametric speech synthesis. Sadhana 36.5: 837–852.
DOI: 10.1007/s12046-011-0048-y
A gentle and nontechnical introduction to this topic, designed to be accessible to readers from any background. Should be read before attempting the more advanced material.
Taylor, P. 2009. Text-to-speech synthesis. Cambridge, UK: Cambridge Univ. Press.
The most comprehensive and authoritative textbook ever written on the subject. The content is still up-to-date and highly relevant. Of course, developments since 2009—such as advanced techniques for HMM-based synthesis and the resurgence of Neural Networks—are not covered.
Tokuda, K., Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura. 2013. Speech synthesis based on Hidden Markov Models. Proceedings of the IEEE 101.5: 1234–1252.
DOI: 10.1109/JPROC.2013.2251852
A tutorial article covering the main concepts of statistical parametric speech synthesis using Hidden Markov Models. Also touches on singing synthesis and controllable models.
van Santen, J. P. H., R. W. Sproat, J. P. Oliver, and J. Hirschberg, eds. 1997. Progress in speech synthesis. New York: Springer.
Covering most aspects of text-to-speech, but now dated. Material that remains relevant: Yarowsky on homograph disambiguation; Sproat’s introduction to the Linguistic Analysis section; Campbell and Black’s inclusion of prosody in the unit selection target cost, to minimize the need for subsequent signal processing (implementation details no longer relevant).
Zen, H., K. Tokuda, and A. W. Black. 2009. Statistical parametric speech synthesis. Speech Communication 51.11: 1039–1064.
DOI: 10.1016/j.specom.2009.04.004
Written before the resurgence of neural networks, this is an authoritative and technical introduction to HMM-based statistical parametric speech synthesis.
Users without a subscription are not able to see the full content on this page. Please subscribe or login.
How to Subscribe
Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here.
Article
- Acceptability Judgments
- Accessibility Theory in Linguistics
- Acquisition, Second Language, and Bilingualism, Psycholin...
- Adjectives
- Adpositions
- Affixation
- African Linguistics
- Afroasiatic Languages
- Agreement
- Algonquian Linguistics
- Altaic Languages
- Ambiguity, Lexical
- Analogy in Language and Linguistics
- Anaphora
- Animal Communication
- Aphasia
- Applicatives
- Applied Linguistics, Critical
- Arawak Languages
- Argument Structure
- Artificial Languages
- Attention and Salience
- Australian Languages
- Austronesian Linguistics
- Auxiliaries
- Balkans, The Languages of the
- Baudouin de Courtenay, Jan
- Berber Languages and Linguistics
- Bilingualism and Multilingualism
- Biology of Language
- Blocking
- Borrowing, Structural
- Caddoan Languages
- Caucasian Languages
- Causatives
- Celtic Languages
- Celtic Mutations
- Chomsky, Noam
- Chumashan Languages
- Classifiers
- Clauses, Relative
- Clinical Linguistics
- Cognitive Linguistics
- Colonial Place Names
- Comparative Reconstruction in Linguistics
- Comparative-Historical Linguistics
- Complementation
- Complexity, Linguistic
- Compositionality
- Compounding
- Comprehension, Sentence
- Computational Linguistics
- Conditionals
- Conjunctions
- Connectionism
- Consonant Epenthesis
- Constructions, Verb-Particle
- Contrastive Analysis in Linguistics
- Conversation Analysis
- Conversation, Maxims of
- Conversational Implicature
- Cooperative Principle
- Coordination
- Copula
- Creoles
- Creoles, Grammatical Categories in
- Critical Periods
- Cross-Language Speech Perception and Production
- Cyberpragmatics
- Default Semantics
- Definiteness
- Dementia and Language
- Dene (Athabaskan) Languages
- Dené-Yeniseian Hypothesis, The
- Dependencies
- Dependencies, Long Distance
- Derivational Morphology
- Determiners
- Dialectology
- Dialogue
- Diglossia
- Disfluency
- Distinctive Features
- Dravidian Languages
- Ellipsis
- Endangered Languages
- English as a Lingua Franca
- English, Early Modern
- English, Old
- Ergativity
- Eskimo-Aleut
- Euphemisms and Dysphemisms
- Evidentials
- Exemplar-Based Models in Linguistics
- Existential
- Existential Wh-Constructions
- Experimental Linguistics
- Fieldwork
- Fieldwork, Sociolinguistic
- Finite State Languages
- First Language Attrition
- Formulaic Language
- Francoprovençal
- French Grammars
- Frisian
- Gabelentz, Georg von der
- Gender
- Genealogical Classification
- Generative Syntax
- Genetics and Language
- Gestures
- Grammar, Categorial
- Grammar, Cognitive
- Grammar, Construction
- Grammar, Descriptive
- Grammar, Functional Discourse
- Grammars, Phrase Structure
- Grammaticalization
- Harris, Zellig
- Heritage Languages
- History of Linguistics
- History of the English Language
- Hmong-Mien Languages
- Hokan Languages
- Honorifics
- Humor in Language
- Hungarian Vowel Harmony
- Iconicity
- Ideophones
- Idiolect
- Idiom and Phraseology
- Imperatives
- Indefiniteness
- Indo-European Etymology
- Inflected Infinitives
- Information Structure
- Innateness
- Interface Between Phonology and Phonetics
- Interjections
- Intonation
- IPA
- Irony
- Iroquoian Languages
- Islands
- Isolates, Language
- Jakobson, Roman
- Japanese Word Accent
- Jones, Daniel
- Juncture and Boundary
- Khoisan Languages
- Kiowa-Tanoan Languages
- Kra-Dai Languages
- Labov, William
- Language Acquisition
- Language and Law
- Language Contact
- Language Documentation
- Language, Embodiment and
- Language for Specific Purposes/Specialized Communication
- Language, Gender, and Sexuality
- Language Geography
- Language Ideologies and Language Attitudes
- Language in Autism Spectrum Disorders
- Language Nests
- Language Revitalization
- Language Shift
- Language Standardization
- Language, Synesthesia and
- Languages of Africa
- Languages of the Americas, Indigenous
- Languages of the World
- Learnability
- Lexemes
- Lexical Access, Cognitive Mechanisms for
- Lexical Semantics
- Lexical-Functional Grammar
- Lexicography
- Lexicography, Bilingual
- Lexicon
- Linguistic Accommodation
- Linguistic Anthropology
- Linguistic Areas
- Linguistic Landscapes
- Linguistic Prescriptivism
- Linguistic Profiling and Language-Based Discrimination
- Linguistic Relativity
- Linguistics, Educational
- Listening, Second Language
- Literature and Linguistics
- Loanwords
- Machine Translation
- Maintenance, Language
- Mande Languages
- Markedness
- Mass-Count Distinction
- Mathematical Linguistics
- Mayan Languages
- Mental Health Disorders, Language in
- Mental Lexicon, The
- Mesoamerican Languages
- Metaphor
- Metathesis
- Metonymy
- Minority Languages
- Mixed Languages
- Mixe-Zoquean Languages
- Modification
- Mon-Khmer Languages
- Morphological Change
- Morphology
- Morphology, Blending in
- Morphology, Subtractive
- Movement
- Munda Languages
- Muskogean Languages
- Nasals and Nasalization
- Negation
- Niger-Congo Languages
- Non-Pama-Nyungan Languages
- Northeast Caucasian Languages
- Nostratic
- Number
- Numerals
- Oceanic Languages
- Papuan Languages
- Penutian Languages
- Philosophy of Language
- Phonetics
- Phonetics, Acoustic
- Phonetics, Articulatory
- Phonological Research, Psycholinguistic Methodology in
- Phonology
- Phonology, Computational
- Phonology, Early Child
- Pidgins
- Polarity
- Policy and Planning, Language
- Politeness in Language
- Polysemy
- Positive Discourse Analysis
- Possessives, Acquisition of
- Pragmatics, Acquisition of
- Pragmatics, Cognitive
- Pragmatics, Computational
- Pragmatics, Cross-Cultural
- Pragmatics, Developmental
- Pragmatics, Experimental
- Pragmatics, Game Theory in
- Pragmatics, Historical
- Pragmatics, Institutional
- Pragmatics, Second Language
- Pragmatics, Teaching
- Prague Linguistic Circle, The
- Presupposition
- Pronouns
- Psycholinguistics
- Quechuan and Aymaran Languages
- Questions
- Reading, Second-Language
- Reciprocals
- Reduplication
- Reflexives and Reflexivity
- Register and Register Variation
- Relevance Theory
- Representation and Processing of Multi-Word Expressions in...
- Salish Languages
- Sapir, Edward
- Saussure, Ferdinand de
- Second Language Acquisition, Anaphora Resolution in
- Semantic Maps
- Semantic Roles
- Semantic-Pragmatic Change
- Semantics, Cognitive
- Sentence Processing in Monolingual and Bilingual Speakers
- Sign Language Linguistics
- Slang
- Sociolinguistics
- Sociolinguistics, Variationist
- Sociopragmatics
- Sonority
- Sound Change
- South American Indian Languages
- Specific Language Impairment
- Speech, Deceptive
- Speech Perception
- Speech Production
- Speech Synthesis
- Suppletion
- Switch-Reference
- Syllables
- Syncretism
- Synonymy
- Syntactic Change
- Syntactic Knowledge, Children’s Acquisition of
- Tense, Aspect, and Mood
- Text Mining
- Tone
- Tone Sandhi
- Topic
- Transcription
- Transitivity and Voice
- Translanguaging
- Translation
- Trubetzkoy, Nikolai
- Tucanoan Languages
- Tupian Languages
- Typology
- Usage-Based Linguistics
- Uto-Aztecan Languages
- Valency Theory
- Verbs, Serial
- Vocabulary, Second Language
- Voice and Voice Quality
- Vowel Harmony
- Whitney, William Dwight
- Word Classes
- Word Formation in Japanese
- Word Recognition, Spoken
- Word Recognition, Visual
- Word Stress
- Writing, Second Language
- Writing Systems
- Yiddish
- Zapotecan Languages