The Theory behind Controllable Expressive Speech Synthesis: a Cross-disciplinary Approach

Abstract

As part of the Human-Computer Interaction field, Expressive speech synthesisis a very rich domain as it requires knowledge in areas such as machinelearning, signal processing, sociology, psychology. In this Chapter, we willfocus mostly on the technical side. From the recording of expressive speech toits modeling, the reader will have an overview of the main paradigms used inthis field, through some of the most prominent systems and methods. We explainhow speech can be represented and encoded with audio features. We present ahistory of the main methods of Text-to-Speech synthesis: concatenative,parametric and statistical parametric speech synthesis. Finally, we focus onthe last one, with the last techniques modeling Text-to-Speech synthesis as asequence-to-sequence problem. This enables the use of Deep Learning blocks suchas Convolutional and Recurrent Neural Networks as well as Attention Mechanism.The last part of the Chapter intends to assemble the different aspects of thetheory and summarize the concepts.

Quick Read (beta)

loading the full paper ...