Speech Synthesis with Mixed Emotions

Abstract

Emotional speech synthesis aims to synthesize human voices with variousemotional effects. The current studies are mostly focused on imitating anaveraged style belonging to a specific emotion type. In this paper, we seek togenerate speech with a mixture of emotions at run-time. We propose a novelformulation that measures the relative difference between the speech samples ofdifferent emotions. We then incorporate our formulation into asequence-to-sequence emotional text-to-speech framework. During the training,the framework does not only explicitly characterize emotion styles, but alsoexplores the ordinal nature of emotions by quantifying the differences withother emotions. At run-time, we control the model to produce the desiredemotion mixture by manually defining an emotion attribute vector. The objectiveand subjective evaluations have validated the effectiveness of the proposedframework. To our best knowledge, this research is the first study onmodelling, synthesizing and evaluating mixed emotions in speech.

Quick Read (beta)

loading the full paper ...