Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis

Shifeng Pan, Lei He
Microsoft, China
Abstract: Cross-speaker style transfer is crucial to the applications of multi-style and expressive speech synthesis at scale. It does not require the target speakers to be experts in expressing all styles and to collect corresponding recordings for model training. However, the performances of existing style transfer methods are still far behind real application needs. The root causes are mainly in two folds. Firstly, the style embedding extracted from single reference speech can hardly provide fine-grained and appropriate prosody information for arbitrary text to synthesize. Secondly, in these models the content/text, prosody, and speaker timbre are usually highly entangled, it’s therefore not realistic to expect a satisfied result when freely combining these components, such as to transfer speaking style between speakers. In this paper, we propose a cross-speaker style transfer text-to-speech (TTS) model with explicit prosody bottleneck. The prosody bottleneck builds up the kernels accounting for speaking style robustly, and disentangles the prosody from content and speaker timbre, therefore guarantees high quality cross-speaker style transfer. Evaluation result shows the proposed method even achieves on-par performance with source speaker’s speaker-dependent (SD) model in objective measurement of prosody, and significantly outperforms the cycle consistency and GMVAE-based baselines in objective and subjective evaluations.

Spk-A_Rec: Speaker A’s recording, held out for test.
Spk-A_SD: Speaker A’s SD model, viewed as the upper boundary for style evaluation.
Spk-B_SD: Speaker B’s SD model, viewed as the lower boundary for style evaluation.
Spk-B_Trans_CC: A Transformer TTS version of the cycle consistency loss enhanced method in [11].
Spk-B_Trans_GMVAE: GMVAE-based style transfer model. Similar with [9], variational inference is introduced into Transformer TTS, but Gaussian Mixture model is used instead of single Gaussian for prior distribution. The mixtures are explicitly tied to each style. In inference, the mean of the mixture corresponding to target style is used.
Spk-B_Trans_Pros: The proposed model.

Exp-1 Emotion transer with models trained from scratch

Happy

(1) No , no hahaha . I'm asking if you did unfollow me , cause I was notified you just followed me.
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(2) Aw , I love it ! ! you should put it as your header .
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(3) Okay ... mum says to let her know how it goes , when she does . we are always including you and her in our nightly prayers .
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(4) True , been there done that . but at least they kind of got it .
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(5) Not complaining . I'm actually quite proud .
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

Angry

(1) Get off Whisper and take your damn shower !
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(2) Stupid accident held me up with traffic and now I'm running to be on time for work .
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(3) Those evil people just causing whale in the country .
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(4) Damn gardening ghosts . Get away from my walls !
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(5) Don't be taking our money !
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

Sad

(1) I can't see far away and I lost it in tall grass .
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(2) I'm actually not making them anymore .
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(3) We'll see if she will go back to him . I have an ugly feeling that she probably will .
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(4) No , I know . Absolutely horrible stuff .
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(5) No , I'm so sick and tired of it .
Spk-A_Rec Spk-A_SD
Spk-B_SD Spk-B_Trans_CC Spk-B_Trans_GMVAE Spk-B_Trans_Pros
 

Exp-2 Style transer by onboarding news source and target speaker

Spk-A chat style to Spk-B

(1) Even if we can't , we'll always find a way . Oh my God , I have the worst cold .
Spk-A_Rec Spk-A_SD Spk-B_SD Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(2) Very cute . She is cute . Actually I look like her .
Spk-A_Rec Spk-A_SD Spk-B_SD Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(3) Well I am a beach person . We'll get used to it .
Spk-A_Rec Spk-A_SD Spk-B_SD Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(4) You will always have my heart . In fact you are my heart , my soul , my everything .
Spk-A_Rec Spk-A_SD Spk-B_SD Spk-B_Trans_GMVAE Spk-B_Trans_Pros

(5) Today the lucky color for Aquarius is gray and the lucky number is zero .
Spk-A_Rec Spk-A_SD Spk-B_SD Spk-B_Trans_GMVAE Spk-B_Trans_Pros
 

Exp-3 Prosody control on the proposed model

(1) Even if we can't , we'll always find a way . Oh my God , I have the worst cold .
No control
Pitch-up-30Hz Pitch-down-30Hz Rate-up-30-cent Rate-down-30-cent Energy-up-50-cent Energy-down-50-cent

(2) Well I am a beach person . We'll get used to it .
No control
Pitch-up-30Hz Pitch-down-30Hz Rate-up-30-cent Rate-down-30-cent Energy-up-50-cent Energy-down-50-cent

(3) Today the lucky color for Aquarius is gray and the lucky number is zero .
No control
Pitch-up-30Hz Pitch-down-30Hz Rate-up-30-cent Rate-down-30-cent Energy-up-50-cent Energy-down-50-cent

(4) I hate this . Just kidding , how can I hate this ?
No control
Pitch-up-30Hz Pitch-down-30Hz Rate-up-30-cent Rate-down-30-cent Energy-up-50-cent Energy-down-50-cent