Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis
Shifeng Pan, Lei He
Microsoft, China
Abstract:
Cross-speaker style transfer is crucial to the applications of multi-style and expressive speech synthesis
at scale. It does not require the target speakers to be experts in expressing all styles and to collect corresponding
recordings for model training. However, the performances of existing style transfer methods are still far behind real
application needs. The root causes are mainly in two folds. Firstly, the style embedding extracted from single reference
speech can hardly provide fine-grained and appropriate prosody information for arbitrary text to synthesize. Secondly,
in these models the content/text, prosody, and speaker timbre are usually highly entangled, it’s therefore not realistic
to expect a satisfied result when freely combining these components, such as to transfer speaking style between speakers.
In this paper, we propose a cross-speaker style transfer text-to-speech (TTS) model with explicit prosody bottleneck.
The prosody bottleneck builds up the kernels accounting for speaking style robustly, and disentangles the prosody from
content and speaker timbre, therefore guarantees high quality cross-speaker style transfer. Evaluation result shows the
proposed method even achieves on-par performance with source speaker’s speaker-dependent (SD) model in objective measurement
of prosody, and significantly outperforms the cycle consistency and GMVAE-based baselines in objective and subjective evaluations.
Spk-A_Rec: Speaker A’s recording, held out for test.
Spk-A_SD: Speaker A’s SD model, viewed as the upper boundary for style evaluation.
Spk-B_SD: Speaker B’s SD model, viewed as the lower boundary for style evaluation.
Spk-B_Trans_CC: A Transformer TTS version of the cycle consistency loss enhanced method in [11].
Spk-B_Trans_GMVAE: GMVAE-based style transfer model. Similar with [9], variational inference is
introduced into Transformer TTS, but Gaussian Mixture model is used instead of single Gaussian for prior
distribution. The mixtures are explicitly tied to each style. In inference, the mean of the mixture
corresponding to target style is used.
Spk-B_Trans_Pros: The proposed model.
Exp-1 Emotion transer with models trained from scratch
Happy
(1) No , no hahaha . I'm asking if you did unfollow me , cause I was notified you just followed me.
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(2) Aw , I love it ! ! you should put it as your header .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(3) Okay ... mum says to let her know how it goes , when she does . we are always including you and her in our nightly prayers .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(4) True , been there done that . but at least they kind of got it .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(5) Not complaining . I'm actually quite proud .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
Angry
(1) Get off Whisper and take your damn shower !
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(2) Stupid accident held me up with traffic and now I'm running to be on time for work .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(3) Those evil people just causing whale in the country .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(4) Damn gardening ghosts . Get away from my walls !
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(5) Don't be taking our money !
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
Sad
(1) I can't see far away and I lost it in tall grass .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(2) I'm actually not making them anymore .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(3) We'll see if she will go back to him . I have an ugly feeling that she probably will .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(4) No , I know . Absolutely horrible stuff .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(5) No , I'm so sick and tired of it .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_CC
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
 
Exp-2 Style transer by onboarding news source and target speaker
Spk-A chat style to Spk-B
(1) Even if we can't , we'll always find a way . Oh my God , I have the worst cold .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(2) Very cute . She is cute . Actually I look like her .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(3) Well I am a beach person . We'll get used to it .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(4) You will always have my heart . In fact you are my heart , my soul , my everything .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
(5) Today the lucky color for Aquarius is gray and the lucky number is zero .
Spk-A_Rec
Spk-A_SD
Spk-B_SD
Spk-B_Trans_GMVAE
Spk-B_Trans_Pros
 
Exp-3 Prosody control on the proposed model
(1) Even if we can't , we'll always find a way . Oh my God , I have the worst cold .
No control
Pitch-up-30Hz
Pitch-down-30Hz
Rate-up-30-cent
Rate-down-30-cent
Energy-up-50-cent
Energy-down-50-cent
(2) Well I am a beach person . We'll get used to it .
No control
Pitch-up-30Hz
Pitch-down-30Hz
Rate-up-30-cent
Rate-down-30-cent
Energy-up-50-cent
Energy-down-50-cent
(3) Today the lucky color for Aquarius is gray and the lucky number is zero .
No control
Pitch-up-30Hz
Pitch-down-30Hz
Rate-up-30-cent
Rate-down-30-cent
Energy-up-50-cent
Energy-down-50-cent
(4) I hate this . Just kidding , how can I hate this ?