You are here

Robust TTS duration modelling using DNNs

Title Robust TTS duration modelling using DNNs
Publication Type Conference Paper
Authors Henter, GEje, Ronanki, S, Watts, O, Wester, M, Wu, Z, King, S
Conference Name Proc. ICASSP
Date Published March
Publisher IEEE
Conference Location Shanghai, China
duration modelling, robust statistics, speech synthesis

Accurate modelling and prediction of speech-sound durations is an important component in generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful modelling paradigm, and large, found corpora of natural and expressive speech are easy to acquire for training them. Unfortunately, found datasets are seldom subject to the quality-control that traditional synthesis methods expect. Common issues likely to affect duration modelling include transcription errors, reductions, filled pauses, and forced-alignment inaccuracies. To combat this, we propose to improve modelling and prediction of speech durations using methods from robust statistics, which are able to disregard ill-fitting points in the training material. We describe a robust fitting criterion based on the density power divergence (the beta-divergence) and a robust generation heuristic using mixture density networks (MDNs). Perceptual tests indicate that subjects prefer synthetic speech generated using robust models of duration over the baselines.
Refereed Designation Refereed