|Title||Robust TTS duration modelling using DNNs|
|Publication Type||Conference Paper|
|Authors||Henter, GEje, Ronanki, S, Watts, O, Wester, M, Wu, Z, King, S|
|Conference Name||Proc. ICASSP|
|Conference Location||Shanghai, China|
|duration modelling, robust statistics, speech synthesis|
Accurate modelling and prediction of speech-sound durations is an important component in generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful modelling paradigm, and large, found corpora of natural and expressive speech are easy to acquire for training them. Unfortunately, found datasets are seldom subject to the quality-control that traditional synthesis methods expect. Common issues likely to affect duration modelling include transcription errors, reductions, filled pauses, and forced-alignment inaccuracies. To combat this, we propose to improve modelling and prediction of speech durations using methods from robust statistics, which are able to disregard ill-fitting points in the training material. We describe a robust fitting criterion based on the density power divergence (the beta-divergence) and a robust generation heuristic using mixture density networks (MDNs). Perceptual tests indicate that subjects prefer synthetic speech generated using robust models of duration over the baselines.