Title | Robust TTS duration modelling using DNNs |
Publication Type | Conference Paper |
2016 | |
Authors | Henter, GEje, Ronanki, S, Watts, O, Wester, M, Wu, Z, King, S |
Conference Name | Proc. ICASSP |
Date Published | March |
Publisher | IEEE |
Conference Location | Shanghai, China |
duration modelling, robust statistics, speech synthesis | |
Accurate modelling and prediction of speech-sound durations is an important component in generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful modelling paradigm, and large, found corpora of natural and expressive speech are easy to acquire for training them. Unfortunately, found datasets are seldom subject to the quality-control that traditional synthesis methods expect. Common issues likely to affect duration modelling include transcription errors, reductions, filled pauses, and forced-alignment inaccuracies. To combat this, we propose to improve modelling and prediction of speech durations using methods from robust statistics, which are able to disregard ill-fitting points in the training material. We describe a robust fitting criterion based on the density power divergence (the beta-divergence) and a robust generation heuristic using mixture density networks (MDNs). Perceptual tests indicate that subjects prefer synthetic speech generated using robust models of duration over the baselines. |
|
http://homepages.inf.ed.ac.uk/ghenter/pubs/henter2016robust.pdf | |
Refereed Designation | Refereed |