Robust TTS duration modelling using DNNs

Fri, 02/12/2016 - 12:26 — ghenter

Title	Robust TTS duration modelling using DNNs
Publication Type	Conference Paper
	2016
Authors	Henter, GEje, Ronanki, S, Watts, O, Wester, M, Wu, Z, King, S
Conference Name	Proc. ICASSP
Date Published	March
Publisher	IEEE
Conference Location	Shanghai, China
	duration modelling, robust statistics, speech synthesis
	Accurate modelling and prediction of speech-sound durations is an important component in generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful modelling paradigm, and large, found corpora of natural and expressive speech are easy to acquire for training them. Unfortunately, found datasets are seldom subject to the quality-control that traditional synthesis methods expect. Common issues likely to affect duration modelling include transcription errors, reductions, filled pauses, and forced-alignment inaccuracies. To combat this, we propose to improve modelling and prediction of speech durations using methods from robust statistics, which are able to disregard ill-fitting points in the training material. We describe a robust fitting criterion based on the density power divergence (the beta-divergence) and a robust generation heuristic using mixture density networks (MDNs). Perceptual tests indicate that subjects prefer synthetic speech generated using robust models of duration over the baselines.
	http://homepages.inf.ed.ac.uk/ghenter/pubs/henter2016robust.pdf
Refereed Designation	Refereed

Main menu