|Title||A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis|
|Publication Type||Journal Article|
|Year of Publication||2015|
|Authors||Chen, L-H, Raitio, T, Valentini-Botinhao, C, Ling, Z, Yamagishi, J|
|Journal||Audio, Speech, and Language Processing, IEEE/ACM Transactions on|
|Keywords||deep generative architecture, HMM, modulation spectrum, postfilter, segmental quality, speech synthesis|
The generated speech of hidden Markov model (HMM)-based statistical parametric speech synthesis still sounds muffled. One cause of this degradation in speech quality may be the loss of fine spectral structures. In this paper, we propose to use a deep generative architecture, a deep neural network (DNN) generatively trained, as a postfilter. The network models the conditional probability of the spectrum of natural speech given that of synthetic speech to compensate for such gap between synthetic and natural speech. The proposed probabilistic postfilter is generatively trained by cascading two restricted Boltzmann machines (RBMs) or deep belief networks (DBNs) with one bidirectional associative memory (BAM). We devised two types of DNN postfilters: one operating in the mel-cepstral domain and the other in the higher dimensional spectral domain. We compare these two new data-driven postfilters with other types of postfilters that are currently used in speech synthesis: a fixed mel-cepstral based postfilter, the global variance based parameter generation, and the modulation spectrum-based enhancement. Subjective evaluations using the synthetic voices of a male and female speaker confirmed that the proposed DNN-based postfilter in the spectral domain significantly improved the segmental quality of synthetic speech compared to that with conventional methods.