Mark Gales gave an invited lecture, Acoustic Factorisation for Speech Recognition and Speech Synthesis, at the International Workshop on Statistical Machine Learning for Speech Processing in Kyoto on 31 March 2012.
Abstract Current state-of-the-art acoustic models for speech recognition are trained on hundreds, or even thousands, of hours of data. This acoustic data usually contains many speakers, acoustic environments and even tasks. An important stage of many speech recognition systems is to adapt this large acoustic model to a particular speaker, task, acoustic environment, or more often some combination of all three of these acoustic 'factors'. Current approaches to adaptation include linear transformations and interpolation of the model parameters, maximum a-posterior estimation and predictive approaches. In predictive schemes a function, representing the mismatch between the general model and a particular speaker/environment, is used. Examples of this class are vocal tract length normalisation and vector Taylor series environment compensation. Increasingly these adaptation approaches are also used in training as this yields an acoustic model that is more amenable for adaptation, referred to as adaptive training. Furthermore parametric statistical speech synthesis approaches are making use of similar schemes, for example average voice models. As well as reviewing these adaptation approaches, this talk will discuss the concept of acoustic factorisation. Here rather than treating the combination of all the acoustic factors as a single target condition, each of the factors is modelled separately. This factorisation yields a highly flexible adaptation scheme. It allows, for example, the fact that the same speaker is talking in a different acoustic environment to be used. Two specific examples of acoustic factorisation will be discussed: speaker and noise factorisation for speech recognition; and speaker and language factorisation for polyglot speech synthesis.