I am interested in statistical modeling techniques for both automatic speech recognition (ASR) and text-to-speech synthesis (TTS).
In the NST project I'm currently working on two topics:
Structuring Diverse data
This topic is central to the generation of the canonical models within this challenge, and hence the underlying forms of models for the synthesis and transcription technologies. One of the aspects we worked on is how to deal with imperfect data in order to produce transcriptions which can be use for the training of canonical models. Raw data can be imperfect in different ways, for instance in the case of BBC, speech characteristics (expressivity, speaking style, regional accent...) and conditions (background noise, bandwidth...) can be diverse depending on the genre of the show (lecture, documentary, drama...). additionally, associated transcriptions can be more or less complete, accurate and reliable. The aim is then to produce complete, accurate and reliable transcriptions which can be use for the training of canonical models for ASR or TTS. We are currently studying new approaches including lightly supervised training and decoding. An other pan of this topic is to develop automatic approaches for the generation of metadata.
Canonical acoustic models for text-to-speech synthesis:
In this topic, we examine approaches for both transcription and synthesis than yield orthogonal factors. An important aspect of this work is that it can make use of the vast amounts of data, with associated transcriptions and metadata from the data structuring project. This topic is closely linked with research on model adaptation and factorisation, and suitable forms of adaptation will be central to obtaining general factorisation. Currently, we are working on a novel approach for speaker adaptation based on the interpolation of several average voice models (AVM). In fact, recent results have shown that the quality/naturalness of adapted voices directly depends on the distance from the average voice model that the speaker adaptation starts from. This suggests the use of several AVM trained on carefully chosen speaker clusters from which a more suitable AVM can be selected/interpolated during the adaptation. In the proposed approach, a set of AVM is hence trained on speaker clusters initialised according to metadata and iteratively re-assigned during the estimation process. While the interpolation stage is closely similar to the cluster adaptive training (CAT) framework extended to speech synthesis one's, the training stage is computationally less expensive as the amount of training data and clusters gets larger. Additionally, during adaptation, all the AVM are first adapted to the speaker which suggests a better tuning to the individual speaker of the space in which the interpolation takes place.