The Natural Speech Technology programme is concerned with significantly advancing the state-of-the-art of speech synthesis and recognition, by recognising and generating natural, expressive speech in different domains and environments. Our aim is to develop a speech technology which approaches human levels of reliability, adaptability, and fluency. The current state-of-the-art in speech technology works well in specific domains - there are lots of point solutions. All the different factors of variability are usually combined in a single model, which makes it expensive to adapt to new situations.
Our research in Natural Speech Technology is based on a common statistical modelling framework for synthesis and recognition and key research themes include fluency, capturing richer context, expression and prosody, and personalisation. Our research is organised into four tracks:
- Learning and Adaptation: Models and algorithms for synthesis and recognition that can learn from continuous streams of data, can compactly represent and adapt to new scenarios and speaking styles, and seamlessly adapt to new situations and contexts almost instantaneously.
- Natural Transcription: Speech recognisers that can detect "who spoke what, when, and how" in any acoustic environment and for any task domain.
- Natural Synthesis: Controllable speech synthesisers that automatically learn from data, and are capable of generating the full expressive diversity of natural speech.
- Exemplar Applications: Deployment of these advances in novel applications, with an emphasis on the health/social domain, media archives, and personal listeners.
We also have a User Group comprising about 15 members, with whom we work to validate our systems on data and tasks that they provide.
The following talk, presented at the NST Annual Meeting in April 2013, provides an overview of the NST programme:
- "Overview of NST" (Steve Renals) [pdf]
More details of specific research topics were presented at the 2012 and 2013 review meetings in these talks:
- The NST 'homeService' application: recent system and experimental developments (Heidi Christensen) [pdf]
- BBC Sample Data Processing (Yanhua Long) [pdf]
Combining neural network acoustic and language models for lecture transcription (Peter Bell) [pdf]
- Asynchronous factorisation of speaker and background in speech recognition (Oscar Saz) [pdf]
- Acoustic data driven pronunciation lexicon for speech recognition (Liang Lu) [pdf]
- Sequence-discriminative training of deep neural networks (Arnab Ghoshal) [pdf]
Deep Neural Networks for Cross-lingual Speech Recognition (Pawel Swietojanski) [pdf]
- Cross-domain paraphrasing for improving language modelling using out-of-domain data (Andrew Liu) [pdf]
- Natural Speech Recognition Output (Marcus Tomalin) [pdf]
- Multiple-average-voice-based speech synthesis (Pierre Lanchantin) [pdf]
- Deep neural network for speech synthesis (Heng Lu) [pdf]
- An update on voice banking and voice reconstruction (Christophe Veaux) [pdf]