Speech Technology

Conversational AI to enhance/augment machine and human capabilities

The Speech Technology Group (STG) is at the heart of modern artificial intelligence by designing novel algorithms for automatic speech recognition and data-driven dialogue systems enabling the creation of advanced and natural, speech enabled, human-machine interfaces.

The Speech Technology Group (STG) is at the heart of modern artificial intelligence by designing novel algorithms for automatic speech recognition and data-driven dialogue systems enabling the creation of advanced and natural, speech enabled, human-machine interfaces. STG has been established in 2002 and since then worked on a wide range of speech technologies that include text-to-speech synthesis, speech intelligibility, automatic speech recognition (ASR) and dialogue modelling. Our focus is to develop advanced natural spoken human-machine interfaces and develop products and services that facilitate easy access to information, thereby improving productivity and quality of human life. STG has made significant contributions to the next generation of Toshiba’s speech recognition, HMM-based speech synthesis and statistical dialogue modelling. We work in collaboration with the speech R&D groups at the Knowledge Media Lab, Toshiba RDC, Kawasaki, Japan and Toshiba China R&D Centre, Beijing, China, and business divisions of Toshiba Group, Japan. Working with groups within Toshiba, we have a tight coupling between our R&D efforts and current and future product development. STG has a long history to work and collaborate with academia, and also constantly strives to forge new relations. We fund research and have academic collaborations with groups in various UK and European Union Universities and Research Centres. Combining the strengths of our group with these collaborations, we address various research topics related to Speech Technology for the future.

Automatic Speech Recognition

Automatic transcription of speech to text plays a critical role in the human-machine interaction. Background noise, reverberation, competing speakers and natural speech variability across speakers make the task challenging. Toshiba aims to improve the state-of-the-art in automatic speech recognition by combining signal processing and machine learning approaches. Our research focuses on both front-end (signal enhancement) and back-end (acoustic modelling for end-to-end streaming ASR, adaptation of end-to-end models).

Dialogue Modelling

The Dialogue group works on fundamental research related to the modelling of human-machine communication. Our aim is to improve the naturalness of human-machine interaction by enabling people to converse more easily with machines using speech. Our team has expertise in natural language interpretation and generation, statistical dialogue management, emotion detection, and machine learning.

Latest Publications

Head-Synchronous Decoding for Transformer-based Streaming ASR

M. Li, T.C. Zorila and R. Doddipatla, Accepted to present at IEEE ICASSP 2021, Toronto, Canada / arXi

Transformer-based Online Speech Recognition with Decoder-End Adaptive Computation Steps

M. Li, T.C. Zorila and R. Doddipatla, Proc. IEEE SLT 2021, Shenzhen, China / arXiv

An Investigation into the Multi-Channel Time Domain Speaker Extraction Network

T.C. Zorila, M. Li and R. Doddipatla, Proc. IEEE SLT 2021, Shenzhen, China

View more