Skip to content

Personal tools
You are here: Home » Research » Projects » Phase I » Speech Processing

Speech Processing

Document Actions
SP: Speech Processing

IDIAP, Speech Processing Group (IP Leader): Herve Bourlard & Hynek Hermansky,
ETHZ/TIK, Speech Processing Group: Beat Pfister,
University of Geneva, Translation and Interpretation School: Susan Armstrong,

Context and Goals
The goal of IM2.SP is to provide IM2 with advanced and flexible speech processing modules which can be used as an input mode (voice input), as an audio indexing tool (requiring large vocabulary, continuous speech recognition systems) turning audio files into text, and as an output mode (requiring speech coding and text-to-speech systems).
With respect to existing state-of-the-art, the generic goals of IM2.SP are:

  • Further research and development in robust speech recognition systems, to provide graceful degradation when the system loses information due to limited bandwidth, background noise, and channel distortion. Given the targeted applications, particular attention should also be paid to robustness to speaking styles and accents, including spontaneous speech processing.
  • Research and development of new microphone arrays processing techniques (as White Paper project), to improve robustness of speech recognition, especially in the case of the IM2.AP project targeting recognition and management of meetings.
  • Further research and development in (realtime) large vocabulary, continuous speech recognition systems. Ideally, the developed systems should be flexible, accommodating task independent training and easy adaptation to new tasks, simply by (automatically?) adapting the lexicon and grammatical constraints.
  • Automatic training and adaptation, to make the system easy and cheap to adapt or train on new domains. This includes: better use of training data, handling out-of-vocabulary words, discovery procedures for syntactic and semantic classes, evaluation and portability.
  • Better use of morphological, syntactic and semantic modeling in state-of-the-art speech recognition and speech synthesis systems.
  • Speech synthesis and speech generation, to produce comprehensible speech output to the user and to enhance our basic understanding of the speech production process. This includes: improvements in basic synthesis systems technology, computational models of variability, integration and synthesis and language generation, adaptation, and evaluation metrics.

Research Issues
With respect to existing state-of-the-art, the emphasis of IM2.SP during the first two years of the project (2002-2003) have been defined as follows:

  • Speech recognition software adaptation [IDIAP] to make it more flexible to the task, easier to share and to integrate with other components of IM2. Ideally, the resulting software should be fully compatible/integrated with the TORCH software recently developed at IDIAP ( , which includes most of the pattern recognition tools and should facilitate multi-modal integration.
  • Improvement of speech recognition acoustic models (better word and subword unit models) and statistical pattern classification techniques particularly well suited to multi-channel processing. More specifically, this will include, among others:
    • Improvement of basic speech recognition technology through new modeling approaches, based on multi-band and multi-stream models [IDIAP].
    • Reduction of inter-speaker variability based on prosody modeling [ETHZ/TIK].
  • Speech/non-speech segmentation and speaker turn detection [IDIAP]: in all audio indexing systems, as well as in the framework of the IM2.AP application, it will be important to be able to decompose any audio signal into speech/non-speech segments, as well as to perform speaker clustering, speaker labeling and speaker turn detection.
  • Improvement of text-to-speech technology [ETHZ], extending multilingual TTS to truly mixed-lingual TTS systems

In the context of the IM2.SP project, a related (White Paper) project on microphone array processing has also been initiated aiming at improving meeting room recording and processing (as addressed in IM2.AP), and more specifically :

  • Acquiring clean speech, free of cross-talk, with minimal constraint upon the user
  • Detecting the periods of voice activity for each user
  • Dynamically determining the location of each user
  • Developing a real-time, "portable" and "scalable" microphone array system.

Flexible and modular baseline speech recognition system that can easily be adapted to different tasks and to the application project requirements.
This will involve, among others:

  • Data and software specification: In collaboration with the relevant IPs, definition of a standard representation of the acoustic data, lexical and grammatical constraints, as well as possible access points to the recognition software.
  • Software development of training and recognition systems: Development/adaptation of existing speech training and recognition software to the above specifications and to make it more flexible and portable. It is probable that different instances of the recognizer will have to be developed depending on the application, typically one recognizer for continuous speech, medium size lexicon (e.g., in the case of vocal commands), and another one for very large lexicon (e.g., in the case of audio indexing).
  • Management and processing of speech corpora: Collecting and maintaining speech corpora, including corpora collected in the framework of IM2.
  • Adapting and testing different speech recognition applications: Development and testing different speech recognition applications based on the above corpora.
  • Speaker normalization: Research into speaker normalization, based on prosody features.

New approaches towards automatic audio (speech/non-speech) segmentation and speaker segmentation/labeling. This will involve, among others:

  • Speech/non-speech discrimination: develop new approaches towards speech/non-speech discrimination, with extensive evaluation in real conditions (as close as possible to IM2 applications).
  • Speaker segmentation and speaker turn detection: develop new approaches towards speaker turn detection and speaker labeling. This will be particularly useful in the framework of the IM2 application scenario.
Preliminary mixed-lingual text to speech system, including:
  • Speech production providing the possibility of synthesizing a single voice speaking the four most important languages in Switzerland (French, German, Italian, and English), and
  • A prototype mixed-lingual text-analysis module for mixed-lingual German and English sentences.

Year 2
Starting from the above baseline recognizer, perform first evaluation on the preliminary IM2 speech corpus and start working on improving the performance of the recognizer. This will involve, among others:

  • Preliminary processing preliminary IM2 speech corpus: Although this is not a scientific task, this is an important and time consuming task, involving: labeling, processing the data (extracting acoustic vectors, etc), defining training and test corpora, extracting lexicon, grammatical constraints, etc.
  • Testing recognizer on preliminary speech corpus: preliminary training and testing on IM2 database.
  • Further work on audio segmentation: extending and testing Tasks 1.5 and 1.6 on IM2 databases.

Regarding real-time microphone array processing, the deliverables for the first two years are expected to be the following:

  • Set up of the (meeting room) microphone array acquisition system, typically using 12 microphones.
  • Development and demonstration of functioning stand-alone small microphone arrays (e.g., with 2 microphones), together with preliminary evaluation results.
  • Development of real-time hardware. This could also be used later to allow real-time control of video cameras (e.g., based on speech activity detection)
  • Development and testing of new approaches towards scalable systems, allowing the combination of small microphone arrays.

Software download

Hardware Description

Quarterly status reports
Available on the local site (password protected).

Last modified 2011-03-18 17:12

Powered by Plone