Skip to content

Personal tools
You are here: Home » Research » Projects » Phase II » IM2.MPR ( Multimodal processing and recognition ) - lay summary

IM2.MPR ( Multimodal processing and recognition ) - lay summary

Document Actions

Multimodal processing and recognition


IP Head: Aude Billard (EPFL)

The IM2 IP on Multimodal Processing and Recognition, MPR, is concerned with the extraction of relevant information through the concurrent analysis of signals coming from various modalities, such as for instance speech and vision. The rationale behind work in MPR is that there is more information gain when treating signals from multiple modalities in combination than when treating each modality separately.

Take a trivial example. When you are on the phone with someone, you usually understand perfectly all of what is said. But if you were to both see and hear the person you are talking to, then, you would get much more information out of the conversation, such as whether the person is joking or worried, by mapping the facial expressions to the words uttered. Moreover, seeing the person’s face can be truly important when the phone signal is poor as you can guess some of what you are missing orally by following the motion of the lips.

There are two key challenges in multimodal signal/image analysis: The first challenge is to build robust tools for analyzing the individual (uni-modal) signals of a multimodal scene. The second challenge is to optimally exploit the complementarity and redundancy between multimodal signals. In MPR, we are addressing these two challenges through the conduct of 7 different research projects in 4 different institutions, namely IDIAP, EPFL, the University of Fribourg and the University of Geneva. Some of the projects are theoretical in giving an estimate of the information gain resulting from analyzing several signals. Other projects are more application-oriented. Applications considered within MPR include algorithms for human-machine interaction through the development of tools for biometric recognition using either audio-visual data or through combined speech and handwriting signatures, and the development of object detectors through combined analysis of the image and the audio-labeling of the objects. Finally, this work finds also an application for the study of human-human interaction, in the analysis of the way humans redirect their visual attention across speakers and objects in the room as a result of social factors, such as dominance or joint attention.

This research exploits and develops tools from Machine Learning, a field of statistics, which offers powerful algorithms for the analysis of non-linear and time-dependent signals. The analysis is done on both the temporal and spatial structure of the signals. Time series analysis is particularly useful in the study of combined speech and video, whereas spatial analysis conveys information on the redundancy in the information conveyed by each signal separately.

Keywords: Multimodal Signal Processing, Machine Learning, Audio-Visuo Analysis, Biometric Recognition

Last modified 2008-05-19 09:16

Powered by Plone