Post-doctoral Research Visit F/m Deep Generative - Strasbourg, France - Inria

Inria
Inria
Entreprise vérifiée
Strasbourg, France

il y a 2 semaines

Sophie Dupont

Posté par:

Sophie Dupont

beBee Recruiter


Description
Le descriptif de l'offre ci-dessous est en Anglais_


Type de contrat :

CDD

Niveau de diplôme exigé :
Thèse ou équivalent


Fonction :
Post-Doctorant


Contexte et atouts du poste:


This postdoctoral research is part of the
REAVISE project: "Robust and Efficient Deep Learning based Audiovisual Speech Enhancement" funded by the French National Research Agency (ANR).

The general objective of REAVISE is to develop a unified audio-visual speech enhancement (AVSE) framework.

This will leverage recent breakthroughs in statistical signal processing, machine learning, and deep neural networks to create a robust and efficient AVSE system.


  • The postdoctoral researcher will be supervised by Mostafa Sadeghi (researcher, Inria), Romain Serizel (associate professor, University of Lorraine), as members of the Multispeech team, and Xavier Alameda-Pineda (Inria Grenoble), member of the RobotLearn team. The team has access to powerful computational resources, including efficient GPUs and CPUs, required for the experiments planned in this project.

Work environment:

Multispeech team, Inria Nancy, France


Starting date & duration:

October 2024 (flexible), for two years.


Mission confiée:


Background. Audio-visual speech enhancement (AVSE) aims to improve the intelligibility and quality of noisy speech signals by utilizing complementary visual information, such as the lip movements of the speaker [1]. This technique is especially useful in highly noisy environments. The advent of deep neural network (DNN) architectures has led to significant advancements in AVSE, prompting extensive research into the area [1]. Existing DNN-based AVSE methods are divided into supervised and unsupervised approaches. In supervised approaches, a DNN is trained on a large audiovisual corpus, like AVSpeech [2], which includes a wide range of noise conditions. This training enables the DNN to transform noisy speech signals and corresponding video frames into a clean speech estimate. These models are typically complex, containing millions of parameters.

On the other hand, unsupervised methods [3-5] employ statistical modeling combined with DNNs.

These methods use deep generative models, such as variational autoencoders (VAEs) [6] and diffusion models [7], trained on clean datasets like TCD-TIMIT [8], to probabilistically estimate clean speech signals.

Since these models do not train on noisy data, they are generally lighter than supervised models and may offer better generalization capabilities and robustness to visual noise, as indicated by their probabilistic nature [3-5].

Despite these advantages, unsupervised methods remain less explored compared to their supervised counterparts.


Principales activités:


Objectives. In this project, we aim to develop a robust and efficient AVSE framework by thoroughly exploring the integration of recent deep-learning architectures designed for speech enhancement, encompassing both supervised and unsupervised approaches. Our goal is to leverage the strengths of both strategies alongside cutting-edge generative modeling techniques to bridge their gap. This includes the implementation of computationally efficient multimodal (latent) diffusion models, dynamical VAEs [9], temporal convolutional networks (TCNs) [10], and attention-based methods [11].

The main objectives of the project are outlined as follows:


  • Develop a neural architecture that assesses the reliability of lip images—whether they are frontal, nonfrontal, occluded, in extreme poses, or missing—by providing a normalized reliability score at the output [12];
  • Design deep generative models that efficiently exploit the sequential nature of data and effectively fuse audiovisual features;
  • Integrate the visual reliability analysis network within the deep generative model to selectively use visual data. This will enable a flexible and robust framework for audiovisual fusion and enhancement.

References:

- [1] D. Michelsanti, Z. H. Tan, S. X. Zhang, Y. Xu, M. Yu, D. Yu, and J. Jensen, "An overview of deep learning-based audio-visual speech enhancement and separation," in _IEEE/ACM Transactions on Audio, Speech, and Language Processing_, vol. 29, 202- [2] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W.T. Freeman, M. Rubinstein, "Looking-to
  • Listen at the

Cocktail Party:
A Speaker-Independent Audio-Visual Model for Speech Separation," in _SIGGRAPH_ 201- [3] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, "Audio-visual speech enhancement using conditional variational auto-encoders," in _IEEE/ACM Transactions on Audio, Speech and Language Processing_, vol. 28, pp , 2020.
- [4] A. Golmakani, M. Sadeghi, and R. Serizel, "Audio-visual Speech Enhancement with a Deep Kalman Filter Generative Model," in _IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP)_, Rhodes Island, June 202- [5] B. Nortier, M. Sadeghi, and

Plus d'emplois de Inria