Internship Research - Sophia Antipolis, France - Inria

Inria

Entreprise vérifiée

Sophia Antipolis, France

il y a 2 semaines

Posté par:

Sophie Dupont

beBee Recruiter

StageSHIP

Description

Le descriptif de l'offre ci-dessous est en Anglais_

Type de contrat :
Convention de stage

Niveau de diplôme exigé :
Bac + 4 ou équivalent

Fonction :
Stagiaire de la recherche

A propos du centre ou de la direction fonctionnelle:
The Inria centre at Université Côte d'Azur includes 37 research teams and 8 support services. The centre's staff (about 500 people) is made up of scientists of different nationalities, engineers, technicians and administrative staff.

The teams are mainly located on the university campuses of Sophia Antipolis and Nice as well as Montpellier, in close collaboration with research and higher education laboratories and establishments (Université Côte d'Azur, CNRS, INRAE, INSERM...), but also with the regiona economic players.

With a presence in the fields of computational neuroscience and biology, data science and modeling, software engineering and certification, as well as collaborative robotics, the Inria Centre at Université Côte d'Azur is a major player in terms of scientific excellence through its results and collaborations at both European and international levels.

Contexte et atouts du poste:

Team
The STARS research team combines advanced theory with a cutting-edge practice focusing on cognitive vision systems.

Scientific context

Feature extraction is a challenging computer vision problem that targets extracting relevant information from raw data in order to reduce dimensionality and capture meaningful patterns.

When this needs to be done in a dataset and task-invariant way, it is referred to as general feature extraction.

This is a crucial step in machine learning pipelines and popular methods like VideoSwin and VIdeoMAE work well for the task of action recognition and video understanding.

However, these works and also the datasets that they are tested on, like Something-Something and Kinetics, fail to capture information about interactions in daily life.

Towards this research direction, several methods [] have been proposed to model these complex fine-grained interactions using datasets like UDIVA, MPII Group Interactions, and Epic-Kitchen.

Those datasets encompassing real-world challenges share the following characteristics: Firstly, there is rich multimodal information available where each modality provides important information relevant to the labels.

Secondly, there is a lot of irrelevant information that has to be ignored as deep learning models easily identify patterns that are coincidental (local mínima).

For example, the colour of the tshirt could be used to assign a certain personality score to someone if by coincidence the majority of extrovert people are wearing warm colours.

Lastly, the videos in these datasets are generally very long.

So, the main question is how to extract general features from multimodal data with a lot of noise in the form of irrelevant information.

Typical situations that we would like to monitor, are daily interactions, responses, and reactions and analyze cause and effect in behavior (it could be human-human interaction or human-object interaction).

The system we want to develop will be beneficial for all tasks requiring a focus on interactions.

Specifically, healthcare for psychological disorders - general feature extraction will allow deep learning models to assist in various subtasks involved in the diagnosis process.

Mission confiée:

In this work, we would like to go beyond Deep Learning by incorporating some semantic modelling within the Deep Learning pipeline, which consists of a combination of CNN and transformers [5] to be able to model the complex action patterns in untrimmed videos.

These complex action patterns could include composite actions and concurrent actions existing in long untrimmed videos.

Existing methods [3, 5, 6] have mostly focused on modelling the variation of visual cues across time locally or globally within a video.

However, these methods consider the temporal information without any further semantics.

Real-world videos contain many complex actions with inherent relationships between action classes at the same time steps or across distant time steps.

Modelling such class-temporal relationships can be extremely useful for locating actions in those videos.

In this work, we focus on semantic modelling for improving action detection performance. Videos may contain rich semantic information such as objects, actions, and scenes. Relationships among different semantics are high-level knowledge which is critical for understanding the video content.

Therefore, semantic relational reasoning can help determine the action instance occurrences and locate the actions in the video, especially for complex actions in video.

For handling these challenges, Class-Temporal Relational Network (CTRN) [4] has been proposed to explore both the class and temporal relations of detected actions.

To go beyond the above, a first attempt may consist to: