ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Audio-Video Detection of the Active Speaker in Meetings

Francisco Madrigal, Frederic Lerasle, Lionel Pibre, Isabelle Ferrané

Auto-TLDR; Active Speaker Detection with Visual and Contextual Information from Meeting Context

Abstract Slides Poster

Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using speech, however, certain visual and contextual information can provide additional insights. In this paper we propose an active speaker detection framework that integrates audiovisual features with social information, from the meeting context. Visual cue is processed using a Convolutional Neural Network (CNN) that captures the spatio-temporal relationships. We analyze several CNN architectures with both cues: raw pixels (RGB images) and motion (estimated with optical flow). Contextual reasoning is done with an original methodology, based on the gaze of all participants. We evaluate our proposal with a public \textcolor{black}{benchmark} in state-of-art: AMI corpus. We show how the addition of visual and context information improves the performance of the active speaker detection.

Similar papers

Spatial Bias in Vision-Based Voice Activity Detection

Kalin Stefanov, Mohammad Adiban, Giampiero Salvi

Auto-TLDR; Spatial Bias in Vision-based Voice Activity Detection in Multiparty Human-Human Interactions

Abstract Poster Similar

We present models for automatic vision-based voice activity detection (VAD) in multiparty human-human interactions that are aimed at complementing the acoustic VAD methods. We provide evidence that this type of vision-based VAD models are susceptible to spatial bias in the datasets. The physical settings of the interaction, usually constant throughout data acquisition, determines the distribution of head poses of the participants. Our results show that when the head pose distributions are significantly different in the training and test sets, the performance of the models drops significantly. This suggests that previously reported results on datasets with a fixed physical configuration may overestimate the generalization capabilities of this type of models. We also propose a number of possible remedies to the spatial bias, including data augmentation, input masking and dynamic features, and provide an in-depth analysis of the visual cues used by our models.

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Stéphane Lathuiliere, Pablo Mesejo, Radu Horaud

Auto-TLDR; Deep Visual Voice Activity Detection with Optical Flow

Abstract Slides Similar

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild, based on combining A-VAD and face detection. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with such a dataset.

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Jianrong Wang, Tong Wu, Shanyu Wang, Mei Yu, Qiang Fang, Ju Zhang, Li Liu

Auto-TLDR; Lip Motion Network for Text-Independent and Text-Dependent Speaker Recognition

Audio-Video Detection of the Active Speaker in Meetings

Similar papers

Spatial Bias in Vision-Based Voice Activity Detection

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Late Fusion of Bayesian and Convolutional Models for Action Recognition

RWF-2000: An Open Large Scale Video Database for Violence Detection

DenseRecognition of Spoken Languages

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Depth Videos for the Classification of Micro-Expressions

Pose-Based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

Automatic Annotation of Corpora for Emotion Recognition through Facial Expressions Analysis

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

Hybrid Network for End-To-End Text-Independent Speaker Identification

A Grid-Based Representation for Human Action Recognition

Modeling Long-Term Interactions to Enhance Action Recognition

Learning Group Activities from Skeletons without Individual Action Labels

Sequential Non-Rigid Factorisation for Head Pose Estimation

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

Knowledge Distillation for Action Anticipation Via Label Smoothing

Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

IPN Hand: A Video Dataset and Benchmark for Real-Time Continuous Hand Gesture Recognition

A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors

Exposing Deepfake Videos by Tracking Eye Movements

Image Sequence Based Cyclist Action Recognition Using Multi-Stream 3D Convolution

Video Face Manipulation Detection through Ensemble of CNNs

Anticipating Activity from Multimodal Signals

Adaptive Feature Fusion Network for Gaze Tracking in Mobile Tablets

Two-Stream Temporal Convolutional Network for Dynamic Facial Attractiveness Prediction

3D Attention Mechanism for Fine-Grained Classification of Table Tennis Strokes Using a Twin Spatio-Temporal Convolutional Neural Networks

Attention-Driven Body Pose Encoding for Human Activity Recognition

Digit Recognition Applied to Reconstructed Audio Signals Using Deep Learning

Mutual Alignment between Audiovisual Features for End-To-End Audiovisual Speech Recognition

Motion U-Net: Multi-Cue Encoder-Decoder Network for Motion Segmentation

Which are the factors affecting the performance of audio surveillance systems?

Self-Supervised Learning of Dynamic Representations for Static Images

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Attribute-Based Quality Assessment for Demographic Estimation in Face Videos

What and How? Jointly Forecasting Human Action and Pose

Vision-Based Multi-Modal Framework for Action Recognition

User-Independent Gaze Estimation by Extracting Pupil Parameter and Its Mapping to the Gaze Angle

Space-Time Domain Tensor Neural Networks: An Application on Human Pose Classification

Classifying Eye-Tracking Data Using Saliency Maps

3D Audio-Visual Speaker Tracking with a Novel Particle Filter

Recognizing American Sign Language Nonmanual Signal Grammar Errors in Continuous Videos

LFIR2Pose: Pose Estimation from an Extremely Low-Resolution FIR Image Sequence

Person Recognition with HGR Maximal Correlation on Multimodal Data

ESResNet: Environmental Sound Classification Based on Visual Domain Models

A Multi-Task Neural Network for Action Recognition with 3D Key-Points