ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

3D Audio-Visual Speaker Tracking with a Novel Particle Filter

Hong Liu, Yongheng Sun, Yidi Li, Bing Yang

Auto-TLDR; 3D audio-visual speaker tracking using particle filter based method

Abstract Slides Poster

3D speaker tracking using co-located audio-visual sensors has received much attention recently. Though various methods have been attempted to this field, it is still challenging to obtain a reliable 3D tracking result since the position of co-located sensors are restricted to a small area. In this paper, a novel particle filter (PF) based method is proposed for 3D audio-visual speaker tracking. Compared with traditional PF based audio-visual speaker tracking method, our 3D audio-visual tracker has two main characteristics. In the prediction stage, we use audio-visual information at current frame to further adjust the direction of the particles after the particle state transition process, which can make the particles more concentrated around the speaker direction. In the update stage, the particle likelihood is calculated by fusing both the visual distance and audio-visual direction information. Specially, the distance likelihood is obtained according to the camera projection model and the adaptively estimated size of speaker face or head, and the direction likelihood is determined by audio-visual particle fitness. In this way, the particle likelihood can better represent the speaker presence probability in 3D space. Experimental results show that the proposed tracker outperforms other methods and provides a favorable speaker tracking performance both in 3D space and on the image plane.

Similar papers

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Hong Liu, Wanlu Xu, Bing Yang

Auto-TLDR; A Two-Step Feature Fusion Network for Speech Recognition

3D Audio-Visual Speaker Tracking with a Novel Particle Filter

Similar papers

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

Audio-Video Detection of the Active Speaker in Meetings

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Mutual Alignment between Audiovisual Features for End-To-End Audiovisual Speech Recognition

Spatial Bias in Vision-Based Voice Activity Detection

Visual Saliency Oriented Vehicle Scale Estimation

An Adaptive Fusion Model Based on Kalman Filtering and LSTM for Fast Tracking of Road Signs

DAL: A Deep Depth-Aware Long-Term Tracker

Utilising Visual Attention Cues for Vehicle Detection and Tracking

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

TSDM: Tracking by SiamRPN++ with a Depth-Refiner and a Mask-Generator

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Efficient Correlation Filter Tracking with Adaptive Training Sample Update Scheme

Detection of Calls from Smart Speaker Devices

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

Mobile Augmented Reality: Fast, Precise, and Smooth Planar Object Tracking

Adaptive Context-Aware Discriminative Correlation Filters for Robust Visual Object Tracking

Unsupervised Sound Source Localization From Audio-Image Pairs Using Input Gradient Map

Person Recognition with HGR Maximal Correlation on Multimodal Data

Are Multiple Cross-Correlation Identities Better Than Just Two? Improving the Estimate of Time Differences-Of-Arrivals from Blind Audio Signals

A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors

AerialMPTNet: Multi-Pedestrian Tracking in Aerial Imagery Using Temporal and Graphical Features

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Siamese Fully Convolutional Tracker with Motion Correction

Let's Play Music: Audio-Driven Performance Video Generation

RSINet: Rotation-Scale Invariant Network for Online Visual Tracking

Hybrid Network for End-To-End Text-Independent Speaker Identification

Real-Time Drone Detection and Tracking with Visible, Thermal and Acoustic Sensors

NetCalib: A Novel Approach for LiDAR-Camera Auto-Calibration Based on Deep Learning

Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

Digit Recognition Applied to Reconstructed Audio Signals Using Deep Learning

Rotational Adjoint Methods for Learning-Free 3D Human Pose Estimation from IMU Data

Total Estimation from RGB Video: On-Line Camera Self-Calibration, Non-Rigid Shape and Motion

Visual Object Tracking in Drone Images with Deep Reinforcement Learning

Unsupervised Moving Object Detection through Background Models for PTZ Camera

AV-SLAM: Autonomous Vehicle SLAM with Gravity Direction Initialization

Which are the factors affecting the performance of audio surveillance systems?

Generic Merging of Structure from Motion Maps with a Low Memory Footprint

DenseRecognition of Spoken Languages

Model Decay in Long-Term Tracking

Siamese Dynamic Mask Estimation Network for Fast Video Object Segmentation

Visual Localization for Autonomous Driving: Mapping the Accurate Location in the City Maze

Toward Text-Independent Cross-Lingual Speaker Recognition Using English-Mandarin-Taiwanese Dataset

Dynamic Resource-Aware Corner Detection for Bio-Inspired Vision Sensors

Talking Face Generation Via Learning Semantic and Temporal Synchronous Landmarks

Reducing False Positives in Object Tracking with Siamese Network

Accurate Background Subtraction Using Dynamic Object Presence Probability in Sports Scenes