ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Quan Nguyen, Simone Frintrop, Timo Gerkmann, Mikko Lauri, Julius Richter

Auto-TLDR; Object-Prior: Learning the 1-to-1 correspondence between visual and audio signals by audio- visual sound source methods

Abstract Slides

The performance of an audio-visual sound source separation system is determined by its ability to separate audio sources given images of the sources and the audio mixture. The goal of this study is to investigate the ability to learn the mapping between the sounds and the images of instruments by audio- visual sound source separation methods based on the state-of-the- art PixelPlayer [1]. Theoretical and empirical analyses illustrate that the PixelPlayer is not properly trained to learn the 1-to- 1 correspondence between visual and audio signals during its mix-and-separate training process. Based on the insights from this analysis, a weakly-supervised method called Object-Prior is proposed and evaluated on two audio-visual datasets. The experimental results show that the proposed Object-Prior method outperforms the PixelPlayer and other baselines in the audio- visual sound source separation task. It is also more robust against asynchronized data, where the frame and the audio do not come from the same video, and recognizes musical instruments based on their sound with higher accuracy than the PixelPlayer. This indicates that learning the 1-to-1 correspondence between visual and audio features of an instrument improves the effectiveness of audio-visual sound source separation.

Similar papers

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Yasunori Ohishi, Yuki Tanaka, Kunio Kashino

Auto-TLDR; A guided attention scheme for audio-visual co-segmentation

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Similar papers

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

Ballroom Dance Recognition from Audio Recordings

S2I-Bird: Sound-To-Image Generation of Bird Species Using Generative Adversarial Networks

Hybrid Network for End-To-End Text-Independent Speaker Identification

Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

Mutual Alignment between Audiovisual Features for End-To-End Audiovisual Speech Recognition

Unsupervised Sound Source Localization From Audio-Image Pairs Using Input Gradient Map

Which are the factors affecting the performance of audio surveillance systems?

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

Are Multiple Cross-Correlation Identities Better Than Just Two? Improving the Estimate of Time Differences-Of-Arrivals from Blind Audio Signals

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Audio-Video Detection of the Active Speaker in Meetings

Spatial Bias in Vision-Based Voice Activity Detection

DenseRecognition of Spoken Languages

Temporally Coherent Embeddings for Self-Supervised Video Representation Learning

Adversarially Training for Audio Classifiers

Anticipating Activity from Multimodal Signals

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Person Recognition with HGR Maximal Correlation on Multimodal Data

Feature Engineering and Stacked Echo State Networks for Musical Onset Detection

One-Shot Learning for Acoustic Identification of Bird Species in Non-Stationary Environments

3D Audio-Visual Speaker Tracking with a Novel Particle Filter

Digit Recognition Applied to Reconstructed Audio Signals Using Deep Learning

Motion-Supervised Co-Part Segmentation

Contextual Classification Using Self-Supervised Auxiliary Models for Deep Neural Networks

Detection of Calls from Smart Speaker Devices

The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Toward Text-Independent Cross-Lingual Speaker Recognition Using English-Mandarin-Taiwanese Dataset

Let's Play Music: Audio-Driven Performance Video Generation

3D Attention Mechanism for Fine-Grained Classification of Table Tennis Strokes Using a Twin Spatio-Temporal Convolutional Neural Networks

Multi-Modal Deep Clustering: Unsupervised Partitioning of Images

Space-Time Domain Tensor Neural Networks: An Application on Human Pose Classification

Mood Detection Analyzing Lyrics and Audio Signal Based on Deep Learning Architectures

RMS-Net: Regression and Masking for Soccer Event Spotting

Neuron-Based Network Pruning Based on Majority Voting

Self-Supervised Learning of Dynamic Representations for Static Images

Attentive Visual Semantic Specialized Network for Video Captioning

Developing Motion Code Embedding for Action Recognition in Videos

Text Synopsis Generation for Egocentric Videos

Automatic Annotation of Corpora for Emotion Recognition through Facial Expressions Analysis