ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Unsupervised Sound Source Localization From Audio-Image Pairs Using Input Gradient Map

Tomohiro Tanaka, Takahiro Shinozaki

Auto-TLDR; Unsupervised Sound Localization Using Gradient Method

Abstract Slides Poster

Humans easily and routinely identify an image region that corresponds to an observed sound in their daily lives. The task is formulated as an unsupervised sound source localization without using tagged data. Recently, several methods have been proposed that utilize the activation of hidden or output layers of neural networks, such as an attention layer or feature maps in a convolutional neural network (CNN). We propose another strategy that obtains a localization map at the input side, applying the widely used input gradient method. It is computationally efficient and can be easily applied to any existing techniques because it is free from the network structure. Taking advantage of it, we propose a combination method with existing methods for higher sound localization performance. Experiments are performed using the Flickr-SoundNet data set. When a pre-trained image front-end was used, the proposed method gives better results than the attention-based method. For a completely unsupervised condition, the gradient method provides comparable performance as the conventional methods; the best results are obtained by this combination method.

Similar papers

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Yasunori Ohishi, Yuki Tanaka, Kunio Kashino

Auto-TLDR; A guided attention scheme for audio-visual co-segmentation

Unsupervised Sound Source Localization From Audio-Image Pairs Using Input Gradient Map

Similar papers

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Attention-Based Selection Strategy for Weakly Supervised Object Localization

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Dual-Attention Guided Dropblock Module for Weakly Supervised Object Localization

Attentive Part-Aware Networks for Partial Person Re-Identification

S2I-Bird: Sound-To-Image Generation of Bird Species Using Generative Adversarial Networks

Aggregating Object Features Based on Attention Weights for Fine-Grained Image Retrieval

Zoom-CAM: Generating Fine-Grained Pixel Annotations from Image Labels

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

Spatial Bias in Vision-Based Voice Activity Detection

Revisiting Sequence-To-Sequence Video Object Segmentation with Multi-Task Loss and Skip-Memory

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

Object Detection Model Based on Scene-Level Region Proposal Self-Attention

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Temporally Coherent Embeddings for Self-Supervised Video Representation Learning

FC-DCNN: A Densely Connected Neural Network for Stereo Estimation

Which are the factors affecting the performance of audio surveillance systems?

Local Attention and Global Representation Collaborating for Fine-Grained Classification

Using Scene Graphs for Detecting Visual Relationships

Rethinking ReID：Multi-Feature Fusion Person Re-Identification Based on Orientation Constraints

Learning with Delayed Feedback

A Few-Shot Learning Approach for Historical Ciphered Manuscript Recognition

An Unsupervised Approach towards Varying Human Skin Tone Using Generative Adversarial Networks

Suppressing Features That Contain Disparity Edge for Stereo Matching

Point In: Counting Trees with Weakly Supervised Segmentation Network

Hybrid Network for End-To-End Text-Independent Speaker Identification

Multi-Order Feature Statistical Model for Fine-Grained Visual Categorization

Visual Object Tracking in Drone Images with Deep Reinforcement Learning

Mutual-Supervised Feature Modulation Network for Occluded Pedestrian Detection

Multi-Branch Attention Networks for Classifying Galaxy Clusters

Two-Level Attention-Based Fusion Learning for RGB-D Face Recognition

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Multi-Modal Deep Clustering: Unsupervised Partitioning of Images

DenseRecognition of Spoken Languages

Mutual Alignment between Audiovisual Features for End-To-End Audiovisual Speech Recognition

Learning Embeddings for Image Clustering: An Empirical Study of Triplet Loss Approaches

Filter Pruning Using Hierarchical Group Sparse Regularization for Deep Convolutional Neural Networks

Utilising Visual Attention Cues for Vehicle Detection and Tracking

One-Stage Multi-Task Detector for 3D Cardiac MR Imaging

Self-Supervised Learning for Astronomical Image Classification

Adaptive Image Compression Using GAN Based Semantic-Perceptual Residual Compensation

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Convolutional STN for Weakly Supervised Object Localization

Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

Audio-Video Detection of the Active Speaker in Meetings

Learning Knowledge-Rich Sequential Model for Planar Homography Estimation in Aerial Video

Context Matters: Self-Attention for Sign Language Recognition