ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Are Multiple Cross-Correlation Identities Better Than Just Two? Improving the Estimate of Time Differences-Of-Arrivals from Blind Audio Signals

Danilo Greco, Jacopo Cavazza, Alessio Del Bue

Auto-TLDR; Improving Blind Channel Identification Using Cross-Correlation Identity for Time Differences-of-Arrivals Estimation

Abstract Slides Poster

Given an unknown audio source, the estimation of time differences-of-arrivals (TDOAs) can be efficiently and robustly solved using blind channel identification and exploiting the cross-correlation identity (CCI). Prior "blind" works have improved the estimate of TDOAs by means of different algorithmic solutions and optimization strategies, while always sticking to the case N = 2 microphones. But what if we can obtain a direct improvement in performance by just increasing N? In this paper we try to investigate this direction, showing that, despite the arguable simplicity, this is capable of (sharply) improving upon state-of-the-art blind channel identification methods based on CCI, without modifying the computational pipeline. Inspired by our results, we seek to warm up the community and the practitioners by paving the way (with two concrete, yet preliminary, examples) towards joint approaches in which advances in the optimization are combined with an increased number of microphones, in order to achieve further improvements.

Similar papers

Which are the factors affecting the performance of audio surveillance systems?

Antonio Greco, Antonio Roberto, Alessia Saggese, Mario Vento

Auto-TLDR; Sound Event Recognition Using Convolutional Neural Networks and Visual Representations on MIVIA Audio Events

Abstract Slides Similar

Sound event recognition systems are rapidly becoming part of our life, since they can be profitably used in several vertical markets, ranging from audio security applications to scene classification and multi-modal analysis in social robotics. In the last years, a not negligible part of the scientific community started to apply Convolutional Neural Networks (CNNs) to image-based representations of the audio stream, due to their successful adoption in almost all the computer vision tasks. In this paper, we carry out a detailed benchmark of various widely used CNN architectures and visual representations on a popular dataset, namely the MIVIA Audio Events database. Our analysis is aimed at understanding how these factors affect the sound event recognition performance with a particular focus on the false positive rate, very relevant in audio surveillance solutions. In fact, although most of the proposed solutions achieve a high recognition rate, the capability of distinguishing the events-of-interest from the background is often not yet sufficient for real systems, and prevent its usage in real applications. Our comprehensive experimental analysis investigates this aspect and allows to identify useful design guidelines for increasing the specificity of sound event recognition systems.

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Quan Nguyen, Simone Frintrop, Timo Gerkmann, Mikko Lauri, Julius Richter

Auto-TLDR; Object-Prior: Learning the 1-to-1 correspondence between visual and audio signals by audio- visual sound source methods

Abstract Slides Similar

The performance of an audio-visual sound source separation system is determined by its ability to separate audio sources given images of the sources and the audio mixture. The goal of this study is to investigate the ability to learn the mapping between the sounds and the images of instruments by audio- visual sound source separation methods based on the state-of-the- art PixelPlayer [1]. Theoretical and empirical analyses illustrate that the PixelPlayer is not properly trained to learn the 1-to- 1 correspondence between visual and audio signals during its mix-and-separate training process. Based on the insights from this analysis, a weakly-supervised method called Object-Prior is proposed and evaluated on two audio-visual datasets. The experimental results show that the proposed Object-Prior method outperforms the PixelPlayer and other baselines in the audio- visual sound source separation task. It is also more robust against asynchronized data, where the frame and the audio do not come from the same video, and recognizes musical instruments based on their sound with higher accuracy than the PixelPlayer. This indicates that learning the 1-to-1 correspondence between visual and audio features of an instrument improves the effectiveness of audio-visual sound source separation.

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel

Auto-TLDR; Environmental Sound Classification with Short-Time Fourier Transform Spectrograms

Are Multiple Cross-Correlation Identities Better Than Just Two? Improving the Estimate of Time Differences-Of-Arrivals from Blind Audio Signals

Similar papers

Which are the factors affecting the performance of audio surveillance systems?

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Computational Data Analysis for First Quantization Estimation on JPEG Double Compressed Images

DenseRecognition of Spoken Languages

Deep Universal Blind Image Denoising

Digit Recognition Applied to Reconstructed Audio Signals Using Deep Learning

Classification and Feature Selection Using a Primal-Dual Method and Projections on Structured Constraints

Wireless Localisation in WiFi Using Novel Deep Architectures

Adversarially Training for Audio Classifiers

Sketch-Based Community Detection Via Representative Node Sampling

Generic Merging of Structure from Motion Maps with a Low Memory Footprint

Learning Sparse Deep Neural Networks Using Efficient Structured Projections on Convex Constraints for Green AI

Hybrid Network for End-To-End Text-Independent Speaker Identification

Learning Sign-Constrained Support Vector Machines

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

RNN Training along Locally Optimal Trajectories via Frank-Wolfe Algorithm

Temporal Pattern Detection in Time-Varying Graphical Models

Motion Segmentation with Pairwise Matches and Unknown Number of Motions

3D Pots Configuration System by Optimizing Over Geometric Constraints

Subspace Clustering for Action Recognition with Covariance Representations and Temporal Pruning

Computing Stable Resultant-Based Minimal Solvers by Hiding a Variable

Minimal Solvers for Indoor UAV Positioning

Spatial Bias in Vision-Based Voice Activity Detection

On the Use of Benford's Law to Detect GAN-Generated Images

A Globally Optimal Method for the PnP Problem with MRP Rotation Parameterization

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

3D Audio-Visual Speaker Tracking with a Novel Particle Filter

Total Estimation from RGB Video: On-Line Camera Self-Calibration, Non-Rigid Shape and Motion

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

D3Net: Joint Demosaicking, Deblurring and Deringing

Expectation-Maximization for Scheduling Problems in Satellite Communication

Improving Gravitational Wave Detection with 2D Convolutional Neural Networks

Electroencephalography Signal Processing Based on Textural Features for Monitoring the Driver’s State by a Brain-Computer Interface

Feature Engineering and Stacked Echo State Networks for Musical Onset Detection

DSPNet: Deep Learning-Enabled Blind Reduction of Speckle Noise

Detection of Calls from Smart Speaker Devices

Embedding Shared Low-Rank and Feature Correlation for Multi-View Data Analysis

Phase Retrieval Using Conditional Generative Adversarial Networks

Wasserstein k-Means with Sparse Simplex Projection

P2D: A Self-Supervised Method for Depth Estimation from Polarimetry

Mutual Alignment between Audiovisual Features for End-To-End Audiovisual Speech Recognition

2D Discrete Mirror Transform for Image Non-Linear Approximation

Directionally Paired Principal Component Analysis for Bivariate Estimation Problems

DR2S: Deep Regression with Region Selection for Camera Quality Evaluation

Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Exploiting Non-Linear Redundancy for Neural Model Compression

Unsupervised Feature Learning for Event Data: Direct vs Inverse Problem Formulation