ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Which are the factors affecting the performance of audio surveillance systems?

Antonio Greco, Antonio Roberto, Alessia Saggese, Mario Vento

Auto-TLDR; Sound Event Recognition Using Convolutional Neural Networks and Visual Representations on MIVIA Audio Events

Abstract Slides

Sound event recognition systems are rapidly becoming part of our life, since they can be profitably used in several vertical markets, ranging from audio security applications to scene classification and multi-modal analysis in social robotics. In the last years, a not negligible part of the scientific community started to apply Convolutional Neural Networks (CNNs) to image-based representations of the audio stream, due to their successful adoption in almost all the computer vision tasks. In this paper, we carry out a detailed benchmark of various widely used CNN architectures and visual representations on a popular dataset, namely the MIVIA Audio Events database. Our analysis is aimed at understanding how these factors affect the sound event recognition performance with a particular focus on the false positive rate, very relevant in audio surveillance solutions. In fact, although most of the proposed solutions achieve a high recognition rate, the capability of distinguishing the events-of-interest from the background is often not yet sufficient for real systems, and prevent its usage in real applications. Our comprehensive experimental analysis investigates this aspect and allows to identify useful design guidelines for increasing the specificity of sound event recognition systems.

Similar papers

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel

Auto-TLDR; Environmental Sound Classification with Short-Time Fourier Transform Spectrograms

Which are the factors affecting the performance of audio surveillance systems?

Similar papers

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Ballroom Dance Recognition from Audio Recordings

DenseRecognition of Spoken Languages

Influence of Event Duration on Automatic Wheeze Classification

Hybrid Network for End-To-End Text-Independent Speaker Identification

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

Improving Gravitational Wave Detection with 2D Convolutional Neural Networks

Adversarially Training for Audio Classifiers

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

Feature Engineering and Stacked Echo State Networks for Musical Onset Detection

A Systematic Investigation on Deep Architectures for Automatic Skin Lesions Classification

One-Shot Learning for Acoustic Identification of Bird Species in Non-Stationary Environments

Fine-Tuning Convolutional Neural Networks: A Comprehensive Guide and Benchmark Analysis for Glaucoma Screening

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Detection of Calls from Smart Speaker Devices

Modulation Pattern Detection Using Complex Convolutions in Deep Learning

Electroencephalography Signal Processing Based on Textural Features for Monitoring the Driver’s State by a Brain-Computer Interface

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition

Digit Recognition Applied to Reconstructed Audio Signals Using Deep Learning

Detecting Marine Species in Echograms Via Traditional, Hybrid, and Deep Learning Frameworks

Weight Estimation from an RGB-D Camera in Top-View Configuration

RMS-Net: Regression and Masking for Soccer Event Spotting

S2I-Bird: Sound-To-Image Generation of Bird Species Using Generative Adversarial Networks

Audio-Video Detection of the Active Speaker in Meetings

A Comparison of Neural Network Approaches for Melanoma Classification

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

Deep Transfer Learning for Alzheimer’s Disease Detection

A Systematic Investigation on End-To-End Deep Recognition of Grocery Products in the Wild

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Are Multiple Cross-Correlation Identities Better Than Just Two? Improving the Estimate of Time Differences-Of-Arrivals from Blind Audio Signals

The Color Out of Space: Learning Self-Supervised Representations for Earth Observation Imagery

Multimodal Side-Tuning for Document Classification

On the Use of Benford's Law to Detect GAN-Generated Images

Deep Learning on Active Sonar Data Using Bayesian Optimization for Hyperparameter Tuning

From Early Biological Models to CNNs: Do They Look Where Humans Look?

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Bridging the Gap between Natural and Medical Images through Deep Colorization

Investigating and Exploiting Image Resolution for Transfer Learning-Based Skin Lesion Classification

Lightweight Low-Resolution Face Recognition for Surveillance Applications

Temporal Binary Representation for Event-Based Action Recognition

Enhancing Deep Semantic Segmentation of RGB-D Data with Entangled Forests

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Fourier Domain Pruning of MobileNet-V2 with Application to Video Based Wildfire Detection

Video Face Manipulation Detection through Ensemble of CNNs

Spatial Bias in Vision-Based Voice Activity Detection

Mood Detection Analyzing Lyrics and Audio Signal Based on Deep Learning Architectures