ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel

Auto-TLDR; Environmental Sound Classification with Short-Time Fourier Transform Spectrograms

Abstract Slides Poster

Environmental Sound Classification (ESC) is an active research area in the audio domain and has seen a lot of progress in the past years. However, many of the existing approaches achieve high accuracy by relying on domain-specific features and architectures, making it harder to benefit from advances in other fields (e.g., the image domain). Additionally, some of the past successes have been attributed to a discrepancy of how results are evaluated (i.e., on unofficial splits of the UrbanSound8K (US8K) dataset), distorting the overall progression of the field. The contribution of this paper is twofold. First, we present a model that is inherently compatible with mono and stereo sound inputs. Our model is based on simple log-power Short-Time Fourier Transform (STFT) spectrograms and combines them with several well-known approaches from the image domain (i.e., ResNet, Siamese-like networks and attention). We investigate the influence of cross-domain pre-training, architectural changes, and evaluate our model on standard datasets. We find that our model out-performs all previously known approaches in a fair comparison by achieving accuracies of 97.0 % (ESC-10), 91.5 % (ESC-50) and 84.2 % / 85.4 % (US8K mono / stereo). Second, we provide a comprehensive overview of the actual state of the field, by differentiating several previously reported results on the US8K dataset between official or unofficial splits. For better reproducibility, our code (including any re-implementations) is made available.

Similar papers

Which are the factors affecting the performance of audio surveillance systems?

Antonio Greco, Antonio Roberto, Alessia Saggese, Mario Vento

Auto-TLDR; Sound Event Recognition Using Convolutional Neural Networks and Visual Representations on MIVIA Audio Events

Abstract Slides Similar

Sound event recognition systems are rapidly becoming part of our life, since they can be profitably used in several vertical markets, ranging from audio security applications to scene classification and multi-modal analysis in social robotics. In the last years, a not negligible part of the scientific community started to apply Convolutional Neural Networks (CNNs) to image-based representations of the audio stream, due to their successful adoption in almost all the computer vision tasks. In this paper, we carry out a detailed benchmark of various widely used CNN architectures and visual representations on a popular dataset, namely the MIVIA Audio Events database. Our analysis is aimed at understanding how these factors affect the sound event recognition performance with a particular focus on the false positive rate, very relevant in audio surveillance solutions. In fact, although most of the proposed solutions achieve a high recognition rate, the capability of distinguishing the events-of-interest from the background is often not yet sufficient for real systems, and prevent its usage in real applications. Our comprehensive experimental analysis investigates this aspect and allows to identify useful design guidelines for increasing the specificity of sound event recognition systems.

Ballroom Dance Recognition from Audio Recordings

Tomas Pavlin, Jan Cech, Jiri Matas

Auto-TLDR; A CNN-based approach to classify ballroom dances given audio recordings

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Similar papers

Which are the factors affecting the performance of audio surveillance systems?

Ballroom Dance Recognition from Audio Recordings

Mood Detection Analyzing Lyrics and Audio Signal Based on Deep Learning Architectures

Feature Engineering and Stacked Echo State Networks for Musical Onset Detection

Hybrid Network for End-To-End Text-Independent Speaker Identification

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

One-Shot Learning for Acoustic Identification of Bird Species in Non-Stationary Environments

DenseRecognition of Spoken Languages

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

Adversarially Training for Audio Classifiers

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Influence of Event Duration on Automatic Wheeze Classification

FatNet: A Feature-Attentive Network for 3D Point Cloud Processing

3D Attention Mechanism for Fine-Grained Classification of Table Tennis Strokes Using a Twin Spatio-Temporal Convolutional Neural Networks

The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition

Improving Gravitational Wave Detection with 2D Convolutional Neural Networks

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

The Color Out of Space: Learning Self-Supervised Representations for Earth Observation Imagery

Video Face Manipulation Detection through Ensemble of CNNs

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection

Translation Resilient Opportunistic WiFi Sensing

Detection of Calls from Smart Speaker Devices

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Spatial Bias in Vision-Based Voice Activity Detection

Planar 3D Transfer Learning for End to End Unimodal MRI Unbalanced Data Segmentation

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

A Close Look at Deep Learning with Small Data

Enhancing Semantic Segmentation of Aerial Images with Inhibitory Neurons

ResNet-Like Architecture with Low Hardware Requirements

Directional Graph Networks with Hard Weight Assignments

Fast and Accurate Real-Time Semantic Segmentation with Dilated Asymmetric Convolutions

S2I-Bird: Sound-To-Image Generation of Bird Species Using Generative Adversarial Networks

A Systematic Investigation on Deep Architectures for Automatic Skin Lesions Classification

Attention Pyramid Module for Scene Recognition

SECI-GAN: Semantic and Edge Completion for Dynamic Objects Removal

Building Computationally Efficient and Well-Generalizing Person Re-Identification Models with Metric Learning

Multimodal Side-Tuning for Document Classification

Domain Siamese CNNs for Sparse Multispectral Disparity Estimation

Personalized Models in Human Activity Recognition Using Deep Learning

Anticipating Activity from Multimodal Signals

Exploiting the Logits: Joint Sign Language Recognition and Spell-Correction

GazeMAE: General Representations of Eye Movements Using a Micro-Macro Autoencoder

Automatic Semantic Segmentation of Structural Elements related to the Spinal Cord in the Lumbar Region by Using Convolutional Neural Networks

The DeepHealth Toolkit: A Unified Framework to Boost Biomedical Applications

Digit Recognition Applied to Reconstructed Audio Signals Using Deep Learning

MEG: Multi-Evidence GNN for Multimodal Semantic Forensics