ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Ballroom Dance Recognition from Audio Recordings

Tomas Pavlin, Jan Cech, Jiri Matas

Auto-TLDR; A CNN-based approach to classify ballroom dances given audio recordings

Abstract Slides Poster

We propose a CNN-based approach to classify ten genres of ballroom dances given audio recordings, five latin and five standard, namely Cha Cha Cha, Jive, Paso Doble, Rumba, Samba, Quickstep, Slow Foxtrot, Slow Waltz, Tango and Viennese Waltz. We utilize a spectrogram of an audio signal and we treat it as an image that is an input of the CNN. The classification is performed independently by 5-seconds spectrogram segments in sliding window fashion and the results are then aggregated. The method was tested on following datasets: Publicly available Extended Ballroom dataset collected by Marchand and Peeters, 2016 and two YouTube datasets collected by us, one in studio quality and the other, more challenging, recorded on mobile phones. The method achieved accuracy 93.9%, 96.7% and 89.8% respectively. The method runs in real-time. We implemented a web application to demonstrate the proposed method.

Similar papers

Which are the factors affecting the performance of audio surveillance systems?

Antonio Greco, Antonio Roberto, Alessia Saggese, Mario Vento

Auto-TLDR; Sound Event Recognition Using Convolutional Neural Networks and Visual Representations on MIVIA Audio Events

Abstract Slides Similar

Sound event recognition systems are rapidly becoming part of our life, since they can be profitably used in several vertical markets, ranging from audio security applications to scene classification and multi-modal analysis in social robotics. In the last years, a not negligible part of the scientific community started to apply Convolutional Neural Networks (CNNs) to image-based representations of the audio stream, due to their successful adoption in almost all the computer vision tasks. In this paper, we carry out a detailed benchmark of various widely used CNN architectures and visual representations on a popular dataset, namely the MIVIA Audio Events database. Our analysis is aimed at understanding how these factors affect the sound event recognition performance with a particular focus on the false positive rate, very relevant in audio surveillance solutions. In fact, although most of the proposed solutions achieve a high recognition rate, the capability of distinguishing the events-of-interest from the background is often not yet sufficient for real systems, and prevent its usage in real applications. Our comprehensive experimental analysis investigates this aspect and allows to identify useful design guidelines for increasing the specificity of sound event recognition systems.

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel

Auto-TLDR; Environmental Sound Classification with Short-Time Fourier Transform Spectrograms

Ballroom Dance Recognition from Audio Recordings

Similar papers

Which are the factors affecting the performance of audio surveillance systems?

ESResNet: Environmental Sound Classification Based on Visual Domain Models

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

DenseRecognition of Spoken Languages

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

Hybrid Network for End-To-End Text-Independent Speaker Identification

Digit Recognition Applied to Reconstructed Audio Signals Using Deep Learning

Influence of Event Duration on Automatic Wheeze Classification

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

One-Shot Learning for Acoustic Identification of Bird Species in Non-Stationary Environments

Feature Engineering and Stacked Echo State Networks for Musical Onset Detection

Mood Detection Analyzing Lyrics and Audio Signal Based on Deep Learning Architectures

Improving Gravitational Wave Detection with 2D Convolutional Neural Networks

The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition

Adversarially Training for Audio Classifiers

S2I-Bird: Sound-To-Image Generation of Bird Species Using Generative Adversarial Networks

RMS-Net: Regression and Masking for Soccer Event Spotting

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Video Face Manipulation Detection through Ensemble of CNNs

Detection of Calls from Smart Speaker Devices

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

Modulation Pattern Detection Using Complex Convolutions in Deep Learning

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

Epileptic Seizure Prediction: A Semi-Dilated Convolutional Neural Network Architecture

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Deep Composer: A Hash-Based Duplicative Neural Network for Generating Multi-Instrument Songs

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

RWF-2000: An Open Large Scale Video Database for Violence Detection

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Planar 3D Transfer Learning for End to End Unimodal MRI Unbalanced Data Segmentation

Spatial Bias in Vision-Based Voice Activity Detection

Person Recognition with HGR Maximal Correlation on Multimodal Data

A Systematic Investigation on Deep Architectures for Automatic Skin Lesions Classification

Audio-Video Detection of the Active Speaker in Meetings

Anticipating Activity from Multimodal Signals

Electroencephalography Signal Processing Based on Textural Features for Monitoring the Driver’s State by a Brain-Computer Interface

Toward Text-Independent Cross-Lingual Speaker Recognition Using English-Mandarin-Taiwanese Dataset

EasiECG: A Novel Inter-Patient Arrhythmia Classification Method Using ECG Waves

Enriching Video Captions with Contextual Text

Real Time Fencing Move Classification and Detection at Touch Time During a Fencing Match

A Systematic Investigation on End-To-End Deep Recognition of Grocery Products in the Wild

Exploiting the Logits: Joint Sign Language Recognition and Spell-Correction

A Grid-Based Representation for Human Action Recognition

Wireless Localisation in WiFi Using Novel Deep Architectures

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Cross-People Mobile-Phone Based Airwriting Character Recognition

Confidence Calibration for Deep Renal Biopsy Immunofluorescence Image Classification

Automatic Annotation of Corpora for Emotion Recognition through Facial Expressions Analysis