ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, Herremans Dorien

Auto-TLDR; Exploring the effect of spectrogram reconstruction loss on automatic music transcription

Abstract Slides

Most of the state-of-the-art automatic music transcription (AMT) models break down the main transcription task into sub-tasks such as onset prediction and offset prediction and train them with onset and offset labels. These predictions are then concatenated together and used as the input to train another model with the pitch labels to obtain the final transcription. We attempt to use only the pitch labels (together with spectrogram reconstruction loss) and explore how far this model can go without introducing supervised sub-tasks. In this paper, we do not aim at achieving state-of-the-art transcription accuracy, instead, we explore the effect that spectrogram reconstruction has on our AMT model. Our proposed model consists of two U-nets: the first U-net transcribes the spectrogram into a posteriorgram, and a second U-net transforms the posteriorgram back into a spectrogram. A reconstruction loss is applied between the original spectrogram and the reconstructed spectrogram to constrain the second U-net to focus only on reconstruction. We train our model on different datasets including MAPS, MAESTRO, and MusicNet. Our experiments show that adding the reconstruction loss can generally improve the note-level transcription accuracy when compared to the same model without the reconstruction part. Moreover, it can also boost the frame-level precision to be higher than the state-of-the-art models. The feature maps learned by our u-net contain gridlike structures (not present in the baseline model) which implies that with the present of reconstruction loss, the model is probably trying to count along both the time and frequency axis, resulting in a higher note-level transcription accuracy.

Similar papers

Ballroom Dance Recognition from Audio Recordings

Tomas Pavlin, Jan Cech, Jiri Matas

Auto-TLDR; A CNN-based approach to classify ballroom dances given audio recordings

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

Similar papers

Ballroom Dance Recognition from Audio Recordings

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Which are the factors affecting the performance of audio surveillance systems?

DenseRecognition of Spoken Languages

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Hybrid Network for End-To-End Text-Independent Speaker Identification

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

Feature Engineering and Stacked Echo State Networks for Musical Onset Detection

Digit Recognition Applied to Reconstructed Audio Signals Using Deep Learning

Mood Detection Analyzing Lyrics and Audio Signal Based on Deep Learning Architectures

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

Heuristics for Evaluation of AI Generated Music

Influence of Event Duration on Automatic Wheeze Classification

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Let's Play Music: Audio-Driven Performance Video Generation

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

Adversarially Training for Audio Classifiers

One-Shot Learning for Acoustic Identification of Bird Species in Non-Stationary Environments

Relevance Detection in Cataract Surgery Videos by Spatio-Temporal Action Localization

Talking Face Generation Via Learning Semantic and Temporal Synchronous Landmarks

Toward Text-Independent Cross-Lingual Speaker Recognition Using English-Mandarin-Taiwanese Dataset

Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images

Deep Composer: A Hash-Based Duplicative Neural Network for Generating Multi-Instrument Songs

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

Spatial Bias in Vision-Based Voice Activity Detection

The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition

S2I-Bird: Sound-To-Image Generation of Bird Species Using Generative Adversarial Networks

Mutual Alignment between Audiovisual Features for End-To-End Audiovisual Speech Recognition

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

RMS-Net: Regression and Masking for Soccer Event Spotting

What and How? Jointly Forecasting Human Action and Pose

Revisiting Sequence-To-Sequence Video Object Segmentation with Multi-Task Loss and Skip-Memory

Video Representation Fusion Network For Multi-Label Movie Genre Classification

Video Face Manipulation Detection through Ensemble of CNNs

Audio-Video Detection of the Active Speaker in Meetings

Text Synopsis Generation for Egocentric Videos

Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Context Matters: Self-Attention for Sign Language Recognition

Deep Convolutional Embedding for Digitized Painting Clustering

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Self-Supervised Learning of Dynamic Representations for Static Images

Confidence Calibration for Deep Renal Biopsy Immunofluorescence Image Classification

Machine-Learned Regularization and Polygonization of Building Segmentation Masks

Aerial Road Segmentation in the Presence of Topological Label Noise

Motion-Supervised Co-Part Segmentation

CAggNet: Crossing Aggregation Network for Medical Image Segmentation

Are Multiple Cross-Correlation Identities Better Than Just Two? Improving the Estimate of Time Differences-Of-Arrivals from Blind Audio Signals