ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

Pavlos Avgoustinakis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Andreas L. Symeonidis, Ioannis Kompatsiaris

Auto-TLDR; AuSiL: Audio Similarity Learning for Near-duplicate Video Retrieval

Abstract Slides Poster

In this work, we address the problem of audio-based near-duplicate video retrieval. We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs. For the robust similarity calculation between two videos, we first extract representative audio-based video descriptors by leveraging transfer learning based on a Convolutional Neural Network (CNN) trained on a large scale dataset of audio events, and then we calculate the similarity matrix derived from the pairwise similarity of these descriptors. The similarity matrix is subsequently fed to a CNN network that captures the temporal structures existing within its content. We train our network following a triplet generation process and optimizing the triplet loss function. To evaluate the effectiveness of the proposed approach, we have manually annotated two publicly available video datasets based on the audio duplicity between their videos. The proposed approach achieves very competitive results compared to three state-of-the-art methods. Also, unlike the competing methods, it is very robust for the retrieval of audio duplicates generated with speed transformations.

Similar papers

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

Kuan-Hsun Wang, Chia Chun Cheng, Yi-Ling Chen, Yale Song, Shang-Hong Lai

Auto-TLDR; Attention-based Deep Metric Learning for Near-duplicate Video Retrieval

Abstract Slides Similar

Near-duplicate video retrieval (NDVR) is an important and challenging problem due to the increasing amount of videos uploaded to the Internet. In this paper, we propose an attention-based deep metric learning method for NDVR. Our method is based on well-established principles: We leverage two-stream networks to combine RGB and optical flow features, and incorporate an attention module to effectively deal with distractor frames commonly observed in near duplicate videos. We further aggregate the features corresponding to multiple video segments to enhance the discriminative power. The whole system is trained using a deep metric learning objective with a Siamese architecture. Our experiments show that the attention module helps eliminate redundant and noisy frames, while focusing on visually relevant frames for solving NVDR. We evaluate our approach on recent large-scale NDVR datasets, CC_WEB_VIDEO, VCDB, FIVR and SVD. To demonstrate the generalization ability of our approach, we report results in both within- and cross-dataset settings, and show that the proposed method significantly outperforms state-of-the-art approaches.

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel

Auto-TLDR; Environmental Sound Classification with Short-Time Fourier Transform Spectrograms

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

Similar papers

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Ballroom Dance Recognition from Audio Recordings

Exploiting Local Indexing and Deep Feature Confidence Scores for Fast Image-To-Video Search

Which are the factors affecting the performance of audio surveillance systems?

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

DenseRecognition of Spoken Languages

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Video Face Manipulation Detection through Ensemble of CNNs

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

Hybrid Network for End-To-End Text-Independent Speaker Identification

Rotation Invariant Aerial Image Retrieval with Group Convolutional Metric Learning

Text Synopsis Generation for Egocentric Videos

Generalized Local Attention Pooling for Deep Metric Learning

RMS-Net: Regression and Masking for Soccer Event Spotting

One-Shot Learning for Acoustic Identification of Bird Species in Non-Stationary Environments

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

Multi-Level Deep Learning Vehicle Re-Identification Using Ranked-Based Loss Functions

Supporting Skin Lesion Diagnosis with Content-Based Image Retrieval

Feature Engineering and Stacked Echo State Networks for Musical Onset Detection

Not 3D Re-ID: Simple Single Stream 2D Convolution for Robust Video Re-Identification

Hierarchical Multimodal Attention for Deep Video Summarization

Adaptive L2 Regularization in Person Re-Identification

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Mood Detection Analyzing Lyrics and Audio Signal Based on Deep Learning Architectures

Learning Neural Textual Representations for Citation Recommendation

Hierarchical Deep Hashing for Fast Large Scale Image Retrieval

Deep Composer: A Hash-Based Duplicative Neural Network for Generating Multi-Instrument Songs

On Identification and Retrieval of Near-Duplicate Biological Images: A New Dataset and Protocol

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Automated Whiteboard Lecture Video Summarization by Content Region Detection and Representation

Influence of Event Duration on Automatic Wheeze Classification

Multi-Scale Keypoint Matching

Progressive Learning Algorithm for Efficient Person Re-Identification

Leveraging Quadratic Spherical Mutual Information Hashing for Fast Image Retrieval

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Adversarially Training for Audio Classifiers

Enriching Video Captions with Contextual Text

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Comparison of Deep Learning and Hand Crafted Features for Mining Simulation Data

Relevance Detection in Cataract Surgery Videos by Spatio-Temporal Action Localization

Aggregating Object Features Based on Attention Weights for Fine-Grained Image Retrieval

Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

Top-DB-Net: Top DropBlock for Activation Enhancement in Person Re-Identification

RWF-2000: An Open Large Scale Video Database for Violence Detection

Unsupervised Sound Source Localization From Audio-Image Pairs Using Input Gradient Map

Context Matters: Self-Attention for Sign Language Recognition

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification