ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Video Representation Fusion Network For Multi-Label Movie Genre Classification

Tianyu Bi, Dmitri Jarnikov, Johan Lukkien

Auto-TLDR; A Video Representation Fusion Network for Movie Genre Classification

Abstract Slides Poster

In this paper, we introduce a Video Representation Fusion Network (VRFN) for movie genre classification. Different from the previous works, which use frame-level features for movie genre classification, our approach uses video classification architecture to create video-level features from a group of frames and fuse these features temporally to learn long-term spatiotemporal information for the movie genre classification task. We use a pre-trained I3D model to generate intermediate video representations and connect it with a C3D-LSTM model for feature fusion and movie genre classification. LMTD-9 dataset which contains 4007 trailers multi-labeled with 9 movie genres is used for training and evaluation of the model. The experimental results demonstrate that learning long-term temporal dependencies by fusing video representations improves the performance in movie genre classification. Our best model outperforms the state-of-the-art methods by 3.4% improvement in AUPRC (macro).

Similar papers

RWF-2000: An Open Large Scale Video Database for Violence Detection

Ming Cheng, Kunjing Cai, Ming Li

Auto-TLDR; Flow Gated Network for Violence Detection in Surveillance Cameras

Video Representation Fusion Network For Multi-Label Movie Genre Classification

Similar papers

RWF-2000: An Open Large Scale Video Database for Violence Detection

Pose-Based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

TinyVIRAT: Low-Resolution Video Action Recognition

Late Fusion of Bayesian and Convolutional Models for Action Recognition

Not 3D Re-ID: Simple Single Stream 2D Convolution for Robust Video Re-Identification

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos

Learnable Higher-Order Representation for Action Recognition

Feature Pyramid Hierarchies for Multi-Scale Temporal Action Detection

Vision-Based Multi-Modal Framework for Action Recognition

You Ought to Look Around: Precise, Large Span Action Detection

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

MFI: Multi-Range Feature Interchange for Video Action Recognition

Enriching Video Captions with Contextual Text

RMS-Net: Regression and Masking for Soccer Event Spotting

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition

What and How? Jointly Forecasting Human Action and Pose

Video Summarization with a Dual Attention Capsule Network

Single View Learning in Action Recognition

Extracting Action Hierarchies from Action Labels and their Use in Deep Action Recognition

Temporally Coherent Embeddings for Self-Supervised Video Representation Learning

ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos

A Grid-Based Representation for Human Action Recognition

Image Sequence Based Cyclist Action Recognition Using Multi-Stream 3D Convolution

Precise Temporal Action Localization with Quantified Temporal Structure of Actions

Developing Motion Code Embedding for Action Recognition in Videos

Space-Time Domain Tensor Neural Networks: An Application on Human Pose Classification

Modeling Long-Term Interactions to Enhance Action Recognition

SCA Net: Sparse Channel Attention Module for Action Recognition

Audio-Video Detection of the Active Speaker in Meetings

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

Motion Complementary Network for Efficient Action Recognition

Context Visual Information-Based Deliberation Network for Video Captioning

3D Attention Mechanism for Fine-Grained Classification of Table Tennis Strokes Using a Twin Spatio-Temporal Convolutional Neural Networks

Applying (3+2+1)D Residual Neural Network with Frame Selection for Hong Kong Sign Language Recognition

SAT-Net: Self-Attention and Temporal Fusion for Facial Action Unit Detection

Attention-Driven Body Pose Encoding for Human Activity Recognition

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

Multi-Attribute Learning with Highly Imbalanced Data

Depth Videos for the Classification of Micro-Expressions

Hierarchical Multimodal Attention for Deep Video Summarization

Attentive Visual Semantic Specialized Network for Video Captioning

Feature-Supervised Action Modality Transfer

Continuous Sign Language Recognition with Iterative Spatiotemporal Fine-Tuning

Relevance Detection in Cataract Surgery Videos by Spatio-Temporal Action Localization

Global Feature Aggregation for Accident Anticipation

Towards Tackling Multi-Label Imbalances in Remote Sensing Imagery