ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Video Summarization with a Dual Attention Capsule Network

Hao Fu, Hongxing Wang, Jianyu Yang

Auto-TLDR; Dual Self-Attention Capsule Network for Video Summarization

Abstract Slides Poster

In this paper, we address the problem of video summarization, which aims at selecting a subset of video frames as a summary to represent the original video contents compactly and completely. We propose a simple but effective supervised approach with a dual attention capsule network towards this end. Unlike existing LSTM based methods, it pays attention to short- and long-term dependencies among video frames through an elaborate dual self-attention architecture, which can handle longer-term dependencies and admit parallel computing. To reconcile the outputs of dual self-attention, we rely on a two-stream capsule network to learn the underlying frame selection criteria. Experiments on real-world datasets show the advantages of the proposed approach compared with state-of-the-art methods.

Similar papers

ILS-SUMM: Iterated Local Search for Unsupervised Video Summarization

Yair Shemer, Daniel Rotman, Nahum Shimkin

Auto-TLDR; ILS-SUMM: Iterated Local Search for Video Summarization

Abstract Slides Similar

In recent years, there has been an increasing interest in building video summarization tools, where the goal is to automatically create a short summary of an input video that properly represents the original content. We consider shot-based video summarization where the summary consists of a subset of the video shots which can be of various lengths. A straightforward approach to maximize the representativeness of a subset of shots is by minimizing the total distance between shots and their nearest selected shots. We formulate the task of video summarization as an optimization problem with a knapsack-like constraint on the total summary duration. Previous studies have proposed greedy algorithms to solve this problem approximately, but no experiments were presented to measure the ability of these methods to obtain solutions with low total distance. Indeed, our experiments on video summarization datasets show that the success of current methods in obtaining results with low total distance still has much room for improvement. In this paper, we develop ILS-SUMM, a novel video summarization algorithm to solve the subset selection problem under the knapsack constraint. Our algorithm is based on the well-known metaheuristic optimization framework -- Iterated Local Search (ILS), known for its ability to avoid weak local minima and obtain a good near-global minimum. Extensive experiments show that our method finds solutions with significantly better total distance than previous methods. Moreover, to indicate the high scalability of ILS-SUMM, we introduce a new dataset consisting of videos of various lengths.

2D Deep Video Capsule Network with Temporal Shift for Action Recognition

Théo Voillemin, Hazem Wannous, Jean-Philippe Vandeborre

Auto-TLDR; Temporal Shift Module over Capsule Network for Action Recognition in Continuous Videos

Abstract Slides Similar

Action recognition in continuous video streams is a growing field since the past few years. Deep learning techniques and in particular Convolutional Neural Networks (CNNs) achieved good results in this topic. However, intrinsic CNNs limitations begin to cap the results since 2D CNN cannot capture temporal information and 3D CNN are to much resource demanding for real-time applications. Capsule Network, evolution of CNN, already proves its interesting benefits on small and low informational datasets like MNIST but yet its true potential has not emerged. In this paper we tackle the action recognition problem by proposing a new architecture combining Temporal Shift module over deep Capsule Network. Temporal Shift module permits us to insert temporal information over 2D Capsule Network with a zero computational cost to conserve the lightness of 2D capsules and their ability to connect spatial features. Our proposed approach outperforms or brings near state-of-the-art results on color and depth information on public datasets like First Person Hand Action and DHG 14/28 with a number of parameters 10 to 40 times less than existing approaches.

Text Synopsis Generation for Egocentric Videos

Aidean Sharghi, Niels Lobo, Mubarak Shah

Auto-TLDR; Egocentric Video Summarization Using Multi-task Learning for End-to-End Learning

Abstract Slides Similar

Mass utilization of body-worn cameras has led to a huge corpus of available egocentric video. Existing video summarization algorithms can accelerate browsing such videos by selecting (visually) interesting shots from them. Nonetheless, since the system user still has to watch the summary videos, browsing large video databases remain a challenge. Hence, in this work, we propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos. Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database using text queries. Since egocentric videos are long and contain many activities and events, using video-to-text algorithms results in thousands of descriptions, many of which are incorrect. Therefore, we propose a multi-task learning scheme to simultaneously generate descriptions for video segments and summarize the resulting descriptions in an end-to-end fashion. We Input a set of video shots and the network generates a text description for each shot. Next, visual-language content matching unit that is trained with a weakly supervised objective, identifies the correct descriptions. Finally, the last component of our network, called purport network, evaluates the descriptions all together to select the ones containing crucial information. Out of thousands of descriptions generated for the video, a few informative sentences are returned to the user. We validate our framework on the challenging UT Egocentric video dataset, where each video is between 3 to 5 hours long, associated with over 3000 textual descriptions on average. The generated textual summaries, including only 5 percent (or less) of the generated descriptions, are compared to groundtruth summaries in text domain using well-established metrics in natural language processing.

Hierarchical Multimodal Attention for Deep Video Summarization

Melissa Sanabria, Frederic Precioso, Thomas Menguy

Auto-TLDR; Automatic Summarization of Professional Soccer Matches Using Event-Stream Data and Multi- Instance Learning

Video Summarization with a Dual Attention Capsule Network

Similar papers

ILS-SUMM: Iterated Local Search for Unsupervised Video Summarization

2D Deep Video Capsule Network with Temporal Shift for Action Recognition

Text Synopsis Generation for Egocentric Videos

Hierarchical Multimodal Attention for Deep Video Summarization

Learning Object Deformation and Motion Adaption for Semi-Supervised Video Object Segmentation

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Context Visual Information-Based Deliberation Network for Video Captioning

MFI: Multi-Range Feature Interchange for Video Action Recognition

VTT: Long-Term Visual Tracking with Transformers

Video Representation Fusion Network For Multi-Label Movie Genre Classification

ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

RWF-2000: An Open Large Scale Video Database for Violence Detection

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Attention-Driven Body Pose Encoding for Human Activity Recognition

What and How? Jointly Forecasting Human Action and Pose

Two-Stream Temporal Convolutional Network for Dynamic Facial Attractiveness Prediction

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Modeling Long-Term Interactions to Enhance Action Recognition

Context Matters: Self-Attention for Sign Language Recognition

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

A Grid-Based Representation for Human Action Recognition

Enriching Video Captions with Contextual Text

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

Feature Pyramid Hierarchies for Multi-Scale Temporal Action Detection

You Ought to Look Around: Precise, Large Span Action Detection

RMS-Net: Regression and Masking for Soccer Event Spotting

Gait Recognition Using Multi-Scale Partial Representation Transformation with Capsules

Attentive Visual Semantic Specialized Network for Video Captioning

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition

SAT-Net: Self-Attention and Temporal Fusion for Facial Action Unit Detection

Not 3D Re-ID: Simple Single Stream 2D Convolution for Robust Video Re-Identification

TSMSAN: A Three-Stream Multi-Scale Attentive Network for Video Saliency Detection

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

A Duplex Spatiotemporal Filtering Network for Video-Based Person Re-Identification

Text Recognition in Real Scenarios with a Few Labeled Samples

Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos

Let's Play Music: Audio-Driven Performance Video Generation

ACCLVOS: Atrous Convolution with Spatial-Temporal ConvLSTM for Video Object Segmentation

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Global Feature Aggregation for Accident Anticipation

Motion U-Net: Multi-Cue Encoder-Decoder Network for Motion Segmentation

Progressive Scene Segmentation Based on Self-Attention Mechanism

TinyVIRAT: Low-Resolution Video Action Recognition

Precise Temporal Action Localization with Quantified Temporal Structure of Actions

Variational Capsule Encoder

Region-Based Non-Local Operation for Video Classification

Pose-Robust Face Recognition by Deep Meta Capsule Network-Based Equivariant Embedding