ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Text Synopsis Generation for Egocentric Videos

Aidean Sharghi, Niels Lobo, Mubarak Shah

Auto-TLDR; Egocentric Video Summarization Using Multi-task Learning for End-to-End Learning

Abstract Slides

Mass utilization of body-worn cameras has led to a huge corpus of available egocentric video. Existing video summarization algorithms can accelerate browsing such videos by selecting (visually) interesting shots from them. Nonetheless, since the system user still has to watch the summary videos, browsing large video databases remain a challenge. Hence, in this work, we propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos. Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database using text queries. Since egocentric videos are long and contain many activities and events, using video-to-text algorithms results in thousands of descriptions, many of which are incorrect. Therefore, we propose a multi-task learning scheme to simultaneously generate descriptions for video segments and summarize the resulting descriptions in an end-to-end fashion. We Input a set of video shots and the network generates a text description for each shot. Next, visual-language content matching unit that is trained with a weakly supervised objective, identifies the correct descriptions. Finally, the last component of our network, called purport network, evaluates the descriptions all together to select the ones containing crucial information. Out of thousands of descriptions generated for the video, a few informative sentences are returned to the user. We validate our framework on the challenging UT Egocentric video dataset, where each video is between 3 to 5 hours long, associated with over 3000 textual descriptions on average. The generated textual summaries, including only 5 percent (or less) of the generated descriptions, are compared to groundtruth summaries in text domain using well-established metrics in natural language processing.

Similar papers

Enriching Video Captions with Contextual Text

Philipp Rimle, Pelin Dogan, Markus Gross

Auto-TLDR; Contextualized Video Captioning Using Contextual Text

Text Synopsis Generation for Egocentric Videos

Similar papers

Enriching Video Captions with Contextual Text

Context Visual Information-Based Deliberation Network for Video Captioning

Attentive Visual Semantic Specialized Network for Video Captioning

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Transformer Reasoning Network for Image-Text Matching and Retrieval

Scientific Document Summarization using Citation Context and Multi-objective Optimization

A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Assessing the Severity of Health States Based on Social Media Posts

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Learning Neural Textual Representations for Citation Recommendation

Video Summarization with a Dual Attention Capsule Network

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Context Matters: Self-Attention for Sign Language Recognition

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Adversarial Training for Aspect-Based Sentiment Analysis with BERT

Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning

What and How? Jointly Forecasting Human Action and Pose

Automated Whiteboard Lecture Video Summarization by Content Region Detection and Representation

ILS-SUMM: Iterated Local Search for Unsupervised Video Summarization

Hierarchical Multimodal Attention for Deep Video Summarization

Tackling Contradiction Detection in German Using Machine Translation and End-To-End Recurrent Neural Networks

Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images

Explore and Explain: Self-Supervised Navigation and Recounting

RWF-2000: An Open Large Scale Video Database for Violence Detection

GCNs-Based Context-Aware Short Text Similarity Model

Efficient Sentence Embedding Via Semantic Subspace Analysis

Multimodal Side-Tuning for Document Classification

ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Cross-Supervised Joint-Event-Extraction with Heterogeneous Information Networks

TinyVIRAT: Low-Resolution Video Action Recognition

Learning with Delayed Feedback

RMS-Net: Regression and Masking for Soccer Event Spotting

Activity Recognition Using First-Person-View Cameras Based on Sparse Optical Flows

Modeling Long-Term Interactions to Enhance Action Recognition

Extracting Action Hierarchies from Action Labels and their Use in Deep Action Recognition

Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos

Transformer Networks for Trajectory Forecasting

Continuous Sign Language Recognition with Iterative Spatiotemporal Fine-Tuning

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity