ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Attentive Visual Semantic Specialized Network for Video Captioning

Jesus Perez-Martin, Benjamin Bustos, Jorge Pérez

Auto-TLDR; Adaptive Visual Semantic Specialized Network for Video Captioning

Abstract Slides Poster

As an essential high-level task of video understanding topic, automatically describing a video with natural language has recently gained attention as a fundamental challenge in computer vision. Previous models for video captioning have several limitations, such as the existence of gaps in current semantic representations and the inexpressibility of the generated captions. To deal with these limitations, in this paper, we present a new architecture that we callAttentive Visual Semantic Specialized Network(AVSSN), which is an encoder-decoder model based on our Adaptive Attention Gate and Specialized LSTM layers. This architecture can selectively decide when to use visual or semantic information into the text generation process. The adaptive gate makes the decoder to automatically select the relevant information for providing a better temporal state representation than the existing decoders. Besides, the model is capable of learning to improve the expressiveness of generated captions attending to their length, using a sentence-length-related loss function. We evaluate the effectiveness of the proposed approach on the Microsoft Video Description(MSVD) and the Microsoft Research Video-to-Text (MSR-VTT) datasets, achieving state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, CIDEr, and ROUGE_L.

Similar papers

Context Visual Information-Based Deliberation Network for Video Captioning

Min Lu, Xueyong Li, Caihua Liu

Auto-TLDR; Context visual information-based deliberation network for video captioning

Attentive Visual Semantic Specialized Network for Video Captioning

Similar papers

Context Visual Information-Based Deliberation Network for Video Captioning

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Enriching Video Captions with Contextual Text

Text Synopsis Generation for Egocentric Videos

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Tackling Contradiction Detection in German Using Machine Translation and End-To-End Recurrent Neural Networks

Transformer Reasoning Network for Image-Text Matching and Retrieval

Context Matters: Self-Attention for Sign Language Recognition

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning

GCNs-Based Context-Aware Short Text Similarity Model

MA-LSTM: A Multi-Attention Based LSTM for Complex Pattern Extraction

Explore and Explain: Self-Supervised Navigation and Recounting

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Extracting Action Hierarchies from Action Labels and their Use in Deep Action Recognition

Cross-Supervised Joint-Event-Extraction with Heterogeneous Information Networks

Trajectory-User Link with Attention Recurrent Networks

Flow-Guided Spatial Attention Tracking for Egocentric Activity Recognition

Efficient Sentence Embedding Via Semantic Subspace Analysis

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Continuous Sign Language Recognition with Iterative Spatiotemporal Fine-Tuning

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Global Feature Aggregation for Accident Anticipation

SAT-Net: Self-Attention and Temporal Fusion for Facial Action Unit Detection

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Reinforcement Learning with Dual Attention Guided Graph Convolution for Relation Extraction

ConvMath : A Convolutional Sequence Network for Mathematical Expression Recognition

Global Context-Based Network with Transformer for Image2latex

Detective: An Attentive Recurrent Model for Sparse Object Detection

Developing Motion Code Embedding for Action Recognition in Videos

Revisiting Sequence-To-Sequence Video Object Segmentation with Multi-Task Loss and Skip-Memory

KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding

Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks

PrivAttNet: Predicting Privacy Risks in Images Using Visual Attention

VSR++: Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Multi-Stage Attention Based Visual Question Answering

AG-GAN: An Attentive Group-Aware GAN for Pedestrian Trajectory Prediction

A Grid-Based Representation for Human Action Recognition

ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos

Moto: Enhancing Embedding with Multiple Joint Factors for Chinese Text Classification

Feature Pyramid Hierarchies for Multi-Scale Temporal Action Detection

Learning Neural Textual Representations for Citation Recommendation

Automatic Student Network Search for Knowledge Distillation