ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning

Ruchika Chavhan, Biplab Banerjee, Xiao Xiang Zhu, Subhasis Chaudhuri

Auto-TLDR; Actor Dual-Critic Training for Remote Sensing Image Captioning Using Deep Reinforcement Learning

Abstract Slides Poster

We deal with the problem of generating textual captions from optical remote sensing (RS) images using the notion of deep reinforcement learning. Due to the high inter-class similarity in reference sentences describing remote sensing data, jointly encoding the sentences and images encourages prediction of captions that are semantically more precise than the ground truth in many cases. To this end, we introduce an Actor Dual-Critic training strategy where a second critic model is deployed in the form of an encoder-decoder RNN to encode the latent information corresponding to the original and generated captions. While all actor-critic methods use an actor to predict sentences for an image and a critic to provide rewards, our proposed encoder-decoder RNN guarantees high-level comprehension of images by sentence-to-image translation. We observe that the proposed model generates sentences on the test data highly similar to the ground truth and is successful in generating even better captions in many critical cases. Extensive experiments on the benchmark Remote Sensing Image Captioning Dataset (RSICD) and the UCM-captions dataset confirm the superiority of the proposed approach in comparison to the previous state-of-the-art where we obtain a gain of sharp increments in both the ROUGE-L and CIDEr measures.

Similar papers

Explore and Explain: Self-Supervised Navigation and Recounting

Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

Auto-TLDR; Exploring a Photorealistic Environment for Explanation and Navigation

Abstract Slides Similar

Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path. In this context, the agent needs to navigate the environment driven by an exploration goal, select proper moments for description, and output natural language descriptions of relevant objects and scenes. Our model integrates a novel self-supervised exploration module with penalty, and a fully-attentive captioning model for explanation. Also, we investigate different policies for selecting proper moments for explanation, driven by information coming from both the environment and the navigation. Experiments are conducted on photorealistic environments from the Matterport3D dataset and investigate the navigation and explanation capabilities of the agent as well as the role of their interactions.

Attentive Visual Semantic Specialized Network for Video Captioning

Jesus Perez-Martin, Benjamin Bustos, Jorge Pérez

Auto-TLDR; Adaptive Visual Semantic Specialized Network for Video Captioning

A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning

Similar papers

Explore and Explain: Self-Supervised Navigation and Recounting

Attentive Visual Semantic Specialized Network for Video Captioning

Context Visual Information-Based Deliberation Network for Video Captioning

Enriching Video Captions with Contextual Text

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Text Synopsis Generation for Egocentric Videos

Transformer Reasoning Network for Image-Text Matching and Retrieval

AVD-Net: Attention Value Decomposition Network for Deep Multi-Agent Reinforcement Learning

A Bayesian Approach to Reinforcement Learning of Vision-Based Vehicular Control

Tackling Contradiction Detection in German Using Machine Translation and End-To-End Recurrent Neural Networks

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Learning with Delayed Feedback

Object-Oriented Map Exploration and Construction Based on Auxiliary Task Aided DRL

ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

The Effect of Multi-Step Methods on Overestimation in Deep Reinforcement Learning

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Low Dimensional State Representation Learning with Reward-Shaped Priors

Context Matters: Self-Attention for Sign Language Recognition

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Learning from Learners: Adapting Reinforcement Learning Agents to Be Competitive in a Card Game

Deep Reinforcement Learning on a Budget: 3D Control and Reasoning without a Supercomputer

Improving Visual Question Answering Using Active Perception on Static Images

Meta Learning Via Learned Loss

Detecting and Adapting to Crisis Pattern with Context Based Deep Reinforcement Learning

Global Context-Based Network with Transformer for Image2latex

Trajectory Representation Learning for Multi-Task NMRDP Planning

Trajectory-User Link with Attention Recurrent Networks

RLST: A Reinforcement Learning Approach to Scene Text Detection Refinement

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning

Vacant Parking Space Detection Based on Task Consistency and Reinforcement Learning

Multi-Stage Attention Based Visual Question Answering

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

DAG-Net: Double Attentive Graph Neural Network for Trajectory Forecasting

Efficient Sentence Embedding Via Semantic Subspace Analysis

The Color Out of Space: Learning Self-Supervised Representations for Earth Observation Imagery

MEG: Multi-Evidence GNN for Multimodal Semantic Forensics

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

What and How? Jointly Forecasting Human Action and Pose

KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding

Recurrent Deep Attention Network for Person Re-Identification

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Detective: An Attentive Recurrent Model for Sparse Object Detection

The Role of Cycle Consistency for Generating Better Human Action Videos from a Single Frame

Visual Object Tracking in Drone Images with Deep Reinforcement Learning

AOAM: Automatic Optimization of Adjacency Matrix for Graph Convolutional Network

ConvMath : A Convolutional Sequence Network for Mathematical Expression Recognition