ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Tianling Jiang, Yi Ji, Chunping Liu

Auto-TLDR; Integrating Historical States and Co-attention for Visual Dialog

Abstract Slides Poster

Visual dialog is a typical multi-modal task which involves both vision and language. Nowadays, it faces two major difficulties. In this paper, we propose Integrating Historical States and Co-attention (HSCA) for visual dialog to solve them. It includes two main modules, Co-ATT and MATCH. Specifically, the main purpose of the Co-ATT module is to guide the image with questions and answers in the early stage to get more specific objects. It tackles the temporal sequence issue in historical information which may influence the precise answer for multi-round questions. The MATCH module is, based on a question with pronouns, to retrieve the best matching historical information block. It overcomes the visual reference problem which requires to solve pronouns referring to unknowns in the text message and then to locate the objects in the given image. We quantitatively and qualitatively evaluate our model on VisDial v1.0, at the same time, ablation studies are carried out. The experimental results demonstrate that HSCA outperforms the state-of-the-art methods in many aspects.

Similar papers

Multi-Stage Attention Based Visual Question Answering

Aakansha Mishra, Ashish Anand, Prithwijit Guha

Auto-TLDR; Alternative Bi-directional Attention for Visual Question Answering

Abstract Poster Similar

Recent developments in the field of Visual Question Answering (VQA) have witnessed promising improvements in performance through contributions in attention based networks. Most such approaches have focused on unidirectional attention that leverage over attention from textual domain (question) on visual space. These approaches mostly focused on learning high-quality attention in the visual space. In contrast, this work proposes an alternating bi-directional attention framework. First, a question to image attention helps to learn the robust visual space embedding, and second, an image to question attention helps to improve the question embedding. This attention mechanism is realized in an alternating fashion i.e. question-to-image followed by image-to-question and is repeated for maximizing performance. We believe that this process of alternating attention generation helps both the modalities and leads to better representations for the VQA task. This proposal is benchmark on TDIUC dataset and against state-of-art approaches. Our ablation analysis shows that alternate attention is the key to achieve high performance in VQA.

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Yanan Li, Yuetan Lin, Hongrui Zhao, Donghui Wang

Auto-TLDR; TextVQA: An End-to-End Visual Question Answering Model for Text-Based VQA

Abstract Slides Similar

As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Yaoyuan Liang, Xin Wang, Xuguang Duan, Wenwu Zhu

Auto-TLDR; Multi-modal Contextual Graph Neural Network for Text Visual Question Answering

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Similar papers

Multi-Stage Attention Based Visual Question Answering

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Multi-Scale Relational Reasoning with Regional Attention for Visual Question Answering

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Question-Agnostic Attention for Visual Question Answering

Answer-Checking in Context: A Multi-Modal Fully Attention Network for Visual Question Answering

Transformer Reasoning Network for Image-Text Matching and Retrieval

Context Visual Information-Based Deliberation Network for Video Captioning

P ≈ NP, at Least in Visual Question Answering

Enhanced User Interest and Expertise Modeling for Expert Recommendation

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Attentive Visual Semantic Specialized Network for Video Captioning

VSR++: Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Improving Visual Question Answering Using Active Perception on Static Images

Visual Style Extraction from Chart Images for Chart Restyling

Graph Discovery for Visual Test Generation

Global Context-Based Network with Transformer for Image2latex

Enriching Video Captions with Contextual Text

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Explore and Explain: Self-Supervised Navigation and Recounting

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Global Feature Aggregation for Accident Anticipation

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Text Synopsis Generation for Egocentric Videos

Two-Level Attention-Based Fusion Learning for RGB-D Face Recognition

Exploring and Exploiting the Hierarchical Structure of a Scene for Scene Graph Generation

PrivAttNet: Predicting Privacy Risks in Images Using Visual Attention

A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning

Cross-Media Hash Retrieval Using Multi-head Attention Network

Using Scene Graphs for Detecting Visual Relationships

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

6D Pose Estimation with Correlation Fusion

ConvMath : A Convolutional Sequence Network for Mathematical Expression Recognition

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

MANet: Multimodal Attention Network Based Point-View Fusion for 3D Shape Recognition

Object Detection Using Dual Graph Network

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Reinforcement Learning with Dual Attention Guided Graph Convolution for Relation Extraction

Forground-Guided Vehicle Perception Framework

Adaptive Word Embedding Module for Semantic Reasoning in Large-Scale Detection

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

Dual-Attention Guided Dropblock Module for Weakly Supervised Object Localization

Flow-Guided Spatial Attention Tracking for Egocentric Activity Recognition

Tackling Contradiction Detection in German Using Machine Translation and End-To-End Recurrent Neural Networks

A Transformer-Based Radical Analysis Network for Chinese Character Recognition

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification