ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Question-Agnostic Attention for Visual Question Answering

Moshiur R Farazi, Salman Hameed Khan, Nick Barnes

Auto-TLDR; Question-Agnostic Attention for Visual Question Answering

Abstract Slides Poster

Visual Question Answering (VQA) models employ attention mechanisms to discover image locations that are most relevant for answering a specific question. For this purpose, several multimodal fusion strategies have been proposed, ranging from relatively simple operations (e.g., linear sum) to more complex ones (e.g., Block). The resulting multimodal representations define an intermediate feature space for capturing the interplay between visual and semantic features, that is helpful in selectively focusing on image content. In this paper, we propose a question-agnostic attention mechanism that is complementary to the existing question-dependent attention mechanisms. Our proposed model parses object instances to obtain an `object map' and applies this map on the visual features to generate Question-Agnostic Attention (QAA) features. In contrast to question-dependent attention approaches that are learned end-to-end, the proposed QAA does not involve question-specific training, and can be easily included in almost any existing VQA model as a generic light-weight pre-processing step, thereby adding minimal computation overhead for training. Further, when used in complement with the question-dependent attention, the QAA allows the model to focus on the regions containing objects that might have been overlooked by the learned attention representation. Through extensive evaluation on VQAv1, VQAv2 and TDIUC datasets, we show that incorporating complementary QAA allows state-of-the-art VQA models to perform better, and provides significant boost to simplistic VQA models, enabling them to performance on par with highly sophisticated fusion strategies.

Similar papers

Multi-Stage Attention Based Visual Question Answering

Aakansha Mishra, Ashish Anand, Prithwijit Guha

Auto-TLDR; Alternative Bi-directional Attention for Visual Question Answering

Abstract Poster Similar

Recent developments in the field of Visual Question Answering (VQA) have witnessed promising improvements in performance through contributions in attention based networks. Most such approaches have focused on unidirectional attention that leverage over attention from textual domain (question) on visual space. These approaches mostly focused on learning high-quality attention in the visual space. In contrast, this work proposes an alternating bi-directional attention framework. First, a question to image attention helps to learn the robust visual space embedding, and second, an image to question attention helps to improve the question embedding. This attention mechanism is realized in an alternating fashion i.e. question-to-image followed by image-to-question and is repeated for maximizing performance. We believe that this process of alternating attention generation helps both the modalities and leads to better representations for the VQA task. This proposal is benchmark on TDIUC dataset and against state-of-art approaches. Our ablation analysis shows that alternate attention is the key to achieve high performance in VQA.

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Yanan Li, Yuetan Lin, Hongrui Zhao, Donghui Wang

Auto-TLDR; TextVQA: An End-to-End Visual Question Answering Model for Text-Based VQA

Abstract Slides Similar

As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.

Multi-Scale Relational Reasoning with Regional Attention for Visual Question Answering

Yuntao Ma, Yirui Wu, Tong Lu

Auto-TLDR; Question-Guided Relational Reasoning for Visual Question Answering

Question-Agnostic Attention for Visual Question Answering

Similar papers

Multi-Stage Attention Based Visual Question Answering

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Multi-Scale Relational Reasoning with Regional Attention for Visual Question Answering

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Answer-Checking in Context: A Multi-Modal Fully Attention Network for Visual Question Answering

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Improving Visual Question Answering Using Active Perception on Static Images

P ≈ NP, at Least in Visual Question Answering

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Visual Style Extraction from Chart Images for Chart Restyling

SIMCO: SIMilarity-Based Object COunting

Transformer Reasoning Network for Image-Text Matching and Retrieval

A Novel Region of Interest Extraction Layer for Instance Segmentation

Two-Level Attention-Based Fusion Learning for RGB-D Face Recognition

PrivAttNet: Predicting Privacy Risks in Images Using Visual Attention

Efficient-Receptive Field Block with Group Spatial Attention Mechanism for Object Detection

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Context for Object Detection Via Lightweight Global and Mid-Level Representations

Detective: An Attentive Recurrent Model for Sparse Object Detection

Using Scene Graphs for Detecting Visual Relationships

Dual-Attention Guided Dropblock Module for Weakly Supervised Object Localization

Collaborative Human Machine Attention Module for Character Recognition

Graph Discovery for Visual Test Generation

VSR++: Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Utilising Visual Attention Cues for Vehicle Detection and Tracking

Self-Selective Context for Interaction Recognition

Improving Visual Relation Detection Using Depth Maps

From Early Biological Models to CNNs: Do They Look Where Humans Look?

Superpixel-Based Refinement for Object Proposal Generation

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

SFPN: Semantic Feature Pyramid Network for Object Detection

Adaptive Word Embedding Module for Semantic Reasoning in Large-Scale Detection

Attentive Visual Semantic Specialized Network for Video Captioning

Dynamic Guided Network for Monocular Depth Estimation

Object Detection Using Dual Graph Network

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

StrongPose: Bottom-up and Strong Keypoint Heat Map Based Pose Estimation

Detecting Objects with High Object Region Percentage

Global-Local Attention Network for Semantic Segmentation in Aerial Images

Flow-Guided Spatial Attention Tracking for Egocentric Activity Recognition

Human-Centric Parsing Network for Human-Object Interaction Detection

A Generalizable Saliency Map-Based Interpretation of Model Outcome

FastSal: A Computationally Efficient Network for Visual Saliency Prediction

Exploring and Exploiting the Hierarchical Structure of a Scene for Scene Graph Generation

Nighttime Pedestrian Detection Based on Feature Attention and Transformation

Simple Multi-Resolution Representation Learning for Human Pose Estimation

An Improved Bilinear Pooling Method for Image-Based Action Recognition