ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Multi-Stage Attention Based Visual Question Answering

Aakansha Mishra, Ashish Anand, Prithwijit Guha

Auto-TLDR; Alternative Bi-directional Attention for Visual Question Answering

Abstract Poster

Recent developments in the field of Visual Question Answering (VQA) have witnessed promising improvements in performance through contributions in attention based networks. Most such approaches have focused on unidirectional attention that leverage over attention from textual domain (question) on visual space. These approaches mostly focused on learning high-quality attention in the visual space. In contrast, this work proposes an alternating bi-directional attention framework. First, a question to image attention helps to learn the robust visual space embedding, and second, an image to question attention helps to improve the question embedding. This attention mechanism is realized in an alternating fashion i.e. question-to-image followed by image-to-question and is repeated for maximizing performance. We believe that this process of alternating attention generation helps both the modalities and leads to better representations for the VQA task. This proposal is benchmark on TDIUC dataset and against state-of-art approaches. Our ablation analysis shows that alternate attention is the key to achieve high performance in VQA.

Similar papers

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Yanan Li, Yuetan Lin, Hongrui Zhao, Donghui Wang

Auto-TLDR; TextVQA: An End-to-End Visual Question Answering Model for Text-Based VQA

Abstract Slides Similar

As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.

Question-Agnostic Attention for Visual Question Answering

Moshiur R Farazi, Salman Hameed Khan, Nick Barnes

Auto-TLDR; Question-Agnostic Attention for Visual Question Answering

Multi-Stage Attention Based Visual Question Answering

Similar papers

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Question-Agnostic Attention for Visual Question Answering

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Multi-Scale Relational Reasoning with Regional Attention for Visual Question Answering

Answer-Checking in Context: A Multi-Modal Fully Attention Network for Visual Question Answering

Transformer Reasoning Network for Image-Text Matching and Retrieval

Improving Visual Question Answering Using Active Perception on Static Images

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

P ≈ NP, at Least in Visual Question Answering

Two-Level Attention-Based Fusion Learning for RGB-D Face Recognition

Visual Style Extraction from Chart Images for Chart Restyling

Enhanced User Interest and Expertise Modeling for Expert Recommendation

VSR++: Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

PrivAttNet: Predicting Privacy Risks in Images Using Visual Attention

6D Pose Estimation with Correlation Fusion

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Using Scene Graphs for Detecting Visual Relationships

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Context Visual Information-Based Deliberation Network for Video Captioning

Attentive Visual Semantic Specialized Network for Video Captioning

Improving Visual Relation Detection Using Depth Maps

Flow-Guided Spatial Attention Tracking for Egocentric Activity Recognition

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Enriching Video Captions with Contextual Text

Graph Discovery for Visual Test Generation

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Exploring and Exploiting the Hierarchical Structure of a Scene for Scene Graph Generation

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Improving Word Recognition Using Multiple Hypotheses and Deep Embeddings

Region and Relations Based Multi Attention Network for Graph Classification

GCNs-Based Context-Aware Short Text Similarity Model

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

Adaptive Word Embedding Module for Semantic Reasoning in Large-Scale Detection

Transformer-Encoder Detector Module: Using Context to Improve Robustness to Adversarial Attacks on Object Detection

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

ConvMath : A Convolutional Sequence Network for Mathematical Expression Recognition

Detective: An Attentive Recurrent Model for Sparse Object Detection

Tackling Contradiction Detection in German Using Machine Translation and End-To-End Recurrent Neural Networks

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

SIMCO: SIMilarity-Based Object COunting

Object Detection Using Dual Graph Network

MEG: Multi-Evidence GNN for Multimodal Semantic Forensics

Text Synopsis Generation for Egocentric Videos

Dual-Attention Guided Dropblock Module for Weakly Supervised Object Localization