ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Answer-Checking in Context: A Multi-Modal Fully Attention Network for Visual Question Answering

Hantao Huang, Tao Han, Wei Han, Deep Yap Deep Yap, Cheng-Ming Chiang

Auto-TLDR; Fully Attention Based Visual Question Answering

Abstract Slides Poster

Visual Question Answering (VQA) is challenging due to the complex cross-modality relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. Such answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves a state-of-the-art accuracy 71.57\% using less parameters on VQA-v2.0 test-standard split.

Similar papers

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Yanan Li, Yuetan Lin, Hongrui Zhao, Donghui Wang

Auto-TLDR; TextVQA: An End-to-End Visual Question Answering Model for Text-Based VQA

Abstract Slides Similar

As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Auto-TLDR; Fully-Attentive Reduction for Vision and Language

Answer-Checking in Context: A Multi-Modal Fully Attention Network for Visual Question Answering

Similar papers

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Multi-Stage Attention Based Visual Question Answering

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Question-Agnostic Attention for Visual Question Answering

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Multi-Scale Relational Reasoning with Regional Attention for Visual Question Answering

Transformer Reasoning Network for Image-Text Matching and Retrieval

P ≈ NP, at Least in Visual Question Answering

Visual Style Extraction from Chart Images for Chart Restyling

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

VSR++: Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Improving Visual Question Answering Using Active Perception on Static Images

Two-Level Attention-Based Fusion Learning for RGB-D Face Recognition

Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks

KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding

CKG: Dynamic Representation Based on Context and Knowledge Graph

Automatic Student Network Search for Knowledge Distillation

GCNs-Based Context-Aware Short Text Similarity Model

Attentive Visual Semantic Specialized Network for Video Captioning

Transformer Networks for Trajectory Forecasting

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

6D Pose Estimation with Correlation Fusion

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

Tackling Contradiction Detection in German Using Machine Translation and End-To-End Recurrent Neural Networks

Attention Pyramid Module for Scene Recognition

Object Detection Using Dual Graph Network

Context Matters: Self-Attention for Sign Language Recognition

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

VTT: Long-Term Visual Tracking with Transformers

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Global Context-Based Network with Transformer for Image2latex

End-To-End Hierarchical Relation Extraction for Generic Form Understanding

Adversarial Training for Aspect-Based Sentiment Analysis with BERT

Context Aware Group Activity Recognition

A Multi-Head Self-Relation Network for Scene Text Recognition

MA-LSTM: A Multi-Attention Based LSTM for Complex Pattern Extraction

Predicting Chemical Properties Using Self-Attention Multi-Task Learning Based on SMILES Representation

ConvMath : A Convolutional Sequence Network for Mathematical Expression Recognition

Explore and Explain: Self-Supervised Navigation and Recounting

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Enriching Video Captions with Contextual Text

Exploring and Exploiting the Hierarchical Structure of a Scene for Scene Graph Generation

Graph Discovery for Visual Test Generation

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Gaussian Constrained Attention Network for Scene Text Recognition

Reinforcement Learning with Dual Attention Guided Graph Convolution for Relation Extraction

ACRM: Attention Cascade R-CNN with Mix-NMS for Metallic Surface Defect Detection