ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

P ≈ NP, at Least in Visual Question Answering

Shailza Jolly, Sebastian Palacio, Joachim Folz, Federico Raue, Jörn Hees, Andreas Dengel

Auto-TLDR; Polar vs Non-Polar VQA: A Cross-over Analysis of Feature Spaces for Joint Training

Abstract

In recent years, progress in the Visual Question Answering (VQA) field has largely been driven by public challenges and large datasets. One of the most widely-used of these is the VQA 2.0 dataset, consisting of polar ("yes/no") and non-polar questions. Looking at the question distribution over all answers, we find that the answers "yes" and "no" account for 38% of the questions, while the remaining 62% are spread over the more than 3000 remaining answers. While several sources of biases have already been investigated in the field, the effects of such an over-representation of polar vs. non-polar questions remain unclear. In this paper, we measure the potential confounding factors when polar and non-polar samples are used jointly to train a baseline VQA classifier, and compare it to an upper bound where the over-representation of polar questions is excluded from the training. Further, we perform cross-over experiments to analyze how well the feature spaces align. Contrary to expectations, we find no evidence of counterproductive effects in the joint training of unbalanced classes. In fact, by exploring the intermediate feature space of visual-text embeddings, we find that the feature space of polar questions already encodes sufficient structure to answer many non-polar questions. Our results indicate that the polar (P) and the non-polar (NP) feature spaces are strongly aligned, hence the expression P ≈ NP.

Similar papers

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Yanan Li, Yuetan Lin, Hongrui Zhao, Donghui Wang

Auto-TLDR; TextVQA: An End-to-End Visual Question Answering Model for Text-Based VQA

Abstract Slides Similar

As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.

Question-Agnostic Attention for Visual Question Answering

Moshiur R Farazi, Salman Hameed Khan, Nick Barnes

Auto-TLDR; Question-Agnostic Attention for Visual Question Answering

P ≈ NP, at Least in Visual Question Answering

Similar papers

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Question-Agnostic Attention for Visual Question Answering

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Multi-Stage Attention Based Visual Question Answering

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Multi-Scale Relational Reasoning with Regional Attention for Visual Question Answering

Improving Visual Question Answering Using Active Perception on Static Images

Graph Discovery for Visual Test Generation

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Answer-Checking in Context: A Multi-Modal Fully Attention Network for Visual Question Answering

Transformer Reasoning Network for Image-Text Matching and Retrieval

Visual Style Extraction from Chart Images for Chart Restyling

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Contextual Classification Using Self-Supervised Auxiliary Models for Deep Neural Networks

SIMCO: SIMilarity-Based Object COunting

Making Every Label Count: Handling Semantic Imprecision by Integrating Domain Knowledge

SAILenv: Learning in Virtual Visual Environments Made Simple

Confidence Calibration for Deep Renal Biopsy Immunofluorescence Image Classification

Spatial Bias in Vision-Based Voice Activity Detection

Towards Tackling Multi-Label Imbalances in Remote Sensing Imagery

Improving Model Accuracy for Imbalanced Image Classification Tasks by Adding a Final Batch Normalization Layer: An Empirical Study

An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers

Multi-Attribute Learning with Highly Imbalanced Data

Label or Message: A Large-Scale Experimental Survey of Texts and Objects Co-Occurrence

KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding

InsideBias: Measuring Bias in Deep Networks and Application to Face Gender Biometrics

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Enhanced User Interest and Expertise Modeling for Expert Recommendation

A Close Look at Deep Learning with Small Data

Adaptive Word Embedding Module for Semantic Reasoning in Large-Scale Detection

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Explore and Explain: Self-Supervised Navigation and Recounting

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Improving Visual Relation Detection Using Depth Maps

Verifying the Causes of Adversarial Examples

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

FourierNet: Compact Mask Representation for Instance Segmentation Using Differentiable Shape Decoders

Self-Supervised Learning for Astronomical Image Classification

Learning with Delayed Feedback

A Novel Region of Interest Extraction Layer for Instance Segmentation

Developing Motion Code Embedding for Action Recognition in Videos

A Generalizable Saliency Map-Based Interpretation of Model Outcome

Text Synopsis Generation for Egocentric Videos

Force Banner for the Recognition of Spatial Relations

Learning to Rank for Active Learning: A Listwise Approach

Few-Shot Few-Shot Learning and the Role of Spatial Attention