ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Auto-TLDR; Fully-Attentive Reduction for Vision and Language

Abstract Slides Poster

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements - like regions and words - proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.

Similar papers

Transformer Reasoning Network for Image-Text Matching and Retrieval

Nicola Messina, Fabrizio Falchi, Andrea Esuli, Giuseppe Amato

Auto-TLDR; A Transformer Encoder Reasoning Network for Image-Text Matching in Large-Scale Information Retrieval

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Similar papers

Transformer Reasoning Network for Image-Text Matching and Retrieval

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Multi-Stage Attention Based Visual Question Answering

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Answer-Checking in Context: A Multi-Modal Fully Attention Network for Visual Question Answering

VSR++: Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Explore and Explain: Self-Supervised Navigation and Recounting

Question-Agnostic Attention for Visual Question Answering

Multi-Scale Relational Reasoning with Regional Attention for Visual Question Answering

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

P ≈ NP, at Least in Visual Question Answering

Enriching Video Captions with Contextual Text

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Improving Visual Question Answering Using Active Perception on Static Images

Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks

GCNs-Based Context-Aware Short Text Similarity Model

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Context Matters: Self-Attention for Sign Language Recognition

Context Visual Information-Based Deliberation Network for Video Captioning

Attentive Visual Semantic Specialized Network for Video Captioning

Tackling Contradiction Detection in German Using Machine Translation and End-To-End Recurrent Neural Networks

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Visual Style Extraction from Chart Images for Chart Restyling

Learning Neural Textual Representations for Citation Recommendation

Global Context-Based Network with Transformer for Image2latex

Cross-Media Hash Retrieval Using Multi-head Attention Network

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

CKG: Dynamic Representation Based on Context and Knowledge Graph

Text Synopsis Generation for Egocentric Videos

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Multimodal Side-Tuning for Document Classification

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

Adversarial Training for Aspect-Based Sentiment Analysis with BERT

Transformer Networks for Trajectory Forecasting

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Attention Pyramid Module for Scene Recognition

Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

Mutual Alignment between Audiovisual Features for End-To-End Audiovisual Speech Recognition

Two-Level Attention-Based Fusion Learning for RGB-D Face Recognition

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies