ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Transformer Reasoning Network for Image-Text Matching and Retrieval

Nicola Messina, Fabrizio Falchi, Andrea Esuli, Giuseppe Amato

Auto-TLDR; A Transformer Encoder Reasoning Network for Image-Text Matching in Large-Scale Information Retrieval

Abstract Slides Poster

Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN.

Similar papers

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Auto-TLDR; Fully-Attentive Reduction for Vision and Language

Transformer Reasoning Network for Image-Text Matching and Retrieval

Similar papers

A Novel Attention-Based Aggregation Function to Combine Vision and Language

VSR++: Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Multi-Stage Attention Based Visual Question Answering

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks

Explore and Explain: Self-Supervised Navigation and Recounting

Multi-Scale Relational Reasoning with Regional Attention for Visual Question Answering

Learning Neural Textual Representations for Citation Recommendation

Text Synopsis Generation for Egocentric Videos

Attentive Visual Semantic Specialized Network for Video Captioning

Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

Enriching Video Captions with Contextual Text

Question-Agnostic Attention for Visual Question Answering

Context Visual Information-Based Deliberation Network for Video Captioning

KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding

Answer-Checking in Context: A Multi-Modal Fully Attention Network for Visual Question Answering

A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning

GCNs-Based Context-Aware Short Text Similarity Model

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Named Entity Recognition and Relation Extraction with Graph Neural Networks in Semi Structured Documents

Efficient Sentence Embedding Via Semantic Subspace Analysis

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

Cross-Media Hash Retrieval Using Multi-head Attention Network

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Adversarial Training for Aspect-Based Sentiment Analysis with BERT

Learning with Delayed Feedback

P ≈ NP, at Least in Visual Question Answering

Tackling Contradiction Detection in German Using Machine Translation and End-To-End Recurrent Neural Networks

Explain2Attack: Text Adversarial Attacks via Cross-Domain Interpretability

Visual Style Extraction from Chart Images for Chart Restyling

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

CKG: Dynamic Representation Based on Context and Knowledge Graph

Transformer-Encoder Detector Module: Using Context to Improve Robustness to Adversarial Attacks on Object Detection

Transformer Networks for Trajectory Forecasting

MEG: Multi-Evidence GNN for Multimodal Semantic Forensics

Improving Visual Relation Detection Using Depth Maps

Context Matters: Self-Attention for Sign Language Recognition

Global Context-Based Network with Transformer for Image2latex

Adaptive Word Embedding Module for Semantic Reasoning in Large-Scale Detection