ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Li Ren, Kai Li, Liqiang Wang, Kien Hua

Auto-TLDR; Adversarial Discriminative Domain Regularization for Efficient Cross-Modal Matching

Abstract Slides Poster

Matching information across image and text modalities is a fundamental challenge for many applications that involve both vision and natural language processing. The objective is to find efficient similarity metrics to compare the similarity between visual and textual information. Existing approaches mainly match the local visual objects and the sentence words in a shared space with attention mechanisms. The matching performance is still limited because the similarity computation is based on simple comparisons of the matching features, ignoring the characteristics of their distribution in the data. In this paper, we address this limitation with an efficient learning objective that considers the discriminative feature distributions between the visual objects and sentence words. Specifically, we propose a novel Adversarial Discriminative Domain Regularization (ADDR) learning framework, beyond the paradigm metric learning objective, to construct a set of discriminative data domains within each image-text pairs. Our approach can generally improve the learning efficiency and the performance of existing metrics learning frameworks by regulating the distribution of the hidden space between the matching pairs. The experimental results show that this new approach significantly improves the overall performance of several popular cross-modal matching techniques (SCAN, VSRN, BFAN) on the MS-COCO and Flickr30K benchmarks.

Similar papers

VSR++: Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Hui Yuan, Yan Huang, Dongbo Zhang, Zerui Chen, Wenlong Cheng, Liang Wang

Auto-TLDR; Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Similar papers

VSR++: Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Transformer Reasoning Network for Image-Text Matching and Retrieval

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Cross-Media Hash Retrieval Using Multi-head Attention Network

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

RGB-Infrared Person Re-Identification Via Image Modality Conversion

VSB^2-Net: Visual-Semantic Bi-Branch Network for Zero-Shot Hashing

Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

JECL: Joint Embedding and Cluster Learning for Image-Text Pairs

Multi-Stage Attention Based Visual Question Answering

CANU-ReID: A Conditional Adversarial Network for Unsupervised Person Re-IDentification

Learning Low-Shot Generative Networks for Cross-Domain Data

Shape Consistent 2D Keypoint Estimation under Domain Shift

Picture-To-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images

Class Conditional Alignment for Partial Domain Adaptation

Cascade Attention Guided Residue Learning GAN for Cross-Modal Translation

Unsupervised Multi-Task Domain Adaptation

Unsupervised Domain Adaptation with Multiple Domain Discriminators and Adaptive Self-Training

Randomized Transferable Machine

Adaptive L2 Regularization in Person Re-Identification

Foreground-Focused Domain Adaption for Object Detection

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Progressive Learning Algorithm for Efficient Person Re-Identification

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Learning Neural Textual Representations for Citation Recommendation

Deep Top-Rank Counter Metric for Person Re-Identification

Text Recognition in Real Scenarios with a Few Labeled Samples

Enlarging Discriminative Power by Adding an Extra Class in Unsupervised Domain Adaptation

Text Synopsis Generation for Egocentric Videos

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

P ≈ NP, at Least in Visual Question Answering

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Nonlinear Ranking Loss on Riemannian Potato Embedding

Few-Shot Font Generation with Deep Metric Learning

Multi-Level Deep Learning Vehicle Re-Identification Using Ranked-Based Loss Functions

Domain Generalized Person Re-Identification Via Cross-Domain Episodic Learning

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Open Set Domain Recognition Via Attention-Based GCN and Semantic Matching Optimization

Price Suggestion for Online Second-Hand Items

MEG: Multi-Evidence GNN for Multimodal Semantic Forensics

Supervised Domain Adaptation Using Graph Embedding

Adversarially Constrained Interpolation for Unsupervised Domain Adaptation

Attentive Visual Semantic Specialized Network for Video Captioning

Sequential Domain Adaptation through Elastic Weight Consolidation for Sentiment Analysis