ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Niluthpol Mithun, Ravdeep Pasricha, Evangelos Papalexakis, Amit Roy-Chowdhury

Auto-TLDR; Robust Joint Embedding for Image-Text Retrieval Using Web Images

Abstract Slides

In this paper, we address the problem of utilizing web images in training robust joint embedding models for the image-text retrieval task. Prior webly supervised approaches directly leverage weakly annotated web images in the joint embedding learning framework. The objective of these approaches would suffer significantly when the ratio of noisy and missing tags associated with the web images is very high. In this regard, we propose a CP decomposition based tensor completion framework to refine the tags of web images by modeling observed ternary inter-relations between the sets of labeled images, tags, and web images as a tensor. To effectively deal with the high ratio of missing entries likely in our case, we incorporate intra-modal correlation as side information in the proposed framework. Our tag refinement approach combined with existing web supervised image-text embedding approaches provide a more principled way for learning the joint embedding models in the presence of significant noise from web data and limited clean labeled data. Experiments on benchmark datasets demonstrate that the proposed approach helps to achieve a significant performance gain in image-text retrieval.

Similar papers

JECL: Joint Embedding and Cluster Learning for Image-Text Pairs

Sean Yang, Kuan-Hao Huang, Bill Howe

Auto-TLDR; JECL: Clustering Image-Caption Pairs with Parallel Encoders and Regularized Clusters

Abstract Poster Similar

We propose JECL, a method for clustering image-caption pairs by training parallel encoders with regularized clustering and alignment objectives, simultaneously learning both representations and cluster assignments. These image-caption pairs arise frequently in high-value applications where structured training data is expensive to produce, but free-text descriptions are common. JECL trains by minimizing the Kullback-Leibler divergence between the distribution of the images and text to that of a combined joint target distribution and optimizing the Jensen-Shannon divergence between the soft cluster assignments of the images and text. Regularizers are also applied to JECL to prevent trivial solutions. Experiments show that JECL outperforms both single-view and multi-view methods on large benchmark image-caption datasets, and is remarkably robust to missing captions and varying data sizes.

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Tobia Tesan, Pasquale Coscia, Lamberto Ballan

Auto-TLDR; Context-Based Image Annotation with Multiple Semantic Embeddings and Recurrent Neural Networks

Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Similar papers

JECL: Joint Embedding and Cluster Learning for Image-Text Pairs

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Embedding Shared Low-Rank and Feature Correlation for Multi-View Data Analysis

Transformer Reasoning Network for Image-Text Matching and Retrieval

VSR++: Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Fast Discrete Cross-Modal Hashing Based on Label Relaxation and Matrix Factorization

Cross-Media Hash Retrieval Using Multi-head Attention Network

Discrete Semantic Matrix Factorization Hashing for Cross-Modal Retrieval

Tensorized Feature Spaces for Feature Explosion

T-SVD Based Non-Convex Tensor Completion and Robust Principal Component Analysis

Making Every Label Count: Handling Semantic Imprecision by Integrating Domain Knowledge

Enriching Video Captions with Contextual Text

Cross-Supervised Joint-Event-Extraction with Heterogeneous Information Networks

Text Synopsis Generation for Egocentric Videos

RWMF: A Real-World Multimodal Foodlog Database

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Enhanced User Interest and Expertise Modeling for Expert Recommendation

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Probabilistic Latent Factor Model for Collaborative Filtering with Bayesian Inference

MEG: Multi-Evidence GNN for Multimodal Semantic Forensics

Context Visual Information-Based Deliberation Network for Video Captioning

Picture-To-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Double Manifolds Regularized Non-Negative Matrix Factorization for Data Representation

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

RGB-Infrared Person Re-Identification Via Image Modality Conversion

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Attentive Visual Semantic Specialized Network for Video Captioning

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Soft Label and Discriminant Embedding Estimation for Semi-Supervised Classification

Exploiting Elasticity in Tensor Ranks for Compressing Neural Networks

Multi-Task Learning for Calorie Prediction on a Novel Large-Scale Recipe Dataset Enriched with Nutritional Information

Temporal Collaborative Filtering with Graph Convolutional Neural Networks

Low Rank Representation on Product Grassmann Manifolds for Multi-viewSubspace Clustering

VSB^2-Net: Visual-Semantic Bi-Branch Network for Zero-Shot Hashing

Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

Label Incorporated Graph Neural Networks for Text Classification

Supervised Domain Adaptation Using Graph Embedding

Multi-Modal Deep Clustering: Unsupervised Partitioning of Images

Transformer Networks for Trajectory Forecasting

Joint Supervised and Self-Supervised Learning for 3D Real World Challenges

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Adaptive L2 Regularization in Person Re-Identification

Deep Convolutional Embedding for Digitized Painting Clustering

Person Recognition with HGR Maximal Correlation on Multimodal Data

Text Recognition - Real World Data and Where to Find Them