ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Tobia Tesan, Pasquale Coscia, Lamberto Ballan

Auto-TLDR; Context-Based Image Annotation with Multiple Semantic Embeddings and Recurrent Neural Networks

Abstract Slides Poster

Images represent a commonly used form of visual communication among people. Nevertheless, image classification may be a challenging task when dealing with unclear or non-common images needing more context to be correctly annotated. Metadata accompanying images on social-media represent an ideal source of additional information for retrieving proper neighborhoods easing image annotation task. To this end, we blend visual features extracted from neighbors and their metadata to jointly leverage context and visual cues. Our models use multiple semantic embeddings to achieve the dual objective of being robust to vocabulary changes between train and test sets and decoupling the architecture from the low-level metadata representation. Convolutional and recurrent neural networks (CNNs-RNNs) are jointly adopted to infer similarity among neighbors and query images. We perform comprehensive experiments on the NUS-WIDE dataset showing that our models outperform state-of-the-art architectures based on images and metadata, and decrease both sensory and semantic gaps to better annotate images.

Similar papers

Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Niluthpol Mithun, Ravdeep Pasricha, Evangelos Papalexakis, Amit Roy-Chowdhury

Auto-TLDR; Robust Joint Embedding for Image-Text Retrieval Using Web Images

Abstract Slides Similar

In this paper, we address the problem of utilizing web images in training robust joint embedding models for the image-text retrieval task. Prior webly supervised approaches directly leverage weakly annotated web images in the joint embedding learning framework. The objective of these approaches would suffer significantly when the ratio of noisy and missing tags associated with the web images is very high. In this regard, we propose a CP decomposition based tensor completion framework to refine the tags of web images by modeling observed ternary inter-relations between the sets of labeled images, tags, and web images as a tensor. To effectively deal with the high ratio of missing entries likely in our case, we incorporate intra-modal correlation as side information in the proposed framework. Our tag refinement approach combined with existing web supervised image-text embedding approaches provide a more principled way for learning the joint embedding models in the presence of significant noise from web data and limited clean labeled data. Experiments on benchmark datasets demonstrate that the proposed approach helps to achieve a significant performance gain in image-text retrieval.

Using Scene Graphs for Detecting Visual Relationships

Anurag Tripathi, Siddharth Srivastava, Brejesh Lall, Santanu Chaudhury

Auto-TLDR; Relationship Detection using Context Aligned Scene Graph Embeddings

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Similar papers

Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Using Scene Graphs for Detecting Visual Relationships

Knowledge Distillation for Action Anticipation Via Label Smoothing

Transformer Reasoning Network for Image-Text Matching and Retrieval

Enriching Video Captions with Contextual Text

Making Every Label Count: Handling Semantic Imprecision by Integrating Domain Knowledge

Attentive Visual Semantic Specialized Network for Video Captioning

A Novel Attention-Based Aggregation Function to Combine Vision and Language

MEG: Multi-Evidence GNN for Multimodal Semantic Forensics

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Picture-To-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images

Multimodal Side-Tuning for Document Classification

PrivAttNet: Predicting Privacy Risks in Images Using Visual Attention

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Contextual Classification Using Self-Supervised Auxiliary Models for Deep Neural Networks

Deep Convolutional Embedding for Digitized Painting Clustering

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Force Banner for the Recognition of Spatial Relations

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

VSB^2-Net: Visual-Semantic Bi-Branch Network for Zero-Shot Hashing

Context Visual Information-Based Deliberation Network for Video Captioning

Generalized Local Attention Pooling for Deep Metric Learning

Leveraging Quadratic Spherical Mutual Information Hashing for Fast Image Retrieval

Hierarchical Multimodal Attention for Deep Video Summarization

GCNs-Based Context-Aware Short Text Similarity Model

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

Learning Natural Thresholds for Image Ranking

Text Synopsis Generation for Egocentric Videos

Multi-Level Deep Learning Vehicle Re-Identification Using Ranked-Based Loss Functions

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

Transformer Networks for Trajectory Forecasting

Image Representation Learning by Transformation Regression

The Color Out of Space: Learning Self-Supervised Representations for Earth Observation Imagery

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

RMS-Net: Regression and Masking for Soccer Event Spotting

Exploiting Knowledge Embedded Soft Labels for Image Recognition

Multi-Modal Deep Clustering: Unsupervised Partitioning of Images

Hybrid Decomposition Convolution Neural Network and Vocabulary Forest for Image Retrieval

Label Incorporated Graph Neural Networks for Text Classification

GuCNet: A Guided Clustering-Based Network for Improved Classification

Detective: An Attentive Recurrent Model for Sparse Object Detection

Textual-Content Based Classification of Bundles of Untranscribed of Manuscript Images

Region and Relations Based Multi Attention Network for Graph Classification

Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

Self-Supervised Learning with Graph Neural Networks for Region of Interest Retrieval in Histopathology

Supporting Skin Lesion Diagnosis with Content-Based Image Retrieval

Generative Latent Implicit Conditional Optimization When Learning from Small Sample

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks