ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

Sungmin Eum, Heesung Kwon

Auto-TLDR; Semantics-to-Space: Deep Zero-Shot Learning for Verb-Object Interaction with Vectors

Abstract Slides Poster

We present a novel deep zero-shot learning (ZSL) model for inferencing human-object-interaction with verb-object (VO) query. While the previous two-stream ZSL approaches only use the semantic/textual information to be fed into the query stream, we seek to incorporate and embed the semantics into the visual representation stream as well. Our approach is powered by Semantics-to-Space (S2S) architecture where semantics derived from the residing objects are embedded into a spatial space of the visual stream. This architecture allows the co-capturing of the semantic attributes of the human and the objects along with their location/size/silhouette information. To validate, we have constructed a new dataset, Verb-Transferability 60 (VT60). VT60 provides 60 different VO pairs with overlapping verbs tailored for testing two-stream ZSL approaches with VO query. Experimental evaluations show that our approach not only outperforms the state-of-the-art, but also shows the capability of consistently improving performance regardless of which ZSL baseline architecture is used.

Similar papers

Developing Motion Code Embedding for Action Recognition in Videos

Maxat Alibayev, David Andrea Paulius, Yu Sun

Auto-TLDR; Motion Embedding via Motion Codes for Action Recognition

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

Similar papers

Developing Motion Code Embedding for Action Recognition in Videos

Prior Knowledge about Attributes: Learning a More Effective Potential Space for Zero-Shot Recognition

Using Scene Graphs for Detecting Visual Relationships

Context for Object Detection Via Lightweight Global and Mid-Level Representations

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Incrementally Zero-Shot Detection by an Extreme Value Analyzer

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Self-Selective Context for Interaction Recognition

Recognizing Bengali Word Images - A Zero-Shot Learning Perspective

Adaptive Word Embedding Module for Semantic Reasoning in Large-Scale Detection

Incorporating Depth Information into Few-Shot Semantic Segmentation

Transformer Reasoning Network for Image-Text Matching and Retrieval

SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Improving Visual Relation Detection Using Depth Maps

Multiscale Attention-Based Prototypical Network for Few-Shot Semantic Segmentation

Heterogeneous Graph-Based Knowledge Transfer for Generalized Zero-Shot Learning

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Augmented Bi-Path Network for Few-Shot Learning

VSB^2-Net: Visual-Semantic Bi-Branch Network for Zero-Shot Hashing

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Modeling Long-Term Interactions to Enhance Action Recognition

Question-Agnostic Attention for Visual Question Answering

Attentive Visual Semantic Specialized Network for Video Captioning

Extracting Action Hierarchies from Action Labels and their Use in Deep Action Recognition

A Prototype-Based Generalized Zero-Shot Learning Framework for Hand Gesture Recognition

Enriching Video Captions with Contextual Text

Directed Variational Cross-encoder Network for Few-Shot Multi-image Co-segmentation

Making Every Label Count: Handling Semantic Imprecision by Integrating Domain Knowledge

Exploiting Knowledge Embedded Soft Labels for Image Recognition

Few-Shot Few-Shot Learning and the Role of Spatial Attention

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Human-Centric Parsing Network for Human-Object Interaction Detection

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Explore and Explain: Self-Supervised Navigation and Recounting

Multi-Stage Attention Based Visual Question Answering

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Text Synopsis Generation for Egocentric Videos

Attention Pyramid Module for Scene Recognition

FatNet: A Feature-Attentive Network for 3D Point Cloud Processing

Sketch-SNet: Deeper Subdivision of Temporal Cues for Sketch Recognition

Concept Embedding through Canonical Forms: A Case Study on Zero-Shot ASL Recognition

Detective: An Attentive Recurrent Model for Sparse Object Detection

Complementing Representation Deficiency in Few-Shot Image Classification: A Meta-Learning Approach

Multimodal Side-Tuning for Document Classification

ScarfNet: Multi-Scale Features with Deeply Fused and Redistributed Semantics for Enhanced Object Detection

Enhanced Vote Network for 3D Object Detection in Point Clouds