Heterogeneous Graph-Based Knowledge Transfer for Generalized Zero-Shot Learning

Junjie Wang, Xiangfeng Wang, Bo Jin, Junchi Yan, Wenjie Zhang, Hongyuan Zha

Responsive image

Auto-TLDR; Heterogeneous Graph-based Knowledge Transfer for Generalized Zero-Shot Learning

Slides Poster

Generalized zero-shot learning (GZSL) tackles the problem of learning to classify instances involving both seen classes and unseen ones. The key issue is how to effectively transfer the model learned from seen classes to unseen classes. Existing works in GZSL usually assume that some prior information about unseen classes are available. However, such an assumption is unrealistic when new unseen classes appear dynamically. To this end, we propose a novel heterogeneous graph-based knowledge transfer method (HGKT) for GZSL, agnostic to unseen classes and instances, by leveraging graph neural network. Specifically, a structured heterogeneous graph is constructed with high-level representative nodes for seen classes, which are chosen through Wasserstein barycenter in order to simultaneously capture inter-class and intra-class relationship. The aggregation and embedding functions can be learned throughgraph neural network, which can be used to compute the embeddings of unseen classes by transferring the knowledge from their neighbors. Extensive experiments on public benchmark datasets show that our method achieves state-of-the-art results.

Similar papers

Prior Knowledge about Attributes: Learning a More Effective Potential Space for Zero-Shot Recognition

Chunlai Chai, Yukuan Lou

Responsive image

Auto-TLDR; Attribute Correlation Potential Space Generation for Zero-Shot Learning

Slides Poster Similar

Zero-shot learning (ZSL) aims to recognize unseen classes accurately by learning seen classes and known attributes, but correlations in attributes were ignored by previous study which lead to classification results confused. To solve this problem, we build an Attribute Correlation Potential Space Generation (ACPSG) model which uses a graph convolution network and attribute correlation to generate a more discriminating potential space. Combining potential discrimination space and user-defined attribute space, we can better classify unseen classes. Our approach outperforms some existing state-of-the-art methods on several benchmark datasets, whether it is conventional ZSL or generalized ZSL.

VSB^2-Net: Visual-Semantic Bi-Branch Network for Zero-Shot Hashing

Xin Li, Xiangfeng Wang, Bo Jin, Wenjie Zhang, Jun Wang, Hongyuan Zha

Responsive image

Auto-TLDR; VSB^2-Net: inductive zero-shot hashing for image retrieval

Slides Poster Similar

Zero-shot hashing aims at learning hashing model from seen classes and the obtained model is capable of generalizing to unseen classes for image retrieval. Inspired by zero-shot learning, existing zero-shot hashing methods usually transfer the supervised knowledge from seen to unseen classes, by embedding the hamming space to a shared semantic space. However, this makes instances difficult to distinguish due to limited hashing bit numbers, especially for semantically similar unseen classes. We propose a novel inductive zero-shot hashing framework, i.e., VSB^2-Net, where both semantic space and visual feature space are embedded to the same hamming space instead. The reconstructive semantic relationships are established in the hamming space, preserving local similarity relationships and explicitly enlarging the discrepancy between semantic hamming vectors. A two-task architecture, comprising of classification module and visual feature reconstruction module, is employed to enhance the generalization and transfer abilities. Extensive evaluation results on several benchmark datasets demonstratethe superiority of our proposed method compared to several state-of-the-art baselines.

A Prototype-Based Generalized Zero-Shot Learning Framework for Hand Gesture Recognition

Jinting Wu, Yujia Zhang, Xiao-Guang Zhao

Responsive image

Auto-TLDR; Generalized Zero-Shot Learning for Hand Gesture Recognition

Slides Poster Similar

Hand gesture recognition plays a significant role in human-computer interaction for understanding various human gestures and their intent. However, most prior works can only recognize gestures of limited labeled classes and fail to adapt to new categories. The task of Generalized Zero-Shot Learning (GZSL) for hand gesture recognition aims to address the above issue by leveraging semantic representations and detecting both seen and unseen class samples. In this paper, we propose an end-to-end prototype-based GZSL framework for hand gesture recognition which consists of two branches. The first branch is a prototype-based detector that learns gesture representations and determines whether an input sample belongs to a seen or unseen category. The second branch is a zero-shot label predictor which takes the features of unseen classes as input and outputs predictions through a learned mapping mechanism between the feature and the semantic space. We further establish a hand gesture dataset that specifically targets this GZSL task, and comprehensive experiments on this dataset demonstrate the effectiveness of our proposed approach on recognizingQuestionnaire both seen and unseen gestures.

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Tengfei Liu, Yongli Hu, Junbin Gao, Yanfeng Sun, Baocai Yin

Responsive image

Auto-TLDR; Semantically Extended Graph Convolutional Network for Zero-shot Text Classification

Slides Poster Similar

As a challenging task of Natural Language Processing(NLP), zero-shot text classification has attracted more and more attention recently. It aims to detect classes that the model has never seen in the training set. For this purpose, a feasible way is to construct connection between the seen and unseen classes by semantic extension and classify the unseen classes by information propagation over the connection. Although many related zero-shot text classification methods have been exploited, how to realize semantic extension properly and propagate information effectively is far from solved. In this paper, we propose a novel zero-shot text classification method called Semantically Extended Graph Convolutional Network (SEGCN). In the proposed method, the semantic category knowledge from ConceptNet is utilized to semantic extension for linking seen classes to unseen classes and constructing a graph of all classes. Then, we build upon Graph Convolutional Network (GCN) for predicting the textual classifier for each category, which transfers the category knowledge by the convolution operators on the constructed graph and is trained in a semi-supervised manner using the samples of the seen classes. The experimental results on Dbpedia and 20newsgroup datasets show that our method outperforms the state of the art zero-shot text classification methods.

Local Propagation for Few-Shot Learning

Yann Lifchitz, Yannis Avrithis, Sylvaine Picard

Responsive image

Auto-TLDR; Local Propagation for Few-Shot Inference

Slides Poster Similar

The challenge in few-shot learning is that available data is not enough to capture the underlying distribution. To mitigate this, two emerging directions are (a) using local image representations, essentially multiplying the amount of data by a constant factor, and (b) using more unlabeled data, for instance by transductive inference, jointly on a number of queries. In this work, we bring these two ideas together, introducing local propagation. We treat local image features as independent examples, we build a graph on them and we use it to propagate both the features themselves and the labels, known and unknown. Interestingly, since there is a number of features per image, even a single query gives rise to transductive inference. As a result, we provide a universally safe choice for few-shot inference under both non-transductive and transductive settings, improving accuracy over corresponding methods. This is in contrast to existing solutions, where one needs to choose the method depending on the quantity of available data.

Incrementally Zero-Shot Detection by an Extreme Value Analyzer

Sixiao Zheng, Yanwei Fu, Yanxi Hou

Responsive image

Auto-TLDR; IZSD-EVer: Incremental Zero-Shot Detection for Incremental Learning

Slides Similar

Human beings not only have the ability of recogniz-ing novel unseen classes, but also can incrementally incorporatethe new classes to existing knowledge preserved. However, thezero-shot learning models assume that all seen classes should beknown beforehand, while incremental learning models cannotrecognize unseen classes. This paper introduces a novel andchallenging task of Incrementally Zero-Shot Detection (IZSD),a practical strategy for both zero-shot learning and class-incremental learning in real-world object detection. An innovativeend-to-end model – IZSD-EVer was proposed to tackle this taskthat requires incrementally detecting new classes and detectingthe classes that have never been seen. Specifically, we proposea novel extreme value analyzer to simultaneously detect objectsfrom old seen, new seen, and unseen classes. Additionally andtechnically, we propose two innovative losses, i.e., background-foreground mean squared error loss alleviating the extremeimbalance of the background and foreground of images, andprojection distance loss aligning the visual space and semanticspaces of old seen classes. Experiments demonstrate the efficacyof our model in detecting objects from both the seen and unseenclasses, outperforming the alternative models on Pascal VOC andMSCOCO datasets.

Recognizing Bengali Word Images - A Zero-Shot Learning Perspective

Sukalpa Chanda, Daniël Arjen Willem Haitink, Prashant Kumar Prasad, Jochem Baas, Umapada Pal, Lambert Schomaker

Responsive image

Auto-TLDR; Zero-Shot Learning for Word Recognition in Bengali Script

Slides Poster Similar

Zero-Shot Learning(ZSL) techniques could classify a completely unseen class, which it has never seen before during training. Thus, making it more apt for any real-life classification problem, where it is not possible to train a system with annotated data for all possible class types. This work investigates recognition of word images written in Bengali Script in a ZSL framework. The proposed approach performs Zero-Shot word recognition by coupling deep learned features procured from VGG16 architecture along with 13 basic shapes/stroke primitives commonly observed in Bengali script characters. As per the notion of ZSL framework those 13 basic shapes are termed as “Signature Attributes”. The obtained results are promising while evaluation was carried out in a Five-Fold cross-validation setup dealing with samples from 250 word classes.

Open Set Domain Recognition Via Attention-Based GCN and Semantic Matching Optimization

Xinxing He, Yuan Yuan, Zhiyu Jiang

Responsive image

Auto-TLDR; Attention-based GCN and Semantic Matching Optimization for Open Set Domain Recognition

Slides Poster Similar

Open set domain recognition has got the attention in recent years. The task aims to specifically classify each sample in the practical unlabeled target domain, which consists of all known classes in the manually labeled source domain and target-specific unknown categories. The absence of annotated training data or auxiliary attribute information for unknown categories makes this task especially difficult. Moreover, exiting domain discrepancy in label space and data distribution further distracts the knowledge transferred from known classes to unknown classes. To address these issues, this work presents an end-to-end model based on attention-based GCN and semantic matching optimization, which first employs the attention mechanism to enable the central node to learn more discriminating representations from its neighbors in the knowledge graph. Moreover, a coarse-to-fine semantic matching optimization approach is proposed to progressively bridge the domain gap. Experimental results validate that the proposed model not only has superiority on recognizing the images of known and unknown classes, but also can adapt to various openness of the target domain.

Graph-Based Interpolation of Feature Vectors for Accurate Few-Shot Classification

Yuqing Hu, Vincent Gripon, Stéphane Pateux

Responsive image

Auto-TLDR; Transductive Learning for Few-Shot Classification using Graph Neural Networks

Slides Poster Similar

In few-shot classification, the aim is to learn models able to discriminate classes using only a small number of labeled examples. In this context, works have proposed to introduce Graph Neural Networks (GNNs) aiming at exploiting the information contained in other samples treated concurrently, what is commonly referred to as the transductive setting in the literature. These GNNs are trained all together with a backbone feature extractor. In this paper, we propose a new method that relies on graphs only to interpolate feature vectors instead, resulting in a transductive learning setting with no additional parameters to train. Our proposed method thus exploits two levels of information: a) transfer features obtained on generic datasets, b) transductive information obtained from other samples to be classified. Using standard few-shot vision classification datasets, we demonstrate its ability to bring significant gains compared to other works.

Edge-Aware Graph Attention Network for Ratio of Edge-User Estimation in Mobile Networks

Jiehui Deng, Sheng Wan, Xiang Wang, Enmei Tu, Xiaolin Huang, Jie Yang, Chen Gong

Responsive image

Auto-TLDR; EAGAT: Edge-Aware Graph Attention Network for Automatic REU Estimation in Mobile Networks

Slides Poster Similar

Estimating the Ratio of Edge-Users (REU) is an important issue in mobile networks, as it helps the subsequent adjustment of loads in different cells. However, existing approaches usually determine the REU manually, which are experience-dependent and labor-intensive, and thus the estimated REU might be imprecise. Considering the inherited graph structure of mobile networks, in this paper, we utilize a graph-based deep learning method for automatic REU estimation, where the practical cells are deemed as nodes and the load switchings among them constitute edges. Concretely, Graph Attention Network (GAT) is employed as the backbone of our method due to its impressive generalizability in dealing with networked data. Nevertheless, conventional GAT cannot make full use of the information in mobile networks, since it only incorporates node features to infer the pairwise importance and conduct graph convolutions, while the edge features that are actually critical in our problem are disregarded. To accommodate this issue, we propose an Edge-Aware Graph Attention Network (EAGAT), which is able to fuse the node features and edge features for REU estimation. Extensive experimental results on two real-world mobile network datasets demonstrate the superiority of our EAGAT approach to several state-of-the-art methods.

Parallel Network to Learn Novelty from the Known

Shuaiyuan Du, Chaoyi Hong, Zhiyu Pan, Chen Feng, Zhiguo Cao

Responsive image

Auto-TLDR; Trainable Parallel Network for Pseudo-Novel Detection

Slides Poster Similar

Towards multi-class novelty detection, we propose an end-to-end trainable Parallel Network (PN) using no additional data but only the training set itself. Our key idea is to first divide the training set into successive subtasks of pseudo-novelty detection to simulate real scenarios. We then design a multi-branch PN to well address the fine-grained division, which yields a compressed and more discriminative classification space and forms a natural ensemble. In practice, we divide the training data into subsets consisting of known and pseudo-novel classes. Each subset forms a sub-task fed to one branch in PN. During training, both known and pseudo-novel classes are uniformly distributed over the branches for better data balance and model diversity. By distinguishing between the known and the diverse pseudo-novel, PN extracts the concept of novelty in a compressed classification space. This provides PN with generalization ability to real novel classes which are absent during training. During online inference, this ability is further strengthened with the ensemble of PN's multiple branches. Experiments on three public datasets show our method's superiority to the mainstream methods.

Complementing Representation Deficiency in Few-Shot Image Classification: A Meta-Learning Approach

Xian Zhong, Cheng Gu, Wenxin Huang, Lin Li, Shuqin Chen, Chia-Wen Lin

Responsive image

Auto-TLDR; Meta-learning with Complementary Representations Network for Few-Shot Learning

Slides Poster Similar

Few-shot learning is a challenging problem that has attracted more and more attention recently since abundant training samples are difficult to obtain in practical applications. Meta-learning has been proposed to address this issue, which focuses on quickly adapting a predictor as a base-learner to new tasks, given limited labeled samples. However, a critical challenge for meta-learning is the representation deficiency since it is hard to discover common information from a small number of training samples or even one, as is the representation of key features from such little information. As a result, a meta-learner cannot be trained well in a high-dimensional parameter space to generalize to new tasks. Existing methods mostly resort to extracting less expressive features so as to avoid the representation deficiency. Aiming at learning better representations, we propose a meta-learning approach with complemented representations network (MCRNet) for few-shot image classification. In particular, we embed a latent space, where latent codes are reconstructed with extra representation information to complement the representation deficiency. Furthermore, the latent space is established with variational inference, collaborating well with different base-learners, and can be extended to other models. Finally, our end-to-end framework achieves the state-of-the-art performance in image classification on three standard few-shot learning datasets.

Meta Generalized Network for Few-Shot Classification

Wei Wu, Shanmin Pang, Zhiqiang Tian, Yaochen Li

Responsive image

Auto-TLDR; Meta Generalized Network for Few-Shot Classification

Similar

Few-shot classification aims to learn a well performance model with very limited labeled examples. There are mainly two directions for this aim, namely, meta- and metric-learning. Meta learning trains models in a particular way to fast adapt to new tasks, but it neglects variational features of images. Metric learning considers relationships among same or different classes, however on the downside, it usually fails to achieve competitive performance on unseen boundary examples. In this paper, we propose a Meta Generalized Network (MGNet) that aims to combine advantages of both meta- and metric-learning. There are two novel components in MGNet. Specifically, we first develop a meta backbone training method that both learns a flexible feature extractor and a classifier initializer efficiently, delightedly leading to fast adaption to unseen few-shot tasks without overfitting. Second, we design a trainable adaptive interval model to improve the cosine classifier, which increases the recognition accuracy of hard examples. We train the meta backbone in the training stage by all classes, and fine-tune the meta-backbone as well as train the adaptive classifier in the testing stage.

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

Sungmin Eum, Heesung Kwon

Responsive image

Auto-TLDR; Semantics-to-Space: Deep Zero-Shot Learning for Verb-Object Interaction with Vectors

Slides Poster Similar

We present a novel deep zero-shot learning (ZSL) model for inferencing human-object-interaction with verb-object (VO) query. While the previous two-stream ZSL approaches only use the semantic/textual information to be fed into the query stream, we seek to incorporate and embed the semantics into the visual representation stream as well. Our approach is powered by Semantics-to-Space (S2S) architecture where semantics derived from the residing objects are embedded into a spatial space of the visual stream. This architecture allows the co-capturing of the semantic attributes of the human and the objects along with their location/size/silhouette information. To validate, we have constructed a new dataset, Verb-Transferability 60 (VT60). VT60 provides 60 different VO pairs with overlapping verbs tailored for testing two-stream ZSL approaches with VO query. Experimental evaluations show that our approach not only outperforms the state-of-the-art, but also shows the capability of consistently improving performance regardless of which ZSL baseline architecture is used.

Explanation-Guided Training for Cross-Domain Few-Shot Classification

Jiamei Sun, Sebastian Lapuschkin, Wojciech Samek, Yunqing Zhao, Ngai-Man Cheung, Alexander Binder

Responsive image

Auto-TLDR; Explaination-Guided Training for Cross-Domain Few-Shot Classification

Slides Poster Similar

Cross-domain few-shot classification task (CD-FSC) combines few-shot classification with the requirement to generalize across domains represented by datasets. This setup faces challenges originating from the limited labeled data in each class and, additionally, from the domain shift between training and test sets. In this paper, we introduce a novel training approach for existing FSC models. It leverages on the explanation scores, obtained from existing explanation methods when applied to the predictions of FSC models, computed for intermediate feature maps of the models. Firstly, we tailor the layer-wise relevance propagation (LRP) method to explain the prediction outcomes of FSC models. Secondly, we develop a model-agnostic explanation-guided training strategy that dynamically finds and emphasizes the features which are important for the predictions. Our contribution does not target a novel explanation method but lies in a novel application of explanations for the training phase. We show that explanation-guided training effectively improves the model generalization. We observe improved accuracy for three different FSC models: RelationNet, cross attention network, and a graph neural network-based formulation, on five few-shot learning datasets: miniImagenet, CUB, Cars, Places, and Plantae.

Multi-Attribute Learning with Highly Imbalanced Data

Lady Viviana Beltran Beltran, Mickaël Coustaty, Nicholas Journet, Juan C. Caicedo, Antoine Doucet

Responsive image

Auto-TLDR; Data Imbalance in Multi-Attribute Deep Learning Models: Adaptation to face each one of the problems derived from imbalance

Slides Poster Similar

Data is one of the most important keys for success when studying a simple or a complex phenomenon. With the use of deep-learning exploding and its democratization, non-computer science experts may struggle to use highly complex deep learning architectures, even when straightforward models offer them suitable performances. In this article, we study the specific and common problem of data imbalance in real databases as most of the bad performance problems are due to the data itself. We review two points: first, when the data contains different levels of imbalance. Classical imbalanced learning strategies cannot be directly applied when using multi-attribute deep learning models, i.e., multi-task and multi-label architectures. Therefore, one of our contributions is our proposed adaptations to face each one of the problems derived from imbalance. Second, we demonstrate that with little to no imbalance, straightforward deep learning models work well. However, for non-experts, these models can be seen as black boxes, where all the effort is put in pre-processing the data. To simplify the problem, we performed the classification task ignoring information that is costly to extract, such as part localization which is widely used in the state of the art of attribute classification. We make use of a widely known attribute database, CUB-200-2011 - CUB as our main use case due to its deeply imbalanced nature, along with two better structured databases: celebA and Awa2. All of them contain multi-attribute annotations. The results of highly fine-grained attribute learning over CUB demonstrate that in the presence of imbalance, by using our proposed strategies is possible to have competitive results against the state of the art, while taking advantage of multi-attribute deep learning models. We also report results for two better-structured databases over which our models over-perform the state of the art.

Soft Label and Discriminant Embedding Estimation for Semi-Supervised Classification

Fadi Dornaika, Abdullah Baradaaji, Youssof El Traboulsi

Responsive image

Auto-TLDR; Semi-supervised Semi-Supervised Learning for Linear Feature Extraction and Label Propagation

Slides Poster Similar

In recent times, graph-based semi-supervised learning proved to be a powerful paradigm for processing and mining large datasets. The main advantage relies on the fact that these methods can be useful in propagating a small set of known labels to a large set of unlabeled data. The scarcity of labeled data may affect the performance of the semi-learning. This paper introduces a new semi-supervised framework for simultaneous linear feature extraction and label propagation. The proposed method simultaneously estimates a discriminant transformation and the unknown label by exploiting both labeled and unlabeled data. In addition, the unknowns of the learning model are estimated by integrating two types of graph-based smoothness constraints. The resulting semi-supervised model is expected to learn more discriminative information. Experiments are conducted on six public image datasets. These experimental results show that the performance of the proposed method can be better than that of many state-of-the-art graph-based semi-supervised algorithms.

Few-Shot Few-Shot Learning and the Role of Spatial Attention

Yann Lifchitz, Yannis Avrithis, Sylvaine Picard

Responsive image

Auto-TLDR; Few-shot Learning with Pre-trained Classifier on Large-Scale Datasets

Slides Poster Similar

Few-shot learning is often motivated by the ability of humans to learn new tasks from few examples. However, standard few-shot classification benchmarks assume that the representation is learned on a limited amount of base class data, ignoring the amount of prior knowledge that a human may have accumulated before learning new tasks. At the same time, even if a powerful representation is available, it may happen in some domain that base class data are limited or non-existent. This motivates us to study a problem where the representation is obtained from a classifier pre-trained on a large-scale dataset of a different domain, assuming no access to its training process, while the base class data are limited to few examples per class and their role is to adapt the representation to the domain at hand rather than learn from scratch. We adapt the representation in two stages, namely on the few base class data if available and on the even fewer data of new tasks. In doing so, we obtain from the pre-trained classifier a spatial attention map that allows focusing on objects and suppressing background clutter. This is important in the new problem, because when base class data are few, the network cannot learn where to focus implicitly. We also show that a pre-trained network may be easily adapted to novel classes, without meta-learning.

Augmented Bi-Path Network for Few-Shot Learning

Baoming Yan, Chen Zhou, Bo Zhao, Kan Guo, Yang Jiang, Xiaobo Li, Zhang Ming, Yizhou Wang

Responsive image

Auto-TLDR; Augmented Bi-path Network for Few-shot Learning

Slides Poster Similar

Few-shot Learning (FSL) which aims to learn from few labeled training data is becoming a popular research topic, due to the expensive labeling cost in many real-world applications. One kind of successful FSL method learns to compare the testing (query) image and training (support) image by simply concatenating the features of two images and feeding it into the neural network. However, with few labeled data in each class, the neural network has difficulty in learning or comparing the local features of two images. Such simple image-level comparison may cause serious mis-classification. To solve this problem, we propose Augmented Bi-path Network (ABNet) for learning to compare both global and local features on multi-scales. Specifically, the salient patches are extracted and embedded as the local features for every image. Then, the model learns to augment the features for better robustness. Finally, the model learns to compare global and local features separately, \emph{i.e.}, in two paths, before merging the similarities. Extensive experiments show that the proposed ABNet outperforms the state-of-the-art methods. Both quantitative and visual ablation studies are provided to verify that the proposed modules lead to more precise comparison results.

Privacy Attributes-Aware Message Passing Neural Network for Visual Privacy Attributes Classification

Hanbin Hong, Wentao Bao, Yuan Hong, Yu Kong

Responsive image

Auto-TLDR; Privacy Attributes-Aware Message Passing Neural Network for Visual Privacy Attribute Classification

Slides Poster Similar

Visual Privacy Attribute Classification (VPAC) identifies privacy information leakage via social media images. These images containing privacy attributes such as skin color, face or gender are classified into multiple privacy attribute categories in VPAC. With limited works in this task, current methods often extract features from images and simply classify the extracted feature into multiple privacy attribute classes. The dependencies between privacy attributes, e.g., skin color and face typically co-exist in the same image, are usually ignored in classification, which causes performance degradation in VPAC. In this paper, we propose a novel end-to-end Privacy Attributes-aware Message Passing Neural Network (PA-MPNN) to address VPAC. Privacy attributes are considered as nodes on a graph and an MPNN is introduced to model the privacy attribute dependencies. To generate representative features for privacy attribute nodes, a class-wise encoder-decoder is proposed to learn a latent space for each attribute. An attention mechanism with multiple correlation matrices is also introduced in MPNN to learn the privacy attributes graph automatically. Experimental results on the Privacy Attribute Dataset demonstrate that our framework achieves better performance than state-of-the-art methods on visual privacy attributes classification.

Using Scene Graphs for Detecting Visual Relationships

Anurag Tripathi, Siddharth Srivastava, Brejesh Lall, Santanu Chaudhury

Responsive image

Auto-TLDR; Relationship Detection using Context Aligned Scene Graph Embeddings

Slides Poster Similar

In this paper we solve the problem of detecting relationships between pairs of objects in an image. We develop spatially aware word embeddings using scene graphs and use joint feature representations containing visual, spatial and semantic embeddings from the input images to train a deep network on the task of relationship detection. Further, we propose to utilize context aligned scene graph embeddings from the train set, without requiring explicit availability of scene graphs at test time. We show that the proposed method outperforms the state-of-the-art methods for predicate detection and provides competing results on relationship detection. We also show the generalization ability of the proposed method by performing predictions under zero shot settings. Further, we also provide an exhaustive empirical evaluation on each component of the proposed network.

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Yaning Li, Liu Yang

Responsive image

Auto-TLDR; Fully Associative Network for Fully Exploiting Correlation Information in Multi-Label Classification

Slides Poster Similar

Recent researches demonstrate that correlation modeling plays a key role in high-performance multi-label classification methods. However, existing methods do not take full advantage of correlation information, especially correlations in feature and label spaces of each image, which limits the performance of correlation-based multi-label classification methods. With more correlations considered, in this study, a Fully Associative Network (FAN) is proposed for fully exploiting correlation information, which involves both visual feature and label correlations. Specifically, FAN introduces a robust covariance pooling to summarize convolution features as global image representation for capturing feature correlation in the multi-label task. Moreover, it constructs an effective label correlation matrix based on a re-weighted scheme, which is fed into a graph convolution network for capturing label correlation. Then, correlation between covariance representations (i.e., feature correlation ) and the outputs of GCN (i.e., label correlation) are modeled for final prediction. Experimental results on two datasets illustrate the effectiveness and efficiency of our proposed FAN compared with state-of-the-art methods.

Generative Latent Implicit Conditional Optimization When Learning from Small Sample

Idan Azuri, Daphna Weinshall

Responsive image

Auto-TLDR; GLICO: Generative Latent Implicit Conditional Optimization for Small Sample Learning

Slides Poster Similar

We revisit the long-standing problem of learning from small sample. The generation of new samples from a small training set of labeled points has attracted increased attention in recent years. In this paper, we propose a novel such method called GLICO (Generative Latent Implicit Conditional Optimization). GLICO learns a mapping from the training examples to a latent space and a generator that generates images from vectors in the latent space. Unlike most recent work, which rely on access to large amounts of unlabeled data, GLICO does not require access to any additional data other than the small set of labeled points. In fact, GLICO learns to synthesize completely new samples for every class using as little as 5 or 10 examples per class, with as few as 10 such classes and no data from unknown classes. GLICO is then used to augment the small training set while training a classifier on the small sample. To this end, our proposed method samples the learned latent space using spherical interpolation (slerp) and generates new examples using the trained generator. Empirical results show that the new sampled set is diverse enough, leading to improvement in image classification in comparison with the state of the art when trained on small samples obtained from CIFAR-10, CIFAR-100, and CUB-200.

Exploiting Knowledge Embedded Soft Labels for Image Recognition

Lixian Yuan, Riquan Chen, Hefeng Wu, Tianshui Chen, Wentao Wang, Pei Chen

Responsive image

Auto-TLDR; A Soft Label Vector for Image Recognition

Slides Poster Similar

Objects from correlated classes usually share highly similar appearances while objects from uncorrelated classes are very different. Most of current image recognition works treat each class independently, which ignores these class correlations and inevitably leads to sub-optimal performance in many cases. Fortunately, object classes inherently form a hierarchy with different levels of abstraction and this hierarchy encodes rich correlations among different classes. In this work, we utilize a soft label vector that encodes the prior knowledge of class correlations as extra regularization to train the image classifiers. Specifically, for each class, instead of simply using a one-hot vector, we assign a high value to its correlated classes and assign small values to those uncorrelated ones, thus generating knowledge embedded soft labels. We conduct experiments on both general and fine-grained image recognition benchmarks and demonstrate its superiority compared with existing methods.

TAAN: Task-Aware Attention Network for Few-Shot Classification

Zhe Wang, Li Liu, Fanzhang Li

Responsive image

Auto-TLDR; TAAN: Task-Aware Attention Network for Few-Shot Classification

Slides Poster Similar

Few-shot classification aims to recognize unlabeled samples from unseen classes given only a few labeled samples.Current approaches of few-shot learning usually employ a metriclearning framework to learn a feature similarity comparison between a query (test) example and the few support (training) examples. However, these approaches all extract features from samples independently without looking at the entire task as a whole, and so fail to provide an enough discrimination to features. Moreover, the existing approaches lack the ability to select the most relevant features for the task at hand. In this work, we propose a novel algorithm called Task-Aware Attention Network (TAAN) to address the above problems in few-shot classification. By inserting a Task-Relevant Channel Attention Module into metric-based few-shot learners, TAAN generates channel attentions for each sample by aggregating the context of the entire support set and identifies the most relevant features for similarity comparison. The experiment demonstrates that TAAN is competitive in overall performance comparing to the recent state-of-the-art systems and improves the performance considerably over baseline systems on both mini-ImageNet and tiered-ImageNet benchmarks.

GuCNet: A Guided Clustering-Based Network for Improved Classification

Ushasi Chaudhuri, Syomantak Chaudhuri, Subhasis Chaudhuri

Responsive image

Auto-TLDR; Semantic Classification of Challenging Dataset Using Guide Datasets

Slides Poster Similar

We deal with the problem of semantic classification of challenging and highly-cluttered dataset. We present a novel, and yet a very simple classification technique by leveraging the ease of classifiability of any existing well separable dataset for guidance. Since the guide dataset which may or may not have any semantic relationship with the experimental dataset, forms well separable clusters in the feature set, the proposed network tries to embed class-wise features of the challenging dataset to those distinct clusters of the guide set, making them more separable. Depending on the availability, we propose two types of guide sets: one using texture (image) guides and another using prototype vectors representing cluster centers. Experimental results obtained on the challenging benchmark RSSCN, LSUN, and TU-Berlin datasets establish the efficacy of the proposed method as we outperform the existing state-of-the-art techniques by a considerable margin.

GCNs-Based Context-Aware Short Text Similarity Model

Xiaoqi Sun

Responsive image

Auto-TLDR; Context-Aware Graph Convolutional Network for Text Similarity

Slides Poster Similar

Semantic textual similarity is a fundamental task in text mining and natural language processing (NLP), which has profound research value. The essential step for text similarity is text representation learning. Recently, researches have explored the graph convolutional network (GCN) techniques on text representation, since GCN does well in handling complex structures and preserving syntactic information. However, current GCN models are usually limited to very shallow layers due to the vanishing gradient problem, which cannot capture non-local dependency information of sentences. In this paper, we propose a GCNs-based context-aware (GCSTS) model that applies iterated GCN blocks to train deeper GCNs. Recurrently employing the same GCN block prevents over-fitting and provides broad effective input width. Combined with dense connections, GCSTS can be trained more deeply. Besides, we use dynamic graph structures in the block, which further extend the receptive field of each vertex in graphs, learning better sentence representations. Experiments show that our model outperforms existing models on several text similarity datasets, while also verify that GCNs-based text representation models can be trained in a deeper manner, rather than being trained in two or three layers.

Supervised Domain Adaptation Using Graph Embedding

Lukas Hedegaard, Omar Ali Sheikh-Omar, Alexandros Iosifidis

Responsive image

Auto-TLDR; Domain Adaptation from the Perspective of Multi-view Graph Embedding and Dimensionality Reduction

Slides Poster Similar

Getting deep convolutional neural networks to perform well requires a large amount of training data. When the available labelled data is small, it is often beneficial to use transfer learning to leverage a related larger dataset (source) in order to improve the performance on the small dataset (target). Among the transfer learning approaches, domain adaptation methods assume that distributions between the two domains are shifted and attempt to realign them. In this paper, we consider the domain adaptation problem from the perspective of multi-view graph embedding and dimensionality reduction. Instead of solving the generalised eigenvalue problem to perform the embedding, we formulate the graph-preserving criterion as loss in the neural network and learn a domain-invariant feature transformation in an end-to-end fashion. We show that the proposed approach leads to a powerful Domain Adaptation framework which generalises the prior methods CCSA and d-SNE, and enables simple and effective loss designs; an LDA-inspired instantiation of the framework leads to performance on par with the state-of-the-art on the most widely used Domain Adaptation benchmarks, Office31 and MNIST to USPS datasets.

Label Incorporated Graph Neural Networks for Text Classification

Yuan Xin, Linli Xu, Junliang Guo, Jiquan Li, Xin Sheng, Yuanyuan Zhou

Responsive image

Auto-TLDR; Graph Neural Networks for Semi-supervised Text Classification

Slides Poster Similar

Graph Neural Networks (GNNs) have achieved great success on graph-structured data, and their applications on traditional data structures such as natural language processing and semi-supervised text classification have been extensively explored in recent years. While previous works only consider the text information while building the graph, heterogeneous information such as labels is ignored. In this paper, we consider to incorporate the label information while building the graph by adding text-label-text paths, through which the supervision information will propagate among the graph more directly. Specifically, we treat labels as nodes in the graph which also contains text and word nodes, and then connect labels with texts belonging to that label. Through graph convolutions, label embeddings are jointly learned with text embeddings in the same latent semantic space. The newly incorporated label nodes will facilitate learning more accurate text embeddings by introducing the label information, and thus benefit the downstream text classification tasks. Extensive results on several benchmark datasets show that the proposed framework outperforms baseline methods by a significant margin.

Revisiting Graph Neural Networks: Graph Filtering Perspective

Hoang Nguyen-Thai, Takanori Maehara, Tsuyoshi Murata

Responsive image

Auto-TLDR; Two-Layers Graph Convolutional Network with Graph Filters Neural Network

Slides Poster Similar

In this work, we develop quantitative results to the learnability of a two-layers Graph Convolutional Network (GCN). Instead of analyzing GCN under some classes of functions, our approach provides a quantitative gap between a two-layers GCN and a two-layers MLP model. From the graph signal processing perspective, we provide useful insights to some flaws of graph neural networks for vertex classification. We empirically demonstrate a few cases when GCN and other state-of-the-art models cannot learn even when true vertex features are extremely low-dimensional. To demonstrate our theoretical findings and propose a solution to the aforementioned adversarial cases, we build a proof of concept graph neural network model with different filters named Graph Filters Neural Network (gfNN).

Adversarial Encoder-Multi-Task-Decoder for Multi-Stage Processes

Andre Mendes, Julian Togelius, Leandro Dos Santos Coelho

Responsive image

Auto-TLDR; Multi-Task Learning and Semi-Supervised Learning for Multi-Stage Processes

Similar

In multi-stage processes, decisions occur in an ordered sequence of stages. Early stages usually have more observations with general information (easier/cheaper to collect), while later stages have fewer observations but more specific data. This situation can be represented by a dual funnel structure, in which the sample size decreases from one stage to the other while the information increases. Training classifiers in this scenario is challenging since information in the early stages may not contain distinct patterns to learn (underfitting). In contrast, the small sample size in later stages can cause overfitting. We address both cases by introducing a framework that combines adversarial autoencoders (AAE), multi-task learning (MTL), and multi-label semi-supervised learning (MLSSL). We improve the decoder of the AAE with an MTL component so it can jointly reconstruct the original input and use feature nets to predict the features for the next stages. We also introduce a sequence constraint in the output of an MLSSL classifier to guarantee the sequential pattern in the predictions. Using real-world data from different domains (selection process, medical diagnosis), we show that our approach outperforms other state-of-the-art methods.

Feature Extraction by Joint Robust Discriminant Analysis and Inter-Class Sparsity

Fadi Dornaika, Ahmad Khoder

Responsive image

Auto-TLDR; Robust Discriminant Analysis with Feature Selection and Inter-class Sparsity (RDA_FSIS)

Slides Similar

Feature extraction methods have been successfully applied to many real-world applications. The classical Linear Discriminant Analysis (LDA) and its variants are widely used as feature extraction methods. Although they have been used for different classification tasks, these methods have some shortcomings. The main one is that the projection axes obtained are not informative about the relevance of original features. In this paper, we propose a linear embedding method that merges two interesting properties: Robust LDA and inter-class sparsity. Furthermore, the targeted projection transformation focuses on the most discriminant original features. The proposed method is called Robust Discriminant Analysis with Feature Selection and Inter-class Sparsity (RDA_FSIS). Two kinds of sparsity are explicitly included in the proposed model. The first kind is obtained by imposing the $\ell_{2,1}$ constraint on the projection matrix in order to perform feature ranking. The second kind is obtained by imposing the inter-class sparsity constraint used for getting a common sparsity structure in each class. Comprehensive experiments on five real-world image datasets demonstrate the effectiveness and advantages of our framework over existing linear methods.

AOAM: Automatic Optimization of Adjacency Matrix for Graph Convolutional Network

Yuhang Zhang, Hongshuai Ren, Jiexia Ye, Xitong Gao, Yang Wang, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; Adjacency Matrix for Graph Convolutional Network in Non-Euclidean Space

Slides Poster Similar

Graph Convolutional Network (GCN) is adopted to tackle the problem of the convolution operation in non-Euclidean space. Although previous works on GCN have made some progress, one of their limitations is that their input Adjacency Matrix (AM) is designed manually and requires domain knowledge, which is cumbersome, tedious and error-prone. In addition, entries of this fixed Adjacency Matrix are generally designed as binary values (i.e., ones and zeros) which can not reflect more complex relationship between nodes. However, many applications require a weighted and dynamic Adjacency Matrix instead of an unweighted and fixed Adjacency Matrix. To this end, there are few works focusing on designing a more flexible Adjacency Matrix. In this paper, we propose an end-to-end algorithm to improve the GCN performance by focusing on the Adjacency Matrix. We first provide a calculation method that called node information entropy to update the matrix. Then, we analyze the search strategy in a continuous space and introduce the Deep Deterministic Policy Gradient (DDPG) method to overcome the demerit of the discrete space search. Finally, we integrate the GCN and reinforcement learning into an end-to-end framework. Our method can automatically define the adjacency matrix without artificial knowledge. At the same time, the proposed approach can deal with any size of the matrix and provide a better value for the network. Four popular datasets are selected to evaluate the capability of our algorithm. The method in this paper achieves the state-of-the-art performance on Cora and Pubmed datasets, respectively, with the accuracy of 84.6% and 81.6%.

Adaptive Word Embedding Module for Semantic Reasoning in Large-Scale Detection

Yu Zhang, Xiaoyu Wu, Ruolin Zhu

Responsive image

Auto-TLDR; Adaptive Word Embedding Module for Object Detection

Slides Poster Similar

In recent years, convolutional neural networks have achieved rapid development in the field of object detection. However, due to the imbalance of data, high costs in labor and uneven level of data labeling, the overall performance of the previous detection network has dropped sharply when dataset extended to the large-scale with hundreds and thousands categories. We present the Adaptive Word Embedding Module, extracting the adaptive semantic knowledge graph to reach semantic consistency within one image. Our method endows the ability to infer global semantic of detection networks without other attribute or relationship annotations. Compared with Faster RCNN, the algorithm on the MSCOCO dataset was significantly improved by 4.1%, and the mAP value has reached 32.8%. On the VG1000 dataset, it increased by 0.9% to 6.7% compared with Faster RCNN. Adaptive Word Embedding Module is lightweight, general-purpose and can be plugged into diverse detection networks. Code will be made available.

A General Model for Learning Node and Graph Representations Jointly

Chaofan Chen

Responsive image

Auto-TLDR; Joint Community Detection/Dynamic Routing for Graph Classification

Slides Poster Similar

This paper focuses on two fundamental graph recognition tasks: node classification and graph classification. Existing methods usually learn the node and graph representations for these two tasks separately, and ignore modeling the relations between the local and global structures. In this paper, we propose a general approach to learn the local and global features collaboratively: (1) in order to characterize the correlation among nodes and communities (a set of nodes), we employ the joint community detection/dynamic routing modules to generate the clustering assignment matrices at first and then utilize these matrices to cluster nodes to capture the global information of graphs (locally relevant graph representations). Inspired by the success of spectral clustering, we minimize the ratiocut loss to help optimize the learned assignment matrices. (2) We maximize the mutual information between local and global representations to help learn the globally relevant node representations. Experimental results on a variety of node and graph classification benchmarks show that our model can achieve superior performance over the state-of-the-art approaches.

SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Raphael Memmesheimer, Nick Theisen, Dietrich Paulus

Responsive image

Auto-TLDR; One-Shot Action Recognition using Metric Learning

Slides Similar

Recognizing an activity with a single reference sample using metric learning approaches is a promising research field. The majority of few-shot methods focus on object recognition or face-identification. We propose a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space. We encode signals into images and extract features using a deep residual CNN. Using triplet loss, we learn a feature embedding. The resulting encoder transforms features into an embedding space in which closer distances encode similar actions while higher distances encode different actions. Our approach is based on a signal level formulation and remains flexible across a variety of modalities. It further outperforms the baseline on the large scale NTU RGB+D 120 dataset for the One-Shot action recognition protocol by \ntuoneshotimpro%. With just 60% of the training data, our approach still outperforms the baseline approach by \ntuoneshotimproreduced%. With 40% of the training data, our approach performs comparably well as the second follow up. Further, we show that our approach generalizes well in experiments on the UTD-MHAD dataset for inertial, skeleton and fused data and the Simitate dataset for motion capturing data. Furthermore, our inter-joint and inter-sensor experiments suggest good capabilities on previously unseen setups.

Multi-Label Contrastive Focal Loss for Pedestrian Attribute Recognition

Xiaoqiang Zheng, Zhenxia Yu, Lin Chen, Fan Zhu, Shilong Wang

Responsive image

Auto-TLDR; Multi-label Contrastive Focal Loss for Pedestrian Attribute Recognition

Slides Poster Similar

Pedestrian Attribute Recognition (PAR) has received extensive attention during the past few years. With the advances of deep constitutional neural networks (CNNs), the performance of PAR has been significantly improved. Existing methods tend to acquire attribute-specific features by designing various complex network structures with additional modules. Such additional modules, however, dramatically increase the number of parameters. Meanwhile, the problems of class imbalance and hard attribute retrieving remain underestimated in PAR. In this paper, we explore the optimization mechanism of the training processing to account for these problems and propose a new loss function called Multi-label Contrastive Focal Loss (MCFL). This proposed MCFL emphasizes the hard and minority attributes by using a separated re-weighting mechanism for different positive and negative classes to alleviate the impact of the imbalance. MCFL is also able to enlarge the gaps between the intra-class of multi-label attributes, to force CNNs to extract more subtle discriminative features. We evaluate the proposed MCFL on three large public pedestrian datasets, including RAP, PA-100K, and PETA. The experimental results indicate that the proposed MCFL with the ResNet-50 backbone is able to outperform other state-of-the-art approaches in comparison.

Siamese Graph Convolution Network for Face Sketch Recognition

Liang Fan, Xianfang Sun, Paul Rosin

Responsive image

Auto-TLDR; A novel Siamese graph convolution network for face sketch recognition

Slides Poster Similar

In this paper, we present a novel Siamese graph convolution network (GCN) for face sketch recognition. To build a graph from an image, we utilize a deep learning method to detect the image edges, and then use a superpixel method to segment the edge image. Each segmented superpixel region is taken as a node, and each pair of adjacent regions forms an edge of the graph. Graphs from both a face sketch and a face photo are input into the Siamese GCN for recognition. A deep graph matching method is used to share messages between cross-modal graphs in this model. Experiments show that the GCN can obtain high performance on several face photo-sketch datasets, including seen and unseen face photo-sketch datasets. It is also shown that the model performance based on the graph structure representation of the data using the Siamese GCN is more stable than a Siamese CNN model.

A Unified Framework for Distance-Aware Domain Adaptation

Fei Wang, Youdong Ding, Huan Liang, Yuzhen Gao, Wenqi Che

Responsive image

Auto-TLDR; distance-aware domain adaptation

Slides Poster Similar

Unsupervised domain adaptation has achieved significant results by leveraging knowledge from a source domain to learn a related but unlabeled target domain. Previous methods are insufficient to model domain discrepancy and class discrepancy, which may lead to misalignment and poor adaptation performance. To address this problem, in this paper, we propose a unified framework, called distance-aware domain adaptation, which is fully aware of both cross-domain distance and class-discriminative distance. In addition, second-order statistics distance and manifold alignment are also exploited to extract more information from data. In this manner, the generalization error of the target domain in classification problems can be reduced substantially. To validate the proposed method, we conducted experiments on five public datasets and an ablation study. The results demonstrate the good performance of our proposed method.

TreeRNN: Topology-Preserving Deep Graph Embedding and Learning

Yecheng Lyu, Ming Li, Xinming Huang, Ulkuhan Guler, Patrick Schaumont, Ziming Zhang

Responsive image

Auto-TLDR; TreeRNN: Recurrent Neural Network for General Graph Classification

Slides Poster Similar

General graphs are difficult for learning due to their irregular structures. Existing works employ message passing along graph edges to extract local patterns using customized graph kernels, but few of them are effective for the integration of such local patterns into global features. In contrast, in this paper we study the methods to transfer the graphs into trees so that explicit orders are learned to direct the feature integration from local to global. To this end, we apply the breadth first search (BFS) to construct trees from the graphs, which adds direction to the graph edges from the center node to the peripheral nodes. In addition, we proposed a novel projection scheme that transfer the trees to image representations, which is suitable for conventional convolution neural networks (CNNs) and recurrent neural networks (RNNs). To best learn the patterns from the graph-tree-images, we propose TreeRNN, a 2D RNN architecture that recurrently integrates the image pixels by rows and columns to help classify the graph categories. We evaluate the proposed method on several graph classification datasets, and manage to demonstrate comparable accuracy with the state-of-the-art on MUTAG, PTC-MR and NCI1 datasets.

Context for Object Detection Via Lightweight Global and Mid-Level Representations

Mesut Erhan Unal, Adriana Kovashka

Responsive image

Auto-TLDR; Context-Based Object Detection with Semantic Similarity

Slides Poster Similar

We propose an approach for explicitly capturing context in object detection. We model visual and geometric relationships between object regions, but also model the global scene as a first-class participant. In contrast to prior approaches, both the context we rely on, as well as our proposed mechanism for belief propagation over regions, is lightweight. We also experiment with capturing similarities between regions at a semantic level, by modeling class co-occurrence and linguistic similarity between class names. We show that our approach significantly outperforms Faster R-CNN, and performs competitively with a much more costly approach that also models context.

Improved Deep Classwise Hashing with Centers Similarity Learning for Image Retrieval

Ming Zhang, Hong Yan

Responsive image

Auto-TLDR; Deep Classwise Hashing for Image Retrieval Using Center Similarity Learning

Slides Poster Similar

Deep supervised hashing for image retrieval has attracted researchers' attention due to its high efficiency and superior retrieval performance. Most existing deep supervised hashing works, which are based on pairwise/triplet labels, suffer from the expensive computational cost and insufficient utilization of the semantics information. Recently, deep classwise hashing introduced a classwise loss supervised by class labels information alternatively; however, we find it still has its drawback. In this paper, we propose an improved deep classwise hashing, which enables hashing learning and class centers learning simultaneously. Specifically, we design a two-step strategy on center similarity learning. It interacts with the classwise loss to attract the class center to concentrate on the intra-class samples while pushing other class centers as far as possible. The centers similarity learning contributes to generating more compact and discriminative hashing codes. We conduct experiments on three benchmark datasets. It shows that the proposed method effectively surpasses the original method and outperforms state-of-the-art baselines under various commonly-used evaluation metrics for image retrieval.

Progressive Cluster Purification for Unsupervised Feature Learning

Yifei Zhang, Chang Liu, Yu Zhou, Wei Wang, Weiping Wang, Qixiang Ye

Responsive image

Auto-TLDR; Progressive Cluster Purification for Unsupervised Feature Learning

Slides Poster Similar

In unsupervised feature learning, sample specificity based methods ignore the inter-class information, which deteriorates the discriminative capability of representation models. Clustering based methods are error-prone to explore the complete class boundary information due to the inevitable class inconsistent samples in each cluster. In this work, we propose a novel clustering based method, which, by iteratively excluding class inconsistent samples during progressive cluster formation, alleviates the impact of noise samples in a simple-yet-effective manner. Our approach, referred to as Progressive Cluster Purification (PCP), implements progressive clustering by gradually reducing the number of clusters during training, while the sizes of clusters continuously expand consistently with the growth of model representation capability. With a well-designed cluster purification mechanism, it further purifies clusters by filtering noise samples which facilitate the subsequent feature learning by utilizing the refined clusters as pseudo-labels. Experiments on commonly used benchmarks demonstrate that the proposed PCP improves baseline method with significant margins. Our code will be available at https://github.com/zhangyifei0115/PCP.

Supervised Feature Embedding for Classification by Learning Rank-Based Neighborhoods

Ghazaal Sheikhi, Hakan Altincay

Responsive image

Auto-TLDR; Supervised Feature Embedding with Representation Learning of Rank-based Neighborhoods

Slides Similar

In feature embedding, the recovery of associated discriminative information in the reduced subspace is critical for downstream classifiers. In this study, a supervised feature embedding method is proposed inspired by the well-known word embedding technique, word2vec. Proposed embedding method is implemented as representative learning of rank-based neighborhoods. The notion of context words in word2vec is extended into neighboring instances within a given window. Neighborship is defined using ranks of instances rather than their values so that regions with different densities are captured properly. Each sample is represented by a unique one-hot vector whereas its neighbors are encoded by several two-hot vectors. The two-hot vectors are identical for neighboring samples of the same class. A feed-forward neural network with a continuous projection layer, then learns the mapping from one-hot vectors to multiple two-hot vectors. The hidden layer determines the reduced subspace for the train samples. The obtained transformation is then applied on test data to find a lower-dimensional representation. Proposed method is tested in classification problems on 10 UCI data sets. Experimental results confirm that the proposed method is effective in finding a discriminative representation of the features and outperforms several supervised embedding approaches in terms of classification performance.

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Yaoyuan Liang, Xin Wang, Xuguang Duan, Wenwu Zhu

Responsive image

Auto-TLDR; Multi-modal Contextual Graph Neural Network for Text Visual Question Answering

Slides Poster Similar

Text visual question answering (TextVQA) targets at answering the question related to texts appearing in the given images, posing more challenges than VQA by requiring a deeper recognition and understanding of various shapes of human-readable scene texts as well as their meanings in different contexts. Existing works on TextVQA suffer from two weaknesses: i) scene texts and non-textual objects are processed separately and independently without considering their mutual interactions during the question understanding and answering process, ii) scene texts are encoded only through word embeddings without taking the corresponding visual appearance features as well as their potential relationships with other non-textual objects in the images into account. To overcome the weakness of exiting works, we propose a novel multi-modal contextual graph neural network (MCG) model for TextVQA. The proposed MCG model can capture the relationships between visual features of scene texts and non-textual objects in the given images as well as utilize richer sources of multi-modal features to improve the model performance. In particular, we encode the scene texts into richer features containing textual, visual and positional features, then model the visual relations between scene texts and non-textual objects through a contextual graph neural network. Our extensive experiments on real-world dataset demonstrate the advantages of the proposed MCG model over baseline approaches.

Class-Incremental Learning with Topological Schemas of Memory Spaces

Xinyuan Chang, Xiaoyu Tao, Xiaopeng Hong, Xing Wei, Wei Ke, Yihong Gong

Responsive image

Auto-TLDR; Class-incremental Learning with Topological Schematic Model

Slides Poster Similar

Class-incremental learning (CIL) aims to incrementally learn a unified classifier for new classes emerging, which suffers from the catastrophic forgetting problem. To alleviate forgetting and improve the recognition performance, we propose a novel CIL framework, named the topological schemas model (TSM). TSM consists of a Gaussian mixture model arranged on 2D grids (2D-GMM) as the memory of the learned knowledge. To train the 2D-GMM model, we develop a novel competitive expectation-maximization (CEM) method, which contains a global topology embedding step and a local expectation-maximization finetuning step. Meanwhile, we choose the image samples of old classes that have the maximum posterior probability with respect to each Gaussian distribution as the episodic points. When finetuning for new classes, we propose the memory preservation loss (MPL) term to ensure episodic points still have maximum probabilities with respect to the corresponding Gaussian distribution. MPL preserves the distribution of 2D-GMM for old knowledge during incremental learning and alleviates catastrophic forgetting. Comprehensive experimental evaluations on two popular CIL benchmarks CIFAR100 and subImageNet demonstrate the superiority of our TSM.

MetaMix: Improved Meta-Learning with Interpolation-based Consistency Regularization

Yangbin Chen, Yun Ma, Tom Ko, Jianping Wang, Qing Li

Responsive image

Auto-TLDR; MetaMix: A Meta-Agnostic Meta-Learning Algorithm for Few-Shot Classification

Slides Poster Similar

Model-Agnostic Meta-Learning (MAML) and its variants are popular few-shot classification methods. They train an initializer across a variety of sampled learning tasks (also known as episodes) such that the initialized model can adapt quickly to new tasks. However, within each episode, current MAML-based algorithms have limitations in forming generalizable decision boundaries using only a few training examples. In this paper, we propose an approach called MetaMix. It generates virtual examples within each episode to regularize the backbone models. MetaMix can be applied in any of the MAML-based algorithms and learn the decision boundaries which are more generalizable to new tasks. Experiments on the mini-ImageNet, CUB, and FC100 datasets show that MetaMix improves the performance of MAML-based algorithms and achieves the state-of-the-art result when applied in Meta-Transfer Learning.

Incorporating Depth Information into Few-Shot Semantic Segmentation

Yifei Zhang, Desire Sidibe, Olivier Morel, Fabrice Meriaudeau

Responsive image

Auto-TLDR; RDNet: A Deep Neural Network for Few-shot Segmentation Using Depth Information

Slides Poster Similar

Few-shot segmentation presents a significant challenge for semantic scene understanding under limited supervision. Namely, this task targets at generalizing the segmentation ability of the model to new categories given a few samples. In order to obtain complete scene information, we extend the RGB-centric methods to take advantage of complementary depth information. In this paper, we propose a two-stream deep neural network based on metric learning. Our method, known as RDNet, learns class-specific prototype representations within RGB and depth embedding spaces, respectively. The learned prototypes provide effective semantic guidance on the corresponding RGB and depth query image, leading to more accurate performance. Moreover, we build a novel outdoor scene dataset, known as Cityscapes-3i, using labeled RGB images and depth images from the Cityscapes dataset. We also perform ablation studies to explore the effective use of depth information in few-shot segmentation tasks. Experiments on Cityscapes-3i show that our method achieves promising results with visual and complementary geometric cues from only a few labeled examples.