Prior Knowledge about Attributes: Learning a More Effective Potential Space for Zero-Shot Recognition

Chunlai Chai, Yukuan Lou

Responsive image

Auto-TLDR; Attribute Correlation Potential Space Generation for Zero-Shot Learning

Slides Poster

Zero-shot learning (ZSL) aims to recognize unseen classes accurately by learning seen classes and known attributes, but correlations in attributes were ignored by previous study which lead to classification results confused. To solve this problem, we build an Attribute Correlation Potential Space Generation (ACPSG) model which uses a graph convolution network and attribute correlation to generate a more discriminating potential space. Combining potential discrimination space and user-defined attribute space, we can better classify unseen classes. Our approach outperforms some existing state-of-the-art methods on several benchmark datasets, whether it is conventional ZSL or generalized ZSL.

Similar papers

Heterogeneous Graph-Based Knowledge Transfer for Generalized Zero-Shot Learning

Junjie Wang, Xiangfeng Wang, Bo Jin, Junchi Yan, Wenjie Zhang, Hongyuan Zha

Responsive image

Auto-TLDR; Heterogeneous Graph-based Knowledge Transfer for Generalized Zero-Shot Learning

Slides Poster Similar

Generalized zero-shot learning (GZSL) tackles the problem of learning to classify instances involving both seen classes and unseen ones. The key issue is how to effectively transfer the model learned from seen classes to unseen classes. Existing works in GZSL usually assume that some prior information about unseen classes are available. However, such an assumption is unrealistic when new unseen classes appear dynamically. To this end, we propose a novel heterogeneous graph-based knowledge transfer method (HGKT) for GZSL, agnostic to unseen classes and instances, by leveraging graph neural network. Specifically, a structured heterogeneous graph is constructed with high-level representative nodes for seen classes, which are chosen through Wasserstein barycenter in order to simultaneously capture inter-class and intra-class relationship. The aggregation and embedding functions can be learned throughgraph neural network, which can be used to compute the embeddings of unseen classes by transferring the knowledge from their neighbors. Extensive experiments on public benchmark datasets show that our method achieves state-of-the-art results.

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Tengfei Liu, Yongli Hu, Junbin Gao, Yanfeng Sun, Baocai Yin

Responsive image

Auto-TLDR; Semantically Extended Graph Convolutional Network for Zero-shot Text Classification

Slides Poster Similar

As a challenging task of Natural Language Processing(NLP), zero-shot text classification has attracted more and more attention recently. It aims to detect classes that the model has never seen in the training set. For this purpose, a feasible way is to construct connection between the seen and unseen classes by semantic extension and classify the unseen classes by information propagation over the connection. Although many related zero-shot text classification methods have been exploited, how to realize semantic extension properly and propagate information effectively is far from solved. In this paper, we propose a novel zero-shot text classification method called Semantically Extended Graph Convolutional Network (SEGCN). In the proposed method, the semantic category knowledge from ConceptNet is utilized to semantic extension for linking seen classes to unseen classes and constructing a graph of all classes. Then, we build upon Graph Convolutional Network (GCN) for predicting the textual classifier for each category, which transfers the category knowledge by the convolution operators on the constructed graph and is trained in a semi-supervised manner using the samples of the seen classes. The experimental results on Dbpedia and 20newsgroup datasets show that our method outperforms the state of the art zero-shot text classification methods.

VSB^2-Net: Visual-Semantic Bi-Branch Network for Zero-Shot Hashing

Xin Li, Xiangfeng Wang, Bo Jin, Wenjie Zhang, Jun Wang, Hongyuan Zha

Responsive image

Auto-TLDR; VSB^2-Net: inductive zero-shot hashing for image retrieval

Slides Poster Similar

Zero-shot hashing aims at learning hashing model from seen classes and the obtained model is capable of generalizing to unseen classes for image retrieval. Inspired by zero-shot learning, existing zero-shot hashing methods usually transfer the supervised knowledge from seen to unseen classes, by embedding the hamming space to a shared semantic space. However, this makes instances difficult to distinguish due to limited hashing bit numbers, especially for semantically similar unseen classes. We propose a novel inductive zero-shot hashing framework, i.e., VSB^2-Net, where both semantic space and visual feature space are embedded to the same hamming space instead. The reconstructive semantic relationships are established in the hamming space, preserving local similarity relationships and explicitly enlarging the discrepancy between semantic hamming vectors. A two-task architecture, comprising of classification module and visual feature reconstruction module, is employed to enhance the generalization and transfer abilities. Extensive evaluation results on several benchmark datasets demonstratethe superiority of our proposed method compared to several state-of-the-art baselines.

A Prototype-Based Generalized Zero-Shot Learning Framework for Hand Gesture Recognition

Jinting Wu, Yujia Zhang, Xiao-Guang Zhao

Responsive image

Auto-TLDR; Generalized Zero-Shot Learning for Hand Gesture Recognition

Slides Poster Similar

Hand gesture recognition plays a significant role in human-computer interaction for understanding various human gestures and their intent. However, most prior works can only recognize gestures of limited labeled classes and fail to adapt to new categories. The task of Generalized Zero-Shot Learning (GZSL) for hand gesture recognition aims to address the above issue by leveraging semantic representations and detecting both seen and unseen class samples. In this paper, we propose an end-to-end prototype-based GZSL framework for hand gesture recognition which consists of two branches. The first branch is a prototype-based detector that learns gesture representations and determines whether an input sample belongs to a seen or unseen category. The second branch is a zero-shot label predictor which takes the features of unseen classes as input and outputs predictions through a learned mapping mechanism between the feature and the semantic space. We further establish a hand gesture dataset that specifically targets this GZSL task, and comprehensive experiments on this dataset demonstrate the effectiveness of our proposed approach on recognizingQuestionnaire both seen and unseen gestures.

Recognizing Bengali Word Images - A Zero-Shot Learning Perspective

Sukalpa Chanda, Daniël Arjen Willem Haitink, Prashant Kumar Prasad, Jochem Baas, Umapada Pal, Lambert Schomaker

Responsive image

Auto-TLDR; Zero-Shot Learning for Word Recognition in Bengali Script

Slides Poster Similar

Zero-Shot Learning(ZSL) techniques could classify a completely unseen class, which it has never seen before during training. Thus, making it more apt for any real-life classification problem, where it is not possible to train a system with annotated data for all possible class types. This work investigates recognition of word images written in Bengali Script in a ZSL framework. The proposed approach performs Zero-Shot word recognition by coupling deep learned features procured from VGG16 architecture along with 13 basic shapes/stroke primitives commonly observed in Bengali script characters. As per the notion of ZSL framework those 13 basic shapes are termed as “Signature Attributes”. The obtained results are promising while evaluation was carried out in a Five-Fold cross-validation setup dealing with samples from 250 word classes.

Open Set Domain Recognition Via Attention-Based GCN and Semantic Matching Optimization

Xinxing He, Yuan Yuan, Zhiyu Jiang

Responsive image

Auto-TLDR; Attention-based GCN and Semantic Matching Optimization for Open Set Domain Recognition

Slides Poster Similar

Open set domain recognition has got the attention in recent years. The task aims to specifically classify each sample in the practical unlabeled target domain, which consists of all known classes in the manually labeled source domain and target-specific unknown categories. The absence of annotated training data or auxiliary attribute information for unknown categories makes this task especially difficult. Moreover, exiting domain discrepancy in label space and data distribution further distracts the knowledge transferred from known classes to unknown classes. To address these issues, this work presents an end-to-end model based on attention-based GCN and semantic matching optimization, which first employs the attention mechanism to enable the central node to learn more discriminating representations from its neighbors in the knowledge graph. Moreover, a coarse-to-fine semantic matching optimization approach is proposed to progressively bridge the domain gap. Experimental results validate that the proposed model not only has superiority on recognizing the images of known and unknown classes, but also can adapt to various openness of the target domain.

Exploiting Knowledge Embedded Soft Labels for Image Recognition

Lixian Yuan, Riquan Chen, Hefeng Wu, Tianshui Chen, Wentao Wang, Pei Chen

Responsive image

Auto-TLDR; A Soft Label Vector for Image Recognition

Slides Poster Similar

Objects from correlated classes usually share highly similar appearances while objects from uncorrelated classes are very different. Most of current image recognition works treat each class independently, which ignores these class correlations and inevitably leads to sub-optimal performance in many cases. Fortunately, object classes inherently form a hierarchy with different levels of abstraction and this hierarchy encodes rich correlations among different classes. In this work, we utilize a soft label vector that encodes the prior knowledge of class correlations as extra regularization to train the image classifiers. Specifically, for each class, instead of simply using a one-hot vector, we assign a high value to its correlated classes and assign small values to those uncorrelated ones, thus generating knowledge embedded soft labels. We conduct experiments on both general and fine-grained image recognition benchmarks and demonstrate its superiority compared with existing methods.

Incrementally Zero-Shot Detection by an Extreme Value Analyzer

Sixiao Zheng, Yanwei Fu, Yanxi Hou

Responsive image

Auto-TLDR; IZSD-EVer: Incremental Zero-Shot Detection for Incremental Learning

Slides Similar

Human beings not only have the ability of recogniz-ing novel unseen classes, but also can incrementally incorporatethe new classes to existing knowledge preserved. However, thezero-shot learning models assume that all seen classes should beknown beforehand, while incremental learning models cannotrecognize unseen classes. This paper introduces a novel andchallenging task of Incrementally Zero-Shot Detection (IZSD),a practical strategy for both zero-shot learning and class-incremental learning in real-world object detection. An innovativeend-to-end model – IZSD-EVer was proposed to tackle this taskthat requires incrementally detecting new classes and detectingthe classes that have never been seen. Specifically, we proposea novel extreme value analyzer to simultaneously detect objectsfrom old seen, new seen, and unseen classes. Additionally andtechnically, we propose two innovative losses, i.e., background-foreground mean squared error loss alleviating the extremeimbalance of the background and foreground of images, andprojection distance loss aligning the visual space and semanticspaces of old seen classes. Experiments demonstrate the efficacyof our model in detecting objects from both the seen and unseenclasses, outperforming the alternative models on Pascal VOC andMSCOCO datasets.

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

Sungmin Eum, Heesung Kwon

Responsive image

Auto-TLDR; Semantics-to-Space: Deep Zero-Shot Learning for Verb-Object Interaction with Vectors

Slides Poster Similar

We present a novel deep zero-shot learning (ZSL) model for inferencing human-object-interaction with verb-object (VO) query. While the previous two-stream ZSL approaches only use the semantic/textual information to be fed into the query stream, we seek to incorporate and embed the semantics into the visual representation stream as well. Our approach is powered by Semantics-to-Space (S2S) architecture where semantics derived from the residing objects are embedded into a spatial space of the visual stream. This architecture allows the co-capturing of the semantic attributes of the human and the objects along with their location/size/silhouette information. To validate, we have constructed a new dataset, Verb-Transferability 60 (VT60). VT60 provides 60 different VO pairs with overlapping verbs tailored for testing two-stream ZSL approaches with VO query. Experimental evaluations show that our approach not only outperforms the state-of-the-art, but also shows the capability of consistently improving performance regardless of which ZSL baseline architecture is used.

Local Propagation for Few-Shot Learning

Yann Lifchitz, Yannis Avrithis, Sylvaine Picard

Responsive image

Auto-TLDR; Local Propagation for Few-Shot Inference

Slides Poster Similar

The challenge in few-shot learning is that available data is not enough to capture the underlying distribution. To mitigate this, two emerging directions are (a) using local image representations, essentially multiplying the amount of data by a constant factor, and (b) using more unlabeled data, for instance by transductive inference, jointly on a number of queries. In this work, we bring these two ideas together, introducing local propagation. We treat local image features as independent examples, we build a graph on them and we use it to propagate both the features themselves and the labels, known and unknown. Interestingly, since there is a number of features per image, even a single query gives rise to transductive inference. As a result, we provide a universally safe choice for few-shot inference under both non-transductive and transductive settings, improving accuracy over corresponding methods. This is in contrast to existing solutions, where one needs to choose the method depending on the quantity of available data.

Edge-Aware Graph Attention Network for Ratio of Edge-User Estimation in Mobile Networks

Jiehui Deng, Sheng Wan, Xiang Wang, Enmei Tu, Xiaolin Huang, Jie Yang, Chen Gong

Responsive image

Auto-TLDR; EAGAT: Edge-Aware Graph Attention Network for Automatic REU Estimation in Mobile Networks

Slides Poster Similar

Estimating the Ratio of Edge-Users (REU) is an important issue in mobile networks, as it helps the subsequent adjustment of loads in different cells. However, existing approaches usually determine the REU manually, which are experience-dependent and labor-intensive, and thus the estimated REU might be imprecise. Considering the inherited graph structure of mobile networks, in this paper, we utilize a graph-based deep learning method for automatic REU estimation, where the practical cells are deemed as nodes and the load switchings among them constitute edges. Concretely, Graph Attention Network (GAT) is employed as the backbone of our method due to its impressive generalizability in dealing with networked data. Nevertheless, conventional GAT cannot make full use of the information in mobile networks, since it only incorporates node features to infer the pairwise importance and conduct graph convolutions, while the edge features that are actually critical in our problem are disregarded. To accommodate this issue, we propose an Edge-Aware Graph Attention Network (EAGAT), which is able to fuse the node features and edge features for REU estimation. Extensive experimental results on two real-world mobile network datasets demonstrate the superiority of our EAGAT approach to several state-of-the-art methods.

Graph-Based Interpolation of Feature Vectors for Accurate Few-Shot Classification

Yuqing Hu, Vincent Gripon, Stéphane Pateux

Responsive image

Auto-TLDR; Transductive Learning for Few-Shot Classification using Graph Neural Networks

Slides Poster Similar

In few-shot classification, the aim is to learn models able to discriminate classes using only a small number of labeled examples. In this context, works have proposed to introduce Graph Neural Networks (GNNs) aiming at exploiting the information contained in other samples treated concurrently, what is commonly referred to as the transductive setting in the literature. These GNNs are trained all together with a backbone feature extractor. In this paper, we propose a new method that relies on graphs only to interpolate feature vectors instead, resulting in a transductive learning setting with no additional parameters to train. Our proposed method thus exploits two levels of information: a) transfer features obtained on generic datasets, b) transductive information obtained from other samples to be classified. Using standard few-shot vision classification datasets, we demonstrate its ability to bring significant gains compared to other works.

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Yaning Li, Liu Yang

Responsive image

Auto-TLDR; Fully Associative Network for Fully Exploiting Correlation Information in Multi-Label Classification

Slides Poster Similar

Recent researches demonstrate that correlation modeling plays a key role in high-performance multi-label classification methods. However, existing methods do not take full advantage of correlation information, especially correlations in feature and label spaces of each image, which limits the performance of correlation-based multi-label classification methods. With more correlations considered, in this study, a Fully Associative Network (FAN) is proposed for fully exploiting correlation information, which involves both visual feature and label correlations. Specifically, FAN introduces a robust covariance pooling to summarize convolution features as global image representation for capturing feature correlation in the multi-label task. Moreover, it constructs an effective label correlation matrix based on a re-weighted scheme, which is fed into a graph convolution network for capturing label correlation. Then, correlation between covariance representations (i.e., feature correlation ) and the outputs of GCN (i.e., label correlation) are modeled for final prediction. Experimental results on two datasets illustrate the effectiveness and efficiency of our proposed FAN compared with state-of-the-art methods.

Complementing Representation Deficiency in Few-Shot Image Classification: A Meta-Learning Approach

Xian Zhong, Cheng Gu, Wenxin Huang, Lin Li, Shuqin Chen, Chia-Wen Lin

Responsive image

Auto-TLDR; Meta-learning with Complementary Representations Network for Few-Shot Learning

Slides Poster Similar

Few-shot learning is a challenging problem that has attracted more and more attention recently since abundant training samples are difficult to obtain in practical applications. Meta-learning has been proposed to address this issue, which focuses on quickly adapting a predictor as a base-learner to new tasks, given limited labeled samples. However, a critical challenge for meta-learning is the representation deficiency since it is hard to discover common information from a small number of training samples or even one, as is the representation of key features from such little information. As a result, a meta-learner cannot be trained well in a high-dimensional parameter space to generalize to new tasks. Existing methods mostly resort to extracting less expressive features so as to avoid the representation deficiency. Aiming at learning better representations, we propose a meta-learning approach with complemented representations network (MCRNet) for few-shot image classification. In particular, we embed a latent space, where latent codes are reconstructed with extra representation information to complement the representation deficiency. Furthermore, the latent space is established with variational inference, collaborating well with different base-learners, and can be extended to other models. Finally, our end-to-end framework achieves the state-of-the-art performance in image classification on three standard few-shot learning datasets.

AOAM: Automatic Optimization of Adjacency Matrix for Graph Convolutional Network

Yuhang Zhang, Hongshuai Ren, Jiexia Ye, Xitong Gao, Yang Wang, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; Adjacency Matrix for Graph Convolutional Network in Non-Euclidean Space

Slides Poster Similar

Graph Convolutional Network (GCN) is adopted to tackle the problem of the convolution operation in non-Euclidean space. Although previous works on GCN have made some progress, one of their limitations is that their input Adjacency Matrix (AM) is designed manually and requires domain knowledge, which is cumbersome, tedious and error-prone. In addition, entries of this fixed Adjacency Matrix are generally designed as binary values (i.e., ones and zeros) which can not reflect more complex relationship between nodes. However, many applications require a weighted and dynamic Adjacency Matrix instead of an unweighted and fixed Adjacency Matrix. To this end, there are few works focusing on designing a more flexible Adjacency Matrix. In this paper, we propose an end-to-end algorithm to improve the GCN performance by focusing on the Adjacency Matrix. We first provide a calculation method that called node information entropy to update the matrix. Then, we analyze the search strategy in a continuous space and introduce the Deep Deterministic Policy Gradient (DDPG) method to overcome the demerit of the discrete space search. Finally, we integrate the GCN and reinforcement learning into an end-to-end framework. Our method can automatically define the adjacency matrix without artificial knowledge. At the same time, the proposed approach can deal with any size of the matrix and provide a better value for the network. Four popular datasets are selected to evaluate the capability of our algorithm. The method in this paper achieves the state-of-the-art performance on Cora and Pubmed datasets, respectively, with the accuracy of 84.6% and 81.6%.

Context for Object Detection Via Lightweight Global and Mid-Level Representations

Mesut Erhan Unal, Adriana Kovashka

Responsive image

Auto-TLDR; Context-Based Object Detection with Semantic Similarity

Slides Poster Similar

We propose an approach for explicitly capturing context in object detection. We model visual and geometric relationships between object regions, but also model the global scene as a first-class participant. In contrast to prior approaches, both the context we rely on, as well as our proposed mechanism for belief propagation over regions, is lightweight. We also experiment with capturing similarities between regions at a semantic level, by modeling class co-occurrence and linguistic similarity between class names. We show that our approach significantly outperforms Faster R-CNN, and performs competitively with a much more costly approach that also models context.

Privacy Attributes-Aware Message Passing Neural Network for Visual Privacy Attributes Classification

Hanbin Hong, Wentao Bao, Yuan Hong, Yu Kong

Responsive image

Auto-TLDR; Privacy Attributes-Aware Message Passing Neural Network for Visual Privacy Attribute Classification

Slides Poster Similar

Visual Privacy Attribute Classification (VPAC) identifies privacy information leakage via social media images. These images containing privacy attributes such as skin color, face or gender are classified into multiple privacy attribute categories in VPAC. With limited works in this task, current methods often extract features from images and simply classify the extracted feature into multiple privacy attribute classes. The dependencies between privacy attributes, e.g., skin color and face typically co-exist in the same image, are usually ignored in classification, which causes performance degradation in VPAC. In this paper, we propose a novel end-to-end Privacy Attributes-aware Message Passing Neural Network (PA-MPNN) to address VPAC. Privacy attributes are considered as nodes on a graph and an MPNN is introduced to model the privacy attribute dependencies. To generate representative features for privacy attribute nodes, a class-wise encoder-decoder is proposed to learn a latent space for each attribute. An attention mechanism with multiple correlation matrices is also introduced in MPNN to learn the privacy attributes graph automatically. Experimental results on the Privacy Attribute Dataset demonstrate that our framework achieves better performance than state-of-the-art methods on visual privacy attributes classification.

Parallel Network to Learn Novelty from the Known

Shuaiyuan Du, Chaoyi Hong, Zhiyu Pan, Chen Feng, Zhiguo Cao

Responsive image

Auto-TLDR; Trainable Parallel Network for Pseudo-Novel Detection

Slides Poster Similar

Towards multi-class novelty detection, we propose an end-to-end trainable Parallel Network (PN) using no additional data but only the training set itself. Our key idea is to first divide the training set into successive subtasks of pseudo-novelty detection to simulate real scenarios. We then design a multi-branch PN to well address the fine-grained division, which yields a compressed and more discriminative classification space and forms a natural ensemble. In practice, we divide the training data into subsets consisting of known and pseudo-novel classes. Each subset forms a sub-task fed to one branch in PN. During training, both known and pseudo-novel classes are uniformly distributed over the branches for better data balance and model diversity. By distinguishing between the known and the diverse pseudo-novel, PN extracts the concept of novelty in a compressed classification space. This provides PN with generalization ability to real novel classes which are absent during training. During online inference, this ability is further strengthened with the ensemble of PN's multiple branches. Experiments on three public datasets show our method's superiority to the mainstream methods.

Generative Latent Implicit Conditional Optimization When Learning from Small Sample

Idan Azuri, Daphna Weinshall

Responsive image

Auto-TLDR; GLICO: Generative Latent Implicit Conditional Optimization for Small Sample Learning

Slides Poster Similar

We revisit the long-standing problem of learning from small sample. The generation of new samples from a small training set of labeled points has attracted increased attention in recent years. In this paper, we propose a novel such method called GLICO (Generative Latent Implicit Conditional Optimization). GLICO learns a mapping from the training examples to a latent space and a generator that generates images from vectors in the latent space. Unlike most recent work, which rely on access to large amounts of unlabeled data, GLICO does not require access to any additional data other than the small set of labeled points. In fact, GLICO learns to synthesize completely new samples for every class using as little as 5 or 10 examples per class, with as few as 10 such classes and no data from unknown classes. GLICO is then used to augment the small training set while training a classifier on the small sample. To this end, our proposed method samples the learned latent space using spherical interpolation (slerp) and generates new examples using the trained generator. Empirical results show that the new sampled set is diverse enough, leading to improvement in image classification in comparison with the state of the art when trained on small samples obtained from CIFAR-10, CIFAR-100, and CUB-200.

MetaMix: Improved Meta-Learning with Interpolation-based Consistency Regularization

Yangbin Chen, Yun Ma, Tom Ko, Jianping Wang, Qing Li

Responsive image

Auto-TLDR; MetaMix: A Meta-Agnostic Meta-Learning Algorithm for Few-Shot Classification

Slides Poster Similar

Model-Agnostic Meta-Learning (MAML) and its variants are popular few-shot classification methods. They train an initializer across a variety of sampled learning tasks (also known as episodes) such that the initialized model can adapt quickly to new tasks. However, within each episode, current MAML-based algorithms have limitations in forming generalizable decision boundaries using only a few training examples. In this paper, we propose an approach called MetaMix. It generates virtual examples within each episode to regularize the backbone models. MetaMix can be applied in any of the MAML-based algorithms and learn the decision boundaries which are more generalizable to new tasks. Experiments on the mini-ImageNet, CUB, and FC100 datasets show that MetaMix improves the performance of MAML-based algorithms and achieves the state-of-the-art result when applied in Meta-Transfer Learning.

A General Model for Learning Node and Graph Representations Jointly

Chaofan Chen

Responsive image

Auto-TLDR; Joint Community Detection/Dynamic Routing for Graph Classification

Slides Poster Similar

This paper focuses on two fundamental graph recognition tasks: node classification and graph classification. Existing methods usually learn the node and graph representations for these two tasks separately, and ignore modeling the relations between the local and global structures. In this paper, we propose a general approach to learn the local and global features collaboratively: (1) in order to characterize the correlation among nodes and communities (a set of nodes), we employ the joint community detection/dynamic routing modules to generate the clustering assignment matrices at first and then utilize these matrices to cluster nodes to capture the global information of graphs (locally relevant graph representations). Inspired by the success of spectral clustering, we minimize the ratiocut loss to help optimize the learned assignment matrices. (2) We maximize the mutual information between local and global representations to help learn the globally relevant node representations. Experimental results on a variety of node and graph classification benchmarks show that our model can achieve superior performance over the state-of-the-art approaches.

Supervised Domain Adaptation Using Graph Embedding

Lukas Hedegaard, Omar Ali Sheikh-Omar, Alexandros Iosifidis

Responsive image

Auto-TLDR; Domain Adaptation from the Perspective of Multi-view Graph Embedding and Dimensionality Reduction

Slides Poster Similar

Getting deep convolutional neural networks to perform well requires a large amount of training data. When the available labelled data is small, it is often beneficial to use transfer learning to leverage a related larger dataset (source) in order to improve the performance on the small dataset (target). Among the transfer learning approaches, domain adaptation methods assume that distributions between the two domains are shifted and attempt to realign them. In this paper, we consider the domain adaptation problem from the perspective of multi-view graph embedding and dimensionality reduction. Instead of solving the generalised eigenvalue problem to perform the embedding, we formulate the graph-preserving criterion as loss in the neural network and learn a domain-invariant feature transformation in an end-to-end fashion. We show that the proposed approach leads to a powerful Domain Adaptation framework which generalises the prior methods CCSA and d-SNE, and enables simple and effective loss designs; an LDA-inspired instantiation of the framework leads to performance on par with the state-of-the-art on the most widely used Domain Adaptation benchmarks, Office31 and MNIST to USPS datasets.

Augmented Bi-Path Network for Few-Shot Learning

Baoming Yan, Chen Zhou, Bo Zhao, Kan Guo, Yang Jiang, Xiaobo Li, Zhang Ming, Yizhou Wang

Responsive image

Auto-TLDR; Augmented Bi-path Network for Few-shot Learning

Slides Poster Similar

Few-shot Learning (FSL) which aims to learn from few labeled training data is becoming a popular research topic, due to the expensive labeling cost in many real-world applications. One kind of successful FSL method learns to compare the testing (query) image and training (support) image by simply concatenating the features of two images and feeding it into the neural network. However, with few labeled data in each class, the neural network has difficulty in learning or comparing the local features of two images. Such simple image-level comparison may cause serious mis-classification. To solve this problem, we propose Augmented Bi-path Network (ABNet) for learning to compare both global and local features on multi-scales. Specifically, the salient patches are extracted and embedded as the local features for every image. Then, the model learns to augment the features for better robustness. Finally, the model learns to compare global and local features separately, \emph{i.e.}, in two paths, before merging the similarities. Extensive experiments show that the proposed ABNet outperforms the state-of-the-art methods. Both quantitative and visual ablation studies are provided to verify that the proposed modules lead to more precise comparison results.

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, Rong Xiao

Responsive image

Auto-TLDR; PICK: A Graph Learning Framework for Key Information Extraction from Documents

Slides Poster Similar

Computer vision with state-of-the-art deep learning models have achieved huge success in the field of Optical Character Recognition (OCR) including text detection and recognition tasks recently. However, Key Information Extraction (KIE) from documents as the downstream task of OCR, having a large number of use scenarios in real-world, remains a challenge because documents not only have textual features extracting from OCR systems but also have semantic visual features that are not fully exploited and play a critical role in KIE. Too little work has been devoted to efficiently make full use of both textual and visual features of the documents. In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Extensive experiments on real-world datasets have been conducted to show that our method outperforms baselines methods by significant margins.

Multi-Attribute Learning with Highly Imbalanced Data

Lady Viviana Beltran Beltran, Mickaël Coustaty, Nicholas Journet, Juan C. Caicedo, Antoine Doucet

Responsive image

Auto-TLDR; Data Imbalance in Multi-Attribute Deep Learning Models: Adaptation to face each one of the problems derived from imbalance

Slides Poster Similar

Data is one of the most important keys for success when studying a simple or a complex phenomenon. With the use of deep-learning exploding and its democratization, non-computer science experts may struggle to use highly complex deep learning architectures, even when straightforward models offer them suitable performances. In this article, we study the specific and common problem of data imbalance in real databases as most of the bad performance problems are due to the data itself. We review two points: first, when the data contains different levels of imbalance. Classical imbalanced learning strategies cannot be directly applied when using multi-attribute deep learning models, i.e., multi-task and multi-label architectures. Therefore, one of our contributions is our proposed adaptations to face each one of the problems derived from imbalance. Second, we demonstrate that with little to no imbalance, straightforward deep learning models work well. However, for non-experts, these models can be seen as black boxes, where all the effort is put in pre-processing the data. To simplify the problem, we performed the classification task ignoring information that is costly to extract, such as part localization which is widely used in the state of the art of attribute classification. We make use of a widely known attribute database, CUB-200-2011 - CUB as our main use case due to its deeply imbalanced nature, along with two better structured databases: celebA and Awa2. All of them contain multi-attribute annotations. The results of highly fine-grained attribute learning over CUB demonstrate that in the presence of imbalance, by using our proposed strategies is possible to have competitive results against the state of the art, while taking advantage of multi-attribute deep learning models. We also report results for two better-structured databases over which our models over-perform the state of the art.

Object Detection Using Dual Graph Network

Shengjia Chen, Zhixin Li, Feicheng Huang, Canlong Zhang, Huifang Ma

Responsive image

Auto-TLDR; A Graph Convolutional Network for Object Detection with Key Relation Information

Slides Similar

Most object detection methods focus only on the local information near the region proposal and ignore the object's global semantic relation and local spatial relation information, resulting in limited performance. To capture and explore these important relations, we propose a detection method based on a graph convolutional network (GCN). Two independent relation graph networks are used to obtain the global semantic information of the object in labels and the local spatial information in images. Semantic relation networks can implicitly acquire global knowledge, and by constructing a directed graph on the dataset, each node is represented by the word embedding of labels and then sent to the GCN to obtain high-level semantic representation. The spatial relation network encodes the relation by the positional relation module and the visual connection module, and enriches the object features through local key information from objects. The feature representation is further improved by aggregating the outputs of the two networks. Instead of directly disseminating visual features in the network, the dual-graph network explores more advanced feature information, giving the detector the ability to obtain key relations in labels and region proposals. Experiments on the PASCAL VOC and MS COCO datasets demonstrate that key relation information significantly improve the performance of detection with better ability to detect small objects and reasonable boduning box. The results on COCO dataset demonstrate our method obtains around 32.3% improvement on AP in terms of small objects.

Explanation-Guided Training for Cross-Domain Few-Shot Classification

Jiamei Sun, Sebastian Lapuschkin, Wojciech Samek, Yunqing Zhao, Ngai-Man Cheung, Alexander Binder

Responsive image

Auto-TLDR; Explaination-Guided Training for Cross-Domain Few-Shot Classification

Slides Poster Similar

Cross-domain few-shot classification task (CD-FSC) combines few-shot classification with the requirement to generalize across domains represented by datasets. This setup faces challenges originating from the limited labeled data in each class and, additionally, from the domain shift between training and test sets. In this paper, we introduce a novel training approach for existing FSC models. It leverages on the explanation scores, obtained from existing explanation methods when applied to the predictions of FSC models, computed for intermediate feature maps of the models. Firstly, we tailor the layer-wise relevance propagation (LRP) method to explain the prediction outcomes of FSC models. Secondly, we develop a model-agnostic explanation-guided training strategy that dynamically finds and emphasizes the features which are important for the predictions. Our contribution does not target a novel explanation method but lies in a novel application of explanations for the training phase. We show that explanation-guided training effectively improves the model generalization. We observe improved accuracy for three different FSC models: RelationNet, cross attention network, and a graph neural network-based formulation, on five few-shot learning datasets: miniImagenet, CUB, Cars, Places, and Plantae.

TAAN: Task-Aware Attention Network for Few-Shot Classification

Zhe Wang, Li Liu, Fanzhang Li

Responsive image

Auto-TLDR; TAAN: Task-Aware Attention Network for Few-Shot Classification

Slides Poster Similar

Few-shot classification aims to recognize unlabeled samples from unseen classes given only a few labeled samples.Current approaches of few-shot learning usually employ a metriclearning framework to learn a feature similarity comparison between a query (test) example and the few support (training) examples. However, these approaches all extract features from samples independently without looking at the entire task as a whole, and so fail to provide an enough discrimination to features. Moreover, the existing approaches lack the ability to select the most relevant features for the task at hand. In this work, we propose a novel algorithm called Task-Aware Attention Network (TAAN) to address the above problems in few-shot classification. By inserting a Task-Relevant Channel Attention Module into metric-based few-shot learners, TAAN generates channel attentions for each sample by aggregating the context of the entire support set and identifies the most relevant features for similarity comparison. The experiment demonstrates that TAAN is competitive in overall performance comparing to the recent state-of-the-art systems and improves the performance considerably over baseline systems on both mini-ImageNet and tiered-ImageNet benchmarks.

Multi-Order Feature Statistical Model for Fine-Grained Visual Categorization

Qingtao Wang, Ke Zhang, Shaoli Huang, Lianbo Zhang, Jin Fan

Responsive image

Auto-TLDR; Multi-Order Feature Statistical Method for Fine-Grained Visual Categorization

Slides Poster Similar

Fine-grained visual categorization aims to learn a robust image representation modeling subtle differences from similar categories. Existing methods in this field tackle the problem by designing complex frameworks, which produce high-level features by performing first-order or second-order pooling. Despite the impressive performance achieved by these strategies, the single-order networks only carry linear or non-linear information of the last convolutional layer, neglecting the fact that feature from different orders are mutually complementary. In this paper, we propose a Multi-Order Feature Statistical Method (MOFS), which learns fine-grained features characterizing multiple orders. Specifically, the MOFS consists of two sub-modules: (i) a first-order module modeling both mid-level and high-level features. (ii) a covariance feature statistical module capturing high-order features. By deploying these two sub-modules on the top of existing backbone networks, MOFS simultaneously captures multi-level of discrimative patters including local, global and co-related patters. We evaluate the proposed method on three challenging benchmarks, namely CUB-200-2011, Stanford Cars, and FGVC-Aircraft. Compared with state-of-the-art methods, experiments results exhibit superior performance in recognizing fine-grained objects

Few-Shot Few-Shot Learning and the Role of Spatial Attention

Yann Lifchitz, Yannis Avrithis, Sylvaine Picard

Responsive image

Auto-TLDR; Few-shot Learning with Pre-trained Classifier on Large-Scale Datasets

Slides Poster Similar

Few-shot learning is often motivated by the ability of humans to learn new tasks from few examples. However, standard few-shot classification benchmarks assume that the representation is learned on a limited amount of base class data, ignoring the amount of prior knowledge that a human may have accumulated before learning new tasks. At the same time, even if a powerful representation is available, it may happen in some domain that base class data are limited or non-existent. This motivates us to study a problem where the representation is obtained from a classifier pre-trained on a large-scale dataset of a different domain, assuming no access to its training process, while the base class data are limited to few examples per class and their role is to adapt the representation to the domain at hand rather than learn from scratch. We adapt the representation in two stages, namely on the few base class data if available and on the even fewer data of new tasks. In doing so, we obtain from the pre-trained classifier a spatial attention map that allows focusing on objects and suppressing background clutter. This is important in the new problem, because when base class data are few, the network cannot learn where to focus implicitly. We also show that a pre-trained network may be easily adapted to novel classes, without meta-learning.

A Self-Supervised GAN for Unsupervised Few-Shot Object Recognition

Khoi Nguyen, Sinisa Todorovic

Responsive image

Auto-TLDR; Self-supervised Few-Shot Object Recognition with a Triplet GAN

Slides Poster Similar

This paper addresses unsupervised few-shot object recognition, where all training images are unlabeled, and test images are divided into queries and a few labeled support images per object class of interest. The training and test images do not share object classes. We extend the vanilla GAN with two loss functions, both aimed at self-supervised learning. The first is a reconstruction loss that enforces the discriminator to reconstruct the probabilistically sampled latent code which has been used for generating the "fake" image. The second is a triplet loss that enforces the discriminator to output image encodings that are closer for more similar images. Evaluation, comparisons, and detailed ablation studies are done in the context of few-shot classification. Our approach significantly outperforms the state of the art on the Mini-Imagenet and Tiered-Imagenet datasets.

Adversarial Encoder-Multi-Task-Decoder for Multi-Stage Processes

Andre Mendes, Julian Togelius, Leandro Dos Santos Coelho

Responsive image

Auto-TLDR; Multi-Task Learning and Semi-Supervised Learning for Multi-Stage Processes

Similar

In multi-stage processes, decisions occur in an ordered sequence of stages. Early stages usually have more observations with general information (easier/cheaper to collect), while later stages have fewer observations but more specific data. This situation can be represented by a dual funnel structure, in which the sample size decreases from one stage to the other while the information increases. Training classifiers in this scenario is challenging since information in the early stages may not contain distinct patterns to learn (underfitting). In contrast, the small sample size in later stages can cause overfitting. We address both cases by introducing a framework that combines adversarial autoencoders (AAE), multi-task learning (MTL), and multi-label semi-supervised learning (MLSSL). We improve the decoder of the AAE with an MTL component so it can jointly reconstruct the original input and use feature nets to predict the features for the next stages. We also introduce a sequence constraint in the output of an MLSSL classifier to guarantee the sequential pattern in the predictions. Using real-world data from different domains (selection process, medical diagnosis), we show that our approach outperforms other state-of-the-art methods.

What Nodes Vote To? Graph Classification without Readout Phase

Yuxing Tian, Zheng Liu, Weiding Liu, Zeyu Zhang, Yanwen Qu

Responsive image

Auto-TLDR; node voting based graph classification with convolutional operator

Slides Poster Similar

In recent years, many researchers have started to construct Graph Neural Networks (GNNs) to deal with graph classification task. Those GNNs can fit into a framework named Message Passing Neural Networks (MPNNs), which consists of two phases: a Message Passing phase used for updating node embeddings and a Readout phase. In Readout phase, node embeddings are aggregated to extract graph feature used for classification. However, the above operation may obscure the affect of the node embedding of each node on graph classification. Therefore, a node voting based graph classification model is proposed in this paper, called Node Voting net (NVnet). Similar to the MPNNs, NVnet also contains the Message Passing phase. The main differences between NVnet and MPNNs are: 1, a decoder for graph reconstruction is added to NVnet to make node embeddings contain as much graph structure information as possible; 2, NVnet replaces the Readout phase with a new phase called Node Voting phase. In the Node Voting phase, an attention layer based on the gate mechanism is constructed to help each node observe the node embeddings of other nodes in the graph, and each node predicts the graph class from its own perspective. The above process is called node voting. After voting, the results of all nodes are aggregated to get the final graph classification result. In addition, considering that aggregation operation may also obscure the difference between node voting results, our solution is to add a regularization term to drive node voting results to reach group consensus. We evaluate the performance of the NVnet on 4 benchmark datasets. The experimental results show that compared with other 10 baselines, NVnet can achieve higher graph classification accuracy on datasets by using appropriate convolutional operator.

Soft Label and Discriminant Embedding Estimation for Semi-Supervised Classification

Fadi Dornaika, Abdullah Baradaaji, Youssof El Traboulsi

Responsive image

Auto-TLDR; Semi-supervised Semi-Supervised Learning for Linear Feature Extraction and Label Propagation

Slides Poster Similar

In recent times, graph-based semi-supervised learning proved to be a powerful paradigm for processing and mining large datasets. The main advantage relies on the fact that these methods can be useful in propagating a small set of known labels to a large set of unlabeled data. The scarcity of labeled data may affect the performance of the semi-learning. This paper introduces a new semi-supervised framework for simultaneous linear feature extraction and label propagation. The proposed method simultaneously estimates a discriminant transformation and the unknown label by exploiting both labeled and unlabeled data. In addition, the unknowns of the learning model are estimated by integrating two types of graph-based smoothness constraints. The resulting semi-supervised model is expected to learn more discriminative information. Experiments are conducted on six public image datasets. These experimental results show that the performance of the proposed method can be better than that of many state-of-the-art graph-based semi-supervised algorithms.

Label Incorporated Graph Neural Networks for Text Classification

Yuan Xin, Linli Xu, Junliang Guo, Jiquan Li, Xin Sheng, Yuanyuan Zhou

Responsive image

Auto-TLDR; Graph Neural Networks for Semi-supervised Text Classification

Slides Poster Similar

Graph Neural Networks (GNNs) have achieved great success on graph-structured data, and their applications on traditional data structures such as natural language processing and semi-supervised text classification have been extensively explored in recent years. While previous works only consider the text information while building the graph, heterogeneous information such as labels is ignored. In this paper, we consider to incorporate the label information while building the graph by adding text-label-text paths, through which the supervision information will propagate among the graph more directly. Specifically, we treat labels as nodes in the graph which also contains text and word nodes, and then connect labels with texts belonging to that label. Through graph convolutions, label embeddings are jointly learned with text embeddings in the same latent semantic space. The newly incorporated label nodes will facilitate learning more accurate text embeddings by introducing the label information, and thus benefit the downstream text classification tasks. Extensive results on several benchmark datasets show that the proposed framework outperforms baseline methods by a significant margin.

Boundary-Aware Graph Convolution for Semantic Segmentation

Hanzhe Hu, Jinshi Cui, Jinshi Hongbin Zha

Responsive image

Auto-TLDR; Boundary-Aware Graph Convolution for Semantic Segmentation

Slides Poster Similar

Recent works have made great progress in semantic segmentation by exploiting contextual information in a local or global manner with dilated convolutions, pyramid pooling or self-attention mechanism. However, few works have focused on harvesting boundary information to improve the segmentation performance. In order to enhance the feature similarity within the object and keep discrimination from other objects, we propose a boundary-aware graph convolution (BGC) module to propagate features within the object. The graph reasoning is performed among pixels of the same object apart from the boundary pixels. Based on the proposed BGC module, we further introduce the Boundary-aware Graph Convolution Network(BGCNet), which consists of two main components including a basic segmentation network and the BGC module, forming a coarse-to-fine paradigm. Specifically, the BGC module takes the coarse segmentation feature map as node features and boundary prediction to guide graph construction. After graph convolution, the reasoned feature and the input feature are fused together to get the refined feature, producing the refined segmentation result. We conduct extensive experiments on three popular semantic segmentation benchmarks including Cityscapes, PASCAL VOC 2012 and COCO Stuff, and achieve state-of-the-art performance on all three benchmarks.

Adaptive Word Embedding Module for Semantic Reasoning in Large-Scale Detection

Yu Zhang, Xiaoyu Wu, Ruolin Zhu

Responsive image

Auto-TLDR; Adaptive Word Embedding Module for Object Detection

Slides Poster Similar

In recent years, convolutional neural networks have achieved rapid development in the field of object detection. However, due to the imbalance of data, high costs in labor and uneven level of data labeling, the overall performance of the previous detection network has dropped sharply when dataset extended to the large-scale with hundreds and thousands categories. We present the Adaptive Word Embedding Module, extracting the adaptive semantic knowledge graph to reach semantic consistency within one image. Our method endows the ability to infer global semantic of detection networks without other attribute or relationship annotations. Compared with Faster RCNN, the algorithm on the MSCOCO dataset was significantly improved by 4.1%, and the mAP value has reached 32.8%. On the VG1000 dataset, it increased by 0.9% to 6.7% compared with Faster RCNN. Adaptive Word Embedding Module is lightweight, general-purpose and can be plugged into diverse detection networks. Code will be made available.

Feature-Aware Unsupervised Learning with Joint Variational Attention and Automatic Clustering

Wang Ru, Lin Li, Peipei Wang, Liu Peiyu

Responsive image

Auto-TLDR; Deep Variational Attention Encoder-Decoder for Clustering

Slides Poster Similar

Deep clustering aims to cluster unlabeled real-world samples by mining deep feature representation. Most of existing methods remain challenging when handling high-dimensional data and simultaneously exploring the complementarity of deep feature representation and clustering. In this paper, we propose a novel Deep Variational Attention Encoder-decoder for Clustering (DVAEC). Our DVAEC improves the representation learning ability by fusing variational attention. Specifically, we design a feature-aware automatic clustering module to mitigate the unreliability of similarity calculation and guide network learning. Besides, to further boost the performance of deep clustering from a global perspective, we define a joint optimization objective to promote feature representation learning and automatic clustering synergistically. Extensive experimental results show the promising performance achieved by our DVAEC on six datasets comparing with several popular baseline clustering methods.

Semantic Bilinear Pooling for Fine-Grained Recognition

Xinjie Li, Chun Yang, Song-Lu Chen, Chao Zhu, Xu-Cheng Yin

Responsive image

Auto-TLDR; Semantic bilinear pooling for fine-grained recognition with hierarchical label tree

Slides Poster Similar

Naturally, fine-grained recognition, e.g., vehicle identification or bird classification, has specific hierarchical labels, where fine categories are always harder to be classified than coarse categories. However, most of the recent deep learning based methods neglect the semantic structure of fine-grained objects and do not take advantage of the traditional fine-grained recognition techniques (e.g. coarse-to-fine classification). In this paper, we propose a novel framework with a two-branch network (coarse branch and fine branch), i.e., semantic bilinear pooling, for fine-grained recognition with a hierarchical label tree. This framework can adaptively learn the semantic information from the hierarchical levels. Specifically, we design a generalized cross-entropy loss for the training of the proposed framework to fully exploit the semantic priors via considering the relevance between adjacent levels and enlarge the distance between samples of different coarse classes. Furthermore, our method leverages only the fine branch when testing so that it adds no overhead to the testing time. Experimental results show that our proposed method achieves state-of-the-art performance on four public datasets.

Local Clustering with Mean Teacher for Semi-Supervised Learning

Zexi Chen, Benjamin Dutton, Bharathkumar Ramachandra, Tianfu Wu, Ranga Raju Vatsavai

Responsive image

Auto-TLDR; Local Clustering for Semi-supervised Learning

Slides Similar

The Mean Teacher (MT) model of Tarvainen and Valpola has shown favorable performance on several semi-supervised benchmark datasets. MT maintains a teacher model's weights as the exponential moving average of a student model's weights and minimizes the divergence between their probability predictions under diverse perturbations of the inputs. However, MT is known to suffer from confirmation bias, that is, reinforcing incorrect teacher model predictions. In this work, we propose a simple yet effective method called Local Clustering (LC) to mitigate the effect of confirmation bias. In MT, each data point is considered independent of other points during training; however, data points are likely to be close to each other in feature space if they share similar features. Motivated by this, we cluster data points locally by minimizing the pairwise distance between neighboring data points in feature space. Combined with a standard classification cross-entropy objective on labeled data points, the misclassified unlabeled data points are pulled towards high-density regions of their correct class with the help of their neighbors, thus improving model performance. We demonstrate on semi-supervised benchmark datasets SVHN and CIFAR-10 that adding our LC loss to MT yields significant improvements compared to MT and performance comparable to the state of the art in semi-supervised learning.

Epitomic Variational Graph Autoencoder

Rayyan Ahmad Khan, Muhammad Umer Anwaar, Martin Kleinsteuber

Responsive image

Auto-TLDR; EVGAE: A Generative Variational Autoencoder for Graph Data

Slides Poster Similar

Variational autoencoder (VAE) is a widely used generative model for learning latent representations. Burda et al. in their seminal paper showed that learning capacity of VAE is limited by over-pruning. It is a phenomenon where a significant number of latent variables fail to capture any information about the input data and the corresponding hidden units become inactive. This adversely affects learning diverse and interpretable latent representations. As variational graph autoencoder (VGAE) extends VAE for graph-structured data, it inherits the over-pruning problem. In this paper, we adopt a model based approach and propose epitomic VGAE (EVGAE),a generative variational framework for graph datasets which successfully mitigates the over-pruning problem and also boosts the generative ability of VGAE. We consider EVGAE to consist of multiple sparse VGAE models, called epitomes, that are groups of latent variables sharing the latent space. This approach aids in increasing active units as epitomes compete to learn better representation of the graph data. We verify our claims via experiments on three benchmark datasets. Our experiments show that EVGAE has a better generative ability than VGAE. Moreover, EVGAE outperforms VGAE on link prediction task in citation networks

Revisiting Graph Neural Networks: Graph Filtering Perspective

Hoang Nguyen-Thai, Takanori Maehara, Tsuyoshi Murata

Responsive image

Auto-TLDR; Two-Layers Graph Convolutional Network with Graph Filters Neural Network

Slides Poster Similar

In this work, we develop quantitative results to the learnability of a two-layers Graph Convolutional Network (GCN). Instead of analyzing GCN under some classes of functions, our approach provides a quantitative gap between a two-layers GCN and a two-layers MLP model. From the graph signal processing perspective, we provide useful insights to some flaws of graph neural networks for vertex classification. We empirically demonstrate a few cases when GCN and other state-of-the-art models cannot learn even when true vertex features are extremely low-dimensional. To demonstrate our theoretical findings and propose a solution to the aforementioned adversarial cases, we build a proof of concept graph neural network model with different filters named Graph Filters Neural Network (gfNN).

Meta Generalized Network for Few-Shot Classification

Wei Wu, Shanmin Pang, Zhiqiang Tian, Yaochen Li

Responsive image

Auto-TLDR; Meta Generalized Network for Few-Shot Classification

Similar

Few-shot classification aims to learn a well performance model with very limited labeled examples. There are mainly two directions for this aim, namely, meta- and metric-learning. Meta learning trains models in a particular way to fast adapt to new tasks, but it neglects variational features of images. Metric learning considers relationships among same or different classes, however on the downside, it usually fails to achieve competitive performance on unseen boundary examples. In this paper, we propose a Meta Generalized Network (MGNet) that aims to combine advantages of both meta- and metric-learning. There are two novel components in MGNet. Specifically, we first develop a meta backbone training method that both learns a flexible feature extractor and a classifier initializer efficiently, delightedly leading to fast adaption to unseen few-shot tasks without overfitting. Second, we design a trainable adaptive interval model to improve the cosine classifier, which increases the recognition accuracy of hard examples. We train the meta backbone in the training stage by all classes, and fine-tune the meta-backbone as well as train the adaptive classifier in the testing stage.

Pretraining Image Encoders without Reconstruction Via Feature Prediction Loss

Gustav Grund Pihlgren, Fredrik Sandin, Marcus Liwicki

Responsive image

Auto-TLDR; Feature Prediction Loss for Autoencoder-based Pretraining of Image Encoders

Similar

This work investigates three methods for calculating loss for autoencoder-based pretraining of image encoders: The commonly used reconstruction loss, the more recently introduced deep perceptual similarity loss, and a feature prediction loss proposed here; the latter turning out to be the most efficient choice. Standard auto-encoder pretraining for deep learning tasks is done by comparing the input image and the reconstructed image. Recent work shows that predictions based on embeddings generated by image autoencoders can be improved by training with perceptual loss, i.e., by adding a loss network after the decoding step. So far the autoencoders trained with loss networks implemented an explicit comparison of the original and reconstructed images using the loss network. However, given such a loss network we show that there is no need for the time-consuming task of decoding the entire image. Instead, we propose to decode the features of the loss network, hence the name ``feature prediction loss''. To evaluate this method we perform experiments on three standard publicly available datasets (LunarLander-v2, STL-10, and SVHN) and compare six different procedures for training image encoders (pixel-wise, perceptual similarity, and feature prediction losses; combined with two variations of image and feature encoding/decoding). The embedding-based prediction results show that encoders trained with feature prediction loss is as good or better than those trained with the other two losses. Additionally, the encoder is significantly faster to train using feature prediction loss in comparison to the other losses. The method implementation used in this work is available online: https://github.com/guspih/Perceptual-Autoencoders

Image Representation Learning by Transformation Regression

Xifeng Guo, Jiyuan Liu, Sihang Zhou, En Zhu, Shihao Dong

Responsive image

Auto-TLDR; Self-supervised Image Representation Learning using Continuous Parameter Prediction

Slides Poster Similar

Self-supervised learning is a thriving research direction since it can relieve the burden of human labeling for machine learning by seeking for supervision from data instead of human annotation. Although demonstrating promising performance in various applications, we observe that the existing methods usually model the auxiliary learning tasks as classification tasks with finite discrete labels, leading to insufficient supervisory signals, which in turn restricts the representation quality. In this paper, to solve the above problem and make full use of the supervision from data, we design a regression model to predict the continuous parameters of a group of transformations, i.e., image rotation, translation, and scaling. Surprisingly, this naive modification stimulates tremendous potential from data and the resulting supervisory signal has largely improved the performance of image representation learning. Extensive experiments on four image datasets, including CIFAR10, CIFAR100, STL10, and SVHN, indicate that our proposed algorithm outperforms the state-of-the-art unsupervised learning methods by a large margin in terms of classification accuracy. Crucially, we find that with our proposed training mechanism as an initialization, the performance of the existing state-of-the-art classification deep architectures can be preferably improved.

Siamese Graph Convolution Network for Face Sketch Recognition

Liang Fan, Xianfang Sun, Paul Rosin

Responsive image

Auto-TLDR; A novel Siamese graph convolution network for face sketch recognition

Slides Poster Similar

In this paper, we present a novel Siamese graph convolution network (GCN) for face sketch recognition. To build a graph from an image, we utilize a deep learning method to detect the image edges, and then use a superpixel method to segment the edge image. Each segmented superpixel region is taken as a node, and each pair of adjacent regions forms an edge of the graph. Graphs from both a face sketch and a face photo are input into the Siamese GCN for recognition. A deep graph matching method is used to share messages between cross-modal graphs in this model. Experiments show that the GCN can obtain high performance on several face photo-sketch datasets, including seen and unseen face photo-sketch datasets. It is also shown that the model performance based on the graph structure representation of the data using the Siamese GCN is more stable than a Siamese CNN model.

Kernel-based Graph Convolutional Networks

Hichem Sahbi

Responsive image

Auto-TLDR; Spatial Graph Convolutional Networks in Recurrent Kernel Hilbert Space

Slides Poster Similar

Learning graph convolutional networks (GCNs) is an emerging field which aims at generalizing deep learning to arbitrary non-regular domains. Most of the existing GCNs follow a neighborhood aggregation scheme, where the representation of a node is recursively obtained by aggregating its neighboring node representations using averaging or sorting operations. However, these operations are either ill-posed or weak to be discriminant or increase the number of training parameters and thereby the computational complexity and the risk of overfitting. In this paper, we introduce a novel GCN framework that achieves spatial graph convolution in a reproducing kernel Hilbert space. The latter makes it possible to design, via implicit kernel representations, convolutional graph filters in a high dimensional and more discriminating space without increasing the number of training parameters. The particularity of our GCN model also resides in its ability to achieve convolutions without explicitly realigning nodes in the receptive fields of the learned graph filters with those of the input graphs, thereby making convolutions permutation agnostic and well defined. Experiments conducted on the challenging task of skeleton-based action recognition show the superiority of the proposed method against different baselines as well as the related work.

Image Inpainting with Contrastive Relation Network

Xiaoqiang Zhou, Junjie Li, Zilei Wang, Ran He, Tieniu Tan

Responsive image

Auto-TLDR; Two-Stage Inpainting with Graph-based Relation Network

Slides Similar

Image inpainting faces the challenging issue of the requirements on structure reasonableness and texture coherence. In this paper, we propose a two-stage inpainting framework to address this issue. The basic idea is to address the two requirements in two separate stages. Completed segmentation of the corrupted image is firstly predicted through segmentation reconstruction network, while fine-grained image details are restored in the second stage through an image generator. The two stages are connected in series as the image details are generated under the guidance of completed segmentation map that predicted in the first stage. Specifically, in the second stage, we propose a novel graph-based relation network to model the relationship existed in corrupted image. In relation network, both intra-relationship for pixels in the same semantic region and inter-relationship between different semantic parts are considered, improving the consistency and compatibility of image textures. Besides, contrastive loss is designed to facilitate the relation network training. Such a framework not only simplifies the inpainting problem directly, but also exploits the relationship in corrupted image explicitly. Extensive experiments on various public datasets quantitatively and qualitatively demonstrate the superiority of our approach compared with the state-of-the-art.