Learning Low-Shot Generative Networks for Cross-Domain Data

Hsuan-Kai Kao, Cheng-Che Lee, Wei-Chen Chiu

Responsive image

Auto-TLDR; Learning Generators for Cross-Domain Data under Low-Shot Learning

Slides Poster

We tackle a novel problem of learning generators for cross-domain data under a specific scenario of low-shot learning. Basically, given a source domain with sufficient amount of training data, we aim to transfer the knowledge of its generative process to another target domain, which not only has few data samples but also contains the domain shift with respect to the source domain. This problem has great potential in practical use and is different from the well-known image translation task, as the target-domain data can be generated without requiring any source-domain ones and the large data consumption for learning target-domain generator can be alleviated. Built upon a cross-domain dataset where (1) each of the low shots in the target domain has its correspondence in the source and (2) these two domains share the similar content information but different appearance, two approaches are proposed: a Latent-Disentanglement-Orientated model (LaDo) and a Generative-Hierarchy-Oriented (GenHo) model. Our LaDo and GenHo approaches address the problem from different perspectives, where the former relies on learning the disentangled representation composed of domain-invariant content features and domain-specific appearance ones; while the later decomposes the generative process of a generator into two parts for synthesizing the content and appearance sequentially. We perform extensive experiments under various settings of cross-domain data and show the efficacy of our models for generating target-domain data with the abundant content variance as in the source domain, which lead to the favourable performance in comparison to several baselines.

Similar papers

Unsupervised Multi-Task Domain Adaptation

Shih-Min Yang, Mei-Chen Yeh

Responsive image

Auto-TLDR; Unsupervised Domain Adaptation with Multi-task Learning for Image Recognition

Slides Poster Similar

With abundant labeled data, deep convolutional neural networks have shown great success in various image recognition tasks. However, these models are often less powerful when applied to novel datasets due to a phenomenon known as domain shift. Unsupervised domain adaptation methods aim to address this problem, allowing deep models trained on the labeled source domain to be used on a different target domain (without labels). In this paper, we investigate whether the generalization ability of an unsupervised domain adaptation method can be improved through multi-task learning, with learned features required to be both domain invariant and discriminative for multiple different but relevant tasks. Experiments evaluating two fundamental recognition tasks---including image recognition and segmentation--- show that the generalization ability empowered by multi-task learning may not benefit recognition when the model is directly applied on the target domain, but the multi-task setting can boost the performance of state-of-the-art unsupervised domain adaptation methods by a non-negligible margin.

Multi-Domain Image-To-Image Translation with Adaptive Inference Graph

The Phuc Nguyen, Stéphane Lathuiliere, Elisa Ricci

Responsive image

Auto-TLDR; Adaptive Graph Structure for Multi-Domain Image-to-Image Translation

Slides Poster Similar

In this work, we address the problem of multi-domain image-to-image translation with particular attention paid to computational cost. In particular, current state of the art models require a large and deep model in order to handle the visual diversity of multiple domains. In a context of limited computational resources, increasing the network size may not be possible. Therefore, we propose to increase the network capacity by using an adaptive graph structure. At inference time, the network estimates its own graph by selecting specific sub-networks. Sub-network selection is implemented using Gumble-Softmax in order to allow end-to-end training. This approach leads to an adjustable increase in number of parameters while preserving an almost constant computational cost. Our evaluation on two publicly available datasets of facial and painting images shows that our adaptive strategy generates better images with fewer artifacts than literature methods.

Local Facial Attribute Transfer through Inpainting

Ricard Durall, Franz-Josef Pfreundt, Janis Keuper

Responsive image

Auto-TLDR; Attribute Transfer Inpainting Generative Adversarial Network

Slides Poster Similar

The term attribute transfer refers to the tasks of altering images in such a way, that the semantic interpretation of a given input image is shifted towards an intended direction, which is quantified by semantic attributes. Prominent example applications are photo realistic changes of facial features and expressions, like changing the hair color, adding a smile, enlarging the nose or altering the entire context of a scene, like transforming a summer landscape into a winter panorama. Recent advances in attribute transfer are mostly based on generative deep neural networks, using various techniques to manipulate images in the latent space of the generator. In this paper, we present a novel method for the common sub-task of local attribute transfers, where only parts of a face have to be altered in order to achieve semantic changes (e.g. removing a mustache). In contrast to previous methods, where such local changes have been implemented by generating new (global) images, we propose to formulate local attribute transfers as an inpainting problem. Removing and regenerating only parts of images, our Attribute Transfer Inpainting Generative Adversarial Network (ATI-GAN) is able to utilize local context information to focus on the attributes while keeping the background unmodified resulting in visually sound results.

Shape Consistent 2D Keypoint Estimation under Domain Shift

Levi Vasconcelos, Massimiliano Mancini, Davide Boscaini, Barbara Caputo, Elisa Ricci

Responsive image

Auto-TLDR; Deep Adaptation for Keypoint Prediction under Domain Shift

Slides Poster Similar

Recent unsupervised domain adaptation methods based on deep architectures have shown remarkable performance not only in traditional classification tasks but also in more complex problems involving structured predictions (e.g. semantic segmentation, depth estimation). Following this trend, in this paper we present a novel deep adaptation framework for estimating keypoints under \textit{domain shift}, i.e. when the training (\textit{source}) and the test (\textit{target}) images significantly differ in terms of visual appearance. Our method seamlessly combines three different components: feature alignment, adversarial training and self-supervision. Specifically, our deep architecture leverages from domain-specific distribution alignment layers to perform target adaptation at the feature level. Furthermore, a novel loss is proposed which combines an adversarial term for ensuring aligned predictions in the output space and a geometric consistency term which guarantees coherent predictions between a target sample and its perturbed version. Our extensive experimental evaluation conducted on three publicly available benchmarks shows that our approach outperforms state-of-the-art domain adaptation methods in the 2D keypoint prediction task.

AVAE: Adversarial Variational Auto Encoder

Antoine Plumerault, Hervé Le Borgne, Celine Hudelot

Responsive image

Auto-TLDR; Combining VAE and GAN for Realistic Image Generation

Slides Poster Similar

Among the wide variety of image generative models, two models stand out: Variational Auto Encoders (VAE) and Generative Adversarial Networks (GAN). GANs can produce realistic images, but they suffer from mode collapse and do not provide simple ways to get the latent representation of an image. On the other hand, VAEs do not have these problems, but they often generate images less realistic than GANs. In this article, we explain that this lack of realism is partially due to a common underestimation of the natural image manifold dimensionality. To solve this issue we introduce a new framework that combines VAE and GAN in a novel and complementary way to produce an auto-encoding model that keeps VAEs properties while generating images of GAN-quality. We evaluate our approach both qualitatively and quantitatively on five image datasets.

Cross-Domain Semantic Segmentation of Urban Scenes Via Multi-Level Feature Alignment

Bin Zhang, Shengjie Zhao, Rongqing Zhang

Responsive image

Auto-TLDR; Cross-Domain Semantic Segmentation Using Generative Adversarial Networks

Slides Poster Similar

Semantic segmentation is an essential task in plenty of real-life applications such as virtual reality, video analysis, autonomous driving, etc. Recent advancements in fundamental vision-based tasks ranging from image classification to semantic segmentation have demonstrated deep learning-based models' high capability in learning complicated representation on large datasets. Nevertheless, manually labeling semantic segmentation dataset with pixel-level annotation is extremely labor-intensive. To address this problem, we propose a novel multi-level feature alignment framework for cross-domain semantic segmentation of urban scenes by exploiting generative adversarial networks. In the proposed multi-level feature alignment method, we first translate images from one domain to another one. Then the discriminative feature representations extracted by the deep neural network are concatenated, followed by domain adversarial learning to make the intermediate feature distribution of the target domain images close to those in the source domain. With these domain adaptation techniques, models trained with images in the source domain where the labels are easy to acquire can be deployed to the target domain where the labels are scarce. Experimental evaluations on various mainstream benchmarks confirm the effectiveness as well as robustness of our approach.

Dual-MTGAN: Stochastic and Deterministic Motion Transfer for Image-To-Video Synthesis

Fu-En Yang, Jing-Cheng Chang, Yuan-Hao Lee, Yu-Chiang Frank Wang

Responsive image

Auto-TLDR; Dual Motion Transfer GAN for Convolutional Neural Networks

Slides Poster Similar

Generating videos with content and motion variations is a challenging task in computer vision. While the recent development of GAN allows video generation from latent representations, it is not easy to produce videos with particular content of motion patterns of interest. In this paper, we propose Dual Motion Transfer GAN (Dual-MTGAN), which takes image and video data as inputs while learning disentangled content and motion representations. Our Dual-MTGAN is able to perform deterministic motion transfer and stochastic motion generation. Based on a given image, the former preserves the input content and transfers motion patterns observed from another video sequence, and the latter directly produces videos with plausible yet diverse motion patterns based on the input image. The proposed model is trained in an end-to-end manner, without the need to utilize pre-defined motion features like pose or facial landmarks. Our quantitative and qualitative results would confirm the effectiveness and robustness of our model in addressing such conditioned image-to-video tasks.

A Simple Domain Shifting Network for Generating Low Quality Images

Guruprasad Hegde, Avinash Nittur Ramesh, Kanchana Vaishnavi Gandikota, Michael Möller, Roman Obermaisser

Responsive image

Auto-TLDR; Robotic Image Classification Using Quality degrading networks

Slides Poster Similar

Deep Learning systems have proven to be extremely successful for image recognition tasks for which significant amounts of training data is available, e.g., on the famous ImageNet dataset. We demonstrate that for robotics applications with cheap camera equipment, the low image quality, however, influences the classification accuracy, and freely available data bases cannot be exploited in a straight forward way to train classifiers to be used on a robot. As a solution we propose to train a network on degrading the quality images in order to mimic specific low quality imaging systems. Numerical experiments demonstrate that classification networks trained by using images produced by our quality degrading network along with the high quality images outperform classification networks trained only on high quality data when used on a real robot system, while being significantly easier to use than competing zero-shot domain adaptation techniques.

High Resolution Face Age Editing

Xu Yao, Gilles Puy, Alasdair Newson, Yann Gousseau, Pierre Hellier

Responsive image

Auto-TLDR; An Encoder-Decoder Architecture for Face Age editing on High Resolution Images

Slides Poster Similar

Face age editing has become a crucial task in film post-production, and is also becoming popular for general purpose photography. Recently, adversarial training has produced some of the most visually impressive results for image manipulation, including the face aging/de-aging task. In spite of considerable progress, current methods often present visual artifacts and can only deal with low-resolution images. In order to achieve aging/de-aging with the high quality and robustness necessary for wider use, these problems need to be addressed. This is the goal of the present work. We present an encoder-decoder architecture for face age editing. The core idea of our network is to encode a face image to age-invariant features, and learn a modulation vector corresponding to a target age. We then combine these two elements to produce a realistic image of the person with the desired target age. Our architecture is greatly simplified with respect to other approaches, and allows for fine-grained age editing on high resolution images in a single unified model. Source codes are available at https://github.com/InterDigitalInc/HRFAE.

Unsupervised Domain Adaptation with Multiple Domain Discriminators and Adaptive Self-Training

Teo Spadotto, Marco Toldo, Umberto Michieli, Pietro Zanuttigh

Responsive image

Auto-TLDR; Unsupervised Domain Adaptation for Semantic Segmentation of Urban Scenes

Slides Poster Similar

Unsupervised Domain Adaptation (UDA) aims at improving the generalization capability of a model trained on a source domain to perform well on a target domain for which no labeled data is available. In this paper, we consider the semantic segmentation of urban scenes and we propose an approach to adapt a deep neural network trained on synthetic data to real scenes addressing the domain shift between the two different data distributions. We introduce a novel UDA framework where a standard supervised loss on labeled synthetic data is supported by an adversarial module and a self-training strategy aiming at aligning the two domain distributions. The adversarial module is driven by a couple of fully convolutional discriminators dealing with different domains: the first discriminates between ground truth and generated maps, while the second between segmentation maps coming from synthetic or real world data. The self-training module exploits the confidence estimated by the discriminators on unlabeled data to select the regions used to reinforce the learning process. Furthermore, the confidence is thresholded with an adaptive mechanism based on the per-class overall confidence. Experimental results prove the effectiveness of the proposed strategy in adapting a segmentation network trained on synthetic datasets like GTA5 and SYNTHIA, to real world datasets like Cityscapes and Mapillary.

Learning Disentangled Representations for Identity Preserving Surveillance Face Camouflage

Jingzhi Li, Lutong Han, Hua Zhang, Xiaoguang Han, Jingguo Ge, Xiaochu Cao

Responsive image

Auto-TLDR; Individual Face Privacy under Surveillance Scenario with Multi-task Loss Function

Poster Similar

In this paper, we focus on protecting the person face privacy under the surveillance scenarios, whose goal is to change the visual appearances of faces while keep them to be recognizable by current face recognition systems. This is a challenging problem as that we should retain the most important structures of captured facial images, while alter the salient facial regions to protect personal privacy. To address this problem, we introduce a novel individual face protection model, which can camouflage the face appearance from the perspective of human visual perception and preserve the identity features of faces used for face authentication. To that end, we develop an encoder-decoder network architecture that can separately disentangle the person feature representation into an appearance code and an identity code. Specifically, we first randomly divide the face image into two groups, the source set and the target set, where the source set is used to extract the identity code and the target set provides the appearance code. Then, we recombine the identity and appearance codes to synthesize a new face, which has the same identity with the source subject. Finally, the synthesized faces are used to replace the original face to protect the privacy of individual. Furthermore, our model is trained end-to-end with a multi-task loss function, which can better preserve the identity and stabilize the training loss. Experiments conducted on Cross-Age Celebrity dataset demonstrate the effectiveness of our model and validate our superiority in terms of visual quality and scalability.

Rethinking Domain Generalization Baselines

Francesco Cappio Borlino, Antonio D'Innocente, Tatiana Tommasi

Responsive image

Auto-TLDR; Style Transfer Data Augmentation for Domain Generalization

Slides Poster Similar

Despite being very powerful in standard learning settings, deep learning models can be extremely brittle when deployed in scenarios different from those on which they were trained. Domain generalization methods investigate this problem and data augmentation strategies have shown to be helpful tools to increase data variability, supporting model robustness across domains. In our work we focus on style transfer data augmentation and we present how it can be implemented with a simple and inexpensive strategy to improve generalization. Moreover, we analyze the behavior of current state of the art domain generalization methods when integrated with this augmentation solution: our thorough experimental evaluation shows that their original effect almost always disappears with respect to the augmented baseline. This issue open new scenarios for domain generalization research, highlighting the need of novel methods properly able to take advantage of the introduced data variability.

Unsupervised Contrastive Photo-To-Caricature Translation Based on Auto-Distortion

Yuhe Ding, Xin Ma, Mandi Luo, Aihua Zheng, Ran He

Responsive image

Auto-TLDR; Unsupervised contrastive photo-to-caricature translation with style loss

Slides Poster Similar

Photo-to-caricature aims to synthesize the caricature as a rendered image exaggerating the features through sketching, pencil strokes, or other artistic drawings. Style rendering and geometry deformation are the most important aspects in photo-to-caricature translation task. To take both into consideration, we propose an unsupervised contrastive photo-to-caricature translation architecture. Considering the intuitive artifacts in the existing methods, we propose a contrastive style loss for style rendering to enforce the similarity between the style of rendered photo and the caricature, and simultaneously enhance its discrepancy to the photos. To obtain an exaggerating deformation in an unpaired/unsupervised fashion, we propose a Distortion Prediction Module (DPM) to predict a set of displacements vectors for each input image while fixing some controlling points, followed by the thin plate spline interpolation for warping. The model is trained on unpaired photo and caricature while can offer bidirectional synthesizing via inputting either a photo or a caricature. Extensive experiments demonstrate that the proposed model is effective to generate hand-drawn like caricatures compared with existing competitors.

GAP: Quantifying the Generative Adversarial Set and Class Feature Applicability of Deep Neural Networks

Edward Collier, Supratik Mukhopadhyay

Responsive image

Auto-TLDR; Approximating Adversarial Learning in Deep Neural Networks Using Set and Class Adversaries

Slides Poster Similar

Recent work in deep neural networks has sought to characterize the nature in which a network learns features and how applicable learnt features are to various problem sets. Deep neural network applicability can be split into three sub-problems; set applicability, class applicability, and instance applicability. In this work we seek to quantify the applicability of features learned during adversarial training, focusing specifically on set and class applicability. We apply techniques for measuring applicability to both generators and discriminators trained on various data sets to quantify applicability and better observe how both a generator and a discriminator, and generative models as a whole, learn features during adversarial training.

Unsupervised Face Manipulation Via Hallucination

Keerthy Kusumam, Enrique Sanchez, Georgios Tzimiropoulos

Responsive image

Auto-TLDR; Unpaired Face Image Manipulation using Autoencoders

Slides Poster Similar

This paper addresses the problem of manipulatinga face image in terms of changing its pose. To achieve this, wepropose a new method that can be trained under the very general“unpaired” setting. To this end, we firstly propose to modelthe general appearance, layout and background of the inputimage using a low-resolution version of it which is progressivelypassed through a hallucination network to generate featuresat higher resolutions. We show that such a formulation issignificantly simpler than previous approaches for appearancemodelling based on autoencoders. Secondly, we propose a fullylearnable and spatially-aware appearance transfer module whichcan cope with misalignment between the input source image andthe target pose and can effectively combine the features fromthe hallucination network with the features produced by ourgenerator. Thirdly, we introduce an identity preserving methodthat is trained in an unsupervised way, by using an auxiliaryfeature extractor and a contrastive loss between the real andgenerated images. We compare our method against the state-of-the-art reporting significant improvements both quantitatively, interms of FID and IS, and qualitatively.

Randomized Transferable Machine

Pengfei Wei, Tze Yun Leong

Responsive image

Auto-TLDR; Randomized Transferable Machine for Suboptimal Feature-based Transfer Learning

Slides Poster Similar

Feature-based transfer method is one of the most effective methodologies for transfer learning. Existing works usually claim the learned new feature representation is truly \emph{domain-invariant}, and thus directly train a transfer model $\mathcal{M}$ on source domain. In this paper, we work on a more realistic scenario where the new feature representation is suboptimal where small divergence still exists across domains. We propose a new learning strategy and name the transfer model following the learning strategy as Randomized Transferable Machine (RTM). More specifically, we work on source data with the new feature representation learned from existing feature-based transfer methods. Our key idea is to enlarge source training data populations by randomly corrupting source data using some noises, and then train a transfer model $\widetilde{\mathcal{M}}$ performing well on all these corrupted source data populations. In principle, the more corruptions are made, the higher probability of the target data can be covered by the constructed source populations and thus a better transfer performance can be achieved by $\widetilde{\mathcal{M}}$. An ideal case is with infinite corruptions, which however is infeasible in reality. We instead develop a marginalized solution. With a marginalization trick, we can train an RTM that is equivalently trained using infinite source noisy populations without truly conducting any corruption. More importantly, such an RTM has a closed-form solution, which enables a super fast and efficient training. Extensive experiments on various real-world transfer tasks show that RTM is a very promising transfer model.

Text Recognition in Real Scenarios with a Few Labeled Samples

Jinghuang Lin, Cheng Zhanzhan, Fan Bai, Yi Niu, Shiliang Pu, Shuigeng Zhou

Responsive image

Auto-TLDR; Few-shot Adversarial Sequence Domain Adaptation for Scene Text Recognition

Slides Poster Similar

Scene text recognition (STR) is still a hot research topic in computer vision field due to its various applications. Existing works mainly focus on learning a general model with a huge number of synthetic text images to recognize unconstrained scene texts, and have achieved substantial progress. However, these methods are not quite applicable in many real-world scenarios where 1) high recognition accuracy is required, while 2) labeled samples are lacked. To tackle this challenging problem, this paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation between the synthetic source domain (with many synthetic labeled samples) and a specific target domain (with only some or a few real labeled samples). This is done by simultaneously learning each character’s feature representation with an attention mech- anism and establishing the corresponding character-level latent subspace with adversarial learning. Our approach can maximize the character-level confusion between the source domain and the target domain, thus achieves the sequence-level adaptation with even a small number of labeled samples in the target domain. Extensive experiments on various datasets show that our method significantly outperforms the finetuning scheme, and obtains comparable performance to the state-of-the-art STR methods.

Disentangled Representation Learning for Controllable Image Synthesis: An Information-Theoretic Perspective

Shichang Tang, Xu Zhou, Xuming He, Yi Ma

Responsive image

Auto-TLDR; Controllable Image Synthesis in Deep Generative Models using Variational Auto-Encoder

Slides Poster Similar

In this paper, we look into the problem of disentangled representation learning and controllable image synthesis in a deep generative model. We develop an encoder-decoder architecture for a variant of the Variational Auto-Encoder (VAE) with two latent codes $z_1$ and $z_2$. Our framework uses $z_2$ to capture specified factors of variation while $z_1$ captures the complementary factors of variation. To this end, we analyze the learning problem from the perspective of multivariate mutual information, derive optimizable lower bounds of the conditional mutual information in the image synthesis processes and incorporate them into the training objective. We validate our method empirically on the Color MNIST dataset and the CelebA dataset by showing controllable image syntheses. Our proposed paradigm is simple yet effective and is applicable to many situations, including those where there is not an explicit factorization of features available, or where the features are non-categorical.

Domain Generalized Person Re-Identification Via Cross-Domain Episodic Learning

Ci-Siang Lin, Yuan Chia Cheng, Yu-Chiang Frank Wang

Responsive image

Auto-TLDR; Domain-Invariant Person Re-identification with Episodic Learning

Slides Poster Similar

Aiming at recognizing images of the same person across distinct camera views, person re-identification (re-ID) has been among active research topics in computer vision. Most existing re-ID works require collection of a large amount of labeled image data from the scenes of interest. When the data to be recognized are different from the source-domain training ones, a number of domain adaptation approaches have been proposed. Nevertheless, one still needs to collect labeled or unlabelled target-domain data during training. In this paper, we tackle an even more challenging and practical setting, domain generalized (DG) person re-ID. That is, while a number of labeled source-domain datasets are available, we do not have access to any target-domain training data. In order to learn domain-invariant features without knowing the target domain of interest, we present an episodic learning scheme which advances meta learning strategies to exploit the observed source-domain labeled data. The learned features would exhibit sufficient domain-invariant properties while not overfitting the source-domain data or ID labels. Our experiments on four benchmark datasets confirm the superiority of our method over the state-of-the-arts.

Galaxy Image Translation with Semi-Supervised Noise-Reconstructed Generative Adversarial Networks

Qiufan Lin, Dominique Fouchez, Jérôme Pasquet

Responsive image

Auto-TLDR; Semi-supervised Image Translation with Generative Adversarial Networks Using Paired and Unpaired Images

Slides Poster Similar

Image-to-image translation with Deep Learning neural networks, particularly with Generative Adversarial Networks (GANs), is one of the most powerful methods for simulating astronomical images. However, current work is limited to utilizing paired images with supervised translation, and there has been rare discussion on reconstructing noise background that encodes instrumental and observational effects. These limitations might be harmful for subsequent scientific applications in astrophysics. Therefore, we aim to develop methods for using unpaired images and preserving noise characteristics in image translation. In this work, we propose a two-way image translation model using GANs that exploits both paired and unpaired images in a semi-supervised manner, and introduce a noise emulating module that is able to learn and reconstruct noise characterized by high-frequency features. By experimenting on multi-band galaxy images from the Sloan Digital Sky Survey (SDSS) and the Canada France Hawaii Telescope Legacy Survey (CFHT), we show that our method recovers global and local properties effectively and outperforms benchmark image translation models. To our best knowledge, this work is the first attempt to apply semi-supervised methods and noise reconstruction techniques in astrophysical studies.

Efficient Shadow Detection and Removal Using Synthetic Data with Domain Adaptation

Rui Guo, Babajide Ayinde, Hao Sun

Responsive image

Auto-TLDR; Shadow Detection and Removal with Domain Adaptation and Synthetic Image Database

Poster Similar

In recent years, learning based shadow detection and removal approaches have shown prospects and, in most cases, yielded state-of-the-art results. The performance of these approaches, however, relies heavily on the construction of training database of shadow images, shadow-free versions, and shadow maps as ground truth. This conventional data gathering method is time-consuming, expensive, or even practically intractable to realize especially for outdoor scenes with complicated shadow patterns, thus limiting the size of the data available for training. In this paper, we leverage on large high quality synthetic image database and domain adaptation to eliminate the bottlenecks resulting from insufficient training samples and domain bias. Specifically, our approach utilizes adversarial training to predict near-pixel-perfect shadow map from synthetic shadow image for downstream shadow removal steps. At inference time, we capitalize on domain adaptation via image style transfer to map the style of real- world scene to that of synthetic scene for the purpose of detecting and subsequently removing shadow. Comprehensive experiments indicate that our approach outperforms state-of-the-art methods on select benchmark datasets.

Class Conditional Alignment for Partial Domain Adaptation

Mohsen Kheirandishfard, Fariba Zohrizadeh, Farhad Kamangar

Responsive image

Auto-TLDR; Multi-class Adversarial Adaptation for Partial Domain Adaptation

Slides Poster Similar

Adversarial adaptation models have demonstrated significant progress towards transferring knowledge from a labeled source dataset to an unlabeled target dataset. Partial domain adaptation (PDA) investigates the scenarios in which the source domain is large and diverse, and the target label space is a subset of the source label space. The main purpose of PDA is to identify the shared classes between the domains and promote learning transferable knowledge from these classes. In this paper, we propose a multi-class adversarial architecture for PDA. The proposed approach jointly aligns the marginal and class-conditional distributions in the shared label space by minimaxing a novel multi-class adversarial loss function. Furthermore, we incorporate effective regularization terms to encourage selecting the most relevant subset of source domain classes. In the absence of target labels, the proposed approach is able to effectively learn domain-invariant feature representations, which in turn can enhance the classification performance in the target domain. Comprehensive experiments on three benchmark datasets Office-$31$, Office-Home, and Caltech-Office corroborate the effectiveness of the proposed approach in addressing different partial transfer learning tasks.

Enlarging Discriminative Power by Adding an Extra Class in Unsupervised Domain Adaptation

Hai Tran, Sumyeong Ahn, Taeyoung Lee, Yung Yi

Responsive image

Auto-TLDR; Unsupervised Domain Adaptation using Artificial Classes

Slides Poster Similar

We study the problem of unsupervised domain adaptation that aims at obtaining a prediction model for the target domain using labeled data from the source domain and unlabeled data from the target domain. There exists an array of recent research based on the idea of extracting features that are not only invariant for both domains but also provide high discriminative power for the target domain. In this paper, we propose an idea of improving the discriminativeness: Adding an extra artificial class and training the model on the given data together with the GAN-generated samples of the new class. The trained model based on the new class samples is capable of extracting the features that are more discriminative by repositioning data of current classes in the target domain and therefore increasing the distances among the target clusters in the feature space. Our idea is highly generic so that it is compatible with many existing methods such as DANN, VADA, and DIRT-T. We conduct various experiments for the standard data commonly used for the evaluation of unsupervised domain adaptations and demonstrate that our algorithm achieves the SOTA performance for many scenarios.

Respecting Domain Relations: Hypothesis Invariance for Domain Generalization

Ziqi Wang, Marco Loog, Jan Van Gemert

Responsive image

Auto-TLDR; Learning Hypothesis Invariant Representations for Domain Generalization

Slides Poster Similar

In domain generalization, multiple labeled non-independent and non-identically distributed source domains are available during training while neither the data nor the labels of target domains are. Currently, learning so-called domain invariant representations (DIRs) is the prevalent approach to domain generalization. In this work, we define DIRs employed by existing works in probabilistic terms and show that by learning DIRs, overly strict requirements are imposed concerning the invariance. Particularly, DIRs aim to perfectly align representations of different domains, i.e. their input distributions. This is, however, not necessary for good generalization to a target domain and may even dispose of valuable classification information. We propose to learn so-called hypothesis invariant representations (HIRs), which relax the invariance assumptions. We report experimental results on public domain generalization datasets to show that learning HIRs is more effective than learning DIRs. In fact, our approach can even compete with approaches using prior knowledge about domains.

Cascade Attention Guided Residue Learning GAN for Cross-Modal Translation

Bin Duan, Wei Wang, Hao Tang, Hugo Latapie, Yan Yan

Responsive image

Auto-TLDR; Cascade Attention-Guided Residue GAN for Cross-modal Audio-Visual Learning

Slides Poster Similar

Since we were babies, we intuitively develop the ability to correlate the input from different cognitive sensors such as vision, audio, and text. However, in machine learning, this cross-modal learning is a nontrivial task because different modalities have no homogeneous properties. Previous works discover that there should be bridges among different modalities. From neurology and psychology perspective, humans have the capacity to link one modality with another one, e.g., associating a picture of a bird with the only hearing of its singing and vice versa. Is it possible for machine learning algorithms to recover the scene given the audio signal? In this paper, we propose a novel Cascade Attention-Guided Residue GAN (CAR-GAN), aiming at reconstructing the scenes given the corresponding audio signals. Particularly, we present a residue module to mitigate the gap between different modalities progressively. Moreover, a cascade attention guided network with a novel classification loss function is designed to tackle the cross-modal learning task. Our model keeps consistency in the high-level semantic label domain and is able to balance two different modalities. The experimental results demonstrate that our model achieves the state-of-the-art cross-modal audio-visual generation on the challenging Sub-URMP dataset.

Detail Fusion GAN: High-Quality Translation for Unpaired Images with GAN-Based Data Augmentation

Ling Li, Yaochen Li, Chuan Wu, Hang Dong, Peilin Jiang, Fei Wang

Responsive image

Auto-TLDR; Data Augmentation with GAN-based Generative Adversarial Network

Slides Poster Similar

Image-to-image translation, a task to learn the mapping relation between two different domains, is a rapid-growing research field in deep learning. Although existing Generative Adversarial Network(GAN)-based methods have achieved decent results in this field, there are still some limitations in generating high-quality images for practical applications (e.g., data augmentation and image inpainting). In this work, we aim to propose a GAN-based network for data augmentation which can generate translated images with more details and less artifacts. The proposed Detail Fusion Generative Adversarial Network(DFGAN) consists of a detail branch, a transfer branch, a filter module, and a reconstruction module. The detail branch is trained by a super-resolution loss and its intermediate features can be used to introduce more details to the transfer branch by the filter module. Extensive evaluations demonstrate that our model generates more satisfactory images against the state-of-the-art approaches for data augmentation.

Disentangle, Assemble, and Synthesize: Unsupervised Learning to Disentangle Appearance and Location

Hiroaki Aizawa, Hirokatsu Kataoka, Yutaka Satoh, Kunihito Kato

Responsive image

Auto-TLDR; Generative Adversarial Networks with Structural Constraint for controllability of latent space

Slides Poster Similar

The next step for the generative adversarial networks~(GAN) is to learn representations that allow us to control only a certain factor in the image explicitly. Since such a representation of the factor is independent of other factors, the controllability obtained from these representations leads to interpretability by identifying the variation of the synthesized image and the transferability for downstream tasks by inference. However, since it is difficult to identify and strictly define latent factors, the annotation is laborious. Moreover, learning such representations by a GAN is challenging due to the complex generation process. Therefore, we resolve this limitation using a novel generative model that can disentangle latent space into the appearance, the x-axis, and the y-axis of the object, and reassemble these components in an unsupervised manner. Specifically, based on the concept of packing the appearance and location in each position of the feature map, we introduce a novel structural constraint technique that prevents these representations from interacting with each other. The proposed structural constraint promotes the disentanglement of these factors. In experiments, we found that the proposed method is simple but effective for controllability and allows us to control the appearance and location via latent space without supervision, as compared with the conditional GAN.

Mask-Based Style-Controlled Image Synthesis Using a Mask Style Encoder

Jaehyeong Cho, Wataru Shimoda, Keiji Yanai

Responsive image

Auto-TLDR; Style-controlled Image Synthesis from Semantic Segmentation masks using GANs

Slides Poster Similar

In recent years, the advances in Generative Adversarial Networks (GANs) have shown impressive results for image generation and translation tasks. In particular, the image-to-image translation is a method of learning mapping from a source domain to a target domain and synthesizing an image. Image-to-image translation can be applied to a variety of tasks, making it possible to quickly and easily synthesize realistic images from semantic segmentation masks. However, in the existing image-to-image translation method, there is a limitation on controlling the style of the translated image, and it is not easy to synthesize an image by controlling the style of each mask element in detail. Therefore, we propose an image synthesis method that controls the style of each element by improving the existing image-to-image translation method. In the proposed method, we implement a style encoder that extracts style features for each mask element. The extracted style features are concatenated to the semantic mask in the normalization layer, and used the style-controlled image synthesis of each mask element. In experiments, we train style-controlled images synthesis using the datasets consisting of semantic segmentation masks and real images. The results show that the proposed method has excellent performance for style-controlled images synthesis for each element.

Unsupervised Domain Adaptation for Object Detection in Cultural Sites

Giovanni Pasqualino, Antonino Furnari, Giovanni Maria Farinella

Responsive image

Auto-TLDR; Unsupervised Domain Adaptation for Object Detection in Cultural Sites

Slides Similar

The ability to detect objects in cultural sites from the egocentric point of view of the user can enable interesting applications for both the visitors and the manager of the site. Unfortunately, current object detection algorithms have to be trained on large amounts of labeled data, the collection of which is costly and time-consuming. While synthetic data generated from the 3D model of the cultural site can be used to train object detection algorithms, a significant drop in performance is generally observed when such algorithms are deployed to work with real images. In this paper, we consider the problem of unsupervised domain adaptation for object detection in cultural sites. Specifically, we assume the availability of synthetic labeled images and real unlabeled images for training. To study the problem, we propose a dataset containing 75244 synthetic and 2190 real images with annotations for 16 different artworks. We hence investigate different domain adaptation techniques based on image-to-image translation and feature alignment. Our analysis points out that such techniques can be useful to address the domain adaptation issue, while there is still plenty of space for improvement on the proposed dataset. We release the dataset at our web page to encourage research on this challenging topic: https://iplab.dmi.unict.it/EGO-CH-OBJ-ADAPT/.

Exemplar Guided Cross-Spectral Face Hallucination Via Mutual Information Disentanglement

Haoxue Wu, Huaibo Huang, Aijing Yu, Jie Cao, Zhen Lei, Ran He

Responsive image

Auto-TLDR; Exemplar Guided Cross-Spectral Face Hallucination with Structural Representation Learning

Slides Poster Similar

Recently, many Near infrared-visible (NIR-VIS) heterogeneous face recognition (HFR) methods have been proposed in the community. But it remains a challenging problem because of the sensing gap along with large pose variations. In this paper, we propose an Exemplar Guided Cross-Spectral Face Hallucination (EGCH) to reduce the domain discrepancy through disentangled representation learning. For each modality, EGCH contains a spectral encoder as well as a structure encoder to disentangle spectral and structure representation, respectively. It also contains a traditional generator that reconstructs the input from the above two representations, and a structure generator that predicts the facial parsing map from the structure representation. Besides, mutual information minimization and maximization are conducted to boost disentanglement and make representations adequately expressed. Then the translation is built on structure representations between two modalities. Provided with the transformed NIR structure representation and original VIS spectral representation, EGCH is capable to produce high-fidelity VIS images that preserve the topology structure of the input NIR while transfer the spectral information of an arbitrary VIS exemplar. Extensive experiments demonstrate that the proposed method achieves more promising results both qualitatively and quantitatively than the state-of-the-art NIR-VIS methods.

Spatial-Aware GAN for Unsupervised Person Re-Identification

Fangneng Zhan, Changgong Zhang

Responsive image

Auto-TLDR; Unsupervised Unsupervised Domain Adaptation for Person Re-Identification

Similar

The recent person re-identification research has achieved great success by learning from a large number of labeled person images. On the other hand, the learned models often experience significant performance drops when applied to images collected in a different environment. Unsupervised domain adaptation (UDA) has been investigated to mitigate this constraint, but most existing systems adapt images at pixel level only and ignore obvious discrepancies at spatial level. This paper presents an innovative UDA-based person re-identification network that is capable of adapting images at both spatial and pixel levels simultaneously. A novel disentangled cycle-consistency loss is designed which guides the learning of spatial-level and pixel-level adaptation in a collaborative manner. In addition, a novel multi-modal mechanism is incorporated which is capable of generating images of different geometry views and augmenting training images effectively. Extensive experiments over a number of public datasets show that the proposed UDA network achieves superior person re-identification performance as compared with the state-of-the-art.

Controllable Face Aging

Haien Zeng, Hanjiang Lai

Responsive image

Auto-TLDR; A controllable face aging method via attribute disentanglement generative adversarial network

Slides Poster Similar

Motivated by the following two observations: 1) people are aging differently under different conditions for changeable facial attributes, e.g., skin color may become darker when working outside, and 2) it needs to keep some unchanged facial attributes during the aging process, e.g., race and gender, we propose a controllable face aging method via attribute disentanglement generative adversarial network. To offer fine control over the synthesized face images, first, an individual embedding of the face is directly learned from an image that contains the desired facial attribute. Second, since the image may contain other unwanted attributes, an attribute disentanglement network is used to separate the individual embedding and learn the common embedding that contains information about the face attribute (e.g., race). With the common embedding, we can manipulate the generated face image with the desired attribute in an explicit manner. Experimental results on two common benchmarks demonstrate that our proposed generator achieves comparable performance on the aging effect with state-of-the-art baselines while gaining more flexibility for attribute control. Code is available at supplementary material.

DAPC: Domain Adaptation People Counting Via Style-Level Transfer Learning and Scene-Aware Estimation

Na Jiang, Xingsen Wen, Zhiping Shi

Responsive image

Auto-TLDR; Domain Adaptation People counting via Style-Level Transfer Learning and Scene-Aware Estimation

Slides Poster Similar

People counting concentrates on predicting the number of people in surveillance images. It remains challenging due to the rich variations in scene type and crowd density. Besides, the limited closed-set with ground truth from reality significantly increase the difficulty of people counting in actual open-set. Targeting to solve these problems, this paper proposes a domain adaptation people counting via style-level transfer learning (STL) and scene-aware estimation (SAE). The style-level transfer learning explicitly leverages the style constraint and content similarity between images to learn effective knowledge transfer, which narrows the gap between closed-set and open-set by generating domain adaptation images. The scene-aware estimation introduces scene classifier to provide scene-aware weights for adaptively fusing density maps, which alleviates interference of variations in scene type and crowd density on domain adaptation people counting. Extensive experimental results demonstrate that images generated by STL are more suitable for domain adaptation learning and our proposed approach significantly outperforms the state-of-the-art methods on multiple cross-domain pairs.

On the Evaluation of Generative Adversarial Networks by Discriminative Models

Amirsina Torfi, Mohammadreza Beyki, Edward Alan Fox

Responsive image

Auto-TLDR; Domain-agnostic GAN Evaluation with Siamese Neural Networks

Slides Poster Similar

Generative Adversarial Networks (GANs) can accurately model complex multi-dimensional data and generate realistic samples. However, due to their implicit estimation of data distributions, their evaluation is a challenging task. The majority of research efforts associated with tackling this issue were validated by qualitative visual evaluation. Such approaches do not generalize well beyond the image domain. Since many of those evaluation metrics are proposed and bound to the vision domain, they are difficult to apply to other domains. Quantitative measures are necessary to better guide the training and comparison of different GANs models. In this work, we leverage Siamese neural networks to propose a domain-agnostic evaluation metric: (1) with a qualitative evaluation that is consistent with human evaluation, (2) that is robust relative to common GAN issues such as mode dropping and invention, and (3) does not require any pretrained classifier. The empirical results in this paper demonstrate the superiority of this method compared to the popular Inception Score and are competitive with the FID score.

Semantics-Guided Representation Learning with Applications to Visual Synthesis

Jia-Wei Yan, Ci-Siang Lin, Fu-En Yang, Yu-Jhe Li, Yu-Chiang Frank Wang

Responsive image

Auto-TLDR; Learning Interpretable and Interpolatable Latent Representations for Visual Synthesis

Slides Poster Similar

Learning interpretable and interpolatable latent representations has been an emerging research direction, allowing researchers to understand and utilize the derived latent space for further applications such as visual synthesis or recognition. While most existing approaches derive an interpolatable latent space and induces smooth transition in image appearance, it is still not clear how to observe desirable representations which would contain semantic information of interest. In this paper, we aim to learn meaningful representations and simultaneously perform semantic-oriented and visually-smooth interpolation. To this end, we propose an angular triplet-neighbor loss (ATNL) that enables learning a latent representation whose distribution matches the semantic information of interest. With the latent space guided by ATNL, we further utilize spherical semantic interpolation for generating semantic warping of images, allowing synthesis of desirable visual data. Experiments on MNIST and CMU Multi-PIE datasets qualitatively and quantitatively verify the effectiveness of our method.

Attributes Aware Face Generation with Generative Adversarial Networks

Zheng Yuan, Jie Zhang, Shiguang Shan, Xilin Chen

Responsive image

Auto-TLDR; AFGAN: A Generative Adversarial Network for Attributes Aware Face Image Generation

Slides Poster Similar

Recent studies have shown remarkable success in face image generations. However, most of the existing methods only generate face images from random noise, and cannot generate face images according to the specific attributes. In this paper, we focus on the problem of face synthesis from attributes, which aims at generating faces with specific characteristics corresponding to the given attributes. To this end, we propose a novel attributes aware face image generator method with generative adversarial networks called AFGAN. Specifically, we firstly propose a two-path embedding layer and self-attention mechanism to convert binary attribute vector to rich attribute features. Then three stacked generators generate 64 * 64, 128 * 128 and 256 * 256 resolution face images respectively by taking the attribute features as input. In addition, an image-attribute matching loss is proposed to enhance the correlation between the generated images and input attributes. Extensive experiments on CelebA demonstrate the superiority of our AFGAN in terms of both qualitative and quantitative evaluations.

Adversarially Constrained Interpolation for Unsupervised Domain Adaptation

Mohamed Azzam, Aurele Tohokantche Gnanha, Hau-San Wong, Si Wu

Responsive image

Auto-TLDR; Unsupervised Domain Adaptation with Domain Mixup Strategy

Slides Poster Similar

We address the problem of unsupervised domain adaptation (UDA) which aims at adapting models trained on a labeled domain to a completely unlabeled domain. One way to achieve this goal is to learn a domain-invariant representation. However, this approach is subject to two challenges: samples from two domains are insufficient to guarantee domain-invariance at most part of the latent space, and neighboring samples from the target domain may not belong to the same class on the low-dimensional manifold. To mitigate these shortcomings, we propose two strategies. First, we incorporate a domain mixup strategy in domain adversarial learning model by linearly interpolating between the source and target domain samples. This allows the latent space to be continuous and yields an improvement of the domain matching. Second, the domain discriminator is regularized via judging the relative difference between both domains for the input mixup features, which speeds up the domain matching. Experiment results show that our proposed model achieves a superior performance on different tasks under various domain shifts and data complexity.

Semantic-Guided Inpainting Network for Complex Urban Scenes Manipulation

Pierfrancesco Ardino, Yahui Liu, Elisa Ricci, Bruno Lepri, Marco De Nadai

Responsive image

Auto-TLDR; Semantic-Guided Inpainting of Complex Urban Scene Using Semantic Segmentation and Generation

Slides Poster Similar

Manipulating images of complex scenes to reconstruct, insert and/or remove specific object instances is a challenging task. Complex scenes contain multiple semantics and objects, which are frequently cluttered or ambiguous, thus hampering the performance of inpainting models. Conventional techniques often rely on structural information such as object contours in multi-stage approaches that generate unreliable results and boundaries. In this work, we propose a novel deep learning model to alter a complex urban scene by removing a user-specified portion of the image and coherently inserting a new object (e.g. car or pedestrian) in that scene. Inspired by recent works on image inpainting, our proposed method leverages the semantic segmentation to model the content and structure of the image, and learn the best shape and location of the object to insert. To generate reliable results, we design a new decoder block that combines the semantic segmentation and generation task to guide better the generation of new objects and scenes, which have to be semantically consistent with the image. Our experiments, conducted on two large-scale datasets of urban scenes (Cityscapes and Indian Driving), show that our proposed approach successfully address the problem of semantically-guided inpainting of complex urban scene.

Open Set Domain Recognition Via Attention-Based GCN and Semantic Matching Optimization

Xinxing He, Yuan Yuan, Zhiyu Jiang

Responsive image

Auto-TLDR; Attention-based GCN and Semantic Matching Optimization for Open Set Domain Recognition

Slides Poster Similar

Open set domain recognition has got the attention in recent years. The task aims to specifically classify each sample in the practical unlabeled target domain, which consists of all known classes in the manually labeled source domain and target-specific unknown categories. The absence of annotated training data or auxiliary attribute information for unknown categories makes this task especially difficult. Moreover, exiting domain discrepancy in label space and data distribution further distracts the knowledge transferred from known classes to unknown classes. To address these issues, this work presents an end-to-end model based on attention-based GCN and semantic matching optimization, which first employs the attention mechanism to enable the central node to learn more discriminating representations from its neighbors in the knowledge graph. Moreover, a coarse-to-fine semantic matching optimization approach is proposed to progressively bridge the domain gap. Experimental results validate that the proposed model not only has superiority on recognizing the images of known and unknown classes, but also can adapt to various openness of the target domain.

Augmented Cyclic Consistency Regularization for Unpaired Image-To-Image Translation

Takehiko Ohkawa, Naoto Inoue, Hirokatsu Kataoka, Nakamasa Inoue

Responsive image

Auto-TLDR; Augmented Cyclic Consistency Regularization for Unpaired Image-to-Image Translation

Slides Poster Similar

Unpaired image-to-image (I2I) translation has received considerable attention in pattern recognition and computer vision because of recent advancements in generative adversarial networks (GANs). However, due to the lack of explicit supervision, unpaired I2I models often fail to generate realistic images, especially in challenging datasets with different backgrounds and poses. Hence, stabilization is indispensable for real-world applications and GANs. Herein, we propose Augmented Cyclic Consistency Regularization (ACCR), a novel regularization method for unpaired I2I translation. Our main idea is to enforce consistency regularization originating from semi-supervised learning on the discriminators leveraging real, fake, reconstructed, and augmented samples. We regularize the discriminators to output similar predictions when fed pairs of original and perturbed images. We qualitatively clarify the generation property between unpaired I2I models and standard GANs, and explain why consistency regularization on fake and reconstructed samples works well. Quantitatively, our method outperforms the consistency regularized GAN (CR-GAN) in real-world translations and demonstrates efficacy against several data augmentation variants and cycle-consistent constraints.

Pose Variation Adaptation for Person Re-Identification

Lei Zhang, Na Jiang, Qishuai Diao, Yue Xu, Zhong Zhou, Wei Wu

Responsive image

Auto-TLDR; Pose Transfer Generative Adversarial Network for Person Re-identification

Slides Poster Similar

Person re-identification (reid) plays an important role in surveillance video analysis, especially for criminal investigation and intelligent security. Although a large number of effective feature or distance metric learning approaches have been proposed, it still suffers from pedestrians appearance variations caused by pose changing. Most of the previous methods address this problem by learning a pose-invariant descriptor subspace. In this paper, we propose a pose variation adaptation method for person reid in the view of data augmentation. It can reduce the probability of deep learning network over-fitting. Specifically, we introduce a pose transfer generative adversarial network with a similarity measurement constraint. With the learned pose transfer model, training images can be pose-transferred to any given poses, and along with the original images, form a augmented training dataset. It increases data diversity against over-fitting. In contrast to previous GAN-based methods, we consider the influence of pose variations on similarity measure to generate more realistic and shaper samples for person reid. Besides, we optimize hard example mining to introduce a novel manner of samples (pose-transferred images) used with the learned pose transfer model. It focuses on the inferior samples which are caused by pose variations to increase the number of effective hard examples for learning discriminative features and improve the generalization ability. We extensively conduct comparative evaluations to demonstrate the advantages and superiority of our proposed method over the state-of-the-art approaches on Market-1501 and DukeMTMC-reID, the rank-1 accuracy is 96.1% for Market-1501 and 92.0% for DukeMTMC-reID.

Coherence and Identity Learning for Arbitrary-Length Face Video Generation

Shuquan Ye, Chu Han, Jiaying Lin, Guoqiang Han, Shengfeng He

Responsive image

Auto-TLDR; Face Video Synthesis Using Identity-Aware GAN and Face Coherence Network

Slides Poster Similar

Face synthesis is an interesting yet challenging task in computer vision. It is even much harder to generate a portrait video than a single image. In this paper, we propose a novel video generation framework for synthesizing arbitrary-length face videos without any face exemplar or landmark. To overcome the synthesis ambiguity of face video, we propose a divide-and-conquer strategy to separately address the video face synthesis problem from two aspects, face identity synthesis and rearrangement. To this end, we design a cascaded network which contains three components, Identity-aware GAN (IA-GAN), Face Coherence Network, and Interpolation Network. IA-GAN is proposed to synthesize photorealistic faces with the same identity from a set of noises. Face Coherence Network is designed to re-arrange the faces generated by IA-GAN while keeping the inter-frame coherence. Interpolation Network is introduced to eliminate the discontinuity between two adjacent frames and improve the smoothness of the face video. Experimental results demonstrate that our proposed network is able to generate face video with high visual quality while preserving the identity. Statistics show that our method outperforms state-of-the-art unconditional face video generative models in multiple challenging datasets.

Contrastive Data Learning for Facial Pose and Illumination Normalization

Gee-Sern Hsu, Chia-Hao Tang

Responsive image

Auto-TLDR; Pose and Illumination Normalization with Contrast Data Learning for Face Recognition

Slides Poster Similar

Face normalization can be a crucial step when handling generic face recognition. We propose the Pose and Illumination Normalization (PIN) framework with contrast data learning for face normalization. The PIN framework is designed to learn the transformation from a source set to a target set. The source set and the target set compose a contrastive data set for learning. The source set contains faces collected in the wild and thus covers a wide range of variation across illumination, pose, expression and other variables. The target set contains face images taken under controlled conditions and all faces are in frontal pose and balanced in illumination. The PIN framework is composed of an encoder, a decoder and two discriminators. The encoder is made of a state-of-the-art face recognition network and acts as a facial feature extractor, which is not updated during training. The decoder is trained on both the source and target sets, and aims to learn the transformation from the source set to the target set; and therefore, it can transform an arbitrary face into a illumination and pose normalized face. The discriminators are trained to ensure the photo-realistic quality of the normalized face images generated by the decoder. The loss functions employed in the decoder and discriminators are appropriately designed and weighted for yielding better normalization outcomes and recognition performance. We verify the performance of the propose framework on several benchmark databases, and compare with state-of-the-art approaches.

Learning Interpretable Representation for 3D Point Clouds

Feng-Guang Su, Ci-Siang Lin, Yu-Chiang Frank Wang

Responsive image

Auto-TLDR; Disentangling Body-type and Pose Information from 3D Point Clouds Using Adversarial Learning

Slides Poster Similar

Point clouds have emerged as a popular representation of 3D visual data. With a set of unordered 3D points, one typically needs to transform them into latent representation before further classification and segmentation tasks. However, one cannot easily interpret such encoded latent representation. To address this issue, we propose a unique deep learning framework for disentangling body-type and pose information from 3D point clouds. Extending from autoenoder, we advance adversarial learning a selected feature type, while classification and data recovery can be additionally observed. Our experiments confirm that our model can be successfully applied to perform a wide range of 3D applications like shape synthesis, action translation, shape/action interpolation, and synchronization.

Generative Latent Implicit Conditional Optimization When Learning from Small Sample

Idan Azuri, Daphna Weinshall

Responsive image

Auto-TLDR; GLICO: Generative Latent Implicit Conditional Optimization for Small Sample Learning

Slides Poster Similar

We revisit the long-standing problem of learning from small sample. The generation of new samples from a small training set of labeled points has attracted increased attention in recent years. In this paper, we propose a novel such method called GLICO (Generative Latent Implicit Conditional Optimization). GLICO learns a mapping from the training examples to a latent space and a generator that generates images from vectors in the latent space. Unlike most recent work, which rely on access to large amounts of unlabeled data, GLICO does not require access to any additional data other than the small set of labeled points. In fact, GLICO learns to synthesize completely new samples for every class using as little as 5 or 10 examples per class, with as few as 10 such classes and no data from unknown classes. GLICO is then used to augment the small training set while training a classifier on the small sample. To this end, our proposed method samples the learned latent space using spherical interpolation (slerp) and generates new examples using the trained generator. Empirical results show that the new sampled set is diverse enough, leading to improvement in image classification in comparison with the state of the art when trained on small samples obtained from CIFAR-10, CIFAR-100, and CUB-200.

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Li Ren, Kai Li, Liqiang Wang, Kien Hua

Responsive image

Auto-TLDR; Adversarial Discriminative Domain Regularization for Efficient Cross-Modal Matching

Slides Poster Similar

Matching information across image and text modalities is a fundamental challenge for many applications that involve both vision and natural language processing. The objective is to find efficient similarity metrics to compare the similarity between visual and textual information. Existing approaches mainly match the local visual objects and the sentence words in a shared space with attention mechanisms. The matching performance is still limited because the similarity computation is based on simple comparisons of the matching features, ignoring the characteristics of their distribution in the data. In this paper, we address this limitation with an efficient learning objective that considers the discriminative feature distributions between the visual objects and sentence words. Specifically, we propose a novel Adversarial Discriminative Domain Regularization (ADDR) learning framework, beyond the paradigm metric learning objective, to construct a set of discriminative data domains within each image-text pairs. Our approach can generally improve the learning efficiency and the performance of existing metrics learning frameworks by regulating the distribution of the hidden space between the matching pairs. The experimental results show that this new approach significantly improves the overall performance of several popular cross-modal matching techniques (SCAN, VSRN, BFAN) on the MS-COCO and Flickr30K benchmarks.

Unsupervised Learning of Landmarks Based on Inter-Intra Subject Consistencies

Weijian Li, Haofu Liao, Shun Miao, Le Lu, Jiebo Luo

Responsive image

Auto-TLDR; Unsupervised Learning for Facial Landmark Discovery using Inter-subject Landmark consistencies

Slides Similar

We present a novel unsupervised learning approach to image landmark discovery by incorporating the inter-subject landmark consistencies on facial images. This is achieved via an inter-subject mapping module that transforms original subject landmarks based on an auxiliary subject-related structure. To recover from the transformed images back to the original subject, the landmark detector is forced to learn spatial locations that contain the consistent semantic meanings both for the paired intra-subject images and between the paired inter-subject images. Our proposed method is extensively evaluated on two public facial image datasets (MAFL, AFLW) with various settings. Experimental results indicate that our method can extract the consistent landmarks for both datasets and achieve better performances compared to the previous state-of-the-art methods quantitatively and qualitatively.

SATGAN: Augmenting Age Biased Dataset for Cross-Age Face Recognition

Wenshuang Liu, Wenting Chen, Yuanlue Zhu, Linlin Shen

Responsive image

Auto-TLDR; SATGAN: Stable Age Translation GAN for Cross-Age Face Recognition

Slides Poster Similar

In this paper, we propose a Stable Age Translation GAN (SATGAN) to generate fake face images at different ages to augment age biased face datasets for Cross-Age Face Recognition (CAFR) . The proposed SATGAN consists of both generator and discriminator. As a part of the generator, a novel Mask Attention Module (MAM) is introduced to make the generator focus on the face area. In addition, the generator employs a Uniform Distribution Discriminator (UDD) to supervise the learning of latent feature map and enforce the uniform distribution. Besides, the discriminator employs a Feature Separation Module (FSM) to disentangle identity information from the age information. The quantitative and qualitative evaluations on Morph dataset prove that SATGAN achieves much better performance than existing methods. The face recognition model trained using dataset (VGGFace2 and MS-Celeb-1M) augmented using our SATGAN achieves better accuracy on cross age dataset like Cross-Age LFW and AgeDB-30.