Semantics-Guided Representation Learning with Applications to Visual Synthesis

Jia-Wei Yan, Ci-Siang Lin, Fu-En Yang, Yu-Jhe Li, Yu-Chiang Frank Wang

Responsive image

Auto-TLDR; Learning Interpretable and Interpolatable Latent Representations for Visual Synthesis

Slides Poster

Learning interpretable and interpolatable latent representations has been an emerging research direction, allowing researchers to understand and utilize the derived latent space for further applications such as visual synthesis or recognition. While most existing approaches derive an interpolatable latent space and induces smooth transition in image appearance, it is still not clear how to observe desirable representations which would contain semantic information of interest. In this paper, we aim to learn meaningful representations and simultaneously perform semantic-oriented and visually-smooth interpolation. To this end, we propose an angular triplet-neighbor loss (ATNL) that enables learning a latent representation whose distribution matches the semantic information of interest. With the latent space guided by ATNL, we further utilize spherical semantic interpolation for generating semantic warping of images, allowing synthesis of desirable visual data. Experiments on MNIST and CMU Multi-PIE datasets qualitatively and quantitatively verify the effectiveness of our method.

Similar papers

Learning Interpretable Representation for 3D Point Clouds

Feng-Guang Su, Ci-Siang Lin, Yu-Chiang Frank Wang

Responsive image

Auto-TLDR; Disentangling Body-type and Pose Information from 3D Point Clouds Using Adversarial Learning

Slides Poster Similar

Point clouds have emerged as a popular representation of 3D visual data. With a set of unordered 3D points, one typically needs to transform them into latent representation before further classification and segmentation tasks. However, one cannot easily interpret such encoded latent representation. To address this issue, we propose a unique deep learning framework for disentangling body-type and pose information from 3D point clouds. Extending from autoenoder, we advance adversarial learning a selected feature type, while classification and data recovery can be additionally observed. Our experiments confirm that our model can be successfully applied to perform a wide range of 3D applications like shape synthesis, action translation, shape/action interpolation, and synchronization.

Disentangled Representation Learning for Controllable Image Synthesis: An Information-Theoretic Perspective

Shichang Tang, Xu Zhou, Xuming He, Yi Ma

Responsive image

Auto-TLDR; Controllable Image Synthesis in Deep Generative Models using Variational Auto-Encoder

Slides Poster Similar

In this paper, we look into the problem of disentangled representation learning and controllable image synthesis in a deep generative model. We develop an encoder-decoder architecture for a variant of the Variational Auto-Encoder (VAE) with two latent codes $z_1$ and $z_2$. Our framework uses $z_2$ to capture specified factors of variation while $z_1$ captures the complementary factors of variation. To this end, we analyze the learning problem from the perspective of multivariate mutual information, derive optimizable lower bounds of the conditional mutual information in the image synthesis processes and incorporate them into the training objective. We validate our method empirically on the Color MNIST dataset and the CelebA dataset by showing controllable image syntheses. Our proposed paradigm is simple yet effective and is applicable to many situations, including those where there is not an explicit factorization of features available, or where the features are non-categorical.

Disentangle, Assemble, and Synthesize: Unsupervised Learning to Disentangle Appearance and Location

Hiroaki Aizawa, Hirokatsu Kataoka, Yutaka Satoh, Kunihito Kato

Responsive image

Auto-TLDR; Generative Adversarial Networks with Structural Constraint for controllability of latent space

Slides Poster Similar

The next step for the generative adversarial networks~(GAN) is to learn representations that allow us to control only a certain factor in the image explicitly. Since such a representation of the factor is independent of other factors, the controllability obtained from these representations leads to interpretability by identifying the variation of the synthesized image and the transferability for downstream tasks by inference. However, since it is difficult to identify and strictly define latent factors, the annotation is laborious. Moreover, learning such representations by a GAN is challenging due to the complex generation process. Therefore, we resolve this limitation using a novel generative model that can disentangle latent space into the appearance, the x-axis, and the y-axis of the object, and reassemble these components in an unsupervised manner. Specifically, based on the concept of packing the appearance and location in each position of the feature map, we introduce a novel structural constraint technique that prevents these representations from interacting with each other. The proposed structural constraint promotes the disentanglement of these factors. In experiments, we found that the proposed method is simple but effective for controllability and allows us to control the appearance and location via latent space without supervision, as compared with the conditional GAN.

Dual-MTGAN: Stochastic and Deterministic Motion Transfer for Image-To-Video Synthesis

Fu-En Yang, Jing-Cheng Chang, Yuan-Hao Lee, Yu-Chiang Frank Wang

Responsive image

Auto-TLDR; Dual Motion Transfer GAN for Convolutional Neural Networks

Slides Poster Similar

Generating videos with content and motion variations is a challenging task in computer vision. While the recent development of GAN allows video generation from latent representations, it is not easy to produce videos with particular content of motion patterns of interest. In this paper, we propose Dual Motion Transfer GAN (Dual-MTGAN), which takes image and video data as inputs while learning disentangled content and motion representations. Our Dual-MTGAN is able to perform deterministic motion transfer and stochastic motion generation. Based on a given image, the former preserves the input content and transfers motion patterns observed from another video sequence, and the latter directly produces videos with plausible yet diverse motion patterns based on the input image. The proposed model is trained in an end-to-end manner, without the need to utilize pre-defined motion features like pose or facial landmarks. Our quantitative and qualitative results would confirm the effectiveness and robustness of our model in addressing such conditioned image-to-video tasks.

Variational Deep Embedding Clustering by Augmented Mutual Information Maximization

Qiang Ji, Yanfeng Sun, Yongli Hu, Baocai Yin

Responsive image

Auto-TLDR; Clustering by Augmented Mutual Information maximization for Deep Embedding

Slides Poster Similar

Clustering is a crucial but challenging task in pattern analysis and machine learning. Recent many deep clustering methods combining representation learning with cluster techniques emerged. These deep clustering methods mainly focus on the correlation among samples and ignore the relationship between samples and their representations. In this paper, we propose a novel end-to-end clustering framework, namely variational deep embedding clustering by augmented mutual information maximization (VCAMI). From the perspective of VAE, we prove that minimizing reconstruction loss is equivalent to maximizing the mutual information of the input and its latent representation. This provides a theoretical guarantee for us to directly maximize the mutual information instead of minimizing reconstruction loss. Therefore we proposed the augmented mutual information which highlights the uniqueness of the representations while discovering invariant information among similar samples. Extensive experiments on several challenging image datasets show that the VCAMI achieves good performance. we achieve state-of-the-art results for clustering on MNIST (99.5%) and CIFAR-10 (65.4%) to the best of our knowledge.

Image Representation Learning by Transformation Regression

Xifeng Guo, Jiyuan Liu, Sihang Zhou, En Zhu, Shihao Dong

Responsive image

Auto-TLDR; Self-supervised Image Representation Learning using Continuous Parameter Prediction

Slides Poster Similar

Self-supervised learning is a thriving research direction since it can relieve the burden of human labeling for machine learning by seeking for supervision from data instead of human annotation. Although demonstrating promising performance in various applications, we observe that the existing methods usually model the auxiliary learning tasks as classification tasks with finite discrete labels, leading to insufficient supervisory signals, which in turn restricts the representation quality. In this paper, to solve the above problem and make full use of the supervision from data, we design a regression model to predict the continuous parameters of a group of transformations, i.e., image rotation, translation, and scaling. Surprisingly, this naive modification stimulates tremendous potential from data and the resulting supervisory signal has largely improved the performance of image representation learning. Extensive experiments on four image datasets, including CIFAR10, CIFAR100, STL10, and SVHN, indicate that our proposed algorithm outperforms the state-of-the-art unsupervised learning methods by a large margin in terms of classification accuracy. Crucially, we find that with our proposed training mechanism as an initialization, the performance of the existing state-of-the-art classification deep architectures can be preferably improved.

A Joint Representation Learning and Feature Modeling Approach for One-Class Recognition

Pramuditha Perera, Vishal Patel

Responsive image

Auto-TLDR; Combining Generative Features and One-Class Classification for Effective One-class Recognition

Slides Poster Similar

One-class recognition is traditionally approached either as a representation learning problem or a feature modelling problem. In this work, we argue that both of these approaches have their own limitations; and a more effective solution can be obtained by combining the two. The proposed approach is based on the combination of a generative framework and a one-class classification method. First, we learn generative features using the one-class data with a generative framework. We augment the learned features with the corresponding reconstruction errors to obtain augmented features. Then, we qualitatively identify a suitable feature distribution that reduces the redundancy in the chosen classifier space. Finally, we force the augmented features to take the form of this distribution using an adversarial framework. We test the effectiveness of the proposed method on three one-class classification tasks and obtain state-of-the-art results.

AVAE: Adversarial Variational Auto Encoder

Antoine Plumerault, Hervé Le Borgne, Celine Hudelot

Responsive image

Auto-TLDR; Combining VAE and GAN for Realistic Image Generation

Slides Poster Similar

Among the wide variety of image generative models, two models stand out: Variational Auto Encoders (VAE) and Generative Adversarial Networks (GAN). GANs can produce realistic images, but they suffer from mode collapse and do not provide simple ways to get the latent representation of an image. On the other hand, VAEs do not have these problems, but they often generate images less realistic than GANs. In this article, we explain that this lack of realism is partially due to a common underestimation of the natural image manifold dimensionality. To solve this issue we introduce a new framework that combines VAE and GAN in a novel and complementary way to produce an auto-encoding model that keeps VAEs properties while generating images of GAN-quality. We evaluate our approach both qualitatively and quantitatively on five image datasets.

Learning Low-Shot Generative Networks for Cross-Domain Data

Hsuan-Kai Kao, Cheng-Che Lee, Wei-Chen Chiu

Responsive image

Auto-TLDR; Learning Generators for Cross-Domain Data under Low-Shot Learning

Slides Poster Similar

We tackle a novel problem of learning generators for cross-domain data under a specific scenario of low-shot learning. Basically, given a source domain with sufficient amount of training data, we aim to transfer the knowledge of its generative process to another target domain, which not only has few data samples but also contains the domain shift with respect to the source domain. This problem has great potential in practical use and is different from the well-known image translation task, as the target-domain data can be generated without requiring any source-domain ones and the large data consumption for learning target-domain generator can be alleviated. Built upon a cross-domain dataset where (1) each of the low shots in the target domain has its correspondence in the source and (2) these two domains share the similar content information but different appearance, two approaches are proposed: a Latent-Disentanglement-Orientated model (LaDo) and a Generative-Hierarchy-Oriented (GenHo) model. Our LaDo and GenHo approaches address the problem from different perspectives, where the former relies on learning the disentangled representation composed of domain-invariant content features and domain-specific appearance ones; while the later decomposes the generative process of a generator into two parts for synthesizing the content and appearance sequentially. We perform extensive experiments under various settings of cross-domain data and show the efficacy of our models for generating target-domain data with the abundant content variance as in the source domain, which lead to the favourable performance in comparison to several baselines.

Learning Disentangled Representations for Identity Preserving Surveillance Face Camouflage

Jingzhi Li, Lutong Han, Hua Zhang, Xiaoguang Han, Jingguo Ge, Xiaochu Cao

Responsive image

Auto-TLDR; Individual Face Privacy under Surveillance Scenario with Multi-task Loss Function

Poster Similar

In this paper, we focus on protecting the person face privacy under the surveillance scenarios, whose goal is to change the visual appearances of faces while keep them to be recognizable by current face recognition systems. This is a challenging problem as that we should retain the most important structures of captured facial images, while alter the salient facial regions to protect personal privacy. To address this problem, we introduce a novel individual face protection model, which can camouflage the face appearance from the perspective of human visual perception and preserve the identity features of faces used for face authentication. To that end, we develop an encoder-decoder network architecture that can separately disentangle the person feature representation into an appearance code and an identity code. Specifically, we first randomly divide the face image into two groups, the source set and the target set, where the source set is used to extract the identity code and the target set provides the appearance code. Then, we recombine the identity and appearance codes to synthesize a new face, which has the same identity with the source subject. Finally, the synthesized faces are used to replace the original face to protect the privacy of individual. Furthermore, our model is trained end-to-end with a multi-task loss function, which can better preserve the identity and stabilize the training loss. Experiments conducted on Cross-Age Celebrity dataset demonstrate the effectiveness of our model and validate our superiority in terms of visual quality and scalability.

Variational Capsule Encoder

Harish Raviprakash, Syed Anwar, Ulas Bagci

Responsive image

Auto-TLDR; Bayesian Capsule Networks for Representation Learning in latent space

Slides Poster Similar

We propose a novel capsule network based variational encoder architecture, called Bayesian capsules (B-Caps), to modulate the mean and standard deviation of the sampling distribution in the latent space. We hypothesize that this approach can learn a better representation of features in the latent space than traditional approaches. Our hypothesis was tested by using the learned latent variables for image reconstruction task, where for MNIST and Fashion-MNIST datasets, different classes were separated successfully in the latent space using our proposed model. Our experimental results have shown improved reconstruction and classification performances for both datasets adding credence to our hypothesis. We also showed that by increasing the latent space dimension, the proposed B-Caps was able to learn a better representation when compared to the traditional variational auto-encoders (VAE). Hence our results indicate the strength of capsule networks in representation learning which has never been examined under the VAE settings before.

High Resolution Face Age Editing

Xu Yao, Gilles Puy, Alasdair Newson, Yann Gousseau, Pierre Hellier

Responsive image

Auto-TLDR; An Encoder-Decoder Architecture for Face Age editing on High Resolution Images

Slides Poster Similar

Face age editing has become a crucial task in film post-production, and is also becoming popular for general purpose photography. Recently, adversarial training has produced some of the most visually impressive results for image manipulation, including the face aging/de-aging task. In spite of considerable progress, current methods often present visual artifacts and can only deal with low-resolution images. In order to achieve aging/de-aging with the high quality and robustness necessary for wider use, these problems need to be addressed. This is the goal of the present work. We present an encoder-decoder architecture for face age editing. The core idea of our network is to encode a face image to age-invariant features, and learn a modulation vector corresponding to a target age. We then combine these two elements to produce a realistic image of the person with the desired target age. Our architecture is greatly simplified with respect to other approaches, and allows for fine-grained age editing on high resolution images in a single unified model. Source codes are available at https://github.com/InterDigitalInc/HRFAE.

Interpolation in Auto Encoders with Bridge Processes

Carl Ringqvist, Henrik Hult, Judith Butepage, Hedvig Kjellstrom

Responsive image

Auto-TLDR; Stochastic interpolations from auto encoders trained on flattened sequences

Slides Poster Similar

Auto encoding models have been extensively studied in recent years. They provide an efficient framework for sample generation, as well as for analysing feature learning. Furthermore, they are efficient in performing interpolations between data-points in semantically meaningful ways. In this paper, we introduce a method for generating sequence samples from auto encoders trained on flattened sequences (e.g video sample from auto encoders trained to generate a video frame); as well as a canonical, dimension independent method for generating stochastic interpolations. The distribution of interpolation paths is represented as the distribution of a bridge process constructed from an artificial random data generating process in the latent space, having the prior distribution as its invariant distribution.

Interpreting the Latent Space of GANs Via Correlation Analysis for Controllable Concept Manipulation

Ziqiang Li, Rentuo Tao, Hongjing Niu, Bin Li

Responsive image

Auto-TLDR; Exploring latent space of GANs by analyzing correlation between latent variables and semantic contents in generated images

Slides Poster Similar

Generative adversarial nets (GANs) have been successfully applied in many fields like image generation, inpainting, super-resolution and drug discovery, etc., by now, the inner process of GANs is far from been understood. To get deeper insight of the intrinsic mechanism of GANs, in this paper, a method for interpreting the latent space of GANs by analyzing the correlation between latent variables and the corresponding semantic contents in generated images is proposed. Unlike previous methods that focus on dissecting models via feature visualization, the emphasis of this work is put on the variables in latent space, i.e. how the latent variables affect the quantitative analysis of generated results. Given a pretrained GAN model with weights fixed, the latent variables are intervened to analyze their effect on the semantic content in generated images. A set of controlling latent variables can be derived for specific content generation, and the controllable semantic content manipulation be achieved. The proposed method is testified on the datasets Fashion-MNIST and UT Zappos50K, experiment results show its effectiveness

Domain Generalized Person Re-Identification Via Cross-Domain Episodic Learning

Ci-Siang Lin, Yuan Chia Cheng, Yu-Chiang Frank Wang

Responsive image

Auto-TLDR; Domain-Invariant Person Re-identification with Episodic Learning

Slides Poster Similar

Aiming at recognizing images of the same person across distinct camera views, person re-identification (re-ID) has been among active research topics in computer vision. Most existing re-ID works require collection of a large amount of labeled image data from the scenes of interest. When the data to be recognized are different from the source-domain training ones, a number of domain adaptation approaches have been proposed. Nevertheless, one still needs to collect labeled or unlabelled target-domain data during training. In this paper, we tackle an even more challenging and practical setting, domain generalized (DG) person re-ID. That is, while a number of labeled source-domain datasets are available, we do not have access to any target-domain training data. In order to learn domain-invariant features without knowing the target domain of interest, we present an episodic learning scheme which advances meta learning strategies to exploit the observed source-domain labeled data. The learned features would exhibit sufficient domain-invariant properties while not overfitting the source-domain data or ID labels. Our experiments on four benchmark datasets confirm the superiority of our method over the state-of-the-arts.

Reducing the Variance of Variational Estimates of Mutual Information by Limiting the Critic's Hypothesis Space to RKHS

Aditya Sreekar P, Ujjwal Tiwari, Anoop Namboodiri

Responsive image

Auto-TLDR; Mutual Information Estimation from Variational Lower Bounds Using a Critic's Hypothesis Space

Slides Similar

Mutual information (MI) is an information-theoretic measure of dependency between two random variables. Several methods to estimate MI, from samples of two random variables with unknown underlying probability distributions have been proposed in the literature. Recent methods realize parametric probability distributions or critic as a neural network to approximate unknown density ratios. The approximated density ratios are used to estimate different variational lower bounds of MI. While these methods provide reliable estimation when the true MI is low, they produce high variance estimates in cases of high MI. We argue that the high variance characteristic is due to the uncontrolled complexity of the critic's hypothesis space. In support of this argument, we use the data-driven Rademacher complexity of the hypothesis space associated with the critic's architecture to analyse generalization error bound of variational lower bound estimates of MI. In the proposed work, we show that it is possible to negate the high variance characteristics of these estimators by constraining the critic's hypothesis space to Reproducing Hilbert Kernel Space (RKHS), which corresponds to a kernel learned using Automated Spectral Kernel Learning (ASKL). By analysing the aforementioned generalization error bounds, we augment the overall optimisation objective with effective regularisation term. We empirically demonstrate the efficacy of this regularization in enforcing proper bias variance tradeoff on four variational lower bounds, namely NWJ, MINE, JS and SMILE.

Mutual Information Based Method for Unsupervised Disentanglement of Video Representation

Aditya Sreekar P, Ujjwal Tiwari, Anoop Namboodiri

Responsive image

Auto-TLDR; MIPAE: Mutual Information Predictive Auto-Encoder for Video Prediction

Slides Poster Similar

Video Prediction is an interesting and challenging task of predicting future frames from a given set context frames that belong to a video sequence. Video prediction models have found prospective applications in Maneuver Planning, Health care, Autonomous Navigation and Simulation. One of the major challenges in future frame generation is due to the high dimensional nature of visual data. In this work, we propose Mutual Information Predictive Auto-Encoder (MIPAE) framework, that reduces the task of predicting high dimensional video frames by factorising video representations into content and low dimensional pose latent variables that are easy to predict. A standard LSTM network is used to predict these low dimensional pose representations. Content and the predicted pose representations are decoded to generate future frames. Our approach leverages the temporal structure of the latent generative factors of a video and a novel mutual information loss to learn disentangled video representations. We also propose a metric based on mutual information gap (MIG) to quantitatively access the effectiveness of disentanglement on DSprites and MPI3D-real datasets. MIG scores corroborate with the visual superiority of frames predicted by MIPAE. We also compare our method quantitatively on evaluation metrics LPIPS, SSIM and PSNR.

Unsupervised Disentangling of Viewpoint and Residues Variations by Substituting Representations for Robust Face Recognition

Minsu Kim, Joanna Hong, Junho Kim, Hong Joo Lee, Yong Man Ro

Responsive image

Auto-TLDR; Unsupervised Disentangling of Identity, viewpoint, and Residue Representations for Robust Face Recognition

Slides Poster Similar

It is well-known that identity-unrelated variations (e.g., viewpoint or illumination) degrade the performances of face recognition methods. In order to handle this challenge, a robust method for disentangling the identity and view representations has drawn an attention in the machine learning area. However, existing methods learn discriminative features which require a manual supervision of such factors of variations. In this paper, we propose a novel disentangling framework through modeling three representations of identity, viewpoint, and residues (i.e., identity and pose unrelated) which do not require supervision of the variations. By jointly modeling the three representations, we enhance the disentanglement of each representation and achieve robust face recognition performance. Further, the learned viewpoint representation can be utilized for pose estimation or editing of a posed facial image. Extensive quantitative and qualitative evaluations verify the effectiveness of our proposed method which disentangles identity, viewpoint, and residues of facial images.

A Self-Supervised GAN for Unsupervised Few-Shot Object Recognition

Khoi Nguyen, Sinisa Todorovic

Responsive image

Auto-TLDR; Self-supervised Few-Shot Object Recognition with a Triplet GAN

Slides Poster Similar

This paper addresses unsupervised few-shot object recognition, where all training images are unlabeled, and test images are divided into queries and a few labeled support images per object class of interest. The training and test images do not share object classes. We extend the vanilla GAN with two loss functions, both aimed at self-supervised learning. The first is a reconstruction loss that enforces the discriminator to reconstruct the probabilistically sampled latent code which has been used for generating the "fake" image. The second is a triplet loss that enforces the discriminator to output image encodings that are closer for more similar images. Evaluation, comparisons, and detailed ablation studies are done in the context of few-shot classification. Our approach significantly outperforms the state of the art on the Mini-Imagenet and Tiered-Imagenet datasets.

Discriminative Multi-Level Reconstruction under Compact Latent Space for One-Class Novelty Detection

Jaewoo Park, Yoon Gyo Jung, Andrew Teoh

Responsive image

Auto-TLDR; Discriminative Compact AE for One-Class novelty detection and Adversarial Example Detection

Slides Similar

In one-class novelty detection, a model learns solely on the in-class data to single out out-class instances. Autoencoder (AE) variants aim to compactly model the in-class data to reconstruct it exclusively, thus differentiating the in-class from out-class by the reconstruction error. However, compact modeling in an improper way might collapse the latent representations of the in-class data and thus their reconstruction, which would lead to performance deterioration. Moreover, to properly measure the reconstruction error of high-dimensional data, a metric is required that captures high-level semantics of the data. To this end, we propose Discriminative Compact AE (DCAE) that learns both compact and collapse-free latent representations of the in-class data, thereby reconstructing them both finely and exclusively. In DCAE, (a) we force a compact latent space to bijectively represent the in-class data by reconstructing them through internal discriminative layers of generative adversarial nets. (b) Based on the deep encoder's vulnerability to open set risk, out-class instances are encoded into the same compact latent space and reconstructed poorly without sacrificing the quality of in-class data reconstruction. (c) In inference, the reconstruction error is measured by a novel metric that computes the dissimilarity between a query and its reconstruction based on the class semantics captured by the internal discriminator. Extensive experiments on public image datasets validate the effectiveness of our proposed model on both novelty and adversarial example detection, delivering state-of-the-art performance.

Generative Latent Implicit Conditional Optimization When Learning from Small Sample

Idan Azuri, Daphna Weinshall

Responsive image

Auto-TLDR; GLICO: Generative Latent Implicit Conditional Optimization for Small Sample Learning

Slides Poster Similar

We revisit the long-standing problem of learning from small sample. The generation of new samples from a small training set of labeled points has attracted increased attention in recent years. In this paper, we propose a novel such method called GLICO (Generative Latent Implicit Conditional Optimization). GLICO learns a mapping from the training examples to a latent space and a generator that generates images from vectors in the latent space. Unlike most recent work, which rely on access to large amounts of unlabeled data, GLICO does not require access to any additional data other than the small set of labeled points. In fact, GLICO learns to synthesize completely new samples for every class using as little as 5 or 10 examples per class, with as few as 10 such classes and no data from unknown classes. GLICO is then used to augment the small training set while training a classifier on the small sample. To this end, our proposed method samples the learned latent space using spherical interpolation (slerp) and generates new examples using the trained generator. Empirical results show that the new sampled set is diverse enough, leading to improvement in image classification in comparison with the state of the art when trained on small samples obtained from CIFAR-10, CIFAR-100, and CUB-200.

Combining GANs and AutoEncoders for Efficient Anomaly Detection

Fabio Carrara, Giuseppe Amato, Luca Brombin, Fabrizio Falchi, Claudio Gennaro

Responsive image

Auto-TLDR; CBIGAN: Anomaly Detection in Images with Consistency Constrained BiGAN

Slides Poster Similar

In this work, we propose CBiGAN --- a novel method for anomaly detection in images, where a consistency constraint is introduced as a regularization term in both the encoder and decoder of a BiGAN. Our model exhibits fairly good modeling power and reconstruction consistency capability. We evaluate the proposed method on MVTec AD --- a real-world benchmark for unsupervised anomaly detection on high-resolution images --- and compare against standard baselines and state-of-the-art approaches. Experiments show that the proposed method improves the performance of BiGAN formulations by a large margin and performs comparably to expensive state-of-the-art iterative methods while reducing the computational cost. We also observe that our model is particularly effective in texture-type anomaly detection, as it sets a new state of the art in this category. The code will be publicly released.

IDA-GAN: A Novel Imbalanced Data Augmentation GAN

Hao Yang, Yun Zhou

Responsive image

Auto-TLDR; IDA-GAN: Generative Adversarial Networks for Imbalanced Data Augmentation

Slides Poster Similar

Class imbalance is a widely existed and challenging problem in real-world applications such as disease diagnosis, fraud detection, network intrusion detection and so on. Due to the scarce of data, it could significantly deteriorate the accuracy of classification. To address this challenge, we propose a novel Imbalanced Data Augmentation Generative Adversarial Networks (GAN) named IDA-GAN as an augmentation tool to deal with the imbalanced dataset. This is a great challenge because it is hard to train a GAN model under this situation. We overcome this issue by coupling Variational autoencoder along with GAN training. Specifically, we introduce the Variational autoencoder to learn the majority and minority class distributions in the latent space, and use the generative model to utilize each class distribution for the subsequent GAN training. The generative model learns useful features to generate target minority-class samples. By comparing with the state-of-the-art GAN models, the experimental results demonstrate that our proposed IDA-GAN could generate more diverse minority samples with better qualities, and it consistently benefits the imbalanced classification task in terms of several widely-used evaluation metrics on five benchmark datasets: MNIST, Fashion-MNIST, SVHN, CIFAR-10 and GTRSB.

Unsupervised Face Manipulation Via Hallucination

Keerthy Kusumam, Enrique Sanchez, Georgios Tzimiropoulos

Responsive image

Auto-TLDR; Unpaired Face Image Manipulation using Autoencoders

Slides Poster Similar

This paper addresses the problem of manipulatinga face image in terms of changing its pose. To achieve this, wepropose a new method that can be trained under the very general“unpaired” setting. To this end, we firstly propose to modelthe general appearance, layout and background of the inputimage using a low-resolution version of it which is progressivelypassed through a hallucination network to generate featuresat higher resolutions. We show that such a formulation issignificantly simpler than previous approaches for appearancemodelling based on autoencoders. Secondly, we propose a fullylearnable and spatially-aware appearance transfer module whichcan cope with misalignment between the input source image andthe target pose and can effectively combine the features fromthe hallucination network with the features produced by ourgenerator. Thirdly, we introduce an identity preserving methodthat is trained in an unsupervised way, by using an auxiliaryfeature extractor and a contrastive loss between the real andgenerated images. We compare our method against the state-of-the-art reporting significant improvements both quantitatively, interms of FID and IS, and qualitatively.

SATGAN: Augmenting Age Biased Dataset for Cross-Age Face Recognition

Wenshuang Liu, Wenting Chen, Yuanlue Zhu, Linlin Shen

Responsive image

Auto-TLDR; SATGAN: Stable Age Translation GAN for Cross-Age Face Recognition

Slides Poster Similar

In this paper, we propose a Stable Age Translation GAN (SATGAN) to generate fake face images at different ages to augment age biased face datasets for Cross-Age Face Recognition (CAFR) . The proposed SATGAN consists of both generator and discriminator. As a part of the generator, a novel Mask Attention Module (MAM) is introduced to make the generator focus on the face area. In addition, the generator employs a Uniform Distribution Discriminator (UDD) to supervise the learning of latent feature map and enforce the uniform distribution. Besides, the discriminator employs a Feature Separation Module (FSM) to disentangle identity information from the age information. The quantitative and qualitative evaluations on Morph dataset prove that SATGAN achieves much better performance than existing methods. The face recognition model trained using dataset (VGGFace2 and MS-Celeb-1M) augmented using our SATGAN achieves better accuracy on cross age dataset like Cross-Age LFW and AgeDB-30.

Learning Embeddings for Image Clustering: An Empirical Study of Triplet Loss Approaches

Kalun Ho, Janis Keuper, Franz-Josef Pfreundt, Margret Keuper

Responsive image

Auto-TLDR; Clustering Objectives for K-means and Correlation Clustering Using Triplet Loss

Slides Poster Similar

In this work, we evaluate two different image clustering objectives, k-means clustering and correlation clustering, in the context of Triplet Loss induced feature space embeddings. Specifically, we train a convolutional neural network to learn discriminative features by optimizing two popular versions of the Triplet Loss in order to study their clustering properties under the assumption of noisy labels. Additionally, we propose a new, simple Triplet Loss formulation, which shows desirable properties with respect to formal clustering objectives and outperforms the existing methods. We evaluate all three Triplet loss formulations for K-means and correlation clustering on the CIFAR-10 image classification dataset.

Unsupervised Contrastive Photo-To-Caricature Translation Based on Auto-Distortion

Yuhe Ding, Xin Ma, Mandi Luo, Aihua Zheng, Ran He

Responsive image

Auto-TLDR; Unsupervised contrastive photo-to-caricature translation with style loss

Slides Poster Similar

Photo-to-caricature aims to synthesize the caricature as a rendered image exaggerating the features through sketching, pencil strokes, or other artistic drawings. Style rendering and geometry deformation are the most important aspects in photo-to-caricature translation task. To take both into consideration, we propose an unsupervised contrastive photo-to-caricature translation architecture. Considering the intuitive artifacts in the existing methods, we propose a contrastive style loss for style rendering to enforce the similarity between the style of rendered photo and the caricature, and simultaneously enhance its discrepancy to the photos. To obtain an exaggerating deformation in an unpaired/unsupervised fashion, we propose a Distortion Prediction Module (DPM) to predict a set of displacements vectors for each input image while fixing some controlling points, followed by the thin plate spline interpolation for warping. The model is trained on unpaired photo and caricature while can offer bidirectional synthesizing via inputting either a photo or a caricature. Extensive experiments demonstrate that the proposed model is effective to generate hand-drawn like caricatures compared with existing competitors.

Local Facial Attribute Transfer through Inpainting

Ricard Durall, Franz-Josef Pfreundt, Janis Keuper

Responsive image

Auto-TLDR; Attribute Transfer Inpainting Generative Adversarial Network

Slides Poster Similar

The term attribute transfer refers to the tasks of altering images in such a way, that the semantic interpretation of a given input image is shifted towards an intended direction, which is quantified by semantic attributes. Prominent example applications are photo realistic changes of facial features and expressions, like changing the hair color, adding a smile, enlarging the nose or altering the entire context of a scene, like transforming a summer landscape into a winter panorama. Recent advances in attribute transfer are mostly based on generative deep neural networks, using various techniques to manipulate images in the latent space of the generator. In this paper, we present a novel method for the common sub-task of local attribute transfers, where only parts of a face have to be altered in order to achieve semantic changes (e.g. removing a mustache). In contrast to previous methods, where such local changes have been implemented by generating new (global) images, we propose to formulate local attribute transfers as an inpainting problem. Removing and regenerating only parts of images, our Attribute Transfer Inpainting Generative Adversarial Network (ATI-GAN) is able to utilize local context information to focus on the attributes while keeping the background unmodified resulting in visually sound results.

Pose Variation Adaptation for Person Re-Identification

Lei Zhang, Na Jiang, Qishuai Diao, Yue Xu, Zhong Zhou, Wei Wu

Responsive image

Auto-TLDR; Pose Transfer Generative Adversarial Network for Person Re-identification

Slides Poster Similar

Person re-identification (reid) plays an important role in surveillance video analysis, especially for criminal investigation and intelligent security. Although a large number of effective feature or distance metric learning approaches have been proposed, it still suffers from pedestrians appearance variations caused by pose changing. Most of the previous methods address this problem by learning a pose-invariant descriptor subspace. In this paper, we propose a pose variation adaptation method for person reid in the view of data augmentation. It can reduce the probability of deep learning network over-fitting. Specifically, we introduce a pose transfer generative adversarial network with a similarity measurement constraint. With the learned pose transfer model, training images can be pose-transferred to any given poses, and along with the original images, form a augmented training dataset. It increases data diversity against over-fitting. In contrast to previous GAN-based methods, we consider the influence of pose variations on similarity measure to generate more realistic and shaper samples for person reid. Besides, we optimize hard example mining to introduce a novel manner of samples (pose-transferred images) used with the learned pose transfer model. It focuses on the inferior samples which are caused by pose variations to increase the number of effective hard examples for learning discriminative features and improve the generalization ability. We extensively conduct comparative evaluations to demonstrate the advantages and superiority of our proposed method over the state-of-the-art approaches on Market-1501 and DukeMTMC-reID, the rank-1 accuracy is 96.1% for Market-1501 and 92.0% for DukeMTMC-reID.

On the Evaluation of Generative Adversarial Networks by Discriminative Models

Amirsina Torfi, Mohammadreza Beyki, Edward Alan Fox

Responsive image

Auto-TLDR; Domain-agnostic GAN Evaluation with Siamese Neural Networks

Slides Poster Similar

Generative Adversarial Networks (GANs) can accurately model complex multi-dimensional data and generate realistic samples. However, due to their implicit estimation of data distributions, their evaluation is a challenging task. The majority of research efforts associated with tackling this issue were validated by qualitative visual evaluation. Such approaches do not generalize well beyond the image domain. Since many of those evaluation metrics are proposed and bound to the vision domain, they are difficult to apply to other domains. Quantitative measures are necessary to better guide the training and comparison of different GANs models. In this work, we leverage Siamese neural networks to propose a domain-agnostic evaluation metric: (1) with a qualitative evaluation that is consistent with human evaluation, (2) that is robust relative to common GAN issues such as mode dropping and invention, and (3) does not require any pretrained classifier. The empirical results in this paper demonstrate the superiority of this method compared to the popular Inception Score and are competitive with the FID score.

GAN-Based Gaussian Mixture Model Responsibility Learning

Wanming Huang, Yi Da Xu, Shuai Jiang, Xuan Liang, Ian Oppermann

Responsive image

Auto-TLDR; Posterior Consistency Module for Gaussian Mixture Model

Slides Poster Similar

Mixture Model (MM) is a probabilistic framework allows us to define dataset containing $K$ different modes. When each of the modes is associated with a Gaussian distribution, we refer to it as Gaussian MM or GMM. Given a data point $x$, a GMM may assume the existence of a random index $k \in \{1, \dots , K \}$ identifying which Gaussian the particular data is associated with. In a traditional GMM paradigm, it is straightforward to compute in closed-form, the conditional likelihood $p(x |k, \theta)$ as well as the responsibility probability $p(k|x, \theta)$ describing the distribution weights for each data. Computing the responsibility allows us to retrieve many important statistics of the overall dataset, including the weights of each of the modes/clusters. Modern large data-sets are often containing multiple unlabelled modes, such as paintings dataset may contain several styles; fashion images containing several unlabelled categories. In its raw representation, the Euclidean distances between the data (e.g., images) do not allow them to form mixtures naturally, nor it's feasible to compute responsibility distribution analytically, making GMM unable to apply. In this paper, we utilize the Generative Adversarial Network (GAN) framework to achieve a plausible alternative method to compute these probabilities. The key insight is that we compute them at the data's latent space $z$ instead of $x$. However, this process of $z \rightarrow x$ is irreversible under GAN which renders the computation of responsibility $p(k|x, \theta)$ infeasible. Our paper proposed a novel method to solve it by using a so-called Posterior Consistency Module (PCM). PCM acts like a GAN, except its Generator $C_{\text{PCM}}$ does not output the data, but instead it outputs a distribution to approximate $p(k|x, \theta)$. The entire network is trained in an ``end-to-end'' fashion. Trough these techniques, it allows us to model the dataset of very complex structure using GMM and subsequently to discover interesting properties of an unsupervised dataset, including its segments, as well as generating new ``out-distribution" data by smooth linear interpolation across any combinations of the modes in a completely unsupervised manner.

Exemplar Guided Cross-Spectral Face Hallucination Via Mutual Information Disentanglement

Haoxue Wu, Huaibo Huang, Aijing Yu, Jie Cao, Zhen Lei, Ran He

Responsive image

Auto-TLDR; Exemplar Guided Cross-Spectral Face Hallucination with Structural Representation Learning

Slides Poster Similar

Recently, many Near infrared-visible (NIR-VIS) heterogeneous face recognition (HFR) methods have been proposed in the community. But it remains a challenging problem because of the sensing gap along with large pose variations. In this paper, we propose an Exemplar Guided Cross-Spectral Face Hallucination (EGCH) to reduce the domain discrepancy through disentangled representation learning. For each modality, EGCH contains a spectral encoder as well as a structure encoder to disentangle spectral and structure representation, respectively. It also contains a traditional generator that reconstructs the input from the above two representations, and a structure generator that predicts the facial parsing map from the structure representation. Besides, mutual information minimization and maximization are conducted to boost disentanglement and make representations adequately expressed. Then the translation is built on structure representations between two modalities. Provided with the transformed NIR structure representation and original VIS spectral representation, EGCH is capable to produce high-fidelity VIS images that preserve the topology structure of the input NIR while transfer the spectral information of an arbitrary VIS exemplar. Extensive experiments demonstrate that the proposed method achieves more promising results both qualitatively and quantitatively than the state-of-the-art NIR-VIS methods.

JUMPS: Joints Upsampling Method for Pose Sequences

Lucas Mourot, Francois Le Clerc, Cédric Thébault, Pierre Hellier

Responsive image

Auto-TLDR; JUMPS: Increasing the Number of Joints in 2D Pose Estimation and Recovering Occluded or Missing Joints

Slides Poster Similar

Human Pose Estimation is a low-level task useful for surveillance, human action recognition, and scene understanding at large. It also offers promising perspectives for the animation of synthetic characters. For all these applications, and especially the latter, estimating the positions of many joints is desirable for improved performance and realism. To this purpose, we propose a novel method called JUMPS for increasing the number of joints in 2D pose estimates and recovering occluded or missing joints. We believe this is the first attempt to address the issue. We build on a deep generative model that combines a GAN and an encoder. The GAN learns the distribution of high-resolution human pose sequences, the encoder maps the input low-resolution sequences to its latent space. Inpainting is obtained by computing the latent representation whose decoding by the GAN generator optimally matches the joints locations at the input. Post-processing a 2D pose sequence using our method provides a richer representation of the character motion. We show experimentally that the localization accuracy of the additional joints is on average on par with the original pose estimates.

Adversarially Constrained Interpolation for Unsupervised Domain Adaptation

Mohamed Azzam, Aurele Tohokantche Gnanha, Hau-San Wong, Si Wu

Responsive image

Auto-TLDR; Unsupervised Domain Adaptation with Domain Mixup Strategy

Slides Poster Similar

We address the problem of unsupervised domain adaptation (UDA) which aims at adapting models trained on a labeled domain to a completely unlabeled domain. One way to achieve this goal is to learn a domain-invariant representation. However, this approach is subject to two challenges: samples from two domains are insufficient to guarantee domain-invariance at most part of the latent space, and neighboring samples from the target domain may not belong to the same class on the low-dimensional manifold. To mitigate these shortcomings, we propose two strategies. First, we incorporate a domain mixup strategy in domain adversarial learning model by linearly interpolating between the source and target domain samples. This allows the latent space to be continuous and yields an improvement of the domain matching. Second, the domain discriminator is regularized via judging the relative difference between both domains for the input mixup features, which speeds up the domain matching. Experiment results show that our proposed model achieves a superior performance on different tasks under various domain shifts and data complexity.

Multi-Modal Deep Clustering: Unsupervised Partitioning of Images

Guy Shiran, Daphna Weinshall

Responsive image

Auto-TLDR; Multi-Modal Deep Clustering for Unlabeled Images

Slides Poster Similar

The clustering of unlabeled raw images is a daunting task, which has recently been approached with some success by deep learning methods. Here we propose an unsupervised clustering framework, which learns a deep neural network in an end-to-end fashion, providing direct cluster assignments of images without additional processing. Multi-Modal Deep Clustering (MMDC), trains a deep network to align its image embeddings with target points sampled from a Gaussian Mixture Model distribution. The cluster assignments are then determined by mixture component association of image embeddings. Simultaneously, the same deep network is trained to solve an additional self-supervised task. This pushes the network to learn more meaningful image representations and stabilizes the training. Experimental results show that MMDC achieves or exceeds state-of-the-art performance on four challenging benchmarks. On natural image datasets we improve on previous results with significant margins of up to 11% absolute accuracy points, yielding an accuracy of 70% on CIFAR-10 and 61% on STL-10.

Free-Form Image Inpainting Via Contrastive Attention Network

Xin Ma, Xiaoqiang Zhou, Huaibo Huang, Zhenhua Chai, Xiaolin Wei, Ran He

Responsive image

Auto-TLDR; Self-supervised Siamese inference for image inpainting

Slides Similar

Most deep learning based image inpainting approaches adopt autoencoder or its variants to fill missing regions in images. Encoders are usually utilized to learn powerful representational spaces, which are important for dealing with sophisticated learning tasks. Specifically, in the image inpainting task, masks with any shapes can appear anywhere in images (i.e., free-form masks) forming complex patterns. It is difficult for encoders to capture such powerful representations under this complex situation. To tackle this problem, we propose a self-supervised Siamese inference network to improve the robustness and generalization. Moreover, the restored image usually can not be harmoniously integrated into the exiting content, especially in the boundary area. To address this problem, we propose a novel Dual Attention Fusion module (DAF), which can combine both the restored and known regions in a smoother way and be inserted into decoder layers in a plug-and-play way. DAF is developed to not only adaptively rescale channel-wise features by taking interdependencies between channels into account but also force deep convolutional neural networks (CNNs) focusing more on unknown regions. In this way, the unknown region will be naturally filled from the outside to the inside. Qualitative and quantitative experiments on multiple datasets, including facial and natural datasets (i.e., Celeb-HQ, Pairs Street View, Places2 and ImageNet), demonstrate that our proposed method outperforms against state-of-the-arts in generating high-quality inpainting results.

DAIL: Dataset-Aware and Invariant Learning for Face Recognition

Gaoang Wang, Chen Lin, Tianqiang Liu, Mingwei He, Jiebo Luo

Responsive image

Auto-TLDR; DAIL: Dataset-Aware and Invariant Learning for Face Recognition

Slides Poster Similar

To achieve good performance in face recognition, a large scale training dataset is usually required. A simple yet effective way for improving the recognition performance is to use a dataset as large as possible by combining multiple datasets in the training. However, it is problematic and troublesome to naively combine different datasets due to two major issues. Firstly, the same person can possibly appear in different datasets, leading to the identity overlapping issue between different datasets. Natively treating the same person as different classes in different datasets during training will affect back-propagation and generate non-representative embeddings. On the other hand, manually cleaning labels will take a lot of human efforts, especially when there are millions of images and thousands of identities. Secondly, different datasets are collected in different situations and thus will lead to different domain distributions. Natively combining datasets will lead to domain distribution differences and make it difficult to learn domain invariant embeddings across different datasets. In this paper, we propose DAIL: Dataset-Aware and Invariant Learning to resolve the above-mentioned issues. To solve the first issue of identity overlapping, we propose a dataset-aware loss for multi-dataset training by reducing the penalty when the same person appears in multiple datasets. This can be readily achieved with a modified softmax loss with a dataset-aware term. To solve the second issue, the domain adaptation with gradient reversal layers is employed for dataset invariant learning. The proposed approach not only achieves state-of-the-art results on several commonly used face recognition validation sets, like LFW, CFP-FP, AgeDB-30, but also shows great benefit for practical usage.

Few-Shot Font Generation with Deep Metric Learning

Haruka Aoki, Koki Tsubota, Hikaru Ikuta, Kiyoharu Aizawa

Responsive image

Auto-TLDR; Deep Metric Learning for Japanese Typographic Font Synthesis

Slides Poster Similar

Designing fonts for languages with a large number of characters, such as Japanese and Chinese, is an extremely labor-intensive and time-consuming task. In this study, we addressed the problem of automatically generating Japanese typographic fonts from only a few font samples, where the synthesized glyphs are expected to have coherent characteristics, such as skeletons, contours, and serifs. Existing methods often fail to generate fine glyph images when the number of style reference glyphs is extremely limited. Herein, we proposed a simple but powerful framework for extracting better style features. This framework introduces deep metric learning to style encoders. We performed experiments using black-and-white and shape-distinctive font datasets and demonstrated the effectiveness of the proposed framework.

Future Urban Scenes Generation through Vehicles Synthesis

Alessandro Simoni, Luca Bergamini, Andrea Palazzi, Simone Calderara, Rita Cucchiara

Responsive image

Auto-TLDR; Predicting the Future of an Urban Scene with a Novel View Synthesis Paradigm

Slides Poster Similar

In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene. Despite recent advances, generating the entire scene in an end-to-end fashion is still far from being achieved. Instead, here we follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently. We leverage a per-object novel view synthesis paradigm; i.e. generating a synthetic representation of an object undergoing a geometrical roto-translation in the 3D space. Our model can be easily conditioned with constraints (e.g. input trajectories) provided by state-of-the-art tracking methods or by the user itself. This allows us to generate a set of diverse realistic futures starting from the same input in a multi-modal fashion. We visually and quantitatively show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow, a challenging real world dataset.

Local Clustering with Mean Teacher for Semi-Supervised Learning

Zexi Chen, Benjamin Dutton, Bharathkumar Ramachandra, Tianfu Wu, Ranga Raju Vatsavai

Responsive image

Auto-TLDR; Local Clustering for Semi-supervised Learning

Slides Similar

The Mean Teacher (MT) model of Tarvainen and Valpola has shown favorable performance on several semi-supervised benchmark datasets. MT maintains a teacher model's weights as the exponential moving average of a student model's weights and minimizes the divergence between their probability predictions under diverse perturbations of the inputs. However, MT is known to suffer from confirmation bias, that is, reinforcing incorrect teacher model predictions. In this work, we propose a simple yet effective method called Local Clustering (LC) to mitigate the effect of confirmation bias. In MT, each data point is considered independent of other points during training; however, data points are likely to be close to each other in feature space if they share similar features. Motivated by this, we cluster data points locally by minimizing the pairwise distance between neighboring data points in feature space. Combined with a standard classification cross-entropy objective on labeled data points, the misclassified unlabeled data points are pulled towards high-density regions of their correct class with the help of their neighbors, thus improving model performance. We demonstrate on semi-supervised benchmark datasets SVHN and CIFAR-10 that adding our LC loss to MT yields significant improvements compared to MT and performance comparable to the state of the art in semi-supervised learning.

Generative Deep-Neural-Network Mixture Modeling with Semi-Supervised MinMax+EM Learning

Nilay Pande, Suyash Awate

Responsive image

Auto-TLDR; Semi-supervised Deep Neural Networks for Generative Mixture Modeling and Clustering

Slides Poster Similar

Deep neural networks (DNNs) for generative mixture modeling typically rely on unsupervised learning that employs hard clustering schemes, or variational learning with loose / approximate bounds, or under-regularized modeling. We propose a novel statistical framework for a DNN mixture model using a single generative adversarial network. Our learning formulation proposes a novel data-likelihood term relying on a well-regularized / constrained Gaussian mixture model in the latent space along with a prior term on the DNN weights. Our min-max learning increases the data likelihood using a tight variational lower bound using expectation maximization (EM). We leverage our min-max EM learning scheme for semi-supervised learning. Results on three real-world datasets demonstrate the benefits of our compact modeling and learning formulation over the state of the art for mixture modeling and clustering.

A Quantitative Evaluation Framework of Video De-Identification Methods

Sathya Bursic, Alessandro D'Amelio, Marco Granato, Giuliano Grossi, Raffaella Lanzarotti

Responsive image

Auto-TLDR; Face de-identification using photo-reality and facial expressions

Slides Poster Similar

We live in an era of privacy concerns, motivating a large research effort in face de-identification. As in other fields, we are observing a general movement from hand-crafted methods to deep learning methods, mainly involving generative models. Although these methods produce more natural de-identified images or videos, we claim that the mere evaluation of the de-identification is not sufficient, especially when it comes to processing the images/videos further. In this note, we take into account the issue of preserving privacy, facial expressions, and photo-reality simultaneously, proposing a general testing framework. The method is applied to four open-source tools, producing a baseline for future de-identification methods.

Cam-Softmax for Discriminative Deep Feature Learning

Tamas Suveges, Stephen James Mckenna

Responsive image

Auto-TLDR; Cam-Softmax: A Generalisation of Activations and Softmax for Deep Feature Spaces

Slides Poster Similar

Deep convolutional neural networks are widely used to learn feature spaces for image classification tasks. We propose cam-softmax, a generalisation of the final layer activations and softmax function, that encourages deep feature spaces to exhibit high intra-class compactness and high inter-class separability. We provide an algorithm to automatically adapt the method's main hyperparameter so that it gradually diverges from the standard activations and softmax method during training. We report experiments using CASIA-Webface, LFW, and YTF face datasets demonstrating that cam-softmax leads to representations well suited to open-set face recognition and face pair matching. Furthermore, we provide empirical evidence that cam-softmax provides some robustness to class labelling errors in training data, making it of potential use for deep learning from large datasets with poorly verified labels.

Phase Retrieval Using Conditional Generative Adversarial Networks

Tobias Uelwer, Alexander Oberstraß, Stefan Harmeling

Responsive image

Auto-TLDR; Conditional Generative Adversarial Networks for Phase Retrieval

Slides Poster Similar

In this paper, we propose the application of conditional generative adversarial networks to solve various phase retrieval problems. We show that including knowledge of the measurement process at training time leads to an optimization at test time that is more robust to initialization than existing approaches involving generative models. In addition, conditioning the generator network on the measurements enables us to achieve much more detailed results. We empirically demonstrate that these advantages provide meaningful solutions to the Fourier and the compressive phase retrieval problem and that our method outperforms well-established projection-based methods as well as existing methods that are based on neural networks. Like other deep learning methods, our approach is very robust to noise and can therefore be very useful for real-world applications.

Coherence and Identity Learning for Arbitrary-Length Face Video Generation

Shuquan Ye, Chu Han, Jiaying Lin, Guoqiang Han, Shengfeng He

Responsive image

Auto-TLDR; Face Video Synthesis Using Identity-Aware GAN and Face Coherence Network

Slides Poster Similar

Face synthesis is an interesting yet challenging task in computer vision. It is even much harder to generate a portrait video than a single image. In this paper, we propose a novel video generation framework for synthesizing arbitrary-length face videos without any face exemplar or landmark. To overcome the synthesis ambiguity of face video, we propose a divide-and-conquer strategy to separately address the video face synthesis problem from two aspects, face identity synthesis and rearrangement. To this end, we design a cascaded network which contains three components, Identity-aware GAN (IA-GAN), Face Coherence Network, and Interpolation Network. IA-GAN is proposed to synthesize photorealistic faces with the same identity from a set of noises. Face Coherence Network is designed to re-arrange the faces generated by IA-GAN while keeping the inter-frame coherence. Interpolation Network is introduced to eliminate the discontinuity between two adjacent frames and improve the smoothness of the face video. Experimental results demonstrate that our proposed network is able to generate face video with high visual quality while preserving the identity. Statistics show that our method outperforms state-of-the-art unconditional face video generative models in multiple challenging datasets.

Galaxy Image Translation with Semi-Supervised Noise-Reconstructed Generative Adversarial Networks

Qiufan Lin, Dominique Fouchez, Jérôme Pasquet

Responsive image

Auto-TLDR; Semi-supervised Image Translation with Generative Adversarial Networks Using Paired and Unpaired Images

Slides Poster Similar

Image-to-image translation with Deep Learning neural networks, particularly with Generative Adversarial Networks (GANs), is one of the most powerful methods for simulating astronomical images. However, current work is limited to utilizing paired images with supervised translation, and there has been rare discussion on reconstructing noise background that encodes instrumental and observational effects. These limitations might be harmful for subsequent scientific applications in astrophysics. Therefore, we aim to develop methods for using unpaired images and preserving noise characteristics in image translation. In this work, we propose a two-way image translation model using GANs that exploits both paired and unpaired images in a semi-supervised manner, and introduce a noise emulating module that is able to learn and reconstruct noise characterized by high-frequency features. By experimenting on multi-band galaxy images from the Sloan Digital Sky Survey (SDSS) and the Canada France Hawaii Telescope Legacy Survey (CFHT), we show that our method recovers global and local properties effectively and outperforms benchmark image translation models. To our best knowledge, this work is the first attempt to apply semi-supervised methods and noise reconstruction techniques in astrophysical studies.

The Role of Cycle Consistency for Generating Better Human Action Videos from a Single Frame

Runze Li, Bir Bhanu

Responsive image

Auto-TLDR; Generating Videos with Human Action Semantics using Cycle Constraints

Slides Poster Similar

This paper addresses the challenging problem of generating videos with human action semantics. Unlike previous work which predict future frames in a single forward pass, this paper introduces the cycle constraints in both forward and backward passes in the generation of human actions. This is achieved by enforcing the appearance and motion consistency across a sequence of frames generated in the future. The approach consists of two stages. In the first stage, the pose of a human body is generated. In the second stage, an image generator is used to generate future frames by using (a) generated human poses in the future from the first stage, (b) the single observed human pose, and (c) the single corresponding future frame. The experiments are performed on three datasets: Weizmann dataset involving simple human actions, Penn Action dataset and UCF-101 dataset containing complicated human actions, especially in sports. The results from these experiments demonstrate the effectiveness of the proposed approach.

Pose-Robust Face Recognition by Deep Meta Capsule Network-Based Equivariant Embedding

Fangyu Wu, Jeremy Simon Smith, Wenjin Lu, Bailing Zhang

Responsive image

Auto-TLDR; Deep Meta Capsule Network-based Equivariant Embedding Model for Pose-Robust Face Recognition

Similar

Despite the exceptional success in face recognition related technologies, handling large pose variations still remains a key challenge. Current techniques for pose-robust face recognition either, directly extract pose-invariant features, or first synthesize a face that matches the target pose before feature extraction. It is more desirable to learn face representations equivariant to pose variations. To this end, this paper proposes a deep meta Capsule network-based Equivariant Embedding Model (DM-CEEM) with three distinct novelties. First, the proposed RB-CapsNet allows DM-CEEM to learn an equivariant embedding for pose variations and achieve the desired transformation for input face images. Second, we introduce a new version of a Capsule network called RB-CapsNet to extend CapsNet to perform a profile-to-frontal face transformation in deep feature space. Third, we train the DM-CEEM in a meta way by treating a single overall classification target as multiple sub-tasks that satisfy certain unknown probabilities. In each sub-task, we sample the support and query sets randomly. The experimental results on both controlled and in-the-wild databases demonstrate the superiority of DM-CEEM over state-of-the-art.