Unsupervised Face Manipulation Via Hallucination

Keerthy Kusumam, Enrique Sanchez, Georgios Tzimiropoulos

Responsive image

Auto-TLDR; Unpaired Face Image Manipulation using Autoencoders

Slides Poster

This paper addresses the problem of manipulatinga face image in terms of changing its pose. To achieve this, wepropose a new method that can be trained under the very general“unpaired” setting. To this end, we firstly propose to modelthe general appearance, layout and background of the inputimage using a low-resolution version of it which is progressivelypassed through a hallucination network to generate featuresat higher resolutions. We show that such a formulation issignificantly simpler than previous approaches for appearancemodelling based on autoencoders. Secondly, we propose a fullylearnable and spatially-aware appearance transfer module whichcan cope with misalignment between the input source image andthe target pose and can effectively combine the features fromthe hallucination network with the features produced by ourgenerator. Thirdly, we introduce an identity preserving methodthat is trained in an unsupervised way, by using an auxiliaryfeature extractor and a contrastive loss between the real andgenerated images. We compare our method against the state-of-the-art reporting significant improvements both quantitatively, interms of FID and IS, and qualitatively.

Similar papers

Multi-Domain Image-To-Image Translation with Adaptive Inference Graph

The Phuc Nguyen, Stéphane Lathuiliere, Elisa Ricci

Responsive image

Auto-TLDR; Adaptive Graph Structure for Multi-Domain Image-to-Image Translation

Slides Poster Similar

In this work, we address the problem of multi-domain image-to-image translation with particular attention paid to computational cost. In particular, current state of the art models require a large and deep model in order to handle the visual diversity of multiple domains. In a context of limited computational resources, increasing the network size may not be possible. Therefore, we propose to increase the network capacity by using an adaptive graph structure. At inference time, the network estimates its own graph by selecting specific sub-networks. Sub-network selection is implemented using Gumble-Softmax in order to allow end-to-end training. This approach leads to an adjustable increase in number of parameters while preserving an almost constant computational cost. Our evaluation on two publicly available datasets of facial and painting images shows that our adaptive strategy generates better images with fewer artifacts than literature methods.

Learning Disentangled Representations for Identity Preserving Surveillance Face Camouflage

Jingzhi Li, Lutong Han, Hua Zhang, Xiaoguang Han, Jingguo Ge, Xiaochu Cao

Responsive image

Auto-TLDR; Individual Face Privacy under Surveillance Scenario with Multi-task Loss Function

Poster Similar

In this paper, we focus on protecting the person face privacy under the surveillance scenarios, whose goal is to change the visual appearances of faces while keep them to be recognizable by current face recognition systems. This is a challenging problem as that we should retain the most important structures of captured facial images, while alter the salient facial regions to protect personal privacy. To address this problem, we introduce a novel individual face protection model, which can camouflage the face appearance from the perspective of human visual perception and preserve the identity features of faces used for face authentication. To that end, we develop an encoder-decoder network architecture that can separately disentangle the person feature representation into an appearance code and an identity code. Specifically, we first randomly divide the face image into two groups, the source set and the target set, where the source set is used to extract the identity code and the target set provides the appearance code. Then, we recombine the identity and appearance codes to synthesize a new face, which has the same identity with the source subject. Finally, the synthesized faces are used to replace the original face to protect the privacy of individual. Furthermore, our model is trained end-to-end with a multi-task loss function, which can better preserve the identity and stabilize the training loss. Experiments conducted on Cross-Age Celebrity dataset demonstrate the effectiveness of our model and validate our superiority in terms of visual quality and scalability.

GarmentGAN: Photo-Realistic Adversarial Fashion Transfer

Amir Hossein Raffiee, Michael Sollami

Responsive image

Auto-TLDR; GarmentGAN: A Generative Adversarial Network for Image-Based Garment Transfer

Slides Poster Similar

The garment transfer problem comprises two tasks: learning to separate a person's body (pose, shape, color) from their clothing (garment type, shape, style) and then generating new images of the wearer dressed in arbitrary garments. We present GarmentGAN, a new algorithm that performs image-based garment transfer through generative adversarial methods. The GarmentGAN framework allows users to virtually try-on items before purchase and generalizes to various apparel types. GarmentGAN requires as input only two images, namely, a picture of the target fashion item and an image containing the customer. The output is a synthetic image wherein the customer is wearing the target apparel. In order to make the generated image look photo-realistic, we employ the use of novel generative adversarial techniques. GarmentGAN improves on existing methods in the realism of generated imagery and solves various problems related to self-occlusions. Our proposed model incorporates additional information during training, utilizing both segmentation maps and body key-point information. We show qualitative and quantitative comparisons to several other networks to demonstrate the effectiveness of this technique.

Attributes Aware Face Generation with Generative Adversarial Networks

Zheng Yuan, Jie Zhang, Shiguang Shan, Xilin Chen

Responsive image

Auto-TLDR; AFGAN: A Generative Adversarial Network for Attributes Aware Face Image Generation

Slides Poster Similar

Recent studies have shown remarkable success in face image generations. However, most of the existing methods only generate face images from random noise, and cannot generate face images according to the specific attributes. In this paper, we focus on the problem of face synthesis from attributes, which aims at generating faces with specific characteristics corresponding to the given attributes. To this end, we propose a novel attributes aware face image generator method with generative adversarial networks called AFGAN. Specifically, we firstly propose a two-path embedding layer and self-attention mechanism to convert binary attribute vector to rich attribute features. Then three stacked generators generate 64 * 64, 128 * 128 and 256 * 256 resolution face images respectively by taking the attribute features as input. In addition, an image-attribute matching loss is proposed to enhance the correlation between the generated images and input attributes. Extensive experiments on CelebA demonstrate the superiority of our AFGAN in terms of both qualitative and quantitative evaluations.

High Resolution Face Age Editing

Xu Yao, Gilles Puy, Alasdair Newson, Yann Gousseau, Pierre Hellier

Responsive image

Auto-TLDR; An Encoder-Decoder Architecture for Face Age editing on High Resolution Images

Slides Poster Similar

Face age editing has become a crucial task in film post-production, and is also becoming popular for general purpose photography. Recently, adversarial training has produced some of the most visually impressive results for image manipulation, including the face aging/de-aging task. In spite of considerable progress, current methods often present visual artifacts and can only deal with low-resolution images. In order to achieve aging/de-aging with the high quality and robustness necessary for wider use, these problems need to be addressed. This is the goal of the present work. We present an encoder-decoder architecture for face age editing. The core idea of our network is to encode a face image to age-invariant features, and learn a modulation vector corresponding to a target age. We then combine these two elements to produce a realistic image of the person with the desired target age. Our architecture is greatly simplified with respect to other approaches, and allows for fine-grained age editing on high resolution images in a single unified model. Source codes are available at https://github.com/InterDigitalInc/HRFAE.

Contrastive Data Learning for Facial Pose and Illumination Normalization

Gee-Sern Hsu, Chia-Hao Tang

Responsive image

Auto-TLDR; Pose and Illumination Normalization with Contrast Data Learning for Face Recognition

Slides Poster Similar

Face normalization can be a crucial step when handling generic face recognition. We propose the Pose and Illumination Normalization (PIN) framework with contrast data learning for face normalization. The PIN framework is designed to learn the transformation from a source set to a target set. The source set and the target set compose a contrastive data set for learning. The source set contains faces collected in the wild and thus covers a wide range of variation across illumination, pose, expression and other variables. The target set contains face images taken under controlled conditions and all faces are in frontal pose and balanced in illumination. The PIN framework is composed of an encoder, a decoder and two discriminators. The encoder is made of a state-of-the-art face recognition network and acts as a facial feature extractor, which is not updated during training. The decoder is trained on both the source and target sets, and aims to learn the transformation from the source set to the target set; and therefore, it can transform an arbitrary face into a illumination and pose normalized face. The discriminators are trained to ensure the photo-realistic quality of the normalized face images generated by the decoder. The loss functions employed in the decoder and discriminators are appropriately designed and weighted for yielding better normalization outcomes and recognition performance. We verify the performance of the propose framework on several benchmark databases, and compare with state-of-the-art approaches.

Unsupervised Learning of Landmarks Based on Inter-Intra Subject Consistencies

Weijian Li, Haofu Liao, Shun Miao, Le Lu, Jiebo Luo

Responsive image

Auto-TLDR; Unsupervised Learning for Facial Landmark Discovery using Inter-subject Landmark consistencies

Slides Similar

We present a novel unsupervised learning approach to image landmark discovery by incorporating the inter-subject landmark consistencies on facial images. This is achieved via an inter-subject mapping module that transforms original subject landmarks based on an auxiliary subject-related structure. To recover from the transformed images back to the original subject, the landmark detector is forced to learn spatial locations that contain the consistent semantic meanings both for the paired intra-subject images and between the paired inter-subject images. Our proposed method is extensively evaluated on two public facial image datasets (MAFL, AFLW) with various settings. Experimental results indicate that our method can extract the consistent landmarks for both datasets and achieve better performances compared to the previous state-of-the-art methods quantitatively and qualitatively.

Coherence and Identity Learning for Arbitrary-Length Face Video Generation

Shuquan Ye, Chu Han, Jiaying Lin, Guoqiang Han, Shengfeng He

Responsive image

Auto-TLDR; Face Video Synthesis Using Identity-Aware GAN and Face Coherence Network

Slides Poster Similar

Face synthesis is an interesting yet challenging task in computer vision. It is even much harder to generate a portrait video than a single image. In this paper, we propose a novel video generation framework for synthesizing arbitrary-length face videos without any face exemplar or landmark. To overcome the synthesis ambiguity of face video, we propose a divide-and-conquer strategy to separately address the video face synthesis problem from two aspects, face identity synthesis and rearrangement. To this end, we design a cascaded network which contains three components, Identity-aware GAN (IA-GAN), Face Coherence Network, and Interpolation Network. IA-GAN is proposed to synthesize photorealistic faces with the same identity from a set of noises. Face Coherence Network is designed to re-arrange the faces generated by IA-GAN while keeping the inter-frame coherence. Interpolation Network is introduced to eliminate the discontinuity between two adjacent frames and improve the smoothness of the face video. Experimental results demonstrate that our proposed network is able to generate face video with high visual quality while preserving the identity. Statistics show that our method outperforms state-of-the-art unconditional face video generative models in multiple challenging datasets.

SATGAN: Augmenting Age Biased Dataset for Cross-Age Face Recognition

Wenshuang Liu, Wenting Chen, Yuanlue Zhu, Linlin Shen

Responsive image

Auto-TLDR; SATGAN: Stable Age Translation GAN for Cross-Age Face Recognition

Slides Poster Similar

In this paper, we propose a Stable Age Translation GAN (SATGAN) to generate fake face images at different ages to augment age biased face datasets for Cross-Age Face Recognition (CAFR) . The proposed SATGAN consists of both generator and discriminator. As a part of the generator, a novel Mask Attention Module (MAM) is introduced to make the generator focus on the face area. In addition, the generator employs a Uniform Distribution Discriminator (UDD) to supervise the learning of latent feature map and enforce the uniform distribution. Besides, the discriminator employs a Feature Separation Module (FSM) to disentangle identity information from the age information. The quantitative and qualitative evaluations on Morph dataset prove that SATGAN achieves much better performance than existing methods. The face recognition model trained using dataset (VGGFace2 and MS-Celeb-1M) augmented using our SATGAN achieves better accuracy on cross age dataset like Cross-Age LFW and AgeDB-30.

Makeup Style Transfer on Low-Quality Images with Weighted Multi-Scale Attention

Daniel Organisciak, Edmond S. L. Ho, Shum Hubert P. H.

Responsive image

Auto-TLDR; Facial Makeup Style Transfer for Low-Resolution Images Using Multi-Scale Spatial Attention

Slides Poster Similar

Facial makeup style transfer is an extremely challenging sub-field of image-to-image-translation. Due to this difficulty, state-of-the-art results are mostly reliant on the Face Parsing Algorithm, which segments a face into parts in order to easily extract makeup features. However, we find that this algorithm can only work well on high-definition images where facial features can be accurately extracted. Faces in many real-world photos, such as those including a large background or multiple people, are typically of low-resolution, which considerably hinders state-of-the-art algorithms. In this paper, we propose an end-to-end holistic approach to effectively transfer makeup styles between two low-resolution images. The idea is built upon a novel weighted multi-scale spatial attention module, which identifies salient pixel regions on low-resolution images in multiple scales, and uses channel attention to determine the most effective attention map. This design provides two benefits: low-resolution images are usually blurry to different extents, so a multi-scale architecture can select the most effective convolution kernel size to implement spatial attention; makeup is applied on both a macro-level (foundation, fake tan) and a micro-level (eyeliner, lipstick) so different scales can excel in extracting different makeup features. We develop an Augmented CycleGAN network that embeds our attention modules at selected layers to most effectively transfer makeup. We test our system with the FBD data set, which consists of many low-resolution facial images, and demonstrates that it outperforms state-of-the-art methods, particularly in transferring makeup for blurry images and partially occluded images.

VITON-GT: An Image-Based Virtual Try-On Model with Geometric Transformations

Matteo Fincato, Federico Landi, Marcella Cornia, Fabio Cesari, Rita Cucchiara

Responsive image

Auto-TLDR; VITON-GT: An Image-based Virtual Try-on Architecture for Fashion Catalogs

Slides Poster Similar

The large spread of online shopping has led computer vision researchers to develop different solutions for the fashion domain to potentially increase the online user experience and improve the efficiency of preparing fashion catalogs. Among them, image-based virtual try-on has recently attracted a lot of attention resulting in several architectures that can generate a new image of a person wearing an input try-on garment in a plausible and realistic way. In this paper, we present VITON-GT, a new model for virtual try-on that generates high-quality and photo-realistic images thanks to multiple geometric transformations. In particular, our model is composed of a two-stage geometric transformation module that performs two different projections on the input garment, and a transformation-guided try-on module that synthesize the new image. We experimentally validate the proposed solution on the most common dataset for this task, containing mainly t-shirts, and we demonstrate its effectiveness compared to different baselines and previous methods. Additionally, we assess the generalization capabilities of our model on a new set of fashion items composed of upper-body clothes from different categories. To the best of our knowledge, we are the first to test virtual try-on architectures in this challenging experimental setting.

Local Facial Attribute Transfer through Inpainting

Ricard Durall, Franz-Josef Pfreundt, Janis Keuper

Responsive image

Auto-TLDR; Attribute Transfer Inpainting Generative Adversarial Network

Slides Poster Similar

The term attribute transfer refers to the tasks of altering images in such a way, that the semantic interpretation of a given input image is shifted towards an intended direction, which is quantified by semantic attributes. Prominent example applications are photo realistic changes of facial features and expressions, like changing the hair color, adding a smile, enlarging the nose or altering the entire context of a scene, like transforming a summer landscape into a winter panorama. Recent advances in attribute transfer are mostly based on generative deep neural networks, using various techniques to manipulate images in the latent space of the generator. In this paper, we present a novel method for the common sub-task of local attribute transfers, where only parts of a face have to be altered in order to achieve semantic changes (e.g. removing a mustache). In contrast to previous methods, where such local changes have been implemented by generating new (global) images, we propose to formulate local attribute transfers as an inpainting problem. Removing and regenerating only parts of images, our Attribute Transfer Inpainting Generative Adversarial Network (ATI-GAN) is able to utilize local context information to focus on the attributes while keeping the background unmodified resulting in visually sound results.

Mask-Based Style-Controlled Image Synthesis Using a Mask Style Encoder

Jaehyeong Cho, Wataru Shimoda, Keiji Yanai

Responsive image

Auto-TLDR; Style-controlled Image Synthesis from Semantic Segmentation masks using GANs

Slides Poster Similar

In recent years, the advances in Generative Adversarial Networks (GANs) have shown impressive results for image generation and translation tasks. In particular, the image-to-image translation is a method of learning mapping from a source domain to a target domain and synthesizing an image. Image-to-image translation can be applied to a variety of tasks, making it possible to quickly and easily synthesize realistic images from semantic segmentation masks. However, in the existing image-to-image translation method, there is a limitation on controlling the style of the translated image, and it is not easy to synthesize an image by controlling the style of each mask element in detail. Therefore, we propose an image synthesis method that controls the style of each element by improving the existing image-to-image translation method. In the proposed method, we implement a style encoder that extracts style features for each mask element. The extracted style features are concatenated to the semantic mask in the normalization layer, and used the style-controlled image synthesis of each mask element. In experiments, we train style-controlled images synthesis using the datasets consisting of semantic segmentation masks and real images. The results show that the proposed method has excellent performance for style-controlled images synthesis for each element.

Continuous Learning of Face Attribute Synthesis

Ning Xin, Shaohui Xu, Fangzhe Nan, Xiaoli Dong, Weijun Li, Yuanzhou Yao

Responsive image

Auto-TLDR; Continuous Learning for Face Attribute Synthesis

Slides Poster Similar

The generative adversarial network (GAN) exhibits great superiority in the face attribute synthesis task. However, existing methods have very limited effects on the expansion of new attributes. To overcome the limitations of a single network in new attribute synthesis, a continuous learning method for face attribute synthesis is proposed in this work. First, the feature vector of the input image is extracted and attribute direction regression is performed in the feature space to obtain the axes of different attributes. The feature vector is then linearly guided along the axis so that images with target attributes can be synthesized by the decoder. Finally, to make the network capable of continuous learning, the orthogonal direction modification module is used to extend the newly-added attributes. Experimental results show that the proposed method can endow a single network with the ability to learn attributes continuously, and, as compared to those produced by the current state-of-the-art methods, the synthetic attributes have higher accuracy.

Identity-Preserved Face Beauty Transformation with Conditional Generative Adversarial Networks

Zhitong Huang, Ching Y Suen

Responsive image

Auto-TLDR; Identity-preserved face beauty transformation using conditional GANs

Slides Poster Similar

Identity-preserved face beauty transformation aims to change the beauty scale of a face image while preserving the identity of the original face. In our framework of conditional Generative Adversarial Networks (cGANs), the synthesized face produced by the generator would have the same beauty scale indicated by the input condition. Unlike the discrete class labels used in most cGANs, the condition of target beauty scale in our framework is given by a continuous real-valued beauty score in the range [1 to 5], which makes the work challenging. To tackle the problem, we have implemented a triple structure, in which the conditional discriminator is divided into a normal discriminator and a separate face beauty predictor. We have also developed another new structure called Conditioned Instance Normalization to replace the original concatenation used in cGANs, which makes the combination of the input image and condition more effective. Furthermore, Self-Consistency Loss is introduced as a new parameter to improve the stability of training and quality of the generated image. In the end, the objectives of beauty transformation and identity preservation are evaluated by the pretrained face beauty predictor and state-of-the-art face recognition network. The result is encouraging and it also shows that certain facial features could be synthesized by the generator according to the target beauty scale, while preserving the original identity.

Semantic-Guided Inpainting Network for Complex Urban Scenes Manipulation

Pierfrancesco Ardino, Yahui Liu, Elisa Ricci, Bruno Lepri, Marco De Nadai

Responsive image

Auto-TLDR; Semantic-Guided Inpainting of Complex Urban Scene Using Semantic Segmentation and Generation

Slides Poster Similar

Manipulating images of complex scenes to reconstruct, insert and/or remove specific object instances is a challenging task. Complex scenes contain multiple semantics and objects, which are frequently cluttered or ambiguous, thus hampering the performance of inpainting models. Conventional techniques often rely on structural information such as object contours in multi-stage approaches that generate unreliable results and boundaries. In this work, we propose a novel deep learning model to alter a complex urban scene by removing a user-specified portion of the image and coherently inserting a new object (e.g. car or pedestrian) in that scene. Inspired by recent works on image inpainting, our proposed method leverages the semantic segmentation to model the content and structure of the image, and learn the best shape and location of the object to insert. To generate reliable results, we design a new decoder block that combines the semantic segmentation and generation task to guide better the generation of new objects and scenes, which have to be semantically consistent with the image. Our experiments, conducted on two large-scale datasets of urban scenes (Cityscapes and Indian Driving), show that our proposed approach successfully address the problem of semantically-guided inpainting of complex urban scene.

Controllable Face Aging

Haien Zeng, Hanjiang Lai

Responsive image

Auto-TLDR; A controllable face aging method via attribute disentanglement generative adversarial network

Slides Poster Similar

Motivated by the following two observations: 1) people are aging differently under different conditions for changeable facial attributes, e.g., skin color may become darker when working outside, and 2) it needs to keep some unchanged facial attributes during the aging process, e.g., race and gender, we propose a controllable face aging method via attribute disentanglement generative adversarial network. To offer fine control over the synthesized face images, first, an individual embedding of the face is directly learned from an image that contains the desired facial attribute. Second, since the image may contain other unwanted attributes, an attribute disentanglement network is used to separate the individual embedding and learn the common embedding that contains information about the face attribute (e.g., race). With the common embedding, we can manipulate the generated face image with the desired attribute in an explicit manner. Experimental results on two common benchmarks demonstrate that our proposed generator achieves comparable performance on the aging effect with state-of-the-art baselines while gaining more flexibility for attribute control. Code is available at supplementary material.

Multi-Laplacian GAN with Edge Enhancement for Face Super Resolution

Shanlei Ko, Bi-Ru Dai

Responsive image

Auto-TLDR; Face Image Super-Resolution with Enhanced Edge Information

Slides Poster Similar

Face image super-resolution has become a research hotspot in the field of image processing. Nowadays, more and more researches add additional information, such as landmark, identity, to reconstruct high resolution images from low resolution ones, and have a good performance in quantitative terms and perceptual quality. However, these additional information is hard to obtain in many cases. In this work, we focus on reconstructing face images by extracting useful information from face images directly rather than using additional information. By observing edge information in each scale of face images, we propose a method to reconstruct high resolution face images with enhanced edge information. In additional, with the proposed training procedure, our method reconstructs photo-realistic images in upscaling factor 8x and outperforms state-of-the-art methods both in quantitative terms and perceptual quality.

Cascade Attention Guided Residue Learning GAN for Cross-Modal Translation

Bin Duan, Wei Wang, Hao Tang, Hugo Latapie, Yan Yan

Responsive image

Auto-TLDR; Cascade Attention-Guided Residue GAN for Cross-modal Audio-Visual Learning

Slides Poster Similar

Since we were babies, we intuitively develop the ability to correlate the input from different cognitive sensors such as vision, audio, and text. However, in machine learning, this cross-modal learning is a nontrivial task because different modalities have no homogeneous properties. Previous works discover that there should be bridges among different modalities. From neurology and psychology perspective, humans have the capacity to link one modality with another one, e.g., associating a picture of a bird with the only hearing of its singing and vice versa. Is it possible for machine learning algorithms to recover the scene given the audio signal? In this paper, we propose a novel Cascade Attention-Guided Residue GAN (CAR-GAN), aiming at reconstructing the scenes given the corresponding audio signals. Particularly, we present a residue module to mitigate the gap between different modalities progressively. Moreover, a cascade attention guided network with a novel classification loss function is designed to tackle the cross-modal learning task. Our model keeps consistency in the high-level semantic label domain and is able to balance two different modalities. The experimental results demonstrate that our model achieves the state-of-the-art cross-modal audio-visual generation on the challenging Sub-URMP dataset.

Pixel-based Facial Expression Synthesis

Arbish Akram, Nazar Khan

Responsive image

Auto-TLDR; pixel-based facial expression synthesis using GANs

Slides Poster Similar

Recently, Facial expression synthesis has shown remarkable advances with the advent of Generative Adversarial Networks (GANs). However, these GAN-based approaches mostly generate photo-realistic results as long as the target data distribution is close to the training data distribution. The quality of GANs results significantly degrades when testing images are from a slightly different distribution. In this work, we propose a pixel-based facial expression synthesis method. Recent work has shown that facial expression synthesis changes only local regions of faces. In the proposed method, each output pixel observes only one input pixel. The proposed method achieves generalization capability by leveraging only few hundred images. Experimental results demonstrate that the proposed method performs comparably with the recent GANs on in-dataset images and significantly outperforms on in the wild images. In addition, the proposed method is faster and it also achieves significantly better performance with two orders of magnitudes lesser computational and storage cost as compared to state-of-the-art GAN-based methods.

AVAE: Adversarial Variational Auto Encoder

Antoine Plumerault, Hervé Le Borgne, Celine Hudelot

Responsive image

Auto-TLDR; Combining VAE and GAN for Realistic Image Generation

Slides Poster Similar

Among the wide variety of image generative models, two models stand out: Variational Auto Encoders (VAE) and Generative Adversarial Networks (GAN). GANs can produce realistic images, but they suffer from mode collapse and do not provide simple ways to get the latent representation of an image. On the other hand, VAEs do not have these problems, but they often generate images less realistic than GANs. In this article, we explain that this lack of realism is partially due to a common underestimation of the natural image manifold dimensionality. To solve this issue we introduce a new framework that combines VAE and GAN in a novel and complementary way to produce an auto-encoding model that keeps VAEs properties while generating images of GAN-quality. We evaluate our approach both qualitatively and quantitatively on five image datasets.

Unsupervised Disentangling of Viewpoint and Residues Variations by Substituting Representations for Robust Face Recognition

Minsu Kim, Joanna Hong, Junho Kim, Hong Joo Lee, Yong Man Ro

Responsive image

Auto-TLDR; Unsupervised Disentangling of Identity, viewpoint, and Residue Representations for Robust Face Recognition

Slides Poster Similar

It is well-known that identity-unrelated variations (e.g., viewpoint or illumination) degrade the performances of face recognition methods. In order to handle this challenge, a robust method for disentangling the identity and view representations has drawn an attention in the machine learning area. However, existing methods learn discriminative features which require a manual supervision of such factors of variations. In this paper, we propose a novel disentangling framework through modeling three representations of identity, viewpoint, and residues (i.e., identity and pose unrelated) which do not require supervision of the variations. By jointly modeling the three representations, we enhance the disentanglement of each representation and achieve robust face recognition performance. Further, the learned viewpoint representation can be utilized for pose estimation or editing of a posed facial image. Extensive quantitative and qualitative evaluations verify the effectiveness of our proposed method which disentangles identity, viewpoint, and residues of facial images.

Stylized-Colorization for Line Arts

Tzu-Ting Fang, Minh Duc Vo, Akihiro Sugimoto, Shang-Hong Lai

Responsive image

Auto-TLDR; Stylized-colorization using GAN-based End-to-End Model for Anime

Slides Poster Similar

We address a novel problem of stylized-colorization which colorizes a given line art using a given coloring style in text. This problem can be stated as multi-domain image translation and is more challenging than the current colorization problem because it requires not only capturing the illustration distribution but also satisfying the required coloring styles specific to anime such as lightness, shading, or saturation. We propose a GAN-based end-to-end model for stylized-colorization where the model has one generator and two discriminators. Our generator is based on the U-Net architecture and receives a pair of a line art and a coloring style in text as its input to produce a stylized-colorization image of the line art. Two discriminators, on the other hand, share weights at early layers to judge the stylized-colorization image in two different aspects: one for color and one for style. One generator and two discriminators are jointly trained in an adversarial and end-to-end manner. Extensive experiments demonstrate the effectiveness of our proposed model.

Unsupervised Contrastive Photo-To-Caricature Translation Based on Auto-Distortion

Yuhe Ding, Xin Ma, Mandi Luo, Aihua Zheng, Ran He

Responsive image

Auto-TLDR; Unsupervised contrastive photo-to-caricature translation with style loss

Slides Poster Similar

Photo-to-caricature aims to synthesize the caricature as a rendered image exaggerating the features through sketching, pencil strokes, or other artistic drawings. Style rendering and geometry deformation are the most important aspects in photo-to-caricature translation task. To take both into consideration, we propose an unsupervised contrastive photo-to-caricature translation architecture. Considering the intuitive artifacts in the existing methods, we propose a contrastive style loss for style rendering to enforce the similarity between the style of rendered photo and the caricature, and simultaneously enhance its discrepancy to the photos. To obtain an exaggerating deformation in an unpaired/unsupervised fashion, we propose a Distortion Prediction Module (DPM) to predict a set of displacements vectors for each input image while fixing some controlling points, followed by the thin plate spline interpolation for warping. The model is trained on unpaired photo and caricature while can offer bidirectional synthesizing via inputting either a photo or a caricature. Extensive experiments demonstrate that the proposed model is effective to generate hand-drawn like caricatures compared with existing competitors.

Boundary Guided Image Translation for Pose Estimation from Ultra-Low Resolution Thermal Sensor

Kohei Kurihara, Tianren Wang, Teng Zhang, Brian Carrington Lovell

Responsive image

Auto-TLDR; Pose Estimation on Low-Resolution Thermal Images Using Image-to-Image Translation Architecture

Slides Poster Similar

This work addresses the pose estimation task on low-resolution images captured using thermal sensors which can operate in a no-light environment. Low-resolution thermal sensors have been widely adopted in various applications for cost control and privacy protection purposes. In this paper, targeting the challenging scenario of ultra-low resolution thermal imaging (3232 pixels), we aim to estimate human poses for the purpose of monitoring health conditions and indoor events. To overcome the challenges in ultra-low resolution thermal imaging such as blurred boundaries and data scarcity, we propose a new Image-to-Image (I2I) translation architecture which can translate the original blurred thermal image into a visible light image with sharper boundaries. Then the generated visible light image can be fed into the off-the-shelf pose estimator which was well-trained in the visible domain. Experimental results suggest that the proposed framework outperforms other state-of-the-art methods in the I2I based pose estimation task for our thermal image dataset. Furthermore, we also demonstrated the merits of the proposed method on the publicly available FLIR dataset by measuring the quality of translated images.

Facial Expression Recognition by Using a Disentangled Identity-Invariant Expression Representation

Kamran Ali, Charles Hughes

Responsive image

Auto-TLDR; Transfer-based Expression Recognition Generative Adversarial Network (TER-GAN)

Slides Poster Similar

Facial Expression Recognition (FER) is a challenging task because many factors of variation such as pose, illumination, and identity-specific attributes are entangled with the expression information in an expressive face image. Recent works show that the performance of a FER algorithm can be improved by disentangling the expression information from identity features. In this paper, we present Transfer-based Expression Recognition Generative Adversarial Network (TER-GAN) that combines the effectiveness of a novel feature disentanglement technique with the concept of identity-invariant expression representation learning for facial expression recognition. More specifically, TER-GAN learns a disentangled expression representation by extracting expression features from one image and transferring the expression information to the identity of another image. To improve the feature disentanglement process, and to learn an identity-invariant expression representation, we introduce a novel expression consistency loss and an identity consistency loss that exploit expression and identity information from both real and synthetic images. We evaluated the performance of our proposed facial expression recognition technique by employing five public facial expression databases, CK+, Oulu-CASIA, MMI, BU-3DFE, and BU-4DFE, the latter being used for pre-training. The experimental results show the effectiveness of the proposed technique.

Exemplar Guided Cross-Spectral Face Hallucination Via Mutual Information Disentanglement

Haoxue Wu, Huaibo Huang, Aijing Yu, Jie Cao, Zhen Lei, Ran He

Responsive image

Auto-TLDR; Exemplar Guided Cross-Spectral Face Hallucination with Structural Representation Learning

Slides Poster Similar

Recently, many Near infrared-visible (NIR-VIS) heterogeneous face recognition (HFR) methods have been proposed in the community. But it remains a challenging problem because of the sensing gap along with large pose variations. In this paper, we propose an Exemplar Guided Cross-Spectral Face Hallucination (EGCH) to reduce the domain discrepancy through disentangled representation learning. For each modality, EGCH contains a spectral encoder as well as a structure encoder to disentangle spectral and structure representation, respectively. It also contains a traditional generator that reconstructs the input from the above two representations, and a structure generator that predicts the facial parsing map from the structure representation. Besides, mutual information minimization and maximization are conducted to boost disentanglement and make representations adequately expressed. Then the translation is built on structure representations between two modalities. Provided with the transformed NIR structure representation and original VIS spectral representation, EGCH is capable to produce high-fidelity VIS images that preserve the topology structure of the input NIR while transfer the spectral information of an arbitrary VIS exemplar. Extensive experiments demonstrate that the proposed method achieves more promising results both qualitatively and quantitatively than the state-of-the-art NIR-VIS methods.

Learning Low-Shot Generative Networks for Cross-Domain Data

Hsuan-Kai Kao, Cheng-Che Lee, Wei-Chen Chiu

Responsive image

Auto-TLDR; Learning Generators for Cross-Domain Data under Low-Shot Learning

Slides Poster Similar

We tackle a novel problem of learning generators for cross-domain data under a specific scenario of low-shot learning. Basically, given a source domain with sufficient amount of training data, we aim to transfer the knowledge of its generative process to another target domain, which not only has few data samples but also contains the domain shift with respect to the source domain. This problem has great potential in practical use and is different from the well-known image translation task, as the target-domain data can be generated without requiring any source-domain ones and the large data consumption for learning target-domain generator can be alleviated. Built upon a cross-domain dataset where (1) each of the low shots in the target domain has its correspondence in the source and (2) these two domains share the similar content information but different appearance, two approaches are proposed: a Latent-Disentanglement-Orientated model (LaDo) and a Generative-Hierarchy-Oriented (GenHo) model. Our LaDo and GenHo approaches address the problem from different perspectives, where the former relies on learning the disentangled representation composed of domain-invariant content features and domain-specific appearance ones; while the later decomposes the generative process of a generator into two parts for synthesizing the content and appearance sequentially. We perform extensive experiments under various settings of cross-domain data and show the efficacy of our models for generating target-domain data with the abundant content variance as in the source domain, which lead to the favourable performance in comparison to several baselines.

Age Gap Reducer-GAN for Recognizing Age-Separated Faces

Daksha Yadav, Naman Kohli, Mayank Vatsa, Richa Singh, Afzel Noore

Responsive image

Auto-TLDR; Generative Adversarial Network for Age-separated Face Recognition

Slides Poster Similar

In this paper, we propose a novel algorithm for matching faces with temporal variations caused due to age progression. The proposed generative adversarial network algorithm is a unified framework which combines facial age estimation and age-separated face verification. The key idea of this approach is to learn the age variations across time by conditioning the input image on the subject's gender and the target age group to which the face needs to be progressed. The loss function accounts for reducing the age gap between the original image and generated face image as well as preserving the identity. Both visual fidelity and quantitative evaluations demonstrate the efficacy of the proposed architecture on different facial age databases for age-separated face recognition.

Disentangled Representation Learning for Controllable Image Synthesis: An Information-Theoretic Perspective

Shichang Tang, Xu Zhou, Xuming He, Yi Ma

Responsive image

Auto-TLDR; Controllable Image Synthesis in Deep Generative Models using Variational Auto-Encoder

Slides Poster Similar

In this paper, we look into the problem of disentangled representation learning and controllable image synthesis in a deep generative model. We develop an encoder-decoder architecture for a variant of the Variational Auto-Encoder (VAE) with two latent codes $z_1$ and $z_2$. Our framework uses $z_2$ to capture specified factors of variation while $z_1$ captures the complementary factors of variation. To this end, we analyze the learning problem from the perspective of multivariate mutual information, derive optimizable lower bounds of the conditional mutual information in the image synthesis processes and incorporate them into the training objective. We validate our method empirically on the Color MNIST dataset and the CelebA dataset by showing controllable image syntheses. Our proposed paradigm is simple yet effective and is applicable to many situations, including those where there is not an explicit factorization of features available, or where the features are non-categorical.

Group-Wise Feature Orthogonalization and Suppression for GAN Based Facial Attribute Translation

Zhiwei Wen, Haoqian Wu, Weicheng Xie, Linlin Shen

Responsive image

Auto-TLDR; Semantic Disentanglement of Generative Adversarial Network

Slides Poster Similar

Generative Adversarial Network (GAN) has been widely used for object attribute editing. However, the semantic correlation, resulted from the feature map interaction in the generative network of GAN, may impair the generalization ability of the generative network. In this work, semantic disentanglement is introduced in GAN to reduce the attribute correlation. The feature maps of the generative network are first grouped with an efficient clustering algorithm based on hash encoding, which are used to excavate hidden semantic attributes and calculate the group-wise orthogonality loss for the reduction of attribute entanglement. Meanwhile, the feature maps falling in the intersection regions of different groups are further suppressed to reduce the attribute-wise interaction. Extensive experiments reveal that the proposed GAN generated more genuine objects than the state of the arts. Quantitative results of classification accuracy, inception and FID scores further justify the effectiveness of the proposed GAN.

Face Super-Resolution Network with Incremental Enhancement of Facial Parsing Information

Shuang Liu, Chengyi Xiong, Zhirong Gao

Responsive image

Auto-TLDR; Learning-based Face Super-Resolution with Incremental Boosting Facial Parsing Information

Slides Poster Similar

Recently, facial priors based face super-resolution (SR) methods have obtained significant performance gains in dealing with extremely degraded facial images, and facial priors have also been proved useful in facilitating the inference of face images. Based on this, how to fully fuse facial priors into deep features to improve face SR performance has attracted a major attention. In this paper, we propose a learning-based face SR approach with incremental boosting facial parsing information (IFPSR) for high-magnification of low-resolution faces. The proposed IFPSR method consists of three main parts: i) a three-stage parsing map embedded features upsampling network, in which image recovery and prior estimation processes are performed simultaneously and progressively to improve the image resolution; ii) a progressive training method and a joint facial attention and heatmap loss to obtain better facial attributes; iii) the channel attention strategy in residual dense blocks to adaptively learn facial features. Extensive experimental results show that compared with the state-of-the-art methods in terms of quantitative and qualitative metrics, our approach can achieve an outstanding balance between SR image quality and low network complexity.

Motion-Supervised Co-Part Segmentation

Aliaksandr Siarohin, Subhankar Roy, Stéphane Lathuiliere, Sergey Tulyakov, Elisa Ricci, Nicu Sebe

Responsive image

Auto-TLDR; Self-supervised Co-Part Segmentation Using Motion Information from Videos

Slides Similar

Recent co-part segmentation methods mostly operate in a supervised learning setting, which requires a large amount of annotated data for training. To overcome this limitation, we propose a self-supervised deep learning method for co-part segmentation. Differently from previous works, our approach develops the idea that motion information inferred from videos can be leveraged to discover meaningful object parts. To this end, our method relies on pairs of frames sampled from the same video. The network learns to predict part segments together with a representation of the motion between two frames, which permits reconstruction of the target image. Through extensive experimental evaluation on publicly available video sequences we demonstrate that our approach can produce improved segmentation maps with respect to previous self-supervised co-part segmentation approaches.

Hybrid Approach for 3D Head Reconstruction: Using Neural Networks and Visual Geometry

Oussema Bouafif, Bogdan Khomutenko, Mohammed Daoudi

Responsive image

Auto-TLDR; Recovering 3D Head Geometry from a Single Image using Deep Learning and Geometric Techniques

Slides Poster Similar

Recovering the 3D geometric structure of a face from a single input image is a challenging active research area in computer vision. In this paper, we present a novel method for reconstructing 3D heads from a single or multiple image(s) using a hybrid approach based on deep learning and geometric techniques. We propose an encoder-decoder network based on the U-net architecture and trained on synthetic data only. It predicts both pixel-wise normal vectors and landmarks maps from a single input photo. Landmarks are used for the pose computation and the initialization of the optimization problem, which, in turn, reconstructs the 3D head geometry by using a parametric morphable model and normal vector fields. State-of-the-art results are achieved through qualitative and quantitative evaluation tests on both single and multi-view settings. Despite the fact that the model was trained only on synthetic data, it successfully recovers 3D geometry and precise poses for real-world images.

Dual-MTGAN: Stochastic and Deterministic Motion Transfer for Image-To-Video Synthesis

Fu-En Yang, Jing-Cheng Chang, Yuan-Hao Lee, Yu-Chiang Frank Wang

Responsive image

Auto-TLDR; Dual Motion Transfer GAN for Convolutional Neural Networks

Slides Poster Similar

Generating videos with content and motion variations is a challenging task in computer vision. While the recent development of GAN allows video generation from latent representations, it is not easy to produce videos with particular content of motion patterns of interest. In this paper, we propose Dual Motion Transfer GAN (Dual-MTGAN), which takes image and video data as inputs while learning disentangled content and motion representations. Our Dual-MTGAN is able to perform deterministic motion transfer and stochastic motion generation. Based on a given image, the former preserves the input content and transfers motion patterns observed from another video sequence, and the latter directly produces videos with plausible yet diverse motion patterns based on the input image. The proposed model is trained in an end-to-end manner, without the need to utilize pre-defined motion features like pose or facial landmarks. Our quantitative and qualitative results would confirm the effectiveness and robustness of our model in addressing such conditioned image-to-video tasks.

An Unsupervised Approach towards Varying Human Skin Tone Using Generative Adversarial Networks

Debapriya Roy, Diganta Mukherjee, Bhabatosh Chanda

Responsive image

Auto-TLDR; Unsupervised Skin Tone Change Using Augmented Reality Based Models

Slides Poster Similar

With the increasing popularity of augmented and virtual reality, retailers are now more focusing towards customer satisfaction to increase the amount of sales. Although augmented reality is not a new concept but it has gained its much needed attention over the past few years. Our present work is targeted towards this direction which may be used to enhance user experience in various virtual and augmented reality based applications. We propose a model to change skin tone of person. Given any input image of a person or a group of persons with some value indicating the desired change of skin color towards fairness or darkness, this method can change the skin tone of the persons in the image. This is an unsupervised method and also unconstrained in terms of pose, illumination, number of persons in the image etc. The goal of this work is to reduce the complexity in terms of time and effort which is generally needed for changing the skin tone using existing applications by professionals or novice. Rigorous experiments shows the efficacy of this method in terms of synthesizing perceptually convincing outputs.

A Quantitative Evaluation Framework of Video De-Identification Methods

Sathya Bursic, Alessandro D'Amelio, Marco Granato, Giuliano Grossi, Raffaella Lanzarotti

Responsive image

Auto-TLDR; Face de-identification using photo-reality and facial expressions

Slides Poster Similar

We live in an era of privacy concerns, motivating a large research effort in face de-identification. As in other fields, we are observing a general movement from hand-crafted methods to deep learning methods, mainly involving generative models. Although these methods produce more natural de-identified images or videos, we claim that the mere evaluation of the de-identification is not sufficient, especially when it comes to processing the images/videos further. In this note, we take into account the issue of preserving privacy, facial expressions, and photo-reality simultaneously, proposing a general testing framework. The method is applied to four open-source tools, producing a baseline for future de-identification methods.

Deep Multi-Task Learning for Facial Expression Recognition and Synthesis Based on Selective Feature Sharing

Rui Zhao, Tianshan Liu, Jun Xiao, P. K. Daniel Lun, Kin-Man Lam

Responsive image

Auto-TLDR; Multi-task Learning for Facial Expression Recognition and Synthesis

Slides Poster Similar

Multi-task learning is an effective learning strategy for deep-learning-based facial expression recognition tasks. However, most existing methods take into limited consideration the feature selection, when transferring information between different tasks, which may lead to task interference when training the multi-task networks. To address this problem, we propose a novel selective feature-sharing method, and establish a multi-task network for facial expression recognition and facial expression synthesis. The proposed method can effectively transfer beneficial features between different tasks, while filtering out useless and harmful information. Moreover, we employ the facial expression synthesis task to enlarge and balance the training dataset to further enhance the generalization ability of the proposed method. Experimental results show that the proposed method achieves state-of-the-art performance on those commonly used facial expression recognition benchmarks, which makes it a potential solution to real-world facial expression recognition problems.

Free-Form Image Inpainting Via Contrastive Attention Network

Xin Ma, Xiaoqiang Zhou, Huaibo Huang, Zhenhua Chai, Xiaolin Wei, Ran He

Responsive image

Auto-TLDR; Self-supervised Siamese inference for image inpainting

Slides Similar

Most deep learning based image inpainting approaches adopt autoencoder or its variants to fill missing regions in images. Encoders are usually utilized to learn powerful representational spaces, which are important for dealing with sophisticated learning tasks. Specifically, in the image inpainting task, masks with any shapes can appear anywhere in images (i.e., free-form masks) forming complex patterns. It is difficult for encoders to capture such powerful representations under this complex situation. To tackle this problem, we propose a self-supervised Siamese inference network to improve the robustness and generalization. Moreover, the restored image usually can not be harmoniously integrated into the exiting content, especially in the boundary area. To address this problem, we propose a novel Dual Attention Fusion module (DAF), which can combine both the restored and known regions in a smoother way and be inserted into decoder layers in a plug-and-play way. DAF is developed to not only adaptively rescale channel-wise features by taking interdependencies between channels into account but also force deep convolutional neural networks (CNNs) focusing more on unknown regions. In this way, the unknown region will be naturally filled from the outside to the inside. Qualitative and quantitative experiments on multiple datasets, including facial and natural datasets (i.e., Celeb-HQ, Pairs Street View, Places2 and ImageNet), demonstrate that our proposed method outperforms against state-of-the-arts in generating high-quality inpainting results.

Local-Global Interactive Network for Face Age Transformation

Jie Song, Ping Wei, Huan Li, Yongchi Zhang, Nanning Zheng

Responsive image

Auto-TLDR; A Novel Local-Global Interaction Framework for Long-span Face Age Transformation

Slides Poster Similar

Face age transformation, which aims to generate a face image in the past or future, has receiving increasing attention due to its significant application value in some special fields, such as looking for a lost child, tracking criminals and entertainment, etc. Currently, most existing methods mainly focus on unidirectional short-span face aging. In this paper, we propose a novel local-global interaction framework for long-span face age transformation. Firstly, we divide a face image into five independent parts and design a local generative network for each of them to learn the local structure changes of a face image, while we utilize a global generative network to learn the global structure changes. Then we introduce an interactive network and an age classification network, which are respectively used to integrate the local and global features and maintain the corresponding age features in different age groups. Given any face image at a certain age, our network can produce a clear and realistic image of face aging or rejuvenation. We test and evaluate the model on complex datasets, and extensive qualitative comparison experiments has proved the effectiveness and immense potential of our proposed method.

Shape Consistent 2D Keypoint Estimation under Domain Shift

Levi Vasconcelos, Massimiliano Mancini, Davide Boscaini, Barbara Caputo, Elisa Ricci

Responsive image

Auto-TLDR; Deep Adaptation for Keypoint Prediction under Domain Shift

Slides Poster Similar

Recent unsupervised domain adaptation methods based on deep architectures have shown remarkable performance not only in traditional classification tasks but also in more complex problems involving structured predictions (e.g. semantic segmentation, depth estimation). Following this trend, in this paper we present a novel deep adaptation framework for estimating keypoints under \textit{domain shift}, i.e. when the training (\textit{source}) and the test (\textit{target}) images significantly differ in terms of visual appearance. Our method seamlessly combines three different components: feature alignment, adversarial training and self-supervision. Specifically, our deep architecture leverages from domain-specific distribution alignment layers to perform target adaptation at the feature level. Furthermore, a novel loss is proposed which combines an adversarial term for ensuring aligned predictions in the output space and a geometric consistency term which guarantees coherent predictions between a target sample and its perturbed version. Our extensive experimental evaluation conducted on three publicly available benchmarks shows that our approach outperforms state-of-the-art domain adaptation methods in the 2D keypoint prediction task.

Future Urban Scenes Generation through Vehicles Synthesis

Alessandro Simoni, Luca Bergamini, Andrea Palazzi, Simone Calderara, Rita Cucchiara

Responsive image

Auto-TLDR; Predicting the Future of an Urban Scene with a Novel View Synthesis Paradigm

Slides Poster Similar

In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene. Despite recent advances, generating the entire scene in an end-to-end fashion is still far from being achieved. Instead, here we follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently. We leverage a per-object novel view synthesis paradigm; i.e. generating a synthetic representation of an object undergoing a geometrical roto-translation in the 3D space. Our model can be easily conditioned with constraints (e.g. input trajectories) provided by state-of-the-art tracking methods or by the user itself. This allows us to generate a set of diverse realistic futures starting from the same input in a multi-modal fashion. We visually and quantitatively show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow, a challenging real world dataset.

Let's Play Music: Audio-Driven Performance Video Generation

Hao Zhu, Yi Li, Feixia Zhu, Aihua Zheng, Ran He

Responsive image

Auto-TLDR; APVG: Audio-driven Performance Video Generation Using Structured Temporal UNet

Slides Poster Similar

We propose a new task named Audio-driven Performance Video Generation (APVG), which aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip. It is a challenging task to generate the high-dimensional temporal consistent videos from low-dimensional audio modality. In this paper, we propose a multi-staged framework to achieve this new task to generate realistic and synchronized performance video from given music. Firstly, we provide both global appearance and local spatial information by generating the coarse videos and keypoints of body and hands from a given music respectively. Then, we propose to transform the generated keypoints to heatmap via a differentiable space transformer, since the heatmap offers more spatial information but is harder to generate directly from audio. Finally, we propose a Structured Temporal UNet (STU) to extract both intra-frame structured information and inter-frame temporal consistency. They are obtained via graph-based structure module, and CNN-GRU based high-level temporal module respectively for final video generation. Comprehensive experiments validate the effectiveness of our proposed framework.

Talking Face Generation Via Learning Semantic and Temporal Synchronous Landmarks

Aihua Zheng, Feixia Zhu, Hao Zhu, Mandi Luo, Ran He

Responsive image

Auto-TLDR; A semantic and temporal synchronous landmark learning method for talking face generation

Slides Poster Similar

Abstract—Given a speech clip and facial image, the goal of talking face generation is to synthesize a talking face video with accurate mouth synchronization and natural face motion. Recent progress has proven the effectiveness of the landmarks as the intermediate information during talking face generation. However,the large gap between audio and visual modalities makes the prediction of landmarks challenging and limits generation ability. This paper proposes a semantic and temporal synchronous landmark learning method for talking face generation. First, we propose to introduce a word detector to enforce richer semantic information. Then, we propose to preserve the temporal synchronization and consistency between landmarks and audio via the proposed temporal residual loss. Lastly, we employ a U-Net generation network with adaptive reconstruction loss to generate facial images for the predicted landmarks. Experimental results on two benchmark datasets LRW and GRID demonstrate the effectiveness of our model compared to the state-of-the-art methods of talking face generation.

DFH-GAN: A Deep Face Hashing with Generative Adversarial Network

Bo Xiao, Lanxiang Zhou, Yifei Wang, Qiangfang Xu

Responsive image

Auto-TLDR; Deep Face Hashing with GAN for Face Image Retrieval

Slides Poster Similar

Face Image retrieval is one of the key research directions in computer vision field. Thanks to the rapid development of deep neural network in recent years, deep hashing has achieved good performance in the field of image retrieval. But for large-scale face image retrieval, the performance needs to be further improved. In this paper, we propose Deep Face Hashing with GAN (DFH-GAN), a novel deep hashing method for face image retrieval, which mainly consists of three components: a generator network for generating synthesized images, a discriminator network with a shared CNN to learn multi-domain face feature, and a hash encoding network to generate compact binary hash codes. The generator network is used to perform data augmentation so that the model could learn from both real images and diverse synthesized images. We adopt a two-stage training strategy. In the first stage, the GAN is trained to generate fake images, while in the second stage, to make the network convergence faster. The model inherits the trained shared CNN of discriminator to train the DFH model by using many different supervised loss functions not only in the last layer but also in the middle layer of the network. Extensive experiments on two widely used datasets demonstrate that DFH-GAN can generate high-quality binary hash codes and exceed the performance of the state-of-the-art model greatly.

Learning Semantic Representations Via Joint 3D Face Reconstruction and Facial Attribute Estimation

Zichun Weng, Youjun Xiang, Xianfeng Li, Juntao Liang, Wanliang Huo, Yuli Fu

Responsive image

Auto-TLDR; Joint Framework for 3D Face Reconstruction with Facial Attribute Estimation

Slides Poster Similar

We propose a novel joint framework for 3D face reconstruction (3DFR) that integrates facial attribute estimation (FAE) as an auxiliary task. One of the essential problems of 3DFR is to extract semantic facial features (e.g., Big Nose, High Cheekbones, and Asian) from in-the-wild 2D images, which is inherently involved with FAE. These two tasks, though heterogeneous, are highly relevant to each other. To achieve this, we leverage a Convolutional Neural Network to extract shared facial representations for both shape decoder and attribute classifier. We further develop an in-batch hybrid-task training scheme that enables our model to learn from heterogeneous facial datasets jointly within a mini-batch. Thanks to the joint loss that provides supervision from both 3DFR and FAE domains, our model learns the correlations between 3D shapes and facial attributes, which benefit both feature extraction and shape inference. Quantitative evaluation and qualitative visualization results confirm the effectiveness and robustness of our joint framework.

Image Inpainting with Contrastive Relation Network

Xiaoqiang Zhou, Junjie Li, Zilei Wang, Ran He, Tieniu Tan

Responsive image

Auto-TLDR; Two-Stage Inpainting with Graph-based Relation Network

Slides Similar

Image inpainting faces the challenging issue of the requirements on structure reasonableness and texture coherence. In this paper, we propose a two-stage inpainting framework to address this issue. The basic idea is to address the two requirements in two separate stages. Completed segmentation of the corrupted image is firstly predicted through segmentation reconstruction network, while fine-grained image details are restored in the second stage through an image generator. The two stages are connected in series as the image details are generated under the guidance of completed segmentation map that predicted in the first stage. Specifically, in the second stage, we propose a novel graph-based relation network to model the relationship existed in corrupted image. In relation network, both intra-relationship for pixels in the same semantic region and inter-relationship between different semantic parts are considered, improving the consistency and compatibility of image textures. Besides, contrastive loss is designed to facilitate the relation network training. Such a framework not only simplifies the inpainting problem directly, but also exploits the relationship in corrupted image explicitly. Extensive experiments on various public datasets quantitatively and qualitatively demonstrate the superiority of our approach compared with the state-of-the-art.

UCCTGAN: Unsupervised Clothing Color Transformation Generative Adversarial Network

Shuming Sun, Xiaoqiang Li, Jide Li

Responsive image

Auto-TLDR; An Unsupervised Clothing Color Transformation Generative Adversarial Network

Slides Poster Similar

Clothing color transformation refers to changing the clothes color in an original image to the clothes color in a target image. In this paper, we propose an Unsupervised Clothing Color Transformation Generative Adversarial Network (UCCTGAN) for the task. UCCTGAN adopts the color histogram of a target clothes as color guidance and an improved U-net architecture called AntennaNet is put forward to fuse the extracted color information with the original image. Meanwhile, to accomplish unsupervised learning, the loss function is carefully designed according to color moment, which evaluates the chromatic aberration between the target clothing and the generated clothing. Experimental results show that our network has the ability to generate convincing color transformation results.