Learning Semantic Representations Via Joint 3D Face Reconstruction and Facial Attribute Estimation

Zichun Weng, Youjun Xiang, Xianfeng Li, Juntao Liang, Wanliang Huo, Yuli Fu

Responsive image

Auto-TLDR; Joint Framework for 3D Face Reconstruction with Facial Attribute Estimation

Slides Poster

We propose a novel joint framework for 3D face reconstruction (3DFR) that integrates facial attribute estimation (FAE) as an auxiliary task. One of the essential problems of 3DFR is to extract semantic facial features (e.g., Big Nose, High Cheekbones, and Asian) from in-the-wild 2D images, which is inherently involved with FAE. These two tasks, though heterogeneous, are highly relevant to each other. To achieve this, we leverage a Convolutional Neural Network to extract shared facial representations for both shape decoder and attribute classifier. We further develop an in-batch hybrid-task training scheme that enables our model to learn from heterogeneous facial datasets jointly within a mini-batch. Thanks to the joint loss that provides supervision from both 3DFR and FAE domains, our model learns the correlations between 3D shapes and facial attributes, which benefit both feature extraction and shape inference. Quantitative evaluation and qualitative visualization results confirm the effectiveness and robustness of our joint framework.

Similar papers

Multi-Attribute Regression Network for Face Reconstruction

Xiangzheng Li, Suping Wu

Responsive image

Auto-TLDR; A Multi-Attribute Regression Network for Face Reconstruction

Slides Poster Similar

In this paper, we propose a multi-attribute regression network (MARN) to investigate the problem of face reconstruction, especially in challenging cases when faces undergo large variations including severe poses, extreme expressions, and partial occlusions in unconstrained environments. The traditional 3DMM parametric regression method is absent from the learning of identity, expression, and attitude attributes, resulting in lacking geometric details in the reconstructed face. Our MARN method is to enable the network to better extract the feature information of face identity, expression, and pose attributes. We introduced identity, expression, and pose attribute loss functions to enhance the learning of details in each attribute. At the same time, we carefully design the geometric contour constraint loss function and use the constraints of sparse 2D face landmarks to improve the reconstructed geometric contour information. The experimental results show that our face reconstruction method has achieved significant results on the AFLW2000-3D and AFLW datasets compared with the most advanced methods. In addition, there has been a great improvement in dense face alignment. .

Hybrid Approach for 3D Head Reconstruction: Using Neural Networks and Visual Geometry

Oussema Bouafif, Bogdan Khomutenko, Mohammed Daoudi

Responsive image

Auto-TLDR; Recovering 3D Head Geometry from a Single Image using Deep Learning and Geometric Techniques

Slides Poster Similar

Recovering the 3D geometric structure of a face from a single input image is a challenging active research area in computer vision. In this paper, we present a novel method for reconstructing 3D heads from a single or multiple image(s) using a hybrid approach based on deep learning and geometric techniques. We propose an encoder-decoder network based on the U-net architecture and trained on synthetic data only. It predicts both pixel-wise normal vectors and landmarks maps from a single input photo. Landmarks are used for the pose computation and the initialization of the optimization problem, which, in turn, reconstructs the 3D head geometry by using a parametric morphable model and normal vector fields. State-of-the-art results are achieved through qualitative and quantitative evaluation tests on both single and multi-view settings. Despite the fact that the model was trained only on synthetic data, it successfully recovers 3D geometry and precise poses for real-world images.

Joint Face Alignment and 3D Face Reconstruction with Efficient Convolution Neural Networks

Keqiang Li, Huaiyu Wu, Xiuqin Shang, Zhen Shen, Gang Xiong, Xisong Dong, Bin Hu, Fei-Yue Wang

Responsive image

Auto-TLDR; Mobile-FRNet: Efficient 3D Morphable Model Alignment and 3D Face Reconstruction from a Single 2D Facial Image

Slides Poster Similar

3D face reconstruction from a single 2D facial image is a challenging and concerned problem. Recent methods based on CNN typically aim to learn parameters of 3D Morphable Model (3DMM) from 2D images to render face alignment and 3D face reconstruction. Most algorithms are designed for faces with small, medium yaw angles, which is extremely challenging to align faces in large poses. At the same time, they are not efficient usually. The main challenge is that it takes time to determine the parameters accurately. In order to address this challenge with the goal of improving performance, this paper proposes a novel and efficient end-to-end framework. We design an efficient and lightweight network model combined with Depthwise Separable Convolution and Muti-scale Representation, Lightweight Attention Mechanism, named Mobile-FRNet. Simultaneously, different loss functions are used to constrain and optimize 3DMM parameters and 3D vertices during training to improve the performance of the network. Meanwhile, extensive experiments on the challenging datasets show that our method significantly improves the accuracy of face alignment and 3D face reconstruction. The model parameters and complexity of our method are also improved greatly.

Learning Disentangled Representations for Identity Preserving Surveillance Face Camouflage

Jingzhi Li, Lutong Han, Hua Zhang, Xiaoguang Han, Jingguo Ge, Xiaochu Cao

Responsive image

Auto-TLDR; Individual Face Privacy under Surveillance Scenario with Multi-task Loss Function

Poster Similar

In this paper, we focus on protecting the person face privacy under the surveillance scenarios, whose goal is to change the visual appearances of faces while keep them to be recognizable by current face recognition systems. This is a challenging problem as that we should retain the most important structures of captured facial images, while alter the salient facial regions to protect personal privacy. To address this problem, we introduce a novel individual face protection model, which can camouflage the face appearance from the perspective of human visual perception and preserve the identity features of faces used for face authentication. To that end, we develop an encoder-decoder network architecture that can separately disentangle the person feature representation into an appearance code and an identity code. Specifically, we first randomly divide the face image into two groups, the source set and the target set, where the source set is used to extract the identity code and the target set provides the appearance code. Then, we recombine the identity and appearance codes to synthesize a new face, which has the same identity with the source subject. Finally, the synthesized faces are used to replace the original face to protect the privacy of individual. Furthermore, our model is trained end-to-end with a multi-task loss function, which can better preserve the identity and stabilize the training loss. Experiments conducted on Cross-Age Celebrity dataset demonstrate the effectiveness of our model and validate our superiority in terms of visual quality and scalability.

Quality-Based Representation for Unconstrained Face Recognition

Nelson Méndez-Llanes, Katy Castillo-Rosado, Heydi Mendez-Vazquez, Massimo Tistarelli

Responsive image

Auto-TLDR; activation map for face recognition in unconstrained environments

Slides Similar

Significant advances have been achieved in face recognition in the last decade thanks to the development of deep learning methods. However, recognizing faces captured in uncontrolled environments is still a challenging problem for the scientific community. In these scenarios, the performance of most of existing deep learning based methods abruptly falls, due to the bad quality of the face images. In this work, we propose to use an activation map to represent the quality information in a face image. Different face regions are analyzed to determine their quality and then only those regions with good quality are used to perform the recognition using a given deep face model. For experimental evaluation, in order to simulate unconstrained environments, three challenging databases, with different variations in appearance, were selected: the Labeled Faces in the Wild Database, the Celebrities in Frontal-Profile in the Wild Database, and the AR Database. Three deep face models were used to evaluate the proposal on these databases and in all cases, the use of the proposed activation map allows the improvement of the recognition rates obtained by the original models in a range from 0.3 up to 31%. The obtained results experimentally demonstrated that the proposal is able to select those face areas with higher discriminative power and enough identifying information, while ignores the ones with spurious information.

Unsupervised Learning of Landmarks Based on Inter-Intra Subject Consistencies

Weijian Li, Haofu Liao, Shun Miao, Le Lu, Jiebo Luo

Responsive image

Auto-TLDR; Unsupervised Learning for Facial Landmark Discovery using Inter-subject Landmark consistencies

Slides Similar

We present a novel unsupervised learning approach to image landmark discovery by incorporating the inter-subject landmark consistencies on facial images. This is achieved via an inter-subject mapping module that transforms original subject landmarks based on an auxiliary subject-related structure. To recover from the transformed images back to the original subject, the landmark detector is forced to learn spatial locations that contain the consistent semantic meanings both for the paired intra-subject images and between the paired inter-subject images. Our proposed method is extensively evaluated on two public facial image datasets (MAFL, AFLW) with various settings. Experimental results indicate that our method can extract the consistent landmarks for both datasets and achieve better performances compared to the previous state-of-the-art methods quantitatively and qualitatively.

Contrastive Data Learning for Facial Pose and Illumination Normalization

Gee-Sern Hsu, Chia-Hao Tang

Responsive image

Auto-TLDR; Pose and Illumination Normalization with Contrast Data Learning for Face Recognition

Slides Poster Similar

Face normalization can be a crucial step when handling generic face recognition. We propose the Pose and Illumination Normalization (PIN) framework with contrast data learning for face normalization. The PIN framework is designed to learn the transformation from a source set to a target set. The source set and the target set compose a contrastive data set for learning. The source set contains faces collected in the wild and thus covers a wide range of variation across illumination, pose, expression and other variables. The target set contains face images taken under controlled conditions and all faces are in frontal pose and balanced in illumination. The PIN framework is composed of an encoder, a decoder and two discriminators. The encoder is made of a state-of-the-art face recognition network and acts as a facial feature extractor, which is not updated during training. The decoder is trained on both the source and target sets, and aims to learn the transformation from the source set to the target set; and therefore, it can transform an arbitrary face into a illumination and pose normalized face. The discriminators are trained to ensure the photo-realistic quality of the normalized face images generated by the decoder. The loss functions employed in the decoder and discriminators are appropriately designed and weighted for yielding better normalization outcomes and recognition performance. We verify the performance of the propose framework on several benchmark databases, and compare with state-of-the-art approaches.

Inner Eye Canthus Localization for Human Body Temperature Screening

Claudio Ferrari, Lorenzo Berlincioni, Marco Bertini, Alberto Del Bimbo

Responsive image

Auto-TLDR; Automatic Localization of the Inner Eye Canthus in Thermal Face Images using 3D Morphable Face Model

Slides Poster Similar

In this paper, we propose an automatic approach for localizing the inner eye canthus in thermal face images. We first coarsely detect 5 facial keypoints corresponding to the center of the eyes, the nosetip and the ears. Then we compute a sparse 2D-3D points correspondence using a 3D Morphable Face Model (3DMM). This correspondence is used to project the entire 3D face onto the image, and subsequently locate the inner eye canthus. Detecting this location allows to obtain the most precise body temperature measurement for a person using a thermal camera. We evaluated the approach on a thermal face dataset provided with manually annotated landmarks. However, such manual annotations are normally conceived to identify facial parts such as eyes, nose and mouth, and are not specifically tailored for localizing the eye canthus region. As additional contribution, we enrich the original dataset by using the annotated landmarks to deform and project the 3DMM onto the images. Then, by manually selecting a small region corresponding to the eye canthus, we enrich the dataset with additional annotations. By using the manual landmarks, we ensure the correctness of the 3DMM projection, which can be used as ground-truth for future evaluations. Moreover, we supply the dataset with the 3D head poses and per-point visibility masks for detecting self-occlusions. The data will be publicly released.

Attentive Hybrid Feature Based a Two-Step Fusion for Facial Expression Recognition

Jun Weng, Yang Yang, Zichang Tan, Zhen Lei

Responsive image

Auto-TLDR; Attentive Hybrid Architecture for Facial Expression Recognition

Slides Poster Similar

Facial expression recognition is inherently a challenging task, especially for the in-the-wild images with various occlusions and large pose variations, which may lead to the loss of some crucial information. To address it, in this paper, we propose an attentive hybrid architecture (AHA) which learns global, local and integrated features based on different face regions. Compared with one type of feature, our extracted features own complementary information and can reduce the loss of crucial information. Specifically, AHA contains three branches, where all sub-networks in those branches employ the attention mechanism to further localize the interested pixels/regions. Moreover, we propose a two-step fusion strategy based on LSTM to deeply explore the hidden correlations among different face regions. Extensive experiments on four popular expression databases (i.e., CK+, FER-2013, SFEW 2.0, RAF-DB) show the effectiveness of the proposed method.

Unsupervised Face Manipulation Via Hallucination

Keerthy Kusumam, Enrique Sanchez, Georgios Tzimiropoulos

Responsive image

Auto-TLDR; Unpaired Face Image Manipulation using Autoencoders

Slides Poster Similar

This paper addresses the problem of manipulatinga face image in terms of changing its pose. To achieve this, wepropose a new method that can be trained under the very general“unpaired” setting. To this end, we firstly propose to modelthe general appearance, layout and background of the inputimage using a low-resolution version of it which is progressivelypassed through a hallucination network to generate featuresat higher resolutions. We show that such a formulation issignificantly simpler than previous approaches for appearancemodelling based on autoencoders. Secondly, we propose a fullylearnable and spatially-aware appearance transfer module whichcan cope with misalignment between the input source image andthe target pose and can effectively combine the features fromthe hallucination network with the features produced by ourgenerator. Thirdly, we introduce an identity preserving methodthat is trained in an unsupervised way, by using an auxiliaryfeature extractor and a contrastive loss between the real andgenerated images. We compare our method against the state-of-the-art reporting significant improvements both quantitatively, interms of FID and IS, and qualitatively.

Learning from Web Data: Improving Crowd Counting Via Semi-Supervised Learning

Tao Peng, Pengfei Zhu

Responsive image

Auto-TLDR; Semi-supervised Crowd Counting Baseline for Deep Neural Networks

Slides Poster Similar

Deep neural networks need large-scale dataset for better training and evaluation. However collecting and annotating large-scale crowd counting dataset is expensive and challenging. In this work, we exploit unlabeled web crowd image and propose an multi-task framework for boosting crowd counting baseline method through semi-supervision.Based on the observation that the rotation and splitting operations will not change the image crowd counting number,we designed three auxiliary tasks to improve the quality of feature extractors and our framework can be easily extended to other crowd counting baselines. Experiments shows that our semi-supervised learning framework outperforms previous baselines on UCF-QNRF dataset and ShanghaiTech dataset.

Deep Multi-Task Learning for Facial Expression Recognition and Synthesis Based on Selective Feature Sharing

Rui Zhao, Tianshan Liu, Jun Xiao, P. K. Daniel Lun, Kin-Man Lam

Responsive image

Auto-TLDR; Multi-task Learning for Facial Expression Recognition and Synthesis

Slides Poster Similar

Multi-task learning is an effective learning strategy for deep-learning-based facial expression recognition tasks. However, most existing methods take into limited consideration the feature selection, when transferring information between different tasks, which may lead to task interference when training the multi-task networks. To address this problem, we propose a novel selective feature-sharing method, and establish a multi-task network for facial expression recognition and facial expression synthesis. The proposed method can effectively transfer beneficial features between different tasks, while filtering out useless and harmful information. Moreover, we employ the facial expression synthesis task to enlarge and balance the training dataset to further enhance the generalization ability of the proposed method. Experimental results show that the proposed method achieves state-of-the-art performance on those commonly used facial expression recognition benchmarks, which makes it a potential solution to real-world facial expression recognition problems.

High Resolution Face Age Editing

Xu Yao, Gilles Puy, Alasdair Newson, Yann Gousseau, Pierre Hellier

Responsive image

Auto-TLDR; An Encoder-Decoder Architecture for Face Age editing on High Resolution Images

Slides Poster Similar

Face age editing has become a crucial task in film post-production, and is also becoming popular for general purpose photography. Recently, adversarial training has produced some of the most visually impressive results for image manipulation, including the face aging/de-aging task. In spite of considerable progress, current methods often present visual artifacts and can only deal with low-resolution images. In order to achieve aging/de-aging with the high quality and robustness necessary for wider use, these problems need to be addressed. This is the goal of the present work. We present an encoder-decoder architecture for face age editing. The core idea of our network is to encode a face image to age-invariant features, and learn a modulation vector corresponding to a target age. We then combine these two elements to produce a realistic image of the person with the desired target age. Our architecture is greatly simplified with respect to other approaches, and allows for fine-grained age editing on high resolution images in a single unified model. Source codes are available at https://github.com/InterDigitalInc/HRFAE.

Local Facial Attribute Transfer through Inpainting

Ricard Durall, Franz-Josef Pfreundt, Janis Keuper

Responsive image

Auto-TLDR; Attribute Transfer Inpainting Generative Adversarial Network

Slides Poster Similar

The term attribute transfer refers to the tasks of altering images in such a way, that the semantic interpretation of a given input image is shifted towards an intended direction, which is quantified by semantic attributes. Prominent example applications are photo realistic changes of facial features and expressions, like changing the hair color, adding a smile, enlarging the nose or altering the entire context of a scene, like transforming a summer landscape into a winter panorama. Recent advances in attribute transfer are mostly based on generative deep neural networks, using various techniques to manipulate images in the latent space of the generator. In this paper, we present a novel method for the common sub-task of local attribute transfers, where only parts of a face have to be altered in order to achieve semantic changes (e.g. removing a mustache). In contrast to previous methods, where such local changes have been implemented by generating new (global) images, we propose to formulate local attribute transfers as an inpainting problem. Removing and regenerating only parts of images, our Attribute Transfer Inpainting Generative Adversarial Network (ATI-GAN) is able to utilize local context information to focus on the attributes while keeping the background unmodified resulting in visually sound results.

Continuous Learning of Face Attribute Synthesis

Ning Xin, Shaohui Xu, Fangzhe Nan, Xiaoli Dong, Weijun Li, Yuanzhou Yao

Responsive image

Auto-TLDR; Continuous Learning for Face Attribute Synthesis

Slides Poster Similar

The generative adversarial network (GAN) exhibits great superiority in the face attribute synthesis task. However, existing methods have very limited effects on the expansion of new attributes. To overcome the limitations of a single network in new attribute synthesis, a continuous learning method for face attribute synthesis is proposed in this work. First, the feature vector of the input image is extracted and attribute direction regression is performed in the feature space to obtain the axes of different attributes. The feature vector is then linearly guided along the axis so that images with target attributes can be synthesized by the decoder. Finally, to make the network capable of continuous learning, the orthogonal direction modification module is used to extend the newly-added attributes. Experimental results show that the proposed method can endow a single network with the ability to learn attributes continuously, and, as compared to those produced by the current state-of-the-art methods, the synthetic attributes have higher accuracy.

Object Features and Face Detection Performance: Analyses with 3D-Rendered Synthetic Data

Jian Han, Sezer Karaoglu, Hoang-An Le, Theo Gevers

Responsive image

Auto-TLDR; Synthetic Data for Face Detection Using 3DU Face Dataset

Slides Poster Similar

This paper is to provide an overview of how object features from images influence face detection performance, and how to select synthetic faces to address specific features. To this end, we investigate the effects of occlusion, scale, viewpoint, background, and noise by using a novel synthetic image generator based on 3DU Face Dataset. To examine the effects of different features, we selected three detectors (Faster RCNN, HR, SSH) as representative of various face detection methodologies. Comparing different configurations of synthetic data on face detection systems, it showed that our synthetic dataset could complement face detectors to become more robust against features in the real world. Our analysis also demonstrated that a variety of data augmentation is necessary to address nuanced differences in performance.

Talking Face Generation Via Learning Semantic and Temporal Synchronous Landmarks

Aihua Zheng, Feixia Zhu, Hao Zhu, Mandi Luo, Ran He

Responsive image

Auto-TLDR; A semantic and temporal synchronous landmark learning method for talking face generation

Slides Poster Similar

Abstract—Given a speech clip and facial image, the goal of talking face generation is to synthesize a talking face video with accurate mouth synchronization and natural face motion. Recent progress has proven the effectiveness of the landmarks as the intermediate information during talking face generation. However,the large gap between audio and visual modalities makes the prediction of landmarks challenging and limits generation ability. This paper proposes a semantic and temporal synchronous landmark learning method for talking face generation. First, we propose to introduce a word detector to enforce richer semantic information. Then, we propose to preserve the temporal synchronization and consistency between landmarks and audio via the proposed temporal residual loss. Lastly, we employ a U-Net generation network with adaptive reconstruction loss to generate facial images for the predicted landmarks. Experimental results on two benchmark datasets LRW and GRID demonstrate the effectiveness of our model compared to the state-of-the-art methods of talking face generation.

HP2IFS: Head Pose Estimation Exploiting Partitioned Iterated Function Systems

Carmen Bisogni, Michele Nappi, Chiara Pero, Stefano Ricciardi

Responsive image

Auto-TLDR; PIFS based head pose estimation using fractal coding theory and Partitioned Iterated Function Systems

Slides Poster Similar

Estimating the actual head orientation from 2D images, with regard to its three degrees of freedom, is a well known problem that is highly significant for a large number of applications involving head pose knowledge. Consequently, this topic has been tackled by a plethora of methods and algorithms the most part of which exploits neural networks. Machine learning methods, indeed, achieve accurate head rotation values yet require an adequate training stage and, to that aim, a relevant number of positive and negative examples. In this paper we take a different approach to this topic by using fractal coding theory and particularly Partitioned Iterated Function Systems to extract the fractal code from the input head image and to compare this representation to the fractal code of a reference model through Hamming distance. According to experiments conducted on both the BIWI and the AFLW2000 databases, the proposed PIFS based head pose estimation method provides accurate yaw/pitch/roll angular values, with a performance approaching that of state of the art of machine-learning based algorithms and exceeding most of non-training based approaches.

SATGAN: Augmenting Age Biased Dataset for Cross-Age Face Recognition

Wenshuang Liu, Wenting Chen, Yuanlue Zhu, Linlin Shen

Responsive image

Auto-TLDR; SATGAN: Stable Age Translation GAN for Cross-Age Face Recognition

Slides Poster Similar

In this paper, we propose a Stable Age Translation GAN (SATGAN) to generate fake face images at different ages to augment age biased face datasets for Cross-Age Face Recognition (CAFR) . The proposed SATGAN consists of both generator and discriminator. As a part of the generator, a novel Mask Attention Module (MAM) is introduced to make the generator focus on the face area. In addition, the generator employs a Uniform Distribution Discriminator (UDD) to supervise the learning of latent feature map and enforce the uniform distribution. Besides, the discriminator employs a Feature Separation Module (FSM) to disentangle identity information from the age information. The quantitative and qualitative evaluations on Morph dataset prove that SATGAN achieves much better performance than existing methods. The face recognition model trained using dataset (VGGFace2 and MS-Celeb-1M) augmented using our SATGAN achieves better accuracy on cross age dataset like Cross-Age LFW and AgeDB-30.

Unsupervised Disentangling of Viewpoint and Residues Variations by Substituting Representations for Robust Face Recognition

Minsu Kim, Joanna Hong, Junho Kim, Hong Joo Lee, Yong Man Ro

Responsive image

Auto-TLDR; Unsupervised Disentangling of Identity, viewpoint, and Residue Representations for Robust Face Recognition

Slides Poster Similar

It is well-known that identity-unrelated variations (e.g., viewpoint or illumination) degrade the performances of face recognition methods. In order to handle this challenge, a robust method for disentangling the identity and view representations has drawn an attention in the machine learning area. However, existing methods learn discriminative features which require a manual supervision of such factors of variations. In this paper, we propose a novel disentangling framework through modeling three representations of identity, viewpoint, and residues (i.e., identity and pose unrelated) which do not require supervision of the variations. By jointly modeling the three representations, we enhance the disentanglement of each representation and achieve robust face recognition performance. Further, the learned viewpoint representation can be utilized for pose estimation or editing of a posed facial image. Extensive quantitative and qualitative evaluations verify the effectiveness of our proposed method which disentangles identity, viewpoint, and residues of facial images.

SAT-Net: Self-Attention and Temporal Fusion for Facial Action Unit Detection

Zhihua Li, Zheng Zhang, Lijun Yin

Responsive image

Auto-TLDR; Temporal Fusion and Self-Attention Network for Facial Action Unit Detection

Slides Poster Similar

Research on facial action unit detection has shown remarkable performances by using deep spatial learning models in recent years, however, it is far from reaching its full capacity in learning due to the lack of use of temporal information of AUs across time. Since the AU occurrence in one frame is highly likely related to previous frames in a temporal sequence, exploring temporal correlation of AUs across frames becomes a key motivation of this work. In this paper, we propose a novel temporal fusion and AU-supervised self-attention network (a so-called SAT-Net) to address the AU detection problem. First of all, we input the deep features of a sequence into a convolutional LSTM network and fuse the previous temporal information into the feature map of the last frame, and continue to learn the AU occurrence. Second, considering the AU detection problem is a multi-label classification problem that individual label depends only on certain facial areas, we propose a new self-learned attention mask by focusing the detection of each AU on parts of facial areas through the learning of individual attention mask for each AU, thus increasing the AU independence without the loss of any spatial relations. Our extensive experiments show that the proposed framework achieves better results of AU detection over the state-of-the-arts on two benchmark databases (BP4D and DISFA).

ClusterFace: Joint Clustering and Classification for Set-Based Face Recognition

Samadhi Poornima Kumarasinghe Wickrama Arachchilage, Ebroul Izquierdo

Responsive image

Auto-TLDR; Joint Clustering and Classification for Face Recognition in the Wild

Slides Poster Similar

Deep learning technology has enabled successful modeling of complex facial features when high quality images are available. Nonetheless, accurate modeling and recognition of human faces in real world scenarios 'on the wild' or under adverse conditions remains an open problem. When unconstrained faces are mapped into deep features, variations such as illumination, pose, occlusion, etc., can create inconsistencies in the resultant feature space. Hence, deriving conclusions based on direct associations could lead to degraded performance. This rises the requirement for a basic feature space analysis prior to face recognition. This paper devises a joint clustering and classification scheme which learns deep face associations in an easy-to-hard way. Our method is based on hierarchical clustering where the early iterations tend to preserve high reliability. The rationale of our method is that a reliable clustering result can provide insights on the distribution of the feature space, that can guide the classification that follows. Experimental evaluations on three tasks, face verification, face identification and rank-order search, demonstrates better or competitive performance compared to the state-of-the-art, on all three experiments.

Learning Emotional Blinded Face Representations

Alejandro Peña Almansa, Julian Fierrez, Agata Lapedriza, Aythami Morales

Responsive image

Auto-TLDR; Blind Face Representations for Emotion Recognition

Slides Poster Similar

This work proposes two new face representations that are blind to the expressions associated to emotional responses. This work is in part motivated by new international regulations for personal data protection, which force data controllers to protect any kind of sensitive information involved in automatic processes. The advances in affective computing have contributed to improve human-machine interfaces, but at the same time, the capacity to monitorize emotional responses trigger potential risks for humans, both in terms of fairness and privacy. We propose two different methods to learn these facial expression blinded features. We show that it is possible to eliminate information related to emotion recognition tasks, while the performance of subject verification, gender recognition, and ethnicity classification are just slightly affected. We also present an application to train fairer classifiers over a protected facial expression attribute. The results demonstrate that it is possible to reduce emotional information in the face representation while retaining competitive performance in other face-based artificial intelligence tasks.

Learning to Implicitly Represent 3D Human Body from Multi-Scale Features and Multi-View Images

Zhongguo Li, Magnus Oskarsson, Anders Heyden

Responsive image

Auto-TLDR; Reconstruction of 3D human bodies from multi-view images using multi-stage end-to-end neural networks

Slides Poster Similar

Reconstruction of 3D human bodies, from images, faces many challenges, due to it generally being an ill-posed problem. In this paper we present a method to reconstruct 3D human bodies from multi-view images, through learning an implicit function to represent 3D shape, based on multi-scale features extracted by multi-stage end-to-end neural networks. Our model consists of several end-to-end hourglass networks for extracting multi-scale features from multi-view images, and a fully connected network for implicit function classification from these features. Given a 3D point, it is projected to multi-view images and these images are fed into our model to extract multi-scale features. The scales of features extracted by the hourglass networks decrease with the depth of our model, which represents the information from local to global scale. Then, the multi-scale features as well as the depth of the 3D point are combined to a new feature vector and the fully connected network classifies the feature vector, in order to predict if the point lies inside or outside of the 3D mesh. The advantage of our method is that we use both local and global features in the fully connected network and represent the 3D mesh by an implicit function, which is more memory-efficient. Experiments on public datasets demonstrate that our method surpasses previous approaches in terms of the accuracy of 3D reconstruction of human bodies from images.

Face Image Quality Assessment for Model and Human Perception

Ken Chen, Yichao Wu, Zhenmao Li, Yudong Wu, Ding Liang

Responsive image

Auto-TLDR; A labour-saving method for FIQA training with contradictory data from multiple sources

Slides Poster Similar

Practical face image quality assessment (FIQA) models are trained under the supervision of labeled data, which requires more or less human labor. The human labeled quality scores are consistent with perceptual intuition but laborious. On the other hand, models can be trained with data generated automatically by the recognition models with artificially selected references. However, the recognition scores are sometimes inaccurate, which may give wrong quality scores during FIQA training. In this paper, we propose a labour-saving method for quality scores generation. For the first time, we conduct systematic investigations to show that there exist severe contradictions between different types of target quality, namely distribution gap (DG). To bridge the gap, we propose a novel framework for training FIQA models by combining the merits of data from different sources. In order to make the target score from multiple sources compatible, we design a method called quality distribution alignment (QDA). Meanwhile, to correct the wrong target by recognition models, contradictory samples selection (CSS) is adopted to select samples from the human labeled dataset adaptively. Extensive experiments and analysis on public benchmarks including MegaFace has demonstrated the superiority of our in terms of effectiveness and efficiency.

Shape Consistent 2D Keypoint Estimation under Domain Shift

Levi Vasconcelos, Massimiliano Mancini, Davide Boscaini, Barbara Caputo, Elisa Ricci

Responsive image

Auto-TLDR; Deep Adaptation for Keypoint Prediction under Domain Shift

Slides Poster Similar

Recent unsupervised domain adaptation methods based on deep architectures have shown remarkable performance not only in traditional classification tasks but also in more complex problems involving structured predictions (e.g. semantic segmentation, depth estimation). Following this trend, in this paper we present a novel deep adaptation framework for estimating keypoints under \textit{domain shift}, i.e. when the training (\textit{source}) and the test (\textit{target}) images significantly differ in terms of visual appearance. Our method seamlessly combines three different components: feature alignment, adversarial training and self-supervision. Specifically, our deep architecture leverages from domain-specific distribution alignment layers to perform target adaptation at the feature level. Furthermore, a novel loss is proposed which combines an adversarial term for ensuring aligned predictions in the output space and a geometric consistency term which guarantees coherent predictions between a target sample and its perturbed version. Our extensive experimental evaluation conducted on three publicly available benchmarks shows that our approach outperforms state-of-the-art domain adaptation methods in the 2D keypoint prediction task.

A Flatter Loss for Bias Mitigation in Cross-Dataset Facial Age Estimation

Ali Akbari, Muhammad Awais, Zhenhua Feng, Ammarah Farooq, Josef Kittler

Responsive image

Auto-TLDR; Cross-dataset Age Estimation for Neural Network Training

Slides Poster Similar

Existing studies in facial age estimation have mostly focused on intra-dataset protocols that assume training and test images captured under similar conditions. However, this is rarely valid in practical applications, where training and test sets usually have different characteristics. In this paper, we advocate a cross-dataset protocol for age estimation benchmarking. In order to improve the cross-dataset age estimation performance, we mitigate the inherent bias caused by the learning algorithm. To this end, we propose a novel loss function that is more effective for neural network training. The relative smoothness of the proposed loss function is its advantage with regards to the optimisation process performed by stochastic gradient decent. Its lower gradient, compared with existing loss functions, facilitates the discovery of and convergence to a better optimum, and consequently a better generalisation. The cross-dataset experimental results demonstrate the superiority of the proposed method over the state-of-the-art algorithms in terms of accuracy and generalisation capability.

Local-Global Interactive Network for Face Age Transformation

Jie Song, Ping Wei, Huan Li, Yongchi Zhang, Nanning Zheng

Responsive image

Auto-TLDR; A Novel Local-Global Interaction Framework for Long-span Face Age Transformation

Slides Poster Similar

Face age transformation, which aims to generate a face image in the past or future, has receiving increasing attention due to its significant application value in some special fields, such as looking for a lost child, tracking criminals and entertainment, etc. Currently, most existing methods mainly focus on unidirectional short-span face aging. In this paper, we propose a novel local-global interaction framework for long-span face age transformation. Firstly, we divide a face image into five independent parts and design a local generative network for each of them to learn the local structure changes of a face image, while we utilize a global generative network to learn the global structure changes. Then we introduce an interactive network and an age classification network, which are respectively used to integrate the local and global features and maintain the corresponding age features in different age groups. Given any face image at a certain age, our network can produce a clear and realistic image of face aging or rejuvenation. We test and evaluate the model on complex datasets, and extensive qualitative comparison experiments has proved the effectiveness and immense potential of our proposed method.

SSDL: Self-Supervised Domain Learning for Improved Face Recognition

Samadhi Poornima Kumarasinghe Wickrama Arachchilage, Ebroul Izquierdo

Responsive image

Auto-TLDR; Self-supervised Domain Learning for Face Recognition in unconstrained environments

Slides Poster Similar

Face recognition in unconstrained environments is challenging due to variations in illumination, quality of sensing, motion blur and etc. An individual’s face appearance can vary drastically under different conditions creating a gap between train (source) and varying test (target) data. The domain gap could cause decreased performance levels in direct knowledge transfer from source to target. Despite fine-tuning with domain specific data could be an effective solution, collecting and annotating data for all domains is extremely expensive. To this end, we propose a self-supervised domain learning (SSDL) scheme that trains on triplets mined from unlabelled data. A key factor in effective discriminative learning, is selecting informative triplets. Building on most confident predictions, we follow an “easy-to-hard” scheme of alternate triplet mining and self-learning. Comprehensive experiments on four different benchmarks show that SSDL generalizes well on different domains.

Joint Supervised and Self-Supervised Learning for 3D Real World Challenges

Antonio Alliegro, Davide Boscaini, Tatiana Tommasi

Responsive image

Auto-TLDR; Self-supervision for 3D Shape Classification and Segmentation in Point Clouds

Slides Similar

Point cloud processing and 3D shape understanding are very challenging tasks for which deep learning techniques have demonstrated great potentials. Still further progresses are essential to allow artificial intelligent agents to interact with the real world. In many practical conditions the amount of annotated data may be limited and integrating new sources of knowledge becomes crucial to support autonomous learning. Here we consider several scenarios involving synthetic and real world point clouds where supervised learning fails due to data scarcity and large domain gaps. We propose to enrich standard feature representations by leveraging self-supervision through a multi-task model that can solve a 3D puzzle while learning the main task of shape classification or part segmentation. An extensive analysis investigating few-shot, transfer learning and cross-domain settings shows the effectiveness of our approach with state-of-the-art results for 3D shape classification and part segmentation.

Controllable Face Aging

Haien Zeng, Hanjiang Lai

Responsive image

Auto-TLDR; A controllable face aging method via attribute disentanglement generative adversarial network

Slides Poster Similar

Motivated by the following two observations: 1) people are aging differently under different conditions for changeable facial attributes, e.g., skin color may become darker when working outside, and 2) it needs to keep some unchanged facial attributes during the aging process, e.g., race and gender, we propose a controllable face aging method via attribute disentanglement generative adversarial network. To offer fine control over the synthesized face images, first, an individual embedding of the face is directly learned from an image that contains the desired facial attribute. Second, since the image may contain other unwanted attributes, an attribute disentanglement network is used to separate the individual embedding and learn the common embedding that contains information about the face attribute (e.g., race). With the common embedding, we can manipulate the generated face image with the desired attribute in an explicit manner. Experimental results on two common benchmarks demonstrate that our proposed generator achieves comparable performance on the aging effect with state-of-the-art baselines while gaining more flexibility for attribute control. Code is available at supplementary material.

3D Facial Matching by Spiral Convolutional Metric Learning and a Biometric Fusion-Net of Demographic Properties

Soha Sadat Mahdi, Nele Nauwelaers, Philip Joris, Giorgos Bouritsas, Imperial London, Sergiy Bokhnyak, Susan Walsh, Mark Shriver, Michael Bronstein, Peter Claes

Responsive image

Auto-TLDR; Multi-biometric Fusion for Biometric Verification using 3D Facial Mesures

Slides Similar

Face recognition is a widely accepted biometric verification tool, as the face contains a lot of information about the identity of a person. In this study, a 2-step neural-based pipeline is presented for matching 3D facial shape to multiple DNA-related properties (sex, age, BMI and genomic background). The first step consists of a triplet loss-based metric learner that compresses facial shape into a lower dimensional embedding while preserving information about the property of interest. Most studies in the field of metric learning have only focused on Euclidean data. In this work, geometric deep learning is employed to learn directly from 3D facial meshes. To this end, spiral convolutions are used along with a novel mesh-sampling scheme that retains uniformly sampled 3D points at different levels of resolution. The second step is a multi-biometric fusion by a fully connected neural network. The network takes an ensemble of embeddings and property labels as input and returns genuine and imposter scores. Since embeddings are accepted as an input, there is no need to train classifiers for the different properties and available data can be used more efficiently. Results obtained by a 10-fold cross-validation for biometric verification show that combining multiple properties leads to stronger biometric systems. Furthermore, the proposed neural-based pipeline outperforms a linear baseline, which consists of principal component analysis, followed by classification with linear support vector machines and a Naïve Bayes-based score-fuser.

Pose-Robust Face Recognition by Deep Meta Capsule Network-Based Equivariant Embedding

Fangyu Wu, Jeremy Simon Smith, Wenjin Lu, Bailing Zhang

Responsive image

Auto-TLDR; Deep Meta Capsule Network-based Equivariant Embedding Model for Pose-Robust Face Recognition

Similar

Despite the exceptional success in face recognition related technologies, handling large pose variations still remains a key challenge. Current techniques for pose-robust face recognition either, directly extract pose-invariant features, or first synthesize a face that matches the target pose before feature extraction. It is more desirable to learn face representations equivariant to pose variations. To this end, this paper proposes a deep meta Capsule network-based Equivariant Embedding Model (DM-CEEM) with three distinct novelties. First, the proposed RB-CapsNet allows DM-CEEM to learn an equivariant embedding for pose variations and achieve the desired transformation for input face images. Second, we introduce a new version of a Capsule network called RB-CapsNet to extend CapsNet to perform a profile-to-frontal face transformation in deep feature space. Third, we train the DM-CEEM in a meta way by treating a single overall classification target as multiple sub-tasks that satisfy certain unknown probabilities. In each sub-task, we sample the support and query sets randomly. The experimental results on both controlled and in-the-wild databases demonstrate the superiority of DM-CEEM over state-of-the-art.

Face Super-Resolution Network with Incremental Enhancement of Facial Parsing Information

Shuang Liu, Chengyi Xiong, Zhirong Gao

Responsive image

Auto-TLDR; Learning-based Face Super-Resolution with Incremental Boosting Facial Parsing Information

Slides Poster Similar

Recently, facial priors based face super-resolution (SR) methods have obtained significant performance gains in dealing with extremely degraded facial images, and facial priors have also been proved useful in facilitating the inference of face images. Based on this, how to fully fuse facial priors into deep features to improve face SR performance has attracted a major attention. In this paper, we propose a learning-based face SR approach with incremental boosting facial parsing information (IFPSR) for high-magnification of low-resolution faces. The proposed IFPSR method consists of three main parts: i) a three-stage parsing map embedded features upsampling network, in which image recovery and prior estimation processes are performed simultaneously and progressively to improve the image resolution; ii) a progressive training method and a joint facial attention and heatmap loss to obtain better facial attributes; iii) the channel attention strategy in residual dense blocks to adaptively learn facial features. Extensive experimental results show that compared with the state-of-the-art methods in terms of quantitative and qualitative metrics, our approach can achieve an outstanding balance between SR image quality and low network complexity.

Orthographic Projection Linear Regression for Single Image 3D Human Pose Estimation

Yahui Zhang, Shaodi You, Theo Gevers

Responsive image

Auto-TLDR; A Deep Neural Network for 3D Human Pose Estimation from a Single 2D Image in the Wild

Slides Poster Similar

3D human pose estimation from a single 2D image in the wild is an important computer vision task but yet extremely challenging. Unlike images taken from indoor and well constrained environments, 2D outdoor images in the wild are extremely complex because of varying imaging conditions. Furthermore, 2D images usually do not have corresponding 3D pose ground truth making a supervised approach ill constrained. Therefore, in this paper, we propose to associate the 3D human pose, the 2D human pose projection and the 2D image appearance through a new orthographic projection based linear regression module. Unlike existing reprojection based approaches, our orthographic projection and regression do not suffer from small angle problems, which usually lead to overfitting in the depth dimension. Hence, we propose a deep neural network which adopts the 2D pose, 3D pose regression and orthographic projection linear regression module. The proposed method shows state-of-the art performance on the Human3.6M dataset and generalizes well to in-the-wild images.

Adaptive Feature Fusion Network for Gaze Tracking in Mobile Tablets

Yiwei Bao, Yihua Cheng, Yunfei Liu, Feng Lu

Responsive image

Auto-TLDR; Adaptive Feature Fusion Network for Multi-stream Gaze Estimation in Mobile Tablets

Slides Poster Similar

Recently, many multi-stream gaze estimation methods have been proposed. They estimate gaze from eye and face appearances and achieve reasonable accuracy. However, most of the methods simply concatenate the features extracted from eye and face appearance. The feature fusion process has been ignored. In this paper, we propose a novel Adaptive Feature Fusion Network (AFF-Net), which performs gaze tracking task in mobile tablets. We stack two-eye feature maps and utilize Squeeze-and-Excitation layers to adaptively fuse two-eye features based on different eye features. Meanwhile, we also propose Adaptive Group Normalization to recalibrate eye features with the guidance of face appearance characteristics. Extensive experiments on both GazeCapture and MPIIFaceGaze datasets demonstrate consistently superior performance of the proposed method.

Coarse to Fine: Progressive and Multi-Task Learning for Salient Object Detection

Dong-Goo Kang, Sangwoo Park, Joonki Paik

Responsive image

Auto-TLDR; Progressive and mutl-task learning scheme for salient object detection

Slides Poster Similar

Most deep learning-based salient object detection (SOD) methods tried to manipulate the convolution block to effectively capture the context of object. In this paper, we propose a novel method, called progressive and mutl-task learning scheme, to extract the context of object by only manipulating the learning scheme without changing the network architecture. The progressive learning scheme is a method to grow the decoder progressively in the train phase. In other words, starting from easier low-resolution layers, it gradually adds high-resolution layers. Although the progressive learning successfullyl captures the context of object, its output boundary tends to be rough. To solve this problem, we also propose a multi-task learning (MTL) scheme that processes the object saliency map and contour in a single network jointly. The proposed MTL scheme trains the network in an edge-preserved direction through an auxiliary branch that learns contours. The proposed a learning scheme can be combined with other convolution block manipulation methods. Extensive experiments on five datasets show that the proposed method performs best compared with state-of-the-art methods in most cases.

Two-Stream Temporal Convolutional Network for Dynamic Facial Attractiveness Prediction

Nina Weng, Jiahao Wang, Annan Li, Yunhong Wang

Responsive image

Auto-TLDR; 2S-TCN: A Two-Stream Temporal Convolutional Network for Dynamic Facial Attractiveness Prediction

Slides Poster Similar

In the field of facial attractiveness prediction, while deep models using static pictures have shown promising results, little attention is paid to dynamic facial information, which is proven to be influential by psychological studies. Meanwhile, the increasing popularity of short video apps creates an enormous demand of facial attractiveness prediction from short video clips. In this paper, we target on the dynamic facial attractiveness prediction problem. To begin with, a large-scale video-based facial attractiveness prediction dataset (VFAP) with more than one thousand clips from TikTok is collected. A two-stream temporal convolutional network (2S-TCN) is then proposed to capture dynamic attractiveness feature from both facial appearance and landmarks. We employ attentive feature enhancement along with specially designed modality and temporal fusion strategies to better explore the temporal dynamics. Extensive experiments on the proposed VFAP dataset demonstrate that 2S-TCN has a distinct advantage over the state-of-the-art static prediction methods.

Weakly Supervised Body Part Segmentation with Pose Based Part Priors

Zhengyuan Yang, Yuncheng Li, Linjie Yang, Ning Zhang, Jiebo Luo

Responsive image

Auto-TLDR; Weakly Supervised Body Part Segmentation Using Weak Labels

Slides Similar

Human body part segmentation refers to the task of predicting the semantic segmentation mask for each body part. Fully supervised body part segmentation methods achieve good performances but require an enormous amount of effort to annotate part masks for training. In contrast to high annotation costs needed for a limited number of part mask annotations, a large number of weak labels such as poses and full body masks already exist and contain relevant information. Motivated by the possibility of using existing weak labels, we propose the first weakly supervised body part segmentation framework. The core idea is first converting the sparse weak labels such as keypoints to the initial estimate of body part masks, and then iteratively refine the part mask predictions. We name the initial part masks estimated from poses the "part priors". with sufficient extra weak labels, our weakly supervised framework achieves a comparable performance (62.0% mIoU) to the fully supervised method (63.6% mIoU) on the Pascal-Person-Part dataset. Furthermore, in the extended semi-supervised setting, the proposed framework outperforms the state-of-art methods. Moreover, we extend our proposed framework to other keypoint-supervised part segmentation tasks such as face parsing.

Coherence and Identity Learning for Arbitrary-Length Face Video Generation

Shuquan Ye, Chu Han, Jiaying Lin, Guoqiang Han, Shengfeng He

Responsive image

Auto-TLDR; Face Video Synthesis Using Identity-Aware GAN and Face Coherence Network

Slides Poster Similar

Face synthesis is an interesting yet challenging task in computer vision. It is even much harder to generate a portrait video than a single image. In this paper, we propose a novel video generation framework for synthesizing arbitrary-length face videos without any face exemplar or landmark. To overcome the synthesis ambiguity of face video, we propose a divide-and-conquer strategy to separately address the video face synthesis problem from two aspects, face identity synthesis and rearrangement. To this end, we design a cascaded network which contains three components, Identity-aware GAN (IA-GAN), Face Coherence Network, and Interpolation Network. IA-GAN is proposed to synthesize photorealistic faces with the same identity from a set of noises. Face Coherence Network is designed to re-arrange the faces generated by IA-GAN while keeping the inter-frame coherence. Interpolation Network is introduced to eliminate the discontinuity between two adjacent frames and improve the smoothness of the face video. Experimental results demonstrate that our proposed network is able to generate face video with high visual quality while preserving the identity. Statistics show that our method outperforms state-of-the-art unconditional face video generative models in multiple challenging datasets.

Deep Gait Relative Attribute Using a Signed Quadratic Contrastive Loss

Yuta Hayashi, Shehata Allam, Yasushi Makihara, Daigo Muramatsu, Yasushi Yagi

Responsive image

Auto-TLDR; Signal-Contrastive Loss for Gait Attributes Estimation

Similar

This paper presents a deep learning-based method to estimate gait attributes (e.g., stately, cool, relax, etc.). Similarly to the existing studies on relative attribute, human perception-based annotations on the gait attributes are given to pairs of gait videos (i.e., the first one is better, tie, and the second one is better), and the relative annotations are utilized to train a ranking model of the gait attribute. More specifically, we design a Siamese (i.e., two-stream) network which takes a pair of gait inputs and output gait attribute score for each. We then introduce a suitable loss function called a signed contrastive loss to train the network parameters with the relative annotation. Unlike the existing loss functions for learning to rank does not inherent a nice property of a quadratic contrastive loss, the proposed signed quadratic contrastive loss function inherents the nice property. The quantitative evaluation results reveal that the proposed method shows better or comparable accuracies of relative attribute prediction against the baseline methods.

MRP-Net: A Light Multiple Region Perception Neural Network for Multi-Label AU Detection

Yang Tang, Shuang Chen, Honggang Zhang, Gang Wang, Rui Yang

Responsive image

Auto-TLDR; MRP-Net: A Fast and Light Neural Network for Facial Action Unit Detection

Slides Poster Similar

Facial Action Units (AUs) are of great significance in communication. Automatic AU detection can improve the understanding of psychological condition and emotional status. Recently, a number of deep learning methods have been proposed to take charge with problems in automatic AU detection. Several challenges, like unbalanced labels and ignorance of local information, remain to be addressed. In this paper, we propose a fast and light neural network called MRP-Net, which is an end-to-end trainable method for facial AU detection to solve these problems. First, we design a Multiple Region Perception (MRP) module aimed at capturing different locations and sizes of features in the deeper level of the network without facial landmark points. Then, in order to balance the positive and negative samples in the large dataset, a batch balanced method adjusting the weight of every sample in one batch in our loss function is suggested. Experimental results on two popular AU datasets, BP4D and DISFA prove that MRP-Net outperforms state-of-the-art methods. Compared with the best method, not only does MRP-Net have an average F1 score improvement of 2.95% on BP4D and 5.43% on DISFA, and it also decreases the number of network parameters by 54.62% and the number of network FLOPs by 19.6%.

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

Mirco Planamente, Andrea Bottino, Barbara Caputo

Responsive image

Auto-TLDR; A Single Stream Architecture for Egocentric Action Recognition from the First-Person Point of View

Slides Poster Similar

Wearable cameras are becoming more and more popular in several applications, increasing the interest of the research community in developing approaches for recognizing actions from the first-person point of view. An open challenge in egocentric action recognition is that videos lack detailed information about the main actor's pose and thus tend to record only parts of the movement when focusing on manipulation tasks. Thus, the amount of information about the action itself is limited, making crucial the understanding of the manipulated objects and their context. Many previous works addressed this issue with two-stream architectures, where one stream is dedicated to modeling the appearance of objects involved in the action, and another to extracting motion features from optical flow. In this paper, we argue that learning features jointly from these two information channels is beneficial to capture the spatio-temporal correlations between the two better. To this end, we propose a single stream architecture able to do so, thanks to the addition of a self-supervised block that uses a pretext motion prediction task to intertwine motion and appearance knowledge. Experiments on several publicly available databases show the power of our approach.

Attribute-Based Quality Assessment for Demographic Estimation in Face Videos

Fabiola Becerra-Riera, Annette Morales-González, Heydi Mendez-Vazquez, Jean-Luc Dugelay

Responsive image

Auto-TLDR; Facial Demographic Estimation in Video Scenarios Using Quality Assessment

Slides Similar

Most existing works regarding facial demographic estimation are focused on still image datasets, although nowadays the need to analyze video content in real applications is increasing. We propose to tackle gender, age and ethnicity estimation in the context of video scenarios. Our main contribution is to use an attribute-specific quality assessment procedure to select best quality frames from a video sequence for each of the three demographic modalities. Best quality frames are classified with fine-tuned MobileNet models and a final video prediction is obtained with a majority voting strategy among the best selected frames. Our validation on three different datasets and our comparison with state-of-the-art models, show the effectiveness of the proposed demographic classifiers and the quality pipeline, which allows to reduce both: the number of frames to be classified and the processing time in practical applications; and improves the soft biometrics prediction accuracy.

Real-Time Driver Drowsiness Detection Using Facial Action Units

Malaika Vijay, Nandagopal Netrakanti Vinayak, Maanvi Nunna, Subramanyam Natarajan

Responsive image

Auto-TLDR; Real-Time Detection of Driver Drowsiness using Facial Action Units using Extreme Gradient Boosting

Slides Poster Similar

This paper presents a two-stage, vision-based pipeline for the real-time detection of driver drowsiness using Facial Action Units (FAUs). FAUs capture movements in groups of muscles in the face like widening of the eyes or dropping of the jaw. The first stage of the pipeline employs a Convolutional Neural Network (CNN) trained to detect FAUs. The output of the penultimate layer of this network serves as an image embedding that captures features relevant to FAU detection. These embeddings are then used to predict drowsiness using an Extreme Gradient Boosting (XGBoost) classifier. A separate XGBoost model is trained for each user of the system so that behavior specific to each user can be modeled into the drowsiness classifier. We show that user-specific classifiers require very little data and low training time to yield high prediction accuracies in real-time.

Multi-Label Contrastive Focal Loss for Pedestrian Attribute Recognition

Xiaoqiang Zheng, Zhenxia Yu, Lin Chen, Fan Zhu, Shilong Wang

Responsive image

Auto-TLDR; Multi-label Contrastive Focal Loss for Pedestrian Attribute Recognition

Slides Poster Similar

Pedestrian Attribute Recognition (PAR) has received extensive attention during the past few years. With the advances of deep constitutional neural networks (CNNs), the performance of PAR has been significantly improved. Existing methods tend to acquire attribute-specific features by designing various complex network structures with additional modules. Such additional modules, however, dramatically increase the number of parameters. Meanwhile, the problems of class imbalance and hard attribute retrieving remain underestimated in PAR. In this paper, we explore the optimization mechanism of the training processing to account for these problems and propose a new loss function called Multi-label Contrastive Focal Loss (MCFL). This proposed MCFL emphasizes the hard and minority attributes by using a separated re-weighting mechanism for different positive and negative classes to alleviate the impact of the imbalance. MCFL is also able to enlarge the gaps between the intra-class of multi-label attributes, to force CNNs to extract more subtle discriminative features. We evaluate the proposed MCFL on three large public pedestrian datasets, including RAP, PA-100K, and PETA. The experimental results indicate that the proposed MCFL with the ResNet-50 backbone is able to outperform other state-of-the-art approaches in comparison.

Exemplar Guided Cross-Spectral Face Hallucination Via Mutual Information Disentanglement

Haoxue Wu, Huaibo Huang, Aijing Yu, Jie Cao, Zhen Lei, Ran He

Responsive image

Auto-TLDR; Exemplar Guided Cross-Spectral Face Hallucination with Structural Representation Learning

Slides Poster Similar

Recently, many Near infrared-visible (NIR-VIS) heterogeneous face recognition (HFR) methods have been proposed in the community. But it remains a challenging problem because of the sensing gap along with large pose variations. In this paper, we propose an Exemplar Guided Cross-Spectral Face Hallucination (EGCH) to reduce the domain discrepancy through disentangled representation learning. For each modality, EGCH contains a spectral encoder as well as a structure encoder to disentangle spectral and structure representation, respectively. It also contains a traditional generator that reconstructs the input from the above two representations, and a structure generator that predicts the facial parsing map from the structure representation. Besides, mutual information minimization and maximization are conducted to boost disentanglement and make representations adequately expressed. Then the translation is built on structure representations between two modalities. Provided with the transformed NIR structure representation and original VIS spectral representation, EGCH is capable to produce high-fidelity VIS images that preserve the topology structure of the input NIR while transfer the spectral information of an arbitrary VIS exemplar. Extensive experiments demonstrate that the proposed method achieves more promising results both qualitatively and quantitatively than the state-of-the-art NIR-VIS methods.

DmifNet:3D Shape Reconstruction Based on Dynamic Multi-Branch Information Fusion

Lei Li, Suping Wu

Responsive image

Auto-TLDR; DmifNet: Dynamic Multi-branch Information Fusion Network for 3D Shape Reconstruction from a Single-View Image

Slides Similar

3D object reconstruction from a single-view image is a long-standing challenging problem. Previous works are difficult to accurately reconstruct 3D shapes with a complex topology which has rich details at the edges and corners. Moreover, previous works use synthetic data to train their network, but domain adaptation problems occurred when testing on real data. In this paper, we propose a Dynamic Multi-branch Information Fusion Network (DmifNet) which can recover a high-fidelity 3D shape of arbitrary topology from a 2D image. Specifically, we design several side branches from the intermediate layers to make the network produce more diverse representations to improve the generalization ability of network. In addition, we utilize DoG (Difference of Gaussians) to extract edge geometry and corners information from input images. Then, we use a separate side branch network to process the extracted data to better capture edge geometry and corners feature information. Finally, we dynamically fuse the information of all branches to gain final predicted probability. Extensive qualitative and quantitative experiments on a large-scale publicly available dataset demonstrate the validity and efficiency of our method. Code and models are publicly available at https://github.com/leilimaster/DmifNet.