#### Attentive Hybrid Feature Based a Two-Step Fusion for Facial Expression Recognition

Jun Weng, Yang Yang, Zichang Tan, Zhen Lei
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 12:00 in session PS T2.4

Auto-TLDR; Attentive Hybrid Architecture for Facial Expression Recognition

Facial expression recognition is inherently a challenging task, especially for the in-the-wild images with various occlusions and large pose variations, which may lead to the loss of some crucial information. To address it, in this paper, we propose an attentive hybrid architecture (AHA) which learns global, local and integrated features based on different face regions. Compared with one type of feature, our extracted features own complementary information and can reduce the loss of crucial information. Specifically, AHA contains three branches, where all sub-networks in those branches employ the attention mechanism to further localize the interested pixels/regions. Moreover, we propose a two-step fusion strategy based on LSTM to deeply explore the hidden correlations among different face regions. Extensive experiments on four popular expression databases (i.e., CK+, FER-2013, SFEW 2.0, RAF-DB) show the effectiveness of the proposed method.

## Similar papers

#### Video-Based Facial Expression Recognition Using Graph Convolutional Networks

Daizong Liu, Hongting Zhang, Pan Zhou
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; Graph Convolutional Network for Video-based Facial Expression Recognition

Facial expression recognition (FER), aiming to classify the expression present in the facial image or video, has attracted a lot of research interests in the field of artificial intelligence and multimedia. In terms of video based FER task, it is sensible to capture the dynamic expression variation among the frames to recognize facial expression. However, existing methods directly utilize CNN-RNN or 3D CNN to extract the spatial-temporal features from different facial units, instead of concentrating on a certain region during expression variation capturing, which leads to limited performance in FER. In our paper, we introduce a Graph Convolutional Network (GCN) layer into a common CNN-RNN based model for video-based FER. First, the GCN layer is utilized to learn more contributing facial expression features which concentrate on certain regions after sharing information between nodes those represent CNN extracted features. Then, a LSTM layer is applied to learn long-term dependencies among the GCN learned features to model the variation. In addition, a weight assignment mechanism is also designed to weight the output of different nodes for final classification by characterizing the expression intensities in each frame. To the best of our knowledge, it is the first time to use GCN in FER task. We evaluate our method on three widely-used datasets, CK+, Oulu-CASIA and MMI, and also one challenging wild dataset AFEW8.0, and the experimental results demonstrate that our method has superior performance to existing methods.

#### Facial Expression Recognition Using Residual Masking Network

Luan Pham, Vu Huynh, Tuan Anh Tran
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 16:00 in session PS T5.6

Auto-TLDR; Deep Residual Masking for Automatic Facial Expression Recognition

Automatic facial expression recognition (FER) has gained much attention due to its applications in human-computer interaction. Among the approaches to improve FER tasks, this paper focuses on deep architecture with the attention mechanism. We propose a novel Masking idea to boost the performance of CNN in facial expression task. It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions. In experiments, we combine the ubiquitous Deep Residual Network and Unet-like architecture to produce a Residual Masking Network. The proposed method holds state-of-the-art (SOTA) accuracy on the well-known FER2013 and private VEMO datasets. Our works are available on Github.

#### SAT-Net: Self-Attention and Temporal Fusion for Facial Action Unit Detection

Zhihua Li, Zheng Zhang, Lijun Yin
Track 2: Biometrics, Human Analysis and Behavior Understanding
Tue 12 Jan 2021 at 17:00 in session PS T2.1

Auto-TLDR; Temporal Fusion and Self-Attention Network for Facial Action Unit Detection

Research on facial action unit detection has shown remarkable performances by using deep spatial learning models in recent years, however, it is far from reaching its full capacity in learning due to the lack of use of temporal information of AUs across time. Since the AU occurrence in one frame is highly likely related to previous frames in a temporal sequence, exploring temporal correlation of AUs across frames becomes a key motivation of this work. In this paper, we propose a novel temporal fusion and AU-supervised self-attention network (a so-called SAT-Net) to address the AU detection problem. First of all, we input the deep features of a sequence into a convolutional LSTM network and fuse the previous temporal information into the feature map of the last frame, and continue to learn the AU occurrence. Second, considering the AU detection problem is a multi-label classification problem that individual label depends only on certain facial areas, we propose a new self-learned attention mask by focusing the detection of each AU on parts of facial areas through the learning of individual attention mask for each AU, thus increasing the AU independence without the loss of any spatial relations. Our extensive experiments show that the proposed framework achieves better results of AU detection over the state-of-the-arts on two benchmark databases (BP4D and DISFA).

#### Identity-Aware Facial Expression Recognition in Compressed Video

Xiaofeng Liu, Linghao Jin, Xu Han, Jun Lu, Jonghye Woo, Jane You
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 14:00 in session OS T5.3

Auto-TLDR; Exploring Facial Expression Representation in Compressed Video with Mutual Information Minimization

This paper targets to explore the inter-subject variations eliminated facial expression representation in the compressed video domain. Most of the previous methods process the RGB images of a sequence, while the off-the-shelf and valuable expression-related muscle movement already embedded in the compression format. In the up to two orders of magnitude compressed domain, we can explicitly infer the expression from the residual frames and possible to extract identity factors from the I frame with a pre-trained face recognition network. By enforcing the marginal independent of them, the expression feature is expected to be purer for the expression and be robust to identity shifts. Specifically, we propose a novel collaborative min-min game for mutual information (MI) minimization in latent space. We do not need the identity label or multiple expression samples from the same person for identity elimination. Moreover, when the apex frame is annotated in the dataset, the complementary constraint can be further added to regularize the feature-level game. In testing, only the compressed residual frames are required to achieve expression prediction. Our solution can achieve comparable or better performance than the recent decoded image based methods on the typical FER benchmarks with about 3$\times$ faster inference with compressed data.

#### Deep Multi-Task Learning for Facial Expression Recognition and Synthesis Based on Selective Feature Sharing

Rui Zhao, Tianshan Liu, Jun Xiao, P. K. Daniel Lun, Kin-Man Lam
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 12:00 in session PS T2.4

Auto-TLDR; Multi-task Learning for Facial Expression Recognition and Synthesis

Multi-task learning is an effective learning strategy for deep-learning-based facial expression recognition tasks. However, most existing methods take into limited consideration the feature selection, when transferring information between different tasks, which may lead to task interference when training the multi-task networks. To address this problem, we propose a novel selective feature-sharing method, and establish a multi-task network for facial expression recognition and facial expression synthesis. The proposed method can effectively transfer beneficial features between different tasks, while filtering out useless and harmful information. Moreover, we employ the facial expression synthesis task to enlarge and balance the training dataset to further enhance the generalization ability of the proposed method. Experimental results show that the proposed method achieves state-of-the-art performance on those commonly used facial expression recognition benchmarks, which makes it a potential solution to real-world facial expression recognition problems.

#### Unconstrained Facial Expression Recogniton Based on Cascade Decision and Gabor Filters

Yanhong Wu, Lijie Zhang, Guannan Chen, Pablo Navarrete Michelini
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 14:00 in session OS T2.2

Auto-TLDR; Convolutional Neural Network for Facial Expression Recognition under unconstrained natural conditions

Facial Expression Recognition (FER) research with Convolutional Neural Networks (CNN) has been active, especially under unconstrained natural conditions. From our observation, prior arts treat expressions equally in classification and the reconition accuracy of some expression are always higher than others. In this paper, we make the assumption that an expression with a higher accuracy is easier to be recognized, and those expressions easier to recognize will hinder the recognition of uneasy expressions. Then, we propose a novel algorithm for unconstrained FER based on cascade decision and Gabor filters. Easier expressions are recognized before the difficult expressions. This simple method trains up to five models to cascadedly recognize a given facial image expression. The first binary classifier model is for the classification of Happy with the highest accuracy. The second binary classifier model is for the classification of Surprise with the second high accuracy. The third binary classifier model is for the classification of Neutral with the third high accuracy. The forth model is for the classification of Sad with the forth high accuracy. And the final model is 3-class classifier for Angry, Disgust and Fear. Gabor filters are included in every model to enhance robustness on illumination variations and face poses. Extensive experiment results on several datasets validate the effectiveness of the proposed method. We obtain accuracy of 77.6% on FER2013 with the final models, outperforming the latest state-of-the-arts.

#### Facial Expression Recognition by Using a Disentangled Identity-Invariant Expression Representation

Kamran Ali, Charles Hughes
Track 2: Biometrics, Human Analysis and Behavior Understanding
Tue 12 Jan 2021 at 17:00 in session PS T2.1

Auto-TLDR; Transfer-based Expression Recognition Generative Adversarial Network (TER-GAN)

Facial Expression Recognition (FER) is a challenging task because many factors of variation such as pose, illumination, and identity-specific attributes are entangled with the expression information in an expressive face image. Recent works show that the performance of a FER algorithm can be improved by disentangling the expression information from identity features. In this paper, we present Transfer-based Expression Recognition Generative Adversarial Network (TER-GAN) that combines the effectiveness of a novel feature disentanglement technique with the concept of identity-invariant expression representation learning for facial expression recognition. More specifically, TER-GAN learns a disentangled expression representation by extracting expression features from one image and transferring the expression information to the identity of another image. To improve the feature disentanglement process, and to learn an identity-invariant expression representation, we introduce a novel expression consistency loss and an identity consistency loss that exploit expression and identity information from both real and synthetic images. We evaluated the performance of our proposed facial expression recognition technique by employing five public facial expression databases, CK+, Oulu-CASIA, MMI, BU-3DFE, and BU-4DFE, the latter being used for pre-training. The experimental results show the effectiveness of the proposed technique.

#### Two-Stream Temporal Convolutional Network for Dynamic Facial Attractiveness Prediction

Nina Weng, Jiahao Wang, Annan Li, Yunhong Wang
Track 5: Image and Signal Processing
Fri 15 Jan 2021 at 15:00 in session PS T5.7

Auto-TLDR; 2S-TCN: A Two-Stream Temporal Convolutional Network for Dynamic Facial Attractiveness Prediction

In the field of facial attractiveness prediction, while deep models using static pictures have shown promising results, little attention is paid to dynamic facial information, which is proven to be influential by psychological studies. Meanwhile, the increasing popularity of short video apps creates an enormous demand of facial attractiveness prediction from short video clips. In this paper, we target on the dynamic facial attractiveness prediction problem. To begin with, a large-scale video-based facial attractiveness prediction dataset (VFAP) with more than one thousand clips from TikTok is collected. A two-stream temporal convolutional network (2S-TCN) is then proposed to capture dynamic attractiveness feature from both facial appearance and landmarks. We employ attentive feature enhancement along with specially designed modality and temporal fusion strategies to better explore the temporal dynamics. Extensive experiments on the proposed VFAP dataset demonstrate that 2S-TCN has a distinct advantage over the state-of-the-art static prediction methods.

#### Teacher-Student Training and Triplet Loss for Facial Expression Recognition under Occlusion

Mariana-Iuliana Georgescu, Radu Ionescu
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 14:00 in session OS T2.2

Auto-TLDR; Knowledge Distillation for Facial Expression Recognition under Occlusion

In this paper, we study the task of facial expression recognition under strong occlusion. We are particularly interested in cases where 50% of the face is occluded, e.g. when the subject wears a Virtual Reality (VR) headset. While previous studies show that pre-training convolutional neural networks (CNNs) on fully-visible (non-occluded) faces improves the accuracy, we propose to employ knowledge distillation to achieve further improvements. First of all, we employ the classic teacher-student training strategy, in which the teacher is a CNN trained on fully-visible faces and the student is a CNN trained on occluded faces. Second of all, we propose a new approach for knowledge distillation based on triplet loss. During training, the goal is to reduce the distance between an anchor embedding, produced by a student CNN that takes occluded faces as input, and a positive embedding (from the same class as the anchor), produced by a teacher CNN trained on fully-visible faces, so that it becomes smaller than the distance between the anchor and a negative embedding (from a different class than the anchor), produced by the student CNN. Third of all, we propose to combine the distilled embeddings obtained through the classic teacher-student strategy and our novel teacher-student strategy based on triplet loss into a single embedding vector. We conduct experiments on two benchmarks, FER+ and AffectNet, with two CNN architectures, VGG-f and VGG-face, showing that knowledge distillation can bring significant improvements over the state-of-the-art methods designed for occluded faces in the VR setting. Furthermore, we obtain accuracy rates that are quite close to the state-of-the-art models that take as input fully-visible faces. For example, on the FER+ data set, our VGG-face based on concatenated distilled embeddings attains an accuracy rate of 82.75% on lower-half-visible faces, which is only 2.24% below the accuracy rate of a state-of-the-art VGG-13 that is evaluated on fully-visible faces. Given that our model sees only the lower-half of the face, we consider this to be a remarkable achievement. In conclusion, we consider that our distilled CNN models can provide useful feedback for the task of recognizing the facial expressions of a person wearing a VR headset.

#### Joint Face Alignment and 3D Face Reconstruction with Efficient Convolution Neural Networks

Keqiang Li, Huaiyu Wu, Xiuqin Shang, Zhen Shen, Gang Xiong, Xisong Dong, Bin Hu, Fei-Yue Wang
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 12:00 in session PS T3.7

Auto-TLDR; Mobile-FRNet: Efficient 3D Morphable Model Alignment and 3D Face Reconstruction from a Single 2D Facial Image

3D face reconstruction from a single 2D facial image is a challenging and concerned problem. Recent methods based on CNN typically aim to learn parameters of 3D Morphable Model (3DMM) from 2D images to render face alignment and 3D face reconstruction. Most algorithms are designed for faces with small, medium yaw angles, which is extremely challenging to align faces in large poses. At the same time, they are not efficient usually. The main challenge is that it takes time to determine the parameters accurately. In order to address this challenge with the goal of improving performance, this paper proposes a novel and efficient end-to-end framework. We design an efficient and lightweight network model combined with Depthwise Separable Convolution and Muti-scale Representation, Lightweight Attention Mechanism, named Mobile-FRNet. Simultaneously, different loss functions are used to constrain and optimize 3DMM parameters and 3D vertices during training to improve the performance of the network. Meanwhile, extensive experiments on the challenging datasets show that our method significantly improves the accuracy of face alignment and 3D face reconstruction. The model parameters and complexity of our method are also improved greatly.

#### Two-Level Attention-Based Fusion Learning for RGB-D Face Recognition

Hardik Uppal, Alireza Sepas-Moghaddam, Michael Greenspan, Ali Etemad
Track 2: Biometrics, Human Analysis and Behavior Understanding
Tue 12 Jan 2021 at 17:00 in session PS T2.1

Auto-TLDR; Fused RGB-D Facial Recognition using Attention-Aware Feature Fusion

With recent advances in RGB-D sensing technologies as well as improvements in machine learning and fusion techniques, RGB-D facial recognition has become an active area of research. A novel attention aware method is proposed to fuse two image modalities, RGB and depth, for enhanced RGB-D facial recognition. The proposed method first extracts features from both modalities using a convolutional feature extractor. These features are then fused using a two layer attention mechanism. The first layer focuses on the fused feature maps generated by the feature extractor, exploiting the relationship between feature maps using LSTM recurrent learning. The second layer focuses on the spatial features of those maps using convolution. The training database is preprocessed and augmented through a set of geometric transformations, and the learning process is further aided using transfer learning from a pure 2D RGB image training process. Comparative evaluations demonstrate that the proposed method outperforms other state-of-the-art approaches, including both traditional and deep neural network-based methods, on the challenging CurtinFaces and IIIT-D RGB-D benchmark databases, achieving classification accuracies over 98.2% and 99.3% respectively. The proposed attention mechanism is also compared with other attention mechanisms, demonstrating more accurate results.

#### Quality-Based Representation for Unconstrained Face Recognition

Nelson Méndez-Llanes, Katy Castillo-Rosado, Heydi Mendez-Vazquez, Massimo Tistarelli
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 14:00 in session OS T2.2

Auto-TLDR; activation map for face recognition in unconstrained environments

Significant advances have been achieved in face recognition in the last decade thanks to the development of deep learning methods. However, recognizing faces captured in uncontrolled environments is still a challenging problem for the scientific community. In these scenarios, the performance of most of existing deep learning based methods abruptly falls, due to the bad quality of the face images. In this work, we propose to use an activation map to represent the quality information in a face image. Different face regions are analyzed to determine their quality and then only those regions with good quality are used to perform the recognition using a given deep face model. For experimental evaluation, in order to simulate unconstrained environments, three challenging databases, with different variations in appearance, were selected: the Labeled Faces in the Wild Database, the Celebrities in Frontal-Profile in the Wild Database, and the AR Database. Three deep face models were used to evaluate the proposal on these databases and in all cases, the use of the proposed activation map allows the improvement of the recognition rates obtained by the original models in a range from 0.3 up to 31%. The obtained results experimentally demonstrated that the proposal is able to select those face areas with higher discriminative power and enough identifying information, while ignores the ones with spurious information.

#### Interpretable Emotion Classification Using Temporal Convolutional Models

Manasi Bharat Gund, Abhiram Ravi Bharadwaj, Ifeoma Nwogu
Track 2: Biometrics, Human Analysis and Behavior Understanding
Tue 12 Jan 2021 at 17:00 in session PS T2.1

Auto-TLDR; Understanding the Dynamics of Facial Emotion Expression with Spatiotemporal Representations

As with many problems solved by deep neural networks, existing solutions rarely explain, precisely, the important factors responsible for the predictions made by the model. This work looks to investigate how different spatial regions and landmark points change in position over time, to better explain the underlying factors responsible for various facial emotion expressions. By pinpointing the specific regions or points responsible for the classification of a particular facial expression, we gain better insight into the dynamics of the face when displaying that emotion. To accomplish this, we examine two spatiotemporal representations of moving faces, while expressing different emotions. The representations are then presented to a convolutional neural network for emotion classification. Class activation maps are used in highlighting the regions of interest and the results are qualitatively compared with the well known facial action units, using the facial action coding system. The model was originally trained and tested on the CK+ dataset for emotion classification, and then generalized to the SAMM dataset. In so doing, we successfully present an interpretable technique for understanding the dynamics that occur during convolutional-based prediction tasks on sequences of face data.

#### Face Anti-Spoofing Using Spatial Pyramid Pooling

Lei Shi, Zhuo Zhou, Zhenhua Guo
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; Spatial Pyramid Pooling for Face Anti-Spoofing

Face recognition system is vulnerable to many kinds of presentation attacks, so how to effectively detect whether the image is from the real face is particularly important. At present, many deep learning-based anti-spoofing methods have been proposed. But these approaches have some limitations, for example, global average pooling (GAP) easily loses local information of faces, single-scale features easily ignore information differences in different scales, while a complex network is prune to be overfitting. In this paper, we propose a face anti-spoofing approach using spatial pyramid pooling (SPP). Firstly, we use ResNet-18 with a small amount of parameter as the basic model to avoid overfitting. Further, we use spatial pyramid pooling module in the single model to enhance local features while fusing multi-scale information. The effectiveness of the proposed method is evaluated on three databases, CASIA-FASD, Replay-Attack and CASIA-SURF. The experimental results show that the proposed approach can achieve state-of-the-art performance.

#### An Improved Bilinear Pooling Method for Image-Based Action Recognition

Wei Wu, Jiale Yu
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.4

Auto-TLDR; An improved bilinear pooling method for image-based action recognition

Action recognition in still images is a challenging task because of the complexity of human motions and the variance of background in the same action category. And some actions typically occur in fine-grained categories, with little visual differences between these categories. So extracting discriminative features or modeling various semantic parts is essential for image-based action recognition. Many methods apply expensive manual annotations to learn discriminative parts information for action recognition, which may severely discourage potential applications in real life. In recent years, bilinear pooling method has shown its effectiveness for image classification due to its learning distinctive features automatically. Inspired by this model, in this paper, an improved bilinear pooling method is proposed for avoiding the shortcomings of traditional bilinear pooling methods. The previous bilinear pooling approaches contain lots of noisy background or harmful feature information, which limit their application for action recognition. In our method, the attention mechanism is introduced into hierarchical bilinear pooling framework with mask aggregation for action recognition. The proposed model can generate the distinctive and ROI-aware feature information by combining multiple attention mask maps from the channel and spatial-wise attention features. To be more specific, our method makes the network to better pay attention to discriminative region of the vital objects in an image. We verify our model on the two challenging datasets: 1) Stanford 40 action dataset and 2) our action dataset that includes 60 categories. Experimental results demonstrate the effectiveness of our approach, which is superior to the traditional and state-of-the-art methods.

#### Adaptive Feature Fusion Network for Gaze Tracking in Mobile Tablets

Yiwei Bao, Yihua Cheng, Yunfei Liu, Feng Lu
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 12:00 in session PS T2.4

Auto-TLDR; Adaptive Feature Fusion Network for Multi-stream Gaze Estimation in Mobile Tablets

Recently, many multi-stream gaze estimation methods have been proposed. They estimate gaze from eye and face appearances and achieve reasonable accuracy. However, most of the methods simply concatenate the features extracted from eye and face appearance. The feature fusion process has been ignored. In this paper, we propose a novel Adaptive Feature Fusion Network (AFF-Net), which performs gaze tracking task in mobile tablets. We stack two-eye feature maps and utilize Squeeze-and-Excitation layers to adaptively fuse two-eye features based on different eye features. Meanwhile, we also propose Adaptive Group Normalization to recalibrate eye features with the guidance of face appearance characteristics. Extensive experiments on both GazeCapture and MPIIFaceGaze datasets demonstrate consistently superior performance of the proposed method.

#### SATGAN: Augmenting Age Biased Dataset for Cross-Age Face Recognition

Wenshuang Liu, Wenting Chen, Yuanlue Zhu, Linlin Shen
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; SATGAN: Stable Age Translation GAN for Cross-Age Face Recognition

In this paper, we propose a Stable Age Translation GAN (SATGAN) to generate fake face images at different ages to augment age biased face datasets for Cross-Age Face Recognition (CAFR) . The proposed SATGAN consists of both generator and discriminator. As a part of the generator, a novel Mask Attention Module (MAM) is introduced to make the generator focus on the face area. In addition, the generator employs a Uniform Distribution Discriminator (UDD) to supervise the learning of latent feature map and enforce the uniform distribution. Besides, the discriminator employs a Feature Separation Module (FSM) to disentangle identity information from the age information. The quantitative and qualitative evaluations on Morph dataset prove that SATGAN achieves much better performance than existing methods. The face recognition model trained using dataset (VGGFace2 and MS-Celeb-1M) augmented using our SATGAN achieves better accuracy on cross age dataset like Cross-Age LFW and AgeDB-30.

#### UHRSNet: A Semantic Segmentation Network Specifically for Ultra-High-Resolution Images

Lianlei Shan, Weiqiang Wang
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 12:00 in session PS T3.4

Auto-TLDR; Ultra-High-Resolution Segmentation with Local and Global Feature Fusion

Abstract—Semantic segmentation is a basic task in computer vision, but only limited attention has been devoted to the ultra-high-resolution (UHR) image segmentation. Since UHR images occupy too much memory, they cannot be directly put into GPU for training. Previous methods are cropping images to small patches or downsampling the whole images. Cropping and downsampling cause the loss of contexts and details, which is essential for segmentation accuracy. To solve this problem, we improve and simplify the local and global feature fusion method in previous works. Local features are extracted from patches and global features are from downsampled images. Meanwhile, we propose one new fusion called local feature fusion for the first time, which can make patches get information from surrounding patches. We call the network with these two fusions ultra-high-resolution segmentation network (UHRSNet). These two fusions can effectively and efficiently solve the problem caused by cropping and downsampling. Experiments show a remarkable improvement on Deepglobe dataset.

#### Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification

Can Zhang, Hong Liu, Wei Guo, Mang Ye
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 12:00 in session PS T5.3

Auto-TLDR; Multi-Scale Part-Aware Cascading for RGB-Infrared Person Re-identification

RGB-Infrared person re-identification (RGB-IR Re-ID) aims to matching persons from heterogeneous images captured by visible and thermal cameras, which is of great significance in surveillance system under poor light conditions. Facing great challenges in complex variances including conventional single-modality and additional inter-modality discrepancies, most of existing RGB-IR Re-ID methods directly work on global features for simultaneous elimination, whereas modality-specific noises and modality-shared features are not well considered. To address these issues, a novel Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global in a cascading manner, which results in an unified representation robust to noises. Moreover, a marginal exponential center (MeCen) loss is introduced to jointly eliminate mixed variances, which enables to model cross-modality correlations on sharable salient features. Extensive experiments are conducted for demonstration that the proposed method outperforms all the state-of-the-arts by a large margin.

#### Depth Videos for the Classification of Micro-Expressions

Ankith Jain Rakesh Kumar, Bir Bhanu, Christopher Casey, Sierra Cheung, Aaron Seitz
Track 2: Biometrics, Human Analysis and Behavior Understanding
Tue 12 Jan 2021 at 17:00 in session PS T2.1

Auto-TLDR; RGB-D Dataset for the Classification of Facial Micro-expressions

Facial micro-expressions are spontaneous, subtle, involuntary muscle movements occurring briefly on the face. The spotting and recognition of these expressions are difficult due to the subtle behavior, and the time duration of these expressions is about half a second, which makes it difficult for humans to identify them. These micro-expressions have many applications in our daily life, such as in the field of online learning, game playing, lie detection, and therapy sessions. Traditionally, researchers use RGB images/videos to spot and classify these micro-expressions, which pose challenging problems, such as illumination, privacy concerns and pose variation. The use of depth videos solves these issues to some extent, as the depth videos are not susceptible to the variation in illumination. This paper describes the collection of a first RGB-D dataset for the classification of facial micro-expressions into 6 universal expressions: Anger, Happy, Sad, Fear, Disgust, and Surprise. This paper shows the comparison between the RGB and Depth videos for the classification of facial micro-expressions. Further, a comparison of results shows that depth videos alone can be used to classify facial micro-expressions correctly in a decision tree structure by using the traditional and deep learning approaches with good classification accuracy. The dataset will be released to the public in the near future.

#### MANet: Multimodal Attention Network Based Point-View Fusion for 3D Shape Recognition

Yaxin Zhao, Jichao Jiao, Ning Li
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 14:00 in session PS T3.8

Auto-TLDR; Fusion Network for 3D Shape Recognition based on Multimodal Attention Mechanism

3D shape recognition has attracted more and more attention as a task of 3D vision research. The proliferation of 3D data encourages various deep learning methods based on 3D data. Now there have been many deep learning models based on point-cloud data or multi-view data alone. However, in the era of big data, integrating data of two different modals to obtain a unified 3D shape descriptor is bound to improve the recognition accuracy. Therefore, this paper proposes a fusion network based on multimodal attention mechanism for 3D shape recognition. Considering the limitations of multi-view data, we introduce a soft attention scheme, which can use the global point-cloud features to filter the multi-view features, and then realize the effective fusion of the two features. More specifically, we obtain the enhanced multi-view features by mining the contribution of each multi-view image to the overall shape recognition, and then fuse the point-cloud features and the enhanced multi-view features to obtain a more discriminative 3D shape descriptor. We have performed relevant experiments on the ModelNet40 dataset, and experimental results verify the effectiveness of our method.

#### Global-Local Attention Network for Semantic Segmentation in Aerial Images

Minglong Li, Lianlei Shan, Weiqiang Wang
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 15:00 in session PS T3.10

Auto-TLDR; GLANet: Global-Local Attention Network for Semantic Segmentation

Errors in semantic segmentation task could be classified into two types: large area misclassification and local inaccurate boundaries. Previously attention based methods capture rich global contextual information, this is beneficial to diminish the first type of error, but local imprecision still exists. In this paper we propose Global-Local Attention Network (GLANet) with a simultaneous consideration of global context and local details. Specifically, our GLANet is composed of two branches namely global attention branch and local attention branch, and three different modules are embedded in the two branches for the purpose of modeling semantic interdependencies in spatial, channel and boundary dimensions respectively. We sum the outputs of the two branches to further improve feature representation, leading to more precise segmentation results. The proposed method achieves very competitive segmentation accuracy on two public aerial image datasets, bringing significant improvements over baseline.

#### Efficient-Receptive Field Block with Group Spatial Attention Mechanism for Object Detection

Jiacheng Zhang, Zhicheng Zhao, Fei Su
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 12:00 in session PS T3.7

Auto-TLDR; E-RFB: Efficient-Receptive Field Block for Deep Neural Network for Object Detection

Object detection has been paid rising attention in computer vision field. Convolutional Neural Networks (CNNs) extract high-level semantic features of images, which directly determine the performance of object detection. As a common solution, embedding integration modules into CNNs can enrich extracted features and thereby improve the performance. However, the instability and inconsistency of internal multiple branches exist in these modules. To address this problem, we propose a novel multibranch module called Efficient-Receptive Field Block (E-RFB), in which multiple levels of features are combined for network optimization. Specifically, by downsampling and increasing depth, the E-RFB provides sufficient RF. Second, in order to eliminate the inconsistency across different branches, a novel spatial attention mechanism, namely, Group Spatial Attention Module (GSAM) is proposed. The GSAM gradually narrows a feature map by channel grouping; thus it encodes the information between spatial and channel dimensions into the final attention heat map. Third, the proposed module can be easily joined in various CNNs to enhance feature representation as a plug-and-play component. With SSD-style detectors, our method halves the parameters of the original detection head and achieves high accuracy on the PASCAL VOC and MS COCO datasets. Moreover, the proposed method achieves superior performance compared with state-of-the-art methods based on similar framework.

#### Makeup Style Transfer on Low-Quality Images with Weighted Multi-Scale Attention

Daniel Organisciak, Edmond S. L. Ho, Shum Hubert P. H.
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 14:00 in session PS T1.11

Auto-TLDR; Facial Makeup Style Transfer for Low-Resolution Images Using Multi-Scale Spatial Attention

Facial makeup style transfer is an extremely challenging sub-field of image-to-image-translation. Due to this difficulty, state-of-the-art results are mostly reliant on the Face Parsing Algorithm, which segments a face into parts in order to easily extract makeup features. However, we find that this algorithm can only work well on high-definition images where facial features can be accurately extracted. Faces in many real-world photos, such as those including a large background or multiple people, are typically of low-resolution, which considerably hinders state-of-the-art algorithms. In this paper, we propose an end-to-end holistic approach to effectively transfer makeup styles between two low-resolution images. The idea is built upon a novel weighted multi-scale spatial attention module, which identifies salient pixel regions on low-resolution images in multiple scales, and uses channel attention to determine the most effective attention map. This design provides two benefits: low-resolution images are usually blurry to different extents, so a multi-scale architecture can select the most effective convolution kernel size to implement spatial attention; makeup is applied on both a macro-level (foundation, fake tan) and a micro-level (eyeliner, lipstick) so different scales can excel in extracting different makeup features. We develop an Augmented CycleGAN network that embeds our attention modules at selected layers to most effectively transfer makeup. We test our system with the FBD data set, which consists of many low-resolution facial images, and demonstrates that it outperforms state-of-the-art methods, particularly in transferring makeup for blurry images and partially occluded images.

#### Learning Semantic Representations Via Joint 3D Face Reconstruction and Facial Attribute Estimation

Zichun Weng, Youjun Xiang, Xianfeng Li, Juntao Liang, Wanliang Huo, Yuli Fu
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 16:00 in session PS T3.9

Auto-TLDR; Joint Framework for 3D Face Reconstruction with Facial Attribute Estimation

We propose a novel joint framework for 3D face reconstruction (3DFR) that integrates facial attribute estimation (FAE) as an auxiliary task. One of the essential problems of 3DFR is to extract semantic facial features (e.g., Big Nose, High Cheekbones, and Asian) from in-the-wild 2D images, which is inherently involved with FAE. These two tasks, though heterogeneous, are highly relevant to each other. To achieve this, we leverage a Convolutional Neural Network to extract shared facial representations for both shape decoder and attribute classifier. We further develop an in-batch hybrid-task training scheme that enables our model to learn from heterogeneous facial datasets jointly within a mini-batch. Thanks to the joint loss that provides supervision from both 3DFR and FAE domains, our model learns the correlations between 3D shapes and facial attributes, which benefit both feature extraction and shape inference. Quantitative evaluation and qualitative visualization results confirm the effectiveness and robustness of our joint framework.

#### MRP-Net: A Light Multiple Region Perception Neural Network for Multi-Label AU Detection

Yang Tang, Shuang Chen, Honggang Zhang, Gang Wang, Rui Yang
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; MRP-Net: A Fast and Light Neural Network for Facial Action Unit Detection

Facial Action Units (AUs) are of great significance in communication. Automatic AU detection can improve the understanding of psychological condition and emotional status. Recently, a number of deep learning methods have been proposed to take charge with problems in automatic AU detection. Several challenges, like unbalanced labels and ignorance of local information, remain to be addressed. In this paper, we propose a fast and light neural network called MRP-Net, which is an end-to-end trainable method for facial AU detection to solve these problems. First, we design a Multiple Region Perception (MRP) module aimed at capturing different locations and sizes of features in the deeper level of the network without facial landmark points. Then, in order to balance the positive and negative samples in the large dataset, a batch balanced method adjusting the weight of every sample in one batch in our loss function is suggested. Experimental results on two popular AU datasets, BP4D and DISFA prove that MRP-Net outperforms state-of-the-art methods. Compared with the best method, not only does MRP-Net have an average F1 score improvement of 2.95% on BP4D and 5.43% on DISFA, and it also decreases the number of network parameters by 54.62% and the number of network FLOPs by 19.6%.

#### Aggregating Object Features Based on Attention Weights for Fine-Grained Image Retrieval

Hongli Lin, Yongqi Song, Zixuan Zeng, Weisheng Wang
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 12:00 in session PS T5.5

Auto-TLDR; DSAW: Unsupervised Dual-selection for Fine-Grained Image Retrieval

Object localization and local feature representation are key issues in fine-grained image retrieval. However, the existing unsupervised methods still need to be improved in these two aspects. For conquering these issues in a unified framework, a novel unsupervised scheme, named DSAW for short, is presented in this paper. Firstly, we proposed a dual-selection (DS) method, which achieves more accurate object localization by using adaptive threshold method to perform feature selection on local and global activation map in turn. Secondly, a novel and faster self-attention weights (AW) method is developed to weight local features by measuring their importance in the global context. Finally, we also evaluated the performance of the proposed method on five fine-grained image datasets and the results showed that our DSAW outperformed the existing best method.

#### Dual-Attention Guided Dropblock Module for Weakly Supervised Object Localization

Junhui Yin, Siqing Zhang, Dongliang Chang, Zhanyu Ma, Jun Guo
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 14:00 in session PS T1.6

Auto-TLDR; Dual-Attention Guided Dropblock for Weakly Supervised Object Localization

Attention mechanisms is frequently used to learn the discriminative features for better feature representations. In this paper, we extend the attention mechanism to the task of weakly supervised object localization (WSOL) and propose the dual-attention guided dropblock module (DGDM), which aims at learning the informative and complementary visual patterns for WSOL. This module contains two key components, the channel attention guided dropout (CAGD) and the spatial attention guided dropblock (SAGD). To model channel interdependencies, the CAGD ranks the channel attentions and treats the top-k attentions with the largest magnitudes as the important ones. It also keeps some low-valued elements to increase their value if they become important during training. The SAGD can efficiently remove the most discriminative information by erasing the contiguous regions of feature maps rather than individual pixels. This guides the model to capture the less discriminative parts for classification. Furthermore, it can also distinguish the foreground objects from the background regions to alleviate the attention misdirection. Experimental results demonstrate that the proposed method achieves new state-of-the-art localization performance.

#### DARN: Deep Attentive Refinement Network for Liver Tumor Segmentation from 3D CT Volume

Yao Zhang, Jiang Tian, Cheng Zhong, Yang Zhang, Zhongchao Shi, Zhiqiang He
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 17:00 in session PS T5.2

Auto-TLDR; Deep Attentive Refinement Network for Liver Tumor Segmentation from 3D Computed Tomography Using Multi-Level Features

Automatic liver tumor segmentation from 3D Computed Tomography (CT) is a necessary prerequisite in the interventions of hepatic abnormalities and surgery planning. However, accurate liver tumor segmentation remains challenging due to the large variability of tumor sizes and inhomogeneous texture. Recent advances based on Fully Convolutional Network (FCN) in liver tumor segmentation draw on success of learning discriminative multi-level features. In this paper, we propose a Deep Attentive Refinement Network (DARN) for improved liver tumor segmentation from CT volumes by fully exploiting both low and high level features embedded in different layers of FCN. Different from existing works, we exploit attention mechanism to leverage the relation of different levels of features encoded in different layers of FCN. Specifically, we introduce a Semantic Attention Refinement (SemRef) module to selectively emphasize global semantic information in low level features with the guidance of high level ones, and a Spatial Attention Refinement (SpaRef) module to adaptively enhance spatial details in high level features with the guidance of low level ones. We evaluate our network on the public MICCAI 2017 Liver Tumor Segmentation Challenge dataset (LiTS dataset) and it achieves state-of-the-art performance. The proposed refinement modules are an effective strategy to exploit multi-level features and has great potential to generalize to other medical image segmentation tasks.

#### Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Hong Liu, Wanlu Xu, Bing Yang
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 12:00 in session PS T1.10

Auto-TLDR; A Two-Step Feature Fusion Network for Speech Recognition

Lip-reading methods and fusion strategy are crucial for audio-visual speech recognition. In recent years, most approaches involve two separate audio and visual streams with early or late fusion strategies. Such a single-stage fusion method may fail to guarantee the integrity and representativeness of fusion information simultaneously. This paper extends a traditional single-stage fusion network to a two-step feature fusion network by adding an audio-visual early feature fusion (AV-EFF) stream to the baseline model. This method can learn the fusion information of different stages, preserving the original features as much as possible and ensuring the independence of different features. Besides, to capture long-range dependencies of video information, a non-local block is added to the feature extraction part of the visual stream (NL-Visual) to obtain the long-term spatio-temporal features. Experimental results on the two largest public datasets in English (LRW) and Mandarin (LRW-1000) demonstrate our method is superior to other state-of-the-art methods.

#### Learning Disentangled Representations for Identity Preserving Surveillance Face Camouflage

Jingzhi Li, Lutong Han, Hua Zhang, Xiaoguang Han, Jingguo Ge, Xiaochu Cao
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 12:00 in session PS T2.4

Auto-TLDR; Individual Face Privacy under Surveillance Scenario with Multi-task Loss Function

In this paper, we focus on protecting the person face privacy under the surveillance scenarios, whose goal is to change the visual appearances of faces while keep them to be recognizable by current face recognition systems. This is a challenging problem as that we should retain the most important structures of captured facial images, while alter the salient facial regions to protect personal privacy. To address this problem, we introduce a novel individual face protection model, which can camouflage the face appearance from the perspective of human visual perception and preserve the identity features of faces used for face authentication. To that end, we develop an encoder-decoder network architecture that can separately disentangle the person feature representation into an appearance code and an identity code. Specifically, we first randomly divide the face image into two groups, the source set and the target set, where the source set is used to extract the identity code and the target set provides the appearance code. Then, we recombine the identity and appearance codes to synthesize a new face, which has the same identity with the source subject. Finally, the synthesized faces are used to replace the original face to protect the privacy of individual. Furthermore, our model is trained end-to-end with a multi-task loss function, which can better preserve the identity and stabilize the training loss. Experiments conducted on Cross-Age Celebrity dataset demonstrate the effectiveness of our model and validate our superiority in terms of visual quality and scalability.

#### Real-Time Semantic Segmentation Via Region and Pixel Context Network

Yajun Li, Yazhou Liu, Quansen Sun
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 12:00 in session PS T3.7

Auto-TLDR; A Dual Context Network for Real-Time Semantic Segmentation

Real-time semantic segmentation is a challenging task as both segmentation accuracy and inference speed need to be considered at the same time. In this paper, we present a Dual Context Network (DCNet) to address this challenge. It contains two independent sub-networks: Region Context Network and Pixel Context Network. Region Context Network is main network with low-resolution input and feature re-weighting module to achieve sufficient receptive field. Meanwhile, Pixel Context Network with location attention module to capture the location dependencies of each pixel for assisting the main network to recover spatial detail. A contextual feature fusion is introduced to combine output features of these two sub-networks. The experiments show that DCNet can achieve high-quality segmentation while keeping a high speed. Specifically, for Cityscapes test dataset, we achieve 76.1% Mean IOU with the speed of 82 FPS on a single GTX 2080Ti GPU when using ResNet50 as backbone, and 71.2% Mean IOU with the speed of 142 FPS when using ResNet18 as backbone.

#### Attention Pyramid Module for Scene Recognition

Zhinan Qiao, Xiaohui Yuan, Chengyuan Zhuang, Abolfazl Meyarian
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.3

Auto-TLDR; Attention Pyramid Module for Multi-Scale Scene Recognition

The unrestricted open vocabulary and diverse substances of scenery images bring significant challenges to scene recognition. However, most deep learning architectures and attention methods are developed on general-purpose datasets and omit the characteristics of scene data. In this paper, we exploit the attention pyramid module (APM) to tackle the predicament of scene recognition. Our method streamlines the multi-scale scene recognition pipeline, learns comprehensive scene features at various scales and locations, addresses the interdependency among scales, and further assists feature re-calibration as well as aggregation process. APM is extremely light-weighted and can be easily plugged into existing network architectures in a parameter-efficient manner. By simply integrating APM into ResNet-50, we obtain a 3.54\% boost in terms of top-1 accuracy on the benchmark scene dataset. Comprehensive experiments show that APM achieves better performance comparing with state-of-the-art attention methods using significant less computation budget. Code and pre-trained models will be made publicly available.

#### Attentive Part-Aware Networks for Partial Person Re-Identification

Lijuan Huo, Chunfeng Song, Zhengyi Liu, Zhaoxiang Zhang
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; Part-Aware Learning for Partial Person Re-identification

Partial person re-identification (re-ID) refers to re-identify a person through occluded images. It suffers from two major challenges, i.e., insufficient training data and incomplete probe image. In this paper, we introduce an automatic data augmentation module and a part-aware learning method for partial re-identification. On the one hand, we adopt the data augmentation to enhance the training data and help learns more stabler partial features. On the other hand, we intuitively find that the partial person images usually have fixed percentages of parts, therefore, in partial person re-id task, the probe image could be cropped from the pictures and divided into several different partial types following fixed ratios. Based on the cropped images, we propose the Cropping Type Consistency (CTC) loss to classify the cropping types of partial images. Moreover, in order to help the network better fit the generated and cropped data, we incorporate the Block Attention Mechanism (BAM) into the framework for attentive learning. To enhance the retrieval performance in the inference stage, we implement cropping on gallery images according to the predicted types of probe partial images. Through calculating feature distances between the partial image and the cropped holistic gallery images, we can recognize the right person from the gallery. To validate the effectiveness of our approach, we conduct extensive experiments on the partial re-ID benchmarks and achieve state-of-the-art performance.

#### DAIL: Dataset-Aware and Invariant Learning for Face Recognition

Gaoang Wang, Chen Lin, Tianqiang Liu, Mingwei He, Jiebo Luo
Track 2: Biometrics, Human Analysis and Behavior Understanding
Fri 15 Jan 2021 at 15:00 in session PS T2.5

Auto-TLDR; DAIL: Dataset-Aware and Invariant Learning for Face Recognition

To achieve good performance in face recognition, a large scale training dataset is usually required. A simple yet effective way for improving the recognition performance is to use a dataset as large as possible by combining multiple datasets in the training. However, it is problematic and troublesome to naively combine different datasets due to two major issues. Firstly, the same person can possibly appear in different datasets, leading to the identity overlapping issue between different datasets. Natively treating the same person as different classes in different datasets during training will affect back-propagation and generate non-representative embeddings. On the other hand, manually cleaning labels will take a lot of human efforts, especially when there are millions of images and thousands of identities. Secondly, different datasets are collected in different situations and thus will lead to different domain distributions. Natively combining datasets will lead to domain distribution differences and make it difficult to learn domain invariant embeddings across different datasets. In this paper, we propose DAIL: Dataset-Aware and Invariant Learning to resolve the above-mentioned issues. To solve the first issue of identity overlapping, we propose a dataset-aware loss for multi-dataset training by reducing the penalty when the same person appears in multiple datasets. This can be readily achieved with a modified softmax loss with a dataset-aware term. To solve the second issue, the domain adaptation with gradient reversal layers is employed for dataset invariant learning. The proposed approach not only achieves state-of-the-art results on several commonly used face recognition validation sets, like LFW, CFP-FP, AgeDB-30, but also shows great benefit for practical usage.

#### Flow-Guided Spatial Attention Tracking for Egocentric Activity Recognition

Tianshan Liu, Kin-Man Lam
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 12:00 in session PS T2.4

Auto-TLDR; flow-guided spatial attention tracking for egocentric activity recognition

The popularity of wearable cameras has opened up a new dimension for egocentric activity recognition. While some methods introduce attention mechanisms into deep learning networks to capture fine-grained hand-object interactions, they often neglect exploring the spatio-temporal relationships. Generating spatial attention, without adequately exploiting temporal consistency, will result in potentially sub-optimal performance in the video-based task. In this paper, we propose a flow-guided spatial attention tracking (F-SAT) module, which is based on enhancing motion patterns and inter-frame information, to highlight the discriminative features from regions of interest across a video sequence. A new form of input, namely the optical-flow volume, is presented to provide informative cues from moving parts for spatial attention tracking. The proposed F-SAT module is deployed to a two-branch-based deep architecture, which fuses complementary information for egocentric activity recognition. Experimental results on three egocentric activity benchmarks show that the proposed method achieves state-of-the-art performance.

#### Self and Channel Attention Network for Person Re-Identification

Asad Munir, Niki Martinel, Christian Micheloni
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.6

Auto-TLDR; SCAN: Self and Channel Attention Network for Person Re-identification

Recent research has shown promising results for person re-identification by focusing on several trends. One is designing efficient metric learning loss functions such as triplet loss family to learn the most discriminative representations. The other is learning local features by designing part based architectures to form an informative descriptor from semantically coherent parts. Some efforts adjust distant outliers to their most similar positions by using soft attention and learn the relationship between distant similar features. However, only a few prior efforts focus on channel-wise dependencies and learn non-local sharp similar part features directly for the degraded data in the person re-identification task. In this paper, we propose a novel Self and Channel Attention Network (SCAN) to model long-range dependencies between channels and feature maps. We add multiple classifiers to learn discriminative global features by using classification loss. Self Attention (SA) module and Channel Attention (CA) module are introduced to model non-local and channel-wise dependencies in the learned features. Spectral normalization is applied to the whole network to stabilize the training process. Experimental results on the person re-identification benchmarks show the proposed components achieve significant improvement with respect to the baseline.

#### Face Super-Resolution Network with Incremental Enhancement of Facial Parsing Information

Shuang Liu, Chengyi Xiong, Zhirong Gao
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 17:00 in session PS T5.2

Auto-TLDR; Learning-based Face Super-Resolution with Incremental Boosting Facial Parsing Information

Recently, facial priors based face super-resolution (SR) methods have obtained significant performance gains in dealing with extremely degraded facial images, and facial priors have also been proved useful in facilitating the inference of face images. Based on this, how to fully fuse facial priors into deep features to improve face SR performance has attracted a major attention. In this paper, we propose a learning-based face SR approach with incremental boosting facial parsing information (IFPSR) for high-magnification of low-resolution faces. The proposed IFPSR method consists of three main parts: i) a three-stage parsing map embedded features upsampling network, in which image recovery and prior estimation processes are performed simultaneously and progressively to improve the image resolution; ii) a progressive training method and a joint facial attention and heatmap loss to obtain better facial attributes; iii) the channel attention strategy in residual dense blocks to adaptively learn facial features. Extensive experimental results show that compared with the state-of-the-art methods in terms of quantitative and qualitative metrics, our approach can achieve an outstanding balance between SR image quality and low network complexity.

#### Responsive Social Smile: A Machine-Learning Based Multimodal Behavior Assessment Framework towards Early Stage Autism Screening

Yueran Pan, Kunjing Cai, Ming Cheng, Xiaobing Zou, Ming Li
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; Responsive Social Smile: A Machine Learningbased Assessment Framework for Early ASD Screening

Autism spectrum disorder (ASD) is a neurodevelopmental disorder, which causes social deﬁcits in social lives. Early ASD screening for children is an important method to reduce the impact of ASD on people’s whole lives. Traditional screening methods rely on protocol experiments and subjective evaluations from clinicians and domain experts and thereby cost a lot. To standardize the process of ASD screening, we 1 collaborate with a group of ASD experts, and design a ”Responsive Social Smile” protocol and an experiment environment. Also, we propose a machine learningbased assessment framework for early ASD screening. By integrating technologies of speech recognition and computer vision, the framework can quantitatively analyze the behaviors of children under well-designed protocols. By collecting 196 test samples from 41 children in the clinical treatments, our proposed method obtains 85.20% accuracy for the score prediction of individual protocol, and 80.49% unweighted accuracy for the ﬁnal ASD prediction. This result indicates that our model reaches the average level of domain experts in ASD diagnosis.

#### Enhanced Feature Pyramid Network for Semantic Segmentation

Mucong Ye, Ouyang Jinpeng, Ge Chen, Jing Zhang, Xiaogang Yu
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.14

Auto-TLDR; EFPN: Enhanced Feature Pyramid Network for Semantic Segmentation

Multi-scale feature fusion has been an effective way for improving the performance of semantic segmentation. However, current methods generally fail to consider the semantic gaps between the shallow (low-level) and deep (high-level) features and thus the fusion methods may not be optimal. In this paper, to address the issues of the semantic gap between the feature from different layers, we propose a unified framework based on the U-shape encoder-decoder architecture, named Enhanced Feature Pyramid Network (EFPN). Specifically, the semantic enhancement module (SEM), boundary extraction module (BEM), and context aggregation model (CAM) are incorporated into the decoder network to improve the robustness of the multi-level features aggregation. In addition, a global fusion model (GFM) in encoder branch is proposed to capture more semantic information in the deep layers and effectively transmit the high-level semantic features to each layer. Extensive experiments are conducted and the results show that the proposed framework achieves the state-of-the-art results on three public datasets, namely PASCAL VOC 2012, Cityscapes, and PASCAL Context. Furthermore, we also demonstrate that the proposed method is effective for other visual tasks that require frequent fusing features and upsampling.

#### Multi-Label Contrastive Focal Loss for Pedestrian Attribute Recognition

Xiaoqiang Zheng, Zhenxia Yu, Lin Chen, Fan Zhu, Shilong Wang
Track 2: Biometrics, Human Analysis and Behavior Understanding
Fri 15 Jan 2021 at 15:00 in session PS T2.5

Auto-TLDR; Multi-label Contrastive Focal Loss for Pedestrian Attribute Recognition

Pedestrian Attribute Recognition (PAR) has received extensive attention during the past few years. With the advances of deep constitutional neural networks (CNNs), the performance of PAR has been significantly improved. Existing methods tend to acquire attribute-specific features by designing various complex network structures with additional modules. Such additional modules, however, dramatically increase the number of parameters. Meanwhile, the problems of class imbalance and hard attribute retrieving remain underestimated in PAR. In this paper, we explore the optimization mechanism of the training processing to account for these problems and propose a new loss function called Multi-label Contrastive Focal Loss (MCFL). This proposed MCFL emphasizes the hard and minority attributes by using a separated re-weighting mechanism for different positive and negative classes to alleviate the impact of the imbalance. MCFL is also able to enlarge the gaps between the intra-class of multi-label attributes, to force CNNs to extract more subtle discriminative features. We evaluate the proposed MCFL on three large public pedestrian datasets, including RAP, PA-100K, and PETA. The experimental results indicate that the proposed MCFL with the ResNet-50 backbone is able to outperform other state-of-the-art approaches in comparison.

#### Hybrid Approach for 3D Head Reconstruction: Using Neural Networks and Visual Geometry

Oussema Bouafif, Bogdan Khomutenko, Mohammed Daoudi
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 16:00 in session PS T3.11

Auto-TLDR; Recovering 3D Head Geometry from a Single Image using Deep Learning and Geometric Techniques

Recovering the 3D geometric structure of a face from a single input image is a challenging active research area in computer vision. In this paper, we present a novel method for reconstructing 3D heads from a single or multiple image(s) using a hybrid approach based on deep learning and geometric techniques. We propose an encoder-decoder network based on the U-net architecture and trained on synthetic data only. It predicts both pixel-wise normal vectors and landmarks maps from a single input photo. Landmarks are used for the pose computation and the initialization of the optimization problem, which, in turn, reconstructs the 3D head geometry by using a parametric morphable model and normal vector fields. State-of-the-art results are achieved through qualitative and quantitative evaluation tests on both single and multi-view settings. Despite the fact that the model was trained only on synthetic data, it successfully recovers 3D geometry and precise poses for real-world images.

#### Spatial-Related and Scale-Aware Network for Crowd Counting

Lei Li, Yuan Dong, Hongliang Bai
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 14:00 in session PS T1.5

Auto-TLDR; Spatial Attention for Crowd Counting

Crowd counting aims to estimate the number of people in images. Although promising progresses have been made with the prevalence of deep Convolutional Neural Networks, there still remains a challenging task due to cluttered backgrounds and varying scales of people within an image. In this paper, we propose a learnable spatial attention module which can get the spatial relations to diminish the negative impact of backgrounds. Besides, a dense hybrid dilated convolution module is also brought up to preserve information derived from varied scales. With these two modules, our network can deal with the problem caused by scale variance and background interference. To demonstrate the effectiveness of our method, we compare it with state-of-the-art algorithms on three representative crowd counting benchmarks (ShanghaiTech UCF-QNRF,UCF_CC_50). Experimental results show that our proposed network can achieve significant improvements on all the three datasets.

#### A Grid-Based Representation for Human Action Recognition

Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier
Track 2: Biometrics, Human Analysis and Behavior Understanding
Tue 12 Jan 2021 at 17:00 in session PS T2.1

Auto-TLDR; GRAR: Grid-based Representation for Action Recognition in Videos

Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for the task, and are limited in the way they fuse temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets that demonstrate that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges.

#### Multi-Scale Residual Pyramid Attention Network for Monocular Depth Estimation

Jing Liu, Xiaona Zhang, Zhaoxin Li, Tianlu Mao
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 15:00 in session PS T3.10

Auto-TLDR; Multi-scale Residual Pyramid Attention Network for Monocular Depth Estimation

Monocular depth estimation is a challenging problem in computer vision and is crucial for understanding 3D scene geometry. Recently, deep convolutional neural networks (DCNNs) based methods have improved the estimation accuracy significantly. However, existing methods fail to consider complex textures and geometries in scenes, thereby resulting in loss of local details, distorted object boundaries, and blurry reconstruction. In this paper, we proposed an end-to-end Multi-scale Residual Pyramid Attention Network (MRPAN) to mitigate these problems.First,we propose a Multi-scale Attention Context Aggregation (MACA) module, which consists of Spatial Attention Module (SAM) and Global Attention Module (GAM). By considering the position and scale correlation of pixels from spatial and global perspectives, the proposed module can adaptively learn the similarity between pixels so as to obtain more global context information of the image and recover the complex structure in the scene. Then we proposed an improved Residual Refinement Module (RRM) to further refine the scene structure, giving rise to deeper semantic information and retain more local details. Experimental results show that our method achieves more promisin performance in object boundaries and local details compared with other state-of-the-art methods.

#### PSDNet: A Balanced Architecture of Accuracy and Parameters for Semantic Segmentation

Yue Liu, Zhichao Lian
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 14:00 in session PS T3.8

Auto-TLDR; Pyramid Pooling Module with SE1Cblock and D2SUpsample Network (PSDNet)

Abstract—In this paper, we present our Pyramid Pooling Module （PPM） with SE1Cblock and D2SUpsample Network (PSDNet), a novel architecture for accurate semantic segmentation. Started from the known work called Pyramid Scene Parsing Network (PSPNet), PSDNet takes advantage of pyramid pooling structure with channel attention module and feature transform module in Pyramid Pooling Module (PPM). The enhanced PPM with these two components can strengthen context information flowing in the network instead of damaging it. The channel attention module we mentioned is an improved “Squeeze and Excitation with 1D Convolution” (SE1C) block which can explicitly model interrelationship between channels with fewer number of parameters. We propose a feature transform module named “Depth to Space Upsampling” (D2SUpsample) in the PPM which keeps integrity of features by transforming features while interpolating features, at the same time reducing parameters. In addition, we introduce a joint strategy in SE1Cblock which combines two variants of global pooling without increasing parameters. Compared with PSPNet, our work achieves higher accuracy on public datasets with 73.97% mIoU and 82.89% mAcc accuracy on Cityscapes Dataset based on ResNet50 backbone.

#### Progressive Scene Segmentation Based on Self-Attention Mechanism

Yunyi Pan, Yuan Gan, Kun Liu, Yan Zhang
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 15:00 in session PS T3.10

Auto-TLDR; Two-Stage Semantic Scene Segmentation with Self-Attention

Semantic scene segmentation is vital for a large variety of applications as it enables understanding of 3D data. Nowadays, various approaches based upon point clouds ignore the mathematical distribution of points and treat the points equally. The methods following this direction neglect the imbalance problem of samples that naturally exists in scenes. To avoid these issues, we propose a two-stage semantic scene segmentation framework based on self-attention mechanism and achieved state-of-the-art performance on 3D scene understanding tasks. We split the whole task into two small ones which efficiently relief the sample imbalance issue. In addition, we have designed a new self-attention block which could be inserted into submanifold convolution networks to model the long-range dependencies that exists among points. The proposed network consists of an encoder and a decoder, with the spatial-wise and channel-wise attention modules inserted. The two-stage network shares a U-Net architecture and is an end-to-end trainable framework which could predict the semantic label for the scene point clouds fed into it. Experiments on standard benchmarks of 3D scenes implies that our network could perform at par or better than the existing state-of-the-art methods.

#### Rethinking ReID：Multi-Feature Fusion Person Re-Identification Based on Orientation Constraints

Mingjing Ai, Guozhi Shan, Bo Liu, Tianyang Liu
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; Person Re-identification with Orientation Constrained Network

Person re-identification (ReID) aims to identify the specific pedestrian in a series of images or videos. Recently, ReID is receiving more and more attention in the fields of computer vision research and application like intelligent security. One major issue downgrading the ReID model performance lies in that various subjects in the same body orientations look too similar to distinguish by the model, while the same subject viewed in different orientations looks rather different. However, most of the current studies do not particularly differentiate pedestrians in orientation when designing the network, so we rethink this problem particularly from the perspective of person orientation and propose a new network structure by including two branches: one handling samples with the same body orientations and the other handling samples with different body orientations. Correspondingly, we also propose an orientation classifier that can accurately distinguish the orientation of each person. At the same time, the three-part loss functions are introduced for orientation constraint and combined to optimize the network simultaneously. Also, we use global and local features int the training stage in order to make use of multi-level information. Therefore, our network can derive its efficacy from orientation constraints and multiple features. Experiments show that our method not only has competitive performance on multiple datasets, but also can let retrieval results aligned with the orientation of the query sample rank higher, which may have great potential in the practical applications.

#### DA-RefineNet: Dual-Inputs Attention RefineNet for Whole Slide Image Segmentation

Ziqiang Li, Rentuo Tao, Qianrun Wu, Bin Li
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 12:00 in session PS T5.5

Auto-TLDR; DA-RefineNet: A dual-inputs attention network for whole slide image segmentation

Automatic medical image segmentation techniques have wide applications for disease diagnosing, however, its much more challenging than natural optical image segmentation tasks due to the high-resolution of medical images and the corresponding huge computation cost. Sliding window was a commonly used technique for whole slide image (WSI) segmentation, however, for these methods that based on sliding window, the main drawback was lacking of global contextual information for supervision. In this paper, we proposed a dual-inputs attention network (denoted as DA-RefineNet) for WSI segmentation, where both local fine-grained information and global coarse information can be efficiently utilized. Sufficient comparative experiments were conducted to evaluate the effectiveness of the proposed method, the results proved that the proposed method can achieve better performance on WSI segmentation tasks compared to methods rely on single-input.