Person Recognition with HGR Maximal Correlation on Multimodal Data

Yihua Liang, Fei Ma, Yang Li, Shao-Lun Huang

Responsive image

Auto-TLDR; A correlation-based multimodal person recognition framework that learns discriminative embeddings of persons by joint learning visual features and audio features

Slides Poster

Multimodal person recognition is a common task in video analysis and public surveillance, where information from multiple modalities, such as images and audio extracted from videos, are used to jointly determine the identity of a person. Previous person recognition techniques either use only uni-modal data or only consider shared representations between different input modalities, while leaving the extraction of their relationship with identity information to downstream tasks. Furthermore, real-world data often contain noise, which makes recognition more challenging practical situations. In our work, we propose a novel correlation-based multimodal person recognition framework that is relatively simple but can efficaciously learn supervised information in multimodal data fusion and resist noise. Specifically, our framework learns a discriminative embeddings of persons by joint learning visual features and audio features while maximizing HGR maximal correlation among multimodal input and persons' identities. Experiments are done on a subset of Voxceleb2. Compared with state-of-the-art methods, the proposed method demonstrates an improvement of accuracy and robustness to noise.

Similar papers

Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

Hong Liu, Wenhao Li, Bing Yang

Responsive image

Auto-TLDR; Hybrid Fusion Based AVSR with Residual Networks and Bidirectional Gated Recurrent Unit for Robust Speech Recognition in Noise Conditions

Slides Poster Similar

The fusion of audio and visual modalities is an important stage of audio-visual speech recognition (AVSR), which is generally approached through feature fusion or decision fusion. Feature fusion can exploit the covariations between features from different modalities effectively, whereas decision fusion shows the robustness of capturing an optimal combination of multi-modality. In this work, to take full advantage of the complementarity of the two fusion strategies and address the challenge of inherent ambiguity in noisy environments, we propose a novel hybrid fusion based AVSR method with residual networks and Bidirectional Gated Recurrent Unit (BGRU), which is able to distinguish homophones in both clean and noisy conditions. Specifically, a simple yet effective audio-visual encoder is used to map audio and visual features into a shared latent space to capture more discriminative multi-modal feature and find the internal correlation between spatial-temporal information for different modalities. Furthermore, a decision fusion module is designed to get final predictions in order to robustly utilize the reliability measures of audio-visual information. Finally, we introduce a combined loss, which shows its noise-robustness in learning the joint representation across various modalities. Experimental results on the largest publicly available dataset (LRW) demonstrate the robustness of the proposed method under various noisy conditions.

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

Puneet Kumar, Sidharth Jain, Balasubramanian Raman, Partha Pratim Roy, Masakazu Iwamura

Responsive image

Auto-TLDR; End-to-End Neural Embedding System for Speech Emotion Recognition

Slides Poster Similar

In this paper, an end-to-end neural embedding system based on triplet loss and residual learning has been proposed for speech emotion recognition. The proposed system learns the embeddings from the emotional information of the speech utterances. The learned embeddings are used to recognize the emotions portrayed by given speech samples of various lengths. The proposed system implements Residual Neural Network architecture. It is trained using softmax pre-training and triplet loss function. The weights between the fully connected and embedding layers of the trained network are used to calculate the embedding values. The embedding representations of various emotions are mapped onto a hyperplane, and the angles among them are computed using the cosine similarity. These angles are utilized to classify a new speech sample into its appropriate emotion class. The proposed system has demonstrated 91.67\% and 64.44\% accuracy while recognizing emotions for RAVDESS and IEMOCAP dataset, respectively.

DAIL: Dataset-Aware and Invariant Learning for Face Recognition

Gaoang Wang, Chen Lin, Tianqiang Liu, Mingwei He, Jiebo Luo

Responsive image

Auto-TLDR; DAIL: Dataset-Aware and Invariant Learning for Face Recognition

Slides Poster Similar

To achieve good performance in face recognition, a large scale training dataset is usually required. A simple yet effective way for improving the recognition performance is to use a dataset as large as possible by combining multiple datasets in the training. However, it is problematic and troublesome to naively combine different datasets due to two major issues. Firstly, the same person can possibly appear in different datasets, leading to the identity overlapping issue between different datasets. Natively treating the same person as different classes in different datasets during training will affect back-propagation and generate non-representative embeddings. On the other hand, manually cleaning labels will take a lot of human efforts, especially when there are millions of images and thousands of identities. Secondly, different datasets are collected in different situations and thus will lead to different domain distributions. Natively combining datasets will lead to domain distribution differences and make it difficult to learn domain invariant embeddings across different datasets. In this paper, we propose DAIL: Dataset-Aware and Invariant Learning to resolve the above-mentioned issues. To solve the first issue of identity overlapping, we propose a dataset-aware loss for multi-dataset training by reducing the penalty when the same person appears in multiple datasets. This can be readily achieved with a modified softmax loss with a dataset-aware term. To solve the second issue, the domain adaptation with gradient reversal layers is employed for dataset invariant learning. The proposed approach not only achieves state-of-the-art results on several commonly used face recognition validation sets, like LFW, CFP-FP, AgeDB-30, but also shows great benefit for practical usage.

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Hong Liu, Wanlu Xu, Bing Yang

Responsive image

Auto-TLDR; A Two-Step Feature Fusion Network for Speech Recognition

Slides Poster Similar

Lip-reading methods and fusion strategy are crucial for audio-visual speech recognition. In recent years, most approaches involve two separate audio and visual streams with early or late fusion strategies. Such a single-stage fusion method may fail to guarantee the integrity and representativeness of fusion information simultaneously. This paper extends a traditional single-stage fusion network to a two-step feature fusion network by adding an audio-visual early feature fusion (AV-EFF) stream to the baseline model. This method can learn the fusion information of different stages, preserving the original features as much as possible and ensuring the independence of different features. Besides, to capture long-range dependencies of video information, a non-local block is added to the feature extraction part of the visual stream (NL-Visual) to obtain the long-term spatio-temporal features. Experimental results on the two largest public datasets in English (LRW) and Mandarin (LRW-1000) demonstrate our method is superior to other state-of-the-art methods.

G-FAN: Graph-Based Feature Aggregation Network for Video Face Recognition

He Zhao, Yongjie Shi, Xin Tong, Jingsi Wen, Xianghua Ying, Jinshi Hongbin Zha

Responsive image

Auto-TLDR; Graph-based Feature Aggregation Network for Video Face Recognition

Slides Poster Similar

In this paper, we propose a graph-based feature aggregation network (G-FAN) for video face recognition. Compared with the still image, video face recognition exhibits great challenges due to huge intra-class variability and high inter-class ambiguity. To address this problem, our G-FAN first uses a Convolutional Neural Network to extract deep features for every input face of a subject. Then, we build an affinity graph based on the relation between facial features and apply Graph Convolutional Network to generate fine-grained quality vectors for each frame. Finally, the features among multiple frames are adaptively aggregated into a discriminative vector to represent a video face. Different from previous works that take a single image as input, our G-FAN could utilize the correlation information between image pairs and aggregate a template of faces simultaneously. The experiments on video face recognition benchmarks, including YTF, IJB-A, and IJB-C show that: (i) G-FAN automatically learns to advocate high-quality frames while repelling low-quality ones. (ii) G-FAN significantly boosts recognition accuracy and outperforms other state-of-the-art aggregation methods.

Mutual Alignment between Audiovisual Features for End-To-End Audiovisual Speech Recognition

Hong Liu, Yawei Wang, Bing Yang

Responsive image

Auto-TLDR; Mutual Iterative Attention for Audio Visual Speech Recognition

Slides Poster Similar

Asynchronization issue caused by different types of modalities is one of the major problems in audio visual speech recognition (AVSR) research. However, most AVSR systems merely rely on up sampling of video or down sampling of audio to align audio and visual features, assuming that the feature sequences are aligned frame-by-frame. These pre-processing steps oversimplify the asynchrony relation between acoustic signal and lip motion, lacking flexibility and impairing the performance of the system. Although there are systems modeling the asynchrony between the modalities, sometimes they fail to align speech and video precisely in some even all noise conditions. In this paper, we propose a mutual feature alignment method for AVSR which can make full use of cross modility information to address the asynchronization issue by introducing Mutual Iterative Attention (MIA) mechanism. Our method can automatically learn an alignment in a mutual way by performing mutual attention iteratively between the audio and visual features, relying on the modified encoder structure of Transformer. Experimental results show that our proposed method obtains absolute improvements up to 20.42% over the audio modality alone depending upon the signal-to-noise-ratio (SNR) level. Better recognition performance can also be achieved comparing with the traditional feature concatenation method under both clean and noisy conditions. It is expectable that our proposed mutual feature alignment method can be easily generalized to other multimodal tasks with semantically correlated information.

Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification

Can Zhang, Hong Liu, Wei Guo, Mang Ye

Responsive image

Auto-TLDR; Multi-Scale Part-Aware Cascading for RGB-Infrared Person Re-identification

Slides Poster Similar

RGB-Infrared person re-identification (RGB-IR Re-ID) aims to matching persons from heterogeneous images captured by visible and thermal cameras, which is of great significance in surveillance system under poor light conditions. Facing great challenges in complex variances including conventional single-modality and additional inter-modality discrepancies, most of existing RGB-IR Re-ID methods directly work on global features for simultaneous elimination, whereas modality-specific noises and modality-shared features are not well considered. To address these issues, a novel Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global in a cascading manner, which results in an unified representation robust to noises. Moreover, a marginal exponential center (MeCen) loss is introduced to jointly eliminate mixed variances, which enables to model cross-modality correlations on sharable salient features. Extensive experiments are conducted for demonstration that the proposed method outperforms all the state-of-the-arts by a large margin.

RGB-Infrared Person Re-Identification Via Image Modality Conversion

Huangpeng Dai, Qing Xie, Yanchun Ma, Yongjian Liu, Shengwu Xiong

Responsive image

Auto-TLDR; CE2L: A Novel Network for Cross-Modality Re-identification with Feature Alignment

Slides Poster Similar

As a cross modality retrieval task, RGB-infrared person re-identification(Re-ID) is an important and challenging tasking, because of its important role in video surveillance applications and large cross-modality variations between visible and infrared images. Most previous works addressed the problem of cross-modality gap with feature alignment by original feature representation learning straightly. In this paper, different from existing works, we propose a novel network(CE2L) to tackle the cross-modality gap with feature alignment. CE2L mainly focuses on adding discriminative information and learning robust features by converting modality between visible and infrared images. Its merits are highlighted in two aspects: 1)Using CycleGAN to convert infrared images into color images can not only increase the recognition characteristics of images, but also allow the our network to better learn the two modal image features; 2)Our novel method can serve as data augmentation. Specifically, it can increase data diversity and total data against over-fitting by converting labeled training images to another modal images. Extensive experimental results on two datasets demonstrate superior performance compared to the baseline and the state-of-the-art methods.

Angular Sparsemax for Face Recognition

Chi Ho Chan, Josef Kittler

Responsive image

Auto-TLDR; Angular Sparsemax for Face Recognition

Slides Poster Similar

We formulate a novel loss function, called Angular Sparsemax for face recognition. The proposed loss function promotes sparseness of the hypotheses prediction function similar to Sparsemax with Fenchel-Young regularisation. With introducing an additive angular margin on the score vector, the discriminatory power of the face embedding is further improved. The proposed loss function is experimentally validated on several databases in term of recognition accuracy. Its performance compares well with the state of the art Arcface loss.

Building Computationally Efficient and Well-Generalizing Person Re-Identification Models with Metric Learning

Vladislav Sovrasov, Dmitry Sidnev

Responsive image

Auto-TLDR; Cross-Domain Generalization in Person Re-identification using Omni-Scale Network

Slides Similar

This work considers the problem of domain shift in person re-identification.Being trained on one dataset, a re-identification model usually performs much worse on unseen data. Partially this gap is caused by the relatively small scale of person re-identification datasets (compared to face recognition ones, for instance), but it is also related to training objectives. We propose to use the metric learning objective, namely AM-Softmax loss, and some additional training practices to build well-generalizing, yet, computationally efficient models. We use recently proposed Omni-Scale Network (OSNet) architecture combined with several training tricks and architecture adjustments to obtain state-of-the art results in cross-domain generalization problem on a large-scale MSMT17 dataset in three setups: MSMT17-all->DukeMTMC, MSMT17-train->Market1501 and MSMT17-all->Market1501.

Unsupervised Disentangling of Viewpoint and Residues Variations by Substituting Representations for Robust Face Recognition

Minsu Kim, Joanna Hong, Junho Kim, Hong Joo Lee, Yong Man Ro

Responsive image

Auto-TLDR; Unsupervised Disentangling of Identity, viewpoint, and Residue Representations for Robust Face Recognition

Slides Poster Similar

It is well-known that identity-unrelated variations (e.g., viewpoint or illumination) degrade the performances of face recognition methods. In order to handle this challenge, a robust method for disentangling the identity and view representations has drawn an attention in the machine learning area. However, existing methods learn discriminative features which require a manual supervision of such factors of variations. In this paper, we propose a novel disentangling framework through modeling three representations of identity, viewpoint, and residues (i.e., identity and pose unrelated) which do not require supervision of the variations. By jointly modeling the three representations, we enhance the disentanglement of each representation and achieve robust face recognition performance. Further, the learned viewpoint representation can be utilized for pose estimation or editing of a posed facial image. Extensive quantitative and qualitative evaluations verify the effectiveness of our proposed method which disentangles identity, viewpoint, and residues of facial images.

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Jianrong Wang, Tong Wu, Shanyu Wang, Mei Yu, Qiang Fang, Ju Zhang, Li Liu

Responsive image

Auto-TLDR; Lip Motion Network for Text-Independent and Text-Dependent Speaker Recognition

Slides Poster Similar

Lip motion reflects behavior characteristics of speakers, and thus can be used as a new kind of biometrics in speaker recognition. In the literature, lots of works used two dimensional (2D) lip images to recognize speaker in a text-dependent context. However, 2D lip easily suffers from face orientations. To this end, in this work, we present a novel end-to-end 3D lip motion Network (3LMNet) by utilizing the sentence-level 3D lip motion (S3DLM) to recognize speakers in both the text-independent and text-dependent contexts. A novel regional feedback module (RFM) is proposed to explore attentions in different lip regions. Besides, prior knowledge of lip motion is investigated to complement RFM, where landmark-level and frame-level features are merged to form a better feature representation. Moreover, we present two methods, i.e., coordinate transformation and face posture correction to pre-process the LSD-AV dataset, which contains 68 speakers and 146 sentences per speaker. The evaluation results on this dataset demonstrate that our proposed 3LMNet is superior to the baseline models, i.e., LSTM, VGG-16 and ResNet-34, and outperforms the state-of-the-art using 2D lip image as well as the 3D face. The code of this work is released at https://github.com/wutong18/Three-Dimensional-Lip-Motion-Ne twork-for-Text-Independent-Speaker-Recognition.

Attentive Part-Aware Networks for Partial Person Re-Identification

Lijuan Huo, Chunfeng Song, Zhengyi Liu, Zhaoxiang Zhang

Responsive image

Auto-TLDR; Part-Aware Learning for Partial Person Re-identification

Slides Poster Similar

Partial person re-identification (re-ID) refers to re-identify a person through occluded images. It suffers from two major challenges, i.e., insufficient training data and incomplete probe image. In this paper, we introduce an automatic data augmentation module and a part-aware learning method for partial re-identification. On the one hand, we adopt the data augmentation to enhance the training data and help learns more stabler partial features. On the other hand, we intuitively find that the partial person images usually have fixed percentages of parts, therefore, in partial person re-id task, the probe image could be cropped from the pictures and divided into several different partial types following fixed ratios. Based on the cropped images, we propose the Cropping Type Consistency (CTC) loss to classify the cropping types of partial images. Moreover, in order to help the network better fit the generated and cropped data, we incorporate the Block Attention Mechanism (BAM) into the framework for attentive learning. To enhance the retrieval performance in the inference stage, we implement cropping on gallery images according to the predicted types of probe partial images. Through calculating feature distances between the partial image and the cropped holistic gallery images, we can recognize the right person from the gallery. To validate the effectiveness of our approach, we conduct extensive experiments on the partial re-ID benchmarks and achieve state-of-the-art performance.

Learning Disentangled Representations for Identity Preserving Surveillance Face Camouflage

Jingzhi Li, Lutong Han, Hua Zhang, Xiaoguang Han, Jingguo Ge, Xiaochu Cao

Responsive image

Auto-TLDR; Individual Face Privacy under Surveillance Scenario with Multi-task Loss Function

Poster Similar

In this paper, we focus on protecting the person face privacy under the surveillance scenarios, whose goal is to change the visual appearances of faces while keep them to be recognizable by current face recognition systems. This is a challenging problem as that we should retain the most important structures of captured facial images, while alter the salient facial regions to protect personal privacy. To address this problem, we introduce a novel individual face protection model, which can camouflage the face appearance from the perspective of human visual perception and preserve the identity features of faces used for face authentication. To that end, we develop an encoder-decoder network architecture that can separately disentangle the person feature representation into an appearance code and an identity code. Specifically, we first randomly divide the face image into two groups, the source set and the target set, where the source set is used to extract the identity code and the target set provides the appearance code. Then, we recombine the identity and appearance codes to synthesize a new face, which has the same identity with the source subject. Finally, the synthesized faces are used to replace the original face to protect the privacy of individual. Furthermore, our model is trained end-to-end with a multi-task loss function, which can better preserve the identity and stabilize the training loss. Experiments conducted on Cross-Age Celebrity dataset demonstrate the effectiveness of our model and validate our superiority in terms of visual quality and scalability.

Cc-Loss: Channel Correlation Loss for Image Classification

Zeyu Song, Dongliang Chang, Zhanyu Ma, Li Xiaoxu, Zheng-Hua Tan

Responsive image

Auto-TLDR; Channel correlation loss for ad- dressing image classification

Slides Poster Similar

The loss function is a key component in deep learning models. A commonly used loss function for classification is the cross-entropy loss, which is simple yet effective application of information theory for classification problems. Based on this loss, many other loss functions have been proposed, e.g., by adding intra-class and inter-class constraints to enhance the discriminative the ability of the learned features. However, these loss functions fail to consider the connections between the feature distribution and the model structure. Aiming at ad- dressing this problem, we propose a channel correlation loss (CC-Loss) that is able to constrain the specific relations between classes and channels as well as maintain the intra- and the inter-class separability. CC-Loss uses a channel attention module to generate channel attention of features for each sample in the training stage. Next, an Euclidean distance matrix is calculated to make the channel attention vectors associated with the same class become identical and to increase the difference between different classes. Finally, we obtain a feature embedding with good intra-class compactness and inter- class separability. Experimental results show that two different backbone models trained with the proposed CC-Loss outperform the state-of-the-art loss functions on three image classification datasets.

Identity-Aware Facial Expression Recognition in Compressed Video

Xiaofeng Liu, Linghao Jin, Xu Han, Jun Lu, Jonghye Woo, Jane You

Responsive image

Auto-TLDR; Exploring Facial Expression Representation in Compressed Video with Mutual Information Minimization

Slides Similar

This paper targets to explore the inter-subject variations eliminated facial expression representation in the compressed video domain. Most of the previous methods process the RGB images of a sequence, while the off-the-shelf and valuable expression-related muscle movement already embedded in the compression format. In the up to two orders of magnitude compressed domain, we can explicitly infer the expression from the residual frames and possible to extract identity factors from the I frame with a pre-trained face recognition network. By enforcing the marginal independent of them, the expression feature is expected to be purer for the expression and be robust to identity shifts. Specifically, we propose a novel collaborative min-min game for mutual information (MI) minimization in latent space. We do not need the identity label or multiple expression samples from the same person for identity elimination. Moreover, when the apex frame is annotated in the dataset, the complementary constraint can be further added to regularize the feature-level game. In testing, only the compressed residual frames are required to achieve expression prediction. Our solution can achieve comparable or better performance than the recent decoded image based methods on the typical FER benchmarks with about 3$\times$ faster inference with compressed data.

SATGAN: Augmenting Age Biased Dataset for Cross-Age Face Recognition

Wenshuang Liu, Wenting Chen, Yuanlue Zhu, Linlin Shen

Responsive image

Auto-TLDR; SATGAN: Stable Age Translation GAN for Cross-Age Face Recognition

Slides Poster Similar

In this paper, we propose a Stable Age Translation GAN (SATGAN) to generate fake face images at different ages to augment age biased face datasets for Cross-Age Face Recognition (CAFR) . The proposed SATGAN consists of both generator and discriminator. As a part of the generator, a novel Mask Attention Module (MAM) is introduced to make the generator focus on the face area. In addition, the generator employs a Uniform Distribution Discriminator (UDD) to supervise the learning of latent feature map and enforce the uniform distribution. Besides, the discriminator employs a Feature Separation Module (FSM) to disentangle identity information from the age information. The quantitative and qualitative evaluations on Morph dataset prove that SATGAN achieves much better performance than existing methods. The face recognition model trained using dataset (VGGFace2 and MS-Celeb-1M) augmented using our SATGAN achieves better accuracy on cross age dataset like Cross-Age LFW and AgeDB-30.

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

Reina Ishikawa, Ryo Hachiuma, Akiyoshi Kurobe, Hideo Saito

Responsive image

Auto-TLDR; Multi-modal Variational Autoencoder for Terrain Type Clustering

Slides Poster Similar

The key to an accurate understanding of terrain is to extract the informative features from the multi-modal data obtained from different devices. Sensors, such as RGB cameras, depth sensors, vibration sensors, and microphones, are used as the multi-modal data. Many studies have explored ways to use them, especially in the robotics field. Some papers have successfully introduced single-modal or multi-modal methods. However, in practice, robots can be faced with extreme conditions; microphones do not work well in the crowded scenes, and an RGB camera cannot capture terrains well in the dark. In this paper, we present a novel framework using the multi-modal variational autoencoder and the Gaussian mixture model clustering algorithm on image data and audio data for terrain type clustering. Our method enables the terrain type clustering even if one of the modalities (either image or audio) is missing at the test-time. We evaluated the clustering accuracy with a conventional multi-modal terrain type clustering method and we conducted ablation studies to show the effectiveness of our approach.

Spatial Bias in Vision-Based Voice Activity Detection

Kalin Stefanov, Mohammad Adiban, Giampiero Salvi

Responsive image

Auto-TLDR; Spatial Bias in Vision-based Voice Activity Detection in Multiparty Human-Human Interactions

Poster Similar

We present models for automatic vision-based voice activity detection (VAD) in multiparty human-human interactions that are aimed at complementing the acoustic VAD methods. We provide evidence that this type of vision-based VAD models are susceptible to spatial bias in the datasets. The physical settings of the interaction, usually constant throughout data acquisition, determines the distribution of head poses of the participants. Our results show that when the head pose distributions are significantly different in the training and test sets, the performance of the models drops significantly. This suggests that previously reported results on datasets with a fixed physical configuration may overestimate the generalization capabilities of this type of models. We also propose a number of possible remedies to the spatial bias, including data augmentation, input masking and dynamic features, and provide an in-depth analysis of the visual cues used by our models.

Face Image Quality Assessment for Model and Human Perception

Ken Chen, Yichao Wu, Zhenmao Li, Yudong Wu, Ding Liang

Responsive image

Auto-TLDR; A labour-saving method for FIQA training with contradictory data from multiple sources

Slides Poster Similar

Practical face image quality assessment (FIQA) models are trained under the supervision of labeled data, which requires more or less human labor. The human labeled quality scores are consistent with perceptual intuition but laborious. On the other hand, models can be trained with data generated automatically by the recognition models with artificially selected references. However, the recognition scores are sometimes inaccurate, which may give wrong quality scores during FIQA training. In this paper, we propose a labour-saving method for quality scores generation. For the first time, we conduct systematic investigations to show that there exist severe contradictions between different types of target quality, namely distribution gap (DG). To bridge the gap, we propose a novel framework for training FIQA models by combining the merits of data from different sources. In order to make the target score from multiple sources compatible, we design a method called quality distribution alignment (QDA). Meanwhile, to correct the wrong target by recognition models, contradictory samples selection (CSS) is adopted to select samples from the human labeled dataset adaptively. Extensive experiments and analysis on public benchmarks including MegaFace has demonstrated the superiority of our in terms of effectiveness and efficiency.

Attentive Hybrid Feature Based a Two-Step Fusion for Facial Expression Recognition

Jun Weng, Yang Yang, Zichang Tan, Zhen Lei

Responsive image

Auto-TLDR; Attentive Hybrid Architecture for Facial Expression Recognition

Slides Poster Similar

Facial expression recognition is inherently a challenging task, especially for the in-the-wild images with various occlusions and large pose variations, which may lead to the loss of some crucial information. To address it, in this paper, we propose an attentive hybrid architecture (AHA) which learns global, local and integrated features based on different face regions. Compared with one type of feature, our extracted features own complementary information and can reduce the loss of crucial information. Specifically, AHA contains three branches, where all sub-networks in those branches employ the attention mechanism to further localize the interested pixels/regions. Moreover, we propose a two-step fusion strategy based on LSTM to deeply explore the hidden correlations among different face regions. Extensive experiments on four popular expression databases (i.e., CK+, FER-2013, SFEW 2.0, RAF-DB) show the effectiveness of the proposed method.

Learning Emotional Blinded Face Representations

Alejandro Peña Almansa, Julian Fierrez, Agata Lapedriza, Aythami Morales

Responsive image

Auto-TLDR; Blind Face Representations for Emotion Recognition

Slides Poster Similar

This work proposes two new face representations that are blind to the expressions associated to emotional responses. This work is in part motivated by new international regulations for personal data protection, which force data controllers to protect any kind of sensitive information involved in automatic processes. The advances in affective computing have contributed to improve human-machine interfaces, but at the same time, the capacity to monitorize emotional responses trigger potential risks for humans, both in terms of fairness and privacy. We propose two different methods to learn these facial expression blinded features. We show that it is possible to eliminate information related to emotion recognition tasks, while the performance of subject verification, gender recognition, and ethnicity classification are just slightly affected. We also present an application to train fairer classifiers over a protected facial expression attribute. The results demonstrate that it is possible to reduce emotional information in the face representation while retaining competitive performance in other face-based artificial intelligence tasks.

Audio-Video Detection of the Active Speaker in Meetings

Francisco Madrigal, Frederic Lerasle, Lionel Pibre, Isabelle Ferrané

Responsive image

Auto-TLDR; Active Speaker Detection with Visual and Contextual Information from Meeting Context

Slides Poster Similar

Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using speech, however, certain visual and contextual information can provide additional insights. In this paper we propose an active speaker detection framework that integrates audiovisual features with social information, from the meeting context. Visual cue is processed using a Convolutional Neural Network (CNN) that captures the spatio-temporal relationships. We analyze several CNN architectures with both cues: raw pixels (RGB images) and motion (estimated with optical flow). Contextual reasoning is done with an original methodology, based on the gaze of all participants. We evaluate our proposal with a public \textcolor{black}{benchmark} in state-of-art: AMI corpus. We show how the addition of visual and context information improves the performance of the active speaker detection.

3D Facial Matching by Spiral Convolutional Metric Learning and a Biometric Fusion-Net of Demographic Properties

Soha Sadat Mahdi, Nele Nauwelaers, Philip Joris, Giorgos Bouritsas, Imperial London, Sergiy Bokhnyak, Susan Walsh, Mark Shriver, Michael Bronstein, Peter Claes

Responsive image

Auto-TLDR; Multi-biometric Fusion for Biometric Verification using 3D Facial Mesures

Slides Similar

Face recognition is a widely accepted biometric verification tool, as the face contains a lot of information about the identity of a person. In this study, a 2-step neural-based pipeline is presented for matching 3D facial shape to multiple DNA-related properties (sex, age, BMI and genomic background). The first step consists of a triplet loss-based metric learner that compresses facial shape into a lower dimensional embedding while preserving information about the property of interest. Most studies in the field of metric learning have only focused on Euclidean data. In this work, geometric deep learning is employed to learn directly from 3D facial meshes. To this end, spiral convolutions are used along with a novel mesh-sampling scheme that retains uniformly sampled 3D points at different levels of resolution. The second step is a multi-biometric fusion by a fully connected neural network. The network takes an ensemble of embeddings and property labels as input and returns genuine and imposter scores. Since embeddings are accepted as an input, there is no need to train classifiers for the different properties and available data can be used more efficiently. Results obtained by a 10-fold cross-validation for biometric verification show that combining multiple properties leads to stronger biometric systems. Furthermore, the proposed neural-based pipeline outperforms a linear baseline, which consists of principal component analysis, followed by classification with linear support vector machines and a Naïve Bayes-based score-fuser.

Automatic Annotation of Corpora for Emotion Recognition through Facial Expressions Analysis

Alex Mircoli, Claudia Diamantini, Domenico Potena, Emanuele Storti

Responsive image

Auto-TLDR; Automatic annotation of video subtitles on the basis of facial expressions using machine learning algorithms

Slides Poster Similar

The recent diffusion of social networks has made available an unprecedented amount of user-generated content, which may be analyzed in order to determine people's opinions and emotions about a large variety of topics. Research has made many efforts in defining accurate algorithms for analyzing emotions expressed by users in texts; however, their performance often rely on the existence of large annotated datasets, whose current scarcity represents a major issue. The manual creation of such datasets represents a costly and time-consuming activity and hence there is an increasing demand for techniques for the automatic annotation of corpora. In this work we present a methodology for the automatic annotation of video subtitles on the basis of the analysis of facial expressions of people in videos, with the goal of creating annotated corpora that may be used to train emotion recognition algorithms. Facial expressions are analyzed through machine learning algorithms, on the basis of a set of manually-engineered facial features that are extracted from video frames. The soundness of the proposed methodology has been evaluated through an extensive experimentation aimed at determining the performance on real datasets of each methodological step.

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Thi Phuong Thao Ha, Bt Balamurali, Herremans Dorien, Roig Gemma

Responsive image

Auto-TLDR; AttendAffectNet: A Self-Attention Based Network for Emotion Prediction from Movies

Slides Poster Similar

In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet. We take both audio and video into account and incorporate the relation among multiple modalities by applying self-attention mechanism in a novel manner into the extracted features for emotion prediction. We compare it to the typically temporal integration of the self-attention based model, which in our case, allows to capture the relation of temporal representations of the movie while considering the sequential dependencies of emotion responses. We demonstrate the effectiveness of our proposed architectures on the extended COGNIMUSE dataset [1], [2] and the MediaEval 2016 Emotional Impact of Movies Task [3], which consist of movies with emotion annotations. Our results show that applying the self-attention mechanism on the different audio-visual features, rather than in the time domain, is more effective for emotion prediction. Our approach is also proven to outperform state-of-the-art models for emotion prediction.

Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

Mani Kumar Tellamekala, Michel Valstar, Michael Pound, Timo Giesbrecht

Responsive image

Auto-TLDR; AV-PPC: A Multi-task Learning Framework for Learning Semantic Visual Features from Unlabeled Video Data

Slides Poster Similar

Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce `Audio-Visual Permutative Predictive Coding' (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 80.30% spoken word classification rate on the LRW dataset, performing on par with directly or fully supervised visual encoders learned from large amounts of labeled data.

A Base-Derivative Framework for Cross-Modality RGB-Infrared Person Re-Identification

Hong Liu, Ziling Miao, Bing Yang, Runwei Ding

Responsive image

Auto-TLDR; Cross-modality RGB-Infrared Person Re-identification with Auxiliary Modalities

Slides Poster Similar

Cross-modality RGB-infrared (RGB-IR) person re-identification (Re-ID) is a challenging research topic due to the heterogeneity of RGB and infrared images. In this paper, we aim to find some auxiliary modalities, which are homologous with the visible or infrared modalities, to help reduce the modality discrepancy caused by heterogeneous images. Accordingly, a new base-derivative framework is proposed, where base refers to the original visible and infrared modalities, and derivative refers to the two auxiliary modalities that are derived from base. In the proposed framework, the double-modality cross-modal learning problem is reformulated as a four-modality one. After that, the images of all the base and derivative modalities are fed into the feature learning network. With the doubled input images, the learned person features become more discriminative. Furthermore, the proposed framework is optimized by the enhanced intra- and cross-modality constraints with the assistance of two derivative modalities. Experimental results on two publicly available datasets SYSU-MM01 and RegDB show that the proposed method outperforms the other state-of-the-art methods. For instance, we achieve a gain of over 13\% in terms of both Rank-1 and mAP on RegDB dataset.

Progressive Learning Algorithm for Efficient Person Re-Identification

Zhen Li, Hanyang Shao, Liang Niu, Nian Xue

Responsive image

Auto-TLDR; Progressive Learning Algorithm for Large-Scale Person Re-Identification

Slides Poster Similar

This paper studies the problem of Person Re-Identification (ReID) for large-scale applications. Recent research efforts have been devoted to building complicated part models, which introduce considerably high computational cost and memory consumption, inhibiting its practicability in large-scale applications. This paper aims to develop a novel learning strategy to find efficient feature embeddings while maintaining the balance of accuracy and model complexity. More specifically, we find by enhancing the classical triplet loss together with cross-entropy loss, our method can explore the hard examples and build a discriminant feature embedding yet compact enough for large-scale applications. Our method is carried out progressively using Bayesian optimization, and we call it the Progressive Learning Algorithm (PLA). Extensive experiments on three large-scale datasets show that our PLA is comparable or better than the state-of-the-arts. Especially, on the challenging Market-1501 dataset, we achieve Rank-1=94.7\%/mAP=89.4\% while saving at least 30\% parameters than strong part models.

Multi-Label Contrastive Focal Loss for Pedestrian Attribute Recognition

Xiaoqiang Zheng, Zhenxia Yu, Lin Chen, Fan Zhu, Shilong Wang

Responsive image

Auto-TLDR; Multi-label Contrastive Focal Loss for Pedestrian Attribute Recognition

Slides Poster Similar

Pedestrian Attribute Recognition (PAR) has received extensive attention during the past few years. With the advances of deep constitutional neural networks (CNNs), the performance of PAR has been significantly improved. Existing methods tend to acquire attribute-specific features by designing various complex network structures with additional modules. Such additional modules, however, dramatically increase the number of parameters. Meanwhile, the problems of class imbalance and hard attribute retrieving remain underestimated in PAR. In this paper, we explore the optimization mechanism of the training processing to account for these problems and propose a new loss function called Multi-label Contrastive Focal Loss (MCFL). This proposed MCFL emphasizes the hard and minority attributes by using a separated re-weighting mechanism for different positive and negative classes to alleviate the impact of the imbalance. MCFL is also able to enlarge the gaps between the intra-class of multi-label attributes, to force CNNs to extract more subtle discriminative features. We evaluate the proposed MCFL on three large public pedestrian datasets, including RAP, PA-100K, and PETA. The experimental results indicate that the proposed MCFL with the ResNet-50 backbone is able to outperform other state-of-the-art approaches in comparison.

Multi-Level Deep Learning Vehicle Re-Identification Using Ranked-Based Loss Functions

Eleni Kamenou, Jesus Martinez-Del-Rincon, Paul Miller, Patricia Devlin - Hill

Responsive image

Auto-TLDR; Multi-Level Re-identification Network for Vehicle Re-Identification

Slides Poster Similar

Identifying vehicles across a network of cameras with non-overlapping fields of view remains a challenging research problem due to scene occlusions, significant inter-class similarity and intra-class variability. In this paper, we propose an end-to-end multi-level re-identification network that is capable of successfully projecting same identity vehicles closer to one another in the embedding space, compared to vehicles of different identities. Robust feature representations are obtained by combining features at multiple levels of the network. As for the learning process, we employ a recent state-of-the-art structured metric learning loss function previously applied to other retrieval problems and adjust it to the vehicle re-identification task. Furthermore, we explore the cases of image-to-image, image-to-video and video-to-video similarity metric. Finally, we evaluate our system and achieve great performance on two large-scale publicly available datasets, CityFlow-ReID and VeRi-776. Compared to most existing state-of-art approaches, our approach is simpler and more straightforward, utilizing only identity-level annotations, while avoiding post-processing the ranking results (re-ranking) at the testing phase.

Rethinking ReID:Multi-Feature Fusion Person Re-Identification Based on Orientation Constraints

Mingjing Ai, Guozhi Shan, Bo Liu, Tianyang Liu

Responsive image

Auto-TLDR; Person Re-identification with Orientation Constrained Network

Slides Poster Similar

Person re-identification (ReID) aims to identify the specific pedestrian in a series of images or videos. Recently, ReID is receiving more and more attention in the fields of computer vision research and application like intelligent security. One major issue downgrading the ReID model performance lies in that various subjects in the same body orientations look too similar to distinguish by the model, while the same subject viewed in different orientations looks rather different. However, most of the current studies do not particularly differentiate pedestrians in orientation when designing the network, so we rethink this problem particularly from the perspective of person orientation and propose a new network structure by including two branches: one handling samples with the same body orientations and the other handling samples with different body orientations. Correspondingly, we also propose an orientation classifier that can accurately distinguish the orientation of each person. At the same time, the three-part loss functions are introduced for orientation constraint and combined to optimize the network simultaneously. Also, we use global and local features int the training stage in order to make use of multi-level information. Therefore, our network can derive its efficacy from orientation constraints and multiple features. Experiments show that our method not only has competitive performance on multiple datasets, but also can let retrieval results aligned with the orientation of the query sample rank higher, which may have great potential in the practical applications.

Two-Level Attention-Based Fusion Learning for RGB-D Face Recognition

Hardik Uppal, Alireza Sepas-Moghaddam, Michael Greenspan, Ali Etemad

Responsive image

Auto-TLDR; Fused RGB-D Facial Recognition using Attention-Aware Feature Fusion

Slides Poster Similar

With recent advances in RGB-D sensing technologies as well as improvements in machine learning and fusion techniques, RGB-D facial recognition has become an active area of research. A novel attention aware method is proposed to fuse two image modalities, RGB and depth, for enhanced RGB-D facial recognition. The proposed method first extracts features from both modalities using a convolutional feature extractor. These features are then fused using a two layer attention mechanism. The first layer focuses on the fused feature maps generated by the feature extractor, exploiting the relationship between feature maps using LSTM recurrent learning. The second layer focuses on the spatial features of those maps using convolution. The training database is preprocessed and augmented through a set of geometric transformations, and the learning process is further aided using transfer learning from a pure 2D RGB image training process. Comparative evaluations demonstrate that the proposed method outperforms other state-of-the-art approaches, including both traditional and deep neural network-based methods, on the challenging CurtinFaces and IIIT-D RGB-D benchmark databases, achieving classification accuracies over 98.2% and 99.3% respectively. The proposed attention mechanism is also compared with other attention mechanisms, demonstrating more accurate results.

Hybrid Network for End-To-End Text-Independent Speaker Identification

Wajdi Ghezaiel, Luc Brun, Olivier Lezoray

Responsive image

Auto-TLDR; Text-Independent Speaker Identification with Scattering Wavelet Network and Convolutional Neural Networks

Slides Poster Similar

Deep learning has recently improved the performance of Speaker Identification (SI) systems. Promising results have been obtained with Convolutional Neural Networks (CNNs). This success are mostly driven by the advent of large datasets. However in the context of commercial applications, collection of large amount of training data is not always possible. In addition, robustness of a SI system is adversely effected by short utterances. SI with only a few and short utterances is a challenging problem. Therefore, in this paper, we propose a novel text-independent speaker identification system. The proposed system can identify speakers by learning from only few training short utterances examples. To achieve this, we combine CNN with Scattering Wavelet Network. We propose a two-stage feature extraction framework using a two-layer wavelet scattering network coupled with a CNN for SI system. The proposed architecture takes variable length speech segments. To evaluate the effectiveness of the proposed approach, Timit and Librispeech datasets are used in the experiments. These conducted experiments show that our hybrid architecture performs successfully for SI, even with a small number and short duration of training samples. In comparaison with related methods, the obtained results shows that an hybrid architecture achieve better performance.

Deep Gait Relative Attribute Using a Signed Quadratic Contrastive Loss

Yuta Hayashi, Shehata Allam, Yasushi Makihara, Daigo Muramatsu, Yasushi Yagi

Responsive image

Auto-TLDR; Signal-Contrastive Loss for Gait Attributes Estimation

Similar

This paper presents a deep learning-based method to estimate gait attributes (e.g., stately, cool, relax, etc.). Similarly to the existing studies on relative attribute, human perception-based annotations on the gait attributes are given to pairs of gait videos (i.e., the first one is better, tie, and the second one is better), and the relative annotations are utilized to train a ranking model of the gait attribute. More specifically, we design a Siamese (i.e., two-stream) network which takes a pair of gait inputs and output gait attribute score for each. We then introduce a suitable loss function called a signed contrastive loss to train the network parameters with the relative annotation. Unlike the existing loss functions for learning to rank does not inherent a nice property of a quadratic contrastive loss, the proposed signed quadratic contrastive loss function inherents the nice property. The quantitative evaluation results reveal that the proposed method shows better or comparable accuracies of relative attribute prediction against the baseline methods.

SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Raphael Memmesheimer, Nick Theisen, Dietrich Paulus

Responsive image

Auto-TLDR; One-Shot Action Recognition using Metric Learning

Slides Similar

Recognizing an activity with a single reference sample using metric learning approaches is a promising research field. The majority of few-shot methods focus on object recognition or face-identification. We propose a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space. We encode signals into images and extract features using a deep residual CNN. Using triplet loss, we learn a feature embedding. The resulting encoder transforms features into an embedding space in which closer distances encode similar actions while higher distances encode different actions. Our approach is based on a signal level formulation and remains flexible across a variety of modalities. It further outperforms the baseline on the large scale NTU RGB+D 120 dataset for the One-Shot action recognition protocol by \ntuoneshotimpro%. With just 60% of the training data, our approach still outperforms the baseline approach by \ntuoneshotimproreduced%. With 40% of the training data, our approach performs comparably well as the second follow up. Further, we show that our approach generalizes well in experiments on the UTD-MHAD dataset for inertial, skeleton and fused data and the Simitate dataset for motion capturing data. Furthermore, our inter-joint and inter-sensor experiments suggest good capabilities on previously unseen setups.

Dual Loss for Manga Character Recognition with Imbalanced Training Data

Yonggang Li, Yafeng Zhou, Yongtao Wang, Xiaoran Qin, Zhi Tang

Responsive image

Auto-TLDR; Dual Adaptive Re-weighting Loss for Manga Character Recognition

Slides Poster Similar

Manga character recognition is a key technology for manga character retrieval and verfication. This task is very challenging since the manga character images have a long-tailed distribution and large quality variations. Training models with cross-entropy softmax loss on such imbalanced data would introduce biases to feature and class weight norm. To handle this problem, we propose a novel dual loss which is the sum of two losses: dual ring loss and dual adaptive re-weighting loss. Dual ring loss combines weight and feature soft normalization and serves as a regularization term to softmax loss. Dual adaptive re-weighting loss re-weights softmax loss according to the norm of both feature and class weight. With the proposed losses, we have achieved encouraging results on Manga109 benchmark. Specifically, compared with the baseline softmax loss, our method improves the character retrieval mAP from 35.72% to 38.88% and the character verification accuracy from 87.00% to 88.50%.

Self and Channel Attention Network for Person Re-Identification

Asad Munir, Niki Martinel, Christian Micheloni

Responsive image

Auto-TLDR; SCAN: Self and Channel Attention Network for Person Re-identification

Slides Poster Similar

Recent research has shown promising results for person re-identification by focusing on several trends. One is designing efficient metric learning loss functions such as triplet loss family to learn the most discriminative representations. The other is learning local features by designing part based architectures to form an informative descriptor from semantically coherent parts. Some efforts adjust distant outliers to their most similar positions by using soft attention and learn the relationship between distant similar features. However, only a few prior efforts focus on channel-wise dependencies and learn non-local sharp similar part features directly for the degraded data in the person re-identification task. In this paper, we propose a novel Self and Channel Attention Network (SCAN) to model long-range dependencies between channels and feature maps. We add multiple classifiers to learn discriminative global features by using classification loss. Self Attention (SA) module and Channel Attention (CA) module are introduced to model non-local and channel-wise dependencies in the learned features. Spectral normalization is applied to the whole network to stabilize the training process. Experimental results on the person re-identification benchmarks show the proposed components achieve significant improvement with respect to the baseline.

Pose-Based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo

Responsive image

Auto-TLDR; Body Language Based Emotion Recognition for Psychiatric Symptoms Prediction

Slides Poster Similar

Inspired by the human ability to infer emotions from body language, we propose an automated framework for body language based emotion recognition starting from regular RGB videos. In collaboration with psychologists, we further extend the framework for psychiatric symptom prediction. Because a specific application domain of the proposed framework may only supply a limited amount of data, the framework is designed to work on a small training set and possess a good transferability. The proposed system in the first stage generates sequences of body language predictions based on human poses estimated from input videos. In the second stage, the predicted sequences are fed into a temporal network for emotion interpretation and psychiatric symptom prediction. We first validate the accuracy and transferability of the proposed body language recognition method on several public action recognition datasets. We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms. The proposed framework outperforms other methods on the URMC dataset.

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Yasunori Ohishi, Yuki Tanaka, Kunio Kashino

Responsive image

Auto-TLDR; A guided attention scheme for audio-visual co-segmentation

Slides Poster Similar

Audio-visual co-segmentation is a task to extract segments and regions corresponding to specific events on unlabelled audio and video signals. It is particularly important to accomplish it in an unsupervised way, since it is generally very difficult to manually label all the objects and events appearing in audio-visual signals for supervised learning. Here, we propose to take advantage of temporal proximity of corresponding audio and video entities included in the signals. For this purpose, we newly employ a guided attention scheme to this task to efficiently detect and utilize temporal cooccurrences of audio and video information. The experiments using a real TV broadcasting of Sumo wrestling, a sport event, with live commentaries show that our model can automatically extract specific athlete movements and its spoken descriptions in an unsupervised manner.

Progressive Unsupervised Domain Adaptation for Image-Based Person Re-Identification

Mingliang Yang, Da Huang, Jing Zhao

Responsive image

Auto-TLDR; Progressive Unsupervised Domain Adaptation for Person Re-Identification

Slides Poster Similar

Unsupervised domain adaptation (UDA) has emerged as an effective paradigm for reducing the huge manual annotation cost for Person Re-Identification (Re-ID). Many of the recent UDA methods for Re-ID are clustering-based and select all the pseudo-label samples in each iteration for the model training. However, there are many wrong labeled samples that will mislead the model optimization under this circumstance. To solve this problem, we propose a Progressive Unsupervised Domain Adaptation (PUDA) framework for image-based Person Re-ID to reduce the negative effect of wrong pseudo-label samples on the model training process. Specifically, we first pretrain a CNN model on a labeled source dataset, then finetune the model on unlabeled target dataset with the following three steps iteratively: 1) estimating pseudo-labels for all the images in the target dataset with the model trained in the last iteration; 2) extending the training set by adding pseudo-label samples with higher label confidence; 3) updating the CNN model with the expanded training set in a supervised manner. During the iteration process, the number of pseudo-label samples added increased progressively. In particular, a Moderate Initial Selections (MIS) strategy for pseudo-label sampling is also proposed to reduce the negative impacts of random noise features in the early iterations and mislabeled samples in the late iterations on the model. The proposed framework with MIS strategy is validated on the Duke-to-Market, Market-to-Duke unsupervised domain adaptation tasks and achieves improvements of 4.2 points (absolute, i.e., 80.0% vs. 75.8%) and 1.7 points (absolute, i.e., 70.7% vs. 69.0%) in mAP correspondingly.

Ballroom Dance Recognition from Audio Recordings

Tomas Pavlin, Jan Cech, Jiri Matas

Responsive image

Auto-TLDR; A CNN-based approach to classify ballroom dances given audio recordings

Slides Poster Similar

We propose a CNN-based approach to classify ten genres of ballroom dances given audio recordings, five latin and five standard, namely Cha Cha Cha, Jive, Paso Doble, Rumba, Samba, Quickstep, Slow Foxtrot, Slow Waltz, Tango and Viennese Waltz. We utilize a spectrogram of an audio signal and we treat it as an image that is an input of the CNN. The classification is performed independently by 5-seconds spectrogram segments in sliding window fashion and the results are then aggregated. The method was tested on following datasets: Publicly available Extended Ballroom dataset collected by Marchand and Peeters, 2016 and two YouTube datasets collected by us, one in studio quality and the other, more challenging, recorded on mobile phones. The method achieved accuracy 93.9%, 96.7% and 89.8% respectively. The method runs in real-time. We implemented a web application to demonstrate the proposed method.

SoftmaxOut Transformation-Permutation Network for Facial Template Protection

Hakyoung Lee, Cheng Yaw Low, Andrew Teoh

Responsive image

Auto-TLDR; SoftmaxOut Transformation-Permutation Network for C cancellable Biometrics

Slides Poster Similar

In this paper, we propose a data-driven cancellable biometrics scheme, referred to as SoftmaxOut Transformation-Permutation Network (SOTPN). The SOTPN is a neural version of Random Permutation Maxout (RPM) transform, which was introduced for facial template protection. We present a specialized SoftmaxOut layer integrated with the permutable MaxOut units and the parameterized softmax function to approximate the non-differentiable permutation and the winner-takes-all operations in the RPM transform. On top of that, a novel pairwise ArcFace loss and a code balancing loss are also formulated to ensure that the SOTPN-transformed facial template is cancellable, discriminative, high entropy and free from quantization errors when coupled with the SoftmaxOut layer. The proposed SOTPN is evaluated on three face datasets, namely LFW, YouTube Face and Facescrub, and our experimental results disclosed that the SOTPN outperforms the RPM transform significantly.

Talking Face Generation Via Learning Semantic and Temporal Synchronous Landmarks

Aihua Zheng, Feixia Zhu, Hao Zhu, Mandi Luo, Ran He

Responsive image

Auto-TLDR; A semantic and temporal synchronous landmark learning method for talking face generation

Slides Poster Similar

Abstract—Given a speech clip and facial image, the goal of talking face generation is to synthesize a talking face video with accurate mouth synchronization and natural face motion. Recent progress has proven the effectiveness of the landmarks as the intermediate information during talking face generation. However,the large gap between audio and visual modalities makes the prediction of landmarks challenging and limits generation ability. This paper proposes a semantic and temporal synchronous landmark learning method for talking face generation. First, we propose to introduce a word detector to enforce richer semantic information. Then, we propose to preserve the temporal synchronization and consistency between landmarks and audio via the proposed temporal residual loss. Lastly, we employ a U-Net generation network with adaptive reconstruction loss to generate facial images for the predicted landmarks. Experimental results on two benchmark datasets LRW and GRID demonstrate the effectiveness of our model compared to the state-of-the-art methods of talking face generation.

Lightweight Low-Resolution Face Recognition for Surveillance Applications

Yoanna Martínez-Díaz, Heydi Mendez-Vazquez, Luis S. Luevano, Leonardo Chang, Miguel Gonzalez-Mendoza

Responsive image

Auto-TLDR; Efficiency of Lightweight Deep Face Networks on Low-Resolution Surveillance Imagery

Slides Poster Similar

Typically, real-world requirements to deploy face recognition models in unconstrained surveillance scenarios demand to identify low-resolution faces with extremely low computational cost. In the last years, several methods based on complex deep learning models have been proposed with promising recognition results but at a high computational cost. Inspired by the compactness and computation efficiency of lightweight deep face networks and their high accuracy on general face recognition tasks, in this work we propose to benchmark two recently introduced lightweight face models on low-resolution surveillance imagery to enable efficient system deployment. In this way, we conduct a comprehensive evaluation on the two typical settings: LR-to-HR and LR-to-LR matching. In addition, we investigate the effect of using trained models with down-sampled synthetic data from high-resolution images, as well as the combination of different models, for face recognition on real low-resolution images. Experimental results show that the used lightweight face models achieve state-of-the-art results on low-resolution benchmarks with low memory footprint and computational complexity. Moreover, we observed that combining models trained with different degradations improves the recognition accuracy on low-resolution surveillance imagery, which is feasible due to their low computational cost.

Deep Top-Rank Counter Metric for Person Re-Identification

Chen Chen, Hao Dou, Xiyuan Hu, Silong Peng

Responsive image

Auto-TLDR; Deep Top-Rank Counter Metric for Person Re-identification

Slides Poster Similar

In the research field of person re-identification, deep metric learning that guides the efficient and effective embedding learning serves as one of the most fundamental tasks. Recent efforts of the loss function based deep metric learning methods mainly focus on the top rank accuracy optimization by minimiz- ing the distance difference between the correctly matching sample pair and wrongly matched sample pair. However, it is more straightforward to count the occurrences of correct top-rank candidates and maximize the counting results for better top rank accuracy. In this paper, we propose a generalized logistic function based metric with effective practicalness in deep learning, namely the“deep top-rank counter metric”, to approximately optimize the counted occurrences of the correct top-rank matches. The properties that qualify the proposed metric as a well-suited deep re-identification metric have been discussed and a progressive hard sample mining strategy is also introduced for effective training and performance boosting. The extensive experiments show that the proposed top-rank counter metric outperforms other loss function based deep metrics and achieves the state-of- the-art accuracies.

Toward Text-Independent Cross-Lingual Speaker Recognition Using English-Mandarin-Taiwanese Dataset

Yi-Chieh Wu, Wen-Hung Liao

Responsive image

Auto-TLDR; Cross-lingual Speech for Biometric Recognition

Poster Similar

Over 40% of the world's population is bilingual. Existing speaker identification/verification systems, however, assume the same language type for both enrollment and recognition stages. In this work, we investigate the feasibility of employing multilingual speech for biometric application. We establish a dataset containing audio recorded in English, Mandarin and Taiwanese. Three acoustic features, namely, i-vector, d-vector and x-vector have been evaluated for both speaker verification (SV) and identification (SI) tasks. Preliminary experimental results indicate that x-vector achieves the best overall performance. Additionally, model trained with hybrid data demonstrates highest accuracy associated with the cost of data collection efforts. In SI tasks, we obtained over 91\% cross-lingual accuracy all models using 3-second audio. In SV tasks, the EER among cross-lingual test is at most 6.52\%, which is observed on the model trained by English corpus. The outcome suggests the feasibility of adopting cross-lingual speech in building text-independent speaker recognition systems.

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Yaning Li, Liu Yang

Responsive image

Auto-TLDR; Fully Associative Network for Fully Exploiting Correlation Information in Multi-Label Classification

Slides Poster Similar

Recent researches demonstrate that correlation modeling plays a key role in high-performance multi-label classification methods. However, existing methods do not take full advantage of correlation information, especially correlations in feature and label spaces of each image, which limits the performance of correlation-based multi-label classification methods. With more correlations considered, in this study, a Fully Associative Network (FAN) is proposed for fully exploiting correlation information, which involves both visual feature and label correlations. Specifically, FAN introduces a robust covariance pooling to summarize convolution features as global image representation for capturing feature correlation in the multi-label task. Moreover, it constructs an effective label correlation matrix based on a re-weighted scheme, which is fed into a graph convolution network for capturing label correlation. Then, correlation between covariance representations (i.e., feature correlation ) and the outputs of GCN (i.e., label correlation) are modeled for final prediction. Experimental results on two datasets illustrate the effectiveness and efficiency of our proposed FAN compared with state-of-the-art methods.