Progressive Learning Algorithm for Efficient Person Re-Identification

Zhen Li, Hanyang Shao, Liang Niu, Nian Xue

Responsive image

Auto-TLDR; Progressive Learning Algorithm for Large-Scale Person Re-Identification

Slides Poster

This paper studies the problem of Person Re-Identification (ReID) for large-scale applications. Recent research efforts have been devoted to building complicated part models, which introduce considerably high computational cost and memory consumption, inhibiting its practicability in large-scale applications. This paper aims to develop a novel learning strategy to find efficient feature embeddings while maintaining the balance of accuracy and model complexity. More specifically, we find by enhancing the classical triplet loss together with cross-entropy loss, our method can explore the hard examples and build a discriminant feature embedding yet compact enough for large-scale applications. Our method is carried out progressively using Bayesian optimization, and we call it the Progressive Learning Algorithm (PLA). Extensive experiments on three large-scale datasets show that our PLA is comparable or better than the state-of-the-arts. Especially, on the challenging Market-1501 dataset, we achieve Rank-1=94.7\%/mAP=89.4\% while saving at least 30\% parameters than strong part models.

Similar papers

Deep Top-Rank Counter Metric for Person Re-Identification

Chen Chen, Hao Dou, Xiyuan Hu, Silong Peng

Responsive image

Auto-TLDR; Deep Top-Rank Counter Metric for Person Re-identification

Slides Poster Similar

In the research field of person re-identification, deep metric learning that guides the efficient and effective embedding learning serves as one of the most fundamental tasks. Recent efforts of the loss function based deep metric learning methods mainly focus on the top rank accuracy optimization by minimiz- ing the distance difference between the correctly matching sample pair and wrongly matched sample pair. However, it is more straightforward to count the occurrences of correct top-rank candidates and maximize the counting results for better top rank accuracy. In this paper, we propose a generalized logistic function based metric with effective practicalness in deep learning, namely the“deep top-rank counter metric”, to approximately optimize the counted occurrences of the correct top-rank matches. The properties that qualify the proposed metric as a well-suited deep re-identification metric have been discussed and a progressive hard sample mining strategy is also introduced for effective training and performance boosting. The extensive experiments show that the proposed top-rank counter metric outperforms other loss function based deep metrics and achieves the state-of- the-art accuracies.

Adaptive L2 Regularization in Person Re-Identification

Xingyang Ni, Liang Fang, Heikki Juhani Huttunen

Responsive image

Auto-TLDR; AdaptiveReID: Adaptive L2 Regularization for Person Re-identification

Slides Poster Similar

We introduce an adaptive L2 regularization mechanism termed AdaptiveReID, in the setting of person re-identification. In the literature, it is common practice to utilize hand-picked regularization factors which remain constant throughout the training procedure. Unlike existing approaches, the regularization factors in our proposed method are updated adaptively through backpropagation. This is achieved by incorporating trainable scalar variables as the regularization factors, which are further fed into a scaled hard sigmoid function. Extensive experiments on the Market-1501, DukeMTMC-reID and MSMT17 datasets validate the effectiveness of our framework. Most notably, we obtain state-of-the-art performance on MSMT17, which is the largest dataset for person re-identification. Source code will be published at https://github.com/nixingyang/AdaptiveReID.

Rethinking ReID:Multi-Feature Fusion Person Re-Identification Based on Orientation Constraints

Mingjing Ai, Guozhi Shan, Bo Liu, Tianyang Liu

Responsive image

Auto-TLDR; Person Re-identification with Orientation Constrained Network

Slides Poster Similar

Person re-identification (ReID) aims to identify the specific pedestrian in a series of images or videos. Recently, ReID is receiving more and more attention in the fields of computer vision research and application like intelligent security. One major issue downgrading the ReID model performance lies in that various subjects in the same body orientations look too similar to distinguish by the model, while the same subject viewed in different orientations looks rather different. However, most of the current studies do not particularly differentiate pedestrians in orientation when designing the network, so we rethink this problem particularly from the perspective of person orientation and propose a new network structure by including two branches: one handling samples with the same body orientations and the other handling samples with different body orientations. Correspondingly, we also propose an orientation classifier that can accurately distinguish the orientation of each person. At the same time, the three-part loss functions are introduced for orientation constraint and combined to optimize the network simultaneously. Also, we use global and local features int the training stage in order to make use of multi-level information. Therefore, our network can derive its efficacy from orientation constraints and multiple features. Experiments show that our method not only has competitive performance on multiple datasets, but also can let retrieval results aligned with the orientation of the query sample rank higher, which may have great potential in the practical applications.

Self and Channel Attention Network for Person Re-Identification

Asad Munir, Niki Martinel, Christian Micheloni

Responsive image

Auto-TLDR; SCAN: Self and Channel Attention Network for Person Re-identification

Slides Poster Similar

Recent research has shown promising results for person re-identification by focusing on several trends. One is designing efficient metric learning loss functions such as triplet loss family to learn the most discriminative representations. The other is learning local features by designing part based architectures to form an informative descriptor from semantically coherent parts. Some efforts adjust distant outliers to their most similar positions by using soft attention and learn the relationship between distant similar features. However, only a few prior efforts focus on channel-wise dependencies and learn non-local sharp similar part features directly for the degraded data in the person re-identification task. In this paper, we propose a novel Self and Channel Attention Network (SCAN) to model long-range dependencies between channels and feature maps. We add multiple classifiers to learn discriminative global features by using classification loss. Self Attention (SA) module and Channel Attention (CA) module are introduced to model non-local and channel-wise dependencies in the learned features. Spectral normalization is applied to the whole network to stabilize the training process. Experimental results on the person re-identification benchmarks show the proposed components achieve significant improvement with respect to the baseline.

Top-DB-Net: Top DropBlock for Activation Enhancement in Person Re-Identification

Rodolfo Quispe, Helio Pedrini

Responsive image

Auto-TLDR; Top-DB-Net for Person Re-Identification using Top DropBlock

Slides Poster Similar

Person Re-Identification is a challenging task that aims to retrieve all instances of a query image across a system of non-overlapping cameras. Due to the various extreme changes of view, it is common that local regions that could be used to match people are suppressed, which leads to a scenario where approaches have to evaluate the similarity of images based on less informative regions. In this work, we introduce the Top-DB-Net, a method based on Top DropBlock that pushes the network to learn to focus on the scene foreground, with special emphasis on the most task-relevant regions and, at the same time, encodes low informative regions to provide high discriminability. The Top-DB-Net is composed of three streams: (i) a global stream encodes rich image information from a backbone, (ii) the Top DropBlock stream encourages the backbone to encode low informative regions with high discriminative features, and (iii) a regularization stream helps to deal with the noise created by the dropping process of the second stream, when testing the first two streams are used. Vast experiments on three challenging datasets show the capabilities of our approach against state-of-the-art methods. Qualitative results demonstrate that our method exhibits better activation maps focusing on reliable parts of the input images.

Building Computationally Efficient and Well-Generalizing Person Re-Identification Models with Metric Learning

Vladislav Sovrasov, Dmitry Sidnev

Responsive image

Auto-TLDR; Cross-Domain Generalization in Person Re-identification using Omni-Scale Network

Slides Similar

This work considers the problem of domain shift in person re-identification.Being trained on one dataset, a re-identification model usually performs much worse on unseen data. Partially this gap is caused by the relatively small scale of person re-identification datasets (compared to face recognition ones, for instance), but it is also related to training objectives. We propose to use the metric learning objective, namely AM-Softmax loss, and some additional training practices to build well-generalizing, yet, computationally efficient models. We use recently proposed Omni-Scale Network (OSNet) architecture combined with several training tricks and architecture adjustments to obtain state-of-the art results in cross-domain generalization problem on a large-scale MSMT17 dataset in three setups: MSMT17-all->DukeMTMC, MSMT17-train->Market1501 and MSMT17-all->Market1501.

Progressive Unsupervised Domain Adaptation for Image-Based Person Re-Identification

Mingliang Yang, Da Huang, Jing Zhao

Responsive image

Auto-TLDR; Progressive Unsupervised Domain Adaptation for Person Re-Identification

Slides Poster Similar

Unsupervised domain adaptation (UDA) has emerged as an effective paradigm for reducing the huge manual annotation cost for Person Re-Identification (Re-ID). Many of the recent UDA methods for Re-ID are clustering-based and select all the pseudo-label samples in each iteration for the model training. However, there are many wrong labeled samples that will mislead the model optimization under this circumstance. To solve this problem, we propose a Progressive Unsupervised Domain Adaptation (PUDA) framework for image-based Person Re-ID to reduce the negative effect of wrong pseudo-label samples on the model training process. Specifically, we first pretrain a CNN model on a labeled source dataset, then finetune the model on unlabeled target dataset with the following three steps iteratively: 1) estimating pseudo-labels for all the images in the target dataset with the model trained in the last iteration; 2) extending the training set by adding pseudo-label samples with higher label confidence; 3) updating the CNN model with the expanded training set in a supervised manner. During the iteration process, the number of pseudo-label samples added increased progressively. In particular, a Moderate Initial Selections (MIS) strategy for pseudo-label sampling is also proposed to reduce the negative impacts of random noise features in the early iterations and mislabeled samples in the late iterations on the model. The proposed framework with MIS strategy is validated on the Duke-to-Market, Market-to-Duke unsupervised domain adaptation tasks and achieves improvements of 4.2 points (absolute, i.e., 80.0% vs. 75.8%) and 1.7 points (absolute, i.e., 70.7% vs. 69.0%) in mAP correspondingly.

Attentive Part-Aware Networks for Partial Person Re-Identification

Lijuan Huo, Chunfeng Song, Zhengyi Liu, Zhaoxiang Zhang

Responsive image

Auto-TLDR; Part-Aware Learning for Partial Person Re-identification

Slides Poster Similar

Partial person re-identification (re-ID) refers to re-identify a person through occluded images. It suffers from two major challenges, i.e., insufficient training data and incomplete probe image. In this paper, we introduce an automatic data augmentation module and a part-aware learning method for partial re-identification. On the one hand, we adopt the data augmentation to enhance the training data and help learns more stabler partial features. On the other hand, we intuitively find that the partial person images usually have fixed percentages of parts, therefore, in partial person re-id task, the probe image could be cropped from the pictures and divided into several different partial types following fixed ratios. Based on the cropped images, we propose the Cropping Type Consistency (CTC) loss to classify the cropping types of partial images. Moreover, in order to help the network better fit the generated and cropped data, we incorporate the Block Attention Mechanism (BAM) into the framework for attentive learning. To enhance the retrieval performance in the inference stage, we implement cropping on gallery images according to the predicted types of probe partial images. Through calculating feature distances between the partial image and the cropped holistic gallery images, we can recognize the right person from the gallery. To validate the effectiveness of our approach, we conduct extensive experiments on the partial re-ID benchmarks and achieve state-of-the-art performance.

Pose Variation Adaptation for Person Re-Identification

Lei Zhang, Na Jiang, Qishuai Diao, Yue Xu, Zhong Zhou, Wei Wu

Responsive image

Auto-TLDR; Pose Transfer Generative Adversarial Network for Person Re-identification

Slides Poster Similar

Person re-identification (reid) plays an important role in surveillance video analysis, especially for criminal investigation and intelligent security. Although a large number of effective feature or distance metric learning approaches have been proposed, it still suffers from pedestrians appearance variations caused by pose changing. Most of the previous methods address this problem by learning a pose-invariant descriptor subspace. In this paper, we propose a pose variation adaptation method for person reid in the view of data augmentation. It can reduce the probability of deep learning network over-fitting. Specifically, we introduce a pose transfer generative adversarial network with a similarity measurement constraint. With the learned pose transfer model, training images can be pose-transferred to any given poses, and along with the original images, form a augmented training dataset. It increases data diversity against over-fitting. In contrast to previous GAN-based methods, we consider the influence of pose variations on similarity measure to generate more realistic and shaper samples for person reid. Besides, we optimize hard example mining to introduce a novel manner of samples (pose-transferred images) used with the learned pose transfer model. It focuses on the inferior samples which are caused by pose variations to increase the number of effective hard examples for learning discriminative features and improve the generalization ability. We extensively conduct comparative evaluations to demonstrate the advantages and superiority of our proposed method over the state-of-the-art approaches on Market-1501 and DukeMTMC-reID, the rank-1 accuracy is 96.1% for Market-1501 and 92.0% for DukeMTMC-reID.

Multi-Level Deep Learning Vehicle Re-Identification Using Ranked-Based Loss Functions

Eleni Kamenou, Jesus Martinez-Del-Rincon, Paul Miller, Patricia Devlin - Hill

Responsive image

Auto-TLDR; Multi-Level Re-identification Network for Vehicle Re-Identification

Slides Poster Similar

Identifying vehicles across a network of cameras with non-overlapping fields of view remains a challenging research problem due to scene occlusions, significant inter-class similarity and intra-class variability. In this paper, we propose an end-to-end multi-level re-identification network that is capable of successfully projecting same identity vehicles closer to one another in the embedding space, compared to vehicles of different identities. Robust feature representations are obtained by combining features at multiple levels of the network. As for the learning process, we employ a recent state-of-the-art structured metric learning loss function previously applied to other retrieval problems and adjust it to the vehicle re-identification task. Furthermore, we explore the cases of image-to-image, image-to-video and video-to-video similarity metric. Finally, we evaluate our system and achieve great performance on two large-scale publicly available datasets, CityFlow-ReID and VeRi-776. Compared to most existing state-of-art approaches, our approach is simpler and more straightforward, utilizing only identity-level annotations, while avoiding post-processing the ranking results (re-ranking) at the testing phase.

Not 3D Re-ID: Simple Single Stream 2D Convolution for Robust Video Re-Identification

Toby Breckon, Aishah Alsehaim

Responsive image

Auto-TLDR; ResNet50-IBN for Video-based Person Re-Identification using Single Stream 2D Convolution Network

Slides Poster Similar

Video-based person re-identification has received increasing attention recently, as it plays an important role within the surveillance video analysis. Video-based Re-ID is an expansion of earlier image-based re-identification methods by learning features from a video via multiple image frames for each person. Most contemporary video Re-ID methods utilise complex CNN-based network architectures using 3D convolution or multi-branch networks to extract spatial-temporal features from the video. By contrast, in this paper, we will illustrate superior performance from a simple single stream 2D convolution network leveraging the ResNet50-IBN architecture to extract frame-level features followed by temporal attention for clip level features. These clip level features can be generalised to extract video level features by averaging clip level features without any additional cost. Our model, uses best video Re-ID practice and transfer learning between datasets, outperforms existing state-of-the-art approaches on MARS, PRID2011 and iLIDSVID datasets with 89:62%, 97:75%, 97:33% rank-1 accuracy respectively and with 84:61% mAP for MARS, without reliance on complex and memory intensive 3D convolutions or multistream networks architectures as found in other contemporary work. Conversely, this work shows that global features extracted by the 2D convolution network are a sufficient representation for robust state of the art video Re-ID.

Recurrent Deep Attention Network for Person Re-Identification

Changhao Wang, Jun Zhou, Xianfei Duan, Guanwen Zhang, Wei Zhou

Responsive image

Auto-TLDR; Recurrent Deep Attention Network for Person Re-identification

Slides Poster Similar

Person re-identification (re-id) is an important task in video surveillance. It is challenging due to the appearance of person varying a wide range acrossnon-overlapping camera views. Recent years, attention-based models are introduced to learn discriminative representation. In this paper, we consider the attention selection in a natural way as like human moving attention on different parts of the visual field for person re-id. In concrete, we propose a Recurrent Deep Attention Network (RDAN) with an attention selection mechanism based on reinforcement learning. The RDAN aims to adaptively observe the identity-sensitive regions to build up the representation of individuals step by step. Extensive experiments on three person re-id benchmarks Market-1501, DukeMTMC-reID and CUHK03-NP demonstrate the proposed method can achieve competitive performance.

Semi-Supervised Person Re-Identification by Attribute Similarity Guidance

Peixian Hong, Ancong Wu, Wei-Shi Zheng

Responsive image

Auto-TLDR; Attribute Similarity Guidance Guidance Loss for Semi-supervised Person Re-identification

Slides Poster Similar

Although supervised person re-identification (RE-ID) has achieved great progress with deep learning, it requires time-consuming annotation of a large number of pedestrian identities. To reduce labeling cost, we attempt to reduce cross-camera identity annotations and exploit pedestrian attribute annotations as auxiliary information instead. The pedestrian attributes, such as outfit styles, contain coarse semantic knowledge. Although pedestrian attributes are annotated without exhaustive searching in a camera network, which is much easier than cross-camera identity annotation, ambiguity exists in attributes when different persons have similar outfits. To solve this problem, we propose an Attribute Similarity Guidance loss (ASG) to guide appearance feature learning for RE-ID by selective attribute similarity preservation to avoid the impact of such ambiguity. Finally, we develop an attribute-guided self training framework to jointly utilize attribute annotations, unlabeled data and limited labeled data for semi-supervised learning. Extensive experiments on Market-1501 and DukeMTMC-ReID show the superiority of our method for semi-supervised RE-ID.

Self-Paced Bottom-Up Clustering Network with Side Information for Person Re-Identification

Mingkun Li, Chun-Guang Li, Ruo-Pei Guo, Jun Guo

Responsive image

Auto-TLDR; Self-Paced Bottom-up Clustering Network with Side Information for Unsupervised Person Re-identification

Slides Poster Similar

Person re-identification (Re-ID) has attracted a lot of research attention in recent years. However, supervised methods demand an enormous amount of manually annotated data. In this paper, we propose a Self-Paced bottom-up Clustering Network with Side Information (SPCNet-SI) for unsupervised person Re-ID, where the side information comes from the serial number of the camera associated with each image. Specifically, our proposed SPCNet-SI exploits the camera side information to guide the feature learning and uses soft label in bottom-up clustering process, in which the camera association information is used in the repelled loss and the soft label based cluster information is used to select the candidate cluster pairs to merge. Moreover, a self-paced dynamic mechanism is developed to regularize the merging process such that the clustering is implemented in an easy-to-hard way with a slow-to-fast merging process. Experiments on two benchmark datasets Market-1501 and DukeMTMC-ReID demonstrate promising performance.

Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification

Can Zhang, Hong Liu, Wei Guo, Mang Ye

Responsive image

Auto-TLDR; Multi-Scale Part-Aware Cascading for RGB-Infrared Person Re-identification

Slides Poster Similar

RGB-Infrared person re-identification (RGB-IR Re-ID) aims to matching persons from heterogeneous images captured by visible and thermal cameras, which is of great significance in surveillance system under poor light conditions. Facing great challenges in complex variances including conventional single-modality and additional inter-modality discrepancies, most of existing RGB-IR Re-ID methods directly work on global features for simultaneous elimination, whereas modality-specific noises and modality-shared features are not well considered. To address these issues, a novel Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global in a cascading manner, which results in an unified representation robust to noises. Moreover, a marginal exponential center (MeCen) loss is introduced to jointly eliminate mixed variances, which enables to model cross-modality correlations on sharable salient features. Extensive experiments are conducted for demonstration that the proposed method outperforms all the state-of-the-arts by a large margin.

Domain Generalized Person Re-Identification Via Cross-Domain Episodic Learning

Ci-Siang Lin, Yuan Chia Cheng, Yu-Chiang Frank Wang

Responsive image

Auto-TLDR; Domain-Invariant Person Re-identification with Episodic Learning

Slides Poster Similar

Aiming at recognizing images of the same person across distinct camera views, person re-identification (re-ID) has been among active research topics in computer vision. Most existing re-ID works require collection of a large amount of labeled image data from the scenes of interest. When the data to be recognized are different from the source-domain training ones, a number of domain adaptation approaches have been proposed. Nevertheless, one still needs to collect labeled or unlabelled target-domain data during training. In this paper, we tackle an even more challenging and practical setting, domain generalized (DG) person re-ID. That is, while a number of labeled source-domain datasets are available, we do not have access to any target-domain training data. In order to learn domain-invariant features without knowing the target domain of interest, we present an episodic learning scheme which advances meta learning strategies to exploit the observed source-domain labeled data. The learned features would exhibit sufficient domain-invariant properties while not overfitting the source-domain data or ID labels. Our experiments on four benchmark datasets confirm the superiority of our method over the state-of-the-arts.

Attention-Based Model with Attribute Classification for Cross-Domain Person Re-Identification

Simin Xu, Lingkun Luo, Shiqiang Hu

Responsive image

Auto-TLDR; An attention-based model with attribute classification for cross-domain person re-identification

Poster Similar

Person re-identification (re-ID) which aims to recognize a pedestrian observed by non-overlapping cameras is a challenging task due to high variance between images from different viewpoints. Although remarkable progresses on research of re-ID had been obtained via leveraging the merits of deep learning framework through sufficient quantity training on a large amount of well labeled data, whereas, in real scenarios, re-ID generally suffers from lacking of well labeled training data. In this paper, we propose an attention-based model with attribute classification (AMAC) to facilitate a well trained model transferring across different data domains, which further enables an efficient cross-domain video-based person re-ID. Specifically, an attention-based sub-network is proposed for deep insight into the quality variations of local parts, hence, different local parts are cooperated with different weights to avoid the heavy occlusions or the cluttered background in datasets. Moreover, we introduce a transferred attribute classification sub-network to extract attribute-semantic features of any new target datasets without the requirement for new training attribute labels which are costly to annotate. Attribute-semantic features can be considered as valuable complementary information for person re-identification since they are robust to illumination varieties and different viewpoints across cameras. Due to the large gap between different datasets, we finetune each sub-network with pseudo labels on the target datasets respectively to strengthen the original model trained on other labeled datasets. Extensive comparable evaluations demonstrate the superiority of our AMAC in solving cross-domain person re-ID task on two benchmarks including PRID-2011 and iLIDS-VID.

Convolutional Feature Transfer via Camera-Specific Discriminative Pooling for Person Re-Identification

Tetsu Matsukawa, Einoshin Suzuki

Responsive image

Auto-TLDR; A small-scale CNN feature transfer method for person re-identification

Slides Poster Similar

Modern Convolutional Neural Networks~(CNNs) have been improving the accuracy of person re-identification (re-id) using a large number of training samples. Such a re-id system suffers from a lack of training samples for deployment to practical security applications. To address this problem, we focus on the approach that transfers CNN features pre-trained on a large-scale person re-id dataset to a small-scale dataset. Most of the ordinal CNN feature transfer methods use the features of fully connected layers that entangle locally pooled features of different spatial locations on an image. Unfortunately, due to the difference of view angles and the bias of walking directions of the persons, each camera view in a dataset has a unique spatial property in the person image, which reduces the generality of the local pooling for different cameras/datasets. To account for the camera- and dataset-specific spatial bias, we propose a method to learn camera and dataset-specific position weight maps for discriminative local pooling of convolutional features. Our experiments on four public datasets confirm the effectiveness of the proposed feature transfer with a small number of training samples in the target datasets.

Polynomial Universal Adversarial Perturbations for Person Re-Identification

Wenjie Ding, Xing Wei, Rongrong Ji, Xiaopeng Hong, Yihong Gong

Responsive image

Auto-TLDR; Polynomial Universal Adversarial Perturbation for Re-identification Methods

Slides Poster Similar

In this paper, we focus on Universal Adversarial Perturbations (UAP) attack on state-of-the-art person re-identification (Re-ID) methods. Existing UAP methods usually compute a perturbation image and add it to the images of interest. Such a simple constant form greatly limits the attack power. To address this problem, we extend the formulation of UAP to a polynomial form and propose the Polynomial Universal Adversarial Perturbation (PUAP). Unlike traditional UAP methods which only rely on the additive perturbation signal, the proposed PUAP consists of both an additive perturbation and a multiplicative modulation factor. The additive perturbation produces the fundamental component of the signal, while the multiplicative factor modulates the perturbation signal in line with the unit impulse pattern of the input image. Moreover, we design a Pearson correlation coefficient loss to generate universal perturbations, for disrupting the outputs of person Re-ID methods. Extensive experiments on DukeMTMC-ReID, Market-1501, and MARS show that the proposed method can efficiently improve the attack performance, especially when the magnitude of UAP is constrained to a small value.

Multi-Label Contrastive Focal Loss for Pedestrian Attribute Recognition

Xiaoqiang Zheng, Zhenxia Yu, Lin Chen, Fan Zhu, Shilong Wang

Responsive image

Auto-TLDR; Multi-label Contrastive Focal Loss for Pedestrian Attribute Recognition

Slides Poster Similar

Pedestrian Attribute Recognition (PAR) has received extensive attention during the past few years. With the advances of deep constitutional neural networks (CNNs), the performance of PAR has been significantly improved. Existing methods tend to acquire attribute-specific features by designing various complex network structures with additional modules. Such additional modules, however, dramatically increase the number of parameters. Meanwhile, the problems of class imbalance and hard attribute retrieving remain underestimated in PAR. In this paper, we explore the optimization mechanism of the training processing to account for these problems and propose a new loss function called Multi-label Contrastive Focal Loss (MCFL). This proposed MCFL emphasizes the hard and minority attributes by using a separated re-weighting mechanism for different positive and negative classes to alleviate the impact of the imbalance. MCFL is also able to enlarge the gaps between the intra-class of multi-label attributes, to force CNNs to extract more subtle discriminative features. We evaluate the proposed MCFL on three large public pedestrian datasets, including RAP, PA-100K, and PETA. The experimental results indicate that the proposed MCFL with the ResNet-50 backbone is able to outperform other state-of-the-art approaches in comparison.

CANU-ReID: A Conditional Adversarial Network for Unsupervised Person Re-IDentification

Guillaume Delorme, Yihong Xu, Stéphane Lathuiliere, Radu Horaud, Xavier Alameda-Pineda

Responsive image

Auto-TLDR; Unsupervised Person Re-Identification with Clustering and Adversarial Learning

Slides Similar

Unsupervised person re-ID is the task of identifying people on a target data set for which the ID labels are unavailable during training. In this paper, we propose to unify two trends in unsupervised person re-ID: clustering & fine-tuning and adversarial learning. On one side, clustering groups training images into pseudo-ID labels, and uses them to fine-tune the feature extractor. On the other side, adversarial learning is used, inspired by domain adaptation, to match distributions from different domains. Since target data is distributed across different camera viewpoints, we propose to model each camera as an independent domain, and aim to learn domain-independent features. Straightforward adversarial learning yields negative transfer, we thus introduce a conditioning vector to mitigate this undesirable effect. In our framework, the centroid of the cluster to which the visual sample belongs is used as conditioning vector of our conditional adversarial network, where the vector is permutation invariant (clusters ordering does not matter) and its size is independent of the number of clusters. To our knowledge, we are the first to propose the use of conditional adversarial networks for unsupervised person re-ID. We evaluate the proposed architecture on top of two state-of-the-art clustering-based unsupervised person re-identification (re-ID) methods on four different experimental settings with three different data sets and set the new state-of-the-art performance on all four of them. Our code and model will be made publicly available at https://team.inria.fr/perception/canu-reid/.

Unsupervised Domain Adaptation for Person Re-Identification through Source-Guided Pseudo-Labeling

Fabian Dubourvieux, Romaric Audigier, Angélique Loesch, Ainouz-Zemouche Samia, Stéphane Canu

Responsive image

Auto-TLDR; Pseudo-labeling for Unsupervised Domain Adaptation for Person Re-Identification

Slides Poster Similar

Person Re-Identification (re-ID) aims at retrieving images of the same person taken by different cameras. A challenge for re-ID is the performance preservation when a model is used on data of interest (target data) which belong to a different domain from the training data domain (source data). Unsupervised Domain Adaptation (UDA) is an interesting research direction for this challenge as it avoids a costly annotation of the target data. Pseudo-labeling methods achieve the best results in UDA-based re-ID. They incrementally learn with identity pseudo-labels which are initialized by clustering features in the source re-ID encoder space. Surprisingly, labeled source data are discarded after this initialization step. However, we believe that pseudo-labeling could further leverage the labeled source data in order to improve the post-initialization training steps. In order to improve robustness against erroneous pseudo-labels, we advocate the exploitation of both labeled source data and pseudo-labeled target data during all training iterations. To support our guideline, we introduce a framework which relies on a two-branch architecture optimizing classification in source and target domains, respectively, in order to allow adaptability to the target domain while ensuring robustness to noisy pseudo-labels. Indeed, shared low and mid-level parameters benefit from the source classification signal while high-level parameters of the target branch learn domain-specific features. Our method is simple enough to be easily combined with existing pseudo-labeling UDA approaches. We show experimentally that it is efficient and improves performance when the base method has no mechanism to deal with pseudo-label noise. And it maintains performance when combined with base method that already manages pseudo-label noise. Our approach reaches state-of-the-art performance when evaluated on commonly used datasets, Market-1501 and DukeMTMC-reID, and outperforms the state of the art when targeting the bigger and more challenging dataset MSMT.

Loop-closure detection by LiDAR scan re-identification

Jukka Peltomäki, Xingyang Ni, Jussi Puura, Joni-Kristian Kamarainen, Heikki Juhani Huttunen

Responsive image

Auto-TLDR; Loop-Closing Detection from LiDAR Scans Using Convolutional Neural Networks

Slides Poster Similar

In this work, loop-closure detection from LiDAR scans is defined as an image re-identification problem. Re-identification is performed by computing Euclidean distances of a query scan to a gallery set of previous scans. The distances are computed in a feature embedding space where the scans are mapped by a convolutional neural network (CNN). The network is trained using the triplet loss training strategy. In our experiments we compare different backbone networks, variants of the triplet loss and generic and LiDAR specific data augmentation techniques. With a realistic indoor dataset the best architecture obtains the mean average precision (mAP) above 90%.

Sample-Dependent Distance for 1 : N Identification Via Discriminative Feature Selection

Naoki Kawamura, Susumu Kubota

Responsive image

Auto-TLDR; Feature Selection Mask for 1:N Identification Problems with Binary Features

Slides Poster Similar

We focus on 1:N identification problems with binary features. Most multiclass classification methods, including identification and verification methods, use a shared metric space in which distances between samples are measured regardless of their identities. This is because dedicated metric spaces learned for each identity in the training set are of little use for the test set. In 1:N identification problems, however, gallery samples contain rich information about the test domain. Given a sample and its neighbors in the gallery set, we propose a method for calculating a discriminative feature selection mask that is used as a sample-dependent distance metric. Experiments on several re-identification datasets show that the proposed method enhances the performance of state-of-the-art feature extractors.

Generalized Local Attention Pooling for Deep Metric Learning

Carlos Roig Mari, David Varas, Issey Masuda, Juan Carlos Riveiro, Elisenda Bou-Balust

Responsive image

Auto-TLDR; Generalized Local Attention Pooling for Deep Metric Learning

Slides Poster Similar

Deep metric learning has been key to recent advances in face verification and image retrieval amongst others. These systems consist on a feature extraction block (extracts feature maps from images) followed by a spatial dimensionality reduction block (generates compact image representations from the feature maps) and an embedding generation module (projects the image representation to the embedding space). While research on deep metric learning has focused on improving the losses for the embedding generation module, the dimensionality reduction block has been overlooked. In this work, we propose a novel method to generate compact image representations which uses local spatial information through an attention mechanism, named Generalized Local Attention Pooling (GLAP). This method, instead of being placed at the end layer of the backbone, is connected at an intermediate level, resulting in lower memory requirements. We assess the performance of the aforementioned method by comparing it with multiple dimensionality reduction techniques, demonstrating the importance of using attention weights to generate robust compact image representations. Moreover, we compare the performance of multiple state-of-the-art losses using the standard deep metric learning system against the same experiment with our GLAP. Experiments showcase that the proposed Generalized Local Attention Pooling mechanism outperforms other pooling methods when compared with current state-of-the-art losses for deep metric learning.

How Important Are Faces for Person Re-Identification?

Julia Dietlmeier, Joseph Antony, Kevin Mcguinness, Noel E O'Connor

Responsive image

Auto-TLDR; Anonymization of Person Re-identification Datasets with Face Detection and Blurring

Slides Poster Similar

This paper investigates the dependence of existing state-of-the-art person re-identification models on the presence and visibility of human faces. We apply a face detection and blurring algorithm to create anonymized versions of several popular person re-identification datasets including Market1501, DukeMTMC-reID, CUHK03, Viper, and Airport. Using a cross-section of existing state-of-the-art models that range in accuracy and computational efficiency, we evaluate the effect of this anonymization on re-identification performance using standard metrics. Perhaps surprisingly, the effect on mAP is very small, and accuracy is recovered by simply training on the anonymized versions of the data rather than the original data. These findings are consistent across multiple models and datasets. These results indicate that datasets can be safely anonymized by blurring faces without significantly impacting the performance of person re-identification systems, and may allow for the release of new richer re-identification datasets where previously there were privacy or data protection concerns.

RGB-Infrared Person Re-Identification Via Image Modality Conversion

Huangpeng Dai, Qing Xie, Yanchun Ma, Yongjian Liu, Shengwu Xiong

Responsive image

Auto-TLDR; CE2L: A Novel Network for Cross-Modality Re-identification with Feature Alignment

Slides Poster Similar

As a cross modality retrieval task, RGB-infrared person re-identification(Re-ID) is an important and challenging tasking, because of its important role in video surveillance applications and large cross-modality variations between visible and infrared images. Most previous works addressed the problem of cross-modality gap with feature alignment by original feature representation learning straightly. In this paper, different from existing works, we propose a novel network(CE2L) to tackle the cross-modality gap with feature alignment. CE2L mainly focuses on adding discriminative information and learning robust features by converting modality between visible and infrared images. Its merits are highlighted in two aspects: 1)Using CycleGAN to convert infrared images into color images can not only increase the recognition characteristics of images, but also allow the our network to better learn the two modal image features; 2)Our novel method can serve as data augmentation. Specifically, it can increase data diversity and total data against over-fitting by converting labeled training images to another modal images. Extensive experimental results on two datasets demonstrate superior performance compared to the baseline and the state-of-the-art methods.

Online Domain Adaptation for Person Re-Identification with a Human in the Loop

Rita Delussu, Lorenzo Putzu, Giorgio Fumera, Fabio Roli

Responsive image

Auto-TLDR; Human-in-the-loop for Person Re-Identification in Infeasible Applications

Slides Poster Similar

Supervised deep learning methods have recently achieved remarkable performance in person re-identification. Unsupervised domain adaptation (UDA) approaches have also been proposed for application scenarios where only unlabelled data are available from target camera views. We consider a more challenging scenario when even collecting a suitable amount of representative, unlabelled target data for offline training or fine-tuning is infeasible. In this context we revisit the human-in-the-loop (HITL) approach, which exploits online the operator's feedback on a small amount of target data. We argue that HITL is a kind of online domain adaptation specifically suited to person re-identification. We then reconsider relevance feedback methods for content-based image retrieval that are computationally much cheaper than state-of-the-art HITL methods for person re-identification, and devise a specific feedback protocol for them. Experimental results show that HITL can achieve comparable or better performance than UDA, and is therefore a valid alternative when the lack of unlabelled target data makes UDA infeasible.

Decoupled Self-Attention Module for Person Re-Identification

Chao Zhao, Zhenyu Zhang, Jian Yang, Yan Yan

Responsive image

Auto-TLDR; Decoupled Self-attention Module for Person Re-identification

Slides Poster Similar

Person re-identification aims to identifying the same person from different cameras, which needs to integrate whole-body information and capture global correlation. However, convolutional neural network is able to only capture short-distance information because of the size of filters. Self-attention is introduced to capture long-distance correlation, but inner-product similarity calculation in self-attention mingles semantic response and semantic difference together. Semantic difference is more important for person re-identification, because it is robust to illumination without the effect of semantic response. However, we find the scale of norms measuring semantic response is much larger than angle measuring semantic difference by decoupling inner-product similarity into norms and angle. To balance the importance of semantic response and semantic difference in self-attention, we propose the decoupled self-attention module for person re-identification to make the most of self-attention. Extensive experiments show that the decoupled self-attention module obtains significant performance with easier convergence and stronger robustness.

DAIL: Dataset-Aware and Invariant Learning for Face Recognition

Gaoang Wang, Chen Lin, Tianqiang Liu, Mingwei He, Jiebo Luo

Responsive image

Auto-TLDR; DAIL: Dataset-Aware and Invariant Learning for Face Recognition

Slides Poster Similar

To achieve good performance in face recognition, a large scale training dataset is usually required. A simple yet effective way for improving the recognition performance is to use a dataset as large as possible by combining multiple datasets in the training. However, it is problematic and troublesome to naively combine different datasets due to two major issues. Firstly, the same person can possibly appear in different datasets, leading to the identity overlapping issue between different datasets. Natively treating the same person as different classes in different datasets during training will affect back-propagation and generate non-representative embeddings. On the other hand, manually cleaning labels will take a lot of human efforts, especially when there are millions of images and thousands of identities. Secondly, different datasets are collected in different situations and thus will lead to different domain distributions. Natively combining datasets will lead to domain distribution differences and make it difficult to learn domain invariant embeddings across different datasets. In this paper, we propose DAIL: Dataset-Aware and Invariant Learning to resolve the above-mentioned issues. To solve the first issue of identity overlapping, we propose a dataset-aware loss for multi-dataset training by reducing the penalty when the same person appears in multiple datasets. This can be readily achieved with a modified softmax loss with a dataset-aware term. To solve the second issue, the domain adaptation with gradient reversal layers is employed for dataset invariant learning. The proposed approach not only achieves state-of-the-art results on several commonly used face recognition validation sets, like LFW, CFP-FP, AgeDB-30, but also shows great benefit for practical usage.

Learning Embeddings for Image Clustering: An Empirical Study of Triplet Loss Approaches

Kalun Ho, Janis Keuper, Franz-Josef Pfreundt, Margret Keuper

Responsive image

Auto-TLDR; Clustering Objectives for K-means and Correlation Clustering Using Triplet Loss

Slides Poster Similar

In this work, we evaluate two different image clustering objectives, k-means clustering and correlation clustering, in the context of Triplet Loss induced feature space embeddings. Specifically, we train a convolutional neural network to learn discriminative features by optimizing two popular versions of the Triplet Loss in order to study their clustering properties under the assumption of noisy labels. Additionally, we propose a new, simple Triplet Loss formulation, which shows desirable properties with respect to formal clustering objectives and outperforms the existing methods. We evaluate all three Triplet loss formulations for K-means and correlation clustering on the CIFAR-10 image classification dataset.

Nonlinear Ranking Loss on Riemannian Potato Embedding

Byung Hyung Kim, Yoonje Suh, Honggu Lee, Sungho Jo

Responsive image

Auto-TLDR; Riemannian Potato for Rank-based Metric Learning

Slides Poster Similar

We propose a rank-based metric learning method by leveraging a concept of the Riemannian Potato for better separating non-linear data. By exploring the geometric properties of Riemannian manifolds, the proposed loss function optimizes the measure of dispersion using the distribution of Riemannian distances between a reference sample and neighbors and builds a ranked list according to the similarities. We show the proposed function can learn a hypersphere for each class, preserving the similarity structure inside it on Riemannian manifold. As a result, compared with Euclidean distance-based metric, our method can further jointly reduce the intra-class distances and enlarge the inter-class distances for learned features, consistently outperforming state-of-the-art methods on three widely used non-linear datasets.

Open-World Group Retrieval with Ambiguity Removal: A Benchmark

Ling Mei, Jian-Huang Lai, Zhanxiang Feng, Xiaohua Xie

Responsive image

Auto-TLDR; P2GSM-AR: Re-identifying changing groups of people under the open-world and group-ambiguity scenarios

Slides Poster Similar

Group retrieval has attracted plenty of attention in artificial intelligence, traditional group retrieval researches assume that members in a group are unique and do not change under different cameras. However, the assumption may not be met for practical situations such as open-world and group-ambiguity scenarios. This paper tackles an important yet non-studied problem: re-identifying changing groups of people under the open-world and group-ambiguity scenarios in different camera fields. The open-world scenario considers that there are probably non-target people for the probe set appear in the searching gallery, while the group-ambiguity scenario means the group members may change. The open-world and group-ambiguity issue is very challenging for the existing methods because the changing of group members results in dramatic visual variations. Nevertheless, as far as we know, the existing literature lacks benchmarks which target on coping with this issue. In this paper, we propose a new group retrieval dataset named OWGA-Campus to consider these challenges. Moreover, we propose a person-to-group similarity matching based ambiguity removal (P2GSM-AR) method to solve these problems and realize the intention of group retrieval. Experimental results on OWGA-Campus dataset demonstrate the effectiveness and robustness of the proposed P2GSM-AR approach in improving the performance of the state-of-the-art feature extraction methods of person re-id towards the open-world and ambiguous group retrieval task.

G-FAN: Graph-Based Feature Aggregation Network for Video Face Recognition

He Zhao, Yongjie Shi, Xin Tong, Jingsi Wen, Xianghua Ying, Jinshi Hongbin Zha

Responsive image

Auto-TLDR; Graph-based Feature Aggregation Network for Video Face Recognition

Slides Poster Similar

In this paper, we propose a graph-based feature aggregation network (G-FAN) for video face recognition. Compared with the still image, video face recognition exhibits great challenges due to huge intra-class variability and high inter-class ambiguity. To address this problem, our G-FAN first uses a Convolutional Neural Network to extract deep features for every input face of a subject. Then, we build an affinity graph based on the relation between facial features and apply Graph Convolutional Network to generate fine-grained quality vectors for each frame. Finally, the features among multiple frames are adaptively aggregated into a discriminative vector to represent a video face. Different from previous works that take a single image as input, our G-FAN could utilize the correlation information between image pairs and aggregate a template of faces simultaneously. The experiments on video face recognition benchmarks, including YTF, IJB-A, and IJB-C show that: (i) G-FAN automatically learns to advocate high-quality frames while repelling low-quality ones. (ii) G-FAN significantly boosts recognition accuracy and outperforms other state-of-the-art aggregation methods.

A Base-Derivative Framework for Cross-Modality RGB-Infrared Person Re-Identification

Hong Liu, Ziling Miao, Bing Yang, Runwei Ding

Responsive image

Auto-TLDR; Cross-modality RGB-Infrared Person Re-identification with Auxiliary Modalities

Slides Poster Similar

Cross-modality RGB-infrared (RGB-IR) person re-identification (Re-ID) is a challenging research topic due to the heterogeneity of RGB and infrared images. In this paper, we aim to find some auxiliary modalities, which are homologous with the visible or infrared modalities, to help reduce the modality discrepancy caused by heterogeneous images. Accordingly, a new base-derivative framework is proposed, where base refers to the original visible and infrared modalities, and derivative refers to the two auxiliary modalities that are derived from base. In the proposed framework, the double-modality cross-modal learning problem is reformulated as a four-modality one. After that, the images of all the base and derivative modalities are fed into the feature learning network. With the doubled input images, the learned person features become more discriminative. Furthermore, the proposed framework is optimized by the enhanced intra- and cross-modality constraints with the assistance of two derivative modalities. Experimental results on two publicly available datasets SYSU-MM01 and RegDB show that the proposed method outperforms the other state-of-the-art methods. For instance, we achieve a gain of over 13\% in terms of both Rank-1 and mAP on RegDB dataset.

A Duplex Spatiotemporal Filtering Network for Video-Based Person Re-Identification

Chong Zheng, Ping Wei, Nanning Zheng

Responsive image

Auto-TLDR; Duplex Spatiotemporal Filtering Network for Person Re-identification in Videos

Slides Poster Similar

Video-based person re-identification plays important roles in surveillance video analysis. This paper proposes a novel Duplex Spatiotemporal Filtering Network (DSFN) to re-identify persons in videos. A video sequence is represented as a duplex spatiotemporal matrix. DSFN model containing a group of filters performs filtering at feature level in both temporal and spatial dimensions, by which the model focuses on feature-level semantic information rather than image-level information as in the traditional filters. We propose sparse-orthogonal constraints to enforce the model to extract more discriminative features. DSFN characterizes not only the appearance features but also dynamic information such as gaits embedded in video sequences and obtains a better performance as a result. Experiments show that the proposed method outperforms state-of-the-art approaches.

TAAN: Task-Aware Attention Network for Few-Shot Classification

Zhe Wang, Li Liu, Fanzhang Li

Responsive image

Auto-TLDR; TAAN: Task-Aware Attention Network for Few-Shot Classification

Slides Poster Similar

Few-shot classification aims to recognize unlabeled samples from unseen classes given only a few labeled samples.Current approaches of few-shot learning usually employ a metriclearning framework to learn a feature similarity comparison between a query (test) example and the few support (training) examples. However, these approaches all extract features from samples independently without looking at the entire task as a whole, and so fail to provide an enough discrimination to features. Moreover, the existing approaches lack the ability to select the most relevant features for the task at hand. In this work, we propose a novel algorithm called Task-Aware Attention Network (TAAN) to address the above problems in few-shot classification. By inserting a Task-Relevant Channel Attention Module into metric-based few-shot learners, TAAN generates channel attentions for each sample by aggregating the context of the entire support set and identifies the most relevant features for similarity comparison. The experiment demonstrates that TAAN is competitive in overall performance comparing to the recent state-of-the-art systems and improves the performance considerably over baseline systems on both mini-ImageNet and tiered-ImageNet benchmarks.

Batch-Incremental Triplet Sampling for Training Triplet Networks Using Bayesian Updating Theorem

Milad Sikaroudi, Benyamin Ghojogh, Fakhri Karray, Mark Crowley, Hamid Reza Tizhoosh

Responsive image

Auto-TLDR; Bayesian Updating Triplet Mining with Bayesian updating

Slides Poster Similar

Variants of Triplet networks are robust entities for learning a discriminative embedding subspace. There exist different triplet mining approaches for selecting the most suitable training triplets. Some of these mining methods rely on the extreme distances between instances, and some others make use of sampling. However, sampling from stochastic distributions of data rather than sampling merely from the existing embedding instances can provide more discriminative information. In this work, we sample triplets from distributions of data rather than from existing instances. We consider a multivariate normal distribution for the embedding of each class. Using Bayesian updating and conjugate priors, we update the distributions of classes dynamically by receiving the new mini-batches of training data. The proposed triplet mining with Bayesian updating can be used with any triplet-based loss function, e.g., \textit{triplet-loss} or Neighborhood Component Analysis (NCA) loss. Accordingly, Our triplet mining approaches are called Bayesian Updating Triplet (BUT) and Bayesian Updating NCA (BUNCA), depending on which loss function is being used. Experimental results on two public datasets, namely MNIST and histopathology colorectal cancer (CRC), substantiate the effectiveness of the proposed triplet mining method.

SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Raphael Memmesheimer, Nick Theisen, Dietrich Paulus

Responsive image

Auto-TLDR; One-Shot Action Recognition using Metric Learning

Slides Similar

Recognizing an activity with a single reference sample using metric learning approaches is a promising research field. The majority of few-shot methods focus on object recognition or face-identification. We propose a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space. We encode signals into images and extract features using a deep residual CNN. Using triplet loss, we learn a feature embedding. The resulting encoder transforms features into an embedding space in which closer distances encode similar actions while higher distances encode different actions. Our approach is based on a signal level formulation and remains flexible across a variety of modalities. It further outperforms the baseline on the large scale NTU RGB+D 120 dataset for the One-Shot action recognition protocol by \ntuoneshotimpro%. With just 60% of the training data, our approach still outperforms the baseline approach by \ntuoneshotimproreduced%. With 40% of the training data, our approach performs comparably well as the second follow up. Further, we show that our approach generalizes well in experiments on the UTD-MHAD dataset for inertial, skeleton and fused data and the Simitate dataset for motion capturing data. Furthermore, our inter-joint and inter-sensor experiments suggest good capabilities on previously unseen setups.

Rotation Invariant Aerial Image Retrieval with Group Convolutional Metric Learning

Hyunseung Chung, Woo-Jeoung Nam, Seong-Whan Lee

Responsive image

Auto-TLDR; Robust Remote Sensing Image Retrieval Using Group Convolution with Attention Mechanism and Metric Learning

Slides Poster Similar

Remote sensing image retrieval (RSIR) is the process of ranking database images depending on the degree of similarity compared to the query image. As the complexity of RSIR increases due to the diversity in shooting range, angle, and location of remote sensors, there is an increasing demand for methods to address these issues and improve retrieval performance. In this work, we introduce a novel method for retrieving aerial images by merging group convolution with attention mechanism and metric learning, resulting in robustness to rotational variations. For refinement and emphasis on important features, we applied channel attention in each group convolution stage. By utilizing the characteristics of group convolution and channel-wise attention, it is possible to acknowledge the equality among rotated but identically located images. The training procedure has two main steps: (i) training the network with Aerial Image Dataset (AID) for classification, (ii) fine-tuning the network with triplet-loss for retrieval with Google Earth South Korea and NWPU-RESISC45 datasets. Results show that the proposed method performance exceeds other state-of-the-art retrieval methods in both rotated and original environments. Furthermore, we utilize class activation maps (CAM) to visualize the distinct difference of main features between our method and baseline, resulting in better adaptability in rotated environments.

P2 Net: Augmented Parallel-Pyramid Net for Attention Guided Pose Estimation

Luanxuan Hou, Jie Cao, Yuan Zhao, Haifeng Shen, Jian Tang, Ran He

Responsive image

Auto-TLDR; Parallel-Pyramid Net with Partial Attention for Human Pose Estimation

Slides Poster Similar

The target of human pose estimation is to determine the body parts and joint locations of persons in the image. Angular changes, motion blur and occlusion etc. in the natural scenes make this task challenging, while some joints are more difficult to be detected than others. In this paper, we propose an augmented Parallel-Pyramid Net (P^2Net) with an partial attention module. During data preprocessing, we proposed a differentiable auto data augmentation (DA^2) method in which sequences of data augmentations are formulated as a trainable and operational Convolution Neural Network (CNN) component. DA^2 improves the training efficiency and effectiveness. A parallel pyramid structure is followed to compensate the information loss introduced by the network. For the information loss problem in the backbone network, we optimize the backbone network by adopting a new parallel structure without increasing the overall computational complexity. To further refine the predictions after completion of global predictions, an Partial Attention Module (PAM) is defined to extract weighted features from different scale feature maps generated by the parallel pyramid structure. Compared with the traditional up-sampling refining, PAM can better capture the relationship between channels. Experiments corroborate the effectiveness of our proposed method. Notably, our method achieves the best performance on the challenging MSCOCO and MPII datasets.

Augmented Bi-Path Network for Few-Shot Learning

Baoming Yan, Chen Zhou, Bo Zhao, Kan Guo, Yang Jiang, Xiaobo Li, Zhang Ming, Yizhou Wang

Responsive image

Auto-TLDR; Augmented Bi-path Network for Few-shot Learning

Slides Poster Similar

Few-shot Learning (FSL) which aims to learn from few labeled training data is becoming a popular research topic, due to the expensive labeling cost in many real-world applications. One kind of successful FSL method learns to compare the testing (query) image and training (support) image by simply concatenating the features of two images and feeding it into the neural network. However, with few labeled data in each class, the neural network has difficulty in learning or comparing the local features of two images. Such simple image-level comparison may cause serious mis-classification. To solve this problem, we propose Augmented Bi-path Network (ABNet) for learning to compare both global and local features on multi-scales. Specifically, the salient patches are extracted and embedded as the local features for every image. Then, the model learns to augment the features for better robustness. Finally, the model learns to compare global and local features separately, \emph{i.e.}, in two paths, before merging the similarities. Extensive experiments show that the proposed ABNet outperforms the state-of-the-art methods. Both quantitative and visual ablation studies are provided to verify that the proposed modules lead to more precise comparison results.

Improved Deep Classwise Hashing with Centers Similarity Learning for Image Retrieval

Ming Zhang, Hong Yan

Responsive image

Auto-TLDR; Deep Classwise Hashing for Image Retrieval Using Center Similarity Learning

Slides Poster Similar

Deep supervised hashing for image retrieval has attracted researchers' attention due to its high efficiency and superior retrieval performance. Most existing deep supervised hashing works, which are based on pairwise/triplet labels, suffer from the expensive computational cost and insufficient utilization of the semantics information. Recently, deep classwise hashing introduced a classwise loss supervised by class labels information alternatively; however, we find it still has its drawback. In this paper, we propose an improved deep classwise hashing, which enables hashing learning and class centers learning simultaneously. Specifically, we design a two-step strategy on center similarity learning. It interacts with the classwise loss to attract the class center to concentrate on the intra-class samples while pushing other class centers as far as possible. The centers similarity learning contributes to generating more compact and discriminative hashing codes. We conduct experiments on three benchmark datasets. It shows that the proposed method effectively surpasses the original method and outperforms state-of-the-art baselines under various commonly-used evaluation metrics for image retrieval.

Learning a Dynamic High-Resolution Network for Multi-Scale Pedestrian Detection

Mengyuan Ding, Shanshan Zhang, Jian Yang

Responsive image

Auto-TLDR; Learningable Dynamic HRNet for Pedestrian Detection

Slides Poster Similar

Pedestrian detection is a canonical instance of object detection in computer vision. In practice, scale variation is one of the key challenges, resulting in unbalanced performance across different scales. Recently, the High-Resolution Network (HRNet) has become popular because high-resolution feature representations are more friendly to small objects. However, when we apply HRNet for pedestrian detection, we observe that it improves for small pedestrians on one hand, but hurts the performance for larger ones on the other hand. To overcome this problem, we propose a learnable Dynamic HRNet (DHRNet) aiming to generate different network paths adaptive to different scales. Specifically, we construct a parallel multi-branch architecture and add a soft conditional gate module allowing for dynamic feature fusion. Both branches share all the same parameters except the soft gate module. Experimental results on CityPersons and Caltech benchmarks indicate that our proposed dynamic HRNet is more capable of dealing with pedestrians of various scales, and thus improves the performance across different scales consistently.

Dynamic Multi-Path Neural Network

Yingcheng Su, Yichao Wu, Ken Chen, Ding Liang, Xiaolin Hu

Responsive image

Auto-TLDR; Dynamic Multi-path Neural Network

Slides Similar

Although deeper and larger neural networks have achieved better performance, due to overwhelming burden on computation, they cannot meet the demands of deployment on resource-limited devices. An effective strategy to address this problem is to make use of dynamic inference mechanism, which changes the inference path for different samples at runtime. Existing methods only reduce the depth by skipping an entire specific layer, which may lose important information in this layer. In this paper, we propose a novel method called Dynamic Multi-path Neural Network (DMNN), which provides more topology choices in terms of both width and depth on the fly. For better modelling the inference path selection, we further introduce previous state and object category information to guide the training process. Compared to previous dynamic inference techniques, the proposed method is more flexible and easier to incorporate into most modern network architectures. Experimental results on ImageNet and CIFAR-100 demonstrate the superiority of our method on both efficiency and classification accuracy.

3D Facial Matching by Spiral Convolutional Metric Learning and a Biometric Fusion-Net of Demographic Properties

Soha Sadat Mahdi, Nele Nauwelaers, Philip Joris, Giorgos Bouritsas, Imperial London, Sergiy Bokhnyak, Susan Walsh, Mark Shriver, Michael Bronstein, Peter Claes

Responsive image

Auto-TLDR; Multi-biometric Fusion for Biometric Verification using 3D Facial Mesures

Slides Similar

Face recognition is a widely accepted biometric verification tool, as the face contains a lot of information about the identity of a person. In this study, a 2-step neural-based pipeline is presented for matching 3D facial shape to multiple DNA-related properties (sex, age, BMI and genomic background). The first step consists of a triplet loss-based metric learner that compresses facial shape into a lower dimensional embedding while preserving information about the property of interest. Most studies in the field of metric learning have only focused on Euclidean data. In this work, geometric deep learning is employed to learn directly from 3D facial meshes. To this end, spiral convolutions are used along with a novel mesh-sampling scheme that retains uniformly sampled 3D points at different levels of resolution. The second step is a multi-biometric fusion by a fully connected neural network. The network takes an ensemble of embeddings and property labels as input and returns genuine and imposter scores. Since embeddings are accepted as an input, there is no need to train classifiers for the different properties and available data can be used more efficiently. Results obtained by a 10-fold cross-validation for biometric verification show that combining multiple properties leads to stronger biometric systems. Furthermore, the proposed neural-based pipeline outperforms a linear baseline, which consists of principal component analysis, followed by classification with linear support vector machines and a Naïve Bayes-based score-fuser.

SSDL: Self-Supervised Domain Learning for Improved Face Recognition

Samadhi Poornima Kumarasinghe Wickrama Arachchilage, Ebroul Izquierdo

Responsive image

Auto-TLDR; Self-supervised Domain Learning for Face Recognition in unconstrained environments

Slides Poster Similar

Face recognition in unconstrained environments is challenging due to variations in illumination, quality of sensing, motion blur and etc. An individual’s face appearance can vary drastically under different conditions creating a gap between train (source) and varying test (target) data. The domain gap could cause decreased performance levels in direct knowledge transfer from source to target. Despite fine-tuning with domain specific data could be an effective solution, collecting and annotating data for all domains is extremely expensive. To this end, we propose a self-supervised domain learning (SSDL) scheme that trains on triplets mined from unlabelled data. A key factor in effective discriminative learning, is selecting informative triplets. Building on most confident predictions, we follow an “easy-to-hard” scheme of alternate triplet mining and self-learning. Comprehensive experiments on four different benchmarks show that SSDL generalizes well on different domains.

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

Kuan-Hsun Wang, Chia Chun Cheng, Yi-Ling Chen, Yale Song, Shang-Hong Lai

Responsive image

Auto-TLDR; Attention-based Deep Metric Learning for Near-duplicate Video Retrieval

Slides Similar

Near-duplicate video retrieval (NDVR) is an important and challenging problem due to the increasing amount of videos uploaded to the Internet. In this paper, we propose an attention-based deep metric learning method for NDVR. Our method is based on well-established principles: We leverage two-stream networks to combine RGB and optical flow features, and incorporate an attention module to effectively deal with distractor frames commonly observed in near duplicate videos. We further aggregate the features corresponding to multiple video segments to enhance the discriminative power. The whole system is trained using a deep metric learning objective with a Siamese architecture. Our experiments show that the attention module helps eliminate redundant and noisy frames, while focusing on visually relevant frames for solving NVDR. We evaluate our approach on recent large-scale NDVR datasets, CC_WEB_VIDEO, VCDB, FIVR and SVD. To demonstrate the generalization ability of our approach, we report results in both within- and cross-dataset settings, and show that the proposed method significantly outperforms state-of-the-art approaches.