Multi-Level Deep Learning Vehicle Re-Identification Using Ranked-Based Loss Functions

Eleni Kamenou, Jesus Martinez-Del-Rincon, Paul Miller, Patricia Devlin - Hill

Responsive image

Auto-TLDR; Multi-Level Re-identification Network for Vehicle Re-Identification

Slides Poster

Identifying vehicles across a network of cameras with non-overlapping fields of view remains a challenging research problem due to scene occlusions, significant inter-class similarity and intra-class variability. In this paper, we propose an end-to-end multi-level re-identification network that is capable of successfully projecting same identity vehicles closer to one another in the embedding space, compared to vehicles of different identities. Robust feature representations are obtained by combining features at multiple levels of the network. As for the learning process, we employ a recent state-of-the-art structured metric learning loss function previously applied to other retrieval problems and adjust it to the vehicle re-identification task. Furthermore, we explore the cases of image-to-image, image-to-video and video-to-video similarity metric. Finally, we evaluate our system and achieve great performance on two large-scale publicly available datasets, CityFlow-ReID and VeRi-776. Compared to most existing state-of-art approaches, our approach is simpler and more straightforward, utilizing only identity-level annotations, while avoiding post-processing the ranking results (re-ranking) at the testing phase.

Similar papers

Not 3D Re-ID: Simple Single Stream 2D Convolution for Robust Video Re-Identification

Toby Breckon, Aishah Alsehaim

Responsive image

Auto-TLDR; ResNet50-IBN for Video-based Person Re-Identification using Single Stream 2D Convolution Network

Slides Poster Similar

Video-based person re-identification has received increasing attention recently, as it plays an important role within the surveillance video analysis. Video-based Re-ID is an expansion of earlier image-based re-identification methods by learning features from a video via multiple image frames for each person. Most contemporary video Re-ID methods utilise complex CNN-based network architectures using 3D convolution or multi-branch networks to extract spatial-temporal features from the video. By contrast, in this paper, we will illustrate superior performance from a simple single stream 2D convolution network leveraging the ResNet50-IBN architecture to extract frame-level features followed by temporal attention for clip level features. These clip level features can be generalised to extract video level features by averaging clip level features without any additional cost. Our model, uses best video Re-ID practice and transfer learning between datasets, outperforms existing state-of-the-art approaches on MARS, PRID2011 and iLIDSVID datasets with 89:62%, 97:75%, 97:33% rank-1 accuracy respectively and with 84:61% mAP for MARS, without reliance on complex and memory intensive 3D convolutions or multistream networks architectures as found in other contemporary work. Conversely, this work shows that global features extracted by the 2D convolution network are a sufficient representation for robust state of the art video Re-ID.

Deep Top-Rank Counter Metric for Person Re-Identification

Chen Chen, Hao Dou, Xiyuan Hu, Silong Peng

Responsive image

Auto-TLDR; Deep Top-Rank Counter Metric for Person Re-identification

Slides Poster Similar

In the research field of person re-identification, deep metric learning that guides the efficient and effective embedding learning serves as one of the most fundamental tasks. Recent efforts of the loss function based deep metric learning methods mainly focus on the top rank accuracy optimization by minimiz- ing the distance difference between the correctly matching sample pair and wrongly matched sample pair. However, it is more straightforward to count the occurrences of correct top-rank candidates and maximize the counting results for better top rank accuracy. In this paper, we propose a generalized logistic function based metric with effective practicalness in deep learning, namely the“deep top-rank counter metric”, to approximately optimize the counted occurrences of the correct top-rank matches. The properties that qualify the proposed metric as a well-suited deep re-identification metric have been discussed and a progressive hard sample mining strategy is also introduced for effective training and performance boosting. The extensive experiments show that the proposed top-rank counter metric outperforms other loss function based deep metrics and achieves the state-of- the-art accuracies.

Progressive Learning Algorithm for Efficient Person Re-Identification

Zhen Li, Hanyang Shao, Liang Niu, Nian Xue

Responsive image

Auto-TLDR; Progressive Learning Algorithm for Large-Scale Person Re-Identification

Slides Poster Similar

This paper studies the problem of Person Re-Identification (ReID) for large-scale applications. Recent research efforts have been devoted to building complicated part models, which introduce considerably high computational cost and memory consumption, inhibiting its practicability in large-scale applications. This paper aims to develop a novel learning strategy to find efficient feature embeddings while maintaining the balance of accuracy and model complexity. More specifically, we find by enhancing the classical triplet loss together with cross-entropy loss, our method can explore the hard examples and build a discriminant feature embedding yet compact enough for large-scale applications. Our method is carried out progressively using Bayesian optimization, and we call it the Progressive Learning Algorithm (PLA). Extensive experiments on three large-scale datasets show that our PLA is comparable or better than the state-of-the-arts. Especially, on the challenging Market-1501 dataset, we achieve Rank-1=94.7\%/mAP=89.4\% while saving at least 30\% parameters than strong part models.

Generalized Local Attention Pooling for Deep Metric Learning

Carlos Roig Mari, David Varas, Issey Masuda, Juan Carlos Riveiro, Elisenda Bou-Balust

Responsive image

Auto-TLDR; Generalized Local Attention Pooling for Deep Metric Learning

Slides Poster Similar

Deep metric learning has been key to recent advances in face verification and image retrieval amongst others. These systems consist on a feature extraction block (extracts feature maps from images) followed by a spatial dimensionality reduction block (generates compact image representations from the feature maps) and an embedding generation module (projects the image representation to the embedding space). While research on deep metric learning has focused on improving the losses for the embedding generation module, the dimensionality reduction block has been overlooked. In this work, we propose a novel method to generate compact image representations which uses local spatial information through an attention mechanism, named Generalized Local Attention Pooling (GLAP). This method, instead of being placed at the end layer of the backbone, is connected at an intermediate level, resulting in lower memory requirements. We assess the performance of the aforementioned method by comparing it with multiple dimensionality reduction techniques, demonstrating the importance of using attention weights to generate robust compact image representations. Moreover, we compare the performance of multiple state-of-the-art losses using the standard deep metric learning system against the same experiment with our GLAP. Experiments showcase that the proposed Generalized Local Attention Pooling mechanism outperforms other pooling methods when compared with current state-of-the-art losses for deep metric learning.

Top-DB-Net: Top DropBlock for Activation Enhancement in Person Re-Identification

Rodolfo Quispe, Helio Pedrini

Responsive image

Auto-TLDR; Top-DB-Net for Person Re-Identification using Top DropBlock

Slides Poster Similar

Person Re-Identification is a challenging task that aims to retrieve all instances of a query image across a system of non-overlapping cameras. Due to the various extreme changes of view, it is common that local regions that could be used to match people are suppressed, which leads to a scenario where approaches have to evaluate the similarity of images based on less informative regions. In this work, we introduce the Top-DB-Net, a method based on Top DropBlock that pushes the network to learn to focus on the scene foreground, with special emphasis on the most task-relevant regions and, at the same time, encodes low informative regions to provide high discriminability. The Top-DB-Net is composed of three streams: (i) a global stream encodes rich image information from a backbone, (ii) the Top DropBlock stream encourages the backbone to encode low informative regions with high discriminative features, and (iii) a regularization stream helps to deal with the noise created by the dropping process of the second stream, when testing the first two streams are used. Vast experiments on three challenging datasets show the capabilities of our approach against state-of-the-art methods. Qualitative results demonstrate that our method exhibits better activation maps focusing on reliable parts of the input images.

Rethinking ReID:Multi-Feature Fusion Person Re-Identification Based on Orientation Constraints

Mingjing Ai, Guozhi Shan, Bo Liu, Tianyang Liu

Responsive image

Auto-TLDR; Person Re-identification with Orientation Constrained Network

Slides Poster Similar

Person re-identification (ReID) aims to identify the specific pedestrian in a series of images or videos. Recently, ReID is receiving more and more attention in the fields of computer vision research and application like intelligent security. One major issue downgrading the ReID model performance lies in that various subjects in the same body orientations look too similar to distinguish by the model, while the same subject viewed in different orientations looks rather different. However, most of the current studies do not particularly differentiate pedestrians in orientation when designing the network, so we rethink this problem particularly from the perspective of person orientation and propose a new network structure by including two branches: one handling samples with the same body orientations and the other handling samples with different body orientations. Correspondingly, we also propose an orientation classifier that can accurately distinguish the orientation of each person. At the same time, the three-part loss functions are introduced for orientation constraint and combined to optimize the network simultaneously. Also, we use global and local features int the training stage in order to make use of multi-level information. Therefore, our network can derive its efficacy from orientation constraints and multiple features. Experiments show that our method not only has competitive performance on multiple datasets, but also can let retrieval results aligned with the orientation of the query sample rank higher, which may have great potential in the practical applications.

Adaptive L2 Regularization in Person Re-Identification

Xingyang Ni, Liang Fang, Heikki Juhani Huttunen

Responsive image

Auto-TLDR; AdaptiveReID: Adaptive L2 Regularization for Person Re-identification

Slides Poster Similar

We introduce an adaptive L2 regularization mechanism termed AdaptiveReID, in the setting of person re-identification. In the literature, it is common practice to utilize hand-picked regularization factors which remain constant throughout the training procedure. Unlike existing approaches, the regularization factors in our proposed method are updated adaptively through backpropagation. This is achieved by incorporating trainable scalar variables as the regularization factors, which are further fed into a scaled hard sigmoid function. Extensive experiments on the Market-1501, DukeMTMC-reID and MSMT17 datasets validate the effectiveness of our framework. Most notably, we obtain state-of-the-art performance on MSMT17, which is the largest dataset for person re-identification. Source code will be published at https://github.com/nixingyang/AdaptiveReID.

Building Computationally Efficient and Well-Generalizing Person Re-Identification Models with Metric Learning

Vladislav Sovrasov, Dmitry Sidnev

Responsive image

Auto-TLDR; Cross-Domain Generalization in Person Re-identification using Omni-Scale Network

Slides Similar

This work considers the problem of domain shift in person re-identification.Being trained on one dataset, a re-identification model usually performs much worse on unseen data. Partially this gap is caused by the relatively small scale of person re-identification datasets (compared to face recognition ones, for instance), but it is also related to training objectives. We propose to use the metric learning objective, namely AM-Softmax loss, and some additional training practices to build well-generalizing, yet, computationally efficient models. We use recently proposed Omni-Scale Network (OSNet) architecture combined with several training tricks and architecture adjustments to obtain state-of-the art results in cross-domain generalization problem on a large-scale MSMT17 dataset in three setups: MSMT17-all->DukeMTMC, MSMT17-train->Market1501 and MSMT17-all->Market1501.

Self and Channel Attention Network for Person Re-Identification

Asad Munir, Niki Martinel, Christian Micheloni

Responsive image

Auto-TLDR; SCAN: Self and Channel Attention Network for Person Re-identification

Slides Poster Similar

Recent research has shown promising results for person re-identification by focusing on several trends. One is designing efficient metric learning loss functions such as triplet loss family to learn the most discriminative representations. The other is learning local features by designing part based architectures to form an informative descriptor from semantically coherent parts. Some efforts adjust distant outliers to their most similar positions by using soft attention and learn the relationship between distant similar features. However, only a few prior efforts focus on channel-wise dependencies and learn non-local sharp similar part features directly for the degraded data in the person re-identification task. In this paper, we propose a novel Self and Channel Attention Network (SCAN) to model long-range dependencies between channels and feature maps. We add multiple classifiers to learn discriminative global features by using classification loss. Self Attention (SA) module and Channel Attention (CA) module are introduced to model non-local and channel-wise dependencies in the learned features. Spectral normalization is applied to the whole network to stabilize the training process. Experimental results on the person re-identification benchmarks show the proposed components achieve significant improvement with respect to the baseline.

Loop-closure detection by LiDAR scan re-identification

Jukka Peltomäki, Xingyang Ni, Jussi Puura, Joni-Kristian Kamarainen, Heikki Juhani Huttunen

Responsive image

Auto-TLDR; Loop-Closing Detection from LiDAR Scans Using Convolutional Neural Networks

Slides Poster Similar

In this work, loop-closure detection from LiDAR scans is defined as an image re-identification problem. Re-identification is performed by computing Euclidean distances of a query scan to a gallery set of previous scans. The distances are computed in a feature embedding space where the scans are mapped by a convolutional neural network (CNN). The network is trained using the triplet loss training strategy. In our experiments we compare different backbone networks, variants of the triplet loss and generic and LiDAR specific data augmentation techniques. With a realistic indoor dataset the best architecture obtains the mean average precision (mAP) above 90%.

Online Domain Adaptation for Person Re-Identification with a Human in the Loop

Rita Delussu, Lorenzo Putzu, Giorgio Fumera, Fabio Roli

Responsive image

Auto-TLDR; Human-in-the-loop for Person Re-Identification in Infeasible Applications

Slides Poster Similar

Supervised deep learning methods have recently achieved remarkable performance in person re-identification. Unsupervised domain adaptation (UDA) approaches have also been proposed for application scenarios where only unlabelled data are available from target camera views. We consider a more challenging scenario when even collecting a suitable amount of representative, unlabelled target data for offline training or fine-tuning is infeasible. In this context we revisit the human-in-the-loop (HITL) approach, which exploits online the operator's feedback on a small amount of target data. We argue that HITL is a kind of online domain adaptation specifically suited to person re-identification. We then reconsider relevance feedback methods for content-based image retrieval that are computationally much cheaper than state-of-the-art HITL methods for person re-identification, and devise a specific feedback protocol for them. Experimental results show that HITL can achieve comparable or better performance than UDA, and is therefore a valid alternative when the lack of unlabelled target data makes UDA infeasible.

Progressive Unsupervised Domain Adaptation for Image-Based Person Re-Identification

Mingliang Yang, Da Huang, Jing Zhao

Responsive image

Auto-TLDR; Progressive Unsupervised Domain Adaptation for Person Re-Identification

Slides Poster Similar

Unsupervised domain adaptation (UDA) has emerged as an effective paradigm for reducing the huge manual annotation cost for Person Re-Identification (Re-ID). Many of the recent UDA methods for Re-ID are clustering-based and select all the pseudo-label samples in each iteration for the model training. However, there are many wrong labeled samples that will mislead the model optimization under this circumstance. To solve this problem, we propose a Progressive Unsupervised Domain Adaptation (PUDA) framework for image-based Person Re-ID to reduce the negative effect of wrong pseudo-label samples on the model training process. Specifically, we first pretrain a CNN model on a labeled source dataset, then finetune the model on unlabeled target dataset with the following three steps iteratively: 1) estimating pseudo-labels for all the images in the target dataset with the model trained in the last iteration; 2) extending the training set by adding pseudo-label samples with higher label confidence; 3) updating the CNN model with the expanded training set in a supervised manner. During the iteration process, the number of pseudo-label samples added increased progressively. In particular, a Moderate Initial Selections (MIS) strategy for pseudo-label sampling is also proposed to reduce the negative impacts of random noise features in the early iterations and mislabeled samples in the late iterations on the model. The proposed framework with MIS strategy is validated on the Duke-to-Market, Market-to-Duke unsupervised domain adaptation tasks and achieves improvements of 4.2 points (absolute, i.e., 80.0% vs. 75.8%) and 1.7 points (absolute, i.e., 70.7% vs. 69.0%) in mAP correspondingly.

RGB-Infrared Person Re-Identification Via Image Modality Conversion

Huangpeng Dai, Qing Xie, Yanchun Ma, Yongjian Liu, Shengwu Xiong

Responsive image

Auto-TLDR; CE2L: A Novel Network for Cross-Modality Re-identification with Feature Alignment

Slides Poster Similar

As a cross modality retrieval task, RGB-infrared person re-identification(Re-ID) is an important and challenging tasking, because of its important role in video surveillance applications and large cross-modality variations between visible and infrared images. Most previous works addressed the problem of cross-modality gap with feature alignment by original feature representation learning straightly. In this paper, different from existing works, we propose a novel network(CE2L) to tackle the cross-modality gap with feature alignment. CE2L mainly focuses on adding discriminative information and learning robust features by converting modality between visible and infrared images. Its merits are highlighted in two aspects: 1)Using CycleGAN to convert infrared images into color images can not only increase the recognition characteristics of images, but also allow the our network to better learn the two modal image features; 2)Our novel method can serve as data augmentation. Specifically, it can increase data diversity and total data against over-fitting by converting labeled training images to another modal images. Extensive experimental results on two datasets demonstrate superior performance compared to the baseline and the state-of-the-art methods.

Attentive Part-Aware Networks for Partial Person Re-Identification

Lijuan Huo, Chunfeng Song, Zhengyi Liu, Zhaoxiang Zhang

Responsive image

Auto-TLDR; Part-Aware Learning for Partial Person Re-identification

Slides Poster Similar

Partial person re-identification (re-ID) refers to re-identify a person through occluded images. It suffers from two major challenges, i.e., insufficient training data and incomplete probe image. In this paper, we introduce an automatic data augmentation module and a part-aware learning method for partial re-identification. On the one hand, we adopt the data augmentation to enhance the training data and help learns more stabler partial features. On the other hand, we intuitively find that the partial person images usually have fixed percentages of parts, therefore, in partial person re-id task, the probe image could be cropped from the pictures and divided into several different partial types following fixed ratios. Based on the cropped images, we propose the Cropping Type Consistency (CTC) loss to classify the cropping types of partial images. Moreover, in order to help the network better fit the generated and cropped data, we incorporate the Block Attention Mechanism (BAM) into the framework for attentive learning. To enhance the retrieval performance in the inference stage, we implement cropping on gallery images according to the predicted types of probe partial images. Through calculating feature distances between the partial image and the cropped holistic gallery images, we can recognize the right person from the gallery. To validate the effectiveness of our approach, we conduct extensive experiments on the partial re-ID benchmarks and achieve state-of-the-art performance.

How Important Are Faces for Person Re-Identification?

Julia Dietlmeier, Joseph Antony, Kevin Mcguinness, Noel E O'Connor

Responsive image

Auto-TLDR; Anonymization of Person Re-identification Datasets with Face Detection and Blurring

Slides Poster Similar

This paper investigates the dependence of existing state-of-the-art person re-identification models on the presence and visibility of human faces. We apply a face detection and blurring algorithm to create anonymized versions of several popular person re-identification datasets including Market1501, DukeMTMC-reID, CUHK03, Viper, and Airport. Using a cross-section of existing state-of-the-art models that range in accuracy and computational efficiency, we evaluate the effect of this anonymization on re-identification performance using standard metrics. Perhaps surprisingly, the effect on mAP is very small, and accuracy is recovered by simply training on the anonymized versions of the data rather than the original data. These findings are consistent across multiple models and datasets. These results indicate that datasets can be safely anonymized by blurring faces without significantly impacting the performance of person re-identification systems, and may allow for the release of new richer re-identification datasets where previously there were privacy or data protection concerns.

Pose Variation Adaptation for Person Re-Identification

Lei Zhang, Na Jiang, Qishuai Diao, Yue Xu, Zhong Zhou, Wei Wu

Responsive image

Auto-TLDR; Pose Transfer Generative Adversarial Network for Person Re-identification

Slides Poster Similar

Person re-identification (reid) plays an important role in surveillance video analysis, especially for criminal investigation and intelligent security. Although a large number of effective feature or distance metric learning approaches have been proposed, it still suffers from pedestrians appearance variations caused by pose changing. Most of the previous methods address this problem by learning a pose-invariant descriptor subspace. In this paper, we propose a pose variation adaptation method for person reid in the view of data augmentation. It can reduce the probability of deep learning network over-fitting. Specifically, we introduce a pose transfer generative adversarial network with a similarity measurement constraint. With the learned pose transfer model, training images can be pose-transferred to any given poses, and along with the original images, form a augmented training dataset. It increases data diversity against over-fitting. In contrast to previous GAN-based methods, we consider the influence of pose variations on similarity measure to generate more realistic and shaper samples for person reid. Besides, we optimize hard example mining to introduce a novel manner of samples (pose-transferred images) used with the learned pose transfer model. It focuses on the inferior samples which are caused by pose variations to increase the number of effective hard examples for learning discriminative features and improve the generalization ability. We extensively conduct comparative evaluations to demonstrate the advantages and superiority of our proposed method over the state-of-the-art approaches on Market-1501 and DukeMTMC-reID, the rank-1 accuracy is 96.1% for Market-1501 and 92.0% for DukeMTMC-reID.

Attention-Based Model with Attribute Classification for Cross-Domain Person Re-Identification

Simin Xu, Lingkun Luo, Shiqiang Hu

Responsive image

Auto-TLDR; An attention-based model with attribute classification for cross-domain person re-identification

Poster Similar

Person re-identification (re-ID) which aims to recognize a pedestrian observed by non-overlapping cameras is a challenging task due to high variance between images from different viewpoints. Although remarkable progresses on research of re-ID had been obtained via leveraging the merits of deep learning framework through sufficient quantity training on a large amount of well labeled data, whereas, in real scenarios, re-ID generally suffers from lacking of well labeled training data. In this paper, we propose an attention-based model with attribute classification (AMAC) to facilitate a well trained model transferring across different data domains, which further enables an efficient cross-domain video-based person re-ID. Specifically, an attention-based sub-network is proposed for deep insight into the quality variations of local parts, hence, different local parts are cooperated with different weights to avoid the heavy occlusions or the cluttered background in datasets. Moreover, we introduce a transferred attribute classification sub-network to extract attribute-semantic features of any new target datasets without the requirement for new training attribute labels which are costly to annotate. Attribute-semantic features can be considered as valuable complementary information for person re-identification since they are robust to illumination varieties and different viewpoints across cameras. Due to the large gap between different datasets, we finetune each sub-network with pseudo labels on the target datasets respectively to strengthen the original model trained on other labeled datasets. Extensive comparable evaluations demonstrate the superiority of our AMAC in solving cross-domain person re-ID task on two benchmarks including PRID-2011 and iLIDS-VID.

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

Kuan-Hsun Wang, Chia Chun Cheng, Yi-Ling Chen, Yale Song, Shang-Hong Lai

Responsive image

Auto-TLDR; Attention-based Deep Metric Learning for Near-duplicate Video Retrieval

Slides Similar

Near-duplicate video retrieval (NDVR) is an important and challenging problem due to the increasing amount of videos uploaded to the Internet. In this paper, we propose an attention-based deep metric learning method for NDVR. Our method is based on well-established principles: We leverage two-stream networks to combine RGB and optical flow features, and incorporate an attention module to effectively deal with distractor frames commonly observed in near duplicate videos. We further aggregate the features corresponding to multiple video segments to enhance the discriminative power. The whole system is trained using a deep metric learning objective with a Siamese architecture. Our experiments show that the attention module helps eliminate redundant and noisy frames, while focusing on visually relevant frames for solving NVDR. We evaluate our approach on recent large-scale NDVR datasets, CC_WEB_VIDEO, VCDB, FIVR and SVD. To demonstrate the generalization ability of our approach, we report results in both within- and cross-dataset settings, and show that the proposed method significantly outperforms state-of-the-art approaches.

CANU-ReID: A Conditional Adversarial Network for Unsupervised Person Re-IDentification

Guillaume Delorme, Yihong Xu, Stéphane Lathuiliere, Radu Horaud, Xavier Alameda-Pineda

Responsive image

Auto-TLDR; Unsupervised Person Re-Identification with Clustering and Adversarial Learning

Slides Similar

Unsupervised person re-ID is the task of identifying people on a target data set for which the ID labels are unavailable during training. In this paper, we propose to unify two trends in unsupervised person re-ID: clustering & fine-tuning and adversarial learning. On one side, clustering groups training images into pseudo-ID labels, and uses them to fine-tune the feature extractor. On the other side, adversarial learning is used, inspired by domain adaptation, to match distributions from different domains. Since target data is distributed across different camera viewpoints, we propose to model each camera as an independent domain, and aim to learn domain-independent features. Straightforward adversarial learning yields negative transfer, we thus introduce a conditioning vector to mitigate this undesirable effect. In our framework, the centroid of the cluster to which the visual sample belongs is used as conditioning vector of our conditional adversarial network, where the vector is permutation invariant (clusters ordering does not matter) and its size is independent of the number of clusters. To our knowledge, we are the first to propose the use of conditional adversarial networks for unsupervised person re-ID. We evaluate the proposed architecture on top of two state-of-the-art clustering-based unsupervised person re-identification (re-ID) methods on four different experimental settings with three different data sets and set the new state-of-the-art performance on all four of them. Our code and model will be made publicly available at https://team.inria.fr/perception/canu-reid/.

Open-World Group Retrieval with Ambiguity Removal: A Benchmark

Ling Mei, Jian-Huang Lai, Zhanxiang Feng, Xiaohua Xie

Responsive image

Auto-TLDR; P2GSM-AR: Re-identifying changing groups of people under the open-world and group-ambiguity scenarios

Slides Poster Similar

Group retrieval has attracted plenty of attention in artificial intelligence, traditional group retrieval researches assume that members in a group are unique and do not change under different cameras. However, the assumption may not be met for practical situations such as open-world and group-ambiguity scenarios. This paper tackles an important yet non-studied problem: re-identifying changing groups of people under the open-world and group-ambiguity scenarios in different camera fields. The open-world scenario considers that there are probably non-target people for the probe set appear in the searching gallery, while the group-ambiguity scenario means the group members may change. The open-world and group-ambiguity issue is very challenging for the existing methods because the changing of group members results in dramatic visual variations. Nevertheless, as far as we know, the existing literature lacks benchmarks which target on coping with this issue. In this paper, we propose a new group retrieval dataset named OWGA-Campus to consider these challenges. Moreover, we propose a person-to-group similarity matching based ambiguity removal (P2GSM-AR) method to solve these problems and realize the intention of group retrieval. Experimental results on OWGA-Campus dataset demonstrate the effectiveness and robustness of the proposed P2GSM-AR approach in improving the performance of the state-of-the-art feature extraction methods of person re-id towards the open-world and ambiguous group retrieval task.

Nonlinear Ranking Loss on Riemannian Potato Embedding

Byung Hyung Kim, Yoonje Suh, Honggu Lee, Sungho Jo

Responsive image

Auto-TLDR; Riemannian Potato for Rank-based Metric Learning

Slides Poster Similar

We propose a rank-based metric learning method by leveraging a concept of the Riemannian Potato for better separating non-linear data. By exploring the geometric properties of Riemannian manifolds, the proposed loss function optimizes the measure of dispersion using the distribution of Riemannian distances between a reference sample and neighbors and builds a ranked list according to the similarities. We show the proposed function can learn a hypersphere for each class, preserving the similarity structure inside it on Riemannian manifold. As a result, compared with Euclidean distance-based metric, our method can further jointly reduce the intra-class distances and enlarge the inter-class distances for learned features, consistently outperforming state-of-the-art methods on three widely used non-linear datasets.

Recurrent Deep Attention Network for Person Re-Identification

Changhao Wang, Jun Zhou, Xianfei Duan, Guanwen Zhang, Wei Zhou

Responsive image

Auto-TLDR; Recurrent Deep Attention Network for Person Re-identification

Slides Poster Similar

Person re-identification (re-id) is an important task in video surveillance. It is challenging due to the appearance of person varying a wide range acrossnon-overlapping camera views. Recent years, attention-based models are introduced to learn discriminative representation. In this paper, we consider the attention selection in a natural way as like human moving attention on different parts of the visual field for person re-id. In concrete, we propose a Recurrent Deep Attention Network (RDAN) with an attention selection mechanism based on reinforcement learning. The RDAN aims to adaptively observe the identity-sensitive regions to build up the representation of individuals step by step. Extensive experiments on three person re-id benchmarks Market-1501, DukeMTMC-reID and CUHK03-NP demonstrate the proposed method can achieve competitive performance.

Learning Embeddings for Image Clustering: An Empirical Study of Triplet Loss Approaches

Kalun Ho, Janis Keuper, Franz-Josef Pfreundt, Margret Keuper

Responsive image

Auto-TLDR; Clustering Objectives for K-means and Correlation Clustering Using Triplet Loss

Slides Poster Similar

In this work, we evaluate two different image clustering objectives, k-means clustering and correlation clustering, in the context of Triplet Loss induced feature space embeddings. Specifically, we train a convolutional neural network to learn discriminative features by optimizing two popular versions of the Triplet Loss in order to study their clustering properties under the assumption of noisy labels. Additionally, we propose a new, simple Triplet Loss formulation, which shows desirable properties with respect to formal clustering objectives and outperforms the existing methods. We evaluate all three Triplet loss formulations for K-means and correlation clustering on the CIFAR-10 image classification dataset.

Polynomial Universal Adversarial Perturbations for Person Re-Identification

Wenjie Ding, Xing Wei, Rongrong Ji, Xiaopeng Hong, Yihong Gong

Responsive image

Auto-TLDR; Polynomial Universal Adversarial Perturbation for Re-identification Methods

Slides Poster Similar

In this paper, we focus on Universal Adversarial Perturbations (UAP) attack on state-of-the-art person re-identification (Re-ID) methods. Existing UAP methods usually compute a perturbation image and add it to the images of interest. Such a simple constant form greatly limits the attack power. To address this problem, we extend the formulation of UAP to a polynomial form and propose the Polynomial Universal Adversarial Perturbation (PUAP). Unlike traditional UAP methods which only rely on the additive perturbation signal, the proposed PUAP consists of both an additive perturbation and a multiplicative modulation factor. The additive perturbation produces the fundamental component of the signal, while the multiplicative factor modulates the perturbation signal in line with the unit impulse pattern of the input image. Moreover, we design a Pearson correlation coefficient loss to generate universal perturbations, for disrupting the outputs of person Re-ID methods. Extensive experiments on DukeMTMC-ReID, Market-1501, and MARS show that the proposed method can efficiently improve the attack performance, especially when the magnitude of UAP is constrained to a small value.

Self-Paced Bottom-Up Clustering Network with Side Information for Person Re-Identification

Mingkun Li, Chun-Guang Li, Ruo-Pei Guo, Jun Guo

Responsive image

Auto-TLDR; Self-Paced Bottom-up Clustering Network with Side Information for Unsupervised Person Re-identification

Slides Poster Similar

Person re-identification (Re-ID) has attracted a lot of research attention in recent years. However, supervised methods demand an enormous amount of manually annotated data. In this paper, we propose a Self-Paced bottom-up Clustering Network with Side Information (SPCNet-SI) for unsupervised person Re-ID, where the side information comes from the serial number of the camera associated with each image. Specifically, our proposed SPCNet-SI exploits the camera side information to guide the feature learning and uses soft label in bottom-up clustering process, in which the camera association information is used in the repelled loss and the soft label based cluster information is used to select the candidate cluster pairs to merge. Moreover, a self-paced dynamic mechanism is developed to regularize the merging process such that the clustering is implemented in an easy-to-hard way with a slow-to-fast merging process. Experiments on two benchmark datasets Market-1501 and DukeMTMC-ReID demonstrate promising performance.

Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification

Can Zhang, Hong Liu, Wei Guo, Mang Ye

Responsive image

Auto-TLDR; Multi-Scale Part-Aware Cascading for RGB-Infrared Person Re-identification

Slides Poster Similar

RGB-Infrared person re-identification (RGB-IR Re-ID) aims to matching persons from heterogeneous images captured by visible and thermal cameras, which is of great significance in surveillance system under poor light conditions. Facing great challenges in complex variances including conventional single-modality and additional inter-modality discrepancies, most of existing RGB-IR Re-ID methods directly work on global features for simultaneous elimination, whereas modality-specific noises and modality-shared features are not well considered. To address these issues, a novel Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global in a cascading manner, which results in an unified representation robust to noises. Moreover, a marginal exponential center (MeCen) loss is introduced to jointly eliminate mixed variances, which enables to model cross-modality correlations on sharable salient features. Extensive experiments are conducted for demonstration that the proposed method outperforms all the state-of-the-arts by a large margin.

Rotation Invariant Aerial Image Retrieval with Group Convolutional Metric Learning

Hyunseung Chung, Woo-Jeoung Nam, Seong-Whan Lee

Responsive image

Auto-TLDR; Robust Remote Sensing Image Retrieval Using Group Convolution with Attention Mechanism and Metric Learning

Slides Poster Similar

Remote sensing image retrieval (RSIR) is the process of ranking database images depending on the degree of similarity compared to the query image. As the complexity of RSIR increases due to the diversity in shooting range, angle, and location of remote sensors, there is an increasing demand for methods to address these issues and improve retrieval performance. In this work, we introduce a novel method for retrieving aerial images by merging group convolution with attention mechanism and metric learning, resulting in robustness to rotational variations. For refinement and emphasis on important features, we applied channel attention in each group convolution stage. By utilizing the characteristics of group convolution and channel-wise attention, it is possible to acknowledge the equality among rotated but identically located images. The training procedure has two main steps: (i) training the network with Aerial Image Dataset (AID) for classification, (ii) fine-tuning the network with triplet-loss for retrieval with Google Earth South Korea and NWPU-RESISC45 datasets. Results show that the proposed method performance exceeds other state-of-the-art retrieval methods in both rotated and original environments. Furthermore, we utilize class activation maps (CAM) to visualize the distinct difference of main features between our method and baseline, resulting in better adaptability in rotated environments.

DAIL: Dataset-Aware and Invariant Learning for Face Recognition

Gaoang Wang, Chen Lin, Tianqiang Liu, Mingwei He, Jiebo Luo

Responsive image

Auto-TLDR; DAIL: Dataset-Aware and Invariant Learning for Face Recognition

Slides Poster Similar

To achieve good performance in face recognition, a large scale training dataset is usually required. A simple yet effective way for improving the recognition performance is to use a dataset as large as possible by combining multiple datasets in the training. However, it is problematic and troublesome to naively combine different datasets due to two major issues. Firstly, the same person can possibly appear in different datasets, leading to the identity overlapping issue between different datasets. Natively treating the same person as different classes in different datasets during training will affect back-propagation and generate non-representative embeddings. On the other hand, manually cleaning labels will take a lot of human efforts, especially when there are millions of images and thousands of identities. Secondly, different datasets are collected in different situations and thus will lead to different domain distributions. Natively combining datasets will lead to domain distribution differences and make it difficult to learn domain invariant embeddings across different datasets. In this paper, we propose DAIL: Dataset-Aware and Invariant Learning to resolve the above-mentioned issues. To solve the first issue of identity overlapping, we propose a dataset-aware loss for multi-dataset training by reducing the penalty when the same person appears in multiple datasets. This can be readily achieved with a modified softmax loss with a dataset-aware term. To solve the second issue, the domain adaptation with gradient reversal layers is employed for dataset invariant learning. The proposed approach not only achieves state-of-the-art results on several commonly used face recognition validation sets, like LFW, CFP-FP, AgeDB-30, but also shows great benefit for practical usage.

SSDL: Self-Supervised Domain Learning for Improved Face Recognition

Samadhi Poornima Kumarasinghe Wickrama Arachchilage, Ebroul Izquierdo

Responsive image

Auto-TLDR; Self-supervised Domain Learning for Face Recognition in unconstrained environments

Slides Poster Similar

Face recognition in unconstrained environments is challenging due to variations in illumination, quality of sensing, motion blur and etc. An individual’s face appearance can vary drastically under different conditions creating a gap between train (source) and varying test (target) data. The domain gap could cause decreased performance levels in direct knowledge transfer from source to target. Despite fine-tuning with domain specific data could be an effective solution, collecting and annotating data for all domains is extremely expensive. To this end, we propose a self-supervised domain learning (SSDL) scheme that trains on triplets mined from unlabelled data. A key factor in effective discriminative learning, is selecting informative triplets. Building on most confident predictions, we follow an “easy-to-hard” scheme of alternate triplet mining and self-learning. Comprehensive experiments on four different benchmarks show that SSDL generalizes well on different domains.

A Duplex Spatiotemporal Filtering Network for Video-Based Person Re-Identification

Chong Zheng, Ping Wei, Nanning Zheng

Responsive image

Auto-TLDR; Duplex Spatiotemporal Filtering Network for Person Re-identification in Videos

Slides Poster Similar

Video-based person re-identification plays important roles in surveillance video analysis. This paper proposes a novel Duplex Spatiotemporal Filtering Network (DSFN) to re-identify persons in videos. A video sequence is represented as a duplex spatiotemporal matrix. DSFN model containing a group of filters performs filtering at feature level in both temporal and spatial dimensions, by which the model focuses on feature-level semantic information rather than image-level information as in the traditional filters. We propose sparse-orthogonal constraints to enforce the model to extract more discriminative features. DSFN characterizes not only the appearance features but also dynamic information such as gaits embedded in video sequences and obtains a better performance as a result. Experiments show that the proposed method outperforms state-of-the-art approaches.

3D Facial Matching by Spiral Convolutional Metric Learning and a Biometric Fusion-Net of Demographic Properties

Soha Sadat Mahdi, Nele Nauwelaers, Philip Joris, Giorgos Bouritsas, Imperial London, Sergiy Bokhnyak, Susan Walsh, Mark Shriver, Michael Bronstein, Peter Claes

Responsive image

Auto-TLDR; Multi-biometric Fusion for Biometric Verification using 3D Facial Mesures

Slides Similar

Face recognition is a widely accepted biometric verification tool, as the face contains a lot of information about the identity of a person. In this study, a 2-step neural-based pipeline is presented for matching 3D facial shape to multiple DNA-related properties (sex, age, BMI and genomic background). The first step consists of a triplet loss-based metric learner that compresses facial shape into a lower dimensional embedding while preserving information about the property of interest. Most studies in the field of metric learning have only focused on Euclidean data. In this work, geometric deep learning is employed to learn directly from 3D facial meshes. To this end, spiral convolutions are used along with a novel mesh-sampling scheme that retains uniformly sampled 3D points at different levels of resolution. The second step is a multi-biometric fusion by a fully connected neural network. The network takes an ensemble of embeddings and property labels as input and returns genuine and imposter scores. Since embeddings are accepted as an input, there is no need to train classifiers for the different properties and available data can be used more efficiently. Results obtained by a 10-fold cross-validation for biometric verification show that combining multiple properties leads to stronger biometric systems. Furthermore, the proposed neural-based pipeline outperforms a linear baseline, which consists of principal component analysis, followed by classification with linear support vector machines and a Naïve Bayes-based score-fuser.

G-FAN: Graph-Based Feature Aggregation Network for Video Face Recognition

He Zhao, Yongjie Shi, Xin Tong, Jingsi Wen, Xianghua Ying, Jinshi Hongbin Zha

Responsive image

Auto-TLDR; Graph-based Feature Aggregation Network for Video Face Recognition

Slides Poster Similar

In this paper, we propose a graph-based feature aggregation network (G-FAN) for video face recognition. Compared with the still image, video face recognition exhibits great challenges due to huge intra-class variability and high inter-class ambiguity. To address this problem, our G-FAN first uses a Convolutional Neural Network to extract deep features for every input face of a subject. Then, we build an affinity graph based on the relation between facial features and apply Graph Convolutional Network to generate fine-grained quality vectors for each frame. Finally, the features among multiple frames are adaptively aggregated into a discriminative vector to represent a video face. Different from previous works that take a single image as input, our G-FAN could utilize the correlation information between image pairs and aggregate a template of faces simultaneously. The experiments on video face recognition benchmarks, including YTF, IJB-A, and IJB-C show that: (i) G-FAN automatically learns to advocate high-quality frames while repelling low-quality ones. (ii) G-FAN significantly boosts recognition accuracy and outperforms other state-of-the-art aggregation methods.

Lookalike Disambiguation: Improving Face Identification Performance at Top Ranks

Thomas Swearingen, Arun Ross

Responsive image

Auto-TLDR; Lookalike Face Identification Using a Disambiguator for Lookalike Images

Poster Similar

A face identification system compares an unknown input probe image to a gallery of face images labeled with identities in order to determine the identity of the probe image. The result of identification is a ranked match list with the most similar gallery face image at the top (rank 1) and the least similar gallery face image at the bottom. In many systems, the top ranked gallery images may look very similar to the probe image as well as to each other and can sometimes result in the misidentification of the probe image. Such similar looking faces pertaining to different identities are referred to as lookalike faces. We hypothesize that a matcher specifically trained to disambiguate lookalike face images and combined with a regular face matcher may improve overall identification performance. This work proposes reranking the initial ranked match list using a disambiguator especially for lookalike face pairs. This work also evaluates schemes to select gallery images in the initial ranked match list that should be re-ranked. Experiments on the challenging TinyFace dataset shows that the proposed approach improves the closed-set identification accuracy of a state-of-the-art face matcher.

Temporally Coherent Embeddings for Self-Supervised Video Representation Learning

Joshua Knights, Ben Harwood, Daniel Ward, Anthony Vanderkop, Olivia Mackenzie-Ross, Peyman Moghadam

Responsive image

Auto-TLDR; Temporally Coherent Embeddings for Self-supervised Video Representation Learning

Slides Poster Similar

This paper presents TCE: Temporally Coherent Embeddings for self-supervised video representation learning. The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space, rather than indirectly learning it through ranking or predictive proxy tasks. In the same way that high-level visual information in the world changes smoothly, we believe that nearby frames in learned representations will benefit from demonstrating similar properties. Using this assumption, we train our TCE model to encode videos such that adjacent frames exist close to each other and videos are separated from one another. Using TCE we learn robust representations from large quantities of unlabeled video data. We thoroughly analyse and evaluate our self-supervised learned TCE models on a downstream task of video action recognition using multiple challenging benchmarks (Kinetics400, UCF101, HMDB51). With a simple but effective 2D-CNN backbone and only RGB stream inputs, TCE pre-trained representations outperform all previous self-supervised 2D-CNN and 3D-CNN trained on UCF101. The code and pre-trained models for this paper can be downloaded at: https://github.com/csiro-robotics/TCE

Domain Generalized Person Re-Identification Via Cross-Domain Episodic Learning

Ci-Siang Lin, Yuan Chia Cheng, Yu-Chiang Frank Wang

Responsive image

Auto-TLDR; Domain-Invariant Person Re-identification with Episodic Learning

Slides Poster Similar

Aiming at recognizing images of the same person across distinct camera views, person re-identification (re-ID) has been among active research topics in computer vision. Most existing re-ID works require collection of a large amount of labeled image data from the scenes of interest. When the data to be recognized are different from the source-domain training ones, a number of domain adaptation approaches have been proposed. Nevertheless, one still needs to collect labeled or unlabelled target-domain data during training. In this paper, we tackle an even more challenging and practical setting, domain generalized (DG) person re-ID. That is, while a number of labeled source-domain datasets are available, we do not have access to any target-domain training data. In order to learn domain-invariant features without knowing the target domain of interest, we present an episodic learning scheme which advances meta learning strategies to exploit the observed source-domain labeled data. The learned features would exhibit sufficient domain-invariant properties while not overfitting the source-domain data or ID labels. Our experiments on four benchmark datasets confirm the superiority of our method over the state-of-the-arts.

Convolutional Feature Transfer via Camera-Specific Discriminative Pooling for Person Re-Identification

Tetsu Matsukawa, Einoshin Suzuki

Responsive image

Auto-TLDR; A small-scale CNN feature transfer method for person re-identification

Slides Poster Similar

Modern Convolutional Neural Networks~(CNNs) have been improving the accuracy of person re-identification (re-id) using a large number of training samples. Such a re-id system suffers from a lack of training samples for deployment to practical security applications. To address this problem, we focus on the approach that transfers CNN features pre-trained on a large-scale person re-id dataset to a small-scale dataset. Most of the ordinal CNN feature transfer methods use the features of fully connected layers that entangle locally pooled features of different spatial locations on an image. Unfortunately, due to the difference of view angles and the bias of walking directions of the persons, each camera view in a dataset has a unique spatial property in the person image, which reduces the generality of the local pooling for different cameras/datasets. To account for the camera- and dataset-specific spatial bias, we propose a method to learn camera and dataset-specific position weight maps for discriminative local pooling of convolutional features. Our experiments on four public datasets confirm the effectiveness of the proposed feature transfer with a small number of training samples in the target datasets.

Unsupervised Domain Adaptation for Person Re-Identification through Source-Guided Pseudo-Labeling

Fabian Dubourvieux, Romaric Audigier, Angélique Loesch, Ainouz-Zemouche Samia, Stéphane Canu

Responsive image

Auto-TLDR; Pseudo-labeling for Unsupervised Domain Adaptation for Person Re-Identification

Slides Poster Similar

Person Re-Identification (re-ID) aims at retrieving images of the same person taken by different cameras. A challenge for re-ID is the performance preservation when a model is used on data of interest (target data) which belong to a different domain from the training data domain (source data). Unsupervised Domain Adaptation (UDA) is an interesting research direction for this challenge as it avoids a costly annotation of the target data. Pseudo-labeling methods achieve the best results in UDA-based re-ID. They incrementally learn with identity pseudo-labels which are initialized by clustering features in the source re-ID encoder space. Surprisingly, labeled source data are discarded after this initialization step. However, we believe that pseudo-labeling could further leverage the labeled source data in order to improve the post-initialization training steps. In order to improve robustness against erroneous pseudo-labels, we advocate the exploitation of both labeled source data and pseudo-labeled target data during all training iterations. To support our guideline, we introduce a framework which relies on a two-branch architecture optimizing classification in source and target domains, respectively, in order to allow adaptability to the target domain while ensuring robustness to noisy pseudo-labels. Indeed, shared low and mid-level parameters benefit from the source classification signal while high-level parameters of the target branch learn domain-specific features. Our method is simple enough to be easily combined with existing pseudo-labeling UDA approaches. We show experimentally that it is efficient and improves performance when the base method has no mechanism to deal with pseudo-label noise. And it maintains performance when combined with base method that already manages pseudo-label noise. Our approach reaches state-of-the-art performance when evaluated on commonly used datasets, Market-1501 and DukeMTMC-reID, and outperforms the state of the art when targeting the bigger and more challenging dataset MSMT.

Sample-Dependent Distance for 1 : N Identification Via Discriminative Feature Selection

Naoki Kawamura, Susumu Kubota

Responsive image

Auto-TLDR; Feature Selection Mask for 1:N Identification Problems with Binary Features

Slides Poster Similar

We focus on 1:N identification problems with binary features. Most multiclass classification methods, including identification and verification methods, use a shared metric space in which distances between samples are measured regardless of their identities. This is because dedicated metric spaces learned for each identity in the training set are of little use for the test set. In 1:N identification problems, however, gallery samples contain rich information about the test domain. Given a sample and its neighbors in the gallery set, we propose a method for calculating a discriminative feature selection mask that is used as a sample-dependent distance metric. Experiments on several re-identification datasets show that the proposed method enhances the performance of state-of-the-art feature extractors.

Multi-Label Contrastive Focal Loss for Pedestrian Attribute Recognition

Xiaoqiang Zheng, Zhenxia Yu, Lin Chen, Fan Zhu, Shilong Wang

Responsive image

Auto-TLDR; Multi-label Contrastive Focal Loss for Pedestrian Attribute Recognition

Slides Poster Similar

Pedestrian Attribute Recognition (PAR) has received extensive attention during the past few years. With the advances of deep constitutional neural networks (CNNs), the performance of PAR has been significantly improved. Existing methods tend to acquire attribute-specific features by designing various complex network structures with additional modules. Such additional modules, however, dramatically increase the number of parameters. Meanwhile, the problems of class imbalance and hard attribute retrieving remain underestimated in PAR. In this paper, we explore the optimization mechanism of the training processing to account for these problems and propose a new loss function called Multi-label Contrastive Focal Loss (MCFL). This proposed MCFL emphasizes the hard and minority attributes by using a separated re-weighting mechanism for different positive and negative classes to alleviate the impact of the imbalance. MCFL is also able to enlarge the gaps between the intra-class of multi-label attributes, to force CNNs to extract more subtle discriminative features. We evaluate the proposed MCFL on three large public pedestrian datasets, including RAP, PA-100K, and PETA. The experimental results indicate that the proposed MCFL with the ResNet-50 backbone is able to outperform other state-of-the-art approaches in comparison.

One-Shot Representational Learning for Joint Biometric and Device Authentication

Sudipta Banerjee, Arun Ross

Responsive image

Auto-TLDR; Joint Biometric and Device Recognition from a Single Biometric Image

Slides Poster Similar

In this work, we propose a method to simultaneously perform (i) biometric recognition (\textit{i.e.}, identify the individual), and (ii) device recognition, (\textit{i.e.}, identify the device) from a single biometric image, say, a face image, using a one-shot schema. Such a joint recognition scheme can be useful in devices such as smartphones for enhancing security as well as privacy. We propose to automatically learn a joint representation that encapsulates both biometric-specific and sensor-specific features. We evaluate the proposed approach using iris, face and periocular images acquired using near-infrared iris sensors and smartphone cameras. Experiments conducted using 14,451 images from 13 sensors resulted in a rank-1 identification accuracy of upto 99.81\% and a verification accuracy of upto 100\% at a false match rate of 1\%.

SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Raphael Memmesheimer, Nick Theisen, Dietrich Paulus

Responsive image

Auto-TLDR; One-Shot Action Recognition using Metric Learning

Slides Similar

Recognizing an activity with a single reference sample using metric learning approaches is a promising research field. The majority of few-shot methods focus on object recognition or face-identification. We propose a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space. We encode signals into images and extract features using a deep residual CNN. Using triplet loss, we learn a feature embedding. The resulting encoder transforms features into an embedding space in which closer distances encode similar actions while higher distances encode different actions. Our approach is based on a signal level formulation and remains flexible across a variety of modalities. It further outperforms the baseline on the large scale NTU RGB+D 120 dataset for the One-Shot action recognition protocol by \ntuoneshotimpro%. With just 60% of the training data, our approach still outperforms the baseline approach by \ntuoneshotimproreduced%. With 40% of the training data, our approach performs comparably well as the second follow up. Further, we show that our approach generalizes well in experiments on the UTD-MHAD dataset for inertial, skeleton and fused data and the Simitate dataset for motion capturing data. Furthermore, our inter-joint and inter-sensor experiments suggest good capabilities on previously unseen setups.

Semi-Supervised Person Re-Identification by Attribute Similarity Guidance

Peixian Hong, Ancong Wu, Wei-Shi Zheng

Responsive image

Auto-TLDR; Attribute Similarity Guidance Guidance Loss for Semi-supervised Person Re-identification

Slides Poster Similar

Although supervised person re-identification (RE-ID) has achieved great progress with deep learning, it requires time-consuming annotation of a large number of pedestrian identities. To reduce labeling cost, we attempt to reduce cross-camera identity annotations and exploit pedestrian attribute annotations as auxiliary information instead. The pedestrian attributes, such as outfit styles, contain coarse semantic knowledge. Although pedestrian attributes are annotated without exhaustive searching in a camera network, which is much easier than cross-camera identity annotation, ambiguity exists in attributes when different persons have similar outfits. To solve this problem, we propose an Attribute Similarity Guidance loss (ASG) to guide appearance feature learning for RE-ID by selective attribute similarity preservation to avoid the impact of such ambiguity. Finally, we develop an attribute-guided self training framework to jointly utilize attribute annotations, unlabeled data and limited labeled data for semi-supervised learning. Extensive experiments on Market-1501 and DukeMTMC-ReID show the superiority of our method for semi-supervised RE-ID.

Lightweight Low-Resolution Face Recognition for Surveillance Applications

Yoanna Martínez-Díaz, Heydi Mendez-Vazquez, Luis S. Luevano, Leonardo Chang, Miguel Gonzalez-Mendoza

Responsive image

Auto-TLDR; Efficiency of Lightweight Deep Face Networks on Low-Resolution Surveillance Imagery

Slides Poster Similar

Typically, real-world requirements to deploy face recognition models in unconstrained surveillance scenarios demand to identify low-resolution faces with extremely low computational cost. In the last years, several methods based on complex deep learning models have been proposed with promising recognition results but at a high computational cost. Inspired by the compactness and computation efficiency of lightweight deep face networks and their high accuracy on general face recognition tasks, in this work we propose to benchmark two recently introduced lightweight face models on low-resolution surveillance imagery to enable efficient system deployment. In this way, we conduct a comprehensive evaluation on the two typical settings: LR-to-HR and LR-to-LR matching. In addition, we investigate the effect of using trained models with down-sampled synthetic data from high-resolution images, as well as the combination of different models, for face recognition on real low-resolution images. Experimental results show that the used lightweight face models achieve state-of-the-art results on low-resolution benchmarks with low memory footprint and computational complexity. Moreover, we observed that combining models trained with different degradations improves the recognition accuracy on low-resolution surveillance imagery, which is feasible due to their low computational cost.

Compact and Discriminative Multi-Object Tracking with Siamese CNNs

Claire Labit-Bonis, Jérôme Thomas, Frederic Lerasle

Responsive image

Auto-TLDR; Fast, Light-Weight and All-in-One Single Object Tracking for Multi-Target Management

Slides Poster Similar

Following the tracking-by-detection paradigm, multiple object tracking deals with challenging scenarios, occlusions or even missing detections; the priority is often given to quality measures instead of speed, and a good trade-off between the two is hard to achieve. Based on recent work, we propose a fast, light-weight tracker able to predict targets position and reidentify them at once, when it is usually done with two sequential steps. To do so, we combine a bounding box regressor with a target-oriented appearance learner in a newly designed and unified architecture. This way, our tracker can infer the targets' image pose but also provide us with a confidence level about target identity. Most of the time, it is also common to filter out the detector outputs with a preprocessing step, throwing away precious information about what has been seen in the image. We propose a tracks management strategy able to balance efficiently between detection and tracking outputs and their associated likelihoods. Simply put, we spotlight a full siamese based single object tracker able to predict both position and appearance features at once with a light-weight and all-in-one architecture, within a balanced overall multi-target management strategy. We demonstrate the efficiency and speed of our system w.r.t the literature on the well-known MOT17 challenge benchmark, and bring to the fore qualitative evaluations as well as state-of-the-art quantitative results.

A Base-Derivative Framework for Cross-Modality RGB-Infrared Person Re-Identification

Hong Liu, Ziling Miao, Bing Yang, Runwei Ding

Responsive image

Auto-TLDR; Cross-modality RGB-Infrared Person Re-identification with Auxiliary Modalities

Slides Poster Similar

Cross-modality RGB-infrared (RGB-IR) person re-identification (Re-ID) is a challenging research topic due to the heterogeneity of RGB and infrared images. In this paper, we aim to find some auxiliary modalities, which are homologous with the visible or infrared modalities, to help reduce the modality discrepancy caused by heterogeneous images. Accordingly, a new base-derivative framework is proposed, where base refers to the original visible and infrared modalities, and derivative refers to the two auxiliary modalities that are derived from base. In the proposed framework, the double-modality cross-modal learning problem is reformulated as a four-modality one. After that, the images of all the base and derivative modalities are fed into the feature learning network. With the doubled input images, the learned person features become more discriminative. Furthermore, the proposed framework is optimized by the enhanced intra- and cross-modality constraints with the assistance of two derivative modalities. Experimental results on two publicly available datasets SYSU-MM01 and RegDB show that the proposed method outperforms the other state-of-the-art methods. For instance, we achieve a gain of over 13\% in terms of both Rank-1 and mAP on RegDB dataset.

ClusterFace: Joint Clustering and Classification for Set-Based Face Recognition

Samadhi Poornima Kumarasinghe Wickrama Arachchilage, Ebroul Izquierdo

Responsive image

Auto-TLDR; Joint Clustering and Classification for Face Recognition in the Wild

Slides Poster Similar

Deep learning technology has enabled successful modeling of complex facial features when high quality images are available. Nonetheless, accurate modeling and recognition of human faces in real world scenarios 'on the wild' or under adverse conditions remains an open problem. When unconstrained faces are mapped into deep features, variations such as illumination, pose, occlusion, etc., can create inconsistencies in the resultant feature space. Hence, deriving conclusions based on direct associations could lead to degraded performance. This rises the requirement for a basic feature space analysis prior to face recognition. This paper devises a joint clustering and classification scheme which learns deep face associations in an easy-to-hard way. Our method is based on hierarchical clustering where the early iterations tend to preserve high reliability. The rationale of our method is that a reliable clustering result can provide insights on the distribution of the feature space, that can guide the classification that follows. Experimental evaluations on three tasks, face verification, face identification and rank-order search, demonstrates better or competitive performance compared to the state-of-the-art, on all three experiments.

An Empirical Analysis of Visual Features for Multiple Object Tracking in Urban Scenes

Mehdi Miah, Justine Pepin, Nicolas Saunier, Guillaume-Alexandre Bilodeau

Responsive image

Auto-TLDR; Evaluating Appearance Features for Multiple Object Tracking in Urban Scenes

Slides Poster Similar

This paper addresses the problem of selecting appearance features for multiple object tracking (MOT) in urban scenes. Over the years, a large number of features has been used for MOT. However, it is not clear whether some of them are better than others. Commonly used features are color histograms, histograms of oriented gradients, deep features from convolutional neural networks and re-identification (ReID) features. In this study, we assess how good these features are at discriminating objects enclosed by a bounding box in urban scene tracking scenarios. Several affinity measures, namely the L1, L2 and the Bhattacharyya distances, Rank-1 counts and the cosine similarity, are also assessed for their impact on the discriminative power of the features. Results on several datasets show that features from ReID networks are the best for discriminating instances from one another regardless of the quality of the detector. If a ReID model is not available, color histograms may be selected if the detector has a good recall and there are few occlusions; otherwise, deep features are more robust to detectors with lower recall.

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

Pavlos Avgoustinakis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Andreas L. Symeonidis, Ioannis Kompatsiaris

Responsive image

Auto-TLDR; AuSiL: Audio Similarity Learning for Near-duplicate Video Retrieval

Slides Poster Similar

In this work, we address the problem of audio-based near-duplicate video retrieval. We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs. For the robust similarity calculation between two videos, we first extract representative audio-based video descriptors by leveraging transfer learning based on a Convolutional Neural Network (CNN) trained on a large scale dataset of audio events, and then we calculate the similarity matrix derived from the pairwise similarity of these descriptors. The similarity matrix is subsequently fed to a CNN network that captures the temporal structures existing within its content. We train our network following a triplet generation process and optimizing the triplet loss function. To evaluate the effectiveness of the proposed approach, we have manually annotated two publicly available video datasets based on the audio duplicity between their videos. The proposed approach achieves very competitive results compared to three state-of-the-art methods. Also, unlike the competing methods, it is very robust for the retrieval of audio duplicates generated with speed transformations.