#### Loop-closure detection by LiDAR scan re-identification

Jukka Peltomäki, Xingyang Ni, Jussi Puura, Joni-Kristian Kamarainen, Heikki Juhani Huttunen
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.6

Auto-TLDR; Loop-Closing Detection from LiDAR Scans Using Convolutional Neural Networks

In this work, loop-closure detection from LiDAR scans is defined as an image re-identification problem. Re-identification is performed by computing Euclidean distances of a query scan to a gallery set of previous scans. The distances are computed in a feature embedding space where the scans are mapped by a convolutional neural network (CNN). The network is trained using the triplet loss training strategy. In our experiments we compare different backbone networks, variants of the triplet loss and generic and LiDAR specific data augmentation techniques. With a realistic indoor dataset the best architecture obtains the mean average precision (mAP) above 90%.

## Similar papers

#### Rotation Invariant Aerial Image Retrieval with Group Convolutional Metric Learning

Hyunseung Chung, Woo-Jeoung Nam, Seong-Whan Lee
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.3

Auto-TLDR; Robust Remote Sensing Image Retrieval Using Group Convolution with Attention Mechanism and Metric Learning

Remote sensing image retrieval (RSIR) is the process of ranking database images depending on the degree of similarity compared to the query image. As the complexity of RSIR increases due to the diversity in shooting range, angle, and location of remote sensors, there is an increasing demand for methods to address these issues and improve retrieval performance. In this work, we introduce a novel method for retrieving aerial images by merging group convolution with attention mechanism and metric learning, resulting in robustness to rotational variations. For refinement and emphasis on important features, we applied channel attention in each group convolution stage. By utilizing the characteristics of group convolution and channel-wise attention, it is possible to acknowledge the equality among rotated but identically located images. The training procedure has two main steps: (i) training the network with Aerial Image Dataset (AID) for classification, (ii) fine-tuning the network with triplet-loss for retrieval with Google Earth South Korea and NWPU-RESISC45 datasets. Results show that the proposed method performance exceeds other state-of-the-art retrieval methods in both rotated and original environments. Furthermore, we utilize class activation maps (CAM) to visualize the distinct difference of main features between our method and baseline, resulting in better adaptability in rotated environments.

#### Multi-Level Deep Learning Vehicle Re-Identification Using Ranked-Based Loss Functions

Eleni Kamenou, Jesus Martinez-Del-Rincon, Paul Miller, Patricia Devlin - Hill
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.6

Auto-TLDR; Multi-Level Re-identification Network for Vehicle Re-Identification

Identifying vehicles across a network of cameras with non-overlapping fields of view remains a challenging research problem due to scene occlusions, significant inter-class similarity and intra-class variability. In this paper, we propose an end-to-end multi-level re-identification network that is capable of successfully projecting same identity vehicles closer to one another in the embedding space, compared to vehicles of different identities. Robust feature representations are obtained by combining features at multiple levels of the network. As for the learning process, we employ a recent state-of-the-art structured metric learning loss function previously applied to other retrieval problems and adjust it to the vehicle re-identification task. Furthermore, we explore the cases of image-to-image, image-to-video and video-to-video similarity metric. Finally, we evaluate our system and achieve great performance on two large-scale publicly available datasets, CityFlow-ReID and VeRi-776. Compared to most existing state-of-art approaches, our approach is simpler and more straightforward, utilizing only identity-level annotations, while avoiding post-processing the ranking results (re-ranking) at the testing phase.

#### Do We Really Need Scene-Specific Pose Encoders?

Yoli Shavit, Ron Ferens
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 16:00 in session OS T3.4

Auto-TLDR; Pose Regression Using Deep Convolutional Networks for Visual Similarity

Visual pose regression models estimate the camera pose from a query image with a single forward pass. Current models learn pose encoding from an image using deep convolutional networks which are trained per scene. The resulting encoding is typically passed to a multi-layer perceptron in order to regress the pose. In this work, we propose that scene-specific pose encoders are not required for pose regression and that encodings trained for visual similarity can be used instead. In order to test our hypothesis, we take a shallow architecture of several fully connected layers and train it with pre-computed encodings from a generic image retrieval model. We find that these encodings are not only sufficient to regress the camera pose, but that, when provided to a branching fully connected architecture, a trained model can achieve competitive results and even surpass current state-of-the-art pose regressors in some cases. Moreover, we show that for outdoor localization, the proposed architecture is the only pose regressor, to date, consistently localizing in under 2 meters and 5 degrees.

#### Progressive Learning Algorithm for Efficient Person Re-Identification

Zhen Li, Hanyang Shao, Liang Niu, Nian Xue
Track 5: Image and Signal Processing
Fri 15 Jan 2021 at 15:00 in session PS T5.7

Auto-TLDR; Progressive Learning Algorithm for Large-Scale Person Re-Identification

This paper studies the problem of Person Re-Identification (ReID) for large-scale applications. Recent research efforts have been devoted to building complicated part models, which introduce considerably high computational cost and memory consumption, inhibiting its practicability in large-scale applications. This paper aims to develop a novel learning strategy to find efficient feature embeddings while maintaining the balance of accuracy and model complexity. More specifically, we find by enhancing the classical triplet loss together with cross-entropy loss, our method can explore the hard examples and build a discriminant feature embedding yet compact enough for large-scale applications. Our method is carried out progressively using Bayesian optimization, and we call it the Progressive Learning Algorithm (PLA). Extensive experiments on three large-scale datasets show that our PLA is comparable or better than the state-of-the-arts. Especially, on the challenging Market-1501 dataset, we achieve Rank-1=94.7\%/mAP=89.4\% while saving at least 30\% parameters than strong part models.

#### Attentive Part-Aware Networks for Partial Person Re-Identification

Lijuan Huo, Chunfeng Song, Zhengyi Liu, Zhaoxiang Zhang
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; Part-Aware Learning for Partial Person Re-identification

Partial person re-identification (re-ID) refers to re-identify a person through occluded images. It suffers from two major challenges, i.e., insufficient training data and incomplete probe image. In this paper, we introduce an automatic data augmentation module and a part-aware learning method for partial re-identification. On the one hand, we adopt the data augmentation to enhance the training data and help learns more stabler partial features. On the other hand, we intuitively find that the partial person images usually have fixed percentages of parts, therefore, in partial person re-id task, the probe image could be cropped from the pictures and divided into several different partial types following fixed ratios. Based on the cropped images, we propose the Cropping Type Consistency (CTC) loss to classify the cropping types of partial images. Moreover, in order to help the network better fit the generated and cropped data, we incorporate the Block Attention Mechanism (BAM) into the framework for attentive learning. To enhance the retrieval performance in the inference stage, we implement cropping on gallery images according to the predicted types of probe partial images. Through calculating feature distances between the partial image and the cropped holistic gallery images, we can recognize the right person from the gallery. To validate the effectiveness of our approach, we conduct extensive experiments on the partial re-ID benchmarks and achieve state-of-the-art performance.

#### Generalized Local Attention Pooling for Deep Metric Learning

Carlos Roig Mari, David Varas, Issey Masuda, Juan Carlos Riveiro, Elisenda Bou-Balust
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 12:00 in session PS T1.9

Auto-TLDR; Generalized Local Attention Pooling for Deep Metric Learning

Deep metric learning has been key to recent advances in face verification and image retrieval amongst others. These systems consist on a feature extraction block (extracts feature maps from images) followed by a spatial dimensionality reduction block (generates compact image representations from the feature maps) and an embedding generation module (projects the image representation to the embedding space). While research on deep metric learning has focused on improving the losses for the embedding generation module, the dimensionality reduction block has been overlooked. In this work, we propose a novel method to generate compact image representations which uses local spatial information through an attention mechanism, named Generalized Local Attention Pooling (GLAP). This method, instead of being placed at the end layer of the backbone, is connected at an intermediate level, resulting in lower memory requirements. We assess the performance of the aforementioned method by comparing it with multiple dimensionality reduction techniques, demonstrating the importance of using attention weights to generate robust compact image representations. Moreover, we compare the performance of multiple state-of-the-art losses using the standard deep metric learning system against the same experiment with our GLAP. Experiments showcase that the proposed Generalized Local Attention Pooling mechanism outperforms other pooling methods when compared with current state-of-the-art losses for deep metric learning.

#### Building Computationally Efficient and Well-Generalizing Person Re-Identification Models with Metric Learning

Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session OS T1.8

Auto-TLDR; Cross-Domain Generalization in Person Re-identification using Omni-Scale Network

This work considers the problem of domain shift in person re-identification.Being trained on one dataset, a re-identification model usually performs much worse on unseen data. Partially this gap is caused by the relatively small scale of person re-identification datasets (compared to face recognition ones, for instance), but it is also related to training objectives. We propose to use the metric learning objective, namely AM-Softmax loss, and some additional training practices to build well-generalizing, yet, computationally efficient models. We use recently proposed Omni-Scale Network (OSNet) architecture combined with several training tricks and architecture adjustments to obtain state-of-the art results in cross-domain generalization problem on a large-scale MSMT17 dataset in three setups: MSMT17-all->DukeMTMC, MSMT17-train->Market1501 and MSMT17-all->Market1501.

#### Adaptive L2 Regularization in Person Re-Identification

Xingyang Ni, Liang Fang, Heikki Juhani Huttunen
Track 5: Image and Signal Processing
Fri 15 Jan 2021 at 16:00 in session PS T5.8

We introduce an adaptive L2 regularization mechanism termed AdaptiveReID, in the setting of person re-identification. In the literature, it is common practice to utilize hand-picked regularization factors which remain constant throughout the training procedure. Unlike existing approaches, the regularization factors in our proposed method are updated adaptively through backpropagation. This is achieved by incorporating trainable scalar variables as the regularization factors, which are further fed into a scaled hard sigmoid function. Extensive experiments on the Market-1501, DukeMTMC-reID and MSMT17 datasets validate the effectiveness of our framework. Most notably, we obtain state-of-the-art performance on MSMT17, which is the largest dataset for person re-identification. Source code will be published at https://github.com/nixingyang/AdaptiveReID.

#### Rethinking ReID：Multi-Feature Fusion Person Re-Identification Based on Orientation Constraints

Mingjing Ai, Guozhi Shan, Bo Liu, Tianyang Liu
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; Person Re-identification with Orientation Constrained Network

Person re-identification (ReID) aims to identify the specific pedestrian in a series of images or videos. Recently, ReID is receiving more and more attention in the fields of computer vision research and application like intelligent security. One major issue downgrading the ReID model performance lies in that various subjects in the same body orientations look too similar to distinguish by the model, while the same subject viewed in different orientations looks rather different. However, most of the current studies do not particularly differentiate pedestrians in orientation when designing the network, so we rethink this problem particularly from the perspective of person orientation and propose a new network structure by including two branches: one handling samples with the same body orientations and the other handling samples with different body orientations. Correspondingly, we also propose an orientation classifier that can accurately distinguish the orientation of each person. At the same time, the three-part loss functions are introduced for orientation constraint and combined to optimize the network simultaneously. Also, we use global and local features int the training stage in order to make use of multi-level information. Therefore, our network can derive its efficacy from orientation constraints and multiple features. Experiments show that our method not only has competitive performance on multiple datasets, but also can let retrieval results aligned with the orientation of the query sample rank higher, which may have great potential in the practical applications.

#### RISEdb: A Novel Indoor Localization Dataset

Carlos Sanchez Belenguer, Erik Wolfart, Álvaro Casado Coscollá, Vitor Sequeira
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.6

Auto-TLDR; Indoor Localization Using LiDAR SLAM and Smartphones: A Benchmarking Dataset

In this paper we introduce a novel public dataset for developing and benchmarking indoor localization systems. We have selected and 3D mapped a set of representative indoor environments including a large office building, a conference room, a workshop, an exhibition area and a restaurant. Our acquisition pipeline is based on a portable LiDAR SLAM backpack to map the buildings and to accurately track the pose of the user as it moves freely inside them. We introduce the calibration procedures that enable us to acquire and geo-reference live data coming from different independent sensors rigidly attached to the backpack. This has allowed us to collect long sequences of spherical and stereo images, together with all the sensor readings coming from a consumer smartphone and locate them inside the map with centimetre accuracy. The dataset addresses many of the limitations of existing indoor localization datasets regarding the scale and diversity of the mapped buildings; the number of acquired sequences under varying conditions; the accuracy of the ground-truth trajectory; the availability of a detailed 3D model and the availability of different sensor types. It enables the benchmarking of existing and the development of new indoor localization approaches, in particular for deep learning based systems that require large amounts of labeled training data.

#### Deep Top-Rank Counter Metric for Person Re-Identification

Chen Chen, Hao Dou, Xiyuan Hu, Silong Peng
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.14

Auto-TLDR; Deep Top-Rank Counter Metric for Person Re-identification

In the research field of person re-identification, deep metric learning that guides the efficient and effective embedding learning serves as one of the most fundamental tasks. Recent efforts of the loss function based deep metric learning methods mainly focus on the top rank accuracy optimization by minimiz- ing the distance difference between the correctly matching sample pair and wrongly matched sample pair. However, it is more straightforward to count the occurrences of correct top-rank candidates and maximize the counting results for better top rank accuracy. In this paper, we propose a generalized logistic function based metric with effective practicalness in deep learning, namely the“deep top-rank counter metric”, to approximately optimize the counted occurrences of the correct top-rank matches. The properties that qualify the proposed metric as a well-suited deep re-identification metric have been discussed and a progressive hard sample mining strategy is also introduced for effective training and performance boosting. The extensive experiments show that the proposed top-rank counter metric outperforms other loss function based deep metrics and achieves the state-of- the-art accuracies.

#### Top-DB-Net: Top DropBlock for Activation Enhancement in Person Re-Identification

Rodolfo Quispe, Helio Pedrini
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.14

Auto-TLDR; Top-DB-Net for Person Re-Identification using Top DropBlock

Person Re-Identification is a challenging task that aims to retrieve all instances of a query image across a system of non-overlapping cameras. Due to the various extreme changes of view, it is common that local regions that could be used to match people are suppressed, which leads to a scenario where approaches have to evaluate the similarity of images based on less informative regions. In this work, we introduce the Top-DB-Net, a method based on Top DropBlock that pushes the network to learn to focus on the scene foreground, with special emphasis on the most task-relevant regions and, at the same time, encodes low informative regions to provide high discriminability. The Top-DB-Net is composed of three streams: (i) a global stream encodes rich image information from a backbone, (ii) the Top DropBlock stream encourages the backbone to encode low informative regions with high discriminative features, and (iii) a regularization stream helps to deal with the noise created by the dropping process of the second stream, when testing the first two streams are used. Vast experiments on three challenging datasets show the capabilities of our approach against state-of-the-art methods. Qualitative results demonstrate that our method exhibits better activation maps focusing on reliable parts of the input images.

#### Learning Embeddings for Image Clustering: An Empirical Study of Triplet Loss Approaches

Kalun Ho, Janis Keuper, Franz-Josef Pfreundt, Margret Keuper
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 15:00 in session PS T3.1

Auto-TLDR; Clustering Objectives for K-means and Correlation Clustering Using Triplet Loss

In this work, we evaluate two different image clustering objectives, k-means clustering and correlation clustering, in the context of Triplet Loss induced feature space embeddings. Specifically, we train a convolutional neural network to learn discriminative features by optimizing two popular versions of the Triplet Loss in order to study their clustering properties under the assumption of noisy labels. Additionally, we propose a new, simple Triplet Loss formulation, which shows desirable properties with respect to formal clustering objectives and outperforms the existing methods. We evaluate all three Triplet loss formulations for K-means and correlation clustering on the CIFAR-10 image classification dataset.

#### Not 3D Re-ID: Simple Single Stream 2D Convolution for Robust Video Re-Identification

Toby Breckon, Aishah Alsehaim
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 12:00 in session PS T2.4

Auto-TLDR; ResNet50-IBN for Video-based Person Re-Identification using Single Stream 2D Convolution Network

Video-based person re-identification has received increasing attention recently, as it plays an important role within the surveillance video analysis. Video-based Re-ID is an expansion of earlier image-based re-identification methods by learning features from a video via multiple image frames for each person. Most contemporary video Re-ID methods utilise complex CNN-based network architectures using 3D convolution or multi-branch networks to extract spatial-temporal features from the video. By contrast, in this paper, we will illustrate superior performance from a simple single stream 2D convolution network leveraging the ResNet50-IBN architecture to extract frame-level features followed by temporal attention for clip level features. These clip level features can be generalised to extract video level features by averaging clip level features without any additional cost. Our model, uses best video Re-ID practice and transfer learning between datasets, outperforms existing state-of-the-art approaches on MARS, PRID2011 and iLIDSVID datasets with 89:62%, 97:75%, 97:33% rank-1 accuracy respectively and with 84:61% mAP for MARS, without reliance on complex and memory intensive 3D convolutions or multistream networks architectures as found in other contemporary work. Conversely, this work shows that global features extracted by the 2D convolution network are a sufficient representation for robust state of the art video Re-ID.

#### Extending Single Beam Lidar to Full Resolution by Fusing with Single Image Depth Estimation

Yawen Lu, Yuxing Wang, Devarth Parikh, Guoyu Lu
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 14:00 in session OS T5.4

Auto-TLDR; Self-supervised LIDAR for Low-Cost Depth Estimation

Depth estimation is playing an important role in indoor and outdoor scene understanding, autonomous driving, augmented reality and many other tasks. Vehicles and robotics are able to use active illumination sensors such as LIDAR to receive high precision depth estimation. However, high-resolution Lidars are usually too expensive, which limits its massive production on various applications. Though single beam LIDAR enjoys the benefits of low cost, one beam depth sensing is not usually sufficient to perceive the surrounding environment in many scenarios. In this paper, we propose a learning-based framework to explore to replicate similar or even higher performance as costly LIDARs with our designed self-supervised network and a low-cost single-beam LIDAR. After the accurate calibration with a visible camera, the single beam LIDAR can adjust the scale uncertainty of the depth map estimated by the visible camera. The adjusted depth map enjoys the benefits of high resolution and sensing accuracy as high beam LIDAR and maintains low-cost as single beam LIDAR. Thus we can achieve similar sensing effect of high beam LIDAR with more than a 50-100 times cheaper price (e.g., \$80000 Velodyne HDL-64E LIDAR v.s. \$1000 SICK TIM-781 2D LIDAR and normal camera). The proposed approach is verified on our collected dataset and public dataset with superior depth-sensing performance.

#### Self and Channel Attention Network for Person Re-Identification

Asad Munir, Niki Martinel, Christian Micheloni
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.6

Auto-TLDR; SCAN: Self and Channel Attention Network for Person Re-identification

Recent research has shown promising results for person re-identification by focusing on several trends. One is designing efficient metric learning loss functions such as triplet loss family to learn the most discriminative representations. The other is learning local features by designing part based architectures to form an informative descriptor from semantically coherent parts. Some efforts adjust distant outliers to their most similar positions by using soft attention and learn the relationship between distant similar features. However, only a few prior efforts focus on channel-wise dependencies and learn non-local sharp similar part features directly for the degraded data in the person re-identification task. In this paper, we propose a novel Self and Channel Attention Network (SCAN) to model long-range dependencies between channels and feature maps. We add multiple classifiers to learn discriminative global features by using classification loss. Self Attention (SA) module and Channel Attention (CA) module are introduced to model non-local and channel-wise dependencies in the learned features. Spectral normalization is applied to the whole network to stabilize the training process. Experimental results on the person re-identification benchmarks show the proposed components achieve significant improvement with respect to the baseline.

#### 3D Facial Matching by Spiral Convolutional Metric Learning and a Biometric Fusion-Net of Demographic Properties

Soha Sadat Mahdi, Nele Nauwelaers, Philip Joris, Giorgos Bouritsas, Imperial London, Sergiy Bokhnyak, Susan Walsh, Mark Shriver, Michael Bronstein, Peter Claes
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 14:00 in session OS T2.2

Auto-TLDR; Multi-biometric Fusion for Biometric Verification using 3D Facial Mesures

Face recognition is a widely accepted biometric verification tool, as the face contains a lot of information about the identity of a person. In this study, a 2-step neural-based pipeline is presented for matching 3D facial shape to multiple DNA-related properties (sex, age, BMI and genomic background). The first step consists of a triplet loss-based metric learner that compresses facial shape into a lower dimensional embedding while preserving information about the property of interest. Most studies in the field of metric learning have only focused on Euclidean data. In this work, geometric deep learning is employed to learn directly from 3D facial meshes. To this end, spiral convolutions are used along with a novel mesh-sampling scheme that retains uniformly sampled 3D points at different levels of resolution. The second step is a multi-biometric fusion by a fully connected neural network. The network takes an ensemble of embeddings and property labels as input and returns genuine and imposter scores. Since embeddings are accepted as an input, there is no need to train classifiers for the different properties and available data can be used more efficiently. Results obtained by a 10-fold cross-validation for biometric verification show that combining multiple properties leads to stronger biometric systems. Furthermore, the proposed neural-based pipeline outperforms a linear baseline, which consists of principal component analysis, followed by classification with linear support vector machines and a Naïve Bayes-based score-fuser.

#### SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Raphael Memmesheimer, Nick Theisen, Dietrich Paulus
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 16:00 in session OS T3.4

Auto-TLDR; One-Shot Action Recognition using Metric Learning

Recognizing an activity with a single reference sample using metric learning approaches is a promising research field. The majority of few-shot methods focus on object recognition or face-identification. We propose a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space. We encode signals into images and extract features using a deep residual CNN. Using triplet loss, we learn a feature embedding. The resulting encoder transforms features into an embedding space in which closer distances encode similar actions while higher distances encode different actions. Our approach is based on a signal level formulation and remains flexible across a variety of modalities. It further outperforms the baseline on the large scale NTU RGB+D 120 dataset for the One-Shot action recognition protocol by \ntuoneshotimpro%. With just 60% of the training data, our approach still outperforms the baseline approach by \ntuoneshotimproreduced%. With 40% of the training data, our approach performs comparably well as the second follow up. Further, we show that our approach generalizes well in experiments on the UTD-MHAD dataset for inertial, skeleton and fused data and the Simitate dataset for motion capturing data. Furthermore, our inter-joint and inter-sensor experiments suggest good capabilities on previously unseen setups.

#### One-Shot Representational Learning for Joint Biometric and Device Authentication

Sudipta Banerjee, Arun Ross
Track 2: Biometrics, Human Analysis and Behavior Understanding
Tue 12 Jan 2021 at 17:00 in session PS T2.1

Auto-TLDR; Joint Biometric and Device Recognition from a Single Biometric Image

In this work, we propose a method to simultaneously perform (i) biometric recognition (\textit{i.e.}, identify the individual), and (ii) device recognition, (\textit{i.e.}, identify the device) from a single biometric image, say, a face image, using a one-shot schema. Such a joint recognition scheme can be useful in devices such as smartphones for enhancing security as well as privacy. We propose to automatically learn a joint representation that encapsulates both biometric-specific and sensor-specific features. We evaluate the proposed approach using iris, face and periocular images acquired using near-infrared iris sensors and smartphone cameras. Experiments conducted using 14,451 images from 13 sensors resulted in a rank-1 identification accuracy of upto 99.81\% and a verification accuracy of upto 100\% at a false match rate of 1\%.

#### Visual Localization for Autonomous Driving: Mapping the Accurate Location in the City Maze

Dongfang Liu, Yiming Cui, Xiaolei Guo, Wei Ding, Baijian Yang, Yingjie Chen
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.2

Auto-TLDR; Feature Voting for Robust Visual Localization in Urban Settings

Accurate localization is a foundational capacity, required for autonomous vehicles to accomplish other tasks such as navigation or path planning. It is a common practice for vehicles to use GPS to acquire location information. However, the application of GPS can result in severe challenges when vehicles run within the inner city where different kinds of structures may shadow the GPS signal and lead to inaccurate location results. To address the localization challenges of urban settings, we propose a novel feature voting technique for visual localization. Different from the conventional front-view-based method, our approach employs views from three directions (front, left, and right) and thus significantly improves the robustness of location prediction. In our work, we craft the proposed feature voting method into three state-of-the-art visual localization networks and modify their architectures properly so that they can be applied for vehicular operation. Extensive field test results indicate that our approach can predict location robustly even in challenging inner-city settings. Our research sheds light on using the visual localization approach to help autonomous vehicles to find accurate location information in a city maze, within a desirable time constraint.

#### Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

Kuan-Hsun Wang, Chia Chun Cheng, Yi-Ling Chen, Yale Song, Shang-Hong Lai
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 14:00 in session OS T3.1

Auto-TLDR; Attention-based Deep Metric Learning for Near-duplicate Video Retrieval

Near-duplicate video retrieval (NDVR) is an important and challenging problem due to the increasing amount of videos uploaded to the Internet. In this paper, we propose an attention-based deep metric learning method for NDVR. Our method is based on well-established principles: We leverage two-stream networks to combine RGB and optical flow features, and incorporate an attention module to effectively deal with distractor frames commonly observed in near duplicate videos. We further aggregate the features corresponding to multiple video segments to enhance the discriminative power. The whole system is trained using a deep metric learning objective with a Siamese architecture. Our experiments show that the attention module helps eliminate redundant and noisy frames, while focusing on visually relevant frames for solving NVDR. We evaluate our approach on recent large-scale NDVR datasets, CC_WEB_VIDEO, VCDB, FIVR and SVD. To demonstrate the generalization ability of our approach, we report results in both within- and cross-dataset settings, and show that the proposed method significantly outperforms state-of-the-art approaches.

#### Benchmarking Cameras for OpenVSLAM Indoors

Kevin Chappellet, Guillaume Caron, Fumio Kanehiro, Ken Sakurada, Abderrahmane Kheddar
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 15:00 in session PS T3.10

Auto-TLDR; OpenVSLAM: Benchmarking Camera Types for Visual Simultaneous Localization and Mapping

In this paper we benchmark different types of cameras and evaluate their performance in terms of reliable localization reliability and precision in Visual Simultaneous Localization and Mapping (vSLAM). Such benchmarking is merely found for visual odometry, but never for vSLAM. Existing studies usually compare several algorithms for a given camera. %This work is the first to handle the dual of the latter, i.e. comparing several cameras for a given SLAM algorithm. The evaluation methodology we propose is applied to the recent OpenVSLAM framework. The latter is versatile enough to natively deal with perspective, fisheye, 360 cameras in a monocular or stereoscopic setup, an in RGB or RGB-D modalities. Results in various sequences containing light variation and scenery modifications in the scene assess quantitatively the maximum localization rate for 360 vision. In the contrary, RGB-D vision shows the lowest localization rate, but highest precision when localization is possible. Stereo-fisheye trades-off with localization rates and precision between 360 vision and RGB-D vision. The dataset with ground truth will be made available in open access to allow evaluating other/future vSLAM algorithms with respect to these camera types.

#### NetCalib: A Novel Approach for LiDAR-Camera Auto-Calibration Based on Deep Learning

Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.2

Auto-TLDR; Automatic Calibration of LiDAR and Cameras using Deep Neural Network

A fusion of LiDAR and cameras have been widely used in many robotics applications such as classification, segmentation, object detection, and autonomous driving. It is essential that the LiDAR sensor can measure distances accurately, which is a good complement to the cameras. Hence, calibrating sensors before deployment is a mandatory step. The conventional methods include checkerboards, specific patterns, or human labeling, which is trivial and human-labor extensive if we do the same calibration process every time. The main propose of this research work is to build a deep neural network that is capable of automatically finding the geometric transformation between LiDAR and cameras. The results show that our model manages to find the transformations from randomly sampled artificial errors. Besides, our work is open-sourced for the community to fully utilize the advances of the methodology for developing more the approach, initiating collaboration, and innovation in the topic.

#### Multi-Label Contrastive Focal Loss for Pedestrian Attribute Recognition

Xiaoqiang Zheng, Zhenxia Yu, Lin Chen, Fan Zhu, Shilong Wang
Track 2: Biometrics, Human Analysis and Behavior Understanding
Fri 15 Jan 2021 at 15:00 in session PS T2.5

Auto-TLDR; Multi-label Contrastive Focal Loss for Pedestrian Attribute Recognition

Pedestrian Attribute Recognition (PAR) has received extensive attention during the past few years. With the advances of deep constitutional neural networks (CNNs), the performance of PAR has been significantly improved. Existing methods tend to acquire attribute-specific features by designing various complex network structures with additional modules. Such additional modules, however, dramatically increase the number of parameters. Meanwhile, the problems of class imbalance and hard attribute retrieving remain underestimated in PAR. In this paper, we explore the optimization mechanism of the training processing to account for these problems and propose a new loss function called Multi-label Contrastive Focal Loss (MCFL). This proposed MCFL emphasizes the hard and minority attributes by using a separated re-weighting mechanism for different positive and negative classes to alleviate the impact of the imbalance. MCFL is also able to enlarge the gaps between the intra-class of multi-label attributes, to force CNNs to extract more subtle discriminative features. We evaluate the proposed MCFL on three large public pedestrian datasets, including RAP, PA-100K, and PETA. The experimental results indicate that the proposed MCFL with the ResNet-50 backbone is able to outperform other state-of-the-art approaches in comparison.

#### Map-Based Temporally Consistent Geolocalization through Learning Motion Trajectories

Bing Zha, Alper Yilmaz
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 16:30 in session PS T1.8

Auto-TLDR; Exploiting Motion Trajectories for Geolocalization of Object on Topological Map using Recurrent Neural Network

In this paper, we propose a novel trajectory learning method that exploits motion trajectories on topological map using recurrent neural network for temporally consistent geolocalization of object. Inspired by human's ability to both be aware of distance and direction of self-motion in navigation, our trajectory learning method learns a pattern representation of trajectories encoded as a sequence of distances and turning angles to assist self-localization. We pose the learning process as a conditional sequence prediction problem in which each output locates the object on a traversable edge in a map. Considering the prediction sequence ought to be topologically connected in the graph-structured map, we adopt two different hypotheses generation and elimination strategies to eliminate disconnected sequence prediction. We demonstrate our approach on the KITTI stereo visual odometry dataset which is a city-scale environment. The key benefits of our approach to geolocalization are that 1) we take advantage of powerful sequence modeling ability of recurrent neural network and its robustness to noisy input, 2) only require a map in the form of a graph and 3) simply use an affordable sensor that generates motion trajectory. The experiments show that the motion trajectories can be learned by training an recurrent neural network, and temporally consistent geolocation can be predicted with both of the proposed strategies.

#### Deep Next-Best-View Planner for Cross-Season Visual Route Classification

Kurauchi Kanya, Kanji Tanaka
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 14:00 in session PS T3.8

Auto-TLDR; Active Visual Place Recognition using Deep Convolutional Neural Network

This paper addresses the problem of active visual place recognition (VPR) from a novel perspective of long-term autonomy. In our approach, a next-best-view (NBV) planner plans an optimal action-observation-sequence to maximize the expected cost-performance for a visual route classification task. A difficulty arises from the fact that the NBV planner is trained and tested in different domains (times of day, weather conditions, and seasons). Existing NBV methods may be confused and deteriorated by the domain-shifts, and require significant efforts for adapting them to a new domain. We address this issue by a novel deep convolutional neural network (DNN) -based NBV planner that does not require the adaptation. Our main contributions in this paper are summarized as follows: (1) We present a novel domain-invariant NBV planner that is specifically tailored for DNN-based VPR. (2) We formulate the active VPR as a POMDP problem and present a feasible solution to address the inherent intractability. Specifically, the probability distribution vector (PDV) output by the available DNN is used as a domain-invariant observation model without the need to retrain it. (3) We verify efficacy of the proposed approach through challenging cross-season VPR experiments, where it is confirmed that the proposed approach clearly outperforms the previous single-view-based or multi-view-based VPR in terms of VPR accuracy and/or action-observation-cost.

#### A Fine-Grained Dataset and Its Efficient Semantic Segmentation for Unstructured Driving Scenarios

Kai Andreas Metzger, Peter Mortimer, Hans J "Joe" Wuensche
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.5

Auto-TLDR; TAS500: A Semantic Segmentation Dataset for Autonomous Driving in Unstructured Environments

Research in autonomous driving for unstructured environments suffers from a lack of semantically labeled datasets compared to its urban counterpart. Urban and unstructured outdoor environments are challenging due to the varying lighting and weather conditions during a day and across seasons. In this paper, we introduce TAS500, a novel semantic segmentation dataset for autonomous driving in unstructured environments. TAS500 offers fine-grained vegetation and terrain classes to learn drivable surfaces and natural obstacles in outdoor scenes effectively. We evaluate the performance of modern semantic segmentation models with an additional focus on their efficiency. Our experiments demonstrate the advantages of fine-grained semantic classes to improve the overall prediction accuracy, especially along the class boundaries. The dataset, code, and pretrained model are available online.

#### Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

Saleem Ahmed, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju
Track 4: Document and Media Analysis
Tue 12 Jan 2021 at 17:00 in session PS T4.1

Auto-TLDR; Representational Learning for Similarity Based Retrieval of Mathematical Expressions

Representational Learning in the form of high dimensional embeddings have been used for multiple pattern recognition applications. There has been a significant interest in building embedding based systems for learning representationsin the mathematical domain. At the same time, retrieval of structured information such as mathematical expressions is an important need for modern IR systems. In this work, our motivation is to introduce a robust framework for learning representations for similarity based retrieval of mathematical expressions. Given a query by example, the embedding can find the closest matching expression as a function of euclidean distance between them. We leverage recent advancements in image-based and graph-based deep learning algorithms to learn our similarity embeddings. We do this first, by using uni-modal encoders in graph space and image space and then, a multi-modal combination of the same. To overcome the lack of training data, we force the networks to learn a deep metric using triplets generated with a heuristic scoring function. We also adopt a custom strategy for mining hard samples to train our neural networks. Our system produces rankings similar to those generated by the original scoring function, but using only a fraction of the time. Our results establish the viability of using such a multi-modal embedding for this task.

#### Localization of Unmanned Aerial Vehicles in Corridor Environments Using Deep Learning

Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.6

Auto-TLDR; A monocular vision assisted localization algorithm for indoor corridor environments

We propose a monocular vision assisted localization algorithm, that will help a UAV navigate safely in indoor corridor environments. Always, the aim is to navigate the UAV through a corridor in the forward direction by keeping it at the center with no orientation either to the left or right side. The algorithm makes use of the RGB image, captured from the UAV front camera, and passes it through a trained Deep Neural Network (DNN) to predict the position of the UAV as either on the left or center or right side of the corridor. Depending upon the divergence of the UAV with respect to an imaginary central line, known as the central bisector line (CBL) of the corridor, a suitable command is generated to bring the UAV to the center. When the UAV is at the center of the corridor, a new image is passed through another trained DNN to predict the orientation of the UAV with respect to the CBL of the corridor. If the UAV is either left or right tilted, an appropriate command is generated to rectify the orientation. We also propose a new corridor dataset, named UAVCorV1, which contains images as captured by the UAV front camera when the UAV is at all possible locations of a variety of corridors. An exhaustive set of experiments in different corridors reveal the efficacy of the proposed algorithm.

#### Weight Estimation from an RGB-D Camera in Top-View Configuration

Marco Mameli, Marina Paolanti, Nicola Conci, Filippo Tessaro, Emanuele Frontoni, Primo Zingaretti
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Tue 12 Jan 2021 at 15:00 in session PS T1.2

Auto-TLDR; Top-View Weight Estimation using Deep Neural Networks

The development of so-called soft-biometrics aims at providing information related to the physical and behavioural characteristics of a person. This paper focuses on bodyweight estimation based on the observation from a top-view RGB-D camera. In fact, the capability to estimate the weight of a person can be of help in many different applications, from health-related scenarios to business intelligence and retail analytics. To deal with this issue, a TVWE (Top-View Weight Estimation) framework is proposed with the aim of predicting the weight. The approach relies on the adoption of Deep Neural Networks (DNNs) that have been trained on depth data. Each network has also been modified in its top section to replace classification with prediction inference. The performance of five state-of-art DNNs has been compared, namely VGG16, ResNet, Inception, DenseNet and Efficient-Net. In addition, a convolutional auto-encoder has also been included for completeness. Considering the limited literature in this domain, the TVWE framework has been evaluated on a new publicly available dataset: “VRAI Weight estimation Dataset”, which also collects, for each subject, labels related to weight, gender, and height. The experimental results have demonstrated that the proposed methods are suitable for this task, bringing different and significant insights for the application of the solution in different domains.

#### Nonlinear Ranking Loss on Riemannian Potato Embedding

Byung Hyung Kim, Yoonje Suh, Honggu Lee, Sungho Jo
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 14:00 in session PS T1.6

Auto-TLDR; Riemannian Potato for Rank-based Metric Learning

We propose a rank-based metric learning method by leveraging a concept of the Riemannian Potato for better separating non-linear data. By exploring the geometric properties of Riemannian manifolds, the proposed loss function optimizes the measure of dispersion using the distribution of Riemannian distances between a reference sample and neighbors and builds a ranked list according to the similarities. We show the proposed function can learn a hypersphere for each class, preserving the similarity structure inside it on Riemannian manifold. As a result, compared with Euclidean distance-based metric, our method can further jointly reduce the intra-class distances and enlarge the inter-class distances for learned features, consistently outperforming state-of-the-art methods on three widely used non-linear datasets.

#### Generic Merging of Structure from Motion Maps with a Low Memory Footprint

Gabrielle Flood, David Gillsjö, Patrik Persson, Anders Heyden, Kalle Åström
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.6

Auto-TLDR; A Low-Memory Footprint Representation for Robust Map Merge

With the development of cheap image sensors, the amount of available image data have increased enormously, and the possibility of using crowdsourced collection methods has emerged. This calls for development of ways to handle all these data. In this paper, we present new tools that will enable efficient, flexible and robust map merging. Assuming that separate optimisations have been performed for the individual maps, we show how only relevant data can be stored in a low memory footprint representation. We use these representations to perform map merging so that the algorithm is invariant to the merging order and independent of the choice of coordinate system. The result is a robust algorithm that can be applied to several maps simultaneously. The result of a merge can also be represented with the same type of low-memory footprint format, which enables further merging and updating of the map in a hierarchical way. Furthermore, the method can perform loop closing and also detect changes in the scene between the capture of the different image sequences. Using both simulated and real data — from both a hand held mobile phone and from a drone — we verify the performance of the proposed method.

#### Improving Robotic Grasping on Monocular Images Via Multi-Task Learning and Positional Loss

William Prew, Toby Breckon, Magnus Bordewich, Ulrik Beierholm
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 16:00 in session PS T3.11

Auto-TLDR; Improving grasping performance from monocularcolour images in an end-to-end CNN architecture with multi-task learning

In this paper we introduce two methods of improv-ing real-time objecting grasping performance from monocularcolour images in an end-to-end CNN architecture. The first isthe addition of an auxiliary task during model training (multi-task learning). Our multi-task CNN model improves graspingperformance from a baseline average of 72.04% to 78.14% onthe large Jacquard grasping dataset when performing a supple-mentary depth reconstruction task. The second is introducinga positional loss function that emphasises loss per pixel forsecondary parameters (gripper angle and width) only on points ofan object where a successful grasp can take place. This increasesperformance from a baseline average of 72.04% to 78.92% aswell as reducing the number of training epochs required. Thesemethods can be also performed in tandem resulting in a furtherperformance increase to 79.12%, while maintaining sufficientinference speed to enable processing at 50FPS

#### Yolo+FPN: 2D and 3D Fused Object Detection with an RGB-D Camera

Ya Wang
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.3

Auto-TLDR; Yolo+FPN: Combining 2D and 3D Object Detection for Real-Time Object Detection

In this paper we propose a new deep neural network system, called Yolo+FPN, which fuses both 2D and 3D object detection algorithms to achieve better real-time object detection results and faster inference speed, to be used on real robots. Finding an optimized fusion strategy to efficiently combine 3D object detection with 2D detection information is useful and challenging for both indoor and outdoor robots. In order to satisfy real-time requirements, a trade-off between accuracy and efficiency is needed. We not only have improved training and test accuracies and lower mean losses on the KITTI object detection benchmark, but also achieve better average precision on 3D detection of all classes in three levels of difficulty. Also, we implemented Yolo+FPN system using an RGB-D camera, and compared the speed of 2D and 3D object detection using different GPUs. For the real implementation of both indoor and outdoor scenes, we focus on person detection, which is the most challenging and important among the three classes.

#### Supporting Skin Lesion Diagnosis with Content-Based Image Retrieval

Stefano Allegretti, Federico Bolelli, Federico Pollastri, Sabrina Longhitano, Giovanni Pellacani, Costantino Grana
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.5

Auto-TLDR; Skin Images Retrieval Using Convolutional Neural Networks for Skin Lesion Classification and Segmentation

Given the relevance of skin cancer, many attempts have been dedicated to the creation of automated devices that could assist both expert and beginner dermatologists towards fast and early diagnosis of skin lesions. In recent years, tasks such as skin lesion classification and segmentation have been extensively addressed with deep learning algorithms, which in some cases reach a diagnostic accuracy comparable to that of expert physicians. However, the general lack of interpretability and reliability severely hinders the ability of those approaches to actually support dermatologists in the diagnosis process. In this paper a novel skin images retrieval system is presented, which exploits features extracted by Convolutional Neural Networks to gather similar images from a publicly available dataset, in order to assist the diagnosis process of both expert and novice practitioners. In the proposed framework, Resnet-50 is initially trained for the classification of dermoscopic images; then, the feature extraction part is isolated, and an embedding network is build on top of it. The embedding learns an alternative representation, which allows to check image similarity by means of a distance measure. Experimental results reveal that the proposed method is able to select meaningful images, which can effectively boost the classification accuracy of human dermatologists.

#### Vehicle Lane Merge Visual Benchmark

Kai Cordes, Hellward Broszio
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 12:00 in session PS T1.9

Auto-TLDR; A Benchmark for Automated Cooperative Maneuvering Using Multi-view Video Streams and Ground Truth Vehicle Description

Automated driving is regarded as the most promising technology for improving road safety in the future. In this context, connected vehicles have an important role regarding their ability to perform cooperative maneuvers for challenging traffic situations. We propose a benchmark for automated cooperative maneuvers. The targeted cooperative maneuver is the vehicle lane merge where a vehicle on the acceleration lane merges into the traffic of a motorway. The benchmark enables the evaluation of vehicle localization approaches as well as the study of cooperative maneuvers. It consists of temporally synchronized multi-view video streams, highly accurate camera calibration, and ground truth vehicle descriptions, including position, heading, speed, and shape. For benchmark generation, the lane merge maneuver is performed by human drivers on a test track, resulting in 120 lane merge data sets with various traffic situations and video recording conditions.

#### SSDL: Self-Supervised Domain Learning for Improved Face Recognition

Samadhi Poornima Kumarasinghe Wickrama Arachchilage, Ebroul Izquierdo
Track 2: Biometrics, Human Analysis and Behavior Understanding
Fri 15 Jan 2021 at 15:00 in session PS T2.5

Auto-TLDR; Self-supervised Domain Learning for Face Recognition in unconstrained environments

Face recognition in unconstrained environments is challenging due to variations in illumination, quality of sensing, motion blur and etc. An individual’s face appearance can vary drastically under different conditions creating a gap between train (source) and varying test (target) data. The domain gap could cause decreased performance levels in direct knowledge transfer from source to target. Despite fine-tuning with domain specific data could be an effective solution, collecting and annotating data for all domains is extremely expensive. To this end, we propose a self-supervised domain learning (SSDL) scheme that trains on triplets mined from unlabelled data. A key factor in effective discriminative learning, is selecting informative triplets. Building on most confident predictions, we follow an “easy-to-hard” scheme of alternate triplet mining and self-learning. Comprehensive experiments on four different benchmarks show that SSDL generalizes well on different domains.

#### DAIL: Dataset-Aware and Invariant Learning for Face Recognition

Gaoang Wang, Chen Lin, Tianqiang Liu, Mingwei He, Jiebo Luo
Track 2: Biometrics, Human Analysis and Behavior Understanding
Fri 15 Jan 2021 at 15:00 in session PS T2.5

Auto-TLDR; DAIL: Dataset-Aware and Invariant Learning for Face Recognition

To achieve good performance in face recognition, a large scale training dataset is usually required. A simple yet effective way for improving the recognition performance is to use a dataset as large as possible by combining multiple datasets in the training. However, it is problematic and troublesome to naively combine different datasets due to two major issues. Firstly, the same person can possibly appear in different datasets, leading to the identity overlapping issue between different datasets. Natively treating the same person as different classes in different datasets during training will affect back-propagation and generate non-representative embeddings. On the other hand, manually cleaning labels will take a lot of human efforts, especially when there are millions of images and thousands of identities. Secondly, different datasets are collected in different situations and thus will lead to different domain distributions. Natively combining datasets will lead to domain distribution differences and make it difficult to learn domain invariant embeddings across different datasets. In this paper, we propose DAIL: Dataset-Aware and Invariant Learning to resolve the above-mentioned issues. To solve the first issue of identity overlapping, we propose a dataset-aware loss for multi-dataset training by reducing the penalty when the same person appears in multiple datasets. This can be readily achieved with a modified softmax loss with a dataset-aware term. To solve the second issue, the domain adaptation with gradient reversal layers is employed for dataset invariant learning. The proposed approach not only achieves state-of-the-art results on several commonly used face recognition validation sets, like LFW, CFP-FP, AgeDB-30, but also shows great benefit for practical usage.

#### RGB-Infrared Person Re-Identification Via Image Modality Conversion

Huangpeng Dai, Qing Xie, Yanchun Ma, Yongjian Liu, Shengwu Xiong
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.13

Auto-TLDR; CE2L: A Novel Network for Cross-Modality Re-identification with Feature Alignment

As a cross modality retrieval task, RGB-infrared person re-identification(Re-ID) is an important and challenging tasking, because of its important role in video surveillance applications and large cross-modality variations between visible and infrared images. Most previous works addressed the problem of cross-modality gap with feature alignment by original feature representation learning straightly. In this paper, different from existing works, we propose a novel network(CE2L) to tackle the cross-modality gap with feature alignment. CE2L mainly focuses on adding discriminative information and learning robust features by converting modality between visible and infrared images. Its merits are highlighted in two aspects: 1)Using CycleGAN to convert infrared images into color images can not only increase the recognition characteristics of images, but also allow the our network to better learn the two modal image features; 2)Our novel method can serve as data augmentation. Specifically, it can increase data diversity and total data against over-fitting by converting labeled training images to another modal images. Extensive experimental results on two datasets demonstrate superior performance compared to the baseline and the state-of-the-art methods.

#### On Identification and Retrieval of Near-Duplicate Biological Images: A New Dataset and Protocol

Thomas E. Koker, Sai Spandana Chintapalli, San Wang, Blake A. Talbot, Daniel Wainstock, Marcelo Cicconet, Mary C. Walsh
Track 4: Document and Media Analysis
Tue 12 Jan 2021 at 17:00 in session PS T4.1

Auto-TLDR; BINDER: Bio-Image Near-Duplicate Examples Repository for Image Identification and Retrieval

Manipulation and re-use of images in scientific publications is a growing issue, not only for biomedical publishers, but also for the research community in general. In this work we introduce BINDER -- Bio-Image Near-Duplicate Examples Repository, a novel dataset to help researchers develop, train, and test models to detect same-source biomedical images. BINDER contains 7,490 unique image patches for model training, 1,821 same-size patch duplicates for validation and testing, and 868 different-size image/patch pairs for image retrieval validation and testing. Except for the training set, patches already contain manipulations including rotation, translation, scale, perspective transform, contrast adjustment and/or compression artifacts. We further use the dataset to demonstrate how novel adaptations of existing image retrieval and metric learning models can be applied to achieve high-accuracy inference results, creating a baseline for future work. In aggregate, we thus present a supervised protocol for near-duplicate image identification and retrieval without any "real-world" training example. Our dataset and source code are available at hms-idac.github.io/BINDER.

#### Pose Variation Adaptation for Person Re-Identification

Lei Zhang, Na Jiang, Qishuai Diao, Yue Xu, Zhong Zhou, Wei Wu
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 12:00 in session PS T3.7

Auto-TLDR; Pose Transfer Generative Adversarial Network for Person Re-identification

Person re-identification (reid) plays an important role in surveillance video analysis, especially for criminal investigation and intelligent security. Although a large number of effective feature or distance metric learning approaches have been proposed, it still suffers from pedestrians appearance variations caused by pose changing. Most of the previous methods address this problem by learning a pose-invariant descriptor subspace. In this paper, we propose a pose variation adaptation method for person reid in the view of data augmentation. It can reduce the probability of deep learning network over-fitting. Specifically, we introduce a pose transfer generative adversarial network with a similarity measurement constraint. With the learned pose transfer model, training images can be pose-transferred to any given poses, and along with the original images, form a augmented training dataset. It increases data diversity against over-fitting. In contrast to previous GAN-based methods, we consider the influence of pose variations on similarity measure to generate more realistic and shaper samples for person reid. Besides, we optimize hard example mining to introduce a novel manner of samples (pose-transferred images) used with the learned pose transfer model. It focuses on the inferior samples which are caused by pose variations to increase the number of effective hard examples for learning discriminative features and improve the generalization ability. We extensively conduct comparative evaluations to demonstrate the advantages and superiority of our proposed method over the state-of-the-art approaches on Market-1501 and DukeMTMC-reID, the rank-1 accuracy is 96.1% for Market-1501 and 92.0% for DukeMTMC-reID.

#### A Systematic Investigation on Deep Architectures for Automatic Skin Lesions Classification

Pierluigi Carcagni, Marco Leo, Andrea Cuna, Giuseppe Celeste, Cosimo Distante
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 16:30 in session PS T5.4

Auto-TLDR; RegNet: Deep Investigation of Convolutional Neural Networks for Automatic Classification of Skin Lesions

Computer vision-based techniques are more and more employed in healthcare and medical fields nowadays in order, principally, to be as a support to the experienced medical staff to help them to make a quick and correct diagnosis. One of the hot topics in this arena concerns the automatic classification of skin lesions. Several promising works exist about it, mainly leveraging Convolutional Neural Networks (CNN), but proposed pipeline mainly rely on complex data preprocessing and there is no systematic investigation about how available deep models can actually reach the accuracy needed for real applications. In order to overcome these drawbacks, in this work, an end-to-end pipeline is introduced and some of the most recent Convolutional Neural Networks (CNNs) architectures are included in it and compared on the largest common benchmark dataset recently introduced. To this aim, for the first time in this application context, a new network design paradigm, namely RegNet, has been exploited to get the best models among a population of configurations. The paper introduces a threefold level of contribution and novelty with respect the previous literature: the deep investigation of several CNN architectures driving to a consistent improvement of the lesions recognition accuracy, the exploitation of a new network design paradigm able to study the behavior of populations of models and a deep discussion about pro and cons of each analyzed method paving the path towards new research lines.

#### How Important Are Faces for Person Re-Identification?

Julia Dietlmeier, Joseph Antony, Kevin Mcguinness, Noel E O'Connor
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 16:00 in session PS T1.15

Auto-TLDR; Anonymization of Person Re-identification Datasets with Face Detection and Blurring

This paper investigates the dependence of existing state-of-the-art person re-identification models on the presence and visibility of human faces. We apply a face detection and blurring algorithm to create anonymized versions of several popular person re-identification datasets including Market1501, DukeMTMC-reID, CUHK03, Viper, and Airport. Using a cross-section of existing state-of-the-art models that range in accuracy and computational efficiency, we evaluate the effect of this anonymization on re-identification performance using standard metrics. Perhaps surprisingly, the effect on mAP is very small, and accuracy is recovered by simply training on the anonymized versions of the data rather than the original data. These findings are consistent across multiple models and datasets. These results indicate that datasets can be safely anonymized by blurring faces without significantly impacting the performance of person re-identification systems, and may allow for the release of new richer re-identification datasets where previously there were privacy or data protection concerns.

#### Enhancing Deep Semantic Segmentation of RGB-D Data with Entangled Forests

Matteo Terreran, Elia Bonetto, Stefano Ghidoni
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.3

Auto-TLDR; FuseNet: A Lighter Deep Learning Model for Semantic Segmentation

Semantic segmentation is a problem which is getting more and more attention in the computer vision community. Nowadays, deep learning methods represent the state of the art to solve this problem, and the trend is to use deeper networks to get higher performance. The drawback with such models is a higher computational cost, which makes it difficult to integrate them on mobile robot platforms. In this work we want to explore how to obtain lighter deep learning models without compromising performance. To do so we will consider the features used in the Entangled Random Forest algorithm and we will study the best strategies to integrate these within FuseNet deep network. Such new features allow us to shrink the network size without loosing performance, obtaining hence a lighter model which achieves state-of-the-art performance on the semantic segmentation task and represents an interesting alternative for mobile robotics applications, where computational power and energy are limited.

#### Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification

Can Zhang, Hong Liu, Wei Guo, Mang Ye
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 12:00 in session PS T5.3

Auto-TLDR; Multi-Scale Part-Aware Cascading for RGB-Infrared Person Re-identification

RGB-Infrared person re-identification (RGB-IR Re-ID) aims to matching persons from heterogeneous images captured by visible and thermal cameras, which is of great significance in surveillance system under poor light conditions. Facing great challenges in complex variances including conventional single-modality and additional inter-modality discrepancies, most of existing RGB-IR Re-ID methods directly work on global features for simultaneous elimination, whereas modality-specific noises and modality-shared features are not well considered. To address these issues, a novel Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global in a cascading manner, which results in an unified representation robust to noises. Moreover, a marginal exponential center (MeCen) loss is introduced to jointly eliminate mixed variances, which enables to model cross-modality correlations on sharable salient features. Extensive experiments are conducted for demonstration that the proposed method outperforms all the state-of-the-arts by a large margin.

#### HPERL: 3D Human Pose Estimastion from RGB and LiDAR

Michael Fürst, Shriya T.P. Gupta, René Schuster, Oliver Wasenmüler, Didier Stricker
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.3

Auto-TLDR; 3D Human Pose Estimation Using RGB and LiDAR Using Weakly-Supervised Approach

In-the-wild human pose estimation has a huge potential for various fields, ranging from animation and action recognition to intention recognition and prediction for autonomous driving. The current state-of-the-art is focused only on RGB and RGB-D approaches for predicting the 3D human pose. However, not using precise LiDAR depth information limits the performance and leads to very inaccurate absolute pose estimation. With LiDAR sensors becoming more affordable and common on robots and autonomous vehicle setups, we propose an end-to-end architecture using RGB and LiDAR to predict the absolute 3D human pose with unprecedented precision. Additionally, we introduce a weakly-supervised approach to generate 3D predictions using 2D pose annotations from PedX. This allows for many new opportunities in the field of 3D human pose estimation.

#### Object-Oriented Map Exploration and Construction Based on Auxiliary Task Aided DRL

Junzhe Xu, Jianhua Zhang, Shengyong Chen, Honghai Liu
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.4

Auto-TLDR; Auxiliary Task Aided Deep Reinforcement Learning for Environment Exploration by Autonomous Robots

Environment exploration by autonomous robots through deep reinforcement learning (DRL) based methods has attracted more and more attention. However, existing methods usually focus on robot navigation to single or multiple fixed goals, while ignoring the perception and construction of external environments. In this paper, we propose a novel environment exploration task based on DRL, which requires a robot fast and completely perceives all objects of interest, and reconstructs their poses in a global environment map, as much as the robot can do. To this end, we design an auxiliary task aided DRL model, which is integrated with the auxiliary object detection and 6-DoF pose estimation components. The outcome of auxiliary tasks can improve the learning speed and robustness of DRL, as well as the accuracy of object pose estimation. Comprehensive experimental results on the indoor simulation platform AI2-THOR have shown the effectiveness and robustness of our method.

#### Multiple Future Prediction Leveraging Synthetic Trajectories

Lorenzo Berlincioni, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 14:00 in session PS T1.11

Auto-TLDR; Synthetic Trajectory Prediction using Markov Chains

Trajectory prediction is an important task, especially in autonomous driving. The ability to forecast the position of other moving agents can yield to an effective planning, ensuring safety for the autonomous vehicle as well for the observed entities. In this work we propose a data driven approach based on Markov Chains to generate synthetic trajectories, which are useful for training a multiple future trajectory predictor. The advantages are twofold: on the one hand synthetic samples can be used to augment existing datasets and train more effective predictors; on the other hand, it allows to generate samples with multiple ground truths, corresponding to diverse equally likely outcomes of the observed trajectory. We define a trajectory prediction model and a loss that explicitly address the multimodality of the problem and we show that combining synthetic and real data leads to prediction improvements, obtaining state of the art results.

#### Attention Based Coupled Framework for Road and Pothole Segmentation

Shaik Masihullah, Ritu Garg, Prerana Mukherjee, Anupama Ray
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.2

Auto-TLDR; Few Shot Learning for Road and Pothole Segmentation on KITTI and IDD

In this paper, we propose a novel attention based coupled framework for road and pothole segmentation. In many developing countries as well as in rural areas, the drivable areas are neither well-defined, nor well-maintained. Under such circumstances, an Advance Driver Assistant System (ADAS) is needed to assess the drivable area and alert about the potholes ahead to ensure vehicle safety. Moreover, this information can also be used in structured environments for assessment and maintenance of road health. We demonstrate few shot learning approach for pothole detection to leverage accuracy even with fewer training samples. We report the exhaustive experimental results for road segmentation on KITTI and IDD datasets. We also present pothole segmentation on IDD.

#### Polarimetric Image Augmentation

Marc Blanchon, Fabrice Meriaudeau, Olivier Morel, Ralph Seulin, Desire Sidibe
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.3

Auto-TLDR; Polarimetric Augmentation for Deep Learning in Robotics Applications

This paper deals with new augmentation methods for an unconventional imaging modality sensitive to the physics of the observed scene called polarimetry. In nature, polarized light is obtained by reﬂection or scattering. Robotics applications in urban environments are subject to many obstacles that can be specular and therefore provide polarized light. These areas are prone to segmentation errors using standard modalities but could be solved using information carried by the polarized light. Deep Convolutional Neural Networks (DCNNs) have shown excellent segmentation results, but require a signiﬁcant amount of data to achieve best performances. The lack of data is usually overcomed by using augmentation methods. However, unlike RGB images, polarization images are not only scalar (intensity) images and standard augmentation techniques cannot be applied straightforwardly. We propose enhancing deep learning models through a regularized augmentation procedure applied to polarimetric data in order to characterize scenes more effectively under challenging conditions. We subsequently observe an average of 18.1% improvement in IoU between not augmented and regularized training procedures on real world data.