#### A Distinct Discriminant Canonical Correlation Analysis Network Based Deep Information Quality Representation for Image Classification

Lei Gao, Zheng Guo, Ling Guan Ling Guan
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Tue 12 Jan 2021 at 15:00 in session PS T1.2

Auto-TLDR; DDCCANet: Deep Information Quality Representation for Image Classification

In this paper, we present a distinct discriminant canonical correlation analysis network (DDCCANet) based deep information quality representation with application to image classification. Specifically, to explore the sufficient discriminant information between different data sets, the within-class and between-class correlation matrices are employed and optimized jointly. Moreover, different from the existing canonical correlation analysis network (CCANet) and related algorithms, an information theoretic descriptor, information quality (IQ), is adopted to generate the deep-level feature representation for image classification. Benefiting from the explored discriminant information and IQ descriptor, it is potential to gain a more effective deep-level representation from multi-view data sets, leading to improved performance in classification tasks. To demonstrate the effectiveness of the proposed DDCCANet, we conduct experiments on the Olivetti Research Lab (ORL) face database, ETH80 database and CIFAR10 database. Experimental results show the superiority of the proposed solution on image classification.

## Similar papers

#### Embedding Shared Low-Rank and Feature Correlation for Multi-View Data Analysis

Zhan Wang, Lizhi Wang, Hua Huang
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Tue 12 Jan 2021 at 15:00 in session PS T1.1

Auto-TLDR; embedding shared low-rank and feature correlation for multi-view data analysis

The diversity of multimedia data in the real-world usually forms multi-view features. How to explore the structure information and correlations among multi-view features is still an open problem. In this paper, we propose a novel multi-view subspace learning method, named embedding shared low-rank and feature correlation (ESLRFC), for multi-view data analysis. First, in the embedding subspace, we propose a robust low-rank model on each feature set and enforce a shared low-rank constraint to characterize the common structure information of multiple feature data. Second, we develop an enhanced correlation analysis in the embedding subspace for simultaneously removing the redundancy of each feature set and exploring the correlations of multiple feature data. Finally, we incorporate the low-rank model and the correlation analysis into a unified framework. The shared low-rank constraint not only depicts the data distribution consistency among multiple feature data, but also assists robust subspace learning. Experimental results on recognition tasks demonstrate the superior performance and noise robustness of the proposed method.

#### Feature Extraction by Joint Robust Discriminant Analysis and Inter-Class Sparsity

Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Tue 12 Jan 2021 at 15:00 in session OS T1.3

Auto-TLDR; Robust Discriminant Analysis with Feature Selection and Inter-class Sparsity (RDA_FSIS)

Feature extraction methods have been successfully applied to many real-world applications. The classical Linear Discriminant Analysis (LDA) and its variants are widely used as feature extraction methods. Although they have been used for different classification tasks, these methods have some shortcomings. The main one is that the projection axes obtained are not informative about the relevance of original features. In this paper, we propose a linear embedding method that merges two interesting properties: Robust LDA and inter-class sparsity. Furthermore, the targeted projection transformation focuses on the most discriminant original features. The proposed method is called Robust Discriminant Analysis with Feature Selection and Inter-class Sparsity (RDA_FSIS). Two kinds of sparsity are explicitly included in the proposed model. The first kind is obtained by imposing the $\ell_{2,1}$ constraint on the projection matrix in order to perform feature ranking. The second kind is obtained by imposing the inter-class sparsity constraint used for getting a common sparsity structure in each class. Comprehensive experiments on five real-world image datasets demonstrate the effectiveness and advantages of our framework over existing linear methods.

#### Joint Learning Multiple Curvature Descriptor for 3D Palmprint Recognition

Lunke Fei, Bob Zhang, Jie Wen, Chunwei Tian, Peng Liu, Shuping Zhao
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; Joint Feature Learning for 3D palmprint recognition using curvature data vectors

3D palmprint-based biometric recognition has drawn growing research attention due to its several merits over 2D counterpart such as robust structural measurement of a palm surface and high anti-counterfeiting capability. However, most existing 3D palmprint descriptors are hand-crafted that usually extract stationary features from 3D palmprint images. In this paper, we propose a feature learning method to jointly learn compact curvature feature descriptor for 3D palmprint recognition. We first form multiple curvature data vectors to completely sample the intrinsic curvature information of 3D palmprint images. Then, we jointly learn a feature projection function that project curvature data vectors into binary feature codes, which have the maximum inter-class variances and minimum intra-class distance so that they are discriminative. Moreover, we learn the collaborative binary representation of the multiple curvature feature codes by minimizing the information loss between the final representation and the multiple curvature features, so that the proposed method is more compact in feature representation and efficient in matching. Experimental results on the baseline 3D palmprint database demonstrate the superiority of the proposed method in terms of recognition performance in comparison with state-of-the-art 3D palmprint descriptors.

#### Low Rank Representation on Product Grassmann Manifolds for Multi-viewSubspace Clustering

Jipeng Guo, Yanfeng Sun, Junbin Gao, Yongli Hu, Baocai Yin
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.13

Auto-TLDR; Low Rank Representation on Product Grassmann Manifold for Multi-View Data Clustering

Clustering high dimension multi-view data with complex intrinsic properties and nonlinear manifold structure is a challenging task since these data are always embedded in low dimension manifolds. Inspired by Low Rank Representation (LRR), some researchers extended classic LRR on Grassmann manifold or Product Grassmann manifold to represent data with non-linear metrics. However, most of these methods utilized convex nuclear norm to leverage a low-rank structure, which was over-relaxation of true rank and would lead to the results deviated from the true underlying ones. And, the computational complexity of singular value decomposition of matrix is high for nuclear norm minimization. In this paper, we propose a new low rank model for high-dimension multi-view data clustering on Product Grassmann Manifold with the matrix tri-factorization which is used to control the upper bound of true rank of representation matrix. And, the original problem can be transformed into the nuclear norm minimization with smaller scale matrices. An effective solution and theoretical analysis are also provided. The experimental results show that the proposed method obviously outperforms other state-of-the-art methods on several multi-source human/crowd action video datasets.

#### Feature Extraction and Selection Via Robust Discriminant Analysis and Class Sparsity

Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 16:00 in session PS T1.15

Auto-TLDR; Hybrid Linear Discriminant Embedding for supervised multi-class classification

The main goal of discriminant embedding is to extract features that can be compact and informative representations of the original set of features. This paper introduces a hybrid scheme for linear feature extraction for supervised multi-class classification. We introduce a unifying criterion that is able to retain the advantages of robust sparse LDA and Inter-class sparsity. Thus, the estimated transformation includes two types of discrimination which are the inter-class sparsity and robust Linear Discriminant Analysis with feature selection. In order to optimize the proposed objective function, we deploy an iterative alternating minimization scheme for estimating the linear transformation and the orthogonal matrix. The introduced scheme is generic in the sense that it can be used for combining and tuning many other linear embedding methods. In the lights of the experiments conducted on six image datasets including faces, objects, and digits, the proposed scheme was able to outperform competing methods in most of the cases.

#### Local Grouped Invariant Order Pattern for Grayscale-Inversion and Rotation Invariant Texture Classification

Yankai Huang, Tiecheng Song, Shuang Li, Yuanjing Han
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 12:00 in session PS T5.5

Auto-TLDR; Local grouped invariant order pattern for grayscale-inversion and rotation invariant texture classification

Local binary pattern (LBP) based descriptors have shown effectiveness for texture classification. However, most of them encode the intensity relationships between neighboring pixels and a central pixel into binary forms, thereby failing to capture the complete ordering information among neighbors. Several methods have explored intensity order information for feature description, but they do not address the grayscale-inversion problem. In this paper, we propose an image descriptor called local grouped invariant order pattern (LGIOP) for grayscale-inversion and rotation invariant texture classification. Our LGIOP is a histogram representation which jointly encodes neighboring order information and central pixels. In particular, two new order encoding methods, i.e., intensity order encoding and distance order encoding, are proposed to describe the neighboring relationships. These two order encoding methods are not only complementary but also invariant to grayscale-inversion and rotation changes. Experiments for texture classification demonstrate that the proposed LGIOP descriptor is robust to (linear or nonlinear) grayscale inversion and image rotation.

#### Local Binary Quaternion Rotation Pattern for Colour Texture Retrieval

Hela Jebali, Noel Richard, Mohamed Naouai
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.3

Auto-TLDR; Local Binary Quaternion Rotation Pattern for Color Texture Classification

Color is very important feature for image representation, it assumes very essential in the human visual recognition process. Most existing approaches usually extract features from the three color channels separately (Marginal way). Although, it exists few vector expressions of texture features. Aware of the high interaction that exists between different channels in the color image, this work introduces a compact texture descriptor, named Local Binary Quaternion Rotation Pattern (LBQRP). In this LBQRP purpose, the quaternion representation is used to represent color texture. The distance between two color can be expressed as the angle of rotation between two unit quaternions using the geodesic distance. After a LBQRP coding, local histograms are extracted and used as features. Experiments on three challenging color datasets: Vistex, Outex-TC13 and USPtex are carried out to evaluate the LBQRP performance in texture classification. Results show the high efficiency of the proposed approach facing to several stat-of-art methods.

#### Soft Label and Discriminant Embedding Estimation for Semi-Supervised Classification

Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 16:00 in session PS T1.15

Auto-TLDR; Semi-supervised Semi-Supervised Learning for Linear Feature Extraction and Label Propagation

In recent times, graph-based semi-supervised learning proved to be a powerful paradigm for processing and mining large datasets. The main advantage relies on the fact that these methods can be useful in propagating a small set of known labels to a large set of unlabeled data. The scarcity of labeled data may affect the performance of the semi-learning. This paper introduces a new semi-supervised framework for simultaneous linear feature extraction and label propagation. The proposed method simultaneously estimates a discriminant transformation and the unknown label by exploiting both labeled and unlabeled data. In addition, the unknowns of the learning model are estimated by integrating two types of graph-based smoothness constraints. The resulting semi-supervised model is expected to learn more discriminative information. Experiments are conducted on six public image datasets. These experimental results show that the performance of the proposed method can be better than that of many state-of-the-art graph-based semi-supervised algorithms.

#### Subspace Clustering Via Joint Unsupervised Feature Selection

Wenhua Dong, Xiaojun Wu, Hui Li, Zhenhua Feng, Josef Kittler
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.3

Auto-TLDR; Unsupervised Feature Selection for Subspace Clustering

Any high-dimensional data arising from practical applications usually contains irrelevant features, which may impact on the performance of existing subspace clustering methods. This paper proposes a novel subspace clustering method, which reconstructs the feature matrix by the means of unsupervised feature selection (UFS) to achieve a better dictionary for subspace clustering (SC). Different from most existing clustering methods, the proposed approach uses a reconstructed feature matrix as the dictionary rather than the original data matrix. As the feature matrix reconstructed by representative features is more discriminative and closer to the ground-truth, it results in improved performance. The corresponding non-convex optimization problem is effectively solved using the half-quadratic and augmented Lagrange multiplier methods. Extensive experiments on four real datasets demonstrate the effectiveness of the proposed method.

#### Cross-spectrum Face Recognition Using Subspace Projection Hashing

Hanrui Wang, Xingbo Dong, Jin Zhe, Jean-Luc Dugelay, Massimo Tistarelli
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; Subspace Projection Hashing for Cross-Spectrum Face Recognition

Cross-spectrum face recognition, e.g. visible to thermal matching, remains a challenging task due to the large variation originated from different domains. This paper proposed a subspace projection hashing (SPH) to enable the cross-spectrum face recognition task. The intrinsic idea behind SPH is to project the features from different domains onto a common subspace, where matching the faces from different domains can be accomplished. Notably, we proposed a new loss function that can (i) preserve both inter-domain and intra-domain similarity; (ii) regularize a scaled-up pairwise distance between hashed codes, to optimize projection matrix. Three datasets, Wiki, EURECOM VIS-TH paired face and TDFace are adopted to evaluate the proposed SPH. The experimental results indicate that the proposed SPH outperforms the original linear subspace ranking hashing (LSRH) in the benchmark dataset (Wiki) and demonstrates a reasonably good performance for visible-thermal, visible-near-infrared face recognition, therefore suggests the feasibility and effectiveness of the proposed SPH.

#### Face Anti-Spoofing Using Spatial Pyramid Pooling

Lei Shi, Zhuo Zhou, Zhenhua Guo
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; Spatial Pyramid Pooling for Face Anti-Spoofing

Face recognition system is vulnerable to many kinds of presentation attacks, so how to effectively detect whether the image is from the real face is particularly important. At present, many deep learning-based anti-spoofing methods have been proposed. But these approaches have some limitations, for example, global average pooling (GAP) easily loses local information of faces, single-scale features easily ignore information differences in different scales, while a complex network is prune to be overfitting. In this paper, we propose a face anti-spoofing approach using spatial pyramid pooling (SPP). Firstly, we use ResNet-18 with a small amount of parameter as the basic model to avoid overfitting. Further, we use spatial pyramid pooling module in the single model to enhance local features while fusing multi-scale information. The effectiveness of the proposed method is evaluated on three databases, CASIA-FASD, Replay-Attack and CASIA-SURF. The experimental results show that the proposed approach can achieve state-of-the-art performance.

#### Appliance Identification Using a Histogram Post-Processing of 2D Local Binary Patterns for Smart Grid Applications

Yassine Himeur, Abdullah Alsalemi, Faycal Bensaali, Abbes Amira
Track 5: Image and Signal Processing
Fri 15 Jan 2021 at 16:00 in session PS T5.8

Auto-TLDR; LBP-BEVM based Local Binary Patterns for Appliances Identification in the Smart Grid

Identifying domestic appliances in the smart grid leads to a better power usage management and further helps in detecting appliance-level abnormalities. An efficient identification can be achieved only if a robust feature extraction scheme is developed with a high ability to discriminate between different appliances on the smart grid. Accordingly, we propose in this paper a novel method to extract electrical power signatures after transforming the power signal to 2D space, which has more encoding possibilities. Following, an improved local binary patterns (LBP) is proposed that relies on improving the discriminative ability of conventional LBP using a post-processing stage. A binarized eigenvalue map (BEVM) is extracted from the 2D power matrix and then used to post-process the generated LBP representation. Next, two histograms are constructed, namely up and down histograms, and are then concatenated to form the global histogram. A comprehensive performance evaluation is performed on two different datasets, namely the GREEND and WITHED, in which power data were collected at 1 Hz and 44000 Hz sampling rates, respectively. The obtained results revealed the superiority of the proposed LBP-BEVM based system in terms of the identification performance versus other 2D descriptors and existing identification frameworks.

#### A Local Descriptor with Physiological Characteristic for Finger Vein Recognition

Liping Zhang, Weijun Li, Ning Xin
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.4

Auto-TLDR; Finger vein-specific local feature descriptors based physiological characteristic of finger vein patterns

Local feature descriptors exhibit great superiority in finger vein recognition due to their stability and robustness against local changes in images. However, most of these are methods use general-purpose descriptors that do not consider finger vein-specific features. In this work, we propose a finger vein-specific local feature descriptors based physiological characteristic of finger vein patterns, i.e., histogram of oriented physiological Gabor responses (HOPGR), for finger vein recognition. First, a prior of directional characteristic of finger vein patterns is obtained in an unsupervised manner. Then the physiological Gabor filter banks are set up based on the prior information to extract the physiological responses and orientation. Finally, to make the feature robust against local changes in images, a histogram is generated as output by dividing the image into non-overlapping cells and overlapping blocks. Extensive experimental results on several databases clearly demonstrate that the proposed method outperforms most current state-of-the-art finger vein recognition methods.

#### Face Anti-Spoofing Based on Dynamic Color Texture Analysis Using Local Directional Number Pattern

Junwei Zhou, Ke Shu, Peng Liu, Jianwen Xiang, Shengwu Xiong
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 12:00 in session PS T2.4

Auto-TLDR; LDN-TOP Representation followed by ProCRC Classification for Face Anti-Spoofing

Face anti-spoofing is becoming increasingly indispensable for face recognition systems, which are vulnerable to various spoofing attacks performed using fake photos and videos. In this paper, a novel "LDN-TOP representation followed by ProCRC classification" pipeline for face anti-spoofing is proposed. We use local directional number pattern (LDN) with the derivative-Gaussian mask to capture detailed appearance information resisting illumination variations and noises, which can influence the texture pattern distribution. To further capture motion information, we extend LDN to a spatial-temporal variant named local directional number pattern from three orthogonal planes (LDN-TOP). The multi-scale LDN-TOP capturing complete information is extracted from color images to generate the feature vector with powerful representation capacity. Finally, the feature vector is fed into the probabilistic collaborative representation based classifier (ProCRC) for face anti-spoofing. Our method is evaluated on three challenging public datasets, namely CASIA FASD, Replay-Attack database, and UVAD database using sequence-based evaluation protocol. The experimental results show that our method can achieve promising performance with 0.37% EER on CASIA and 5.73% HTER on UVAD. The performance on Replay-Attack database is also competitive.

#### Two-Level Attention-Based Fusion Learning for RGB-D Face Recognition

Track 2: Biometrics, Human Analysis and Behavior Understanding
Tue 12 Jan 2021 at 17:00 in session PS T2.1

Auto-TLDR; Fused RGB-D Facial Recognition using Attention-Aware Feature Fusion

With recent advances in RGB-D sensing technologies as well as improvements in machine learning and fusion techniques, RGB-D facial recognition has become an active area of research. A novel attention aware method is proposed to fuse two image modalities, RGB and depth, for enhanced RGB-D facial recognition. The proposed method first extracts features from both modalities using a convolutional feature extractor. These features are then fused using a two layer attention mechanism. The first layer focuses on the fused feature maps generated by the feature extractor, exploiting the relationship between feature maps using LSTM recurrent learning. The second layer focuses on the spatial features of those maps using convolution. The training database is preprocessed and augmented through a set of geometric transformations, and the learning process is further aided using transfer learning from a pure 2D RGB image training process. Comparative evaluations demonstrate that the proposed method outperforms other state-of-the-art approaches, including both traditional and deep neural network-based methods, on the challenging CurtinFaces and IIIT-D RGB-D benchmark databases, achieving classification accuracies over 98.2% and 99.3% respectively. The proposed attention mechanism is also compared with other attention mechanisms, demonstrating more accurate results.

#### Color Texture Description Based on Holistic and Hierarchical Order-Encoding Patterns

Tiecheng Song, Jie Feng, Yuanlin Wang, Chenqiang Gao
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 12:00 in session PS T5.3

Auto-TLDR; Holistic and Hierarchical Order-Encoding Patterns for Color Texture Classification

Local binary pattern (LBP), as one of the most representative texture operators, has attracted much attention in computer vision applications. Many LBP variants were developed in the literature. However, most of them were designed for gray images and their performance remains to be improved for color images. In this paper, we propose a novel color image descriptor named Holistic and Hierarchical Order-Encoding Patterns (H2OEP) for texture classification. In H2OEP, the holistic order-encoding pattern compactly encodes color order variation tendencies for each pixel in color space. The hierarchical order-encoding pattern leverages min ordering, median ordering and max ordering to encode local neighboring relationships across different color channels. Finally, the generated order-encoding patterns are aggregated via central pixel encoding to build 3D joint histograms for image representation. Experiments on four benchmark texture databases demonstrate the effectiveness of the proposed descriptor for color texture classification.

#### RGB-Infrared Person Re-Identification Via Image Modality Conversion

Huangpeng Dai, Qing Xie, Yanchun Ma, Yongjian Liu, Shengwu Xiong
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.13

Auto-TLDR; CE2L: A Novel Network for Cross-Modality Re-identification with Feature Alignment

As a cross modality retrieval task, RGB-infrared person re-identification(Re-ID) is an important and challenging tasking, because of its important role in video surveillance applications and large cross-modality variations between visible and infrared images. Most previous works addressed the problem of cross-modality gap with feature alignment by original feature representation learning straightly. In this paper, different from existing works, we propose a novel network(CE2L) to tackle the cross-modality gap with feature alignment. CE2L mainly focuses on adding discriminative information and learning robust features by converting modality between visible and infrared images. Its merits are highlighted in two aspects: 1)Using CycleGAN to convert infrared images into color images can not only increase the recognition characteristics of images, but also allow the our network to better learn the two modal image features; 2)Our novel method can serve as data augmentation. Specifically, it can increase data diversity and total data against over-fitting by converting labeled training images to another modal images. Extensive experimental results on two datasets demonstrate superior performance compared to the baseline and the state-of-the-art methods.

#### Quality-Based Representation for Unconstrained Face Recognition

Nelson Méndez-Llanes, Katy Castillo-Rosado, Heydi Mendez-Vazquez, Massimo Tistarelli
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 14:00 in session OS T2.2

Auto-TLDR; activation map for face recognition in unconstrained environments

Significant advances have been achieved in face recognition in the last decade thanks to the development of deep learning methods. However, recognizing faces captured in uncontrolled environments is still a challenging problem for the scientific community. In these scenarios, the performance of most of existing deep learning based methods abruptly falls, due to the bad quality of the face images. In this work, we propose to use an activation map to represent the quality information in a face image. Different face regions are analyzed to determine their quality and then only those regions with good quality are used to perform the recognition using a given deep face model. For experimental evaluation, in order to simulate unconstrained environments, three challenging databases, with different variations in appearance, were selected: the Labeled Faces in the Wild Database, the Celebrities in Frontal-Profile in the Wild Database, and the AR Database. Three deep face models were used to evaluate the proposal on these databases and in all cases, the use of the proposed activation map allows the improvement of the recognition rates obtained by the original models in a range from 0.3 up to 31%. The obtained results experimentally demonstrated that the proposal is able to select those face areas with higher discriminative power and enough identifying information, while ignores the ones with spurious information.

#### Attention Pyramid Module for Scene Recognition

Zhinan Qiao, Xiaohui Yuan, Chengyuan Zhuang, Abolfazl Meyarian
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.3

Auto-TLDR; Attention Pyramid Module for Multi-Scale Scene Recognition

The unrestricted open vocabulary and diverse substances of scenery images bring significant challenges to scene recognition. However, most deep learning architectures and attention methods are developed on general-purpose datasets and omit the characteristics of scene data. In this paper, we exploit the attention pyramid module (APM) to tackle the predicament of scene recognition. Our method streamlines the multi-scale scene recognition pipeline, learns comprehensive scene features at various scales and locations, addresses the interdependency among scales, and further assists feature re-calibration as well as aggregation process. APM is extremely light-weighted and can be easily plugged into existing network architectures in a parameter-efficient manner. By simply integrating APM into ResNet-50, we obtain a 3.54\% boost in terms of top-1 accuracy on the benchmark scene dataset. Comprehensive experiments show that APM achieves better performance comparing with state-of-the-art attention methods using significant less computation budget. Code and pre-trained models will be made publicly available.

#### Supervised Domain Adaptation Using Graph Embedding

Lukas Hedegaard, Omar Ali Sheikh-Omar, Alexandros Iosifidis
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Tue 12 Jan 2021 at 15:00 in session PS T1.2

Auto-TLDR; Domain Adaptation from the Perspective of Multi-view Graph Embedding and Dimensionality Reduction

Getting deep convolutional neural networks to perform well requires a large amount of training data. When the available labelled data is small, it is often beneficial to use transfer learning to leverage a related larger dataset (source) in order to improve the performance on the small dataset (target). Among the transfer learning approaches, domain adaptation methods assume that distributions between the two domains are shifted and attempt to realign them. In this paper, we consider the domain adaptation problem from the perspective of multi-view graph embedding and dimensionality reduction. Instead of solving the generalised eigenvalue problem to perform the embedding, we formulate the graph-preserving criterion as loss in the neural network and learn a domain-invariant feature transformation in an end-to-end fashion. We show that the proposed approach leads to a powerful Domain Adaptation framework which generalises the prior methods CCSA and d-SNE, and enables simple and effective loss designs; an LDA-inspired instantiation of the framework leads to performance on par with the state-of-the-art on the most widely used Domain Adaptation benchmarks, Office31 and MNIST to USPS datasets.

#### Norm Loss: An Efficient yet Effective Regularization Method for Deep Neural Networks

Theodoros Georgiou, Sebastian Schmitt, Thomas Baeck, Wei Chen, Michael Lew
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 14:00 in session PS T1.11

Auto-TLDR; Weight Soft-Regularization with Oblique Manifold for Convolutional Neural Network Training

Convolutional neural network training can suffer from diverse issues like exploding or vanishing gradients, scaling-based weight space symmetry and covariant-shift. In order to address these issues, researchers develop weight regularization methods and activation normalization methods. In this work we propose a weight soft-regularization method based on the Oblique manifold. The proposed method uses a loss function which pushes each weight vector to have a norm close to one, i.e. the weight matrix is smoothly steered toward the so-called Oblique manifold. We evaluate our method on the very popular CIFAR-10, CIFAR-100 and ImageNet 2012 datasets using two state-of-the-art architectures, namely the ResNet and wide-ResNet. Our method introduces negligible computational overhead and the results show that it is competitive to the state-of-the-art and in some cases superior to it. Additionally, the results are less sensitive to hyperparameter settings such as batch size and regularization factor.

#### Rotation Invariant Aerial Image Retrieval with Group Convolutional Metric Learning

Hyunseung Chung, Woo-Jeoung Nam, Seong-Whan Lee
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.3

Auto-TLDR; Robust Remote Sensing Image Retrieval Using Group Convolution with Attention Mechanism and Metric Learning

Remote sensing image retrieval (RSIR) is the process of ranking database images depending on the degree of similarity compared to the query image. As the complexity of RSIR increases due to the diversity in shooting range, angle, and location of remote sensors, there is an increasing demand for methods to address these issues and improve retrieval performance. In this work, we introduce a novel method for retrieving aerial images by merging group convolution with attention mechanism and metric learning, resulting in robustness to rotational variations. For refinement and emphasis on important features, we applied channel attention in each group convolution stage. By utilizing the characteristics of group convolution and channel-wise attention, it is possible to acknowledge the equality among rotated but identically located images. The training procedure has two main steps: (i) training the network with Aerial Image Dataset (AID) for classification, (ii) fine-tuning the network with triplet-loss for retrieval with Google Earth South Korea and NWPU-RESISC45 datasets. Results show that the proposed method performance exceeds other state-of-the-art retrieval methods in both rotated and original environments. Furthermore, we utilize class activation maps (CAM) to visualize the distinct difference of main features between our method and baseline, resulting in better adaptability in rotated environments.

#### Pose-Robust Face Recognition by Deep Meta Capsule Network-Based Equivariant Embedding

Fangyu Wu, Jeremy Simon Smith, Wenjin Lu, Bailing Zhang
Track 2: Biometrics, Human Analysis and Behavior Understanding
Fri 15 Jan 2021 at 15:00 in session PS T2.5

Auto-TLDR; Deep Meta Capsule Network-based Equivariant Embedding Model for Pose-Robust Face Recognition

Despite the exceptional success in face recognition related technologies, handling large pose variations still remains a key challenge. Current techniques for pose-robust face recognition either, directly extract pose-invariant features, or first synthesize a face that matches the target pose before feature extraction. It is more desirable to learn face representations equivariant to pose variations. To this end, this paper proposes a deep meta Capsule network-based Equivariant Embedding Model (DM-CEEM) with three distinct novelties. First, the proposed RB-CapsNet allows DM-CEEM to learn an equivariant embedding for pose variations and achieve the desired transformation for input face images. Second, we introduce a new version of a Capsule network called RB-CapsNet to extend CapsNet to perform a profile-to-frontal face transformation in deep feature space. Third, we train the DM-CEEM in a meta way by treating a single overall classification target as multiple sub-tasks that satisfy certain unknown probabilities. In each sub-task, we sample the support and query sets randomly. The experimental results on both controlled and in-the-wild databases demonstrate the superiority of DM-CEEM over state-of-the-art.

#### Super-Resolution Guided Pore Detection for Fingerprint Recognition

Syeda Nyma Ferdous, Ali Dabouei, Jeremy Dawson, Nasser M. Nasarabadi
Track 2: Biometrics, Human Analysis and Behavior Understanding
Tue 12 Jan 2021 at 17:00 in session PS T2.1

Auto-TLDR; Super-Resolution Generative Adversarial Network for Fingerprint Recognition Using Pore Features

Performance of fingerprint recognition algorithms substantially rely on fine features extracted from fingerprints. Apart from minutiae and ridge patterns, pore features have proven to be usable for fingerprint recognition. Although features from minutiae and ridge patterns are quite attainable from low-resolution images, using pore features is practical only if the fingerprint image is of high resolution which necessitates a model that enhances the image quality of the conventional 500 ppi legacy fingerprints preserving the fine details. To find a solution for recovering pore information from low-resolution fingerprints, we adopt a joint learning-based approach that combines both super-resolution and pore detection networks. Our modified single image Super-Resolution Generative Adversarial Network (SRGAN) framework helps to reliably reconstruct high-resolution fingerprint samples from low-resolution ones assisting the pore detection network to identify pores with a high accuracy. The network jointly learns a distinctive feature representation from a real low-resolution fingerprint sample and successfully synthesizes a high-resolution sample from it. To add discriminative information and uniqueness for all the subjects, we have integrated features extracted from a deep fingerprint verifier with the SRGAN quality discriminator. We also add ridge reconstruction loss, utilizing ridge patterns to make the best use of extracted features. Our proposed method solves the recognition problem by improving the quality of fingerprint images. High recognition accuracy of the synthesized samples that is close to the accuracy achieved using the original high-resolution images validate the effectiveness of our proposed model.

#### Supervised Feature Embedding for Classification by Learning Rank-Based Neighborhoods

Ghazaal Sheikhi, Hakan Altincay
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 14:00 in session OS T1.4

Auto-TLDR; Supervised Feature Embedding with Representation Learning of Rank-based Neighborhoods

In feature embedding, the recovery of associated discriminative information in the reduced subspace is critical for downstream classifiers. In this study, a supervised feature embedding method is proposed inspired by the well-known word embedding technique, word2vec. Proposed embedding method is implemented as representative learning of rank-based neighborhoods. The notion of context words in word2vec is extended into neighboring instances within a given window. Neighborship is defined using ranks of instances rather than their values so that regions with different densities are captured properly. Each sample is represented by a unique one-hot vector whereas its neighbors are encoded by several two-hot vectors. The two-hot vectors are identical for neighboring samples of the same class. A feed-forward neural network with a continuous projection layer, then learns the mapping from one-hot vectors to multiple two-hot vectors. The hidden layer determines the reduced subspace for the train samples. The obtained transformation is then applied on test data to find a lower-dimensional representation. Proposed method is tested in classification problems on 10 UCI data sets. Experimental results confirm that the proposed method is effective in finding a discriminative representation of the features and outperforms several supervised embedding approaches in terms of classification performance.

#### More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Yaning Li, Liu Yang
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 12:00 in session PS T1.10

Auto-TLDR; Fully Associative Network for Fully Exploiting Correlation Information in Multi-Label Classification

Recent researches demonstrate that correlation modeling plays a key role in high-performance multi-label classification methods. However, existing methods do not take full advantage of correlation information, especially correlations in feature and label spaces of each image, which limits the performance of correlation-based multi-label classification methods. With more correlations considered, in this study, a Fully Associative Network (FAN) is proposed for fully exploiting correlation information, which involves both visual feature and label correlations. Specifically, FAN introduces a robust covariance pooling to summarize convolution features as global image representation for capturing feature correlation in the multi-label task. Moreover, it constructs an effective label correlation matrix based on a re-weighted scheme, which is fed into a graph convolution network for capturing label correlation. Then, correlation between covariance representations (i.e., feature correlation ) and the outputs of GCN (i.e., label correlation) are modeled for final prediction. Experimental results on two datasets illustrate the effectiveness and efficiency of our proposed FAN compared with state-of-the-art methods.

#### Image Representation Learning by Transformation Regression

Xifeng Guo, Jiyuan Liu, Sihang Zhou, En Zhu, Shihao Dong
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 14:00 in session PS T1.5

Auto-TLDR; Self-supervised Image Representation Learning using Continuous Parameter Prediction

Self-supervised learning is a thriving research direction since it can relieve the burden of human labeling for machine learning by seeking for supervision from data instead of human annotation. Although demonstrating promising performance in various applications, we observe that the existing methods usually model the auxiliary learning tasks as classification tasks with finite discrete labels, leading to insufficient supervisory signals, which in turn restricts the representation quality. In this paper, to solve the above problem and make full use of the supervision from data, we design a regression model to predict the continuous parameters of a group of transformations, i.e., image rotation, translation, and scaling. Surprisingly, this naive modification stimulates tremendous potential from data and the resulting supervisory signal has largely improved the performance of image representation learning. Extensive experiments on four image datasets, including CIFAR10, CIFAR100, STL10, and SVHN, indicate that our proposed algorithm outperforms the state-of-the-art unsupervised learning methods by a large margin in terms of classification accuracy. Crucially, we find that with our proposed training mechanism as an initialization, the performance of the existing state-of-the-art classification deep architectures can be preferably improved.

#### ClusterFace: Joint Clustering and Classification for Set-Based Face Recognition

Samadhi Poornima Kumarasinghe Wickrama Arachchilage, Ebroul Izquierdo
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 14:00 in session PS T2.3

Auto-TLDR; Joint Clustering and Classification for Face Recognition in the Wild

Deep learning technology has enabled successful modeling of complex facial features when high quality images are available. Nonetheless, accurate modeling and recognition of human faces in real world scenarios 'on the wild' or under adverse conditions remains an open problem. When unconstrained faces are mapped into deep features, variations such as illumination, pose, occlusion, etc., can create inconsistencies in the resultant feature space. Hence, deriving conclusions based on direct associations could lead to degraded performance. This rises the requirement for a basic feature space analysis prior to face recognition. This paper devises a joint clustering and classification scheme which learns deep face associations in an easy-to-hard way. Our method is based on hierarchical clustering where the early iterations tend to preserve high reliability. The rationale of our method is that a reliable clustering result can provide insights on the distribution of the feature space, that can guide the classification that follows. Experimental evaluations on three tasks, face verification, face identification and rank-order search, demonstrates better or competitive performance compared to the state-of-the-art, on all three experiments.

#### Attentive Hybrid Feature Based a Two-Step Fusion for Facial Expression Recognition

Jun Weng, Yang Yang, Zichang Tan, Zhen Lei
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 12:00 in session PS T2.4

Auto-TLDR; Attentive Hybrid Architecture for Facial Expression Recognition

Facial expression recognition is inherently a challenging task, especially for the in-the-wild images with various occlusions and large pose variations, which may lead to the loss of some crucial information. To address it, in this paper, we propose an attentive hybrid architecture (AHA) which learns global, local and integrated features based on different face regions. Compared with one type of feature, our extracted features own complementary information and can reduce the loss of crucial information. Specifically, AHA contains three branches, where all sub-networks in those branches employ the attention mechanism to further localize the interested pixels/regions. Moreover, we propose a two-step fusion strategy based on LSTM to deeply explore the hidden correlations among different face regions. Extensive experiments on four popular expression databases (i.e., CK+, FER-2013, SFEW 2.0, RAF-DB) show the effectiveness of the proposed method.

#### Subspace Clustering for Action Recognition with Covariance Representations and Temporal Pruning

Giancarlo Paoletti, Jacopo Cavazza, Cigdem Beyan, Alessio Del Bue
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Tue 12 Jan 2021 at 15:00 in session OS T1.3

Auto-TLDR; Unsupervised Learning for Human Action Recognition from Skeletal Data

This paper tackles the problem of human action recognition, defined as classifying which action is displayed in a trimmed sequence, from skeletal data. Albeit state-of-the-art approaches designed for this application are all supervised, in this paper we pursue a more challenging direction: Solving the problem with unsupervised learning. To this end, we propose a novel subspace clustering method, which exploits covariance matrix to enhance the action’s discriminability and a timestamp pruning approach that allow us to better handle the temporal dimension of the data. Through a broad experimental validation, we show that our computational pipeline surpasses existing unsupervised approaches but also can result in favorable performances as compared to supervised methods.

#### Exploiting Knowledge Embedded Soft Labels for Image Recognition

Lixian Yuan, Riquan Chen, Hefeng Wu, Tianshui Chen, Wentao Wang, Pei Chen
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.4

Auto-TLDR; A Soft Label Vector for Image Recognition

Objects from correlated classes usually share highly similar appearances while objects from uncorrelated classes are very different. Most of current image recognition works treat each class independently, which ignores these class correlations and inevitably leads to sub-optimal performance in many cases. Fortunately, object classes inherently form a hierarchy with different levels of abstraction and this hierarchy encodes rich correlations among different classes. In this work, we utilize a soft label vector that encodes the prior knowledge of class correlations as extra regularization to train the image classifiers. Specifically, for each class, instead of simply using a one-hot vector, we assign a high value to its correlated classes and assign small values to those uncorrelated ones, thus generating knowledge embedded soft labels. We conduct experiments on both general and fine-grained image recognition benchmarks and demonstrate its superiority compared with existing methods.

#### Self-Paced Bottom-Up Clustering Network with Side Information for Person Re-Identification

Mingkun Li, Chun-Guang Li, Ruo-Pei Guo, Jun Guo
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; Self-Paced Bottom-up Clustering Network with Side Information for Unsupervised Person Re-identification

Person re-identification (Re-ID) has attracted a lot of research attention in recent years. However, supervised methods demand an enormous amount of manually annotated data. In this paper, we propose a Self-Paced bottom-up Clustering Network with Side Information (SPCNet-SI) for unsupervised person Re-ID, where the side information comes from the serial number of the camera associated with each image. Specifically, our proposed SPCNet-SI exploits the camera side information to guide the feature learning and uses soft label in bottom-up clustering process, in which the camera association information is used in the repelled loss and the soft label based cluster information is used to select the candidate cluster pairs to merge. Moreover, a self-paced dynamic mechanism is developed to regularize the merging process such that the clustering is implemented in an easy-to-hard way with a slow-to-fast merging process. Experiments on two benchmark datasets Market-1501 and DukeMTMC-ReID demonstrate promising performance.

#### Improved Deep Classwise Hashing with Centers Similarity Learning for Image Retrieval

Ming Zhang, Hong Yan
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 16:00 in session PS T3.9

Auto-TLDR; Deep Classwise Hashing for Image Retrieval Using Center Similarity Learning

Deep supervised hashing for image retrieval has attracted researchers' attention due to its high efficiency and superior retrieval performance. Most existing deep supervised hashing works, which are based on pairwise/triplet labels, suffer from the expensive computational cost and insufficient utilization of the semantics information. Recently, deep classwise hashing introduced a classwise loss supervised by class labels information alternatively; however, we find it still has its drawback. In this paper, we propose an improved deep classwise hashing, which enables hashing learning and class centers learning simultaneously. Specifically, we design a two-step strategy on center similarity learning. It interacts with the classwise loss to attract the class center to concentrate on the intra-class samples while pushing other class centers as far as possible. The centers similarity learning contributes to generating more compact and discriminative hashing codes. We conduct experiments on three benchmark datasets. It shows that the proposed method effectively surpasses the original method and outperforms state-of-the-art baselines under various commonly-used evaluation metrics for image retrieval.

#### Space-Time Domain Tensor Neural Networks: An Application on Human Pose Classification

Konstantinos Makantasis, Athanasios Voulodimos, Anastasios Doulamis, Nikolaos Doulamis, Nikolaos Bakalos
Track 2: Biometrics, Human Analysis and Behavior Understanding
Thu 14 Jan 2021 at 12:00 in session PS T2.4

Auto-TLDR; Tensor-Based Neural Network for Spatiotemporal Pose Classifiaction using Three-Dimensional Skeleton Data

Recent advances in sensing technologies require the design and development of pattern recognition models capable of processing spatiotemporal data efficiently. In this study, we propose a spatially and temporally aware tensor-based neural network for human pose classifiaction using three-dimensional skeleton data. Our model employs three novel components. First, an input layer capable of constructing highly discriminative spatiotemporal features. Second, a tensor fusion operation that produces compact yet rich representations of the data, and third, a tensor-based neural network that processes data representations in their original tensor form. Our model is end-to-end trainable and characterized by a small number of trainable parameters making it suitable for problems where the annotated data is limited. Experimental evaluation of the proposed model indicates that it can achieve state-of-the-art performance.

#### Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification

Can Zhang, Hong Liu, Wei Guo, Mang Ye
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 12:00 in session PS T5.3

Auto-TLDR; Multi-Scale Part-Aware Cascading for RGB-Infrared Person Re-identification

RGB-Infrared person re-identification (RGB-IR Re-ID) aims to matching persons from heterogeneous images captured by visible and thermal cameras, which is of great significance in surveillance system under poor light conditions. Facing great challenges in complex variances including conventional single-modality and additional inter-modality discrepancies, most of existing RGB-IR Re-ID methods directly work on global features for simultaneous elimination, whereas modality-specific noises and modality-shared features are not well considered. To address these issues, a novel Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global in a cascading manner, which results in an unified representation robust to noises. Moreover, a marginal exponential center (MeCen) loss is introduced to jointly eliminate mixed variances, which enables to model cross-modality correlations on sharable salient features. Extensive experiments are conducted for demonstration that the proposed method outperforms all the state-of-the-arts by a large margin.

#### A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Tobia Tesan, Pasquale Coscia, Lamberto Ballan
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.6

Auto-TLDR; Context-Based Image Annotation with Multiple Semantic Embeddings and Recurrent Neural Networks

Images represent a commonly used form of visual communication among people. Nevertheless, image classification may be a challenging task when dealing with unclear or non-common images needing more context to be correctly annotated. Metadata accompanying images on social-media represent an ideal source of additional information for retrieving proper neighborhoods easing image annotation task. To this end, we blend visual features extracted from neighbors and their metadata to jointly leverage context and visual cues. Our models use multiple semantic embeddings to achieve the dual objective of being robust to vocabulary changes between train and test sets and decoupling the architecture from the low-level metadata representation. Convolutional and recurrent neural networks (CNNs-RNNs) are jointly adopted to infer similarity among neighbors and query images. We perform comprehensive experiments on the NUS-WIDE dataset showing that our models outperform state-of-the-art architectures based on images and metadata, and decrease both sensory and semantic gaps to better annotate images.

#### Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Michele Alberti, Angela Botros, Schuetz Narayan, Rolf Ingold, Marcus Liwicki, Mathias Seuret
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Tue 12 Jan 2021 at 15:00 in session PS T1.2

Auto-TLDR; Trainable and Spectrally Initializable Matrix Transformations for Neural Networks

In this work, we introduce a new architectural component to Neural Networks (NN), i.e., trainable and spectrally initializable matrix transformations on feature maps. While previous literature has already demonstrated the possibility of adding static spectral transformations as feature processors, our focus is on more general trainable transforms. We study the transforms in various architectural configurations on four datasets of different nature: from medical (ColorectalHist, HAM10000) and natural (Flowers) images to historical documents (CB55). With rigorous experiments that control for the number of parameters and randomness, we show that networks utilizing the introduced matrix transformations outperform vanilla neural networks. The observed accuracy increases appreciably across all datasets. In addition, we show that the benefit of spectral initialization leads to significantly faster convergence, as opposed to randomly initialized matrix transformations. The transformations are implemented as auto-differentiable PyTorch modules that can be incorporated into any neural network architecture. The entire code base is open-source.

#### Face Image Quality Assessment for Model and Human Perception

Ken Chen, Yichao Wu, Zhenmao Li, Yudong Wu, Ding Liang
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.14

Auto-TLDR; A labour-saving method for FIQA training with contradictory data from multiple sources

Practical face image quality assessment (FIQA) models are trained under the supervision of labeled data, which requires more or less human labor. The human labeled quality scores are consistent with perceptual intuition but laborious. On the other hand, models can be trained with data generated automatically by the recognition models with artificially selected references. However, the recognition scores are sometimes inaccurate, which may give wrong quality scores during FIQA training. In this paper, we propose a labour-saving method for quality scores generation. For the first time, we conduct systematic investigations to show that there exist severe contradictions between different types of target quality, namely distribution gap (DG). To bridge the gap, we propose a novel framework for training FIQA models by combining the merits of data from different sources. In order to make the target score from multiple sources compatible, we design a method called quality distribution alignment (QDA). Meanwhile, to correct the wrong target by recognition models, contradictory samples selection (CSS) is adopted to select samples from the human labeled dataset adaptively. Extensive experiments and analysis on public benchmarks including MegaFace has demonstrated the superiority of our in terms of effectiveness and efficiency.

#### Free-Form Image Inpainting Via Contrastive Attention Network

Xin Ma, Xiaoqiang Zhou, Huaibo Huang, Zhenhua Chai, Xiaolin Wei, Ran He
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 14:00 in session OS T5.1

Auto-TLDR; Self-supervised Siamese inference for image inpainting

Most deep learning based image inpainting approaches adopt autoencoder or its variants to fill missing regions in images. Encoders are usually utilized to learn powerful representational spaces, which are important for dealing with sophisticated learning tasks. Specifically, in the image inpainting task, masks with any shapes can appear anywhere in images (i.e., free-form masks) forming complex patterns. It is difficult for encoders to capture such powerful representations under this complex situation. To tackle this problem, we propose a self-supervised Siamese inference network to improve the robustness and generalization. Moreover, the restored image usually can not be harmoniously integrated into the exiting content, especially in the boundary area. To address this problem, we propose a novel Dual Attention Fusion module (DAF), which can combine both the restored and known regions in a smoother way and be inserted into decoder layers in a plug-and-play way. DAF is developed to not only adaptively rescale channel-wise features by taking interdependencies between channels into account but also force deep convolutional neural networks (CNNs) focusing more on unknown regions. In this way, the unknown region will be naturally filled from the outside to the inside. Qualitative and quantitative experiments on multiple datasets, including facial and natural datasets (i.e., Celeb-HQ, Pairs Street View, Places2 and ImageNet), demonstrate that our proposed method outperforms against state-of-the-arts in generating high-quality inpainting results.

#### Label Self-Adaption Hashing for Image Retrieval

Jianglin Lu, Zhihui Lai, Hailing Wang, Jie Zhou
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 12:00 in session PS T5.5

Auto-TLDR; Label Self-Adaption Hashing for Large-Scale Image Retrieval

Hashing has attracted widespread attention in image retrieval because of its fast retrieval speed and low storage cost. Compared with supervised methods, unsupervised hashing methods are more reasonable and suitable for large-scale image retrieval since it is always difficult and expensive to collect true labels of the massive data. Without label information, however, unsupervised hashing methods can not guarantee the quality of learned binary codes. To resolve this dilemma, this paper proposes a novel unsupervised hashing method called Label Self-Adaption Hashing (LSAH), which contains effective hashing function learning part and self-adaption label generation part. In the first part, we utilize anchor graph to keep the local structure of the data and introduce joint sparsity into the model to extract effective features for high-quality binary code learning. In the second part, a self-adaptive cluster label matrix is learned from the data under the assumption that the nearest neighbor points should have a large probability to be in the same cluster. Therefore, the proposed LSAH can make full use of the potential discriminative information of the data to guide the learning of binary code. It is worth noting that LSAH can learn effective binary codes, hashing function and cluster labels simultaneously in a unified optimization framework. To solve the resulting optimization problem, an Augmented Lagrange Multiplier based iterative algorithm is elaborately designed. Extensive experiments on three large-scale data sets indicate the promising performance of the proposed LSAH.

#### A Joint Representation Learning and Feature Modeling Approach for One-Class Recognition

Pramuditha Perera, Vishal Patel
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 16:00 in session PS T1.12

Auto-TLDR; Combining Generative Features and One-Class Classification for Effective One-class Recognition

One-class recognition is traditionally approached either as a representation learning problem or a feature modelling problem. In this work, we argue that both of these approaches have their own limitations; and a more effective solution can be obtained by combining the two. The proposed approach is based on the combination of a generative framework and a one-class classification method. First, we learn generative features using the one-class data with a generative framework. We augment the learned features with the corresponding reconstruction errors to obtain augmented features. Then, we qualitatively identify a suitable feature distribution that reduces the redundancy in the chosen classifier space. Finally, we force the augmented features to take the form of this distribution using an adversarial framework. We test the effectiveness of the proposed method on three one-class classification tasks and obtain state-of-the-art results.

#### Exploring the Ability of CNNs to Generalise to Previously Unseen Scales Over Wide Scale Ranges

Ylva Jansson, Tony Lindeberg
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 13:00 in session OS T1.7

Auto-TLDR; A theoretical analysis of invariance and covariance properties of scale channel networks

The ability to handle large scale variations is crucial for many real world visual tasks. A straightforward approach to handling scale in a deep neural network is to process multiple rescaled image copies in a set of scale channels (subnetworks). Scale invariance can then, in principle, be achieved by using weight sharing between the scale channels together with max or average pooling over the outputs from the scale channels. The ability of such scale channel networks to generalise to scales not present in the training set over significant scale ranges has, however, not previously been explored. We, therefore, present a theoretical analysis of invariance and covariance properties of scale channel networks and perform an experimental evaluation of the ability of different types of scale channel networks to generalise to previously unseen scales. We identify limitations of previous approaches and propose a new type of foveated scale channel architecture, where the scale channels process increasingly larger parts of the image with decreasing resolution. Our proposed FovMax and FovAvg networks perform almost identically over a scale range of 8 also when training on single scale training data and give improvements in the small sample regime.

#### Object Detection on Monocular Images with Two-Dimensional Canonical Correlation Analysis

Zifan Yu, Suya You
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.2

Auto-TLDR; Multi-Task Object Detection from Monocular Images Using Multimodal RGB and Depth Data

Accurate and robust detection objects from monocular images is a fundamental vision task. This paper describes a novel approach of holistic scene understanding that can simultaneously achieve multiple tasks of scene reconstruction and object detection from a single monocular camera. Rather than pursuing an independent solution for each individual task as most existing work does, we seek a globally optimal solution that holistically resolves the multiple perception and reasoning tasks in an effective manner. The approach explores the complementary properties of multimodal RGB imagery and depth data to improve scene perception tasks. It uniquely combines the techniques of canonical correlation analysis and deep learning to learn the most correlated features to maximize the modal cross-correlation for improving the performance and robustness of object detection in complex environments. Extensive experiments have been conducted to evaluate and demonstrate the performances of the proposed approach.

#### DAIL: Dataset-Aware and Invariant Learning for Face Recognition

Gaoang Wang, Chen Lin, Tianqiang Liu, Mingwei He, Jiebo Luo
Track 2: Biometrics, Human Analysis and Behavior Understanding
Fri 15 Jan 2021 at 15:00 in session PS T2.5

Auto-TLDR; DAIL: Dataset-Aware and Invariant Learning for Face Recognition

To achieve good performance in face recognition, a large scale training dataset is usually required. A simple yet effective way for improving the recognition performance is to use a dataset as large as possible by combining multiple datasets in the training. However, it is problematic and troublesome to naively combine different datasets due to two major issues. Firstly, the same person can possibly appear in different datasets, leading to the identity overlapping issue between different datasets. Natively treating the same person as different classes in different datasets during training will affect back-propagation and generate non-representative embeddings. On the other hand, manually cleaning labels will take a lot of human efforts, especially when there are millions of images and thousands of identities. Secondly, different datasets are collected in different situations and thus will lead to different domain distributions. Natively combining datasets will lead to domain distribution differences and make it difficult to learn domain invariant embeddings across different datasets. In this paper, we propose DAIL: Dataset-Aware and Invariant Learning to resolve the above-mentioned issues. To solve the first issue of identity overlapping, we propose a dataset-aware loss for multi-dataset training by reducing the penalty when the same person appears in multiple datasets. This can be readily achieved with a modified softmax loss with a dataset-aware term. To solve the second issue, the domain adaptation with gradient reversal layers is employed for dataset invariant learning. The proposed approach not only achieves state-of-the-art results on several commonly used face recognition validation sets, like LFW, CFP-FP, AgeDB-30, but also shows great benefit for practical usage.

#### Multi-Level Deep Learning Vehicle Re-Identification Using Ranked-Based Loss Functions

Eleni Kamenou, Jesus Martinez-Del-Rincon, Paul Miller, Patricia Devlin - Hill
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.6

Auto-TLDR; Multi-Level Re-identification Network for Vehicle Re-Identification

Identifying vehicles across a network of cameras with non-overlapping fields of view remains a challenging research problem due to scene occlusions, significant inter-class similarity and intra-class variability. In this paper, we propose an end-to-end multi-level re-identification network that is capable of successfully projecting same identity vehicles closer to one another in the embedding space, compared to vehicles of different identities. Robust feature representations are obtained by combining features at multiple levels of the network. As for the learning process, we employ a recent state-of-the-art structured metric learning loss function previously applied to other retrieval problems and adjust it to the vehicle re-identification task. Furthermore, we explore the cases of image-to-image, image-to-video and video-to-video similarity metric. Finally, we evaluate our system and achieve great performance on two large-scale publicly available datasets, CityFlow-ReID and VeRi-776. Compared to most existing state-of-art approaches, our approach is simpler and more straightforward, utilizing only identity-level annotations, while avoiding post-processing the ranking results (re-ranking) at the testing phase.

#### Nonlinear Ranking Loss on Riemannian Potato Embedding

Byung Hyung Kim, Yoonje Suh, Honggu Lee, Sungho Jo
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 14:00 in session PS T1.6

Auto-TLDR; Riemannian Potato for Rank-based Metric Learning

We propose a rank-based metric learning method by leveraging a concept of the Riemannian Potato for better separating non-linear data. By exploring the geometric properties of Riemannian manifolds, the proposed loss function optimizes the measure of dispersion using the distribution of Riemannian distances between a reference sample and neighbors and builds a ranked list according to the similarities. We show the proposed function can learn a hypersphere for each class, preserving the similarity structure inside it on Riemannian manifold. As a result, compared with Euclidean distance-based metric, our method can further jointly reduce the intra-class distances and enlarge the inter-class distances for learned features, consistently outperforming state-of-the-art methods on three widely used non-linear datasets.

#### MANet: Multimodal Attention Network Based Point-View Fusion for 3D Shape Recognition

Yaxin Zhao, Jichao Jiao, Ning Li
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 14:00 in session PS T3.8

Auto-TLDR; Fusion Network for 3D Shape Recognition based on Multimodal Attention Mechanism

3D shape recognition has attracted more and more attention as a task of 3D vision research. The proliferation of 3D data encourages various deep learning methods based on 3D data. Now there have been many deep learning models based on point-cloud data or multi-view data alone. However, in the era of big data, integrating data of two different modals to obtain a unified 3D shape descriptor is bound to improve the recognition accuracy. Therefore, this paper proposes a fusion network based on multimodal attention mechanism for 3D shape recognition. Considering the limitations of multi-view data, we introduce a soft attention scheme, which can use the global point-cloud features to filter the multi-view features, and then realize the effective fusion of the two features. More specifically, we obtain the enhanced multi-view features by mining the contribution of each multi-view image to the overall shape recognition, and then fuse the point-cloud features and the enhanced multi-view features to obtain a more discriminative 3D shape descriptor. We have performed relevant experiments on the ModelNet40 dataset, and experimental results verify the effectiveness of our method.

#### Efficient Online Subclass Knowledge Distillation for Image Classification

Maria Tzelepi, Nikolaos Passalis, Anastasios Tefas
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 16:00 in session PS T1.16

Auto-TLDR; OSKD: Online Subclass Knowledge Distillation

Deploying state-of-the-art deep learning models on embedded systems dictates certain storage and computation limitations. During the recent few years Knowledge Distillation (KD) has been recognized as a prominent approach to address this issue. That is, KD has been effectively proposed for training fast and compact deep learning models by transferring knowledge from more complex and powerful models. However, knowledge distillation, in its conventional form, involves multiple stages of training, rendering it a computationally and memory demanding procedure. In this paper, a novel single-stage self knowledge distillation method is proposed, namely Online Subclass Knowledge Distillation (OSKD), that aims at revealing the similarities inside classes, improving the performance of any deep neural model in an online manner. Hence, as opposed to existing online distillation methods, we are able to acquire further knowledge from the model itself, without building multiple identical models or using multiple models to teach each other, rendering the OSKD approach more efficient. The experimental evaluation on two datasets validates that the proposed method improves the classification performance.

#### Prior Knowledge about Attributes: Learning a More Effective Potential Space for Zero-Shot Recognition

Chunlai Chai, Yukuan Lou
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.4

Auto-TLDR; Attribute Correlation Potential Space Generation for Zero-Shot Learning

Zero-shot learning (ZSL) aims to recognize unseen classes accurately by learning seen classes and known attributes, but correlations in attributes were ignored by previous study which lead to classification results confused. To solve this problem, we build an Attribute Correlation Potential Space Generation (ACPSG) model which uses a graph convolution network and attribute correlation to generate a more discriminating potential space. Combining potential discrimination space and user-defined attribute space, we can better classify unseen classes. Our approach outperforms some existing state-of-the-art methods on several benchmark datasets, whether it is conventional ZSL or generalized ZSL.