Viability of Optical Coherence Tomography for Iris Presentation Attack Detection

Renu Sharma, Arun Ross

Auto-TLDR; Optical Coherence Tomography Imaging for Iris Presentation Attack Detection

In this paper, we first propose the use of Optical Coherence Tomography (OCT) imaging for the problem of iris presentation attack (PA) detection. Secondly, we assess its viability by comparing its performance with respect to traditional modalities, viz., near-infrared (NIR) and visible spectrum. OCT imaging provides a cross-sectional view of an eye, whereas NIR and visible spectrum imaging provide 2D iris textural information. Implementation is performed using three state-of-the-art deep architectures (VGG19, ResNet50 and DenseNet121) to differentiate between bonafide and PA samples for each of the three imaging modalities. Experiments are performed on a dataset of 2,169 bonafide, 177 Van Dyke eyes and 360 cosmetic contact images acquired using all three imaging modalities under intra-attack (known PAs) and cross-attack (unknown PAs) scenario. We observe promising results demonstrating OCT as a viable solution for iris PA detection.

Similar papers

Generalized Iris Presentation Attack Detection Algorithm under Cross-Database Settings

Mehak Gupta, Vishal Singh, Akshay Agarwal, Mayank Vatsa, Richa Singh

Auto-TLDR; MVNet: A Deep Learning-based PAD Network for Iris Recognition against Presentation Attacks

The deployment of biometrics features based person identification has increased significantly from border access to mobile unlock to electronic transactions. Iris recognition is considered as one of the most accurate biometric modality for person identification. However, the vulnerability of this recognition towards presentation attacks, especially towards the 3D contact lenses, can limit its potential deployments. The textured lenses are so effective in hiding the real texture of iris that it can fool not only the automatic recognition algorithms but also the human examiners. While in literature, several presentation attack detection (PAD) algorithms are presented; however, the significant limitation is the generalizability against an unseen database, unseen sensor, and different imaging environment. Inspired by the success of the hybrid algorithm or fusion of multiple detection networks, we have proposed a deep learning-based PAD network that utilizes multiple feature representation layers. The computational complexity is an essential factor in training the deep neural networks; therefore, to limit the computational complexity while learning multiple feature representation layers, a base model is kept the same. The network is trained end-to-end using a softmax classifier. We have evaluated the performance of the proposed network termed as MVNet using multiple databases such as IIITD-WVU MUIPA, IIITD-WVU UnMIPA database under cross-database training-testing settings. The experiments are performed extensively to assess the generalizability of the proposed algorithm.

MixNet for Generalized Face Presentation Attack Detection

Nilay Sanghvi, Sushant Singh, Akshay Agarwal, Mayank Vatsa, Richa Singh

Auto-TLDR; MixNet: A Deep Learning-based Network for Detection of Presentation Attacks in Cross-Database and Unseen Setting

The non-intrusive nature and high accuracy of face recognition algorithms have led to their successful deployment across multiple applications ranging from border access to mobile unlocking and digital payments. However, their vulnerability against sophisticated and cost-effective presentation attack mediums raises essential questions regarding its reliability. Several presentation attack detection algorithms are presented; however, they are still far behind from reality. The major problem with the existing work is the generalizability against multiple attacks both in the seen and unseen setting. The algorithms which are useful for one kind of attack (such as print) fail miserably for another type of attack (such as silicone masks). In this research, we have proposed a deep learning-based network called MixNet to detect presentation attacks in cross-database and unseen attack settings. The proposed algorithm utilizes state-of-the-art convolutional neural network architectures and learns the feature mapping for each attack category. Experiments are performed using multiple challenging face presentation attack databases such as Silicone Mask Attack Database (SMAD) and Spoof In the Wild with Multiple Attack (SiW-M). Extensive experiments and comparison with the existing state of the art algorithms show the effectiveness of the proposed algorithm.

Local Attention and Global Representation Collaborating for Fine-Grained Classification

He Zhang, Yunming Bai, Hui Zhang, Jing Liu, Xingguang Li, Zhaofeng He

Auto-TLDR; Weighted Region Network for Cosmetic Contact Lenses Detection

The cosmetic contact lenses over an iris may change its original textural pattern that is the foundation for iris recognition, making the cosmetic lenses a possible and easy-to-use iris presentation attack means. Aiming at cosmetic contact lenses detection of practical application system, some approaches have been proposed but still facing unsolved problems, such as low quality iris images and inaccurate localized iris boundaries. In this paper, we propose a novel framework called Weighted Region Network (WRN) for the cosmetic contact lenses detection. The WRN includes both the local attention Weight Network and the global classification Region Network. With the inherent attention mechanism, the proposed network is able to find the most discriminative regions, which reduces the requirement for target detection and improves the ability of classification based on some specific areas and patterns. The Weight Network can be trained by using Rank loss and MSE loss without manual discriminative region annotations. Experiments are conducted on several databases and a new collected low-quality iris image database. The proposed method outperforms state-of-the-art fake iris detection algorithms, and is also effective for the fine-grained image classification task.

One-Shot Representational Learning for Joint Biometric and Device Authentication

Sudipta Banerjee, Arun Ross

Auto-TLDR; Joint Biometric and Device Recognition from a Single Biometric Image

In this work, we propose a method to simultaneously perform (i) biometric recognition (\textit{i.e.}, identify the individual), and (ii) device recognition, (\textit{i.e.}, identify the device) from a single biometric image, say, a face image, using a one-shot schema. Such a joint recognition scheme can be useful in devices such as smartphones for enhancing security as well as privacy. We propose to automatically learn a joint representation that encapsulates both biometric-specific and sensor-specific features. We evaluate the proposed approach using iris, face and periocular images acquired using near-infrared iris sensors and smartphone cameras. Experiments conducted using 14,451 images from 13 sensors resulted in a rank-1 identification accuracy of upto 99.81\% and a verification accuracy of upto 100\% at a false match rate of 1\%.

Face Anti-Spoofing Using Spatial Pyramid Pooling

Lei Shi, Zhuo Zhou, Zhenhua Guo

Auto-TLDR; Spatial Pyramid Pooling for Face Anti-Spoofing

Face recognition system is vulnerable to many kinds of presentation attacks, so how to effectively detect whether the image is from the real face is particularly important. At present, many deep learning-based anti-spoofing methods have been proposed. But these approaches have some limitations, for example, global average pooling (GAP) easily loses local information of faces, single-scale features easily ignore information differences in different scales, while a complex network is prune to be overfitting. In this paper, we propose a face anti-spoofing approach using spatial pyramid pooling (SPP). Firstly, we use ResNet-18 with a small amount of parameter as the basic model to avoid overfitting. Further, we use spatial pyramid pooling module in the single model to enhance local features while fusing multi-scale information. The effectiveness of the proposed method is evaluated on three databases, CASIA-FASD, Replay-Attack and CASIA-SURF. The experimental results show that the proposed approach can achieve state-of-the-art performance.

A Cross Domain Multi-Modal Dataset for Robust Face Anti-Spooﬁng

Qiaobin Ji, Shugong Xu, Xudong Chen, Shan Cao, Shunqing Zhang

Auto-TLDR; Cross domain multi-modal FAS dataset GREAT-FASD and several evaluation protocols for academic community

Face Anti-spooﬁng (FAS) is a challenging problem due to the complex serving scenario and diverse face presentation attack patterns. Using single modal images which are usually captured with RGB cameras is not able to deal with the former because of serious overﬁtting problems. The existing multi-modal FAS datasets rarely pay attention to the cross domain problems, trainingFASsystemonthesedataleadstoinconsistenciesandlow generalization capabilities in deployment since imaging principles(structured light, TOF, etc.) and pre-processing methods vary between devices. We explore the subtle ﬁne-grained differences betweeen multi-modal cameras and proposed a cross domain multi-modal FAS dataset GREAT-FASD and several evaluation protocols for academic community. Furthermore, we incorporate the multiplicative attention and center loss to enhance the representative power of CNN via seeking out complementary information as a powerful baseline. In addition, extensive experiments have been conducted on the proposed dataset to analyze the robustness to distinguish spoof faces and bona-ﬁde faces. Experimental results show the effectiveness of proposed method and achieve the state-of-the-art competitive results. Finally, we visualize our future distribution in hidden space and observe that the proposed method is able to lead the network to generate a large margin for face anti-spooﬁng task

Detection of Makeup Presentation Attacks Based on Deep Face Representations

Christian Rathgeb, Pawel Drozdowski, Christoph Busch

Auto-TLDR; An Attack Detection Scheme for Face Recognition Using Makeup Presentation Attacks

Facial cosmetics have the ability to substantially alter the facial appearance, which can negatively affect the decisions of a face recognition. In addition, it was recently shown that the application of makeup can be abused to launch so-called makeup presentation attacks. In such attacks, the attacker might apply heavy makeup in order to achieve the facial appearance of a target subject for the purpose of impersonation. In this work, we assess the vulnerability of a COTS face recognition system to makeup presentation attacks employing the publicly available Makeup Induced Face Spoofing (MIFS) database. It is shown that makeup presentation attacks might seriously impact the security of the face recognition system. Further, we propose an attack detection scheme which distinguishes makeup presentation attacks from genuine authentication attempts by analysing differences in deep face representations obtained from potential makeup presentation attacks and corresponding target face images. The proposed detection system employs a machine learning-based classifier, which is trained with synthetically generated makeup presentation attacks utilizing a generative adversarial network for facial makeup transfer in conjunction with image warping. Experimental evaluations conducted using the MIFS database reveal a detection equal error rate of 0.7% for the task of separating genuine authentication attempts from makeup presentation attacks.

How Unique Is a Face: An Investigative Study

Michal Balazia, S L Happy, Francois Bremond, Antitza Dantcheva

Auto-TLDR; Uniqueness of Face Recognition: Exploring the Impact of Factors such as image resolution, feature representation, database size, age and gender

Face recognition has been widely accepted as a means of identification in applications ranging from border control to security in the banking sector. Surprisingly, while widely accepted, we still lack the understanding of the uniqueness or distinctiveness of face as a biometric characteristic. In this work, we study the impact of factors such as image resolution, feature representation, database size, age and gender on uniqueness denoted by the Kullback-Leibler divergence between genuine and impostor distributions. Towards understanding the impact, we present experimental results on the datasets AT&T, LFW, IMDb-Face, as well as ND-TWINS, with the feature extraction algorithms VGGFace, VGG16, ResNet50, InceptionV3, MobileNet and DenseNet121, that reveal the quantitative impact of the named factors. While these are early results, our findings indicate the need for a better understanding of the concept of biometric uniqueness and its implication on face recognition.

Are Spoofs from Latent Fingerprints a Real Threat for the Best State-Of-Art Liveness Detectors?

Roberto Casula, Giulia Orrù, Daniele Angioni, Xiaoyi Feng, Gian Luca Marcialis, Fabio Roli

Auto-TLDR; ScreenSpoof: Attacks using latent fingerprints against state-of-art fingerprint liveness detectors and verification systems

We investigated the threat level of realistic attacks using latent fingerprints against sensors equipped with state-of-art liveness detectors and fingerprint verification systems which integrate such liveness algorithms. To the best of our knowledge, only a previous investigation was done with spoofs from latent prints. In this paper, we focus on using snapshot pictures of latent fingerprints. These pictures provide molds, that allows, after some digital processing, to fabricate high-quality spoofs. Taking a snapshot picture is much simpler than developing fingerprints left on a surface by magnetic powders and lifting the trace by a tape. What we are interested here is to evaluate preliminary at which extent attacks of the kind can be considered a real threat for state-of-art fingerprint liveness detectors and verification systems. To this aim, we collected a novel data set of live and spoof images fabricated with snapshot pictures of latent fingerprints. This data set provide a set of attacks at the most favourable conditions. We refer to this method and the related data set as "ScreenSpoof". Then, we tested with it the performances of the best liveness detection algorithms, namely, the three winners of the LivDet competition. Reported results point out that the ScreenSpoof method is a threat of the same level, in terms of detection and verification errors, than that of attacks using spoofs fabricated with the full consensus of the victim. We think that this is a notable result, never reported in previous work.

Face Anti-Spoofing Based on Dynamic Color Texture Analysis Using Local Directional Number Pattern

Junwei Zhou, Ke Shu, Peng Liu, Jianwen Xiang, Shengwu Xiong

Auto-TLDR; LDN-TOP Representation followed by ProCRC Classification for Face Anti-Spoofing

Face anti-spoofing is becoming increasingly indispensable for face recognition systems, which are vulnerable to various spoofing attacks performed using fake photos and videos. In this paper, a novel "LDN-TOP representation followed by ProCRC classification" pipeline for face anti-spoofing is proposed. We use local directional number pattern (LDN) with the derivative-Gaussian mask to capture detailed appearance information resisting illumination variations and noises, which can influence the texture pattern distribution. To further capture motion information, we extend LDN to a spatial-temporal variant named local directional number pattern from three orthogonal planes (LDN-TOP). The multi-scale LDN-TOP capturing complete information is extracted from color images to generate the feature vector with powerful representation capacity. Finally, the feature vector is fed into the probabilistic collaborative representation based classifier (ProCRC) for face anti-spoofing. Our method is evaluated on three challenging public datasets, namely CASIA FASD, Replay-Attack database, and UVAD database using sequence-based evaluation protocol. The experimental results show that our method can achieve promising performance with 0.37% EER on CASIA and 5.73% HTER on UVAD. The performance on Replay-Attack database is also competitive.

Disentangled Representation Based Face Anti-Spoofing

Zhao Liu, Zunlei Feng, Yong Li, Zeyu Zou, Rong Zhang, Mingli Song, Jianping Shen

Auto-TLDR; Face Anti-Spoofing using Motion Information and Disentangled Frame Work

Face anti-spoofing is an important problem for both academic research and industrial face recognition systems. Most of the existing face anti-spoofing methods take it as a classification task on individual static images, where motion pattern differences in consecutive real or fake face sequences are ignored. In this work, we propose a novel method to identify spoofing patterns using motion information. Different from previous methods, the proposed method makes the real or fake decision on the disentangled feature level, based on the observation that motion and spoofing pattern features could be disentangled from original image frames. We design a representation disentangling frame- work for this task, which is able to reconstruct both real and fake face sequences from the input. Meanwhile, the disentangled representations could be used to classify whether the input faces are real or fake. We perform several experiments on Casia-FASD and ReplayAttack datasets. The proposed method achieves SOTA results compared with existing face anti-spoofing methods.

Fine-Tuning Convolutional Neural Networks: A Comprehensive Guide and Benchmark Analysis for Glaucoma Screening

Amed Mvoulana, Rostom Kachouri, Mohamed Akil

Auto-TLDR; Fine-tuning Convolutional Neural Networks for Glaucoma Screening

This work aimed at giving a comprehensive and in-detailed guide on the route to fine-tuning Convolutional Neural Networks (CNNs) for glaucoma screening. Transfer learning consists in a promising alternative to train CNNs from stratch, to avoid the huge data and resources requirements. After a thorough study of five state-of-the-art CNNs architectures, a complete and well-explained strategy for fine-tuning these networks is proposed, using hyperparameter grid-searching and two-phase training approach. Excellent performance is reached on model evaluation, with a 0.9772 AUROC validation rate, giving arise to reliable glaucoma diagosis-help systems. Also, a benchmark analysis is conducted across all fine-tuned models, studying them according to performance indices such as model complexity and size, AUROC density and inference time. This in-depth analysis allows a rigorous comparison between model characteristics, and is useful for giving practioners important trademarks for prospective applications and deployments.

Hyperspectral Imaging for Analysis and Classification of Plastic Waste

Jakub Kraśniewski, Łukasz Dąbała, Lewandowski Marcin

Auto-TLDR; A Hyperspectral Camera for Material Classification

Environmental protection is one of the main challenges facing society nowadays. Even with constantly growing awareness, not all of the sorting can be done by people themselves - the differences between materials are not visible to the human eye. For that reason, we present the use of a hyperspectral camera as a capture device, which allows us to obtain the full spectrum of the material. In this work we propose a method for efficient recognition of the substance of an item. We conducted several experiments and analysis of the spectra of different materials in different conditions on a special measuring stand. That enabled identification of the best features, which can later be used during classification, which was confirmed during the extensive testing procedure.

Finger Vein Recognition and Intra-Subject Similarity Evaluation of Finger Veins Using the CNN Triplet Loss

Georg Wimmer, Bernhard Prommegger, Andreas Uhl

Auto-TLDR; Finger vein recognition using CNNs and hard triplet online selection

Finger vein recognition deals with the identification of subjects based on their venous pattern within the fingers. There is a lot of prior work using hand crafted features, but only little work using CNN based recognition systems. This article proposes a new approach using CNNs that utilizes the triplet loss function together with hard triplet online selection for finger vein recognition. The CNNs are used for three different use cases: (1) the classical recognition use case, where every finger of a subject is considered as a separate class, (2) an evaluation of the similarity of left and right hand fingers from the same subject and (3) an evaluation of the similarity of different fingers of the same subject. The results show that the proposed approach achieves superior results compared to prior work on finger vein recognition using the triplet loss function. Furtherly, we show that different fingers of the same subject, especially same fingers from the left and right hand, show enough similarities to perform recognition. The last statement contradicts the current understanding in the literature for finger vein biometry, in which it is assumed that different fingers of the same subject are unique identities.

Video Face Manipulation Detection through Ensemble of CNNs

Nicolo Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, Stefano Tubaro

Auto-TLDR; Face Manipulation Detection in Video Sequences Using Convolutional Neural Networks

In the last few years, several techniques for facial manipulation in videos have been successfully developed and made available to the masses (i.e., FaceSwap, deepfake, etc.). These methods enable anyone to easily edit faces in video sequences with incredibly realistic results and a very little effort. Despite the usefulness of these tools in many fields, if used maliciously, they can have a significantly bad impact on society (e.g., fake news spreading, cyber bullying through fake revenge porn). The ability of objectively detecting whether a face has been manipulated in a video sequence is then a task of utmost importance. In this paper, we tackle the problem of face manipulation detection in video sequences targeting modern facial manipulation techniques. In particular, we study the ensembling of different trained Convolutional Neural Network (CNN) models. In the proposed solution, different models are obtained starting from a base network (i.e., EfficientNetB4) making use of two different concepts: (i) attention layers; (ii) siamese training. We show that combining these networks leads to promising face manipulation detection results on two publicly available datasets with more than 119000 videos.

ResMax: Detecting Voice Spoofing Attacks with Residual Network and Max Feature Map

Il-Youp Kwak, Sungsu Kwag, Junhee Lee, Jun Ho Huh, Choong-Hoon Lee, Youngbae Jeon, Jeonghwan Hwang, Ji Won Yoon

Auto-TLDR; ASVspoof 2019: A Lightweight Automatic Speaker Verification Spoofing and Countermeasures System

The 2019 Automatic Speaker Verification Spoofing And Countermeasures Challenge'' (ASVspoof) competition aimed to facilitate the design of highly accurate voice spoofing attack detection systems. the competition did not emphasize model complexity and latency requirements; such constraints are strict and integral in real-world deployment. Hence, most of the top performing solutions from the competition all used an ensemble approach, and combined multiple complex deep learning models to maximize detection accuracy -- this kind of approach would sit uneasily with real-world deployment constraints. To design a lightweight system, we combined the notions of skip connection (from ResNet) and max feature map (from Light CNN), and evaluated the accuracy of the system using the ASVspoof 2019 dataset. With an optimized constant Q transform (CQT) feature, our single model achieved a replay attack detection equal error rate (EER) of 0.37% on the evaluation set, outperforming the top ensemble system from the competition that achieved an EER of 0.39%.

Exemplar Guided Cross-Spectral Face Hallucination Via Mutual Information Disentanglement

Haoxue Wu, Huaibo Huang, Aijing Yu, Jie Cao, Zhen Lei, Ran He

Auto-TLDR; Exemplar Guided Cross-Spectral Face Hallucination with Structural Representation Learning

Recently, many Near infrared-visible (NIR-VIS) heterogeneous face recognition (HFR) methods have been proposed in the community. But it remains a challenging problem because of the sensing gap along with large pose variations. In this paper, we propose an Exemplar Guided Cross-Spectral Face Hallucination (EGCH) to reduce the domain discrepancy through disentangled representation learning. For each modality, EGCH contains a spectral encoder as well as a structure encoder to disentangle spectral and structure representation, respectively. It also contains a traditional generator that reconstructs the input from the above two representations, and a structure generator that predicts the facial parsing map from the structure representation. Besides, mutual information minimization and maximization are conducted to boost disentanglement and make representations adequately expressed. Then the translation is built on structure representations between two modalities. Provided with the transformed NIR structure representation and original VIS spectral representation, EGCH is capable to produce high-fidelity VIS images that preserve the topology structure of the input NIR while transfer the spectral information of an arbitrary VIS exemplar. Extensive experiments demonstrate that the proposed method achieves more promising results both qualitatively and quantitatively than the state-of-the-art NIR-VIS methods.

Exploring Seismocardiogram Biometrics with Wavelet Transform

Po-Ya Hsu, Po-Han Hsu, Hsin-Li Liu

Auto-TLDR; Seismocardiogram Biometric Matching Using Wavelet Transform and Deep Learning Models

Seismocardiogram (SCG) has become easily accessible in the past decade owing to the advance of sensor technology. However, SCG biometric has not been widely explored. In this paper, we propose combining wavelet transform together with deep learning models, machine learning classiﬁers, or structural similarity metric to perform SCG biometric matching tasks. We validate the proposed methods on the publicly available dataset from PhysioNet database. The dataset contains one hour long electrocardiogram, breathing, and SCG data of 20 subjects. We train the models on the ﬁrst ﬁve minute SCG and conduct identiﬁcation on the last ﬁve minute SCG. We evaluate the identiﬁcation and authentication performance with recognition rate and equal error rate, respectively. Based on the results, we show that wavelet transformed SCG biometric can achieve state-of-the-art performance when combined with deep learning models, machine learning classiﬁers, or structural similarity.

Detection and Correspondence Matching of Corneal Reflections for Eye Tracking Using Deep Learning

Soumil Chugh, Braiden Brousseau, Jonathan Rose, Moshe Eizenman

Auto-TLDR; A Fully Convolutional Neural Network for Corneal Reflection Detection and Matching in Extended Reality Eye Tracking Systems

Eye tracking systems that estimate the point-of-gaze are essential in extended reality (XR) systems as they enable new interaction paradigms and technological improvements. It is important for these systems to maintain accuracy when the headset moves relative to the head (known as device slippage) due to head movements or user adjustment. One of the most accurate eye tracking techniques, which is also insensitive to shifts of the system relative to the head, uses two or more infrared (IR) light emitting diodes to illuminate the eye and an IR camera to capture images of the eye. An essential step in estimating the point-of-gaze in these systems is the precise determination of the location of two or more corneal reflections (virtual images of the IR-LEDs that illuminate the eye) in images of the eye. Eye trackers tend to have multiple light sources to ensure at least one pair of reflections for each gaze position. The use of multiple light sources introduces a difficult problem: the need to match the corneal reflections with the corresponding light source over the range of expected eye movements. Corneal reflection detection and matching often fail in XR systems due to the proximity of camera and steep illumination angles of light sources with respect to the eye. The failures are caused by corneal reflections having varying shape and intensity levels or disappearance due to rotation of the eye, or the presence of spurious reflections. We have developed a fully convolutional neural network, based on the UNET architecture, that solves the detection and matching problem in the presence of spurious and missing reflections. Eye images of 25 people were collected in a virtual reality headset using a binocular eye tracking module consisting of five infrared light sources per eye. A set of 4,000 eye images were manually labelled for each of the corneal reflections, and data augmentation was used to generate a dataset of 40,000 images. The network is able to correctly identify and match 91% of corneal reflections present in the test set. This is comparable to a state-of-the-art deep learning system, but our approach requires 33 times less memory and executes 10 times faster. The proposed algorithm, when used in an eye tracker in a VR system, achieved an average mean absolute gaze error of 1°. This is a significant improvement over the state-of-the-art learning-based XR eye tracking systems that have reported gaze errors of 2-3°.

InsideBias: Measuring Bias in Deep Networks and Application to Face Gender Biometrics

Ignacio Serna, Alejandro Peña Almansa, Aythami Morales, Julian Fierrez

Auto-TLDR; InsideBias: Detecting Bias in Deep Neural Networks from Face Images

This work explores the biases in learning processes based on deep neural network architectures. We analyze how bias affects deep learning processes through a toy example using the MNIST database and a case study in gender detection from face images. We employ two gender detection models based on popular deep neural networks. We present a comprehensive analysis of bias effects when using an unbalanced training dataset on the features learned by the models. We show how bias impacts in the activations of gender detection models based on face images. We finally propose InsideBias, a novel method to detect biased models. InsideBias is based on how the models represent the information instead of how they perform, which is the normal practice in other existing methods for bias detection. Our strategy with InsideBias allows to detect biased models with very few samples (only 15 images in our case study). Our experiments include 72K face images from 24K identities and 3 ethnic groups.

A Versatile Crack Inspection Portable System Based on Classifier Ensemble and Controlled Illumination

Milind Gajanan Padalkar, Carlos Beltran-Gonzalez, Matteo Bustreo, Alessio Del Bue, Vittorio Murino

Auto-TLDR; Lighting Conditions for Crack Detection in Ceramic Tile

This paper presents a novel setup for automatic visual inspection of cracks in ceramic tile as well as studies the effect of various classifiers and height-varying illumination conditions for this task. The intuition behind this setup is that cracks can be better visualized under specific lighting conditions than others. Our setup, which is designed for field work with constraints in its maximum dimensions, can acquire images for crack detection with multiple lighting conditions using the illumination sources placed at multiple heights. Crack detection is then performed by classifying patches extracted from the acquired images in a sliding window fashion. We study the effect of lights placed at various heights by training classifiers both on customized as well as state-of-the-art architectures and evaluate their performance both at patch-level and image-level, demonstrating the effectiveness of our setup. More importantly, ours is the first study that demonstrates how height-varying illumination conditions can affect crack detection with the use of existing state-of-the-art classifiers. We provide an insight about the illumination conditions that can help in improving crack detection in a challenging real-world industrial environment.

Cross-spectrum Face Recognition Using Subspace Projection Hashing

Hanrui Wang, Xingbo Dong, Jin Zhe, Jean-Luc Dugelay, Massimo Tistarelli

Auto-TLDR; Subspace Projection Hashing for Cross-Spectrum Face Recognition

Cross-spectrum face recognition, e.g. visible to thermal matching, remains a challenging task due to the large variation originated from different domains. This paper proposed a subspace projection hashing (SPH) to enable the cross-spectrum face recognition task. The intrinsic idea behind SPH is to project the features from different domains onto a common subspace, where matching the faces from different domains can be accomplished. Notably, we proposed a new loss function that can (i) preserve both inter-domain and intra-domain similarity; (ii) regularize a scaled-up pairwise distance between hashed codes, to optimize projection matrix. Three datasets, Wiki, EURECOM VIS-TH paired face and TDFace are adopted to evaluate the proposed SPH. The experimental results indicate that the proposed SPH outperforms the original linear subspace ranking hashing (LSRH) in the benchmark dataset (Wiki) and demonstrates a reasonably good performance for visible-thermal, visible-near-infrared face recognition, therefore suggests the feasibility and effectiveness of the proposed SPH.

Deep Fusion of RGB and NIR Paired Images Using Convolutional Neural Networks

Auto-TLDR; Deep Fusion of RGB and NIR paired images in low light condition using convolutional neural networks

In low light condition, the captured color (RGB) images are highly degraded by noise with severe texture loss. In this paper, we propose deep fusion of RGB and NIR paired images in low light condition using convolutional neural networks (CNNs). The proposed deep fusion network consists of three independent sub-networks: denoising, enhancing, and fusion. We build a denoising sub-network to eliminate noise from noisy RGB images. After denoising, we perform an enhancing sub-network to increase the brightness of low light RGB images. Since NIR image contains fine details, we fuse it with the Y channel of RGB image through a fusion sub-network. Experimental results demonstrate that the proposed method successfully fuses RGB and NIR images, and generates high quality fusion results containing textures and colors.

Two-Level Attention-Based Fusion Learning for RGB-D Face Recognition

Auto-TLDR; Fused RGB-D Facial Recognition using Attention-Aware Feature Fusion

With recent advances in RGB-D sensing technologies as well as improvements in machine learning and fusion techniques, RGB-D facial recognition has become an active area of research. A novel attention aware method is proposed to fuse two image modalities, RGB and depth, for enhanced RGB-D facial recognition. The proposed method first extracts features from both modalities using a convolutional feature extractor. These features are then fused using a two layer attention mechanism. The first layer focuses on the fused feature maps generated by the feature extractor, exploiting the relationship between feature maps using LSTM recurrent learning. The second layer focuses on the spatial features of those maps using convolution. The training database is preprocessed and augmented through a set of geometric transformations, and the learning process is further aided using transfer learning from a pure 2D RGB image training process. Comparative evaluations demonstrate that the proposed method outperforms other state-of-the-art approaches, including both traditional and deep neural network-based methods, on the challenging CurtinFaces and IIIT-D RGB-D benchmark databases, achieving classification accuracies over 98.2% and 99.3% respectively. The proposed attention mechanism is also compared with other attention mechanisms, demonstrating more accurate results.

Near-Infrared Depth-Independent Image Dehazing using Haar Wavelets

Sumit Laha, Ankit Sharma, Shengnan Hu, Hassan Foroosh

Auto-TLDR; A fusion algorithm for haze removal using Haar wavelets

We propose a fusion algorithm for haze removal that combines color information from an RGB image and edge information extracted from its corresponding NIR image using Haar wavelets. The proposed algorithm is based on the key observation that NIR edge features are more prominent in the hazy regions of the image than the RGB edge features in those same regions. To combine the color and edge information, we introduce a haze-weight map which proportionately distributes the color and edge information during the fusion process. Because NIR images are, intrinsically, nearly haze-free, our work makes no assumptions like existing works that rely on a scattering model and essentially designing a depth-independent method. This helps in minimizing artifacts and gives a more realistic sense to the restored haze-free image. Extensive experiments show that the proposed algorithm is both qualitatively and quantitatively better on several key metrics when compared to existing state-of-the-art methods.

Documents Counterfeit Detection through a Deep Learning Approach

Darwin Danilo Saire Pilco, Salvatore Tabbone

Auto-TLDR; End-to-End Learning for Counterfeit Documents Detection using Deep Neural Network

The main topic of this work is on the detection of counterfeit documents and especially banknotes. We propose an end-to-end learning model using a deep learning approach based on Adapnet++ which manages feature extraction at multiple scale levels using several residual units. Unlike previous models based on regions of interest (ROI) and high-resolution documents, our network is feed with simple input images (i.e., a single patch) and we do not need high resolution images. Besides, discriminative regions can be visualized at different scales. Our network learns by itself which regions of interest predict the better results. Experimental results show that we are competitive compared with the state-of-the-art and our deep neural network has good ability to generalize and can be applied to other kind of documents like identity or administrative one.

Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection

Yongqiang Dou, Haocheng Yang, Maolin Yang, Yanyan Xu, Dengfeng Ke

Auto-TLDR; Anti-Spoofing with Balanced Focal Loss Function and Combination Features

It becomes urgent to design effective anti-spoofing algorithms for vulnerable automatic speaker verification systems due to the advancement of high-quality playback devices. Current studies mainly treat anti-spoofing as a binary classification problem between bonafide and spoofed utterances, while lack of indistinguishable samples makes it difficult to train a robust spoofing detector. In this paper, we argue that for anti-spoofing, it needs more attention for indistinguishable samples over easily-classified ones in the modeling process, to make correct discrimination a top priority. Therefore, to mitigate the data discrepancy between training and inference, we propose to leverage a balanced focal loss function as the training objective to dynamically scale the loss based on the traits of the sample itself. Besides, in the experiments, we select three kinds of features that contain both magnitude-based and phase-based information to form complementary and informative features. Experimental results on the ASVspoof2019 dataset demonstrate the superiority of the proposed methods by comparison between our systems and top-performing ones. Systems trained with the balanced focal loss perform significantly better than conventional cross-entropy loss. With complementary features, our fusion system with only three kinds of features outperforms other systems containing five or more complex single models by 22.5% for min-tDCF and 7% for EER, achieving a min-tDCF and an EER of 0.0124 and 0.55% respectively. Furthermore, we present and discuss the evaluation results on real replay data apart from the simulated ASVspoof2019 data, indicating that research for anti-spoofing still has a long way to go.

Classifying Eye-Tracking Data Using Saliency Maps

Shafin Rahman, Sejuti Rahman, Omar Shahid, Md. Tahmeed Abdullah, Jubair Ahmed Sourov

Auto-TLDR; Saliency-based Feature Extraction for Automatic Classification of Eye-tracking Data

A plethora of research in the literature shows how human eye fixation pattern varies depending on different factors, including genetics, age, social functioning, cognitive functioning, and so on. Analysis of these variations in visual attention has already elicited two potential research avenues: 1) determining the physiological or psychological state of the subject and 2) predicting the tasks associated with the act of viewing from the recorded eye-fixation data. To this end, this paper proposes a visual saliency based novel feature extraction method for automatic and quantitative classification of eye-tracking data, which is applicable to both of the research directions. Instead of directly extracting features from the fixation data, this method employs several well-known computational models of visual attention to predict eye fixation locations as saliency maps. Comparing the saliency amplitudes, similarity and dissimilarity of saliency maps with the corresponding eye fixations maps gives an extra dimension of information which is effectively utilized to generate discriminative features to classify the eye-tracking data. Extensive experimentation using Saliency4ASD [1], Age Prediction [2], and Visual Perceptual Task [3] dataset show that our saliency-based feature can achieve superior performance, outperforming the previous state-of-the-art methods [2],[4], [5] by a considerable margin. Moreover, unlike the existing application-specific solutions, our method demonstrates performance improvement across three distinct problems from the real-life domain: Autism Spectrum Disorder screening, toddler age prediction, and human visual perceptual task classification, providing a general paradigm that utilizes the extra-information inherent in saliency maps for a more accurate classification.

Can You Really Trust the Sensor's PRNU? How Image Content Might Impact the Finger Vein Sensor Identification Performance

Dominik Söllinger, Luca Debiasi, Andreas Uhl

Auto-TLDR; Finger vein imagery can cause the PRNU estimate to be biased by image content

We study the impact of highly correlated image content on the estimated sensor PRNU and its impact on the sensor identification performance. Based on eight publicly available finger vein datasets, we show formally and experimentally that the nature of finger vein imagery can cause the estimated PRNU to be biased by image content and lead to a fairly bad PRNU estimate. Such bias can cause a false increase in sensor identification performance depending on the dataset composition. Our results indicate that independent of the biometric modality, examining the quality of the estimated PRNU is essential before claiming the sensor identification performance to be good.

Real-Time Drone Detection and Tracking with Visible, Thermal and Acoustic Sensors

Fredrik Svanström, Cristofer Englund, Fernando Alonso-Fernandez

Auto-TLDR; Automatic multi-sensor drone detection using sensor fusion

This paper explores the process of designing an automatic multi-sensor drone detection system. Besides the common video and audio sensors, the system also includes a thermal infrared camera, which is shown to be a feasible solution to the drone detection task. Even with slightly lower resolution, the performance is just as good as a camera in visible range. The detector performance as a function of the sensor-to-target distance is also investigated. In addition, using sensor fusion, the system is made more robust than the individual sensors, helping to reduce false detections. To counteract the lack of public datasets, a novel video dataset containing 650 annotated infrared and visible videos of drones, birds, airplanes and helicopters is also presented. The database is complemented with an audio dataset of the classes drones, helicopters and background noise.

Weight Estimation from an RGB-D Camera in Top-View Configuration

Marco Mameli, Marina Paolanti, Nicola Conci, Filippo Tessaro, Emanuele Frontoni, Primo Zingaretti

Auto-TLDR; Top-View Weight Estimation using Deep Neural Networks

The development of so-called soft-biometrics aims at providing information related to the physical and behavioural characteristics of a person. This paper focuses on bodyweight estimation based on the observation from a top-view RGB-D camera. In fact, the capability to estimate the weight of a person can be of help in many different applications, from health-related scenarios to business intelligence and retail analytics. To deal with this issue, a TVWE (Top-View Weight Estimation) framework is proposed with the aim of predicting the weight. The approach relies on the adoption of Deep Neural Networks (DNNs) that have been trained on depth data. Each network has also been modified in its top section to replace classification with prediction inference. The performance of five state-of-art DNNs has been compared, namely VGG16, ResNet, Inception, DenseNet and Efficient-Net. In addition, a convolutional auto-encoder has also been included for completeness. Considering the limited literature in this domain, the TVWE framework has been evaluated on a new publicly available dataset: “VRAI Weight estimation Dataset”, which also collects, for each subject, labels related to weight, gender, and height. The experimental results have demonstrated that the proposed methods are suitable for this task, bringing different and significant insights for the application of the solution in different domains.

Attack-Agnostic Adversarial Detection on Medical Data Using Explainable Machine Learning

Matthew Watson, Noura Al Moubayed

Auto-TLDR; Explainability-based Detection of Adversarial Samples on EHR and Chest X-Ray Data

Explainable machine learning has become increasingly prevalent, especially in healthcare where explainable models are vital for ethical and trusted automated decision making. Work on the susceptibility of deep learning models to adversarial attacks has shown the ease of designing samples to mislead a model into making incorrect predictions. In this work, we propose an explainability-based method for the accurate detection of adversarial samples on two datasets with different complexity and properties: Electronic Health Record (EHR) and chest X-ray (CXR) data. On the MIMIC-III and Henan-Renmin EHR datasets, we report a detection accuracy of 77% against the Longitudinal Adversarial Attack. On the MIMIC-CXR dataset, we achieve an accuracy of 88%; significantly improving on the state of the art of adversarial detection in both datasets by over 10% in all settings. We propose an anomaly detection based method using explainability techniques to detect adversarial samples which is able to generalise to different attack methods without a need for retraining.

Real Time Fencing Move Classification and Detection at Touch Time During a Fencing Match

Cem Ekin Sunal, Chris G. Willcocks, Boguslaw Obara

Auto-TLDR; Fencing Body Move Classification and Detection Using Deep Learning

Fencing is a fast-paced sport played with swords which are Epee, Foil, and Saber. However, such fast-pace can cause referees to make wrong decisions. Review of slow-motion camera footage in tournaments helps referees’ decision making, but it interrupts the match and may not be available for every organization. Motivated by the need for better decision making, analysis, and availability, we introduce the first fully-automated deep learning classification and detection system for fencing body moves at the moment a touch is made. This is an important step towards creating a fencing analysis system, with player profiling and decision tools that will benefit the fencing community. The proposed architecture combines You Only Look Once version three (YOLOv3) with a ResNet-34 classifier, trained on ImageNet settings to obtain 83.0\% test accuracy on the fencing moves. These results are exciting development in the sport, providing immediate feedback and analysis along with accessibility, hence making it a valuable tool for trainers and fencing match referees.

A Flatter Loss for Bias Mitigation in Cross-Dataset Facial Age Estimation

Ali Akbari, Muhammad Awais, Zhenhua Feng, Ammarah Farooq, Josef Kittler

Auto-TLDR; Cross-dataset Age Estimation for Neural Network Training

Existing studies in facial age estimation have mostly focused on intra-dataset protocols that assume training and test images captured under similar conditions. However, this is rarely valid in practical applications, where training and test sets usually have different characteristics. In this paper, we advocate a cross-dataset protocol for age estimation benchmarking. In order to improve the cross-dataset age estimation performance, we mitigate the inherent bias caused by the learning algorithm. To this end, we propose a novel loss function that is more effective for neural network training. The relative smoothness of the proposed loss function is its advantage with regards to the optimisation process performed by stochastic gradient decent. Its lower gradient, compared with existing loss functions, facilitates the discovery of and convergence to a better optimum, and consequently a better generalisation. The cross-dataset experimental results demonstrate the superiority of the proposed method over the state-of-the-art algorithms in terms of accuracy and generalisation capability.

Rotation Detection in Finger Vein Biometrics Using CNNs

Bernhard Prommegger, Georg Wimmer, Andreas Uhl

Auto-TLDR; A CNN based rotation detector for finger vein recognition

Finger vein recognition deals with the identification of subjects based on their venous pattern within the fingers. The recognition accuracy of finger vein recognition systems suffers from different internal and external factors. One of the major problems are misplacements of the finger during acquisition. In particular longitudinal finger rotation poses a severe problem for such recognition systems. The detection and correction of such rotations is a difficult task as typically finger vein scanners acquire only a single image from the vein pattern. Therefore, important information such as the shape of the finger or the depth of the veins within the finger, which are needed for the rotation detection, are not available. This work presents a CNN based rotation detector that is capable of estimating the rotational difference between vein images of the same finger without providing any additional information. The experiments executed not only show that the method delivers highly accurate results, but it also generalizes so that the trained CNN can also be applied on data sets which have not been included during the training of the CNN. Correcting the rotation difference between images using the CNN's rotation prediction leads to EER improvements between 50-260% for a well-established vein-pattern based method (Maximum Curvature) on four public finger vein databases.

3D Facial Matching by Spiral Convolutional Metric Learning and a Biometric Fusion-Net of Demographic Properties

Soha Sadat Mahdi, Nele Nauwelaers, Philip Joris, Giorgos Bouritsas, Imperial London, Sergiy Bokhnyak, Susan Walsh, Mark Shriver, Michael Bronstein, Peter Claes

Auto-TLDR; Multi-biometric Fusion for Biometric Verification using 3D Facial Mesures

Face recognition is a widely accepted biometric verification tool, as the face contains a lot of information about the identity of a person. In this study, a 2-step neural-based pipeline is presented for matching 3D facial shape to multiple DNA-related properties (sex, age, BMI and genomic background). The first step consists of a triplet loss-based metric learner that compresses facial shape into a lower dimensional embedding while preserving information about the property of interest. Most studies in the field of metric learning have only focused on Euclidean data. In this work, geometric deep learning is employed to learn directly from 3D facial meshes. To this end, spiral convolutions are used along with a novel mesh-sampling scheme that retains uniformly sampled 3D points at different levels of resolution. The second step is a multi-biometric fusion by a fully connected neural network. The network takes an ensemble of embeddings and property labels as input and returns genuine and imposter scores. Since embeddings are accepted as an input, there is no need to train classifiers for the different properties and available data can be used more efficiently. Results obtained by a 10-fold cross-validation for biometric verification show that combining multiple properties leads to stronger biometric systems. Furthermore, the proposed neural-based pipeline outperforms a linear baseline, which consists of principal component analysis, followed by classification with linear support vector machines and a Naïve Bayes-based score-fuser.

An Experimental Evaluation of Recent Face Recognition Losses for Deepfake Detection

Yu-Cheng Liu, Chia-Ming Chang, I-Hsuan Chen, Yu Ju Ku, Jun-Cheng Chen

Auto-TLDR; Deepfake Classification and Detection using Loss Functions for Face Recognition

Due to the recent breakthroughs of deep generative models, the fake faces, also known as deepfake which has been abused to deceive the general public, can be easily produced at scale and in very high fidelity. Many works focus on exploring various network architectures or various artifacts produced by deep generative models. Instead, in this work, we focus on the loss functions which have been shown to play a significant role in the context of face recognition. We perform a thorough study of several recent state-of-the-art losses commonly used in face recognition task for deepfake classification and detection since the current deepfake is highly related to face generation. With extensive experiments on the challenging FaceForensic++ and Celeb-DF datasets, the evaluation results provide a clear overview of the performance comparisons of different loss functions and generalization capability across different deepfake data.

Attribute-Based Quality Assessment for Demographic Estimation in Face Videos

Fabiola Becerra-Riera, Annette Morales-González, Heydi Mendez-Vazquez, Jean-Luc Dugelay

Auto-TLDR; Facial Demographic Estimation in Video Scenarios Using Quality Assessment

Most existing works regarding facial demographic estimation are focused on still image datasets, although nowadays the need to analyze video content in real applications is increasing. We propose to tackle gender, age and ethnicity estimation in the context of video scenarios. Our main contribution is to use an attribute-specific quality assessment procedure to select best quality frames from a video sequence for each of the three demographic modalities. Best quality frames are classified with fine-tuned MobileNet models and a final video prediction is obtained with a majority voting strategy among the best selected frames. Our validation on three different datasets and our comparison with state-of-the-art models, show the effectiveness of the proposed demographic classifiers and the quality pipeline, which allows to reduce both: the number of frames to be classified and the processing time in practical applications; and improves the soft biometrics prediction accuracy.

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Michele Alberti, Angela Botros, Schuetz Narayan, Rolf Ingold, Marcus Liwicki, Mathias Seuret

Auto-TLDR; Trainable and Spectrally Initializable Matrix Transformations for Neural Networks

In this work, we introduce a new architectural component to Neural Networks (NN), i.e., trainable and spectrally initializable matrix transformations on feature maps. While previous literature has already demonstrated the possibility of adding static spectral transformations as feature processors, our focus is on more general trainable transforms. We study the transforms in various architectural configurations on four datasets of different nature: from medical (ColorectalHist, HAM10000) and natural (Flowers) images to historical documents (CB55). With rigorous experiments that control for the number of parameters and randomness, we show that networks utilizing the introduced matrix transformations outperform vanilla neural networks. The observed accuracy increases appreciably across all datasets. In addition, we show that the benefit of spectral initialization leads to significantly faster convergence, as opposed to randomly initialized matrix transformations. The transformations are implemented as auto-differentiable PyTorch modules that can be incorporated into any neural network architecture. The entire code base is open-source.

Complex-Object Visual Inspection: Empirical Studies on a Multiple Lighting Solution

Maya Aghaei, Matteo Bustreo, Pietro Morerio, Nicolò Carissimi, Alessio Del Bue, Vittorio Murino

Auto-TLDR; A Novel Illumination Setup for Automatic Visual Inspection of Complex Objects

The design of an automatic visual inspection system is usually performed in two stages. While the first stage consists in selecting the most suitable hardware setup for highlighting most effectively the defects on the surface to be inspected, the second stage concerns the development of algorithmic solutions to exploit the potentials offered by the collected data. In this paper, first, we present a novel illumination setup embedding four illumination configurations to resemble diffused, dark-field, and front lighting techniques. Second, we analyze the contributions brought by deploying the proposed setup in the training phase only, mimicking the scenario in which an already developed visual inspection system cannot be modified on the customer site. Along with an exhaustive set of experiments, in this paper, we demonstrate the suitability of the proposed setup for effective illumination of complex-objects, defined as manufactured items with variable surface characteristics that cannot be determined a priori. Eventually, we provide insights into the importance of multiple light configurations availability during training and their natural boosting effect which, without the need to modify the system design in the evaluation phase, lead to improvements in the overall system performance.

Bridging the Gap between Natural and Medical Images through Deep Colorization

Lia Morra, Luca Piano, Fabrizio Lamberti, Tatiana Tommasi

Auto-TLDR; Transfer Learning for Diagnosis on X-ray Images Using Color Adaptation

Deep learning has thrived by training on large-scale datasets. However, in many applications, as for medical image diagnosis, getting massive amount of data is still prohibitive due to privacy, lack of acquisition homogeneity and annotation cost. In this scenario transfer learning from natural image collections is a standard practice that attempts to tackle shape, texture and color discrepancy all at once through pretrained model fine-tuning. In this work we propose to disentangle those challenges and design a dedicated network module that focuses on color adaptation. We combine learning from scratch of the color module with transfer learning of different classification backbones obtaining an end-to-end, easy-to-train architecture for diagnostic image recognition on X-ray images. Extensive experiments show how our approach is particularly efficient in case of data scarcity and provides a new path for further transferring the learned color information across multiple medical datasets.

Detecting Manipulated Facial Videos: A Time Series Solution

Zhang Zhewei, Ma Can, Gao Meilin, Ding Bowen

Auto-TLDR; Face-Alignment Based Bi-LSTM for Fake Video Detection

We propose a new method to expose fake videos based on a time series solution. The method is based on bidirectional long short-term memory (Bi-LSTM) backbone architecture with two different types of features: {Face-Alignment} and {Dense-Face-Alignment}, in which both of them are physiological signals that can be distinguished between fake and original videos. We choose 68 landmark points as the feature of {Face-Alignment} and Pose Adaptive Feature (PAF) for {Dense-Face-Alignment}. Based on these two facial features, we designed two deep networks. In addition, we optimize our network by adding an attention mechanism that improves detection precision. Our method is tested over benchmarks of Face Forensics/Face Forensics++ dataset and show a promising performance on inference speed while maintaining accuracy with state-of art solutions that deal against DeepFake.

Human or Machine? It Is Not What You Write, but How You Write It

Luis Leiva, Moises Diaz, M.A. Ferrer, Réjean Plamondon

Auto-TLDR; Behavioral Biometrics via Handwritten Symbols for Identification and Verification

Online fraud often involves identity theft. Since most security measures are weak or can be spoofed, we investigate a more nuanced and less explored avenue: behavioral biometrics via handwriting movements. This kind of data can be used to verify if a legitimate user is operating a device or a computer application, so it is important to distinguish between human and machine-generated movements reliably. For this purpose, we study handwritten symbols (isolated characters, digits, gestures, and signatures) produced by humans and machines, and compare and contrast several deep learning models. We find that if symbols are presented as static images, they can fool state-of-the-art classifiers (near 75% accuracy in the best case) but can be distinguished with remarkable accuracy if they are presented as temporal sequences (95% accuracy in the average case). We conclude that an accurate detection of fake movements has more to do with how users write, rather than what they write. Our work has implications for computerized systems that need to authenticate or verify legitimate human users, and provides an additional layer of security to keep attackers at bay.

A Systematic Investigation on Deep Architectures for Automatic Skin Lesions Classification

Pierluigi Carcagni, Marco Leo, Andrea Cuna, Giuseppe Celeste, Cosimo Distante

Auto-TLDR; RegNet: Deep Investigation of Convolutional Neural Networks for Automatic Classification of Skin Lesions

Computer vision-based techniques are more and more employed in healthcare and medical fields nowadays in order, principally, to be as a support to the experienced medical staff to help them to make a quick and correct diagnosis. One of the hot topics in this arena concerns the automatic classification of skin lesions. Several promising works exist about it, mainly leveraging Convolutional Neural Networks (CNN), but proposed pipeline mainly rely on complex data preprocessing and there is no systematic investigation about how available deep models can actually reach the accuracy needed for real applications. In order to overcome these drawbacks, in this work, an end-to-end pipeline is introduced and some of the most recent Convolutional Neural Networks (CNNs) architectures are included in it and compared on the largest common benchmark dataset recently introduced. To this aim, for the first time in this application context, a new network design paradigm, namely RegNet, has been exploited to get the best models among a population of configurations. The paper introduces a threefold level of contribution and novelty with respect the previous literature: the deep investigation of several CNN architectures driving to a consistent improvement of the lesions recognition accuracy, the exploitation of a new network design paradigm able to study the behavior of populations of models and a deep discussion about pro and cons of each analyzed method paving the path towards new research lines.

ISP4ML: The Role of Image Signal Processing in Efficient Deep Learning Vision Systems

Patrick Hansen, Alexey Vilkin, Yury Khrustalev, James Stuart Imber, Dumidu Sanjaya Talagala, David Hanwell, Matthew Mattina, Paul Whatmough

Auto-TLDR; Towards Efficient Convolutional Neural Networks with Image Signal Processing

Convolutional neural networks (CNNs) are now predominant components in a variety of computer vision (CV) systems. These systems typically include an image signal processor (ISP), even though the ISP is traditionally designed to produce images that look appealing to humans. In CV systems, it is not clear what the role of the ISP is, or if it is even required at all for accurate prediction. In this work, we investigate the efficacy of the ISP in CNN classification tasks and outline the system-level trade-offs between prediction accuracy and computational cost. To do so, we build software models of a configurable ISP and an imaging sensor to train CNNs on ImageNet with a range of different ISP settings and functionality. Results on ImageNet show that an ISP improves accuracy by 4.6\%-12.2\% on MobileNets. Results from ResNets demonstrate these trends also generalize to deeper networks. An ablation study of the various processing stages in a typical ISP reveals that the tone mapper is the most significant stage when operating on high dynamic range (HDR) images, by providing 5.8\% average accuracy improvement alone. Overall, the ISP benefits system efficiency because the memory and computational costs of the ISP is minimal compared to the cost of using a larger CNN to achieve the same accuracy.

Lightweight Low-Resolution Face Recognition for Surveillance Applications

Yoanna Martínez-Díaz, Heydi Mendez-Vazquez, Luis S. Luevano, Leonardo Chang, Miguel Gonzalez-Mendoza

Auto-TLDR; Efﬁciency of Lightweight Deep Face Networks on Low-Resolution Surveillance Imagery

Typically, real-world requirements to deploy face recognition models in unconstrained surveillance scenarios demand to identify low-resolution faces with extremely low computational cost. In the last years, several methods based on complex deep learning models have been proposed with promising recognition results but at a high computational cost. Inspired by the compactness and computation efﬁciency of lightweight deep face networks and their high accuracy on general face recognition tasks, in this work we propose to benchmark two recently introduced lightweight face models on low-resolution surveillance imagery to enable efﬁcient system deployment. In this way, we conduct a comprehensive evaluation on the two typical settings: LR-to-HR and LR-to-LR matching. In addition, we investigate the effect of using trained models with down-sampled synthetic data from high-resolution images, as well as the combination of different models, for face recognition on real low-resolution images. Experimental results show that the used lightweight face models achieve state-of-the-art results on low-resolution benchmarks with low memory footprint and computational complexity. Moreover, we observed that combining models trained with different degradations improves the recognition accuracy on low-resolution surveillance imagery, which is feasible due to their low computational cost.

Single View Learning in Action Recognition

Gaurvi Goyal, Nicoletta Noceti, Francesca Odone

Auto-TLDR; Cross-View Action Recognition Using Domain Adaptation for Knowledge Transfer

Viewpoint is an essential aspect of how an action is visually perceived, with the motion appearing substantially different for some viewpoint pairs. Data driven action recognition algorithms compensate for this by including a variety of viewpoints in their training data, adding to the cost of data acquisition as well as training. We propose a novel methodology that leverages deeply pretrained features to learn actions from a single viewpoint using domain adaptation for knowledge transfer. We demonstrate the effectiveness of this pipeline on 3 different datasets: IXMAS, MoCA and NTU RGBD+, and compare with both classical and deep learning methods. Our method requires low training data and demonstrates unparalleled cross-view action recognition accuracies for single view learning.

A Novel Disaster Image Data-Set and Characteristics Analysis Using Attention Model

Fahim Faisal Niloy, Arif ., Abu Bakar Siddik Nayem, Anis Sarker, Ovi Paul, M Ashraful Amin, Amin Ahsan Ali, Moinul Islam Zaber, Akmmahbubur Rahman

Auto-TLDR; Attentive Attention Model for Disaster Classification

The advancement of deep learning technology has enabled us to develop systems that outperform any other classification technique. However, success of any empirical system depends on the quality and diversity of the data available to train the proposed system. In this research, we have carefully accumulated a relatively challenging dataset that contains images collected from various sources for three different disasters: fire, water and land. Besides this, we have also collected images for various damaged infrastructure due to natural or man made calamities and damaged human due to war or accidents. We have also accumulated image data for a class named non-damage that contains images with no such disaster or sign of damage in them. There are 13,720 manually annotated images in this dataset, each image is annotated by three individuals. We are also providing discriminating image class information annotated manually with bounding box for a set of 200 test images. Images are collected from different news portals, social media, and standard datasets made available by other researchers. A three layer attention model (TLAM) is trained and average five fold validation accuracy of 95.88% is achieved. Moreover, on the 200 unseen test images this accuracy is 96.48%. We also generate and compare attention maps for these test images to determine the characteristics of the trained attention model.