#### CCA: Exploring the Possibility of Contextual Camouflage Attack on Object Detection

Shengnan Hu, Yang Zhang, Sumit Laha, Ankit Sharma, Hassan Foroosh
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.3

Auto-TLDR; Contextual camouflage attack for object detection

Deep neural network based object detection has become the cornerstone of many real-world applications. Along with this success comes concerns about its vulnerability to malicious attacks. To gain more insight into this issue, we propose a contextual camouflage attack (CCA for short) algorithm to influence the performance of object detectors. In this paper, we use an evolutionary search strategy and adversarial machine learning in interactions with a photo-realistic simulated environment to find camouflage patterns that are effective over a huge variety of object locations, camera poses, and lighting conditions. The proposed camouflages are validated effective to most of the state-of-the-art object detectors.

## Similar papers

#### AdvHat: Real-World Adversarial Attack on ArcFace Face ID System

Stepan Komkov, Aleksandr Petiushko
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; Adversarial Sticker Attack on ArcFace in Shooting Conditions

In this paper we propose a novel easily reproducible technique to attack the best public Face ID system ArcFace in different shooting conditions. To create an attack, we print the rectangular paper sticker on a common color printer and put it on the hat. The adversarial sticker is prepared with a novel algorithm for off-plane transformations of the image which imitates sticker location on the hat. Such an approach confuses the state-of-the-art public Face ID model LResNet100E-IR, ArcFace@ms1m-refine-v2 and is transferable to other Face ID models.

#### Transferable Adversarial Attacks for Deep Scene Text Detection

Shudeng Wu, Tao Dai, Guanghao Meng, Bin Chen, Jian Lu, Shutao Xia
Track 4: Document and Media Analysis
Fri 15 Jan 2021 at 13:00 in session OS T 4.2

Auto-TLDR; Robustness of DNN-based STD methods against Adversarial Attacks

Scene text detection (STD) aims to locate text in images and plays an important role in many computer vision tasks including automatic driving and text recognition systems. Recently, deep neural networks (DNNs) have been widely and successfully used in scene text detection, leading to plenty of DNN-based STD methods including regression-based and segmentation-based STD methods. However, recent studies have also shown that DNN is vulnerable to adversarial attacks, which can significantly degrade the performance of DNN models. In this paper, we investigate the robustness of DNN-based STD methods against adversarial attacks. To this end, we propose a generic and efficient attack method to generate adversarial examples, which are produced by adding small but imperceptible adversarial perturbation to the input images. Experiments on attacking four various models and a real-world STD engine of Google optical character recognition (OCR) show that the state-of-the-art DNN-based STD methods including regression-based and segmentation-based methods are vulnerable to adversarial attacks.

#### SyNet: An Ensemble Network for Object Detection in UAV Images

Berat Mert Albaba, Sedat Ozer
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 16:00 in session PS T3.11

Auto-TLDR; SyNet: Combining Multi-Stage and Single-Stage Object Detection for Aerial Images

Recent advances in camera equipped drone applications and their widespread use increased the demand on vision based object detection algorithms for aerial images. Object detection process is inherently a challenging task as a generic computer vision problem, however, since the use of object detection algorithms on UAVs (or on drones) is relatively a new area, it remains as a more challenging problem to detect objects in aerial images. There are several reasons for that including: (i) the lack of large drone datasets including large object variance, (ii) the large orientation and scale variance in drone images when compared to the ground images, and (iii) the difference in texture and shape features between the ground and the aerial images. Deep learning based object detection algorithms can be classified under two main categories: (a) single-stage detectors and (b) multi-stage detectors. Both single-stage and multi-stage solutions have their advantages and disadvantages over each other. However, a technique to combine the good sides of each of those solutions could yield even a stronger solution than each of those solutions individually. In this paper, we propose an ensemble network, SyNet, that combines a multi-stage method with a single-stage one with the motivation of decreasing the high false negative rate of multi-stage detectors and increasing the quality of the single-stage detector proposals. As building blocks, CenterNet and Cascade R-CNN with pretrained feature extractors are utilized along with an ensembling strategy. We report the state of the art results obtained by our proposed solution on two different datasets: namely MS-COCO and visDrone with \%52.1 $mAP_{IoU = 0.75}$ is obtained on MS-COCO $val2017$ dataset and \%26.2 $mAP_{IoU = 0.75}$ is obtained on VisDrone $test-set$. Our code is available at: https://github.com/mertalbaba/SyNet}{https://github.com/mer talbaba/SyNet

#### Optimal Transport As a Defense against Adversarial Attacks

Quentin Bouniot, Romaric Audigier, Angélique Loesch
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 16:30 in session PS T1.7

Auto-TLDR; Sinkhorn Adversarial Training with Optimal Transport Theory

Deep learning classifiers are now known to have flaws in the representations of their class. Adversarial attacks can find a human-imperceptible perturbation for a given image that will mislead a trained model. The most effective methods to defend against such attacks trains on generated adversarial examples to learn their distribution. Previous work aimed to align original and adversarial image representations in the same way as domain adaptation to improve robustness. Yet, they partially align the representations using approaches that do not reflect the geometry of space and distribution. In addition, it is difficult to accurately compare robustness between defended models. Until now, they have been evaluated using a fixed perturbation size. However, defended models may react differently to variations of this perturbation size. In this paper, the analogy of domain adaptation is taken a step further by exploiting optimal transport theory. We propose to use a loss between distributions that faithfully reflect the ground distance. This leads to SAT (Sinkhorn Adversarial Training), a more robust defense against adversarial attacks. Then, we propose to quantify more precisely the robustness of a model to adversarial attacks over a wide range of perturbation sizes using a different metric, the Area Under the Accuracy Curve (AUAC). We perform extensive experiments on both CIFAR-10 and CIFAR-100 datasets and show that our defense is globally more robust than the state-of-the-art.

#### MFPP: Morphological Fragmental Perturbation Pyramid for Black-Box Model Explanations

Qing Yang, Xia Zhu, Jong-Kae Fwu, Yun Ye, Ganmei You, Yuan Zhu
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Tue 12 Jan 2021 at 15:00 in session PS T1.1

Auto-TLDR; Morphological Fragmental Perturbation Pyramid for Explainable Deep Neural Network

Deep neural networks (DNNs) have recently been applied and used in many advanced and diverse tasks, such as medical diagnosis, automatic driving, etc. Due to the lack of transparency of the deep models, DNNs are often criticized for their prediction that cannot be explainable by human. In this paper, we propose a novel Morphological Fragmental Perturbation Pyramid (MFPP) method to solve the Explainable AI problem. In particular, we focus on the black-box scheme, which can identify the input area responsible for the output of the DNN without having to understand the internal architecture of the DNN. In the MFPP method, we divide the input image into multi-scale fragments and randomly mask out fragments as perturbation to generate a saliency map, which indicates the significance of each pixel for the prediction result of the black box model. Compared with the existing input sampling perturbation method, the pyramid structure fragment has proved to be more effective. It can better explore the morphological information of the input image to match its semantic information, and does not need any value inside the DNN. We qualitatively and quantitatively prove that MFPP meets and exceeds the performance of state-of-the-art (SOTA) black-box interpretation method on multiple DNN models and datasets.

#### Forground-Guided Vehicle Perception Framework

Kun Tian, Tong Zhou, Shiming Xiang, Chunhong Pan
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 15:00 in session PS T3.10

Auto-TLDR; A foreground segmentation branch for vehicle detection

As the basis of advanced visual tasks such as vehicle tracking and traffic flow analysis, vehicle detection needs to accurately predict the position and category of vehicle objects. In the past decade, deep learning based methods have made great progress. However, we also notice that some existing cases are not studied thoroughly. First, false positive on the background regions is one of the critical problems. Second, most of the previous approaches only optimize a single vehicle detection model, ignoring the relationship between different visual perception tasks. In response to the above two findings, we introduce a foreground segmentation branch for the first time, which can predict the pixel level of vehicles in advance. Furthermore, two attention modules are designed to guide the work of the detection branch. The proposed method can be easily grafted into the one-stage and two-stage detection framework. We evaluate the effectiveness of our model on LSVH, a dataset with large variations in vehicle scales, and achieve the state-of-the-art detection accuracy.

#### Transformer-Encoder Detector Module: Using Context to Improve Robustness to Adversarial Attacks on Object Detection

Faisal Alamri, Sinan Kalkan, Nicolas Pugeault
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.6

Auto-TLDR; Context Module for Robust Object Detection with Transformer-Encoder Detector Module

Deep neural network approaches have demonstrated high performance in object recognition (CNN) and detection (Faster-RCNN) tasks, but experiments have shown that such architectures are vulnerable to adversarial attacks (FFF, UAP): low amplitude perturbations, barely perceptible by the human eye, can lead to a drastic reduction in labelling performance. This article proposes a new context module, called Transformer-Encoder Detector Module, that can be applied to an object detector to (i) improve the labelling of object instances; and (ii) improve the detector's robustness to adversarial attacks. The proposed model achieves higher mAP, F1 scores and AUC average score of up to 13\% compared to the baseline Faster-RCNN detector, and an mAP score 8 points higher on images subjected to FFF or UAP attacks. The result demonstrates that a simple ad-hoc context module can improve the reliability of object detectors significantly

#### Cost-Effective Adversarial Attacks against Scene Text Recognition

Mingkun Yang, Haitian Zheng, Xiang Bai, Jiebo Luo
Track 4: Document and Media Analysis
Thu 14 Jan 2021 at 12:00 in session PS T4.3

Auto-TLDR; Adversarial Attacks on Scene Text Recognition

Scene text recognition is a challenging task due to the diversity in text appearance and complexity of natural scenes. Thanks to the development of deep learning and the large volume of training data, scene text recognition has made impressive progress in recent years. However, recent research on adversarial examples has shown that deep learning models are vulnerable to adversarial input with imperceptible changes. As one of the most practical tasks in computer vision, scene text recognition is also facing huge security risks. To our best knowledge, there has been no work on adversarial attacks against scene text recognition. To investigate its effects on scene text recognition, we make the first attempt to attack the state-of-the-art scene text recognizer, i.e., attention-based recognizer. To that end, we first adjust the objective function designed for non-sequential tasks, such as image classification, semantic segmentation and image retrieval, to the sequential form. We then propose a novel and effective objective function to further reduce the amount of perturbation while achieving a higher attack success rate. Comprehensive experiments on several standard benchmarks clearly demonstrate effective adversarial effects on scene text recognition by the proposed attacks.

#### Object Detection on Monocular Images with Two-Dimensional Canonical Correlation Analysis

Zifan Yu, Suya You
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.2

Auto-TLDR; Multi-Task Object Detection from Monocular Images Using Multimodal RGB and Depth Data

Accurate and robust detection objects from monocular images is a fundamental vision task. This paper describes a novel approach of holistic scene understanding that can simultaneously achieve multiple tasks of scene reconstruction and object detection from a single monocular camera. Rather than pursuing an independent solution for each individual task as most existing work does, we seek a globally optimal solution that holistically resolves the multiple perception and reasoning tasks in an effective manner. The approach explores the complementary properties of multimodal RGB imagery and depth data to improve scene perception tasks. It uniquely combines the techniques of canonical correlation analysis and deep learning to learn the most correlated features to maximize the modal cross-correlation for improving the performance and robustness of object detection in complex environments. Extensive experiments have been conducted to evaluate and demonstrate the performances of the proposed approach.

#### F-Mixup: Attack CNNs from Fourier Perspective

Xiu-Chuan Li, Xu-Yao Zhang, Fei Yin, Cheng-Lin Liu
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.13

Auto-TLDR; F-Mixup: A novel black-box attack in frequency domain for deep neural networks

Recent research has revealed that deep neural networks are highly vulnerable to adversarial examples. In this paper, different from most adversarial attacks which directly modify pixels in spatial domain, we propose a novel black-box attack in frequency domain, named as f-mixup, based on the property of natural images and perception disparity between human-visual system (HVS) and convolutional neural networks (CNNs): First, natural images tend to have the bulk of their Fourier spectrums concentrated on the low frequency domain; Second, HVS is much less sensitive to high frequencies while CNNs can utilize both low and high frequency information to make predictions. Extensive experiments are conducted and show that deeper CNNs tend to concentrate more on the high frequency domain, which may explain the contradiction between robustness and accuracy. In addition, we compared f-mixup with existing attack methods and observed that our approach possesses great advantages. Finally, we show that f-mixup can be also incorporated in training to make deep CNNs defensible against a kind of perturbations effectively.

#### Adversarially Training for Audio Classifiers

Raymel Alfonso Sallo, Mohammad Esmaeilpour, Patrick Cardinal
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 16:30 in session PS T1.8

Auto-TLDR; Adversarially Training for Robust Neural Networks against Adversarial Attacks

In this paper, we investigate the potential effect of the adversarially training on the robustness of six advanced deep neural networks against a variety of targeted and non-targeted adversarial attacks. We firstly show that, the ResNet-56 model trained on the 2D representation of the discrete wavelet transform appended with the tonnetz chromagram outperforms other models in terms of recognition accuracy. Then we demonstrate the positive impact of adversarially training on this model as well as other deep architectures against six types of attack algorithms (white and black-box) with the cost of the reduced recognition accuracy and limited adversarial perturbation. We run our experiments on two benchmarking environmental sound datasets and show that without any imposed limitations on the budget allocations for the adversary, the fooling rate of the adversarially trained models can exceed 90%. In other words, adversarial attacks exist in any scales, but they might require higher adversarial perturbations compared to non-adversarially trained models.

#### Task-based Focal Loss for Adversarially Robust Meta-Learning

Yufan Hou, Lixin Zou, Weidong Liu
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.14

Auto-TLDR; Task-based Adversarial Focal Loss for Few-shot Meta-Learner

Adversarial robustness of machine learning has been widely studied in recent years, and a series of effective methods are proposed to resist adversarial attacks. However, less attention is paid to few-shot meta-learners which are much more vulnerable due to the lack of training samples. In this paper, we propose Task-based Adversarial Focal Loss (TAFL) to handle this tough challenge on a typical meta-learner called MAML. More concretely, we regard few-shot classification tasks as normal samples in learning models and apply focal loss mechanism on them. Our proposed method focuses more on adversarially fragile tasks, leading to improvement on overall model robustness. Results of extensive experiments on several benchmarks demonstrate that TAFL can effectively promote the performance of the meta-learner on adversarial examples with elaborately designed perturbations.

#### Zoom-CAM: Generating Fine-Grained Pixel Annotations from Image Labels

Xiangwei Shi, Seyran Khademi, Yunqiang Li, Jan Van Gemert
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 16:00 in session PS T1.16

Auto-TLDR; Zoom-CAM for Weakly Supervised Object Localization and Segmentation

Current weakly supervised object localization and segmentation rely on class-discriminative visualization techniques for convolutional neural networks (CNN) to generate pseudo-labels for pixel-level training. However, visualization methods, including CAM and Grad-CAM, focus on most discriminative object parts summarized in the last convolutional layer, missing the complete pixel mapping in intermediate layers. We propose Zoom-CAM: going beyond the last lowest resolution layer by integrating the importance maps over all activations in intermediate layers. Zoom-CAM captures fine-grained small-scale objects for various discriminative class instances, which are commonly missed by the baseline visualization methods. We focus on generating pixel-level pseudo-labels from class labels. The quality of our pseudo-labels evaluated on the ImageNet localization task exhibits more than 2.8% improvement on top-1 error. For weakly supervised semantic segmentation our generated pseudo-labels improve a state of the art model by 1.1%.

#### Adaptive Noise Injection for Training Stochastic Student Networks from Deterministic Teachers

Yi Xiang Marcus Tan, Yuval Elovici, Alexander Binder
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Tue 12 Jan 2021 at 14:00 in session OS T1.1

Auto-TLDR; Adaptive Stochastic Networks for Adversarial Attacks

Adversarial attacks have been a prevalent problem causing misclassification in machine learning models, with stochasticity being a promising direction towards greater robustness. However, stochastic networks frequently underperform compared to deterministic deep networks. In this work, we present a conceptually clear adaptive noise injection mechanism in combination with teacher-initialisation, which adjusts its degree of randomness dynamically through the computation of mini-batch statistics. This mechanism is embedded within a simple framework to obtain stochastic networks from existing deterministic networks. Our experiments show that our method is able to outperform prior baselines under white-box settings, exemplified through CIFAR-10 and CIFAR-100. Following which, we perform in-depth analysis on varying different components of training with our approach on the effects of robustness and accuracy, through the study of the evolution of decision boundary and trend curves of clean accuracy/attack success over differing degrees of stochasticity. We also shed light on the effects of adversarial training on a pre-trained network, through the lens of decision boundaries.

#### Tiny Object Detection in Aerial Images

Jinwang Wang, Wen Yang, Haowen Guo, Ruixiang Zhang, Gui-Song Xia
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 13:00 in session OS T3.5

Auto-TLDR; Tiny Object Detection in Aerial Images Using Multiple Center Points Based Learning Network

Object detection in Earth Vision has achieved great progress in recent years. However, tiny object detection in aerial images remains a very challenging problem since the tiny objects contain a small number of pixels and are easily confused with the background. To advance tiny object detection research in aerial images, we present a new dataset for Tiny Object Detection in Aerial Images (AI-TOD). Specifically, AI-TOD comes with 700,621 object instances for eight categories across 28,036 aerial images. Compared to existing object detection datasets in aerial images, the mean size of objects in AI-TOD is about 12.8 pixels, which is much smaller than others. To build a benchmark for tiny object detection in aerial images, we evaluate the state-of-the-art object detectors on our AI-TOD dataset. Experimental results show that direct application of these approaches on AI-TOD produces suboptimal object detection results, thus new specialized detectors for tiny object detection need to be designed. Therefore, we propose a multiple center points based learning network (M-CenterNet) to improve the localization performance of tiny object detection, and experimental results show the significant performance gain over the competitors.

#### Detecting Objects with High Object Region Percentage

Fen Fang, Qianli Xu, Liyuan Li, Ying Gu, Joo-Hwee Lim
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 14:00 in session PS T3.8

Auto-TLDR; Faster R-CNN for High-ORP Object Detection

Object shape is a subtle but important factor for object detection. It has been observed that the object-region-percentage (ORP) can be utilized to improve detection accuracy for elongated objects, which have much lower ORPs than other types of objects. In this paper, we propose an approach to improve the detection performance for objects whose ORPs are relatively higher.To address the problem of high-ORP object detection, we propose a method consisting of three steps. First, we adjust the ground truth bounding boxes of high-ORP objects to an optimal range. Second, we train an object detector, Faster R-CNN, based on adjusted bounding boxes to achieve high recall. Finally, we train a DCNN to learn the adjustment ratios towards four directions and adjust detected bounding boxes of objects to get better localization for higher precision. We evaluate the effectiveness of our method on 12 high-ORP objects in COCO and 8 objects in a proprietary gearbox dataset. The experimental results show that our method can achieve state-of-the-art performance on these objects while costing less resources in training and inference stages.

#### EAGLE: Large-Scale Vehicle Detection Dataset in Real-World Scenarios Using Aerial Imagery

Seyed Majid Azimi, Reza Bahmanyar, Corentin Henry, Kurz Franz
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session OS T1.8

Auto-TLDR; EAGLE: A Large-Scale Dataset for Multi-class Vehicle Detection with Object Orientation Information in Airborne Imagery

Multi-class vehicle detection from airborne imagery with orientation estimation is an important task in the near and remote vision domains with applications in traffic monitoring and disaster management. In the last decade, we have witnessed significant progress in object detection in ground imagery, but it is still in its infancy in airborne imagery, mostly due to the scarcity of diverse and large-scale datasets. Despite being a useful tool for different applications, current airborne datasets only partially reflect the challenges of real-world scenarios. To address this issue, we introduce EAGLE (oriEnted object detection using Aerial imaGery in real-worLd scEnarios), a large-scale dataset for multi-class vehicle detection with object orientation information in aerial imagery. It features high-resolution aerial images composed of different real-world situations with a wide variety of camera sensor, resolution, flight altitude, weather, illumination, haze, shadow, time, city, country, occlusion, and camera angle. The annotation was done by airborne imagery experts with small- and large-vehicle classes. EAGLE contains 215,986 instances annotated with oriented bounding boxes defined by four points and orientation, making it by far the largest dataset to date in this task. It also supports researches on the haze and shadow removal as well as super-resolution and in-painting applications. We define three tasks: detection by (1) horizontal bounding boxes, (2) rotated bounding boxes, and (3) oriented bounding boxes. We carried out several experiments to evaluate several state-of-the-art methods in object detection on our dataset to form a baseline. Experiments show that the EAGLE dataset accurately reflects real-world situations and correspondingly challenging applications. The dataset will be made publicly available.

#### Defense Mechanism against Adversarial Attacks Using Density-Based Representation of Images

Yen-Ting Huang, Wen-Hung Liao, Chen-Wei Huang
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.3

Auto-TLDR; Adversarial Attacks Reduction Using Input Recharacterization

Adversarial examples are slightly modified inputs devised to cause erroneous inference of deep learning models. Protection against the intervention of adversarial examples is a fundamental issue that needs to be addressed before the wide adoption of deep-learning based intelligent systems. In this research, we utilize the method known as input recharacterization to effectively eliminate the perturbations found in the adversarial examples. By converting images from the intensity domain into density-based representation using halftoning operation, performance of the classifier can be properly maintained. With adversarial attacks generated using FGSM, I-FGSM, and PGD, the top-5 accuracy of the hybrid model can still achieve 80.97%, 78.77%, 81.56%, respectively. Although the accuracy has been slightly affected, the influence of adversarial examples is significantly discounted. The average improvement over existing input transform defense mechanisms is approximately 10%.

#### Variational Inference with Latent Space Quantization for Adversarial Resilience

Vinay Kyatham, Deepak Mishra, Prathosh A.P.
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 12:00 in session PS T1.9

Auto-TLDR; A Generalized Defense Mechanism for Adversarial Attacks on Data Manifolds

Despite their tremendous success in modelling highdimensional data manifolds, deep neural networks suffer from the threat of adversarial attacks - Existence of perceptually valid input-like samples obtained through careful perturbation that lead to degradation in the performance of the underlying model. Major concerns with existing defense mechanisms include non-generalizability across different attacks, models and large inference time. In this paper, we propose a generalized defense mechanism capitalizing on the expressive power of regularized latent space based generative models. We design an adversarial filter, devoid of access to classifier and adversaries, which makes it usable in tandem with any classifier. The basic idea is to learn a Lipschitz constrained mapping from the data manifold, incorporating adversarial perturbations, to a quantized latent space and re-map it to the true data manifold. Specifically, we simultaneously auto-encode the data manifold and its perturbations implicitly through the perturbations of the regularized and quantized generative latent space, realized using variational inference. We demonstrate the efficacy of the proposed formulation in providing resilience against multiple attack types (black and white box) and methods, while being almost real-time. Our experiments show that the proposed method surpasses the stateof-the-art techniques in several cases.

#### Small Object Detection Leveraging on Simultaneous Super-Resolution

Hong Ji, Zhi Gao, Xiaodong Liu, Tiancan Mei
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.13

Auto-TLDR; Super-Resolution via Generative Adversarial Network for Small Object Detection

Despite the impressive advancement achieved in object detection, the detection performance of small object is still far from satisfactory due to the lack of sufficient detailed appearance to distinguish it from similar objects. Inspired by the positive effects of super-resolution for object detection, we propose a general framework that can be incorporated with most available detector networks to significantly improve the performance of small object detection, in which the low-resolution image is super-resolved via generative adversarial network (GAN) in an unsupervised manner. In our method, the super-resolution network and the detection network are trained jointly and alternately with each other fixed. In particular, the detection loss is back-propagated into the super-resolution network during training to facilitate detection. Compared with available simultaneous super-resolution and detection methods which heavily rely on low-/high-resolution image pairs, our work breaks through such restriction via applying the CycleGAN strategy, achieving increased generality and applicability, while remaining an elegant structure. Extensive experiments on datasets from both computer vision and remote sensing communities demonstrate that our method works effectively on a wide range of complex scenarios, resulting in best performance that significantly outperforms many state-of-the-art approaches.

#### Bidirectional Matrix Feature Pyramid Network for Object Detection

Wei Xu, Yi Gan, Jianbo Su
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 15:00 in session PS T3.10

Auto-TLDR; BMFPN: Bidirectional Matrix Feature Pyramid Network for Object Detection

Feature pyramids are widely used to improve scale invariance for object detection. Most methods just map the objects to feature maps with relevant square receptive fields, but rarely pay attention to the aspect ratio variation, which is also an important property of object instances. It will lead to a poor match between rectangular objects and assigned features with square receptive fields, thus preventing from accurate recognition and location. Besides, the information propagation among feature layers is sparse, namely, each feature in the pyramid may mainly or only contain single-level information, which is not representative enough for classification and localization sub-tasks. In this paper, Bidirectional Matrix Feature Pyramid Network (BMFPN) is proposed to address these issues. It consists of three modules: Diagonal Layer Generation Module (DLGM), Top-down Module (TDM) and Bottom-up Module (BUM). First, multi-level features extracted by backbone are fed into DLGM to produce the base features. Then these base features are utilized to construct the final feature pyramid through TDM and BUM in series. The receptive fields of the designed feature layers in BMFPN have various scales and aspect ratios. Objects can be correctly assigned to appropriate and representative feature maps with relevant receptive fields depending on its scale and aspect ratio properties. Moreover, TDM and BUM form bidirectional and reticular information flow, which effectively fuses multi level information in top-down and bottom-up manner respectively. To evaluate the effectiveness of our proposed architecture, an end-toend anchor-free detector is designed and trained by integrating BMFPN into FCOS. And the center ness branch in FCOS is modified with our Gaussian center-ness branch (GCB), which brings another slight improvement. Without bells and whistles, our method gains +3.3%, +2.4% and +2.6% AP on MS COCO dataset from baselines with ResNet-50, ResNet-101 and ResNeXt-101 backbones, respectively.

#### Utilising Visual Attention Cues for Vehicle Detection and Tracking

Feiyan Hu, Venkatesh Gurram Munirathnam, Noel E O'Connor, Alan Smeaton, Suzanne Little
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 16:00 in session PS T3.11

Auto-TLDR; Visual Attention for Object Detection and Tracking in Driver-Assistance Systems

Advanced Driver-Assistance Systems (ADAS) have been attracting attention from many researchers. Vision based sensors are the closest way to emulate human driver visual behavior while driving. In this paper, we explore possible ways to use visual attention (saliency) for object detection and tracking. We investigate: 1) How a visual attention map such as a subjectness attention or saliency map and an objectness attention map can facilitate region proposal generation in a 2-stage object detector; 2) How a visual attention map can be used for tracking multiple objects. We propose a neural network that can simultaneously detect objects as and generate objectness and subjectness maps to save computational power. We further exploit the visual attention map during tracking using a sequential Monte Carlo probability hypothesis density (PHD) filter. The experiments are conducted on KITTI and DETRAC datasets. The use of visual attention and hierarchical features has shown a considerable improvement of≈8% in object detection which effectively increased tracking performance by≈4% on KITTI dataset.

#### FeatureNMS: Non-Maximum Suppression by Learning Feature Embeddings

Niels Ole Salscheider
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.5

Auto-TLDR; FeatureNMS: Non-Maximum Suppression for Multiple Object Detection

Most state of the art object detectors output multiple detections per object. The duplicates are removed in a post-processing step called Non-Maximum Suppression. Classical Non-Maximum Suppression has shortcomings in scenes that contain objects with high overlap: The idea of this heuristic is that a high bounding box overlap corresponds to a high probability of having a duplicate. We propose FeatureNMS to solve this problem. FeatureNMS recognizes duplicates not only based on the intersection over union between bounding boxes, but also based on the difference of feature vectors. These feature vectors can encode more information like visual appearance. Our approach outperforms classical NMS and derived approaches and achieves state of the art performance.

#### Small Object Detection by Generative and Discriminative Learning

Yi Gu, Jie Li, Chentao Wu, Weijia Jia, Jianping Chen
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 12:00 in session PS T3.4

Auto-TLDR; Generative and Discriminative Learning for Small Object Detection

With the development of deep convolutional neural networks (CNNs), the object detection accuracy has been greatly improved. But the performance of small object detection is still far from satisfactory, mainly because small objects are so tiny that the information contained in the feature map is limited. Existing methods focus on improving classification accuracy but still suffer from the limitation of bounding box prediction. To solve this issue, we propose a detection framework by generative and discriminative learning. First, a reconstruction generator network is designed to reconstruct the mapping from low frequency to high frequency for anchor box prediction. Then, a detector module extracts the regions of interest (ROIs) from generated results and implements a RoI-Head to predict object category and refine bounding box. In order to guide the reconstructed image related to the corresponding one, a discriminator module is adopted to tell from the generated result and the original image. Extensive evaluations on the challenging MS-COCO dataset demonstrate that our model outperforms most state-of-the-art models in detecting small objects, especially the reconstruction module improves the average precision for small object (APs) by 7.7%.

#### Future Urban Scenes Generation through Vehicles Synthesis

Alessandro Simoni, Luca Bergamini, Andrea Palazzi, Simone Calderara, Rita Cucchiara
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 16:00 in session PS T3.11

Auto-TLDR; Predicting the Future of an Urban Scene with a Novel View Synthesis Paradigm

In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene. Despite recent advances, generating the entire scene in an end-to-end fashion is still far from being achieved. Instead, here we follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently. We leverage a per-object novel view synthesis paradigm; i.e. generating a synthetic representation of an object undergoing a geometrical roto-translation in the 3D space. Our model can be easily conditioned with constraints (e.g. input trajectories) provided by state-of-the-art tracking methods or by the user itself. This allows us to generate a set of diverse realistic futures starting from the same input in a multi-modal fashion. We visually and quantitatively show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow, a challenging real world dataset.

#### Beyond Cross-Entropy: Learning Highly Separable Feature Distributions for Robust and Accurate Classification

Arslan Ali, Andrea Migliorati, Tiziano Bianchi, Enrico Magli
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 12:00 in session PS T1.9

Auto-TLDR; Gaussian class-conditional simplex loss for adversarial robust multiclass classifiers

Deep learning has shown outstanding performance in several applications including image classification. However, deep classifiers are known to be highly vulnerable to adversarial attacks, in that a minor perturbation of the input can easily lead to an error. Providing robustness to adversarial attacks is a very challenging task especially in problems involving a large number of classes, as it typically comes at the expense of an accuracy decrease. In this work, we propose the Gaussian class-conditional simplex (GCCS) loss: a novel approach for training deep robust multiclass classifiers that provides adversarial robustness while at the same time achieving or even surpassing the classification accuracy of state-of-the-art methods. Differently from other frameworks, the proposed method learns a mapping of the input classes onto target distributions in a latent space such that the classes are linearly separable. Instead of maximizing the likelihood of target labels for individual samples, our objective function pushes the network to produce feature distributions yielding high inter-class separation. The mean values of the distributions are centered on the vertices of a simplex such that each class is at the same distance from every other class. We show that the regularization of the latent space based on our approach yields excellent classification accuracy and inherently provides robustness to multiple adversarial attacks, both targeted and untargeted, outperforming state-of-the-art approaches over challenging datasets.

#### Foreground-Focused Domain Adaption for Object Detection

Yuchen Yang, Nilanjan Ray
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session OS T1.8

Auto-TLDR; Unsupervised Domain Adaptation for Unsupervised Object Detection

Object detectors suffer from accuracy loss caused by domain shift from a source to a target domain. Unsupervised domain adaptation (UDA) approaches mitigate this loss by training with unlabeled target domain images. A popular processing pipeline applies adversarial training that aligns the distributions of the features from the two domains. We advocate that aligning the full image level features is not ideal for UDA object detection due to the presence of varied background areas during inference. Thus, we propose a novel foreground-focused domain adaptation (FFDA) framework which mines the loss of the domain discriminators to concentrate on the backpropagation of foreground loss. We obtain mining masks by collecting target predictions and source labels to outline foreground regions, and apply the masks to image and instance level domain discriminators to allow backpropagation only on the mined regions. By reinforcing this foreground-focused adaptation throughout multiple layers in the detector model, we gain a significant accuracy boost on the target domain prediction. Compared to previous works, our method reaches the new state-of-the-art accuracy on adapting Cityscape to Foggy Cityscape dataset and demonstrates competitive accuracy on other datasets that include various scenarios for autonomous driving applications.

#### Explain2Attack: Text Adversarial Attacks via Cross-Domain Interpretability

Mahmoud Hossam, Le Trung, He Zhao, Dinh Phung
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.4

Auto-TLDR; Transfer2Attack: A Black-box Adversarial Attack on Text Classification

Training robust deep learning models is a critical challenge for downstream tasks. Research has shown that common down-stream models can be easily fooled with adversarial inputs that look like the training data, but slightly perturbed, in a way imperceptible to humans. Understanding the behavior of natural language models under these attacks is crucial to better defend these models against such attacks. In the black-box attack setting, where no access to model parameters is available, the attacker can only query the output information from the targeted model to craft a successful attack. Current black-box state-of-the-art models are costly in both computational complexity and number of queries needed to craft successful adversarial examples. For real world scenarios, the number of queries is critical, where less queries are desired to avoid suspicion towards an attacking agent. In this paper, we propose Transfer2Attack, a black-box adversarial attack on text classification task, that employs cross-domain interpretability to reduce target model queries during attack. We show that our framework either achieves or out-performs attack rates of the state-of-the-art models, yet with lower queries cost and higher efficiency.

#### Multi-View Object Detection Using Epipolar Constraints within Cluttered X-Ray Security Imagery

Brian Kostadinov Shalon Isaac-Medina, Chris G. Willcocks, Toby Breckon
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 16:00 in session PS T3.11

Auto-TLDR; Exploiting Epipolar Constraints for Multi-View Object Detection in X-ray Security Images

Automatic detection for threat object items is an increasing emerging area of future application in X-ray security imagery. Although modern X-ray security scanners can provide two or more views, the integration of such object detectors across the views has not been widely explored with rigour. Therefore, we investigate the application of geometric constraints using the epipolar nature of multi-view imagery to improve object detection performance. Furthermore, we assume that images come from uncalibrated views, such that a method to estimate the fundamental matrix using ground truth bounding box centroids from multiple view object detection labels is proposed. In addition, detections are given a score based on its similarity with respect to the distribution of the error of the epipolar estimation. This score is used as confidence weights for merging duplicated predictions using non-maximum suppression. Using a standard object detector (YOLOv3), our technique increases the average precision of detection by 2.8% on a dataset composed of firearms, laptops, knives and cameras. These results indicate that the integration of images at different views significantly improves the detection performance of threat items of cluttered X-ray security images.

#### Effective Deployment of CNNs for 3DoF Pose Estimation and Grasping in Industrial Settings

Daniele De Gregorio, Riccardo Zanella, Gianluca Palli, Luigi Di Stefano
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.3

Auto-TLDR; Automated Deep Learning for Robotic Grasping Applications

In this paper we investigate how to effectively deploy deep learning in practical industrial settings, such as robotic grasping applications. When a deep-learning based solution is proposed, usually lacks of any simple method to generate the training data. In the industrial field, where automation is the main goal, not bridging this gap is one of the main reasons why deep learning is not as widespread as it is in the academic world. For this reason, in this work we developed a system composed by a 3-DoF Pose Estimator based on Convolutional Neural Networks (CNNs) and an effective procedure to gather massive amounts of training images in the field with minimal human intervention. By automating the labeling stage, we also obtain very robust systems suitable for production-level usage. An open source implementation of our solution is provided, alongside with the dataset used for the experimental evaluation.

#### Attack Agnostic Adversarial Defense via Visual Imperceptible Bound

Saheb Chhabra, Akshay Agarwal, Richa Singh, Mayank Vatsa
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.4

Auto-TLDR; Robust Adversarial Defense with Visual Imperceptible Bound

High susceptibility of deep learning algorithms against structured and unstructured perturbations has motivated the development of efficient adversarial defense algorithms. However, the lack of generalizability of existing defense algorithms and the high variability in the performance of the attack algorithms for different databases raises several questions on the effectiveness of the defense algorithms. In this research, we aim to design a defense model that is robust within the certain bound against both seen and unseen adversarial attacks. This bound is related to the visual appearance of an image and we termed it as \textit{Visual Imperceptible Bound (VIB)}. To compute this bound, we propose a novel method that uses the database characteristics. The VIB is further used to compute the effectiveness of attack algorithms. In order to design a defense model, we propose a defense algorithm which makes the model robust within the VIB against both seen and unseen attacks. The performance of the proposed defense algorithm and the method to compute VIB are evaluated on MNIST, CIFAR-10, and Tiny ImageNet databases on multiple attacks including C\&W ($l_2$) and DeepFool. The proposed defense algorithm is not only able to increase the robustness against several attacks but also retain or improve the classification accuracy on an original clean test set. Experimentally, it is demonstrated that the proposed defense is better than existing strong defense algorithms based on adversarial retraining. We have additionally performed the PGD attack in white box settings and compared the results with the existing algorithms. The proposed defense is independent of the target model and adversarial attacks, and therefore can be utilized against any attack.

#### A Delayed Elastic-Net Approach for Performing Adversarial Attacks

Brais Cancela, Veronica Bolon-Canedo, Amparo Alonso-Betanzos
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 14:00 in session PS T1.5

Auto-TLDR; Robustness of ImageNet Pretrained Models against Adversarial Attacks

With the rise of the so-called Adversarial Attacks, there is an increased concern on model security. In this paper we present two different contributions: novel measures of robustness (based on adversarial attacks) and a novel adversarial attack. The key idea behind these metrics is to obtain a measure that could compare different architectures, with independence of how the input is preprocessed (robustness against different input sizes and value ranges). To do so, a novel adversarial attack is presented, performing a delayed elastic-net adversarial attack (constraints are only used whenever a successful adversarial attack is obtained). Experimental results show that our approach obtains state-of-the-art adversarial samples, in terms of minimal perturbation distance. Finally, a benchmark of ImageNet pretrained models is used to conduct experiments aiming to shed some light about which model should be selected whenever security is a role factor.

#### Multiple-Step Sampling for Dense Object Detection and Counting

Zhaoli Deng, Yang Chenhui
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 12:00 in session PS T3.4

Auto-TLDR; Multiple-Step Sampling for Dense Objects Detection

A multitude of similar or even identical objects are positioned closely in dense scenes, which brings about difficulties in object-detecting and object-counting. Since the poor performance of Faster R-CNN, recent works prefer to detect dense objects with the utilization of multi-layer feature maps. Nevertheless, they require complex post-processing to minimize overlap between adjacent bounding boxes, which reduce their detection speed. However, we find that such a multilayer prediction is not necessary. It is observed that there exists a waste of ground-truth boxes during sampling, causing the lack of positive samples and the final failure of Faster R-CNN training. Motivated by this observation we propose a multiple-step sampling method for anchor sampling. Our method reduces the waste of ground-truth boxes in three steps according to different rules. Besides, we balance the positive and negative samples, and samples at different quality. Our method improves base detector (Faster R-CNN), the detection tests on SKU-110K and CARPK benchmarks indicate that our approach offers a good trade-off between accuracy and speed.

#### Construction Worker Hardhat-Wearing Detection Based on an Improved BiFPN

Chenyang Zhang, Zhiqiang Tian, Jingyi Song, Yaoyue Zheng, Bo Xu
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 15:00 in session PS T3.1

Auto-TLDR; A One-Stage Object Detection Method for Hardhat-Wearing in Construction Site

Work in the construction site is considered to be one of the occupations with the highest safety risk factor. Therefore, safety plays an important role in construction site. One of the most fundamental safety rules in construction site is to wear a hardhat. To strengthen the safety of the construction site, most of the current methods use multi-stage method for hardhat-wearing detection. These methods have limitations in terms of adaptability and generalizability. In this paper, we propose a one-stage object detection method based on convolutional neural network. We present a multi-scale strategy that selects the high-resolution feature maps of DarkNet-53 to effectively identify small-scale hardhats. In addition, we propose an improved weighted bi-directional feature pyramid network (BiFPN), which could fuse more semantic features from more scales. The proposed method can not only detect hardhat-wearing, but also identify the color of the hardhat. Experimental results show that the proposed method achieves a mAP of 87.04%, which outperforms several state-of-the-art methods on a public dataset.

#### Accuracy-Perturbation Curves for Evaluation of Adversarial Attack and Defence Methods

Jaka Šircelj, Danijel Skocaj
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 16:00 in session PS T1.15

Auto-TLDR; Accuracy-perturbation Curve for Robustness Evaluation of Adversarial Examples

With more research published on adversarial examples, we face a growing need for strong and insightful methods for evaluating the robustness of machine learning solutions against their adversarial threats. Previous work contains problematic and overly simplified evaluation methods, where different methods for generating adversarial examples are compared, even though they produce adversarial examples of differing perturbation magnitudes. This creates a biased evaluation environment, as higher perturbations yield naturally stronger adversarial examples. We propose a novel "accuracy-perturbation curve" that visualizes a classifiers classification accuracy response to adversarial examples of different perturbations. To demonstrate the utility of the curve we perform evaluation of responses of different image classifier architectures to four popular adversarial example methods. We also show how adversarial training improves the robustness of a classifier using the "accuracy-perturbation curve".

#### A Modified Single-Shot Multibox Detector for Beyond Real-Time Object Detection

Georgios Orfanidis, Konstantinos Ioannidis, Stefanos Vrochidis, Anastasios Tefas, Ioannis Kompatsiaris
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.3

Auto-TLDR; Single Shot Detector in Resource-Restricted Systems with Lighter SSD Variations

This works focuses on examining the performance of the Single Shot Detector (SSD) model in resource restricted systems where maintaining the power of the full model comprises a significant prerequisite. The proposed SSD variations examine the behavior of lighter versions of SSD while propose measures to limit the unavoidable performance shortage. The outcomes of the conducted research demonstrate a remarkable trade-off between performance losses, speed improvement and the required resource reservation. Thus, the experimental results evidence the efficiency of the presented SSD alterations towards accomplishing higher frame rates and retaining the performance of the original model.

#### Towards Explaining Adversarial Examples Phenomenon in Artificial Neural Networks

Ramin Barati, Reza Safabakhsh, Mohammad Rahmati
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 16:00 in session PS T1.15

Auto-TLDR; Convolutional Neural Networks and Adversarial Training from the Perspective of convergence

In this paper, we study the adversarial examples existence and adversarial training from the standpoint of convergence and provide evidence that pointwise convergence in ANNs can explain these observations. The main contribution of our proposal is that it relates the objective of the evasion attacks and adversarial training with concepts already defined in learning theory. Also, we extend and unify some of the other proposals in the literature and provide alternative explanations on the observations made in those proposals. Through different experiments, we demonstrate that the framework is valuable in the study of the phenomenon and is applicable to real-world problems.

#### Detective: An Attentive Recurrent Model for Sparse Object Detection

Amine Kechaou, Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 16:00 in session PS T3.11

Auto-TLDR; Detective: An attentive object detector that identifies objects in images in a sequential manner

In this work, we present Detective – an attentive object detector that identifies objects in images in a sequential manner. Our network is based on an encoder-decoder architecture, where the encoder is a convolutional neural network, and the decoder is a convolutional recurrent neural network coupled with an attention mechanism. At each iteration, our decoder focuses on the relevant parts of the image using an attention mechanism, and then estimates the object’s class and the bounding box coordinates. Current object detection models generate dense predictions and rely on post-processing to remove duplicate predictions. Detective is a sparse object detector that generates a single bounding box per object instance. However, training a sparse object detector is challenging, as it requires the model to reason at the instance level and not just at the class and spatial levels. We propose a training mechanism based on the Hungarian Algorithm and a loss that balances the localization and classification tasks. This allows Detective to achieve promising results on the PASCAL VOC object detection dataset. Our experiments demonstrate that sparse object detection is possible and has a great potential for future developments in applications where the order of the objects to be predicted is of interest.

#### Attack-Agnostic Adversarial Detection on Medical Data Using Explainable Machine Learning

Matthew Watson, Noura Al Moubayed
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 17:00 in session PS T5.2

Auto-TLDR; Explainability-based Detection of Adversarial Samples on EHR and Chest X-Ray Data

Explainable machine learning has become increasingly prevalent, especially in healthcare where explainable models are vital for ethical and trusted automated decision making. Work on the susceptibility of deep learning models to adversarial attacks has shown the ease of designing samples to mislead a model into making incorrect predictions. In this work, we propose an explainability-based method for the accurate detection of adversarial samples on two datasets with different complexity and properties: Electronic Health Record (EHR) and chest X-ray (CXR) data. On the MIMIC-III and Henan-Renmin EHR datasets, we report a detection accuracy of 77% against the Longitudinal Adversarial Attack. On the MIMIC-CXR dataset, we achieve an accuracy of 88%; significantly improving on the state of the art of adversarial detection in both datasets by over 10% in all settings. We propose an anomaly detection based method using explainability techniques to detect adversarial samples which is able to generalise to different attack methods without a need for retraining.

#### SFPN: Semantic Feature Pyramid Network for Object Detection

Yi Gan, Wei Xu, Jianbo Su
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.13

Auto-TLDR; SFPN: Semantic Feature Pyramid Network to Address Information Dilution Issue in FPN

Feature Pyramid Network(FPN) employs a top-down path to enhance low level feature by utilizing high level feature.However, further improvement of detector is greatly hindered by the inner defect of FPN. The dilution issue in FPN is analyzed in this paper, and a new architecture named Semantic Feature Pyramid Network(SFPN) is introduced to address the information imbalance problem caused by information dilution. The proposed method consists of two simple and effective components: Semantic Pyramid Module(SPM) and Semantic Feature Fusion Module(SFFM). To compensate for the weaknesses of FPN, the semantic segmentation result is utilized as an extra information source in our architecture.By constructing a semantic pyramid based on the segmentation result and fusing it with FPN, feature maps at each level can obtain the necessary information without suffering from the dilution issue. The proposed architecture could be applied on many detectors, and non-negligible improvement could be achieved. Although this method is designed for object detection, other tasks such as instance segmentation can also largely benefit from it. The proposed method brings Faster R-CNN and Mask R-CNN with ResNet-50 as backbone both 1.8 AP improvements respectively. Furthermore, SFPN improves Cascade R-CNN with backbone ResNet-101 from 42.4 AP to 43.5 AP.

#### Iterative Bounding Box Annotation for Object Detection

Bishwo Adhikari, Heikki Juhani Huttunen
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.6

Auto-TLDR; Semi-Automatic Bounding Box Annotation for Object Detection in Digital Images

Manual annotation of bounding boxes for object detection in digital images is tedious, and time and resource consuming. In this paper, we propose a semi-automatic method for efficient bounding box annotation. The method trains the object detector iteratively on small batches of labeled images and learns to propose bounding boxes for the next batch, after which the human annotator only needs to correct possible errors. We propose an experimental setup for simulating the human actions and use it for comparing different iteration strategies, such as the order in which the data is presented to the annotator. We experiment on our method with three datasets and show that it can reduce the human annotation effort significantly, saving up to 75% of total manual annotation work.

#### Triplet-Path Dilated Network for Detection and Segmentation of General Pathological Images

Jiaqi Luo, Zhicheng Zhao, Fei Su, Limei Guo
Track 5: Image and Signal Processing
Fri 15 Jan 2021 at 13:00 in session OS T5.5

Auto-TLDR; Triplet-path Network for One-Stage Object Detection and Segmentation in Pathological Images

Deep learning has been widely applied in the field of medical image processing. However, compared with flourishing visual tasks in natural images, the progress achieved in pathological images is not remarkable, and detection and segmentation, which are among basic tasks of computer vision, are regarded as two independent tasks. In this paper, we make full use of existing datasets and construct a triplet-path network using dilated convolutions to cooperatively accomplish one-stage object detection and nuclei segmentation for general pathological images. First, in order to meet the requirement of detection and segmentation, a novel structure called triplet feature generation (TFG) is designed to extract high-resolution and multiscale features, where features from different layers can be properly integrated. Second, considering that pathological datasets are usually small, a location-aware and partially truncated loss function is proposed to improve the classification accuracy of datasets with few images and widely varying targets. We compare the performance of both object detection and instance segmentation with state-of-the-art methods. Experimental results demonstrate the effectiveness and efficiency of the proposed network on two datasets collected from multiple organs.

#### Hierarchical Head Design for Object Detectors

Shivang Agarwal, Frederic Jurie
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 16:30 in session PS T3.5

Auto-TLDR; Hierarchical Anchor for SSD Detector

The notion of anchor plays a major role in modern detection algorithms such as the Faster-RCNN or the SSD detector. Anchors relate the features of the last layers of the detector with bounding boxes containing objects in images. Despite their importance, the literature on object detection has not paid real attention to them. The motivation of this paper comes from the observations that (i) each anchor learns to classify and regress candidate objects independently (ii) insufficient examples are available for each anchor in case of small-scale datasets. This paper addresses these questions by proposing a novel hierarchical head for the SSD detector. The new design has the added advantage of no extra weights, as compared to the original design at inference time, while improving detectors performance for small size training sets. Improved performance on PASCAL-VOC and state-of-the-art performance on FlickrLogos-47 validate the method. We also show when the proposed design does not give additional performance gain over the original design.

#### Compression of YOLOv3 Via Block-Wise and Channel-Wise Pruning for Real-Time and Complicated Autonomous Driving Environment Sensing Applications

Jiaqi Li, Yanan Zhao, Li Gao, Feng Cui
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 15:00 in session PS T3.10

Auto-TLDR; Pruning YOLOv3 with Batch Normalization for Autonomous Driving

Nowadays, in the area of autonomous driving, the computational power of the object detectors is limited by the embedded devices and the public datasets for autonomous driving are over-idealistic. In this paper, we propose a pipeline combining both block-wise pruning and channel-wise pruning to compress the object detection model iteratively. We enforce the introduced factor of the residual blocks and the scale parameters in Batch Normalization (BN) layers to sparsity to select the less important residual blocks and channels. Moreover, a modified loss function has been proposed to remedy the class-imbalance problem. After removing the unimportant structures iteratively, we get the pruned YOLOv3 trained on our datasets which have more abundant and elaborate classes. Evaluated by our validation sets on the server, the pruned YOLOv3 saves 79.7% floating point operations (FLOPs), 93.8% parameter size, 93.8% model volume and 45.4% inference times with only 4.16% mean of average precision (mAP) loss. Evaluated on the embedded device, the pruned model operates about 13 frames per second with 4.53% mAP loss. These results show that the real-time property and accuracy of the pruned YOLOv3 can meet the needs of the embedded devices in complicated autonomous driving environments.

#### How Does DCNN Make Decisions?

Yi Lin, Namin Wang, Xiaoqing Ma, Ziwei Li, Gang Bai
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.3

Auto-TLDR; Exploring Deep Convolutional Neural Network's Decision-Making Interpretability

Deep Convolutional Neural Networks (DCNN), despite imitating the human visual system, present no such decision credibility as human observers. This phenomenon, therefore, leads to the limitations of DCNN's applications in the security and trusted computing, such as self-driving cars and medical diagnosis. Focusing on this issue, our work aims to explore the way DCNN makes decisions. In this paper, the major contributions we made are: firstly, provide the hypothesis, “point-wise activation” of convolution function, according to the analysis of DCNN’s architectures and training process; secondly, point out the effect of “point-wise activation” on DCNN’s uninterpretable classification and pool robustness, and then suggest, in particular, the contradiction between the traditional and DCNN’s convolution kernel functions; finally, distinguish decision-making interpretability from semantic interpretability, and indicate that DCNN’s decision-making mechanism need to evolve towards the direction of semantics in the future. Besides, the “point-wise activation” hypothesis and conclusions proposed in our paper are supported by extensive experimental results.

#### Understanding Integrated Gradients with SmoothTaylor for Deep Neural Network Attribution

Gary Shing Wee Goh, Sebastian Lapuschkin, Leander Weber, Wojciech Samek, Alexander Binder
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Tue 12 Jan 2021 at 14:00 in session OS T1.1

Auto-TLDR; SmoothGrad: bridging Integrated Gradients and SmoothGrad from the Taylor's theorem perspective

Integrated Gradients as an attribution method for deep neural network models offers simple implementability. However, it suffers from noisiness of explanations which affects the ease of interpretability. The SmoothGrad technique is proposed to solve the noisiness issue and smoothen the attribution maps of any gradient-based attribution method. In this paper, we present SmoothTaylor as a novel theoretical concept bridging Integrated Gradients and SmoothGrad, from the Taylor's theorem perspective. We apply the methods to the image classification problem, using the ILSVRC2012 ImageNet object recognition dataset, and a couple of pretrained image models to generate attribution maps. These attribution maps are empirically evaluated using quantitative measures for sensitivity and noise level. We further propose adaptive noising to optimize for the noise scale hyperparameter value. From our experiments, we find that the SmoothTaylor approach together with adaptive noising is able to generate better quality saliency maps with lesser noise and higher sensitivity to the relevant points in the input space as compared to Integrated Gradients.

#### Killing Four Birds with One Gaussian Process: The Relation between Different Test-Time Attacks

Kathrin Grosse, Michael Thomas Smith, Michael Backes
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 14:00 in session PS T1.6

Auto-TLDR; Security of Gaussian Process Classifiers against Attack Algorithms

In machine learning (ML) security, attacks like evasion, model stealing or membership inference are generally studied in individually. Previous work has also shown a relationship between some attacks and decision function curvature of the targeted model. Consequently, we study an ML model allowing direct control over the decision surface curvature: Gaussian Process classifiers (GPCs). For evasion, we find that changing GPC's curvature to be robust against one attack algorithm boils down to enabling a different norm or attack algorithm to succeed. This is backed up by our formal analysis showing that static security guarantees are opposed to learning. Concerning intellectual property, we show formally that lazy learning does not necessarily leak all information when applied. In practice, often a seemingly secure curvature can be found. For example, we are able to secure GPC against empirical membership inference by proper configuration. In this configuration, however, the GPC's hyper-parameters are leaked, e.g. model reverse engineering succeeds. We conclude that attacks on classification should not be studied in isolation, but in relation to each other.

#### MagnifierNet: Learning Efficient Small-Scale Pedestrian Detector towards Multiple Dense Regions

Qi Cheng, Mingqin Chen, Yingjie Wu, Fei Chen, Shiping Lin
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 12:00 in session PS T3.4

Auto-TLDR; MagnifierNet: A Simple but Effective Small-Scale Pedestrian Detection Towards Multiple Dense Regions

Despite the success of pedestrian detection, there is still a significant gap in the performance of the detection of pedestrians at different scales. Detecting small-scale pedestrians is extremely challenging due to the low resolution of their convolution features which is essential for downstream classifiers. To address this issue, we observed pedestrian datasets and found that pedestrians often gather together in crowded public places. Then we propose MagnifierNet, a simple but effective small-scale pedestrian detector towards multiple dense regions. MagnifierNet uses our proposed sweep-line based grouping algorithm to find dense regions based on the number of pedestrians in the grouped region. And we adopt a new definition of small-scale pedestrians through grid search and KL-divergence. Besides, our grouping method can also be used as a new strategy for pedestrian data augmentation. The ablation study demonstrates that MagnifierNet improves the representation of small-scale pedestrians. We validate the effectiveness of MagnifierNet on CityPersons and KITTI datasets. Experimental results show that MagnifierNet achieves the best small-scale pedestrian detection performance on CityPersons benchmark without any external data, and also achieves competitive performance for detecting small-scale pedestrians on KITTI dataset without bells and whistles.

#### A Generalizable Saliency Map-Based Interpretation of Model Outcome

Shailja Thakur, Sebastian Fischmeister
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 14:00 in session PS T1.6

Auto-TLDR; Interpretability of Deep Neural Networks Using Salient Input and Output

One of the significant challenges of deep neural networks is that the complex nature of the network prevents human comprehension of the outcome of the network. Consequently, the applicability of complex machine learning models is limited in the safety-critical domains, which incurs risk to life and property. To fully exploit the capabilities of complex neural networks, we propose a non-intrusive interpretability technique that uses the input and output of the model to generate a saliency map. The method works by empirically optimizing a randomly initialized input mask by localizing and weighing individual pixels according to their sensitivity towards the target class. Our experiments show that the proposed model interpretability approach performs better than the existing saliency map-based approaches methods at localizing the relevant input pixels. Furthermore, to obtain a global perspective on the target-specific explanation, we propose a saliency map reconstruction approach to generate acceptable variations of the salient inputs from the space of input data distribution for which the model outcome remains unaltered. Experiments show that our interpretability method can reconstruct the salient part of the input with a classification accuracy of 89%.