NAS-EOD: An End-To-End Neural Architecture Search Method for Efficient Object Detection

Huigang Zhang, Liuan Wang, Jun Sun, Li Sun, Hiromichi Kobashi, Nobutaka Imamura

Responsive image

Auto-TLDR; NAS-EOD: Neural Architecture Search for Object Detection on Edge Devices

Slides

Model efficiency for object detection has become more and more important recently, especially when intelligent mobile devices are more and more convenient and developed today. Current small models for this task is either extended from the models for classification task, or pruned directly on the basis of large models. These pipelines are not task-specific or data-oriented so that their performance are not good enough for users. In this work, we propose a neural architecture search (NAS) method to build a detection model automatically that can perform well on edge devices. Specifically, the proposed method supports the search of not only multi-scale feature network, but also backbone network. This enables us to search out a global optimal model. To the best of our knowledge, it is a first attempt for searching an overall detection model via NAS. Additionally, we add latency information into the main objective during performance estimation, so that the search process can find a final model suitable for edge devices. Experiments on the PASCAL VOC benchmark indicate that the searched model (named NAS-EOD) can get good accuracy even without ImageNet pre-training. When using ImageNet pre-training, our model is superior to state-of-the-art small object detection models.

Similar papers

VPU Specific CNNs through Neural Architecture Search

Ciarán Donegan, Hamza Yous, Saksham Sinha, Jonathan Byrne

Responsive image

Auto-TLDR; Efficient Convolutional Neural Networks for Edge Devices using Neural Architecture Search

Slides Poster Similar

The success of deep learning at computer vision tasks has led to an ever-increasing number of applications on edge devices. Often with the use of edge AI hardware accelerators like the Intel Movidius Vision Processing Unit (VPU). Performing computer vision tasks on edge devices is challenging. Many Convolutional Neural Networks (CNNs) are too complex to run on edge devices with limited computing power. This has created large interest in designing efficient CNNs and one promising way of doing this is through Neural Architecture Search (NAS). NAS aims to automate the design of neural networks. NAS can also optimize multiple different objectives together, like accuracy and efficiency, which is difficult for humans. In this paper, we use a differentiable NAS method to find efficient CNNs for VPU that achieves state-of-the-art classification accuracy on ImageNet. Our NAS designed model outperforms MobileNetV2, having almost 1\% higher top-1 accuracy while being 13\% faster on MyriadX VPU. To the best of our knowledge, this is the first time a VPU specific CNN has been designed using a NAS algorithm. Our results also reiterate the fact that efficient networks must be designed for each specific hardware. We show that efficient networks targeted at different devices do not perform as well on the VPU.

E-DNAS: Differentiable Neural Architecture Search for Embedded Systems

Javier García López, Antonio Agudo, Francesc Moreno-Noguer

Responsive image

Auto-TLDR; E-DNAS: Differentiable Architecture Search for Light-Weight Networks for Image Classification

Slides Poster Similar

Designing optimal and light weight networks to fit in resource-limited platforms like mobiles, DSPs or GPUs is a challenging problem with a wide range of interesting applications, {\em e.g.} in embedded systems for autonomous driving. While most approaches are based on manual hyperparameter tuning, there exist a new line of research, the so-called NAS (Neural Architecture Search) methods, that aim to optimize several metrics during the design process, including memory requirements of the network, number of FLOPs, number of MACs (Multiply-ACcumulate operations) or inference latency. However, while NAS methods have shown very promising results, they are still significantly time and cost consuming. In this work we introduce E-DNAS, a differentiable architecture search method, which improves the efficiency of NAS methods in designing light-weight networks for the task of image classification. Concretely, E-DNAS computes, in a differentiable manner, the optimal size of a number of meta-kernels that capture patterns of the input data at different resolutions. We also leverage on the additive property of convolution operations to merge several kernels with different compatible sizes into a single one, reducing thus the number of operations and the time required to estimate the optimal configuration. We evaluate our approach on several datasets to perform classification. We report results in terms of the SoC (System on Chips) metric, typically used in the Texas Instruments TDA2x families for autonomous driving applications. The results show that our approach allows designing low latency architectures significantly faster than state-of-the-art.

EDD-Net: An Efficient Defect Detection Network

Tianyu Guo, Linlin Zhang, Runwei Ding, Ge Yang

Responsive image

Auto-TLDR; EfficientNet: Efficient Network for Mobile Phone Surface defect Detection

Slides Poster Similar

As the most commonly used communication tool, the mobile phone has become an indispensable part of our daily life. The surface of the mobile phone as the main window of human-phone interaction directly affects the user experience. It is necessary to detect surface defects on the production line in order to ensure the high quality of the mobile phone. However, the existing mobile phone surface defect detection is mainly done manually, and currently there are few automatic defect detection methods to replace human eyes. How to quickly and accurately detect the surface defects of mobile phone is an urgent problem to be solved. Hence, an efficient defect detection network (EDD-Net) is proposed. Firstly, EfficientNet is used as the backbone network. Then, according to the small-scale of mobile phone surface defects, a feature pyramid module named GCSA-BiFPN is proposed to obtain more discriminative features. Finally, the box/class prediction network is used to achieve effective defect detection. We also build a mobile phone surface oil stain defect (MPSOSD) dataset to alleviate the lack of dataset in this field. The performance on the relevant datasets shows that the network we proposed is effective and has practical significance for industrial production.

ScarfNet: Multi-Scale Features with Deeply Fused and Redistributed Semantics for Enhanced Object Detection

Jin Hyeok Yoo, Dongsuk Kum, Jun Won Choi

Responsive image

Auto-TLDR; Semantic Fusion of Multi-scale Feature Maps for Object Detection

Slides Poster Similar

Convolutional neural networks (CNNs) have led us to achieve significant progress in object detection research. To detect objects of various sizes, object detectors often exploit the hierarchy of the multiscale feature maps called {\it feature pyramids}, which are readily obtained by the CNN architecture. However, the performance of these object detectors is limited because the bottom-level feature maps, which experience fewer convolutional layers, lack the semantic information needed to capture the characteristics of the small objects. To address such problems, various methods have been proposed to increase the depth for the bottom-level features used for object detection. While most approaches are based on the generation of additional features through the top-down pathway with lateral connections, our approach directly fuses multi-scale feature maps using bidirectional long short-term memory (biLSTM) in an effort to leverage the gating functions and parameter-sharing in generating deeply fused semantics. The resulting semantic information is redistributed to the individual pyramidal feature at each scale through the channel-wise attention model. We integrate our semantic combining and attentive redistribution feature network (ScarfNet) with the baseline object detectors, i.e., Faster R-CNN, single-shot multibox detector (SSD), and RetinaNet. Experimental results show that our method offers a significant performance gain over the baseline detectors and outperforms the competing multiscale fusion methods in the PASCAL VOC and COCO detection benchmarks.

Operation and Topology Aware Fast Differentiable Architecture Search

Shahid Siddiqui, Christos Kyrkou, Theocharis Theocharides

Responsive image

Auto-TLDR; EDARTS: Efficient Differentiable Architecture Search with Efficient Optimization

Slides Poster Similar

Differentiable architecture search (DARTS) has gained significant attention amongst neural architecture search approaches due to its effectiveness in finding competitive network architectures with reasonable computational complexity. DARTS' search space however is designed such that even a randomly picked architecture is very competitive and due to the complexity of search architectural building block or cell, it is unclear whether these are certain operations or the cell topology that contributes most to achieving higher final accuracy. In this work, we dissect the DARTS's search space as to understand which components are most effective in producing better architectures. Our experiments show that: (1) Good architectures can be found regardless of the search network depth; (2) Seperable convolution is the most effective operation in the search space; and (3) The cell topology also has substantial effect on the accuracy. Based on these insights, we propose an efficient search approach based referred to as eDARTS, that searches on a pre-specified cell with a good topology with increased attention to important operations using a shallow supernet. Moreover, we propose some optimizations for eDARTS which significantly speed up the search as well as alleviate the well known skip connection aggregation problem of DARTS. eDARTS achieves an error rate of 2.53% on CIFAR-10 using a 3.1M parameters model; while the search cost is less than 30 minutes.

OCT Image Segmentation Using NeuralArchitecture Search and SRGAN

Saba Heidari, Omid Dehzangi, Nasser M. Nasarabadi, Ali Rezai

Responsive image

Auto-TLDR; Automatic Segmentation of Retinal Layers in Optical Coherence Tomography using Neural Architecture Search

Poster Similar

Alzheimer’s disease (AD) diagnosis is one of the major research areas in computational medicine. Optical coherence tomography (OCT) is a non-invasive, inexpensive, and timely efficient method that scans the human’s retina with depth. It has been hypothesized that the thickness of the retinal layers extracted from OCTs could be an efficient and effective biomarker for early diagnosis of AD. In this work, we aim to design a self-training model architecture for the task of segmenting the retinal layers in OCT scans. Neural architecture search (NAS) is a subfield of AutoML domain, which has a significant impact on improving the accuracy of machine vision tasks. We integrate the NAS algorithm with a Unet auto-encoder architecture as its backbone. Then, we employ our proposed model to segment the retinal nerve fiber layer in our preprocessed OCT images with the aim of AD diagnosis. In this work, we trained a super-resolution generative adversarial network on the raw OCT scans to improve the quality of the images before the modeling stage. In our architecture search strategy, different primitive operations suggested to find down- \& up-sampling Unet cell blocks and the binary gate method has been applied to make the search strategy more practical. Our architecture search method is empirically evaluated by training on the Unet and NAS-Unet from scratch. Specifically, the proposed NAS-Unet training significantly outperforms the baseline human-designed architecture by achieving 95.1\% in the mean Intersection over Union metric and 79.1\% in the Dice similarity coefficient.

P2 Net: Augmented Parallel-Pyramid Net for Attention Guided Pose Estimation

Luanxuan Hou, Jie Cao, Yuan Zhao, Haifeng Shen, Jian Tang, Ran He

Responsive image

Auto-TLDR; Parallel-Pyramid Net with Partial Attention for Human Pose Estimation

Slides Poster Similar

The target of human pose estimation is to determine the body parts and joint locations of persons in the image. Angular changes, motion blur and occlusion etc. in the natural scenes make this task challenging, while some joints are more difficult to be detected than others. In this paper, we propose an augmented Parallel-Pyramid Net (P^2Net) with an partial attention module. During data preprocessing, we proposed a differentiable auto data augmentation (DA^2) method in which sequences of data augmentations are formulated as a trainable and operational Convolution Neural Network (CNN) component. DA^2 improves the training efficiency and effectiveness. A parallel pyramid structure is followed to compensate the information loss introduced by the network. For the information loss problem in the backbone network, we optimize the backbone network by adopting a new parallel structure without increasing the overall computational complexity. To further refine the predictions after completion of global predictions, an Partial Attention Module (PAM) is defined to extract weighted features from different scale feature maps generated by the parallel pyramid structure. Compared with the traditional up-sampling refining, PAM can better capture the relationship between channels. Experiments corroborate the effectiveness of our proposed method. Notably, our method achieves the best performance on the challenging MSCOCO and MPII datasets.

Object Detection Model Based on Scene-Level Region Proposal Self-Attention

Yu Quan, Zhixin Li, Canlong Zhang, Huifang Ma

Responsive image

Auto-TLDR; Exploiting Semantic Informations for Object Detection

Slides Poster Similar

The improvement of object detection performance is mostly focused on the extraction of local information near the region of interest in the image, which results in detection performance in this area being unable to achieve the desired effect. First, a depth-wise separable convolution network(D_SCNet-127 R-CNN) is built on the backbone network. Considering the importance of scene and semantic informations for visual recognition, the feature map is sent into the branch of the semantic segmentation module, region proposal network module, and the region proposal self-attention module to build the network of scene-level and region proposal self-attention module. Second, a deep reinforcement learning was utilized to achieve accurate positioning of border regression, and the calculation speed of the whole model was improved through implementing a light-weight head network. This model can effectively solve the limitation of feature extraction in traditional object detection and obtain more comprehensive detailed features. The experimental verification on MSCOCO17, VOC12, and Cityscapes datasets shows that the proposed method has good validity and scalability.

Stage-Wise Neural Architecture Search

Artur Jordão, Fernando Akio Yamada, Maiko Lie, William Schwartz

Responsive image

Auto-TLDR; Efficient Neural Architecture Search for Deep Convolutional Networks

Slides Poster Similar

Modern convolutional networks such as ResNet and NASNet have achieved state-of-the-art results in many computer vision applications. These architectures consist of stages, which are sets of layers that operate on representations in the same resolution. It has been demonstrated that increasing the number of layers in each stage improves the prediction ability of the network. However, the resulting architecture becomes computationally expensive in terms of floating point operations, memory requirements and inference time. Thus, significant human effort is necessary to evaluate different trade-offs between depth and performance. To handle this problem, recent works have proposed to automatically design high-performance architectures, mainly by means of neural architecture search (NAS). Current NAS strategies analyze a large set of possible candidate architectures and, hence, require vast computational resources and take many GPUs days. Motivated by this, we propose a NAS approach to efficiently design accurate and low-cost convolutional architectures and demonstrate that an efficient strategy for designing these architectures is to learn the depth stage-by-stage. For this purpose, our approach increases depth incrementally in each stage taking into account its importance, such that stages with low importance are kept shallow while stages with high importance become deeper. We conduct experiments on the CIFAR and different versions of ImageNet datasets, where we show that architectures discovered by our approach achieve better accuracy and efficiency than human-designed architectures. Additionally, we show that architectures discovered on CIFAR-10 can be successfully transferred to large datasets. Compared to previous NAS approaches, our method is substantially more efficient, as it evaluates one order of magnitude fewer models and yields architectures on par with the state-of-the-art.

Resource-efficient DNNs for Keyword Spotting using Neural Architecture Search and Quantization

David Peter, Wolfgang Roth, Franz Pernkopf

Responsive image

Auto-TLDR; Neural Architecture Search for Keyword Spotting in Limited Resource Environments

Slides Poster Similar

This paper introduces neural architecture search (NAS) for the automatic discovery of small models for keyword spotting (KWS) in limited resource environments. We employ a differentiable NAS approach to optimize the structure of convolutional neural networks (CNNs) to meet certain memory constraints for storing weights as well as constraints on the number of operations per inference. Using NAS only, we were able to obtain a highly efficient model with 95.6% accuracy on the Google speech commands dataset with 494.8 kB of memory usage and 19.6 million operations. Additionally, weight quantization is used to reduce the memory consumption even further. We show that weight quantization to low bit-widths (e.g. 1 bit) can be used without substantial loss in accuracy. By increasing the number of input features from 10 MFCC to 20 MFCC we were able to increase the accuracy to 96.6% at 340.1 kB of memory usage and 27.1 million operations.

Fine-Tuning DARTS for Image Classification

Muhammad Suhaib Tanveer, Umar Karim Khan, Chong Min Kyung

Responsive image

Auto-TLDR; Fine-Tune Neural Architecture Search using Fixed Operations

Slides Poster Similar

Neural Architecture Search (NAS) has gained attraction due to superior classification performance. Differential Architecture Search (DARTS) is a computationally light method. To limit computational resources DARTS makes numerous approximations. These approximations result in inferior performance. We propose to fine-tune DARTS using fixed operations as these are independent of these approximations. Our method offers a good trade-off between the number of parameters and classification accuracy. Our approach improves the top-1 accuracy on Fashion-MNIST, CompCars and MIO-TCD datasets by 0.56%, 0.50%, and 0.39%, respectively compared to the state-of-the-art approaches. Our approach performs better than DARTS, improving the accuracy by 0.28%, 1.64%, 0.34%, 4.5%, and 3.27% compared to DARTS, on CIFAR-10, CIFAR-100, Fashion-MNIST, CompCars, and MIO-TCD datasets, respectively.

Neural Architecture Search for Image Super-Resolution Using Densely Connected Search Space: DeCoNAS

Joon Young Ahn, Nam Ik Cho

Responsive image

Auto-TLDR; DeCoNASNet: Automated Neural Architecture Search for Super-Resolution

Slides Poster Similar

Abstract—The recent progress of deep convolutional neural networks has enabled great success in single image superresolution (SISR) and many other vision tasks. Their performances are also being increased by deepening the networks and developing more sophisticated network structures. However, finding an optimal structure for the given problem is a difficult task, even for human experts. For this reason, neural architecture search (NAS) methods have been introduced, which automate the procedure of constructing the structures. In this paper, we expand the NAS to the super-resolution domain and find a lightweight densely connected network named DeCoNASNet. We use a hierarchical search strategy to find the best connection with local and global features. In this process, we define a complexitybased penalty for solving image super-resolution, which can be considered a multi-objective problem. Experiments show that our DeCoNASNet outperforms the state-of-the-art lightweight superresolution networks designed by handcraft methods and existing NAS-based design.

Fast, Accurate and Lightweight Super-Resolution with Neural Architecture Search

Chu Xiangxiang, Bo Zhang, Micheal Ma Hailong, Ruijun Xu, Jixiang Li, Qingyuan Li

Responsive image

Auto-TLDR; Multi-Objective Neural Architecture Search for Super-Resolution

Slides Poster Similar

Deep convolutional neural networks demonstrate impressive results in the super-resolution domain. A series of studies concentrate on improving peak signal noise ratio (PSNR) by using much deeper layers, which are not friendly to constrained resources. Pursuing a trade-off between the restoration capacity and the simplicity of models is still non-trivial. Recent contributions are struggling to manually maximize this balance, while our work achieves the same goal automatically with neural architecture search. Specifically, we handle super-resolution with a multi-objective approach. We also propose an elastic search tactic at both micro and macro level, based on a hybrid controller that profits from evolutionary computation and reinforcement learning. Quantitative experiments help us to draw a conclusion that our generated models dominate most of the state-of-the-art methods with respect to the individual FLOPS.

Construction Worker Hardhat-Wearing Detection Based on an Improved BiFPN

Chenyang Zhang, Zhiqiang Tian, Jingyi Song, Yaoyue Zheng, Bo Xu

Responsive image

Auto-TLDR; A One-Stage Object Detection Method for Hardhat-Wearing in Construction Site

Slides Poster Similar

Work in the construction site is considered to be one of the occupations with the highest safety risk factor. Therefore, safety plays an important role in construction site. One of the most fundamental safety rules in construction site is to wear a hardhat. To strengthen the safety of the construction site, most of the current methods use multi-stage method for hardhat-wearing detection. These methods have limitations in terms of adaptability and generalizability. In this paper, we propose a one-stage object detection method based on convolutional neural network. We present a multi-scale strategy that selects the high-resolution feature maps of DarkNet-53 to effectively identify small-scale hardhats. In addition, we propose an improved weighted bi-directional feature pyramid network (BiFPN), which could fuse more semantic features from more scales. The proposed method can not only detect hardhat-wearing, but also identify the color of the hardhat. Experimental results show that the proposed method achieves a mAP of 87.04%, which outperforms several state-of-the-art methods on a public dataset.

Object Detection in the DCT Domain: Is Luminance the Solution?

Benjamin Deguerre, Clement Chatelain, Gilles Gasso

Responsive image

Auto-TLDR; Jpeg Deep: Object Detection Using Compressed JPEG Images

Slides Poster Similar

Object detection in images has reached unprecedented performances. The state-of-the-art methods rely on deep architectures that extract salient features and predict bounding boxes enclosing the objects of interest. These methods essentially run on RGB images. However, the RGB images are often compressed by the acquisition devices for storage purpose and transfer efficiency. Hence, their decompression is required for object detectors. To gain in efficiency, this paper proposes to take advantage of the compressed representation of images to carry out object detection usable in constrained resources conditions. Specifically, we focus on JPEG images and propose a thorough analysis of detection architectures newly designed in regard of the peculiarities of the JPEG norm. This leads to a x1.7 speed up in comparison with a standard RGB-based architecture, while only reducing the detection performance by 5.5%. Additionally, our empirical findings demonstrate that only part of the compressed JPEG information, namely the luminance component, may be required to match detection accuracy of the full input methods. Code is made available at : https://github.com/D3lt4lph4/jpeg_deep.

Efficient-Receptive Field Block with Group Spatial Attention Mechanism for Object Detection

Jiacheng Zhang, Zhicheng Zhao, Fei Su

Responsive image

Auto-TLDR; E-RFB: Efficient-Receptive Field Block for Deep Neural Network for Object Detection

Slides Poster Similar

Object detection has been paid rising attention in computer vision field. Convolutional Neural Networks (CNNs) extract high-level semantic features of images, which directly determine the performance of object detection. As a common solution, embedding integration modules into CNNs can enrich extracted features and thereby improve the performance. However, the instability and inconsistency of internal multiple branches exist in these modules. To address this problem, we propose a novel multibranch module called Efficient-Receptive Field Block (E-RFB), in which multiple levels of features are combined for network optimization. Specifically, by downsampling and increasing depth, the E-RFB provides sufficient RF. Second, in order to eliminate the inconsistency across different branches, a novel spatial attention mechanism, namely, Group Spatial Attention Module (GSAM) is proposed. The GSAM gradually narrows a feature map by channel grouping; thus it encodes the information between spatial and channel dimensions into the final attention heat map. Third, the proposed module can be easily joined in various CNNs to enhance feature representation as a plug-and-play component. With SSD-style detectors, our method halves the parameters of the original detection head and achieves high accuracy on the PASCAL VOC and MS COCO datasets. Moreover, the proposed method achieves superior performance compared with state-of-the-art methods based on similar framework.

Bidirectional Matrix Feature Pyramid Network for Object Detection

Wei Xu, Yi Gan, Jianbo Su

Responsive image

Auto-TLDR; BMFPN: Bidirectional Matrix Feature Pyramid Network for Object Detection

Slides Poster Similar

Feature pyramids are widely used to improve scale invariance for object detection. Most methods just map the objects to feature maps with relevant square receptive fields, but rarely pay attention to the aspect ratio variation, which is also an important property of object instances. It will lead to a poor match between rectangular objects and assigned features with square receptive fields, thus preventing from accurate recognition and location. Besides, the information propagation among feature layers is sparse, namely, each feature in the pyramid may mainly or only contain single-level information, which is not representative enough for classification and localization sub-tasks. In this paper, Bidirectional Matrix Feature Pyramid Network (BMFPN) is proposed to address these issues. It consists of three modules: Diagonal Layer Generation Module (DLGM), Top-down Module (TDM) and Bottom-up Module (BUM). First, multi-level features extracted by backbone are fed into DLGM to produce the base features. Then these base features are utilized to construct the final feature pyramid through TDM and BUM in series. The receptive fields of the designed feature layers in BMFPN have various scales and aspect ratios. Objects can be correctly assigned to appropriate and representative feature maps with relevant receptive fields depending on its scale and aspect ratio properties. Moreover, TDM and BUM form bidirectional and reticular information flow, which effectively fuses multi level information in top-down and bottom-up manner respectively. To evaluate the effectiveness of our proposed architecture, an end-toend anchor-free detector is designed and trained by integrating BMFPN into FCOS. And the center ness branch in FCOS is modified with our Gaussian center-ness branch (GCB), which brings another slight improvement. Without bells and whistles, our method gains +3.3%, +2.4% and +2.6% AP on MS COCO dataset from baselines with ResNet-50, ResNet-101 and ResNeXt-101 backbones, respectively.

Fast and Accurate Real-Time Semantic Segmentation with Dilated Asymmetric Convolutions

Leonel Rosas-Arias, Gibran Benitez-Garcia, Jose Portillo-Portillo, Gabriel Sanchez-Perez, Keiji Yanai

Responsive image

Auto-TLDR; FASSD-Net: Dilated Asymmetric Pyramidal Fusion for Real-Time Semantic Segmentation

Slides Poster Similar

Recent works have shown promising results applied to real-time semantic segmentation tasks. To maintain fast inference speed, most of the existing networks make use of light decoders, or they simply do not use them at all. This strategy helps to maintain a fast inference speed; however, their accuracy performance is significantly lower in comparison to non-real-time semantic segmentation networks. In this paper, we introduce two key modules aimed to design a high-performance decoder for real-time semantic segmentation for reducing the accuracy gap between real-time and non-real-time segmentation networks. Our first module, Dilated Asymmetric Pyramidal Fusion (DAPF), is designed to substantially increase the receptive field on the top of the last stage of the encoder, obtaining richer contextual features. Our second module, Multi-resolution Dilated Asymmetric (MDA) module, fuses and refines detail and contextual information from multi-scale feature maps coming from early and deeper stages of the network. Both modules exploit contextual information without excessively increasing the computational complexity by using asymmetric convolutions. Our proposed network entitled “FASSD-Net” reaches 78.8% of mIoU accuracy on the Cityscapes validation dataset at 41.1 FPS on full resolution images (1024x2048). Besides, with a light version of our network, we reach 74.1% of mIoU at 133.1 FPS (full resolution) on a single NVIDIA GTX 1080Ti card with no additional acceleration techniques. The source code and pre-trained models are available at https://github.com/GibranBenitez/FASSD-Net.

A Modified Single-Shot Multibox Detector for Beyond Real-Time Object Detection

Georgios Orfanidis, Konstantinos Ioannidis, Stefanos Vrochidis, Anastasios Tefas, Ioannis Kompatsiaris

Responsive image

Auto-TLDR; Single Shot Detector in Resource-Restricted Systems with Lighter SSD Variations

Slides Poster Similar

This works focuses on examining the performance of the Single Shot Detector (SSD) model in resource restricted systems where maintaining the power of the full model comprises a significant prerequisite. The proposed SSD variations examine the behavior of lighter versions of SSD while propose measures to limit the unavoidable performance shortage. The outcomes of the conducted research demonstrate a remarkable trade-off between performance losses, speed improvement and the required resource reservation. Thus, the experimental results evidence the efficiency of the presented SSD alterations towards accomplishing higher frame rates and retaining the performance of the original model.

Dynamic Multi-Path Neural Network

Yingcheng Su, Yichao Wu, Ken Chen, Ding Liang, Xiaolin Hu

Responsive image

Auto-TLDR; Dynamic Multi-path Neural Network

Slides Similar

Although deeper and larger neural networks have achieved better performance, due to overwhelming burden on computation, they cannot meet the demands of deployment on resource-limited devices. An effective strategy to address this problem is to make use of dynamic inference mechanism, which changes the inference path for different samples at runtime. Existing methods only reduce the depth by skipping an entire specific layer, which may lose important information in this layer. In this paper, we propose a novel method called Dynamic Multi-path Neural Network (DMNN), which provides more topology choices in terms of both width and depth on the fly. For better modelling the inference path selection, we further introduce previous state and object category information to guide the training process. Compared to previous dynamic inference techniques, the proposed method is more flexible and easier to incorporate into most modern network architectures. Experimental results on ImageNet and CIFAR-100 demonstrate the superiority of our method on both efficiency and classification accuracy.

A Novel Region of Interest Extraction Layer for Instance Segmentation

Leonardo Rossi, Akbar Karimi, Andrea Prati

Responsive image

Auto-TLDR; Generic RoI Extractor for Two-Stage Neural Network for Instance Segmentation

Slides Poster Similar

Given the wide diffusion of deep neural network architectures for computer vision tasks, several new applications are nowadays more and more feasible. Among them, a particular attention has been recently given to instance segmentation, by exploiting the results achievable by two-stage networks (such as Mask R-CNN or Faster R-CNN), derived from R-CNN. In these complex architectures, a crucial role is played by the Region of Interest (RoI) extraction layer, devoted to extract a coherent subset of features from a single Feature Pyramid Network (FPN) layer attached on top of a backbone. This paper is motivated by the need to overcome to the limitations of existing RoI extractors which select only one (the best) layer from FPN. Our intuition is that all the layers of FPN retain useful information. Therefore, the proposed layer (called Generic RoI Extractor - GRoIE) introduces non-local building blocks and attention mechanisms to boost the performance. A comprehensive ablation study at component level is conducted to find the best set of algorithms and parameters for the GRoIE layer. Moreover, GRoIE can be integrated seamlessly with every two-stage architecture for both object detection and instance segmentation tasks. Therefore, the improvements brought by the use of GRoIE in different state-of-the-art architectures are also evaluated. The proposed layer leads up to gain a 1.1% AP on bounding box detection and 1.7% AP on instance segmentation. The code is publicly available on GitHub repository at https://github.com/IMPLabUniPr/mmdetection-groie

SFPN: Semantic Feature Pyramid Network for Object Detection

Yi Gan, Wei Xu, Jianbo Su

Responsive image

Auto-TLDR; SFPN: Semantic Feature Pyramid Network to Address Information Dilution Issue in FPN

Slides Poster Similar

Feature Pyramid Network(FPN) employs a top-down path to enhance low level feature by utilizing high level feature.However, further improvement of detector is greatly hindered by the inner defect of FPN. The dilution issue in FPN is analyzed in this paper, and a new architecture named Semantic Feature Pyramid Network(SFPN) is introduced to address the information imbalance problem caused by information dilution. The proposed method consists of two simple and effective components: Semantic Pyramid Module(SPM) and Semantic Feature Fusion Module(SFFM). To compensate for the weaknesses of FPN, the semantic segmentation result is utilized as an extra information source in our architecture.By constructing a semantic pyramid based on the segmentation result and fusing it with FPN, feature maps at each level can obtain the necessary information without suffering from the dilution issue. The proposed architecture could be applied on many detectors, and non-negligible improvement could be achieved. Although this method is designed for object detection, other tasks such as instance segmentation can also largely benefit from it. The proposed method brings Faster R-CNN and Mask R-CNN with ResNet-50 as backbone both 1.8 AP improvements respectively. Furthermore, SFPN improves Cascade R-CNN with backbone ResNet-101 from 42.4 AP to 43.5 AP.

A Fast and Accurate Object Detector for Handwritten Digit String Recognition

Jun Guo, Wenjing Wei, Yifeng Ma, Cong Peng

Responsive image

Auto-TLDR; ChipNet: An anchor-free object detector for handwritten digit string recognition

Slides Poster Similar

Focusing on handwritten digit string recognition (HDSR), we propose an anchor-free object detector called ChipNet, where a novel encoding method is designed. The input image is divided into columns, and then these columns are encoded by the ground truth. The adjacent columns are responsible for detecting the same target so that it can well address the class-imbalanced problem meanwhile reducing the network computation. ChipNet is composed of convolutional and bidirectional long short term memory networks. Different from the typical detectors, it doesn't use region proposals, anchors or regions of interest pooling. Hence, it can overcome the shortages of anchor-based and dense detectors in HDSR. The experiments are implemented on the synthetic digit strings, the CVL HDS database, and the ORAND-CAR-A & B databases. The high accuracies, which surpass the reported results by a large margin (up to 6.62%), are achieved. Furthermore, it gets 219 FPS speed on 160*32 px resolution images when using a Tesla P100 GPU. The results also show that ChipNet can handle touching, connecting and arbitrary length digit strings, and the obtained accuracies in HDSR are as high as the ones in single handwritten digit recognition.

Forground-Guided Vehicle Perception Framework

Kun Tian, Tong Zhou, Shiming Xiang, Chunhong Pan

Responsive image

Auto-TLDR; A foreground segmentation branch for vehicle detection

Slides Poster Similar

As the basis of advanced visual tasks such as vehicle tracking and traffic flow analysis, vehicle detection needs to accurately predict the position and category of vehicle objects. In the past decade, deep learning based methods have made great progress. However, we also notice that some existing cases are not studied thoroughly. First, false positive on the background regions is one of the critical problems. Second, most of the previous approaches only optimize a single vehicle detection model, ignoring the relationship between different visual perception tasks. In response to the above two findings, we introduce a foreground segmentation branch for the first time, which can predict the pixel level of vehicles in advance. Furthermore, two attention modules are designed to guide the work of the detection branch. The proposed method can be easily grafted into the one-stage and two-stage detection framework. We evaluate the effectiveness of our model on LSVH, a dataset with large variations in vehicle scales, and achieve the state-of-the-art detection accuracy.

Multiple-Step Sampling for Dense Object Detection and Counting

Zhaoli Deng, Yang Chenhui

Responsive image

Auto-TLDR; Multiple-Step Sampling for Dense Objects Detection

Slides Poster Similar

A multitude of similar or even identical objects are positioned closely in dense scenes, which brings about difficulties in object-detecting and object-counting. Since the poor performance of Faster R-CNN, recent works prefer to detect dense objects with the utilization of multi-layer feature maps. Nevertheless, they require complex post-processing to minimize overlap between adjacent bounding boxes, which reduce their detection speed. However, we find that such a multilayer prediction is not necessary. It is observed that there exists a waste of ground-truth boxes during sampling, causing the lack of positive samples and the final failure of Faster R-CNN training. Motivated by this observation we propose a multiple-step sampling method for anchor sampling. Our method reduces the waste of ground-truth boxes in three steps according to different rules. Besides, we balance the positive and negative samples, and samples at different quality. Our method improves base detector (Faster R-CNN), the detection tests on SKU-110K and CARPK benchmarks indicate that our approach offers a good trade-off between accuracy and speed.

Compression of YOLOv3 Via Block-Wise and Channel-Wise Pruning for Real-Time and Complicated Autonomous Driving Environment Sensing Applications

Jiaqi Li, Yanan Zhao, Li Gao, Feng Cui

Responsive image

Auto-TLDR; Pruning YOLOv3 with Batch Normalization for Autonomous Driving

Slides Poster Similar

Nowadays, in the area of autonomous driving, the computational power of the object detectors is limited by the embedded devices and the public datasets for autonomous driving are over-idealistic. In this paper, we propose a pipeline combining both block-wise pruning and channel-wise pruning to compress the object detection model iteratively. We enforce the introduced factor of the residual blocks and the scale parameters in Batch Normalization (BN) layers to sparsity to select the less important residual blocks and channels. Moreover, a modified loss function has been proposed to remedy the class-imbalance problem. After removing the unimportant structures iteratively, we get the pruned YOLOv3 trained on our datasets which have more abundant and elaborate classes. Evaluated by our validation sets on the server, the pruned YOLOv3 saves 79.7% floating point operations (FLOPs), 93.8% parameter size, 93.8% model volume and 45.4% inference times with only 4.16% mean of average precision (mAP) loss. Evaluated on the embedded device, the pruned model operates about 13 frames per second with 4.53% mAP loss. These results show that the real-time property and accuracy of the pruned YOLOv3 can meet the needs of the embedded devices in complicated autonomous driving environments.

Small Object Detection by Generative and Discriminative Learning

Yi Gu, Jie Li, Chentao Wu, Weijia Jia, Jianping Chen

Responsive image

Auto-TLDR; Generative and Discriminative Learning for Small Object Detection

Slides Poster Similar

With the development of deep convolutional neural networks (CNNs), the object detection accuracy has been greatly improved. But the performance of small object detection is still far from satisfactory, mainly because small objects are so tiny that the information contained in the feature map is limited. Existing methods focus on improving classification accuracy but still suffer from the limitation of bounding box prediction. To solve this issue, we propose a detection framework by generative and discriminative learning. First, a reconstruction generator network is designed to reconstruct the mapping from low frequency to high frequency for anchor box prediction. Then, a detector module extracts the regions of interest (ROIs) from generated results and implements a RoI-Head to predict object category and refine bounding box. In order to guide the reconstructed image related to the corresponding one, a discriminator module is adopted to tell from the generated result and the original image. Extensive evaluations on the challenging MS-COCO dataset demonstrate that our model outperforms most state-of-the-art models in detecting small objects, especially the reconstruction module improves the average precision for small object (APs) by 7.7%.

Hierarchical Head Design for Object Detectors

Shivang Agarwal, Frederic Jurie

Responsive image

Auto-TLDR; Hierarchical Anchor for SSD Detector

Slides Poster Similar

The notion of anchor plays a major role in modern detection algorithms such as the Faster-RCNN or the SSD detector. Anchors relate the features of the last layers of the detector with bounding boxes containing objects in images. Despite their importance, the literature on object detection has not paid real attention to them. The motivation of this paper comes from the observations that (i) each anchor learns to classify and regress candidate objects independently (ii) insufficient examples are available for each anchor in case of small-scale datasets. This paper addresses these questions by proposing a novel hierarchical head for the SSD detector. The new design has the added advantage of no extra weights, as compared to the original design at inference time, while improving detectors performance for small size training sets. Improved performance on PASCAL-VOC and state-of-the-art performance on FlickrLogos-47 validate the method. We also show when the proposed design does not give additional performance gain over the original design.

Learning a Dynamic High-Resolution Network for Multi-Scale Pedestrian Detection

Mengyuan Ding, Shanshan Zhang, Jian Yang

Responsive image

Auto-TLDR; Learningable Dynamic HRNet for Pedestrian Detection

Slides Poster Similar

Pedestrian detection is a canonical instance of object detection in computer vision. In practice, scale variation is one of the key challenges, resulting in unbalanced performance across different scales. Recently, the High-Resolution Network (HRNet) has become popular because high-resolution feature representations are more friendly to small objects. However, when we apply HRNet for pedestrian detection, we observe that it improves for small pedestrians on one hand, but hurts the performance for larger ones on the other hand. To overcome this problem, we propose a learnable Dynamic HRNet (DHRNet) aiming to generate different network paths adaptive to different scales. Specifically, we construct a parallel multi-branch architecture and add a soft conditional gate module allowing for dynamic feature fusion. Both branches share all the same parameters except the soft gate module. Experimental results on CityPersons and Caltech benchmarks indicate that our proposed dynamic HRNet is more capable of dealing with pedestrians of various scales, and thus improves the performance across different scales consistently.

Deep Real-Time Hand Detection Using CFPN on Embedded Systems

Pirdiansyah Hendri, Jun-Wei Hsieh, Ping Yang Chen

Responsive image

Auto-TLDR; Concatenated Feature Pyramid Network for Small Hand Detection on Embedded Devices

Slides Poster Similar

Real-time HI (Human Interface) systems need accurate and efficient hand detection models to meet the limited resources in budget, dimension, memory, computing, and electric power. In recent years, object detection became a less challenging task with the latest deep CNN-based state-of-the-art models, i.e., RCNN, SSD, and YOLO; however, these models cannot provide the desired efficiency and accuracy for HI systems on embedded devices due to their complex time-consuming architecture. In addition, the detection of small hands (<30x30 pixels) is still a challenging task for all the above existing methods. Thus, we propose a shallow model named Concatenated Feature Pyramid Network (CFPN) to provide above mentioned performance for small hand detection. The superiority of CFPN is confirmed on a HandFlow dataset with mAP:0.5 of 95.6 and FPS of 33 on Nvidia TX2. The COCO dataset is also used to compare with other state-of-the-art method and shows the highest efficiency and accuracy with the proposed CFPN model. Thus we conclude that the proposed model is useful for real-life small hand detection on embedded devices.

Channel Planting for Deep Neural Networks Using Knowledge Distillation

Kakeru Mitsuno, Yuichiro Nomura, Takio Kurita

Responsive image

Auto-TLDR; Incremental Training for Deep Neural Networks with Knowledge Distillation

Slides Poster Similar

In recent years, deeper and wider neural networks have shown excellent performance in computer vision tasks, while their enormous amount of parameters results in increased computational cost and overfitting. Several methods have been proposed to compress the size of the networks without reducing network performance. Network pruning can reduce redundant and unnecessary parameters from a network. Knowledge distillation can transfer the knowledge of deeper and wider networks to smaller networks. The performance of the smaller network obtained by these methods is bounded by the predefined network. Neural architecture search has been proposed, which can search automatically the architecture of the networks to break the structure limitation. Also, there is a dynamic configuration method to train networks incrementally as sub-networks. In this paper, we present a novel incremental training algorithm for deep neural networks called planting. Our planting can search the optimal network architecture with smaller number of parameters for improving the network performance by augmenting channels incrementally to layers of the initial networks while keeping the earlier trained parameters fixed. Also, we propose using the knowledge distillation method for training the channels planted. By transferring the knowledge of deeper and wider networks, we can grow the networks effectively and efficiently. We evaluate the effectiveness of the proposed method on different datasets such as CIFAR-10/100 and STL-10. For the STL-10 dataset, we show that we are able to achieve comparable performance with only 7% parameters compared to the larger network and reduce the overfitting caused by a small amount of the data.

Yolo+FPN: 2D and 3D Fused Object Detection with an RGB-D Camera

Ya Wang

Responsive image

Auto-TLDR; Yolo+FPN: Combining 2D and 3D Object Detection for Real-Time Object Detection

Slides Poster Similar

In this paper we propose a new deep neural network system, called Yolo+FPN, which fuses both 2D and 3D object detection algorithms to achieve better real-time object detection results and faster inference speed, to be used on real robots. Finding an optimized fusion strategy to efficiently combine 3D object detection with 2D detection information is useful and challenging for both indoor and outdoor robots. In order to satisfy real-time requirements, a trade-off between accuracy and efficiency is needed. We not only have improved training and test accuracies and lower mean losses on the KITTI object detection benchmark, but also achieve better average precision on 3D detection of all classes in three levels of difficulty. Also, we implemented Yolo+FPN system using an RGB-D camera, and compared the speed of 2D and 3D object detection using different GPUs. For the real implementation of both indoor and outdoor scenes, we focus on person detection, which is the most challenging and important among the three classes.

Automatic Student Network Search for Knowledge Distillation

Zhexi Zhang, Wei Zhu, Junchi Yan, Peng Gao, Guotong Xie

Responsive image

Auto-TLDR; NAS-KD: Knowledge Distillation for BERT

Slides Poster Similar

Pre-trained language models (PLMs), such as BERT, have achieved outstanding performance on multiple natural language processing (NLP) tasks. However, such pre-trained models usually contain a huge number of parameters and are computationally expensive. The high resource demand hinders their application on resource-restricted devices like mobile phones. Knowledge distillation (KD) is an effective compression approach, aiming at encouraging a light-weight student network to imitate the teacher network, and accordingly latent knowledge is transferred from the teacher to student. However, the great majority of student networks in previous KD methods are manually designed, normally a subnetwork of the teacher network. Transformer is generally utilized as the student for compressing BERT but still contains masses of parameters. Motivated by this, we propose a novel approach named NAS-KD, which automatically generates an optimal student network using neural architecture search (NAS) to enhance the distillation for BERT. Experiment on 7 classification tasks in NLP domain demonstrates that NAS-KD can substantially reduce the size of BERT without much performance sacrifice.

HFP: Hardware-Aware Filter Pruning for Deep Convolutional Neural Networks Acceleration

Fang Yu, Chuanqi Han, Pengcheng Wang, Ruoran Huang, Xi Huang, Li Cui

Responsive image

Auto-TLDR; Hardware-Aware Filter Pruning for Convolutional Neural Networks

Slides Poster Similar

Convolutional Neural Networks (CNNs) are powerful but computationally demanding and memory intensive, thus impeding their practical applications on resource-constrained hardware. Filter pruning is an efficient approach for deep CNN compression and acceleration, which aims to eliminate some filters with tolerable performance degradation. In the literature, the majority of approaches prune networks by defining the redundant filters or training the networks with a sparsity prior loss function. These approaches mainly use FLOPs as their speed metric. However, the inference latency of pruned networks cannot be directly controlled on the hardware platform, which is an important dimension of practicality. To address this issue, we propose a novel Hardware-aware Filter Pruning method (HFP) which can produce pruned networks that satisfy the actual latency budget on the hardwares of interest. In addition, we propose an iterative pruning framework called Opti-Cut to decrease the accuracy degradation of pruning process and accelerate the pruning procedure whilst meeting the hardware budget. More specifically, HFP first builds up a lookup table for fast estimating the latency of target network about filter configuration layer by layer. Then, HFP leverages information gain (IG) to globally evaluate the filters' contribution to network output distribution. HFP utilizes the Opti-Cut framework to globally prune filters with the minimum IG one by one until the latency budget is satisfied. We verify the effectiveness of the proposed method on CIFAR-10 and ImageNet. Compared with the state-of-the-art pruning methods, HFP demonstrates superior performances on VGGNet, ResNet and MobileNet V1/V2.

Neural Compression and Filtering for Edge-assisted Real-time Object Detection in Challenged Networks

Yoshitomo Matsubara, Marco Levorato

Responsive image

Auto-TLDR; Deep Neural Networks for Remote Object Detection Using Edge Computing

Slides Poster Similar

The edge computing paradigm places compute-capable devices - edge servers - at the network edge to assist mobile devices in executing data analysis tasks. Intuitively, offloading compute-intense tasks to edge servers can reduce their execution time. However, poor conditions of the wireless channel connecting the mobile devices to the edge servers may degrade the overall capture-to-output delay achieved by edge offloading. Herein, we focus on edge computing supporting remote object detection by means of Deep Neural Networks (DNN), and develop a framework to reduce the amount of data transmitted over the wireless link. The core idea we propose builds on recent approaches splitting DNNs into sections - namely head and tail models - executed by the mobile device and edge server, respectively. The wireless link, then, is used to transport the output of the last layer of the head model to the edge server, instead of the DNN input. Most prior work focuses on classification tasks and leaves the DNN structure unaltered. Herein, we focus on DNNs for three different object detection tasks, which present a much more convoluted structure, and modify the architecture of the network to: (i) achieve in-network compression by introducing a bottleneck layer in the early layers on the head model, and (ii) prefilter pictures that do not contain objects of interest using a convolutional neural network. Results show that the proposed technique represents an effective intermediate option between local and edge computing in a parameter region where these extreme point solutions fail to provide satisfactory performance.

Towards Low-Bit Quantization of Deep Neural Networks with Limited Data

Yong Yuan, Chen Chen, Xiyuan Hu, Silong Peng

Responsive image

Auto-TLDR; Low-Precision Quantization of Deep Neural Networks with Limited Data

Slides Poster Similar

Recent machine learning methods use increasingly large deep neural networks to achieve state-of-the-art results in various tasks. Network quantization can effectively reduce computation and memory costs without modifying network structures, facilitating the deployment of deep neural networks (DNNs) on cloud and edge devices. However, most of the existing methods usually need time-consuming training or fine-tuning and access to the original training dataset that may be unavailable due to privacy or security concerns. In this paper, we present a novel method to achieve low-precision quantization of deep neural networks with limited data. Firstly, to reduce the complexity of per-channel quantization and the degeneration of per-layer quantization, we introduce group-wise quantization which separates the output channels into groups that each group is quantized separately. Secondly, to better distill knowledge from the pre-trained FP32 model with limited data, we introduce a two-stage knowledge distillation method that divides the optimization process into independent optimization stage and joint optimization stage to address the limitation of layer-wise supervision and global supervision. Extensive experiments on ImageNet2012 (ResNet18/50, ShuffleNetV2, and MobileNetV2) demonstrate that the proposed approach can significantly improve the quantization model's accuracy when only a few training samples are available. We further show that the method also extends to other computer vision architectures and tasks such as object detection and semantic segmentation.

Detecting Objects with High Object Region Percentage

Fen Fang, Qianli Xu, Liyuan Li, Ying Gu, Joo-Hwee Lim

Responsive image

Auto-TLDR; Faster R-CNN for High-ORP Object Detection

Slides Poster Similar

Object shape is a subtle but important factor for object detection. It has been observed that the object-region-percentage (ORP) can be utilized to improve detection accuracy for elongated objects, which have much lower ORPs than other types of objects. In this paper, we propose an approach to improve the detection performance for objects whose ORPs are relatively higher.To address the problem of high-ORP object detection, we propose a method consisting of three steps. First, we adjust the ground truth bounding boxes of high-ORP objects to an optimal range. Second, we train an object detector, Faster R-CNN, based on adjusted bounding boxes to achieve high recall. Finally, we train a DCNN to learn the adjustment ratios towards four directions and adjust detected bounding boxes of objects to get better localization for higher precision. We evaluate the effectiveness of our method on 12 high-ORP objects in COCO and 8 objects in a proprietary gearbox dataset. The experimental results show that our method can achieve state-of-the-art performance on these objects while costing less resources in training and inference stages.

CenterRepp: Predict Central Representative Point Set's Distribution for Detection

Yulin He, Limeng Zhang, Wei Chen, Xin Luo, Chen Li, Xiaogang Jia

Responsive image

Auto-TLDR; CRPDet: CenterRepp Detector for Object Detection

Slides Poster Similar

Object detection has long been an important issue in the discipline of scene understanding. Existing researches mainly focus on the object itself, ignoring its surrounding environment. In fact, the surrounding environment provides abundant information to help detectors classify and locate objects. This paper proposes CRPDet, viz. CenterRepp Detector, a framework for object detection. The main function of CRPDet is accomplished by the CenterRepp module, which takes into account the surrounding environment by predicting the distribution of the central representative points. CenterRepp converts labeled object frames into the mean and standard variance of the sampling points’ distribution. This helps increase the receptive field of objects, breaking the limitation of object frames. CenterRepp defines a position-fixed center point with significant weights, avoiding to sample all points in the surroundings. Experiments on the COCO test-dev detection benchmark demonstrates that our proposed CRPDet has comparable performance with state-of-the-art detectors, achieving 39.4 mAP with 51 FPS tested under single size input.

Tiny Object Detection in Aerial Images

Jinwang Wang, Wen Yang, Haowen Guo, Ruixiang Zhang, Gui-Song Xia

Responsive image

Auto-TLDR; Tiny Object Detection in Aerial Images Using Multiple Center Points Based Learning Network

Slides Similar

Object detection in Earth Vision has achieved great progress in recent years. However, tiny object detection in aerial images remains a very challenging problem since the tiny objects contain a small number of pixels and are easily confused with the background. To advance tiny object detection research in aerial images, we present a new dataset for Tiny Object Detection in Aerial Images (AI-TOD). Specifically, AI-TOD comes with 700,621 object instances for eight categories across 28,036 aerial images. Compared to existing object detection datasets in aerial images, the mean size of objects in AI-TOD is about 12.8 pixels, which is much smaller than others. To build a benchmark for tiny object detection in aerial images, we evaluate the state-of-the-art object detectors on our AI-TOD dataset. Experimental results show that direct application of these approaches on AI-TOD produces suboptimal object detection results, thus new specialized detectors for tiny object detection need to be designed. Therefore, we propose a multiple center points based learning network (M-CenterNet) to improve the localization performance of tiny object detection, and experimental results show the significant performance gain over the competitors.

Adaptive Remote Sensing Image Attribute Learning for Active Object Detection

Nuo Xu, Chunlei Huo, Chunhong Pan

Responsive image

Auto-TLDR; Adaptive Image Attribute Learning for Active Object Detection

Slides Similar

In recent years, deep learning methods bring incredible progress to the field of object detection. However, in the field of remote sensing image processing, existing methods neglect the relationship between imaging configuration and detection performance, and do not take into account the importance of detection performance feedback for improving image quality. Therefore, detection performance is limited by the passive nature of the conventional object detection framework. In order to solve the above limitations, this paper takes adaptive brightness adjustment and scale adjustment as examples, and proposes an active object detection method based on deep reinforcement learning. The goal of adaptive image attribute learning is to maximize the detection performance. With the help of active object detection and image attribute adjustment strategies, low-quality images can be converted into high-quality images, and the overall performance is improved without retraining the detector.

CASNet: Common Attribute Support Network for Image Instance and Panoptic Segmentation

Xiaolong Liu, Yuqing Hou, Anbang Yao, Yurong Chen, Keqiang Li

Responsive image

Auto-TLDR; Common Attribute Support Network for instance segmentation and panoptic segmentation

Slides Poster Similar

Instance segmentation and panoptic segmentation is being paid more and more attention in recent years. In comparison with bounding box based object detection and semantic segmentation, instance segmentation can provide more analytical results at pixel level. Given the insight that pixels belonging to one instance have one or more common attributes of current instance, we bring up an one-stage instance segmentation network named Common Attribute Support Network (CASNet), which realizes instance segmentation by predicting and clustering common attributes. CASNet is designed in the manner of fully convolutional and can implement training and inference from end to end. And CASNet manages predicting the instance without overlaps and holes, which problem exists in most of current instance segmentation algorithms. Furthermore, it can be easily extended to panoptic segmentation through minor modifications with little computation overhead. CASNet builds a bridge between semantic and instance segmentation from finding pixel class ID to obtaining class and instance ID by operations on common attribute. Through experiment for instance and panoptic segmentation, CASNet gets mAP 32.8\% and PQ 59.0\% on Cityscapes validation dataset by joint training, and mAP 36.3\% and PQ 66.1\% by separated training mode. For panoptic segmentation, CASNet gets state-of-the-art performance on the Cityscapes validation dataset.

SyNet: An Ensemble Network for Object Detection in UAV Images

Berat Mert Albaba, Sedat Ozer

Responsive image

Auto-TLDR; SyNet: Combining Multi-Stage and Single-Stage Object Detection for Aerial Images

Poster Similar

Recent advances in camera equipped drone applications and their widespread use increased the demand on vision based object detection algorithms for aerial images. Object detection process is inherently a challenging task as a generic computer vision problem, however, since the use of object detection algorithms on UAVs (or on drones) is relatively a new area, it remains as a more challenging problem to detect objects in aerial images. There are several reasons for that including: (i) the lack of large drone datasets including large object variance, (ii) the large orientation and scale variance in drone images when compared to the ground images, and (iii) the difference in texture and shape features between the ground and the aerial images. Deep learning based object detection algorithms can be classified under two main categories: (a) single-stage detectors and (b) multi-stage detectors. Both single-stage and multi-stage solutions have their advantages and disadvantages over each other. However, a technique to combine the good sides of each of those solutions could yield even a stronger solution than each of those solutions individually. In this paper, we propose an ensemble network, SyNet, that combines a multi-stage method with a single-stage one with the motivation of decreasing the high false negative rate of multi-stage detectors and increasing the quality of the single-stage detector proposals. As building blocks, CenterNet and Cascade R-CNN with pretrained feature extractors are utilized along with an ensembling strategy. We report the state of the art results obtained by our proposed solution on two different datasets: namely MS-COCO and visDrone with \%52.1 $mAP_{IoU = 0.75}$ is obtained on MS-COCO $val2017$ dataset and \%26.2 $mAP_{IoU = 0.75}$ is obtained on VisDrone $test-set$. Our code is available at: https://github.com/mertalbaba/SyNet}{https://github.com/mer talbaba/SyNet

Object Detection on Monocular Images with Two-Dimensional Canonical Correlation Analysis

Zifan Yu, Suya You

Responsive image

Auto-TLDR; Multi-Task Object Detection from Monocular Images Using Multimodal RGB and Depth Data

Slides Poster Similar

Accurate and robust detection objects from monocular images is a fundamental vision task. This paper describes a novel approach of holistic scene understanding that can simultaneously achieve multiple tasks of scene reconstruction and object detection from a single monocular camera. Rather than pursuing an independent solution for each individual task as most existing work does, we seek a globally optimal solution that holistically resolves the multiple perception and reasoning tasks in an effective manner. The approach explores the complementary properties of multimodal RGB imagery and depth data to improve scene perception tasks. It uniquely combines the techniques of canonical correlation analysis and deep learning to learn the most correlated features to maximize the modal cross-correlation for improving the performance and robustness of object detection in complex environments. Extensive experiments have been conducted to evaluate and demonstrate the performances of the proposed approach.

Improved Residual Networks for Image and Video Recognition

Ionut Cosmin Duta, Li Liu, Fan Zhu, Ling Shao

Responsive image

Auto-TLDR; Residual Networks for Deep Learning

Slides Poster Similar

Residual networks (ResNets) represent a powerful type of convolutional neural network (CNN) architecture, widely adopted and used in various tasks. In this work we propose an improved version of ResNets. Our proposed improvements address all three main components of a ResNet: the flow of information through the network layers, the residual building block, and the projection shortcut. We are able to show consistent improvements in accuracy and learning convergence over the baseline. For instance, on ImageNet dataset, using the ResNet with 50 layers, for top-1 accuracy we can report a 1.19% improvement over the baseline in one setting and around 2% boost in another. Importantly, these improvements are obtained without increasing the model complexity. Our proposed approach allows us to train extremely deep networks, while the baseline shows severe optimization issues. We report results on three tasks over six datasets: image classification (ImageNet, CIFAR-10 and CIFAR-100), object detection (COCO) and video action recognition (Kinetics-400 and Something-Something-v2). In the deep learning era, we establish a new milestone for the depth of a CNN. We successfully train a 404-layer deep CNN on the ImageNet dataset and a 3002-layer network on CIFAR-10 and CIFAR-100, while the baseline is not able to converge at such extreme depths. Code is available at: https://github.com/iduta/iresnet

AOAM: Automatic Optimization of Adjacency Matrix for Graph Convolutional Network

Yuhang Zhang, Hongshuai Ren, Jiexia Ye, Xitong Gao, Yang Wang, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; Adjacency Matrix for Graph Convolutional Network in Non-Euclidean Space

Slides Poster Similar

Graph Convolutional Network (GCN) is adopted to tackle the problem of the convolution operation in non-Euclidean space. Although previous works on GCN have made some progress, one of their limitations is that their input Adjacency Matrix (AM) is designed manually and requires domain knowledge, which is cumbersome, tedious and error-prone. In addition, entries of this fixed Adjacency Matrix are generally designed as binary values (i.e., ones and zeros) which can not reflect more complex relationship between nodes. However, many applications require a weighted and dynamic Adjacency Matrix instead of an unweighted and fixed Adjacency Matrix. To this end, there are few works focusing on designing a more flexible Adjacency Matrix. In this paper, we propose an end-to-end algorithm to improve the GCN performance by focusing on the Adjacency Matrix. We first provide a calculation method that called node information entropy to update the matrix. Then, we analyze the search strategy in a continuous space and introduce the Deep Deterministic Policy Gradient (DDPG) method to overcome the demerit of the discrete space search. Finally, we integrate the GCN and reinforcement learning into an end-to-end framework. Our method can automatically define the adjacency matrix without artificial knowledge. At the same time, the proposed approach can deal with any size of the matrix and provide a better value for the network. Four popular datasets are selected to evaluate the capability of our algorithm. The method in this paper achieves the state-of-the-art performance on Cora and Pubmed datasets, respectively, with the accuracy of 84.6% and 81.6%.

Hybrid Cascade Point Search Network for High Precision Bar Chart Component Detection

Junyu Luo, Jinpeng Wang, Chin-Yew Lin

Responsive image

Auto-TLDR; Object Detection of Chart Components in Chart Images Using Point-based and Region-Based Object Detection Framework

Slides Poster Similar

Charts are commonly used for data visualization. One common form of chart distribution is in its image form. To enable machine comprehension of chart images, precise detection of chart components in chart images is a critical step. Existing image object detection methods do not perform well in chart component detection which requires high boundary detection precision. And traditional rule-based approaches lack enough generalization ability. In order to address this problem, we design a novel two-stage object detection framework that combines point-based and region-based ideas, by simulating the process that human creating bounding boxes for objects. The experiment on our labeled ChartDet dataset shows our method greatly improves the performance of chart object detection. We further extend our method to a general object detection task and get comparable performance.

FastSal: A Computationally Efficient Network for Visual Saliency Prediction

Feiyan Hu, Kevin Mcguinness

Responsive image

Auto-TLDR; MobileNetV2: A Convolutional Neural Network for Saliency Prediction

Slides Poster Similar

This paper focuses on the problem of visual saliency prediction, predicting regions of an image that tend to attract human visual attention, under a constrained computational budget. We modify and test various recent efficient convolutional neural network architectures like EfficientNet and MobileNetV2 and compare them with existing state-of-the-art saliency models such as SalGAN and DeepGaze II both in terms of standard accuracy metrics like AUC and NSS, and in terms of the computational complexity and model size. We find that MobileNetV2 makes an excellent backbone for a visual saliency model and can be effective even without a complex decoder. We also show that knowledge transfer from a more computationally expensive model like DeepGaze II can be achieved via pseudo-labelling an unlabelled dataset, and that this approach gives result on-par with many state-of-the-art algorithms with a fraction of the computational cost and model size.

MagnifierNet: Learning Efficient Small-Scale Pedestrian Detector towards Multiple Dense Regions

Qi Cheng, Mingqin Chen, Yingjie Wu, Fei Chen, Shiping Lin

Responsive image

Auto-TLDR; MagnifierNet: A Simple but Effective Small-Scale Pedestrian Detection Towards Multiple Dense Regions

Slides Poster Similar

Despite the success of pedestrian detection, there is still a significant gap in the performance of the detection of pedestrians at different scales. Detecting small-scale pedestrians is extremely challenging due to the low resolution of their convolution features which is essential for downstream classifiers. To address this issue, we observed pedestrian datasets and found that pedestrians often gather together in crowded public places. Then we propose MagnifierNet, a simple but effective small-scale pedestrian detector towards multiple dense regions. MagnifierNet uses our proposed sweep-line based grouping algorithm to find dense regions based on the number of pedestrians in the grouped region. And we adopt a new definition of small-scale pedestrians through grid search and KL-divergence. Besides, our grouping method can also be used as a new strategy for pedestrian data augmentation. The ablation study demonstrates that MagnifierNet improves the representation of small-scale pedestrians. We validate the effectiveness of MagnifierNet on CityPersons and KITTI datasets. Experimental results show that MagnifierNet achieves the best small-scale pedestrian detection performance on CityPersons benchmark without any external data, and also achieves competitive performance for detecting small-scale pedestrians on KITTI dataset without bells and whistles.