S-VoteNet: Deep Hough Voting with Spherical Proposal for 3D Object Detection

Yanxian Chen, Huimin Ma, Xi Li, Xiong Luo

Responsive image

Auto-TLDR; S-VoteNet: 3D Object Detection with Spherical Bounded Box Prediction

Slides Poster

Current 3D object detection methods adopt an analogous box prediction structure with the 2D methods, which predict center and size of the object simultaneously in a box regression procedure, leading to the poor performance of 3D detector to a great extent. In this work, we propose S-VoteNet, which converts the prediction of 3D bounding box into two parts: center prediction and size prediction. By introducing a novel spherical proposal, S-VoteNet uses vote groups to predict the center and radius of object rather than all parameters of 3D bounding box. The prediction of radius is used to constrain the object size, and the radius-based spherical center loss is applied to measure the geometric distance between the proposal and ground-truth. To make better use of the geometric information provided by point cloud, S-VoteNet gathers seed points whose corresponding votes are within the vote groups for seed group generation. Seed groups are then consumed for box size regression and orientation estimation. By decoupling the localization and size estimation, our method effectively reduces the regression pressure of the 3D detector. Experimental results on SUN RGB-D 3D detection benchmark demonstrate that our S-VoteNet achieves state-of-the-art performance by using only point cloud as input.

Similar papers

Enhanced Vote Network for 3D Object Detection in Point Clouds

Min Zhong, Gang Zeng

Responsive image

Auto-TLDR; A Vote Feature Enhancement Network for 3D Bounding Box Prediction

Slides Poster Similar

In this work, we aim to estimate 3D bounding boxes by voting to object centers and then groups and aggregates the votes to generate 3D box proposals and semantic classes of objects. However, due to the sparse and unstructured nature of the point clouds, we face some challenges when directly predicting bounding box from the vote feature: the sparse vote feature may lack some necessary semantic and context information. To address the challenges, we propose a vote feature enhancement network that aims to encode semantic-aware information and aggravate global context for the vote feature. Specifically, we learn the point-wise semantic information and supplement it to the vote feature, and we also encode the pairwise relations to collect the global context. Experiments on two large datasets of real 3D scans, ScanNet and SUN RGB-D, demonstrate that our method can achieve excellent 3D detection results.

PointDrop: Improving Object Detection from Sparse Point Clouds Via Adversarial Data Augmentation

Wenxin Ma, Jian Chen, Qing Du, Wei Jia

Responsive image

Auto-TLDR; PointDrop: Improving Robust 3D Object Detection to Sparse Point Clouds

Slides Poster Similar

Current 3D object detection methods achieve accurate and efficient results on the standard point cloud dataset. However, in real-world applications, due to the expensive cost of obtaining the annotated 3D object detection data, we expect to directly apply the model trained on the standard dataset to real-world scenarios. This strategy may fail because the point cloud samples obtained in the real-world scenarios may be much sparser due to various reasons (occlusion, low reflectivity of objects and fewer laser beams) and existing methods do not consider the limitations of their models on sparse point clouds. To improve the robustness of an object detector to sparser point clouds, we propose PointDrop, which learns to drop the features of some key points in the point clouds to generate challenging sparse samples for data augmentation. Moreover, PointDrop is able to adjust the difficulty of the generated samples based on the capacity of the detector and thus progressively improve the performance of the detector. We create two sparse point clouds datasets from the KITTI dataset to evaluate our method, and the experimental results show that PointDrop significantly improves the robustness of the detector to sparse point clouds.

Yolo+FPN: 2D and 3D Fused Object Detection with an RGB-D Camera

Ya Wang

Responsive image

Auto-TLDR; Yolo+FPN: Combining 2D and 3D Object Detection for Real-Time Object Detection

Slides Poster Similar

In this paper we propose a new deep neural network system, called Yolo+FPN, which fuses both 2D and 3D object detection algorithms to achieve better real-time object detection results and faster inference speed, to be used on real robots. Finding an optimized fusion strategy to efficiently combine 3D object detection with 2D detection information is useful and challenging for both indoor and outdoor robots. In order to satisfy real-time requirements, a trade-off between accuracy and efficiency is needed. We not only have improved training and test accuracies and lower mean losses on the KITTI object detection benchmark, but also achieve better average precision on 3D detection of all classes in three levels of difficulty. Also, we implemented Yolo+FPN system using an RGB-D camera, and compared the speed of 2D and 3D object detection using different GPUs. For the real implementation of both indoor and outdoor scenes, we focus on person detection, which is the most challenging and important among the three classes.

PointSpherical: Deep Shape Context for Point Cloud Learning in Spherical Coordinates

Hua Lin, Bin Fan, Yongcheng Liu, Yirong Yang, Zheng Pan, Jianbo Shi, Chunhong Pan, Huiwen Xie

Responsive image

Auto-TLDR; Spherical Hierarchical Modeling of 3D Point Cloud

Slides Poster Similar

We propose Spherical Hierarchical modeling of 3D point cloud. Inspired by Shape Context, we design a receptive field on each 3D point by placing a spherical coordinate on it. We sample points using the furthest point method and creating overlapping balls of points. For each ball, we divide the space into radial, polar angular and azimuthal angular bins on which we form a Spherical Hierarchy. We apply 1x1 CNN convolution on points to start the initial feature extraction. Repeated 3D CNN and max pooling over the Spherical bins propagate contextual information until all the information is condensed in the center bin. Extensive experiments on five datasets strongly evidence that our method outperform current models on various Point Cloud Learning tasks, including 2D/3D shape classification, 3D part segmentation and 3D semantic segmentation.

Scene Text Detection with Selected Anchors

Anna Zhu, Hang Du, Shengwu Xiong

Responsive image

Auto-TLDR; AS-RPN: Anchor Selection-based Region Proposal Network for Scene Text Detection

Slides Poster Similar

Object proposal technique with dense anchoring scheme for scene text detection were applied frequently to achieve high recall. It results in the significant improvement in accuracy but waste of computational searching, regression and classification. In this paper, we propose an anchor selection-based region proposal network (AS-RPN) using effective selected anchors instead of dense anchors to extract text proposals. The center, scales, aspect ratios and orientations of anchors are learnable instead of fixing, which leads to high recall and greatly reduced numbers of anchors. By replacing the anchor-based RPN in Faster RCNN, the AS-RPN-based Faster RCNN can achieve comparable performance with previous state-of-the-art text detecting approaches on standard benchmarks, including COCO-Text, ICDAR2013, ICDAR2015 and MSRA-TD500 when using single-scale and single model (ResNet50) testing only.

MixedFusion: 6D Object Pose Estimation from Decoupled RGB-Depth Features

Hangtao Feng, Lu Zhang, Xu Yang, Zhiyong Liu

Responsive image

Auto-TLDR; MixedFusion: Combining Color and Point Clouds for 6D Pose Estimation

Slides Poster Similar

Estimating the 6D pose of objects is an important process for intelligent systems to achieve interaction with the real-world. As the RGB-D sensors become more accessible, the fusion-based methods have prevailed, since the point clouds provide complementary geometric information with RGB values. However, Due to the difference in feature space between color image and depth image, the network structures that directly perform point-to-point matching fusion do not effectively fuse the features of the two. In this paper, we propose a simple but effective approach, named MixedFusion. Different from the prior works, we argue that the spatial correspondence of color and point clouds could be decoupled and reconnected, thus enabling a more flexible fusion scheme. By performing the proposed method, more informative points can be mixed and fused with rich color features. Extensive experiments are conducted on the challenging LineMod and YCB-Video datasets, show that our method significantly boosts the performance without introducing extra overheads. Furthermore, when the minimum tolerance of metric narrows, the proposed approach performs better for the high-precision demands.

CASNet: Common Attribute Support Network for Image Instance and Panoptic Segmentation

Xiaolong Liu, Yuqing Hou, Anbang Yao, Yurong Chen, Keqiang Li

Responsive image

Auto-TLDR; Common Attribute Support Network for instance segmentation and panoptic segmentation

Slides Poster Similar

Instance segmentation and panoptic segmentation is being paid more and more attention in recent years. In comparison with bounding box based object detection and semantic segmentation, instance segmentation can provide more analytical results at pixel level. Given the insight that pixels belonging to one instance have one or more common attributes of current instance, we bring up an one-stage instance segmentation network named Common Attribute Support Network (CASNet), which realizes instance segmentation by predicting and clustering common attributes. CASNet is designed in the manner of fully convolutional and can implement training and inference from end to end. And CASNet manages predicting the instance without overlaps and holes, which problem exists in most of current instance segmentation algorithms. Furthermore, it can be easily extended to panoptic segmentation through minor modifications with little computation overhead. CASNet builds a bridge between semantic and instance segmentation from finding pixel class ID to obtaining class and instance ID by operations on common attribute. Through experiment for instance and panoptic segmentation, CASNet gets mAP 32.8\% and PQ 59.0\% on Cityscapes validation dataset by joint training, and mAP 36.3\% and PQ 66.1\% by separated training mode. For panoptic segmentation, CASNet gets state-of-the-art performance on the Cityscapes validation dataset.

CenterRepp: Predict Central Representative Point Set's Distribution for Detection

Yulin He, Limeng Zhang, Wei Chen, Xin Luo, Chen Li, Xiaogang Jia

Responsive image

Auto-TLDR; CRPDet: CenterRepp Detector for Object Detection

Slides Poster Similar

Object detection has long been an important issue in the discipline of scene understanding. Existing researches mainly focus on the object itself, ignoring its surrounding environment. In fact, the surrounding environment provides abundant information to help detectors classify and locate objects. This paper proposes CRPDet, viz. CenterRepp Detector, a framework for object detection. The main function of CRPDet is accomplished by the CenterRepp module, which takes into account the surrounding environment by predicting the distribution of the central representative points. CenterRepp converts labeled object frames into the mean and standard variance of the sampling points’ distribution. This helps increase the receptive field of objects, breaking the limitation of object frames. CenterRepp defines a position-fixed center point with significant weights, avoiding to sample all points in the surroundings. Experiments on the COCO test-dev detection benchmark demonstrates that our proposed CRPDet has comparable performance with state-of-the-art detectors, achieving 39.4 mAP with 51 FPS tested under single size input.

HPERL: 3D Human Pose Estimastion from RGB and LiDAR

Michael Fürst, Shriya T.P. Gupta, René Schuster, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; 3D Human Pose Estimation Using RGB and LiDAR Using Weakly-Supervised Approach

Slides Poster Similar

In-the-wild human pose estimation has a huge potential for various fields, ranging from animation and action recognition to intention recognition and prediction for autonomous driving. The current state-of-the-art is focused only on RGB and RGB-D approaches for predicting the 3D human pose. However, not using precise LiDAR depth information limits the performance and leads to very inaccurate absolute pose estimation. With LiDAR sensors becoming more affordable and common on robots and autonomous vehicle setups, we propose an end-to-end architecture using RGB and LiDAR to predict the absolute 3D human pose with unprecedented precision. Additionally, we introduce a weakly-supervised approach to generate 3D predictions using 2D pose annotations from PedX. This allows for many new opportunities in the field of 3D human pose estimation.

Mutual-Supervised Feature Modulation Network for Occluded Pedestrian Detection

Ye He, Chao Zhu, Xu-Cheng Yin

Responsive image

Auto-TLDR; A Mutual-Supervised Feature Modulation Network for Occluded Pedestrian Detection

Similar

State-of-the-art pedestrian detectors have achieved significant progress on non-occluded pedestrians, yet they are still struggling under heavy occlusions. The recent occlusion handling strategy of popular two-stage approaches is to build a two-branch architecture with the help of additional visible body annotations. Nonetheless, these methods still have some weaknesses. Either the two branches are trained independently with only score-level fusion, which cannot guarantee the detectors to learn robust enough pedestrian features. Or the attention mechanisms are exploited to only emphasize on the visible body features. However, the visible body features of heavily occluded pedestrians are concentrated on a relatively small area, which will easily cause missing detections. To address the above issues, we propose in this paper a novel Mutual-Supervised Feature Modulation (MSFM) network, to better handle occluded pedestrian detection. The key MSFM module in our network calculates the similarity loss of full body boxes and visible body boxes corresponding to the same pedestrian, so that the full-body detector could learn more complete and robust pedestrian features with the assist of contextual features from the occluding parts. To facilitate the MSFM module, we also propose a novel two-branch architecture, consisting of a standard full body detection branch and an extra visible body classification branch. These two branches are trained in a mutual-supervised way with full body annotations and visible body annotations, respectively. To verify the effectiveness of our proposed method, extensive experiments are conducted on two challenging pedestrian datasets: Caltech and CityPersons, and our approach achieves superior performances compared to other state-of-the-art methods on both datasets, especially in heavy occlusion cases.

Deep Space Probing for Point Cloud Analysis

Yirong Yang, Bin Fan, Yongcheng Liu, Hua Lin, Jiyong Zhang, Xin Liu, 蔡鑫宇 蔡鑫宇, Shiming Xiang, Chunhong Pan

Responsive image

Auto-TLDR; SPCNN: Space Probing Convolutional Neural Network for Point Cloud Analysis

Slides Poster Similar

3D points distribute in a continuous 3D space irregularly, thus directly adapting 2D image convolution to 3D points is not an easy job. Previous works often artificially divide the space into regular grids, yet it could be suboptimal to learn geometry. In this paper, we propose SPCNN, namely, Space Probing Convolutional Neural Network, which naturally generalizes image CNN to deal with point clouds. The key idea of SPCNN is learning to probe the 3D space in an adaptive manner. Specifically, we define a pool of learnable convolutional weights, and let each point in the local region learn to choose a suitable convolutional weight from the pool. This is achieved by constructing a geometry guided index-mapping function that implicitly establishes a correspondence between convolutional weights and some local regions in the neighborhood (Fig. 1). In this way, the index-mapping function learns to adaptively partition nearby space for local geometry pattern recognition. With this convolution as a basic operator, SPCNN, a hierarchical architecture can be developed for effective point cloud analysis. Extensive experiments on challenging benchmarks across three tasks demonstrate that SPCNN achieves the state-of-the-art or has competitive performance.

Self-Supervised Detection and Pose Estimation of Logistical Objects in 3D Sensor Data

Nikolas Müller, Jonas Stenzel, Jian-Jia Chen

Responsive image

Auto-TLDR; A self-supervised and fully automated deep learning approach for object pose estimation using simulated 3D data

Slides Poster Similar

Localization of objects in cluttered scenes with machine learning methods is a fairly young research area. Despite the high potential of object localization for full process automation in Industry 4.0 and logistical environments, 3D data sets for such applications to train machine learning models are not openly available and less publications have been made on that topic. To the authors knowledge, this is the first publication that describes a self-supervised and fully automated deep learning approach for object pose estimation using simulated 3D data. The solution covers the simulated generation of training data, the detection of objects in point clouds using a fully convolutional feedforward network and the computation of the pose for each detected object instance.

Bidirectional Matrix Feature Pyramid Network for Object Detection

Wei Xu, Yi Gan, Jianbo Su

Responsive image

Auto-TLDR; BMFPN: Bidirectional Matrix Feature Pyramid Network for Object Detection

Slides Poster Similar

Feature pyramids are widely used to improve scale invariance for object detection. Most methods just map the objects to feature maps with relevant square receptive fields, but rarely pay attention to the aspect ratio variation, which is also an important property of object instances. It will lead to a poor match between rectangular objects and assigned features with square receptive fields, thus preventing from accurate recognition and location. Besides, the information propagation among feature layers is sparse, namely, each feature in the pyramid may mainly or only contain single-level information, which is not representative enough for classification and localization sub-tasks. In this paper, Bidirectional Matrix Feature Pyramid Network (BMFPN) is proposed to address these issues. It consists of three modules: Diagonal Layer Generation Module (DLGM), Top-down Module (TDM) and Bottom-up Module (BUM). First, multi-level features extracted by backbone are fed into DLGM to produce the base features. Then these base features are utilized to construct the final feature pyramid through TDM and BUM in series. The receptive fields of the designed feature layers in BMFPN have various scales and aspect ratios. Objects can be correctly assigned to appropriate and representative feature maps with relevant receptive fields depending on its scale and aspect ratio properties. Moreover, TDM and BUM form bidirectional and reticular information flow, which effectively fuses multi level information in top-down and bottom-up manner respectively. To evaluate the effectiveness of our proposed architecture, an end-toend anchor-free detector is designed and trained by integrating BMFPN into FCOS. And the center ness branch in FCOS is modified with our Gaussian center-ness branch (GCB), which brings another slight improvement. Without bells and whistles, our method gains +3.3%, +2.4% and +2.6% AP on MS COCO dataset from baselines with ResNet-50, ResNet-101 and ResNeXt-101 backbones, respectively.

MagnifierNet: Learning Efficient Small-Scale Pedestrian Detector towards Multiple Dense Regions

Qi Cheng, Mingqin Chen, Yingjie Wu, Fei Chen, Shiping Lin

Responsive image

Auto-TLDR; MagnifierNet: A Simple but Effective Small-Scale Pedestrian Detection Towards Multiple Dense Regions

Slides Poster Similar

Despite the success of pedestrian detection, there is still a significant gap in the performance of the detection of pedestrians at different scales. Detecting small-scale pedestrians is extremely challenging due to the low resolution of their convolution features which is essential for downstream classifiers. To address this issue, we observed pedestrian datasets and found that pedestrians often gather together in crowded public places. Then we propose MagnifierNet, a simple but effective small-scale pedestrian detector towards multiple dense regions. MagnifierNet uses our proposed sweep-line based grouping algorithm to find dense regions based on the number of pedestrians in the grouped region. And we adopt a new definition of small-scale pedestrians through grid search and KL-divergence. Besides, our grouping method can also be used as a new strategy for pedestrian data augmentation. The ablation study demonstrates that MagnifierNet improves the representation of small-scale pedestrians. We validate the effectiveness of MagnifierNet on CityPersons and KITTI datasets. Experimental results show that MagnifierNet achieves the best small-scale pedestrian detection performance on CityPersons benchmark without any external data, and also achieves competitive performance for detecting small-scale pedestrians on KITTI dataset without bells and whistles.

Detecting Objects with High Object Region Percentage

Fen Fang, Qianli Xu, Liyuan Li, Ying Gu, Joo-Hwee Lim

Responsive image

Auto-TLDR; Faster R-CNN for High-ORP Object Detection

Slides Poster Similar

Object shape is a subtle but important factor for object detection. It has been observed that the object-region-percentage (ORP) can be utilized to improve detection accuracy for elongated objects, which have much lower ORPs than other types of objects. In this paper, we propose an approach to improve the detection performance for objects whose ORPs are relatively higher.To address the problem of high-ORP object detection, we propose a method consisting of three steps. First, we adjust the ground truth bounding boxes of high-ORP objects to an optimal range. Second, we train an object detector, Faster R-CNN, based on adjusted bounding boxes to achieve high recall. Finally, we train a DCNN to learn the adjustment ratios towards four directions and adjust detected bounding boxes of objects to get better localization for higher precision. We evaluate the effectiveness of our method on 12 high-ORP objects in COCO and 8 objects in a proprietary gearbox dataset. The experimental results show that our method can achieve state-of-the-art performance on these objects while costing less resources in training and inference stages.

Joint Semantic-Instance Segmentation of 3D Point Clouds: Instance Separation and Semantic Fusion

Min Zhong, Gang Zeng

Responsive image

Auto-TLDR; Joint Semantic Segmentation and Instance Separation of 3D Point Clouds

Slides Poster Similar

This paper introduces an approach for jointly addressing semantic segmentation (SS) and instance segmentation (IS) of 3D point clouds. Two novel modules are designed to model the interplay between SS and IS. Specifically, we develop an Instance Separation Module that supplements the position-invariance semantic feature with the instance-specific centroid position to help separate different instances. To fuse the semantic information within a single instance, an attention-based Semantic Fusion Module is proposed to encode attention maps in the instance embedding space, which are applied to fuse semantic information in the semantic feature space. The proposed method is thoroughly evaluated on the S3DIS dataset. Compared with the excellent method ASIS, our approach achieves significant improvements across all evaluation metrics in both IS and SS.

PC-Net: A Deep Network for 3D Point Clouds Analysis

Zhuo Chen, Tao Guan, Yawei Luo, Yuesong Wang

Responsive image

Auto-TLDR; PC-Net: A Hierarchical Neural Network for 3D Point Clouds Analysis

Slides Poster Similar

Due to the irregularity and sparsity of 3D point clouds, applying convolutional neural networks directly on them can be nontrivial. In this work, we propose a simple but effective approach for 3D Point Clouds analysis, named PC-Net. PC-Net directly learns on point sets and is equipped with three new operations: first, we apply a novel scale-aware neighbor search for adaptive neighborhood extracting; second, for each neighboring point, we learn a local spatial feature as a complement to their associated features; finally, at the end we use a distance re-weighted pooling to aggregate all the features from local structure. With this module, we design hierarchical neural network for point cloud understanding. For both classification and segmentation tasks, our architecture proves effective in the experiments and our models demonstrate state-of-the-art performance over existing deep learning methods on popular point cloud benchmarks.

MANet: Multimodal Attention Network Based Point-View Fusion for 3D Shape Recognition

Yaxin Zhao, Jichao Jiao, Ning Li

Responsive image

Auto-TLDR; Fusion Network for 3D Shape Recognition based on Multimodal Attention Mechanism

Slides Poster Similar

3D shape recognition has attracted more and more attention as a task of 3D vision research. The proliferation of 3D data encourages various deep learning methods based on 3D data. Now there have been many deep learning models based on point-cloud data or multi-view data alone. However, in the era of big data, integrating data of two different modals to obtain a unified 3D shape descriptor is bound to improve the recognition accuracy. Therefore, this paper proposes a fusion network based on multimodal attention mechanism for 3D shape recognition. Considering the limitations of multi-view data, we introduce a soft attention scheme, which can use the global point-cloud features to filter the multi-view features, and then realize the effective fusion of the two features. More specifically, we obtain the enhanced multi-view features by mining the contribution of each multi-view image to the overall shape recognition, and then fuse the point-cloud features and the enhanced multi-view features to obtain a more discriminative 3D shape descriptor. We have performed relevant experiments on the ModelNet40 dataset, and experimental results verify the effectiveness of our method.

Cross-Regional Attention Network for Point Cloud Completion

Hang Wu, Yubin Miao

Responsive image

Auto-TLDR; Learning-based Point Cloud Repair with Graph Convolution

Slides Poster Similar

Point clouds obtained from real word scanning are always incomplete and ununiformly distributed, which would cause structural losses in 3D shape representations. Therefore, a learning-based method is introduced in this paper to repair partial point clouds and restore the complete shapes of target objects. First, we design an encoder that takes both local features and global features into consideration. Second, we establish a graph to connect the local features together, and then implement graph convolution with multi-head attention on it. The graph enables each local feature vector to search across the regions and selectively absorb other local features based on the its own features and global features. Third, we design a coarse decoder to collect cross-region features from the graph and generate coarse point clouds with low resolution, and a folding-based decoder to generate fine point clouds with high resolution. Our network is trained on six categories of objects in the ModelNet dataset, and its performance is compared with several existing methods, the results show that our network is able to generate dense complete point cloud with the highest accuracy.

Sensor-Independent Pedestrian Detection for Personal Mobility Vehicles in Walking Space Using Dataset Generated by Simulation

Takahiro Shimizu, Kenji Koide, Shuji Oishi, Masashi Yokozuka, Atsuhiko Banno, Motoki Shino

Responsive image

Auto-TLDR; CosPointPillars: A 3D Object Detection Method for Pedestrian Detection in Walking Spaces

Slides Poster Similar

Autonomous driving of a personal mobility vehicle such as a wheelchair in a walking space is necessary in the future as a means of transportation for the elderly and the physically handicapped. To realize this, accurate pedestrian detection is indispensable. As existing 3D object detection methods are trained with a roadway dataset, they are widely used for object detection in roadways. These methods have two major issues in the detection of objects in walking spaces. The first issue is that they are largely affected by the difference of the LIDAR models. To eliminate this issue, we propose a 3D object detection method, CosPointPillars. CosPointPillars does not take the reflection intensities of LIDAR point cloud, which cause a sensor model dependency, as input. Furthermore, CosPointPillars utilizes a cosine estimation network (CEN) to retain the detection accuracy. The second issue is that networks trained with a roadway dataset cannot sufficiently detect pedestrians (who are major traffic participants in walking spaces) located within a short distance; this is because the roadway dataset hardly includes nearby pedestrians. To solve this issue, we generated a new walking space dataset called SimDataset, which includes nearby pedestrians as a training dataset in the simulations. An experiment on the KITTI showed that the CEN helps in pedestrian detection in sparse point clouds. Furthermore, an experiment on a real walking space showed that SimDataset is suitable for pedestrian detection in such cases.

Cascade Saliency Attention Network for Object Detection in Remote Sensing Images

Dayang Yu, Rong Zhang, Shan Qin

Responsive image

Auto-TLDR; Cascade Saliency Attention Network for Object Detection in Remote Sensing Images

Slides Poster Similar

Object detection in remote sensing images is a challenging task due to objects in the bird-view perspective appearing with arbitrary orientations. Though considerable progress has been made, there still exist challenges with the interference from complex backgrounds, dense arrangement, and large-scale variations. In this paper, we propose an oriented detector named Cascade Saliency Attention Network (CSAN), designed for comprehensively suppressing interference in remote sensing images. Specifically, we first combine context and pixel attention on feature maps to enhance saliency of objects for suppressing interference from backgrounds. Then, in cascade network, we apply instance segmentation on ROI to increase saliency of the central object, thus preventing object features from mutual interference in dense arrangement. Additionally, to alleviate large-scale variations, we devise a multi-scale merge module during FPN merging process to learn richer scale representations. Experimental results on DOTA and HRSC2016 datasets outperform other state-of-the-art object detection methods and verify the effectiveness of our method.

Manual-Label Free 3D Detection Via an Open-Source Simulator

Zhen Yang, Chi Zhang, Zhaoxiang Zhang, Huiming Guo

Responsive image

Auto-TLDR; DA-VoxelNet: A Novel Domain Adaptive VoxelNet for LIDAR-based 3D Object Detection

Slides Poster Similar

LiDAR based 3D object detectors typically need a large amount of detailed-labeled point cloud data for training, but these detailed labels are commonly expensive to acquire. In this paper, we propose a manual-label free 3D detection algorithm that leverages the CARLA simulator to generate a large amount of self-labeled training samples and introduces a novel Domain Adaptive VoxelNet (DA-VoxelNet) that can cross the distribution gap from the synthetic data to the real scenario. The self-labeled training samples are generated by a set of high quality 3D models embedded in a CARLA simulator and a proposed LiDAR-guided sampling algorithm. Then a DA-VoxelNet that integrates both a sample-level DA module and an anchor-level DA module is proposed to enable the detector trained by the synthetic data to adapt to real scenario. Experimental results show that the proposed unsupervised DA 3D detector on KITTI evaluation set can achieve 76.66% and 56.64% mAP on BEV mode and 3D mode respectively. The results reveal a promising perspective of training a LIDAR-based 3D detector without any hand-tagged label.

Learning a Dynamic High-Resolution Network for Multi-Scale Pedestrian Detection

Mengyuan Ding, Shanshan Zhang, Jian Yang

Responsive image

Auto-TLDR; Learningable Dynamic HRNet for Pedestrian Detection

Slides Poster Similar

Pedestrian detection is a canonical instance of object detection in computer vision. In practice, scale variation is one of the key challenges, resulting in unbalanced performance across different scales. Recently, the High-Resolution Network (HRNet) has become popular because high-resolution feature representations are more friendly to small objects. However, when we apply HRNet for pedestrian detection, we observe that it improves for small pedestrians on one hand, but hurts the performance for larger ones on the other hand. To overcome this problem, we propose a learnable Dynamic HRNet (DHRNet) aiming to generate different network paths adaptive to different scales. Specifically, we construct a parallel multi-branch architecture and add a soft conditional gate module allowing for dynamic feature fusion. Both branches share all the same parameters except the soft gate module. Experimental results on CityPersons and Caltech benchmarks indicate that our proposed dynamic HRNet is more capable of dealing with pedestrians of various scales, and thus improves the performance across different scales consistently.

Towards Efficient 3D Point Cloud Scene Completion Via Novel Depth View Synthesis

Haiyan Wang, Liang Yang, Xuejian Rong, Ying-Li Tian

Responsive image

Auto-TLDR; 3D Point Cloud Completion with Depth View Synthesis and Depth View synthesis

Poster Similar

3D point cloud completion has been a long-standing challenge at scale, and corresponding per-point supervised training strategies suffered from the cumbersome annotations. 2D supervision has recently emerged as a promising alternative for 3D tasks, but specific approaches for 3D point cloud completion still remain to be explored. To overcome these limitations, we propose an end-to-end method that directly lifts a single depth map to a completed point cloud. With one depth map as input, a multi-way novel depth view synthesis network (NDVNet) is designed to infer coarsely completed depth maps under various viewpoints. Meanwhile, a geometric depth perspective rendering module is introduced to utilize the raw input depth map to generate a re-projected depth map for each view. Therefore, the two parallelly generated depth maps for each view are further concatenated and refined by a depth completion network (DCNet). The final completed point cloud is fused from all refined depth views. Experimental results demonstrate the effectiveness of our proposed approach composed of aforementioned components, to produce high-quality state-of-the-art results on the popular SUNCG benchmark.

DualBox: Generating BBox Pair with Strong Correspondence Via Occlusion Pattern Clustering and Proposal Refinement

Zheng Ge, Chuyu Hu, Xin Huang, Baiqiao Qiu, Osamu Yoshie

Responsive image

Auto-TLDR; R2NMS: Combining Full and Visible Body Bounding Box for Dense Pedestrian Detection

Slides Poster Similar

Despite the rapid development of pedestrian detection, the problem of dense pedestrian detection is still unsolved, especially the upper limit of Recall caused by Non-Maximum-Suppression (NMS). Out of this reason, R2NMS is proposed to simultaneously detect full and visible body bounding boxes, by replacing the full body BBoxes with less occluded visible body BBoxes in the NMS algorithm, achieving a higher recall. However, the P-RPN and P-RCNN modules proposed in R2NMS for simultaneous high quality full and visible body prediction require non-trivial positive/negative assigning strategies for anchor BBoxes. To simplify the prerequisites and improve the utility of R2NMS, we incorporate clustering analysis into the learning of visible body proposals from full body proposals. Furthermore, to reduce the computation complexity caused by the large number of potential visible body proposals, we introduce a novel occlusion pattern prediction branch on top of the R-CNN module (i.e. F-RCNN) to select the best matched visible proposals for each full body proposals and then feed them into another R-CNN module (i.e. V-RCNN). Incorporated with R2NMS, our DualBox model can achieve competitive performance while only requires few hyper-parameters. We validate the effectiveness of the proposed approach on the CrowdHuman and CityPersons datasets. Experimental results show that our approach achieves promising performance for detecting both non-occluded and occluded pedestrians, especially heavily occluded ones.

Hybrid Cascade Point Search Network for High Precision Bar Chart Component Detection

Junyu Luo, Jinpeng Wang, Chin-Yew Lin

Responsive image

Auto-TLDR; Object Detection of Chart Components in Chart Images Using Point-based and Region-Based Object Detection Framework

Slides Poster Similar

Charts are commonly used for data visualization. One common form of chart distribution is in its image form. To enable machine comprehension of chart images, precise detection of chart components in chart images is a critical step. Existing image object detection methods do not perform well in chart component detection which requires high boundary detection precision. And traditional rule-based approaches lack enough generalization ability. In order to address this problem, we design a novel two-stage object detection framework that combines point-based and region-based ideas, by simulating the process that human creating bounding boxes for objects. The experiment on our labeled ChartDet dataset shows our method greatly improves the performance of chart object detection. We further extend our method to a general object detection task and get comparable performance.

Human Segmentation with Dynamic LiDAR Data

Tao Zhong, Wonjik Kim, Masayuki Tanaka, Masatoshi Okutomi

Responsive image

Auto-TLDR; Spatiotemporal Neural Network for Human Segmentation with Dynamic Point Clouds

Slides Similar

Consecutive LiDAR scans and depth images compose dynamic 3D sequences, which contain more abundant spatiotemporal information than a single frame. Similar to the development history of image and video perception, dynamic 3D sequence perception starts to come into sight after inspiring research on static 3D data perception. This work proposes a spatiotemporal neural network for human segmentation with the dynamic LiDAR point clouds. It takes a sequence of depth images as input. It has a two-branch structure, i.e., the spatial segmentation branch and the temporal velocity estimation branch. The velocity estimation branch is designed to capture motion cues from the input sequence and then propagates them to the other branch. So that the segmentation branch segments humans according to both spatial and temporal features. These two branches are jointly learned on a generated dynamic point cloud data set for human recognition. Our works fill in the blank of dynamic point cloud perception with the spherical representation of point cloud and achieves high accuracy. The experiments indicate that the introduction of temporal feature benefits the segmentation of dynamic point cloud perception.

ACRM: Attention Cascade R-CNN with Mix-NMS for Metallic Surface Defect Detection

Junting Fang, Xiaoyang Tan, Yuhui Wang

Responsive image

Auto-TLDR; Attention Cascade R-CNN with Mix Non-Maximum Suppression for Robust Metal Defect Detection

Slides Poster Similar

Metallic surface defect detection is of great significance in quality control for production. However, this task is very challenging due to the noise disturbance, large appearance variation, and the ambiguous definition of the defect individual. Traditional image processing methods are unable to detect the damaged region effectively and efficiently. In this paper, we propose a new defect detection method, Attention Cascade R-CNN with Mix-NMS (ACRM), to classify and locate defects robustly. Three submodules are developed to achieve this goal: 1) a lightweight attention block is introduced, which can improve the ability in capture global and local feature both in the spatial and channel dimension; 2) we firstly apply the cascade R-CNN to our task, which exploits multiple detectors to sequentially refine the detection result robustly; 3) we introduce a new method named Mix Non-Maximum Suppression (Mix-NMS), which can significantly improve its ability in filtering the redundant detection result in our task. Extensive experiments on a real industrial dataset show that ACRM achieves state-of-the-art results compared to the existing methods, demonstrating the effectiveness and robustness of our detection method.

FatNet: A Feature-Attentive Network for 3D Point Cloud Processing

Chaitanya Kaul, Nick Pears, Suresh Manandhar

Responsive image

Auto-TLDR; Feature-Attentive Neural Networks for Point Cloud Classification and Segmentation

Slides Similar

The application of deep learning to 3D point clouds is challenging due to its lack of order. Inspired by the point embeddings of PointNet and the edge embeddings of DGCNNs, we propose three improvements to the task of point cloud analysis. First, we introduce a novel feature-attentive neural network layer, a FAT layer, that combines both global point-based features and local edge-based features in order to generate better embeddings. Second, we find that applying the same attention mechanism across two different forms of feature map aggregation, max pooling and average pooling, gives better performance than either alone. Third, we observe that residual feature reuse in this setting propagates information more effectively between the layers, and makes the network easier to train. Our architecture achieves state-of-the-art results on the task of point cloud classification, as demonstrated on the ModelNet40 dataset, and an extremely competitive performance on the ShapeNet part segmentation challenge.

SFPN: Semantic Feature Pyramid Network for Object Detection

Yi Gan, Wei Xu, Jianbo Su

Responsive image

Auto-TLDR; SFPN: Semantic Feature Pyramid Network to Address Information Dilution Issue in FPN

Slides Poster Similar

Feature Pyramid Network(FPN) employs a top-down path to enhance low level feature by utilizing high level feature.However, further improvement of detector is greatly hindered by the inner defect of FPN. The dilution issue in FPN is analyzed in this paper, and a new architecture named Semantic Feature Pyramid Network(SFPN) is introduced to address the information imbalance problem caused by information dilution. The proposed method consists of two simple and effective components: Semantic Pyramid Module(SPM) and Semantic Feature Fusion Module(SFFM). To compensate for the weaknesses of FPN, the semantic segmentation result is utilized as an extra information source in our architecture.By constructing a semantic pyramid based on the segmentation result and fusing it with FPN, feature maps at each level can obtain the necessary information without suffering from the dilution issue. The proposed architecture could be applied on many detectors, and non-negligible improvement could be achieved. Although this method is designed for object detection, other tasks such as instance segmentation can also largely benefit from it. The proposed method brings Faster R-CNN and Mask R-CNN with ResNet-50 as backbone both 1.8 AP improvements respectively. Furthermore, SFPN improves Cascade R-CNN with backbone ResNet-101 from 42.4 AP to 43.5 AP.

SyNet: An Ensemble Network for Object Detection in UAV Images

Berat Mert Albaba, Sedat Ozer

Responsive image

Auto-TLDR; SyNet: Combining Multi-Stage and Single-Stage Object Detection for Aerial Images

Poster Similar

Recent advances in camera equipped drone applications and their widespread use increased the demand on vision based object detection algorithms for aerial images. Object detection process is inherently a challenging task as a generic computer vision problem, however, since the use of object detection algorithms on UAVs (or on drones) is relatively a new area, it remains as a more challenging problem to detect objects in aerial images. There are several reasons for that including: (i) the lack of large drone datasets including large object variance, (ii) the large orientation and scale variance in drone images when compared to the ground images, and (iii) the difference in texture and shape features between the ground and the aerial images. Deep learning based object detection algorithms can be classified under two main categories: (a) single-stage detectors and (b) multi-stage detectors. Both single-stage and multi-stage solutions have their advantages and disadvantages over each other. However, a technique to combine the good sides of each of those solutions could yield even a stronger solution than each of those solutions individually. In this paper, we propose an ensemble network, SyNet, that combines a multi-stage method with a single-stage one with the motivation of decreasing the high false negative rate of multi-stage detectors and increasing the quality of the single-stage detector proposals. As building blocks, CenterNet and Cascade R-CNN with pretrained feature extractors are utilized along with an ensembling strategy. We report the state of the art results obtained by our proposed solution on two different datasets: namely MS-COCO and visDrone with \%52.1 $mAP_{IoU = 0.75}$ is obtained on MS-COCO $val2017$ dataset and \%26.2 $mAP_{IoU = 0.75}$ is obtained on VisDrone $test-set$. Our code is available at: https://github.com/mertalbaba/SyNet}{https://github.com/mer talbaba/SyNet

Object Detection Model Based on Scene-Level Region Proposal Self-Attention

Yu Quan, Zhixin Li, Canlong Zhang, Huifang Ma

Responsive image

Auto-TLDR; Exploiting Semantic Informations for Object Detection

Slides Poster Similar

The improvement of object detection performance is mostly focused on the extraction of local information near the region of interest in the image, which results in detection performance in this area being unable to achieve the desired effect. First, a depth-wise separable convolution network(D_SCNet-127 R-CNN) is built on the backbone network. Considering the importance of scene and semantic informations for visual recognition, the feature map is sent into the branch of the semantic segmentation module, region proposal network module, and the region proposal self-attention module to build the network of scene-level and region proposal self-attention module. Second, a deep reinforcement learning was utilized to achieve accurate positioning of border regression, and the calculation speed of the whole model was improved through implementing a light-weight head network. This model can effectively solve the limitation of feature extraction in traditional object detection and obtain more comprehensive detailed features. The experimental verification on MSCOCO17, VOC12, and Cityscapes datasets shows that the proposed method has good validity and scalability.

Forground-Guided Vehicle Perception Framework

Kun Tian, Tong Zhou, Shiming Xiang, Chunhong Pan

Responsive image

Auto-TLDR; A foreground segmentation branch for vehicle detection

Slides Poster Similar

As the basis of advanced visual tasks such as vehicle tracking and traffic flow analysis, vehicle detection needs to accurately predict the position and category of vehicle objects. In the past decade, deep learning based methods have made great progress. However, we also notice that some existing cases are not studied thoroughly. First, false positive on the background regions is one of the critical problems. Second, most of the previous approaches only optimize a single vehicle detection model, ignoring the relationship between different visual perception tasks. In response to the above two findings, we introduce a foreground segmentation branch for the first time, which can predict the pixel level of vehicles in advance. Furthermore, two attention modules are designed to guide the work of the detection branch. The proposed method can be easily grafted into the one-stage and two-stage detection framework. We evaluate the effectiveness of our model on LSVH, a dataset with large variations in vehicle scales, and achieve the state-of-the-art detection accuracy.

One-Stage Multi-Task Detector for 3D Cardiac MR Imaging

Weizeng Lu, Xi Jia, Wei Chen, Nicolò Savioli, Antonio De Marvao, Linlin Shen, Declan O'Regan, Jinming Duan

Responsive image

Auto-TLDR; Multi-task Learning for Real-Time, simultaneous landmark location and bounding box detection in 3D space

Slides Poster Similar

Fast and accurate landmark location and bounding box detection are important steps in 3D medical imaging. In this paper, we propose a novel multi-task learning framework, for real-time, simultaneous landmark location and bounding box detection in 3D space. Our method extends the famous single-shot multibox detector (SSD) from single-task learning to multi-task learning and from 2D to 3D. Furthermore, we propose a post-processing approach to refine the network landmark output, by averaging the candidate landmarks. Owing to these settings, the proposed framework is fast and accurate. For 3D cardiac magnetic resonance (MR) images with size 224 × 224 × 64, our framework runs about 128 volumes per second (VPS) on GPU and achieves 6.75mm average point-to-point distance error for landmark location, which outperforms both state-of-the-art and baseline methods. We also show that segmenting the 3D image cropped with the bounding box results in both improved performance and efficiency.

Small Object Detection by Generative and Discriminative Learning

Yi Gu, Jie Li, Chentao Wu, Weijia Jia, Jianping Chen

Responsive image

Auto-TLDR; Generative and Discriminative Learning for Small Object Detection

Slides Poster Similar

With the development of deep convolutional neural networks (CNNs), the object detection accuracy has been greatly improved. But the performance of small object detection is still far from satisfactory, mainly because small objects are so tiny that the information contained in the feature map is limited. Existing methods focus on improving classification accuracy but still suffer from the limitation of bounding box prediction. To solve this issue, we propose a detection framework by generative and discriminative learning. First, a reconstruction generator network is designed to reconstruct the mapping from low frequency to high frequency for anchor box prediction. Then, a detector module extracts the regions of interest (ROIs) from generated results and implements a RoI-Head to predict object category and refine bounding box. In order to guide the reconstructed image related to the corresponding one, a discriminator module is adopted to tell from the generated result and the original image. Extensive evaluations on the challenging MS-COCO dataset demonstrate that our model outperforms most state-of-the-art models in detecting small objects, especially the reconstruction module improves the average precision for small object (APs) by 7.7%.

Construction Worker Hardhat-Wearing Detection Based on an Improved BiFPN

Chenyang Zhang, Zhiqiang Tian, Jingyi Song, Yaoyue Zheng, Bo Xu

Responsive image

Auto-TLDR; A One-Stage Object Detection Method for Hardhat-Wearing in Construction Site

Slides Poster Similar

Work in the construction site is considered to be one of the occupations with the highest safety risk factor. Therefore, safety plays an important role in construction site. One of the most fundamental safety rules in construction site is to wear a hardhat. To strengthen the safety of the construction site, most of the current methods use multi-stage method for hardhat-wearing detection. These methods have limitations in terms of adaptability and generalizability. In this paper, we propose a one-stage object detection method based on convolutional neural network. We present a multi-scale strategy that selects the high-resolution feature maps of DarkNet-53 to effectively identify small-scale hardhats. In addition, we propose an improved weighted bi-directional feature pyramid network (BiFPN), which could fuse more semantic features from more scales. The proposed method can not only detect hardhat-wearing, but also identify the color of the hardhat. Experimental results show that the proposed method achieves a mAP of 87.04%, which outperforms several state-of-the-art methods on a public dataset.

Progressive Scene Segmentation Based on Self-Attention Mechanism

Yunyi Pan, Yuan Gan, Kun Liu, Yan Zhang

Responsive image

Auto-TLDR; Two-Stage Semantic Scene Segmentation with Self-Attention

Slides Poster Similar

Semantic scene segmentation is vital for a large variety of applications as it enables understanding of 3D data. Nowadays, various approaches based upon point clouds ignore the mathematical distribution of points and treat the points equally. The methods following this direction neglect the imbalance problem of samples that naturally exists in scenes. To avoid these issues, we propose a two-stage semantic scene segmentation framework based on self-attention mechanism and achieved state-of-the-art performance on 3D scene understanding tasks. We split the whole task into two small ones which efficiently relief the sample imbalance issue. In addition, we have designed a new self-attention block which could be inserted into submanifold convolution networks to model the long-range dependencies that exists among points. The proposed network consists of an encoder and a decoder, with the spatial-wise and channel-wise attention modules inserted. The two-stage network shares a U-Net architecture and is an end-to-end trainable framework which could predict the semantic label for the scene point clouds fed into it. Experiments on standard benchmarks of 3D scenes implies that our network could perform at par or better than the existing state-of-the-art methods.

An Accurate Threshold Insensitive Kernel Detector for Arbitrary Shaped Text

Xijun Qian, Yifan Liu, Yu-Bin Yang

Responsive image

Auto-TLDR; TIKD: threshold insensitive kernel detector for arbitrary shaped text

Slides Similar

Recently, segmentation-based methods are popular in scene text detection due to the segmentation results can easily represent scene text of arbitrary shapes. However, previous works segment text instances the same as normal objects. It is obvious that the edge of the text instance differs from normal objects. In this paper, we propose a threshold insensitive kernel detector for arbitrary shaped text called TIKD, which includes a simple but stable base model and a new loss weight called Decay Loss Weight (DLW). By suppressing outlier pixels in a gradual way, the DLW can lead the network to detect more accurate text instances. Our method shows great power in accuracy and stability. It is worth mentioning that we achieve the precision, recall, f-measure of 88.7%, 83.7%, 86.1% respectively on the Total-Text dataset, with a fast speed of 16.3 frames per second. What’s more, even if we set the threshold in an extreme situation range from 0.1 to 0.9, our method can always achieve a stable f-measure over 79.9% on the Total-Text dataset.

Ghost Target Detection in 3D Radar Data Using Point Cloud Based Deep Neural Network

Mahdi Chamseddine, Jason Rambach, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; Point Based Deep Learning for Ghost Target Detection in 3D Radar Point Clouds

Slides Poster Similar

Ghost targets are targets that appear at wrong locations in radar data and are caused by the presence of multiple indirect reflections between the target and the sensor. In this work, we introduce the first point based deep learning approach for ghost target detection in 3D radar point clouds. This is done by extending the PointNet network architecture by modifying its input to include radar point features beyond location and introducing skip connetions. We compare different input modalities and analyze the effects of the changes we introduced. We also propose an approach for automatic labeling of ghost targets 3D radar data using lidar as reference. The algorithm is trained and tested on real data in various driving scenarios and the tests show promising results in classifying real and ghost radar targets.

Feature Pyramid Hierarchies for Multi-Scale Temporal Action Detection

Jiayu He, Guohui Li, Jun Lei

Responsive image

Auto-TLDR; Temporal Action Detection using Pyramid Hierarchies and Multi-scale Feature Maps

Slides Poster Similar

Temporal action detection is a challenging but promising task in video content analysis. It is in great demand in the field of public safety. The main difficulty of the task is precisely localizing activities in the video especially those short duration activities. And most of the existing methods can not achieve a satisfactory detection result. Our method addresses a key point to improve detection accuracy, which is to use multi-scale feature maps for regression and classification. In this paper, we introduce a novel network based on classification following proposal framework. In our network, a 3D feature pyramid hierarchies is built to enhance the ability of detecting short duration activities. The input RGB/Flow frames are first encoded by a 3D feature pyramid hierarchies, and this subnet produces multi-level feature maps. Then temporal proposal subnet uses these features to pick out proposals which might contain activity segments. Finally a pyramid region of interest (RoI) pooling pipeline and two fully connected layers reuse muti-level feature maps to refine the temporal boundaries of proposals and classify them. We use late feature fusion scheme to combine RGB and Flow information. The network is trained end-to-end and we evaluate it in THUMOS'14 dataset. Our network achieves a good result among typical methods. A further ablation test demonstrate that pyramid hierarchies is effective to improve detecting short duration activity segments.

Feature Embedding Based Text Instance Grouping for Largely Spaced and Occluded Text Detection

Pan Gao, Qi Wan, Renwu Gao, Linlin Shen

Responsive image

Auto-TLDR; Text Instance Embedding Based Feature Embeddings for Multiple Text Instance Grouping

Slides Poster Similar

A text instance can be easily detected as multiple ones due to the large space between texts/characters, curved shape and partial occlusion. In this paper, a feature embedding based text instance grouping algorithm is proposed to solve this problem. To learn the feature space, a TIEM (Text Instance Embedding Module) is trained to minimize the within instance scatter and maximize the between instance scatter. Similarity between different text instances are measured in the feature space and merged if they meet certain conditions. Experimental results show that our approach can effectively connect text regions that belong to the same text instance. Competitive performance of our approach has been achieved on CTW1500, Total-Text, IC15 and a subset consists of texts selected from the three datasets, with large spacing and occlusions.

PRF-Ped: Multi-Scale Pedestrian Detector with Prior-Based Receptive Field

Yuzhi Tan, Hongxun Yao, Haoran Li, Xiusheng Lu, Haozhe Xie

Responsive image

Auto-TLDR; Bidirectional Feature Enhancement Module for Multi-Scale Pedestrian Detection

Slides Poster Similar

Multi-scale feature representation is a common strategy to handle the scale variation in pedestrian detection. Existing methods simply utilize the convolutional pyramidal features for multi-scale representation. However, they rarely pay attention to the differences among different feature scales and extract multi-scale features from a single feature map, which may make the detectors sensitive to scale-variance in multi-scale pedestrian detection. In this paper, we introduce a bidirectional feature enhancement module (BFEM) to augment the semantic information of low-level features and the localization information of high-level features. In addition, we propose a prior-based receptive field block (PRFB) for multi-scale pedestrian feature extraction, where the receptive field is closer to the aspect ratio of the pedestrian target. Consequently, it is less affected by the surrounding background when extracting features. Experimental results indicate that the proposed method outperform the state-of-the-art methods on the CityPersons and Caltech datasets.

Detective: An Attentive Recurrent Model for Sparse Object Detection

Amine Kechaou, Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen

Responsive image

Auto-TLDR; Detective: An attentive object detector that identifies objects in images in a sequential manner

Slides Poster Similar

In this work, we present Detective – an attentive object detector that identifies objects in images in a sequential manner. Our network is based on an encoder-decoder architecture, where the encoder is a convolutional neural network, and the decoder is a convolutional recurrent neural network coupled with an attention mechanism. At each iteration, our decoder focuses on the relevant parts of the image using an attention mechanism, and then estimates the object’s class and the bounding box coordinates. Current object detection models generate dense predictions and rely on post-processing to remove duplicate predictions. Detective is a sparse object detector that generates a single bounding box per object instance. However, training a sparse object detector is challenging, as it requires the model to reason at the instance level and not just at the class and spatial levels. We propose a training mechanism based on the Hungarian Algorithm and a loss that balances the localization and classification tasks. This allows Detective to achieve promising results on the PASCAL VOC object detection dataset. Our experiments demonstrate that sparse object detection is possible and has a great potential for future developments in applications where the order of the objects to be predicted is of interest.

Point In: Counting Trees with Weakly Supervised Segmentation Network

Pinmo Tong, Shuhui Bu, Pengcheng Han

Responsive image

Auto-TLDR; Weakly Tree counting using Deep Segmentation Network with Localization and Mask Prediction

Slides Poster Similar

For tree counting tasks, since traditional image processing methods require expensive feature engineering and are not end-to-end frameworks, this will cause additional noise and cannot be optimized overall, so this method has not been widely used in recent trends of tree counting application. Recently, many deep learning based approaches are designed for this task because of the powerful feature extracting ability. The representative way is bounding box based supervised method, but time-consuming annotations are indispensable for them. Moreover, these methods are difficult to overcome the occlusion or overlap. To solve this problem, we propose a weakly tree counting network (WTCNet) based on deep segmentation network with only point supervision. It can simultaneously complete tree counting with localization and output mask of each tree at the same time. We first adopt a novel feature extractor network (FENet) to get features of input images, and then an effective strategy is introduced to deal with different mask predictions. In the end, we propose a basic localization guidance accompany with rectification guidance to train the network. We create two different datasets and select an existing challenging plant dataset to evaluate our method on three different tasks. Experimental results show the good performance improvement of our method compared with other existing methods. Further study shows that our method has great potential to reduce human labor and provide effective ground-truth masks and the results show the superiority of our method over the advanced methods.

PS^2-Net: A Locally and Globally Aware Network for Point-Based Semantic Segmentation

Na Zhao, Tat Seng Chua, Gim Hee Lee

Responsive image

Auto-TLDR; PS2-Net: A Local and Globally Aware Deep Learning Framework for Semantic Segmentation on 3D Point Clouds

Slides Poster Similar

In this paper, we present the PS^2-Net - a locally and globally aware deep learning framework for semantic segmentation on 3D scene-level point clouds. In order to deeply incorporate local structures and global context to support 3D scene segmentation, our network is built on four repeatedly stacked encoders, where each encoder has two basic components: EdgeConv that captures local structures and NetVLAD that models global context. Different from existing start-of-the-art methods for point-based scene semantic segmentation that either violate or do not achieve permutation invariance, our PS2-Net is designed to be permutation invariant which is an essential property of any deep network used to process unordered point clouds. We further provide theoretical proof to guarantee the permutation invariance property of our network. We perform extensive experiments on two large-scale 3D indoor scene datasets and demonstrate that our PS2-Net is able to achieve state-of-the-art performances as compared to existing approaches.

A Fast and Accurate Object Detector for Handwritten Digit String Recognition

Jun Guo, Wenjing Wei, Yifeng Ma, Cong Peng

Responsive image

Auto-TLDR; ChipNet: An anchor-free object detector for handwritten digit string recognition

Slides Poster Similar

Focusing on handwritten digit string recognition (HDSR), we propose an anchor-free object detector called ChipNet, where a novel encoding method is designed. The input image is divided into columns, and then these columns are encoded by the ground truth. The adjacent columns are responsible for detecting the same target so that it can well address the class-imbalanced problem meanwhile reducing the network computation. ChipNet is composed of convolutional and bidirectional long short term memory networks. Different from the typical detectors, it doesn't use region proposals, anchors or regions of interest pooling. Hence, it can overcome the shortages of anchor-based and dense detectors in HDSR. The experiments are implemented on the synthetic digit strings, the CVL HDS database, and the ORAND-CAR-A & B databases. The high accuracies, which surpass the reported results by a large margin (up to 6.62%), are achieved. Furthermore, it gets 219 FPS speed on 160*32 px resolution images when using a Tesla P100 GPU. The results also show that ChipNet can handle touching, connecting and arbitrary length digit strings, and the obtained accuracies in HDSR are as high as the ones in single handwritten digit recognition.

Object Detection on Monocular Images with Two-Dimensional Canonical Correlation Analysis

Zifan Yu, Suya You

Responsive image

Auto-TLDR; Multi-Task Object Detection from Monocular Images Using Multimodal RGB and Depth Data

Slides Poster Similar

Accurate and robust detection objects from monocular images is a fundamental vision task. This paper describes a novel approach of holistic scene understanding that can simultaneously achieve multiple tasks of scene reconstruction and object detection from a single monocular camera. Rather than pursuing an independent solution for each individual task as most existing work does, we seek a globally optimal solution that holistically resolves the multiple perception and reasoning tasks in an effective manner. The approach explores the complementary properties of multimodal RGB imagery and depth data to improve scene perception tasks. It uniquely combines the techniques of canonical correlation analysis and deep learning to learn the most correlated features to maximize the modal cross-correlation for improving the performance and robustness of object detection in complex environments. Extensive experiments have been conducted to evaluate and demonstrate the performances of the proposed approach.

Convolutional STN for Weakly Supervised Object Localization

Akhil Meethal, Marco Pedersoli, Soufiane Belharbi, Eric Granger

Responsive image

Auto-TLDR; Spatial Localization for Weakly Supervised Object Localization

Slides Similar

Weakly-supervised object localization is a challenging task in which the object of interest should be localized while learning its appearance. State-of-the-art methods recycle the architecture of a standard CNN by using the activation maps of the last layer for localizing the object. While this approach is simple and works relatively well, object localization relies on different features than classification, thus, a specialized localization mechanism is required during training to improve performance. In this paper, we propose a convolutional, multi-scale spatial localization network that provides accurate localization for the object of interest. Experimental results on CUB-200-2011 and ImageNet datasets show competitive performance of our proposed approach on Weakly supervised localization.