CASNet: Common Attribute Support Network for Image Instance and Panoptic Segmentation

Xiaolong Liu, Yuqing Hou, Anbang Yao, Yurong Chen, Keqiang Li

Responsive image

Auto-TLDR; Common Attribute Support Network for instance segmentation and panoptic segmentation

Slides Poster

Instance segmentation and panoptic segmentation is being paid more and more attention in recent years. In comparison with bounding box based object detection and semantic segmentation, instance segmentation can provide more analytical results at pixel level. Given the insight that pixels belonging to one instance have one or more common attributes of current instance, we bring up an one-stage instance segmentation network named Common Attribute Support Network (CASNet), which realizes instance segmentation by predicting and clustering common attributes. CASNet is designed in the manner of fully convolutional and can implement training and inference from end to end. And CASNet manages predicting the instance without overlaps and holes, which problem exists in most of current instance segmentation algorithms. Furthermore, it can be easily extended to panoptic segmentation through minor modifications with little computation overhead. CASNet builds a bridge between semantic and instance segmentation from finding pixel class ID to obtaining class and instance ID by operations on common attribute. Through experiment for instance and panoptic segmentation, CASNet gets mAP 32.8\% and PQ 59.0\% on Cityscapes validation dataset by joint training, and mAP 36.3\% and PQ 66.1\% by separated training mode. For panoptic segmentation, CASNet gets state-of-the-art performance on the Cityscapes validation dataset.

Similar papers

Learning a Dynamic High-Resolution Network for Multi-Scale Pedestrian Detection

Mengyuan Ding, Shanshan Zhang, Jian Yang

Responsive image

Auto-TLDR; Learningable Dynamic HRNet for Pedestrian Detection

Slides Poster Similar

Pedestrian detection is a canonical instance of object detection in computer vision. In practice, scale variation is one of the key challenges, resulting in unbalanced performance across different scales. Recently, the High-Resolution Network (HRNet) has become popular because high-resolution feature representations are more friendly to small objects. However, when we apply HRNet for pedestrian detection, we observe that it improves for small pedestrians on one hand, but hurts the performance for larger ones on the other hand. To overcome this problem, we propose a learnable Dynamic HRNet (DHRNet) aiming to generate different network paths adaptive to different scales. Specifically, we construct a parallel multi-branch architecture and add a soft conditional gate module allowing for dynamic feature fusion. Both branches share all the same parameters except the soft gate module. Experimental results on CityPersons and Caltech benchmarks indicate that our proposed dynamic HRNet is more capable of dealing with pedestrians of various scales, and thus improves the performance across different scales consistently.

SFPN: Semantic Feature Pyramid Network for Object Detection

Yi Gan, Wei Xu, Jianbo Su

Responsive image

Auto-TLDR; SFPN: Semantic Feature Pyramid Network to Address Information Dilution Issue in FPN

Slides Poster Similar

Feature Pyramid Network(FPN) employs a top-down path to enhance low level feature by utilizing high level feature.However, further improvement of detector is greatly hindered by the inner defect of FPN. The dilution issue in FPN is analyzed in this paper, and a new architecture named Semantic Feature Pyramid Network(SFPN) is introduced to address the information imbalance problem caused by information dilution. The proposed method consists of two simple and effective components: Semantic Pyramid Module(SPM) and Semantic Feature Fusion Module(SFFM). To compensate for the weaknesses of FPN, the semantic segmentation result is utilized as an extra information source in our architecture.By constructing a semantic pyramid based on the segmentation result and fusing it with FPN, feature maps at each level can obtain the necessary information without suffering from the dilution issue. The proposed architecture could be applied on many detectors, and non-negligible improvement could be achieved. Although this method is designed for object detection, other tasks such as instance segmentation can also largely benefit from it. The proposed method brings Faster R-CNN and Mask R-CNN with ResNet-50 as backbone both 1.8 AP improvements respectively. Furthermore, SFPN improves Cascade R-CNN with backbone ResNet-101 from 42.4 AP to 43.5 AP.

A Novel Region of Interest Extraction Layer for Instance Segmentation

Leonardo Rossi, Akbar Karimi, Andrea Prati

Responsive image

Auto-TLDR; Generic RoI Extractor for Two-Stage Neural Network for Instance Segmentation

Slides Poster Similar

Given the wide diffusion of deep neural network architectures for computer vision tasks, several new applications are nowadays more and more feasible. Among them, a particular attention has been recently given to instance segmentation, by exploiting the results achievable by two-stage networks (such as Mask R-CNN or Faster R-CNN), derived from R-CNN. In these complex architectures, a crucial role is played by the Region of Interest (RoI) extraction layer, devoted to extract a coherent subset of features from a single Feature Pyramid Network (FPN) layer attached on top of a backbone. This paper is motivated by the need to overcome to the limitations of existing RoI extractors which select only one (the best) layer from FPN. Our intuition is that all the layers of FPN retain useful information. Therefore, the proposed layer (called Generic RoI Extractor - GRoIE) introduces non-local building blocks and attention mechanisms to boost the performance. A comprehensive ablation study at component level is conducted to find the best set of algorithms and parameters for the GRoIE layer. Moreover, GRoIE can be integrated seamlessly with every two-stage architecture for both object detection and instance segmentation tasks. Therefore, the improvements brought by the use of GRoIE in different state-of-the-art architectures are also evaluated. The proposed layer leads up to gain a 1.1% AP on bounding box detection and 1.7% AP on instance segmentation. The code is publicly available on GitHub repository at https://github.com/IMPLabUniPr/mmdetection-groie

End-To-End Deep Learning Methods for Automated Damage Detection in Extreme Events at Various Scales

Yongsheng Bai, Alper Yilmaz, Halil Sezen

Responsive image

Auto-TLDR; Robust Mask R-CNN for Crack Detection in Extreme Events

Slides Poster Similar

Robust Mask R-CNN (Mask Regional Convolutional Neural Network) methods are proposed and tested for automatic detection of cracks on structures or their components that may be damaged during extreme events, such as earth-quakes. We curated a new dataset with 2,021 labeled images for training and validation and aimed to find end-to-end deep neural networks for crack detection in the field. With data augmentation and parameters fine-tuning, Path Aggregation Network (PANet) with spatial attention mechanisms and High-resolution Network (HRNet) are introduced into Mask R-CNNs. The tests on three public datasets with low- or high-resolution images demonstrate that the proposed methods can achieve a big improvement over alternative networks, so the proposed method may be sufficient for crack detection for a variety of scales in real applications.

Forground-Guided Vehicle Perception Framework

Kun Tian, Tong Zhou, Shiming Xiang, Chunhong Pan

Responsive image

Auto-TLDR; A foreground segmentation branch for vehicle detection

Slides Poster Similar

As the basis of advanced visual tasks such as vehicle tracking and traffic flow analysis, vehicle detection needs to accurately predict the position and category of vehicle objects. In the past decade, deep learning based methods have made great progress. However, we also notice that some existing cases are not studied thoroughly. First, false positive on the background regions is one of the critical problems. Second, most of the previous approaches only optimize a single vehicle detection model, ignoring the relationship between different visual perception tasks. In response to the above two findings, we introduce a foreground segmentation branch for the first time, which can predict the pixel level of vehicles in advance. Furthermore, two attention modules are designed to guide the work of the detection branch. The proposed method can be easily grafted into the one-stage and two-stage detection framework. We evaluate the effectiveness of our model on LSVH, a dataset with large variations in vehicle scales, and achieve the state-of-the-art detection accuracy.

SyNet: An Ensemble Network for Object Detection in UAV Images

Berat Mert Albaba, Sedat Ozer

Responsive image

Auto-TLDR; SyNet: Combining Multi-Stage and Single-Stage Object Detection for Aerial Images

Poster Similar

Recent advances in camera equipped drone applications and their widespread use increased the demand on vision based object detection algorithms for aerial images. Object detection process is inherently a challenging task as a generic computer vision problem, however, since the use of object detection algorithms on UAVs (or on drones) is relatively a new area, it remains as a more challenging problem to detect objects in aerial images. There are several reasons for that including: (i) the lack of large drone datasets including large object variance, (ii) the large orientation and scale variance in drone images when compared to the ground images, and (iii) the difference in texture and shape features between the ground and the aerial images. Deep learning based object detection algorithms can be classified under two main categories: (a) single-stage detectors and (b) multi-stage detectors. Both single-stage and multi-stage solutions have their advantages and disadvantages over each other. However, a technique to combine the good sides of each of those solutions could yield even a stronger solution than each of those solutions individually. In this paper, we propose an ensemble network, SyNet, that combines a multi-stage method with a single-stage one with the motivation of decreasing the high false negative rate of multi-stage detectors and increasing the quality of the single-stage detector proposals. As building blocks, CenterNet and Cascade R-CNN with pretrained feature extractors are utilized along with an ensembling strategy. We report the state of the art results obtained by our proposed solution on two different datasets: namely MS-COCO and visDrone with \%52.1 $mAP_{IoU = 0.75}$ is obtained on MS-COCO $val2017$ dataset and \%26.2 $mAP_{IoU = 0.75}$ is obtained on VisDrone $test-set$. Our code is available at: https://github.com/mertalbaba/SyNet}{https://github.com/mer talbaba/SyNet

FourierNet: Compact Mask Representation for Instance Segmentation Using Differentiable Shape Decoders

Hamd Ul Moqeet Riaz, Nuri Benbarka, Andreas Zell

Responsive image

Auto-TLDR; FourierNet: A Single shot, anchor-free, fully convolutional instance segmentation method that predicts a shape vector

Slides Poster Similar

We present FourierNet, a single shot, anchor-free, fully convolutional instance segmentation method that predicts a shape vector. Consequently, this shape vector is converted into the masks' contour points using a fast numerical transform. Compared to previous methods, we introduce a new training technique, where we utilize a differentiable shape decoder, which manages the automatic weight balancing of the shape vector's coefficients. We used the Fourier series as a shape encoder because of its coefficient interpretability and fast implementation. FourierNet shows promising results compared to polygon representation methods, achieving 30.6 mAP on the MS COCO 2017 benchmark. At lower image resolutions, it runs at 26.6 FPS with 24.3 mAP. It reaches 23.3 mAP using just eight parameters to represent the mask (note that at least four parameters are needed for bounding box prediction only). Qualitative analysis shows that suppressing a reasonable proportion of higher frequencies of Fourier series, still generates meaningful masks. These results validate our understanding that lower frequency components hold higher information for the segmentation task, and therefore, we can achieve a compressed representation. Code is available at: github.com/cogsys-tuebingen/FourierNet.

Scene Text Detection with Selected Anchors

Anna Zhu, Hang Du, Shengwu Xiong

Responsive image

Auto-TLDR; AS-RPN: Anchor Selection-based Region Proposal Network for Scene Text Detection

Slides Poster Similar

Object proposal technique with dense anchoring scheme for scene text detection were applied frequently to achieve high recall. It results in the significant improvement in accuracy but waste of computational searching, regression and classification. In this paper, we propose an anchor selection-based region proposal network (AS-RPN) using effective selected anchors instead of dense anchors to extract text proposals. The center, scales, aspect ratios and orientations of anchors are learnable instead of fixing, which leads to high recall and greatly reduced numbers of anchors. By replacing the anchor-based RPN in Faster RCNN, the AS-RPN-based Faster RCNN can achieve comparable performance with previous state-of-the-art text detecting approaches on standard benchmarks, including COCO-Text, ICDAR2013, ICDAR2015 and MSRA-TD500 when using single-scale and single model (ResNet50) testing only.

Triplet-Path Dilated Network for Detection and Segmentation of General Pathological Images

Jiaqi Luo, Zhicheng Zhao, Fei Su, Limei Guo

Responsive image

Auto-TLDR; Triplet-path Network for One-Stage Object Detection and Segmentation in Pathological Images

Slides Similar

Deep learning has been widely applied in the field of medical image processing. However, compared with flourishing visual tasks in natural images, the progress achieved in pathological images is not remarkable, and detection and segmentation, which are among basic tasks of computer vision, are regarded as two independent tasks. In this paper, we make full use of existing datasets and construct a triplet-path network using dilated convolutions to cooperatively accomplish one-stage object detection and nuclei segmentation for general pathological images. First, in order to meet the requirement of detection and segmentation, a novel structure called triplet feature generation (TFG) is designed to extract high-resolution and multiscale features, where features from different layers can be properly integrated. Second, considering that pathological datasets are usually small, a location-aware and partially truncated loss function is proposed to improve the classification accuracy of datasets with few images and widely varying targets. We compare the performance of both object detection and instance segmentation with state-of-the-art methods. Experimental results demonstrate the effectiveness and efficiency of the proposed network on two datasets collected from multiple organs.

Boundary-Aware Graph Convolution for Semantic Segmentation

Hanzhe Hu, Jinshi Cui, Jinshi Hongbin Zha

Responsive image

Auto-TLDR; Boundary-Aware Graph Convolution for Semantic Segmentation

Slides Poster Similar

Recent works have made great progress in semantic segmentation by exploiting contextual information in a local or global manner with dilated convolutions, pyramid pooling or self-attention mechanism. However, few works have focused on harvesting boundary information to improve the segmentation performance. In order to enhance the feature similarity within the object and keep discrimination from other objects, we propose a boundary-aware graph convolution (BGC) module to propagate features within the object. The graph reasoning is performed among pixels of the same object apart from the boundary pixels. Based on the proposed BGC module, we further introduce the Boundary-aware Graph Convolution Network(BGCNet), which consists of two main components including a basic segmentation network and the BGC module, forming a coarse-to-fine paradigm. Specifically, the BGC module takes the coarse segmentation feature map as node features and boundary prediction to guide graph construction. After graph convolution, the reasoned feature and the input feature are fused together to get the refined feature, producing the refined segmentation result. We conduct extensive experiments on three popular semantic segmentation benchmarks including Cityscapes, PASCAL VOC 2012 and COCO Stuff, and achieve state-of-the-art performance on all three benchmarks.

Object Detection Model Based on Scene-Level Region Proposal Self-Attention

Yu Quan, Zhixin Li, Canlong Zhang, Huifang Ma

Responsive image

Auto-TLDR; Exploiting Semantic Informations for Object Detection

Slides Poster Similar

The improvement of object detection performance is mostly focused on the extraction of local information near the region of interest in the image, which results in detection performance in this area being unable to achieve the desired effect. First, a depth-wise separable convolution network(D_SCNet-127 R-CNN) is built on the backbone network. Considering the importance of scene and semantic informations for visual recognition, the feature map is sent into the branch of the semantic segmentation module, region proposal network module, and the region proposal self-attention module to build the network of scene-level and region proposal self-attention module. Second, a deep reinforcement learning was utilized to achieve accurate positioning of border regression, and the calculation speed of the whole model was improved through implementing a light-weight head network. This model can effectively solve the limitation of feature extraction in traditional object detection and obtain more comprehensive detailed features. The experimental verification on MSCOCO17, VOC12, and Cityscapes datasets shows that the proposed method has good validity and scalability.

One-Stage Multi-Task Detector for 3D Cardiac MR Imaging

Weizeng Lu, Xi Jia, Wei Chen, Nicolò Savioli, Antonio De Marvao, Linlin Shen, Declan O'Regan, Jinming Duan

Responsive image

Auto-TLDR; Multi-task Learning for Real-Time, simultaneous landmark location and bounding box detection in 3D space

Slides Poster Similar

Fast and accurate landmark location and bounding box detection are important steps in 3D medical imaging. In this paper, we propose a novel multi-task learning framework, for real-time, simultaneous landmark location and bounding box detection in 3D space. Our method extends the famous single-shot multibox detector (SSD) from single-task learning to multi-task learning and from 2D to 3D. Furthermore, we propose a post-processing approach to refine the network landmark output, by averaging the candidate landmarks. Owing to these settings, the proposed framework is fast and accurate. For 3D cardiac magnetic resonance (MR) images with size 224 × 224 × 64, our framework runs about 128 volumes per second (VPS) on GPU and achieves 6.75mm average point-to-point distance error for landmark location, which outperforms both state-of-the-art and baseline methods. We also show that segmenting the 3D image cropped with the bounding box results in both improved performance and efficiency.

Hybrid Cascade Point Search Network for High Precision Bar Chart Component Detection

Junyu Luo, Jinpeng Wang, Chin-Yew Lin

Responsive image

Auto-TLDR; Object Detection of Chart Components in Chart Images Using Point-based and Region-Based Object Detection Framework

Slides Poster Similar

Charts are commonly used for data visualization. One common form of chart distribution is in its image form. To enable machine comprehension of chart images, precise detection of chart components in chart images is a critical step. Existing image object detection methods do not perform well in chart component detection which requires high boundary detection precision. And traditional rule-based approaches lack enough generalization ability. In order to address this problem, we design a novel two-stage object detection framework that combines point-based and region-based ideas, by simulating the process that human creating bounding boxes for objects. The experiment on our labeled ChartDet dataset shows our method greatly improves the performance of chart object detection. We further extend our method to a general object detection task and get comparable performance.

An Accurate Threshold Insensitive Kernel Detector for Arbitrary Shaped Text

Xijun Qian, Yifan Liu, Yu-Bin Yang

Responsive image

Auto-TLDR; TIKD: threshold insensitive kernel detector for arbitrary shaped text

Slides Similar

Recently, segmentation-based methods are popular in scene text detection due to the segmentation results can easily represent scene text of arbitrary shapes. However, previous works segment text instances the same as normal objects. It is obvious that the edge of the text instance differs from normal objects. In this paper, we propose a threshold insensitive kernel detector for arbitrary shaped text called TIKD, which includes a simple but stable base model and a new loss weight called Decay Loss Weight (DLW). By suppressing outlier pixels in a gradual way, the DLW can lead the network to detect more accurate text instances. Our method shows great power in accuracy and stability. It is worth mentioning that we achieve the precision, recall, f-measure of 88.7%, 83.7%, 86.1% respectively on the Total-Text dataset, with a fast speed of 16.3 frames per second. What’s more, even if we set the threshold in an extreme situation range from 0.1 to 0.9, our method can always achieve a stable f-measure over 79.9% on the Total-Text dataset.

Yolo+FPN: 2D and 3D Fused Object Detection with an RGB-D Camera

Ya Wang

Responsive image

Auto-TLDR; Yolo+FPN: Combining 2D and 3D Object Detection for Real-Time Object Detection

Slides Poster Similar

In this paper we propose a new deep neural network system, called Yolo+FPN, which fuses both 2D and 3D object detection algorithms to achieve better real-time object detection results and faster inference speed, to be used on real robots. Finding an optimized fusion strategy to efficiently combine 3D object detection with 2D detection information is useful and challenging for both indoor and outdoor robots. In order to satisfy real-time requirements, a trade-off between accuracy and efficiency is needed. We not only have improved training and test accuracies and lower mean losses on the KITTI object detection benchmark, but also achieve better average precision on 3D detection of all classes in three levels of difficulty. Also, we implemented Yolo+FPN system using an RGB-D camera, and compared the speed of 2D and 3D object detection using different GPUs. For the real implementation of both indoor and outdoor scenes, we focus on person detection, which is the most challenging and important among the three classes.

StrongPose: Bottom-up and Strong Keypoint Heat Map Based Pose Estimation

Niaz Ahmad, Jongwon Yoon

Responsive image

Auto-TLDR; StrongPose: A bottom-up box-free approach for human pose estimation and action recognition

Slides Poster Similar

Adaptation of deep convolutional neural network has made revolutionary progress in human pose estimation, various applications in recent years have drawn considerable attention. However, prediction and localization of the keypoints in single and multi-person images are a challenging problem. Towards this purpose, we present a bottom-up box-free approach for the task of pose estimation and action recognition. We proposed a StrongPose system model that uses part-based modeling to tackle object-part associations. The model utilizes a convolution network that learns how to detect Strong Keypoints Heat Maps (SKHM) and predict their comparative displacements, enabling us to group keypoints into person pose instances. Further, we produce Body Heat Maps (BHM) with the help of keypoints which allows us to localize the human body in the picture. The StrongPose framework is based on fully-convolutional engineering and permits proficient inference, with runtime basically autonomous of the number of individuals display within the scene. Train and test on COCO data alone, our framework achieves COCO test-dev keypoint average precision of 0.708 using ResNet-101 and 0.725 using ResNet-152, which considerably outperforms all prior bottom-up pose estimation frameworks.

FeatureNMS: Non-Maximum Suppression by Learning Feature Embeddings

Niels Ole Salscheider

Responsive image

Auto-TLDR; FeatureNMS: Non-Maximum Suppression for Multiple Object Detection

Slides Poster Similar

Most state of the art object detectors output multiple detections per object. The duplicates are removed in a post-processing step called Non-Maximum Suppression. Classical Non-Maximum Suppression has shortcomings in scenes that contain objects with high overlap: The idea of this heuristic is that a high bounding box overlap corresponds to a high probability of having a duplicate. We propose FeatureNMS to solve this problem. FeatureNMS recognizes duplicates not only based on the intersection over union between bounding boxes, but also based on the difference of feature vectors. These feature vectors can encode more information like visual appearance. Our approach outperforms classical NMS and derived approaches and achieves state of the art performance.

S-VoteNet: Deep Hough Voting with Spherical Proposal for 3D Object Detection

Yanxian Chen, Huimin Ma, Xi Li, Xiong Luo

Responsive image

Auto-TLDR; S-VoteNet: 3D Object Detection with Spherical Bounded Box Prediction

Slides Poster Similar

Current 3D object detection methods adopt an analogous box prediction structure with the 2D methods, which predict center and size of the object simultaneously in a box regression procedure, leading to the poor performance of 3D detector to a great extent. In this work, we propose S-VoteNet, which converts the prediction of 3D bounding box into two parts: center prediction and size prediction. By introducing a novel spherical proposal, S-VoteNet uses vote groups to predict the center and radius of object rather than all parameters of 3D bounding box. The prediction of radius is used to constrain the object size, and the radius-based spherical center loss is applied to measure the geometric distance between the proposal and ground-truth. To make better use of the geometric information provided by point cloud, S-VoteNet gathers seed points whose corresponding votes are within the vote groups for seed group generation. Seed groups are then consumed for box size regression and orientation estimation. By decoupling the localization and size estimation, our method effectively reduces the regression pressure of the 3D detector. Experimental results on SUN RGB-D 3D detection benchmark demonstrate that our S-VoteNet achieves state-of-the-art performance by using only point cloud as input.

Joint Semantic-Instance Segmentation of 3D Point Clouds: Instance Separation and Semantic Fusion

Min Zhong, Gang Zeng

Responsive image

Auto-TLDR; Joint Semantic Segmentation and Instance Separation of 3D Point Clouds

Slides Poster Similar

This paper introduces an approach for jointly addressing semantic segmentation (SS) and instance segmentation (IS) of 3D point clouds. Two novel modules are designed to model the interplay between SS and IS. Specifically, we develop an Instance Separation Module that supplements the position-invariance semantic feature with the instance-specific centroid position to help separate different instances. To fuse the semantic information within a single instance, an attention-based Semantic Fusion Module is proposed to encode attention maps in the instance embedding space, which are applied to fuse semantic information in the semantic feature space. The proposed method is thoroughly evaluated on the S3DIS dataset. Compared with the excellent method ASIS, our approach achieves significant improvements across all evaluation metrics in both IS and SS.

A Fine-Grained Dataset and Its Efficient Semantic Segmentation for Unstructured Driving Scenarios

Kai Andreas Metzger, Peter Mortimer, Hans J "Joe" Wuensche

Responsive image

Auto-TLDR; TAS500: A Semantic Segmentation Dataset for Autonomous Driving in Unstructured Environments

Slides Poster Similar

Research in autonomous driving for unstructured environments suffers from a lack of semantically labeled datasets compared to its urban counterpart. Urban and unstructured outdoor environments are challenging due to the varying lighting and weather conditions during a day and across seasons. In this paper, we introduce TAS500, a novel semantic segmentation dataset for autonomous driving in unstructured environments. TAS500 offers fine-grained vegetation and terrain classes to learn drivable surfaces and natural obstacles in outdoor scenes effectively. We evaluate the performance of modern semantic segmentation models with an additional focus on their efficiency. Our experiments demonstrate the advantages of fine-grained semantic classes to improve the overall prediction accuracy, especially along the class boundaries. The dataset, code, and pretrained model are available online.

Detective: An Attentive Recurrent Model for Sparse Object Detection

Amine Kechaou, Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen

Responsive image

Auto-TLDR; Detective: An attentive object detector that identifies objects in images in a sequential manner

Slides Poster Similar

In this work, we present Detective – an attentive object detector that identifies objects in images in a sequential manner. Our network is based on an encoder-decoder architecture, where the encoder is a convolutional neural network, and the decoder is a convolutional recurrent neural network coupled with an attention mechanism. At each iteration, our decoder focuses on the relevant parts of the image using an attention mechanism, and then estimates the object’s class and the bounding box coordinates. Current object detection models generate dense predictions and rely on post-processing to remove duplicate predictions. Detective is a sparse object detector that generates a single bounding box per object instance. However, training a sparse object detector is challenging, as it requires the model to reason at the instance level and not just at the class and spatial levels. We propose a training mechanism based on the Hungarian Algorithm and a loss that balances the localization and classification tasks. This allows Detective to achieve promising results on the PASCAL VOC object detection dataset. Our experiments demonstrate that sparse object detection is possible and has a great potential for future developments in applications where the order of the objects to be predicted is of interest.

Small Object Detection by Generative and Discriminative Learning

Yi Gu, Jie Li, Chentao Wu, Weijia Jia, Jianping Chen

Responsive image

Auto-TLDR; Generative and Discriminative Learning for Small Object Detection

Slides Poster Similar

With the development of deep convolutional neural networks (CNNs), the object detection accuracy has been greatly improved. But the performance of small object detection is still far from satisfactory, mainly because small objects are so tiny that the information contained in the feature map is limited. Existing methods focus on improving classification accuracy but still suffer from the limitation of bounding box prediction. To solve this issue, we propose a detection framework by generative and discriminative learning. First, a reconstruction generator network is designed to reconstruct the mapping from low frequency to high frequency for anchor box prediction. Then, a detector module extracts the regions of interest (ROIs) from generated results and implements a RoI-Head to predict object category and refine bounding box. In order to guide the reconstructed image related to the corresponding one, a discriminator module is adopted to tell from the generated result and the original image. Extensive evaluations on the challenging MS-COCO dataset demonstrate that our model outperforms most state-of-the-art models in detecting small objects, especially the reconstruction module improves the average precision for small object (APs) by 7.7%.

GSTO: Gated Scale-Transfer Operation for Multi-Scale Feature Learning in Semantic Segmentation

Zhuoying Wang, Yongtao Wang, Zhi Tang, Yangyan Li, Ying Chen, Haibin Ling, Weisi Lin

Responsive image

Auto-TLDR; Gated Scale-Transfer Operation for Semantic Segmentation

Slides Poster Similar

Existing CNN-based methods for semantic segmentation heavily depend on multi-scale features to meet the requirements of both semantic comprehension and detail preservation. State-of-the-art segmentation networks widely exploit conventional scale-transfer operations, i.e., up-sampling and down-sampling to learn multi-scale features. In this work, we find that these operations lead to scale-confused features and suboptimal performance because they are spatial-invariant and directly transit all feature information cross scales without spatial selection. To address this issue, we propose the Gated Scale-Transfer Operation (GSTO) to properly transit spatial-filtered features to another scale. Specifically, GSTO can work either with or without extra supervision. Unsupervised GSTO is learned from the feature itself while the supervised one is guided by the supervised probability matrix. Both forms of GSTO are lightweight and plug-and-play, which can be flexibly integrated into networks or modules for learning better multi-scale features. In particular, by plugging GSTO into HRNet, we get a more powerful backbone (namely GSTO-HRNet) for pixel labeling, and it achieves new state-of-the-art results on multiple benchmarks for semantic segmentation including Cityscapes, LIP and Pascal Context, with negligible extra computational cost. Moreover, experiment results demonstrate that GSTO can also significantly boost the performance of multi-scale feature aggregation modules like PPM and ASPP.

Enhanced Vote Network for 3D Object Detection in Point Clouds

Min Zhong, Gang Zeng

Responsive image

Auto-TLDR; A Vote Feature Enhancement Network for 3D Bounding Box Prediction

Slides Poster Similar

In this work, we aim to estimate 3D bounding boxes by voting to object centers and then groups and aggregates the votes to generate 3D box proposals and semantic classes of objects. However, due to the sparse and unstructured nature of the point clouds, we face some challenges when directly predicting bounding box from the vote feature: the sparse vote feature may lack some necessary semantic and context information. To address the challenges, we propose a vote feature enhancement network that aims to encode semantic-aware information and aggravate global context for the vote feature. Specifically, we learn the point-wise semantic information and supplement it to the vote feature, and we also encode the pairwise relations to collect the global context. Experiments on two large datasets of real 3D scans, ScanNet and SUN RGB-D, demonstrate that our method can achieve excellent 3D detection results.

DualBox: Generating BBox Pair with Strong Correspondence Via Occlusion Pattern Clustering and Proposal Refinement

Zheng Ge, Chuyu Hu, Xin Huang, Baiqiao Qiu, Osamu Yoshie

Responsive image

Auto-TLDR; R2NMS: Combining Full and Visible Body Bounding Box for Dense Pedestrian Detection

Slides Poster Similar

Despite the rapid development of pedestrian detection, the problem of dense pedestrian detection is still unsolved, especially the upper limit of Recall caused by Non-Maximum-Suppression (NMS). Out of this reason, R2NMS is proposed to simultaneously detect full and visible body bounding boxes, by replacing the full body BBoxes with less occluded visible body BBoxes in the NMS algorithm, achieving a higher recall. However, the P-RPN and P-RCNN modules proposed in R2NMS for simultaneous high quality full and visible body prediction require non-trivial positive/negative assigning strategies for anchor BBoxes. To simplify the prerequisites and improve the utility of R2NMS, we incorporate clustering analysis into the learning of visible body proposals from full body proposals. Furthermore, to reduce the computation complexity caused by the large number of potential visible body proposals, we introduce a novel occlusion pattern prediction branch on top of the R-CNN module (i.e. F-RCNN) to select the best matched visible proposals for each full body proposals and then feed them into another R-CNN module (i.e. V-RCNN). Incorporated with R2NMS, our DualBox model can achieve competitive performance while only requires few hyper-parameters. We validate the effectiveness of the proposed approach on the CrowdHuman and CityPersons datasets. Experimental results show that our approach achieves promising performance for detecting both non-occluded and occluded pedestrians, especially heavily occluded ones.

Feature Embedding Based Text Instance Grouping for Largely Spaced and Occluded Text Detection

Pan Gao, Qi Wan, Renwu Gao, Linlin Shen

Responsive image

Auto-TLDR; Text Instance Embedding Based Feature Embeddings for Multiple Text Instance Grouping

Slides Poster Similar

A text instance can be easily detected as multiple ones due to the large space between texts/characters, curved shape and partial occlusion. In this paper, a feature embedding based text instance grouping algorithm is proposed to solve this problem. To learn the feature space, a TIEM (Text Instance Embedding Module) is trained to minimize the within instance scatter and maximize the between instance scatter. Similarity between different text instances are measured in the feature space and merged if they meet certain conditions. Experimental results show that our approach can effectively connect text regions that belong to the same text instance. Competitive performance of our approach has been achieved on CTW1500, Total-Text, IC15 and a subset consists of texts selected from the three datasets, with large spacing and occlusions.

Detecting Objects with High Object Region Percentage

Fen Fang, Qianli Xu, Liyuan Li, Ying Gu, Joo-Hwee Lim

Responsive image

Auto-TLDR; Faster R-CNN for High-ORP Object Detection

Slides Poster Similar

Object shape is a subtle but important factor for object detection. It has been observed that the object-region-percentage (ORP) can be utilized to improve detection accuracy for elongated objects, which have much lower ORPs than other types of objects. In this paper, we propose an approach to improve the detection performance for objects whose ORPs are relatively higher.To address the problem of high-ORP object detection, we propose a method consisting of three steps. First, we adjust the ground truth bounding boxes of high-ORP objects to an optimal range. Second, we train an object detector, Faster R-CNN, based on adjusted bounding boxes to achieve high recall. Finally, we train a DCNN to learn the adjustment ratios towards four directions and adjust detected bounding boxes of objects to get better localization for higher precision. We evaluate the effectiveness of our method on 12 high-ORP objects in COCO and 8 objects in a proprietary gearbox dataset. The experimental results show that our method can achieve state-of-the-art performance on these objects while costing less resources in training and inference stages.

Bidirectional Matrix Feature Pyramid Network for Object Detection

Wei Xu, Yi Gan, Jianbo Su

Responsive image

Auto-TLDR; BMFPN: Bidirectional Matrix Feature Pyramid Network for Object Detection

Slides Poster Similar

Feature pyramids are widely used to improve scale invariance for object detection. Most methods just map the objects to feature maps with relevant square receptive fields, but rarely pay attention to the aspect ratio variation, which is also an important property of object instances. It will lead to a poor match between rectangular objects and assigned features with square receptive fields, thus preventing from accurate recognition and location. Besides, the information propagation among feature layers is sparse, namely, each feature in the pyramid may mainly or only contain single-level information, which is not representative enough for classification and localization sub-tasks. In this paper, Bidirectional Matrix Feature Pyramid Network (BMFPN) is proposed to address these issues. It consists of three modules: Diagonal Layer Generation Module (DLGM), Top-down Module (TDM) and Bottom-up Module (BUM). First, multi-level features extracted by backbone are fed into DLGM to produce the base features. Then these base features are utilized to construct the final feature pyramid through TDM and BUM in series. The receptive fields of the designed feature layers in BMFPN have various scales and aspect ratios. Objects can be correctly assigned to appropriate and representative feature maps with relevant receptive fields depending on its scale and aspect ratio properties. Moreover, TDM and BUM form bidirectional and reticular information flow, which effectively fuses multi level information in top-down and bottom-up manner respectively. To evaluate the effectiveness of our proposed architecture, an end-toend anchor-free detector is designed and trained by integrating BMFPN into FCOS. And the center ness branch in FCOS is modified with our Gaussian center-ness branch (GCB), which brings another slight improvement. Without bells and whistles, our method gains +3.3%, +2.4% and +2.6% AP on MS COCO dataset from baselines with ResNet-50, ResNet-101 and ResNeXt-101 backbones, respectively.

MagnifierNet: Learning Efficient Small-Scale Pedestrian Detector towards Multiple Dense Regions

Qi Cheng, Mingqin Chen, Yingjie Wu, Fei Chen, Shiping Lin

Responsive image

Auto-TLDR; MagnifierNet: A Simple but Effective Small-Scale Pedestrian Detection Towards Multiple Dense Regions

Slides Poster Similar

Despite the success of pedestrian detection, there is still a significant gap in the performance of the detection of pedestrians at different scales. Detecting small-scale pedestrians is extremely challenging due to the low resolution of their convolution features which is essential for downstream classifiers. To address this issue, we observed pedestrian datasets and found that pedestrians often gather together in crowded public places. Then we propose MagnifierNet, a simple but effective small-scale pedestrian detector towards multiple dense regions. MagnifierNet uses our proposed sweep-line based grouping algorithm to find dense regions based on the number of pedestrians in the grouped region. And we adopt a new definition of small-scale pedestrians through grid search and KL-divergence. Besides, our grouping method can also be used as a new strategy for pedestrian data augmentation. The ablation study demonstrates that MagnifierNet improves the representation of small-scale pedestrians. We validate the effectiveness of MagnifierNet on CityPersons and KITTI datasets. Experimental results show that MagnifierNet achieves the best small-scale pedestrian detection performance on CityPersons benchmark without any external data, and also achieves competitive performance for detecting small-scale pedestrians on KITTI dataset without bells and whistles.

Siamese Dynamic Mask Estimation Network for Fast Video Object Segmentation

Dexiang Hong, Guorong Li, Kai Xu, Li Su, Qingming Huang

Responsive image

Auto-TLDR; Siamese Dynamic Mask Estimation for Video Object Segmentation

Slides Poster Similar

Video object segmentation(VOS) has been a fundamental topic in recent years, and many deep learning-based methods have achieved state-of-the-art performance on multiple benchmarks. However, most of these methods rely on pixel-level matching between the template and the searched frames on the whole image while the targets only occupy a small region. Calculating on the entire image brings lots of additional computation cost. Besides, the whole image may contain some distracting information resulting in many false-positive matching points. To address this issue, motivated by one-stage instance object segmentation methods, we propose an efficient siamese dynamic mask estimation network for fast video object segmentation. The VOS is decoupled into two tasks, i.e. mask feature learning and dynamic kernel prediction. The former is responsible for learning high-quality features to preserve structural geometric information, and the latter learns a dynamic kernel which is used to convolve with the mask feature to generate a mask output. We use Siamese neural network as a feature extractor and directly predict masks after correlation. In this way, we can avoid using pixel-level matching, making our framework more simple and efficient. Experiment results on DAVIS 2016 /2017 datasets show that our proposed methods can run at 35 frames per second on NVIDIA RTX TITAN while preserving competitive accuracy.

HPERL: 3D Human Pose Estimastion from RGB and LiDAR

Michael Fürst, Shriya T.P. Gupta, René Schuster, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; 3D Human Pose Estimation Using RGB and LiDAR Using Weakly-Supervised Approach

Slides Poster Similar

In-the-wild human pose estimation has a huge potential for various fields, ranging from animation and action recognition to intention recognition and prediction for autonomous driving. The current state-of-the-art is focused only on RGB and RGB-D approaches for predicting the 3D human pose. However, not using precise LiDAR depth information limits the performance and leads to very inaccurate absolute pose estimation. With LiDAR sensors becoming more affordable and common on robots and autonomous vehicle setups, we propose an end-to-end architecture using RGB and LiDAR to predict the absolute 3D human pose with unprecedented precision. Additionally, we introduce a weakly-supervised approach to generate 3D predictions using 2D pose annotations from PedX. This allows for many new opportunities in the field of 3D human pose estimation.

Machine-Learned Regularization and Polygonization of Building Segmentation Masks

Stefano Zorzi, Ksenia Bittner, Friedrich Fraundorfer

Responsive image

Auto-TLDR; Automatic Regularization and Polygonization of Building Segmentation masks using Generative Adversarial Network

Slides Poster Similar

We propose a machine learning based approach for automatic regularization and polygonization of building segmentation masks. Taking an image as input, we first predict building segmentation maps exploiting generic fully convolutional network (FCN). A generative adversarial network (GAN) is then involved to perform a regularization of building boundaries to make them more realistic, i.e., having more rectilinear outlines which construct right angles if required. This is achieved through the interplay between the discriminator which gives a probability of input image being true and generator that learns from discriminator’s response to create more realistic images. Finally, we train the backbone convolutional neural network (CNN) which is adapted to predict sparse outcomes corresponding to building corners out of regularized building segmentation results. Experiments on three building segmentation datasets demonstrate that the proposed method is not only capable of obtaining accurate results, but also of producing visually pleasing building outlines parameterized as polygons.

Point In: Counting Trees with Weakly Supervised Segmentation Network

Pinmo Tong, Shuhui Bu, Pengcheng Han

Responsive image

Auto-TLDR; Weakly Tree counting using Deep Segmentation Network with Localization and Mask Prediction

Slides Poster Similar

For tree counting tasks, since traditional image processing methods require expensive feature engineering and are not end-to-end frameworks, this will cause additional noise and cannot be optimized overall, so this method has not been widely used in recent trends of tree counting application. Recently, many deep learning based approaches are designed for this task because of the powerful feature extracting ability. The representative way is bounding box based supervised method, but time-consuming annotations are indispensable for them. Moreover, these methods are difficult to overcome the occlusion or overlap. To solve this problem, we propose a weakly tree counting network (WTCNet) based on deep segmentation network with only point supervision. It can simultaneously complete tree counting with localization and output mask of each tree at the same time. We first adopt a novel feature extractor network (FENet) to get features of input images, and then an effective strategy is introduced to deal with different mask predictions. In the end, we propose a basic localization guidance accompany with rectification guidance to train the network. We create two different datasets and select an existing challenging plant dataset to evaluate our method on three different tasks. Experimental results show the good performance improvement of our method compared with other existing methods. Further study shows that our method has great potential to reduce human labor and provide effective ground-truth masks and the results show the superiority of our method over the advanced methods.

Tiny Object Detection in Aerial Images

Jinwang Wang, Wen Yang, Haowen Guo, Ruixiang Zhang, Gui-Song Xia

Responsive image

Auto-TLDR; Tiny Object Detection in Aerial Images Using Multiple Center Points Based Learning Network

Slides Similar

Object detection in Earth Vision has achieved great progress in recent years. However, tiny object detection in aerial images remains a very challenging problem since the tiny objects contain a small number of pixels and are easily confused with the background. To advance tiny object detection research in aerial images, we present a new dataset for Tiny Object Detection in Aerial Images (AI-TOD). Specifically, AI-TOD comes with 700,621 object instances for eight categories across 28,036 aerial images. Compared to existing object detection datasets in aerial images, the mean size of objects in AI-TOD is about 12.8 pixels, which is much smaller than others. To build a benchmark for tiny object detection in aerial images, we evaluate the state-of-the-art object detectors on our AI-TOD dataset. Experimental results show that direct application of these approaches on AI-TOD produces suboptimal object detection results, thus new specialized detectors for tiny object detection need to be designed. Therefore, we propose a multiple center points based learning network (M-CenterNet) to improve the localization performance of tiny object detection, and experimental results show the significant performance gain over the competitors.

P2 Net: Augmented Parallel-Pyramid Net for Attention Guided Pose Estimation

Luanxuan Hou, Jie Cao, Yuan Zhao, Haifeng Shen, Jian Tang, Ran He

Responsive image

Auto-TLDR; Parallel-Pyramid Net with Partial Attention for Human Pose Estimation

Slides Poster Similar

The target of human pose estimation is to determine the body parts and joint locations of persons in the image. Angular changes, motion blur and occlusion etc. in the natural scenes make this task challenging, while some joints are more difficult to be detected than others. In this paper, we propose an augmented Parallel-Pyramid Net (P^2Net) with an partial attention module. During data preprocessing, we proposed a differentiable auto data augmentation (DA^2) method in which sequences of data augmentations are formulated as a trainable and operational Convolution Neural Network (CNN) component. DA^2 improves the training efficiency and effectiveness. A parallel pyramid structure is followed to compensate the information loss introduced by the network. For the information loss problem in the backbone network, we optimize the backbone network by adopting a new parallel structure without increasing the overall computational complexity. To further refine the predictions after completion of global predictions, an Partial Attention Module (PAM) is defined to extract weighted features from different scale feature maps generated by the parallel pyramid structure. Compared with the traditional up-sampling refining, PAM can better capture the relationship between channels. Experiments corroborate the effectiveness of our proposed method. Notably, our method achieves the best performance on the challenging MSCOCO and MPII datasets.

EAGLE: Large-Scale Vehicle Detection Dataset in Real-World Scenarios Using Aerial Imagery

Seyed Majid Azimi, Reza Bahmanyar, Corentin Henry, Kurz Franz

Responsive image

Auto-TLDR; EAGLE: A Large-Scale Dataset for Multi-class Vehicle Detection with Object Orientation Information in Airborne Imagery

Slides Similar

Multi-class vehicle detection from airborne imagery with orientation estimation is an important task in the near and remote vision domains with applications in traffic monitoring and disaster management. In the last decade, we have witnessed significant progress in object detection in ground imagery, but it is still in its infancy in airborne imagery, mostly due to the scarcity of diverse and large-scale datasets. Despite being a useful tool for different applications, current airborne datasets only partially reflect the challenges of real-world scenarios. To address this issue, we introduce EAGLE (oriEnted object detection using Aerial imaGery in real-worLd scEnarios), a large-scale dataset for multi-class vehicle detection with object orientation information in aerial imagery. It features high-resolution aerial images composed of different real-world situations with a wide variety of camera sensor, resolution, flight altitude, weather, illumination, haze, shadow, time, city, country, occlusion, and camera angle. The annotation was done by airborne imagery experts with small- and large-vehicle classes. EAGLE contains 215,986 instances annotated with oriented bounding boxes defined by four points and orientation, making it by far the largest dataset to date in this task. It also supports researches on the haze and shadow removal as well as super-resolution and in-painting applications. We define three tasks: detection by (1) horizontal bounding boxes, (2) rotated bounding boxes, and (3) oriented bounding boxes. We carried out several experiments to evaluate several state-of-the-art methods in object detection on our dataset to form a baseline. Experiments show that the EAGLE dataset accurately reflects real-world situations and correspondingly challenging applications. The dataset will be made publicly available.

Construction Worker Hardhat-Wearing Detection Based on an Improved BiFPN

Chenyang Zhang, Zhiqiang Tian, Jingyi Song, Yaoyue Zheng, Bo Xu

Responsive image

Auto-TLDR; A One-Stage Object Detection Method for Hardhat-Wearing in Construction Site

Slides Poster Similar

Work in the construction site is considered to be one of the occupations with the highest safety risk factor. Therefore, safety plays an important role in construction site. One of the most fundamental safety rules in construction site is to wear a hardhat. To strengthen the safety of the construction site, most of the current methods use multi-stage method for hardhat-wearing detection. These methods have limitations in terms of adaptability and generalizability. In this paper, we propose a one-stage object detection method based on convolutional neural network. We present a multi-scale strategy that selects the high-resolution feature maps of DarkNet-53 to effectively identify small-scale hardhats. In addition, we propose an improved weighted bi-directional feature pyramid network (BiFPN), which could fuse more semantic features from more scales. The proposed method can not only detect hardhat-wearing, but also identify the color of the hardhat. Experimental results show that the proposed method achieves a mAP of 87.04%, which outperforms several state-of-the-art methods on a public dataset.

Cascade Saliency Attention Network for Object Detection in Remote Sensing Images

Dayang Yu, Rong Zhang, Shan Qin

Responsive image

Auto-TLDR; Cascade Saliency Attention Network for Object Detection in Remote Sensing Images

Slides Poster Similar

Object detection in remote sensing images is a challenging task due to objects in the bird-view perspective appearing with arbitrary orientations. Though considerable progress has been made, there still exist challenges with the interference from complex backgrounds, dense arrangement, and large-scale variations. In this paper, we propose an oriented detector named Cascade Saliency Attention Network (CSAN), designed for comprehensively suppressing interference in remote sensing images. Specifically, we first combine context and pixel attention on feature maps to enhance saliency of objects for suppressing interference from backgrounds. Then, in cascade network, we apply instance segmentation on ROI to increase saliency of the central object, thus preventing object features from mutual interference in dense arrangement. Additionally, to alleviate large-scale variations, we devise a multi-scale merge module during FPN merging process to learn richer scale representations. Experimental results on DOTA and HRSC2016 datasets outperform other state-of-the-art object detection methods and verify the effectiveness of our method.

Hierarchical Head Design for Object Detectors

Shivang Agarwal, Frederic Jurie

Responsive image

Auto-TLDR; Hierarchical Anchor for SSD Detector

Slides Poster Similar

The notion of anchor plays a major role in modern detection algorithms such as the Faster-RCNN or the SSD detector. Anchors relate the features of the last layers of the detector with bounding boxes containing objects in images. Despite their importance, the literature on object detection has not paid real attention to them. The motivation of this paper comes from the observations that (i) each anchor learns to classify and regress candidate objects independently (ii) insufficient examples are available for each anchor in case of small-scale datasets. This paper addresses these questions by proposing a novel hierarchical head for the SSD detector. The new design has the added advantage of no extra weights, as compared to the original design at inference time, while improving detectors performance for small size training sets. Improved performance on PASCAL-VOC and state-of-the-art performance on FlickrLogos-47 validate the method. We also show when the proposed design does not give additional performance gain over the original design.

Foreground-Focused Domain Adaption for Object Detection

Yuchen Yang, Nilanjan Ray

Responsive image

Auto-TLDR; Unsupervised Domain Adaptation for Unsupervised Object Detection

Slides Similar

Object detectors suffer from accuracy loss caused by domain shift from a source to a target domain. Unsupervised domain adaptation (UDA) approaches mitigate this loss by training with unlabeled target domain images. A popular processing pipeline applies adversarial training that aligns the distributions of the features from the two domains. We advocate that aligning the full image level features is not ideal for UDA object detection due to the presence of varied background areas during inference. Thus, we propose a novel foreground-focused domain adaptation (FFDA) framework which mines the loss of the domain discriminators to concentrate on the backpropagation of foreground loss. We obtain mining masks by collecting target predictions and source labels to outline foreground regions, and apply the masks to image and instance level domain discriminators to allow backpropagation only on the mined regions. By reinforcing this foreground-focused adaptation throughout multiple layers in the detector model, we gain a significant accuracy boost on the target domain prediction. Compared to previous works, our method reaches the new state-of-the-art accuracy on adapting Cityscape to Foggy Cityscape dataset and demonstrates competitive accuracy on other datasets that include various scenarios for autonomous driving applications.

A Fast and Accurate Object Detector for Handwritten Digit String Recognition

Jun Guo, Wenjing Wei, Yifeng Ma, Cong Peng

Responsive image

Auto-TLDR; ChipNet: An anchor-free object detector for handwritten digit string recognition

Slides Poster Similar

Focusing on handwritten digit string recognition (HDSR), we propose an anchor-free object detector called ChipNet, where a novel encoding method is designed. The input image is divided into columns, and then these columns are encoded by the ground truth. The adjacent columns are responsible for detecting the same target so that it can well address the class-imbalanced problem meanwhile reducing the network computation. ChipNet is composed of convolutional and bidirectional long short term memory networks. Different from the typical detectors, it doesn't use region proposals, anchors or regions of interest pooling. Hence, it can overcome the shortages of anchor-based and dense detectors in HDSR. The experiments are implemented on the synthetic digit strings, the CVL HDS database, and the ORAND-CAR-A & B databases. The high accuracies, which surpass the reported results by a large margin (up to 6.62%), are achieved. Furthermore, it gets 219 FPS speed on 160*32 px resolution images when using a Tesla P100 GPU. The results also show that ChipNet can handle touching, connecting and arbitrary length digit strings, and the obtained accuracies in HDSR are as high as the ones in single handwritten digit recognition.

3D Semantic Labeling of Photogrammetry Meshes Based on Active Learning

Mengqi Rong, Shuhan Shen, Zhanyi Hu

Responsive image

Auto-TLDR; 3D Semantic Expression of Urban Scenes Based on Active Learning

Slides Poster Similar

As different urban scenes are similar but still not completely consistent, coupled with the complexity of labeling directly in 3D, high-level understanding of 3D scenes has always been a tricky problem. In this paper, we propose a procedural approach for 3D semantic expression of urban scenes based on active learning. We first start with a small labeled image set to fine-tune a semantic segmentation network and then project its probability map onto a 3D mesh model for fusion, finally outputs a 3D semantic mesh model in which each facet has a semantic label and a heat model showing each facet’s confidence. Our key observation is that our algorithm is iterative, in each iteration, we use the output semantic model as a supervision to select several valuable images for annotation to co-participate in the fine-tuning for overall improvement. In this way, we reduce the workload of labeling but not the quality of 3D semantic model. Using urban areas from two different cities, we show the potential of our method and demonstrate its effectiveness.

Incorporating Depth Information into Few-Shot Semantic Segmentation

Yifei Zhang, Desire Sidibe, Olivier Morel, Fabrice Meriaudeau

Responsive image

Auto-TLDR; RDNet: A Deep Neural Network for Few-shot Segmentation Using Depth Information

Slides Poster Similar

Few-shot segmentation presents a significant challenge for semantic scene understanding under limited supervision. Namely, this task targets at generalizing the segmentation ability of the model to new categories given a few samples. In order to obtain complete scene information, we extend the RGB-centric methods to take advantage of complementary depth information. In this paper, we propose a two-stream deep neural network based on metric learning. Our method, known as RDNet, learns class-specific prototype representations within RGB and depth embedding spaces, respectively. The learned prototypes provide effective semantic guidance on the corresponding RGB and depth query image, leading to more accurate performance. Moreover, we build a novel outdoor scene dataset, known as Cityscapes-3i, using labeled RGB images and depth images from the Cityscapes dataset. We also perform ablation studies to explore the effective use of depth information in few-shot segmentation tasks. Experiments on Cityscapes-3i show that our method achieves promising results with visual and complementary geometric cues from only a few labeled examples.

RescueNet: Joint Building Segmentation and Damage Assessment from Satellite Imagery

Rohit Gupta, Mubarak Shah

Responsive image

Auto-TLDR; RescueNet: End-to-End Building Segmentation and Damage Classification for Humanitarian Aid and Disaster Response

Slides Poster Similar

Accurate and fine-grained information about the extent of damage to buildings is essential for directing Humanitarian Aid and Disaster Response (HADR) operations in the immediate aftermath of any natural calamity. In recent years, satellite and UAV (drone) imagery has been used for this purpose, sometimes aided by computer vision algorithms. Existing Computer Vision approaches for building damage assessment typically rely on a two stage approach, consisting of building detection using an object detection model, followed by damage assessment through classification of the detected building tiles. These multi-stage methods are not end-to-end trainable, and suffer from poor overall results. We propose RescueNet, a unified model that can simultaneously segment buildings and assess the damage levels to individual buildings and can be trained end-to end. In order to to model the composite nature of this problem, we propose a novel localization aware loss function, which consists of a Binary Cross Entropy loss for building segmentation, and a foreground only selective Categorical Cross-Entropy loss for damage classification, and show significant improvement over the widely used Cross-Entropy loss. RescueNet is tested on the large scale and diverse xBD dataset and achieves significantly better building segmentation and damage classification performance than previous methods and achieves generalization across varied geographical regions and disaster types.

Context for Object Detection Via Lightweight Global and Mid-Level Representations

Mesut Erhan Unal, Adriana Kovashka

Responsive image

Auto-TLDR; Context-Based Object Detection with Semantic Similarity

Slides Poster Similar

We propose an approach for explicitly capturing context in object detection. We model visual and geometric relationships between object regions, but also model the global scene as a first-class participant. In contrast to prior approaches, both the context we rely on, as well as our proposed mechanism for belief propagation over regions, is lightweight. We also experiment with capturing similarities between regions at a semantic level, by modeling class co-occurrence and linguistic similarity between class names. We show that our approach significantly outperforms Faster R-CNN, and performs competitively with a much more costly approach that also models context.

Iterative Bounding Box Annotation for Object Detection

Bishwo Adhikari, Heikki Juhani Huttunen

Responsive image

Auto-TLDR; Semi-Automatic Bounding Box Annotation for Object Detection in Digital Images

Slides Poster Similar

Manual annotation of bounding boxes for object detection in digital images is tedious, and time and resource consuming. In this paper, we propose a semi-automatic method for efficient bounding box annotation. The method trains the object detector iteratively on small batches of labeled images and learns to propose bounding boxes for the next batch, after which the human annotator only needs to correct possible errors. We propose an experimental setup for simulating the human actions and use it for comparing different iteration strategies, such as the order in which the data is presented to the annotator. We experiment on our method with three datasets and show that it can reduce the human annotation effort significantly, saving up to 75% of total manual annotation work.

PSDNet: A Balanced Architecture of Accuracy and Parameters for Semantic Segmentation

Yue Liu, Zhichao Lian

Responsive image

Auto-TLDR; Pyramid Pooling Module with SE1Cblock and D2SUpsample Network (PSDNet)

Slides Poster Similar

Abstract—In this paper, we present our Pyramid Pooling Module (PPM) with SE1Cblock and D2SUpsample Network (PSDNet), a novel architecture for accurate semantic segmentation. Started from the known work called Pyramid Scene Parsing Network (PSPNet), PSDNet takes advantage of pyramid pooling structure with channel attention module and feature transform module in Pyramid Pooling Module (PPM). The enhanced PPM with these two components can strengthen context information flowing in the network instead of damaging it. The channel attention module we mentioned is an improved “Squeeze and Excitation with 1D Convolution” (SE1C) block which can explicitly model interrelationship between channels with fewer number of parameters. We propose a feature transform module named “Depth to Space Upsampling” (D2SUpsample) in the PPM which keeps integrity of features by transforming features while interpolating features, at the same time reducing parameters. In addition, we introduce a joint strategy in SE1Cblock which combines two variants of global pooling without increasing parameters. Compared with PSPNet, our work achieves higher accuracy on public datasets with 73.97% mIoU and 82.89% mAcc accuracy on Cityscapes Dataset based on ResNet50 backbone.

Cross-Domain Semantic Segmentation of Urban Scenes Via Multi-Level Feature Alignment

Bin Zhang, Shengjie Zhao, Rongqing Zhang

Responsive image

Auto-TLDR; Cross-Domain Semantic Segmentation Using Generative Adversarial Networks

Slides Poster Similar

Semantic segmentation is an essential task in plenty of real-life applications such as virtual reality, video analysis, autonomous driving, etc. Recent advancements in fundamental vision-based tasks ranging from image classification to semantic segmentation have demonstrated deep learning-based models' high capability in learning complicated representation on large datasets. Nevertheless, manually labeling semantic segmentation dataset with pixel-level annotation is extremely labor-intensive. To address this problem, we propose a novel multi-level feature alignment framework for cross-domain semantic segmentation of urban scenes by exploiting generative adversarial networks. In the proposed multi-level feature alignment method, we first translate images from one domain to another one. Then the discriminative feature representations extracted by the deep neural network are concatenated, followed by domain adversarial learning to make the intermediate feature distribution of the target domain images close to those in the source domain. With these domain adaptation techniques, models trained with images in the source domain where the labels are easy to acquire can be deployed to the target domain where the labels are scarce. Experimental evaluations on various mainstream benchmarks confirm the effectiveness as well as robustness of our approach.