Mutual-Supervised Feature Modulation Network for Occluded Pedestrian Detection

Ye He, Chao Zhu, Xu-Cheng Yin

Responsive image

Auto-TLDR; A Mutual-Supervised Feature Modulation Network for Occluded Pedestrian Detection

State-of-the-art pedestrian detectors have achieved significant progress on non-occluded pedestrians, yet they are still struggling under heavy occlusions. The recent occlusion handling strategy of popular two-stage approaches is to build a two-branch architecture with the help of additional visible body annotations. Nonetheless, these methods still have some weaknesses. Either the two branches are trained independently with only score-level fusion, which cannot guarantee the detectors to learn robust enough pedestrian features. Or the attention mechanisms are exploited to only emphasize on the visible body features. However, the visible body features of heavily occluded pedestrians are concentrated on a relatively small area, which will easily cause missing detections. To address the above issues, we propose in this paper a novel Mutual-Supervised Feature Modulation (MSFM) network, to better handle occluded pedestrian detection. The key MSFM module in our network calculates the similarity loss of full body boxes and visible body boxes corresponding to the same pedestrian, so that the full-body detector could learn more complete and robust pedestrian features with the assist of contextual features from the occluding parts. To facilitate the MSFM module, we also propose a novel two-branch architecture, consisting of a standard full body detection branch and an extra visible body classification branch. These two branches are trained in a mutual-supervised way with full body annotations and visible body annotations, respectively. To verify the effectiveness of our proposed method, extensive experiments are conducted on two challenging pedestrian datasets: Caltech and CityPersons, and our approach achieves superior performances compared to other state-of-the-art methods on both datasets, especially in heavy occlusion cases.

Similar papers

DualBox: Generating BBox Pair with Strong Correspondence Via Occlusion Pattern Clustering and Proposal Refinement

Zheng Ge, Chuyu Hu, Xin Huang, Baiqiao Qiu, Osamu Yoshie

Responsive image

Auto-TLDR; R2NMS: Combining Full and Visible Body Bounding Box for Dense Pedestrian Detection

Slides Poster Similar

Despite the rapid development of pedestrian detection, the problem of dense pedestrian detection is still unsolved, especially the upper limit of Recall caused by Non-Maximum-Suppression (NMS). Out of this reason, R2NMS is proposed to simultaneously detect full and visible body bounding boxes, by replacing the full body BBoxes with less occluded visible body BBoxes in the NMS algorithm, achieving a higher recall. However, the P-RPN and P-RCNN modules proposed in R2NMS for simultaneous high quality full and visible body prediction require non-trivial positive/negative assigning strategies for anchor BBoxes. To simplify the prerequisites and improve the utility of R2NMS, we incorporate clustering analysis into the learning of visible body proposals from full body proposals. Furthermore, to reduce the computation complexity caused by the large number of potential visible body proposals, we introduce a novel occlusion pattern prediction branch on top of the R-CNN module (i.e. F-RCNN) to select the best matched visible proposals for each full body proposals and then feed them into another R-CNN module (i.e. V-RCNN). Incorporated with R2NMS, our DualBox model can achieve competitive performance while only requires few hyper-parameters. We validate the effectiveness of the proposed approach on the CrowdHuman and CityPersons datasets. Experimental results show that our approach achieves promising performance for detecting both non-occluded and occluded pedestrians, especially heavily occluded ones.

Nighttime Pedestrian Detection Based on Feature Attention and Transformation

Gang Li, Shanshan Zhang, Jian Yang

Responsive image

Auto-TLDR; FAM and FTM: Enhanced Feature Attention Module and Feature Transformation Module for nighttime pedestrian detection

Slides Poster Similar

Pedestrian detection at nighttime is an important yet challenging task, which is fundamental for many practical applications, e.g. autonomous driving, video surveillance. To address this problem, in this work we start with some analysis, from which we find that the nighttime features have much more noise than that of daytime, resulting in low discrimination ability. Besides, we also observe some pedestrian examples are under adverse illumination conditions, and they can hardly provide sufficient information for accurate detection. Based on these findings, we propose the Feature Attention Module (FAM) and Feature Transformation Module (FTM) to enhance nighttime features. In FAM, guided by progressive segmentation supervision, hierarchical feature attention is produced to enhance multi-level features. On the other hand, FTM is introduced to enforce features from adverse illumination to approach that from better illumination. Based on feature attention and transformation (FAT) mechanism, a two-stage detector called FATNet is constructed for nighttime pedestrian detection. We conduct extensive experiments on nighttime datasets of EuroCity Persons (Night) and NightOwls to demonstrate the effectiveness of our method. On both two datasets, our method achieves significant improvements to the baseline and also outperforms state-of-the-art detectors.

Learning a Dynamic High-Resolution Network for Multi-Scale Pedestrian Detection

Mengyuan Ding, Shanshan Zhang, Jian Yang

Responsive image

Auto-TLDR; Learningable Dynamic HRNet for Pedestrian Detection

Slides Poster Similar

Pedestrian detection is a canonical instance of object detection in computer vision. In practice, scale variation is one of the key challenges, resulting in unbalanced performance across different scales. Recently, the High-Resolution Network (HRNet) has become popular because high-resolution feature representations are more friendly to small objects. However, when we apply HRNet for pedestrian detection, we observe that it improves for small pedestrians on one hand, but hurts the performance for larger ones on the other hand. To overcome this problem, we propose a learnable Dynamic HRNet (DHRNet) aiming to generate different network paths adaptive to different scales. Specifically, we construct a parallel multi-branch architecture and add a soft conditional gate module allowing for dynamic feature fusion. Both branches share all the same parameters except the soft gate module. Experimental results on CityPersons and Caltech benchmarks indicate that our proposed dynamic HRNet is more capable of dealing with pedestrians of various scales, and thus improves the performance across different scales consistently.

MagnifierNet: Learning Efficient Small-Scale Pedestrian Detector towards Multiple Dense Regions

Qi Cheng, Mingqin Chen, Yingjie Wu, Fei Chen, Shiping Lin

Responsive image

Auto-TLDR; MagnifierNet: A Simple but Effective Small-Scale Pedestrian Detection Towards Multiple Dense Regions

Slides Poster Similar

Despite the success of pedestrian detection, there is still a significant gap in the performance of the detection of pedestrians at different scales. Detecting small-scale pedestrians is extremely challenging due to the low resolution of their convolution features which is essential for downstream classifiers. To address this issue, we observed pedestrian datasets and found that pedestrians often gather together in crowded public places. Then we propose MagnifierNet, a simple but effective small-scale pedestrian detector towards multiple dense regions. MagnifierNet uses our proposed sweep-line based grouping algorithm to find dense regions based on the number of pedestrians in the grouped region. And we adopt a new definition of small-scale pedestrians through grid search and KL-divergence. Besides, our grouping method can also be used as a new strategy for pedestrian data augmentation. The ablation study demonstrates that MagnifierNet improves the representation of small-scale pedestrians. We validate the effectiveness of MagnifierNet on CityPersons and KITTI datasets. Experimental results show that MagnifierNet achieves the best small-scale pedestrian detection performance on CityPersons benchmark without any external data, and also achieves competitive performance for detecting small-scale pedestrians on KITTI dataset without bells and whistles.

PRF-Ped: Multi-Scale Pedestrian Detector with Prior-Based Receptive Field

Yuzhi Tan, Hongxun Yao, Haoran Li, Xiusheng Lu, Haozhe Xie

Responsive image

Auto-TLDR; Bidirectional Feature Enhancement Module for Multi-Scale Pedestrian Detection

Slides Poster Similar

Multi-scale feature representation is a common strategy to handle the scale variation in pedestrian detection. Existing methods simply utilize the convolutional pyramidal features for multi-scale representation. However, they rarely pay attention to the differences among different feature scales and extract multi-scale features from a single feature map, which may make the detectors sensitive to scale-variance in multi-scale pedestrian detection. In this paper, we introduce a bidirectional feature enhancement module (BFEM) to augment the semantic information of low-level features and the localization information of high-level features. In addition, we propose a prior-based receptive field block (PRFB) for multi-scale pedestrian feature extraction, where the receptive field is closer to the aspect ratio of the pedestrian target. Consequently, it is less affected by the surrounding background when extracting features. Experimental results indicate that the proposed method outperform the state-of-the-art methods on the CityPersons and Caltech datasets.

HPERL: 3D Human Pose Estimastion from RGB and LiDAR

Michael Fürst, Shriya T.P. Gupta, René Schuster, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; 3D Human Pose Estimation Using RGB and LiDAR Using Weakly-Supervised Approach

Slides Poster Similar

In-the-wild human pose estimation has a huge potential for various fields, ranging from animation and action recognition to intention recognition and prediction for autonomous driving. The current state-of-the-art is focused only on RGB and RGB-D approaches for predicting the 3D human pose. However, not using precise LiDAR depth information limits the performance and leads to very inaccurate absolute pose estimation. With LiDAR sensors becoming more affordable and common on robots and autonomous vehicle setups, we propose an end-to-end architecture using RGB and LiDAR to predict the absolute 3D human pose with unprecedented precision. Additionally, we introduce a weakly-supervised approach to generate 3D predictions using 2D pose annotations from PedX. This allows for many new opportunities in the field of 3D human pose estimation.

FeatureNMS: Non-Maximum Suppression by Learning Feature Embeddings

Niels Ole Salscheider

Responsive image

Auto-TLDR; FeatureNMS: Non-Maximum Suppression for Multiple Object Detection

Slides Poster Similar

Most state of the art object detectors output multiple detections per object. The duplicates are removed in a post-processing step called Non-Maximum Suppression. Classical Non-Maximum Suppression has shortcomings in scenes that contain objects with high overlap: The idea of this heuristic is that a high bounding box overlap corresponds to a high probability of having a duplicate. We propose FeatureNMS to solve this problem. FeatureNMS recognizes duplicates not only based on the intersection over union between bounding boxes, but also based on the difference of feature vectors. These feature vectors can encode more information like visual appearance. Our approach outperforms classical NMS and derived approaches and achieves state of the art performance.

Attentive Part-Aware Networks for Partial Person Re-Identification

Lijuan Huo, Chunfeng Song, Zhengyi Liu, Zhaoxiang Zhang

Responsive image

Auto-TLDR; Part-Aware Learning for Partial Person Re-identification

Slides Poster Similar

Partial person re-identification (re-ID) refers to re-identify a person through occluded images. It suffers from two major challenges, i.e., insufficient training data and incomplete probe image. In this paper, we introduce an automatic data augmentation module and a part-aware learning method for partial re-identification. On the one hand, we adopt the data augmentation to enhance the training data and help learns more stabler partial features. On the other hand, we intuitively find that the partial person images usually have fixed percentages of parts, therefore, in partial person re-id task, the probe image could be cropped from the pictures and divided into several different partial types following fixed ratios. Based on the cropped images, we propose the Cropping Type Consistency (CTC) loss to classify the cropping types of partial images. Moreover, in order to help the network better fit the generated and cropped data, we incorporate the Block Attention Mechanism (BAM) into the framework for attentive learning. To enhance the retrieval performance in the inference stage, we implement cropping on gallery images according to the predicted types of probe partial images. Through calculating feature distances between the partial image and the cropped holistic gallery images, we can recognize the right person from the gallery. To validate the effectiveness of our approach, we conduct extensive experiments on the partial re-ID benchmarks and achieve state-of-the-art performance.

S-VoteNet: Deep Hough Voting with Spherical Proposal for 3D Object Detection

Yanxian Chen, Huimin Ma, Xi Li, Xiong Luo

Responsive image

Auto-TLDR; S-VoteNet: 3D Object Detection with Spherical Bounded Box Prediction

Slides Poster Similar

Current 3D object detection methods adopt an analogous box prediction structure with the 2D methods, which predict center and size of the object simultaneously in a box regression procedure, leading to the poor performance of 3D detector to a great extent. In this work, we propose S-VoteNet, which converts the prediction of 3D bounding box into two parts: center prediction and size prediction. By introducing a novel spherical proposal, S-VoteNet uses vote groups to predict the center and radius of object rather than all parameters of 3D bounding box. The prediction of radius is used to constrain the object size, and the radius-based spherical center loss is applied to measure the geometric distance between the proposal and ground-truth. To make better use of the geometric information provided by point cloud, S-VoteNet gathers seed points whose corresponding votes are within the vote groups for seed group generation. Seed groups are then consumed for box size regression and orientation estimation. By decoupling the localization and size estimation, our method effectively reduces the regression pressure of the 3D detector. Experimental results on SUN RGB-D 3D detection benchmark demonstrate that our S-VoteNet achieves state-of-the-art performance by using only point cloud as input.

Scene Text Detection with Selected Anchors

Anna Zhu, Hang Du, Shengwu Xiong

Responsive image

Auto-TLDR; AS-RPN: Anchor Selection-based Region Proposal Network for Scene Text Detection

Slides Poster Similar

Object proposal technique with dense anchoring scheme for scene text detection were applied frequently to achieve high recall. It results in the significant improvement in accuracy but waste of computational searching, regression and classification. In this paper, we propose an anchor selection-based region proposal network (AS-RPN) using effective selected anchors instead of dense anchors to extract text proposals. The center, scales, aspect ratios and orientations of anchors are learnable instead of fixing, which leads to high recall and greatly reduced numbers of anchors. By replacing the anchor-based RPN in Faster RCNN, the AS-RPN-based Faster RCNN can achieve comparable performance with previous state-of-the-art text detecting approaches on standard benchmarks, including COCO-Text, ICDAR2013, ICDAR2015 and MSRA-TD500 when using single-scale and single model (ResNet50) testing only.

Detecting Objects with High Object Region Percentage

Fen Fang, Qianli Xu, Liyuan Li, Ying Gu, Joo-Hwee Lim

Responsive image

Auto-TLDR; Faster R-CNN for High-ORP Object Detection

Slides Poster Similar

Object shape is a subtle but important factor for object detection. It has been observed that the object-region-percentage (ORP) can be utilized to improve detection accuracy for elongated objects, which have much lower ORPs than other types of objects. In this paper, we propose an approach to improve the detection performance for objects whose ORPs are relatively higher.To address the problem of high-ORP object detection, we propose a method consisting of three steps. First, we adjust the ground truth bounding boxes of high-ORP objects to an optimal range. Second, we train an object detector, Faster R-CNN, based on adjusted bounding boxes to achieve high recall. Finally, we train a DCNN to learn the adjustment ratios towards four directions and adjust detected bounding boxes of objects to get better localization for higher precision. We evaluate the effectiveness of our method on 12 high-ORP objects in COCO and 8 objects in a proprietary gearbox dataset. The experimental results show that our method can achieve state-of-the-art performance on these objects while costing less resources in training and inference stages.

Adaptive Word Embedding Module for Semantic Reasoning in Large-Scale Detection

Yu Zhang, Xiaoyu Wu, Ruolin Zhu

Responsive image

Auto-TLDR; Adaptive Word Embedding Module for Object Detection

Slides Poster Similar

In recent years, convolutional neural networks have achieved rapid development in the field of object detection. However, due to the imbalance of data, high costs in labor and uneven level of data labeling, the overall performance of the previous detection network has dropped sharply when dataset extended to the large-scale with hundreds and thousands categories. We present the Adaptive Word Embedding Module, extracting the adaptive semantic knowledge graph to reach semantic consistency within one image. Our method endows the ability to infer global semantic of detection networks without other attribute or relationship annotations. Compared with Faster RCNN, the algorithm on the MSCOCO dataset was significantly improved by 4.1%, and the mAP value has reached 32.8%. On the VG1000 dataset, it increased by 0.9% to 6.7% compared with Faster RCNN. Adaptive Word Embedding Module is lightweight, general-purpose and can be plugged into diverse detection networks. Code will be made available.

SyNet: An Ensemble Network for Object Detection in UAV Images

Berat Mert Albaba, Sedat Ozer

Responsive image

Auto-TLDR; SyNet: Combining Multi-Stage and Single-Stage Object Detection for Aerial Images

Poster Similar

Recent advances in camera equipped drone applications and their widespread use increased the demand on vision based object detection algorithms for aerial images. Object detection process is inherently a challenging task as a generic computer vision problem, however, since the use of object detection algorithms on UAVs (or on drones) is relatively a new area, it remains as a more challenging problem to detect objects in aerial images. There are several reasons for that including: (i) the lack of large drone datasets including large object variance, (ii) the large orientation and scale variance in drone images when compared to the ground images, and (iii) the difference in texture and shape features between the ground and the aerial images. Deep learning based object detection algorithms can be classified under two main categories: (a) single-stage detectors and (b) multi-stage detectors. Both single-stage and multi-stage solutions have their advantages and disadvantages over each other. However, a technique to combine the good sides of each of those solutions could yield even a stronger solution than each of those solutions individually. In this paper, we propose an ensemble network, SyNet, that combines a multi-stage method with a single-stage one with the motivation of decreasing the high false negative rate of multi-stage detectors and increasing the quality of the single-stage detector proposals. As building blocks, CenterNet and Cascade R-CNN with pretrained feature extractors are utilized along with an ensembling strategy. We report the state of the art results obtained by our proposed solution on two different datasets: namely MS-COCO and visDrone with \%52.1 $mAP_{IoU = 0.75}$ is obtained on MS-COCO $val2017$ dataset and \%26.2 $mAP_{IoU = 0.75}$ is obtained on VisDrone $test-set$. Our code is available at: https://github.com/mertalbaba/SyNet}{https://github.com/mer talbaba/SyNet

ACRM: Attention Cascade R-CNN with Mix-NMS for Metallic Surface Defect Detection

Junting Fang, Xiaoyang Tan, Yuhui Wang

Responsive image

Auto-TLDR; Attention Cascade R-CNN with Mix Non-Maximum Suppression for Robust Metal Defect Detection

Slides Poster Similar

Metallic surface defect detection is of great significance in quality control for production. However, this task is very challenging due to the noise disturbance, large appearance variation, and the ambiguous definition of the defect individual. Traditional image processing methods are unable to detect the damaged region effectively and efficiently. In this paper, we propose a new defect detection method, Attention Cascade R-CNN with Mix-NMS (ACRM), to classify and locate defects robustly. Three submodules are developed to achieve this goal: 1) a lightweight attention block is introduced, which can improve the ability in capture global and local feature both in the spatial and channel dimension; 2) we firstly apply the cascade R-CNN to our task, which exploits multiple detectors to sequentially refine the detection result robustly; 3) we introduce a new method named Mix Non-Maximum Suppression (Mix-NMS), which can significantly improve its ability in filtering the redundant detection result in our task. Extensive experiments on a real industrial dataset show that ACRM achieves state-of-the-art results compared to the existing methods, demonstrating the effectiveness and robustness of our detection method.

Construction Worker Hardhat-Wearing Detection Based on an Improved BiFPN

Chenyang Zhang, Zhiqiang Tian, Jingyi Song, Yaoyue Zheng, Bo Xu

Responsive image

Auto-TLDR; A One-Stage Object Detection Method for Hardhat-Wearing in Construction Site

Slides Poster Similar

Work in the construction site is considered to be one of the occupations with the highest safety risk factor. Therefore, safety plays an important role in construction site. One of the most fundamental safety rules in construction site is to wear a hardhat. To strengthen the safety of the construction site, most of the current methods use multi-stage method for hardhat-wearing detection. These methods have limitations in terms of adaptability and generalizability. In this paper, we propose a one-stage object detection method based on convolutional neural network. We present a multi-scale strategy that selects the high-resolution feature maps of DarkNet-53 to effectively identify small-scale hardhats. In addition, we propose an improved weighted bi-directional feature pyramid network (BiFPN), which could fuse more semantic features from more scales. The proposed method can not only detect hardhat-wearing, but also identify the color of the hardhat. Experimental results show that the proposed method achieves a mAP of 87.04%, which outperforms several state-of-the-art methods on a public dataset.

SFPN: Semantic Feature Pyramid Network for Object Detection

Yi Gan, Wei Xu, Jianbo Su

Responsive image

Auto-TLDR; SFPN: Semantic Feature Pyramid Network to Address Information Dilution Issue in FPN

Slides Poster Similar

Feature Pyramid Network(FPN) employs a top-down path to enhance low level feature by utilizing high level feature.However, further improvement of detector is greatly hindered by the inner defect of FPN. The dilution issue in FPN is analyzed in this paper, and a new architecture named Semantic Feature Pyramid Network(SFPN) is introduced to address the information imbalance problem caused by information dilution. The proposed method consists of two simple and effective components: Semantic Pyramid Module(SPM) and Semantic Feature Fusion Module(SFFM). To compensate for the weaknesses of FPN, the semantic segmentation result is utilized as an extra information source in our architecture.By constructing a semantic pyramid based on the segmentation result and fusing it with FPN, feature maps at each level can obtain the necessary information without suffering from the dilution issue. The proposed architecture could be applied on many detectors, and non-negligible improvement could be achieved. Although this method is designed for object detection, other tasks such as instance segmentation can also largely benefit from it. The proposed method brings Faster R-CNN and Mask R-CNN with ResNet-50 as backbone both 1.8 AP improvements respectively. Furthermore, SFPN improves Cascade R-CNN with backbone ResNet-101 from 42.4 AP to 43.5 AP.

Forground-Guided Vehicle Perception Framework

Kun Tian, Tong Zhou, Shiming Xiang, Chunhong Pan

Responsive image

Auto-TLDR; A foreground segmentation branch for vehicle detection

Slides Poster Similar

As the basis of advanced visual tasks such as vehicle tracking and traffic flow analysis, vehicle detection needs to accurately predict the position and category of vehicle objects. In the past decade, deep learning based methods have made great progress. However, we also notice that some existing cases are not studied thoroughly. First, false positive on the background regions is one of the critical problems. Second, most of the previous approaches only optimize a single vehicle detection model, ignoring the relationship between different visual perception tasks. In response to the above two findings, we introduce a foreground segmentation branch for the first time, which can predict the pixel level of vehicles in advance. Furthermore, two attention modules are designed to guide the work of the detection branch. The proposed method can be easily grafted into the one-stage and two-stage detection framework. We evaluate the effectiveness of our model on LSVH, a dataset with large variations in vehicle scales, and achieve the state-of-the-art detection accuracy.

VGG-Embedded Adaptive Layer-Normalized Crowd Counting Net with Scale-Shuffling Modules

Dewen Guo, Jie Feng, Bingfeng Zhou

Responsive image

Auto-TLDR; VadaLN: VGG-embedded Adaptive Layer Normalization for Crowd Counting

Slides Poster Similar

Crowd counting is widely used in real-time congestion monitoring and public security. Due to the limited data, many methods have little ability to be generalized because the differences between feature domains are not taken into consideration. We propose VGG-embedded adaptive layer normalization (VadaLN) to filter the features that irrelevant to the counting tasks in order that the counting results should not be affected by the image quality, color or illumination. VadaLN is implemented on the pretrained VGG-16 backbone. There is no additional learning parameters required through our method. VadaLN incoporates the proposed scale-shuffling modules (SSM) to relax the distortions in upsampling operations. Besides, non-aligned training methdology for the estimation of density maps is leveraged by an adversarial contextual loss (ACL) to improve the counting performance. Based on the proposed method, we construct an end-to-end trainable baseline model without bells and whistles, namely VadaLNet, which outperforms several recent state-of-the-art methods on commonly used challenging standard benchmarks. The intermediate scale-shuffled results are combined to formulate a scale-complementary strategy as a more powerful network, namely as VadaLNeSt. We implement VadaLNeSt on standard benchmarks, e.g. ShanghaiTech (Part A & Part B), UCF_CC_50, and UCF_QNRF, to show the superiority of our method.

Cascade Saliency Attention Network for Object Detection in Remote Sensing Images

Dayang Yu, Rong Zhang, Shan Qin

Responsive image

Auto-TLDR; Cascade Saliency Attention Network for Object Detection in Remote Sensing Images

Slides Poster Similar

Object detection in remote sensing images is a challenging task due to objects in the bird-view perspective appearing with arbitrary orientations. Though considerable progress has been made, there still exist challenges with the interference from complex backgrounds, dense arrangement, and large-scale variations. In this paper, we propose an oriented detector named Cascade Saliency Attention Network (CSAN), designed for comprehensively suppressing interference in remote sensing images. Specifically, we first combine context and pixel attention on feature maps to enhance saliency of objects for suppressing interference from backgrounds. Then, in cascade network, we apply instance segmentation on ROI to increase saliency of the central object, thus preventing object features from mutual interference in dense arrangement. Additionally, to alleviate large-scale variations, we devise a multi-scale merge module during FPN merging process to learn richer scale representations. Experimental results on DOTA and HRSC2016 datasets outperform other state-of-the-art object detection methods and verify the effectiveness of our method.

Bidirectional Matrix Feature Pyramid Network for Object Detection

Wei Xu, Yi Gan, Jianbo Su

Responsive image

Auto-TLDR; BMFPN: Bidirectional Matrix Feature Pyramid Network for Object Detection

Slides Poster Similar

Feature pyramids are widely used to improve scale invariance for object detection. Most methods just map the objects to feature maps with relevant square receptive fields, but rarely pay attention to the aspect ratio variation, which is also an important property of object instances. It will lead to a poor match between rectangular objects and assigned features with square receptive fields, thus preventing from accurate recognition and location. Besides, the information propagation among feature layers is sparse, namely, each feature in the pyramid may mainly or only contain single-level information, which is not representative enough for classification and localization sub-tasks. In this paper, Bidirectional Matrix Feature Pyramid Network (BMFPN) is proposed to address these issues. It consists of three modules: Diagonal Layer Generation Module (DLGM), Top-down Module (TDM) and Bottom-up Module (BUM). First, multi-level features extracted by backbone are fed into DLGM to produce the base features. Then these base features are utilized to construct the final feature pyramid through TDM and BUM in series. The receptive fields of the designed feature layers in BMFPN have various scales and aspect ratios. Objects can be correctly assigned to appropriate and representative feature maps with relevant receptive fields depending on its scale and aspect ratio properties. Moreover, TDM and BUM form bidirectional and reticular information flow, which effectively fuses multi level information in top-down and bottom-up manner respectively. To evaluate the effectiveness of our proposed architecture, an end-toend anchor-free detector is designed and trained by integrating BMFPN into FCOS. And the center ness branch in FCOS is modified with our Gaussian center-ness branch (GCB), which brings another slight improvement. Without bells and whistles, our method gains +3.3%, +2.4% and +2.6% AP on MS COCO dataset from baselines with ResNet-50, ResNet-101 and ResNeXt-101 backbones, respectively.

Triplet-Path Dilated Network for Detection and Segmentation of General Pathological Images

Jiaqi Luo, Zhicheng Zhao, Fei Su, Limei Guo

Responsive image

Auto-TLDR; Triplet-path Network for One-Stage Object Detection and Segmentation in Pathological Images

Slides Similar

Deep learning has been widely applied in the field of medical image processing. However, compared with flourishing visual tasks in natural images, the progress achieved in pathological images is not remarkable, and detection and segmentation, which are among basic tasks of computer vision, are regarded as two independent tasks. In this paper, we make full use of existing datasets and construct a triplet-path network using dilated convolutions to cooperatively accomplish one-stage object detection and nuclei segmentation for general pathological images. First, in order to meet the requirement of detection and segmentation, a novel structure called triplet feature generation (TFG) is designed to extract high-resolution and multiscale features, where features from different layers can be properly integrated. Second, considering that pathological datasets are usually small, a location-aware and partially truncated loss function is proposed to improve the classification accuracy of datasets with few images and widely varying targets. We compare the performance of both object detection and instance segmentation with state-of-the-art methods. Experimental results demonstrate the effectiveness and efficiency of the proposed network on two datasets collected from multiple organs.

CenterRepp: Predict Central Representative Point Set's Distribution for Detection

Yulin He, Limeng Zhang, Wei Chen, Xin Luo, Chen Li, Xiaogang Jia

Responsive image

Auto-TLDR; CRPDet: CenterRepp Detector for Object Detection

Slides Poster Similar

Object detection has long been an important issue in the discipline of scene understanding. Existing researches mainly focus on the object itself, ignoring its surrounding environment. In fact, the surrounding environment provides abundant information to help detectors classify and locate objects. This paper proposes CRPDet, viz. CenterRepp Detector, a framework for object detection. The main function of CRPDet is accomplished by the CenterRepp module, which takes into account the surrounding environment by predicting the distribution of the central representative points. CenterRepp converts labeled object frames into the mean and standard variance of the sampling points’ distribution. This helps increase the receptive field of objects, breaking the limitation of object frames. CenterRepp defines a position-fixed center point with significant weights, avoiding to sample all points in the surroundings. Experiments on the COCO test-dev detection benchmark demonstrates that our proposed CRPDet has comparable performance with state-of-the-art detectors, achieving 39.4 mAP with 51 FPS tested under single size input.

Rethinking ReID:Multi-Feature Fusion Person Re-Identification Based on Orientation Constraints

Mingjing Ai, Guozhi Shan, Bo Liu, Tianyang Liu

Responsive image

Auto-TLDR; Person Re-identification with Orientation Constrained Network

Slides Poster Similar

Person re-identification (ReID) aims to identify the specific pedestrian in a series of images or videos. Recently, ReID is receiving more and more attention in the fields of computer vision research and application like intelligent security. One major issue downgrading the ReID model performance lies in that various subjects in the same body orientations look too similar to distinguish by the model, while the same subject viewed in different orientations looks rather different. However, most of the current studies do not particularly differentiate pedestrians in orientation when designing the network, so we rethink this problem particularly from the perspective of person orientation and propose a new network structure by including two branches: one handling samples with the same body orientations and the other handling samples with different body orientations. Correspondingly, we also propose an orientation classifier that can accurately distinguish the orientation of each person. At the same time, the three-part loss functions are introduced for orientation constraint and combined to optimize the network simultaneously. Also, we use global and local features int the training stage in order to make use of multi-level information. Therefore, our network can derive its efficacy from orientation constraints and multiple features. Experiments show that our method not only has competitive performance on multiple datasets, but also can let retrieval results aligned with the orientation of the query sample rank higher, which may have great potential in the practical applications.

Small Object Detection by Generative and Discriminative Learning

Yi Gu, Jie Li, Chentao Wu, Weijia Jia, Jianping Chen

Responsive image

Auto-TLDR; Generative and Discriminative Learning for Small Object Detection

Slides Poster Similar

With the development of deep convolutional neural networks (CNNs), the object detection accuracy has been greatly improved. But the performance of small object detection is still far from satisfactory, mainly because small objects are so tiny that the information contained in the feature map is limited. Existing methods focus on improving classification accuracy but still suffer from the limitation of bounding box prediction. To solve this issue, we propose a detection framework by generative and discriminative learning. First, a reconstruction generator network is designed to reconstruct the mapping from low frequency to high frequency for anchor box prediction. Then, a detector module extracts the regions of interest (ROIs) from generated results and implements a RoI-Head to predict object category and refine bounding box. In order to guide the reconstructed image related to the corresponding one, a discriminator module is adopted to tell from the generated result and the original image. Extensive evaluations on the challenging MS-COCO dataset demonstrate that our model outperforms most state-of-the-art models in detecting small objects, especially the reconstruction module improves the average precision for small object (APs) by 7.7%.

A Novel Region of Interest Extraction Layer for Instance Segmentation

Leonardo Rossi, Akbar Karimi, Andrea Prati

Responsive image

Auto-TLDR; Generic RoI Extractor for Two-Stage Neural Network for Instance Segmentation

Slides Poster Similar

Given the wide diffusion of deep neural network architectures for computer vision tasks, several new applications are nowadays more and more feasible. Among them, a particular attention has been recently given to instance segmentation, by exploiting the results achievable by two-stage networks (such as Mask R-CNN or Faster R-CNN), derived from R-CNN. In these complex architectures, a crucial role is played by the Region of Interest (RoI) extraction layer, devoted to extract a coherent subset of features from a single Feature Pyramid Network (FPN) layer attached on top of a backbone. This paper is motivated by the need to overcome to the limitations of existing RoI extractors which select only one (the best) layer from FPN. Our intuition is that all the layers of FPN retain useful information. Therefore, the proposed layer (called Generic RoI Extractor - GRoIE) introduces non-local building blocks and attention mechanisms to boost the performance. A comprehensive ablation study at component level is conducted to find the best set of algorithms and parameters for the GRoIE layer. Moreover, GRoIE can be integrated seamlessly with every two-stage architecture for both object detection and instance segmentation tasks. Therefore, the improvements brought by the use of GRoIE in different state-of-the-art architectures are also evaluated. The proposed layer leads up to gain a 1.1% AP on bounding box detection and 1.7% AP on instance segmentation. The code is publicly available on GitHub repository at https://github.com/IMPLabUniPr/mmdetection-groie

Dynamic Low-Light Image Enhancement for Object Detection Via End-To-End Training

Haifeng Guo, Yirui Wu, Tong Lu

Responsive image

Auto-TLDR; Object Detection using Low-Light Image Enhancement for End-to-End Training

Slides Poster Similar

Object detection based on convolutional neural networks is a hot research topic in computer vision. The illumination component in the image has a great impact on object detection, and it will cause a sharp decline in detection performance under low-light conditions. Using low-light image enhancement technique as a pre-processing mechanism can improve image quality and obtain better detection results.However, due to the complexity of low-light environments, the existing enhancement methods may have negative effects on some samples. Therefore, it is difficult to improve the overall detection performance in low-light conditions. In this paper, our goal is to use image enhancement to improve object detection performance rather than perceptual quality for humans. We propose a novel framework that combines low-light enhancement and object detection for end-to-end training. The framework can dynamically select different enhancement subnetworks for each sample to improve the performance of the detector. Our proposed method consists of two stage: the enhancement stage and the detection stage. The enhancement stage dynamically enhances the low-light images under the supervision of several enhancement methods and output corresponding weights. During the detection stage, the weights offers information on object classification to generate high-quality region proposals and in turn result in accurate detection. Our experiments present promising results, which show that the proposed method can significantly improve the detection performance in low-light environment.

Video Object Detection Using Object's Motion Context and Spatio-Temporal Feature Aggregation

Jaekyum Kim, Junho Koh, Byeongwon Lee, Seungji Yang, Jun Won Choi

Responsive image

Auto-TLDR; Video Object Detection Using Spatio-Temporal Aggregated Features and Gated Attention Network

Slides Poster Similar

The deep learning technique has recently led to significant improvement in object-detection accuracy. Numerous object detection schemes have been designed to process each frame independently. However, in many applications, object detection is performed using video data, which consists of a sequence of two-dimensional (2D) image frames. Thus, the object detection accuracy can be improved by exploiting the temporal context of the video sequence. In this paper, we propose a novel video object detection method that exploits both the motion context of the object and spatio-temporal aggregated features in the video sequence to enhance the object detection performance. First, the motion of the object is captured by the correlation between the spatial feature maps of two adjacent frames. Then, the embedding vector, representing the motion context, is obtained by feeding the N correlation maps to long short term memory (LSTM). In addition to generating the motion context vector, the spatial feature maps for N adjacent frames are aggregated to boost the quality of the feature map. The gated attention network is employed to selectively combine only highly correlated feature maps based on their relevance. While most video object detectors are applied to two-stage detectors, our proposed method is applicable to one-stage detectors, which tend to be preferred for practical applications owing to reduced computational complexity. Our numerical evaluation conducted on the ImageNet VID dataset shows that our network offers significant performance gain over baseline algorithms, and it outperforms the existing state-of-the-art one-stage video object detection methods.

Feature Embedding Based Text Instance Grouping for Largely Spaced and Occluded Text Detection

Pan Gao, Qi Wan, Renwu Gao, Linlin Shen

Responsive image

Auto-TLDR; Text Instance Embedding Based Feature Embeddings for Multiple Text Instance Grouping

Slides Poster Similar

A text instance can be easily detected as multiple ones due to the large space between texts/characters, curved shape and partial occlusion. In this paper, a feature embedding based text instance grouping algorithm is proposed to solve this problem. To learn the feature space, a TIEM (Text Instance Embedding Module) is trained to minimize the within instance scatter and maximize the between instance scatter. Similarity between different text instances are measured in the feature space and merged if they meet certain conditions. Experimental results show that our approach can effectively connect text regions that belong to the same text instance. Competitive performance of our approach has been achieved on CTW1500, Total-Text, IC15 and a subset consists of texts selected from the three datasets, with large spacing and occlusions.

StrongPose: Bottom-up and Strong Keypoint Heat Map Based Pose Estimation

Niaz Ahmad, Jongwon Yoon

Responsive image

Auto-TLDR; StrongPose: A bottom-up box-free approach for human pose estimation and action recognition

Slides Poster Similar

Adaptation of deep convolutional neural network has made revolutionary progress in human pose estimation, various applications in recent years have drawn considerable attention. However, prediction and localization of the keypoints in single and multi-person images are a challenging problem. Towards this purpose, we present a bottom-up box-free approach for the task of pose estimation and action recognition. We proposed a StrongPose system model that uses part-based modeling to tackle object-part associations. The model utilizes a convolution network that learns how to detect Strong Keypoints Heat Maps (SKHM) and predict their comparative displacements, enabling us to group keypoints into person pose instances. Further, we produce Body Heat Maps (BHM) with the help of keypoints which allows us to localize the human body in the picture. The StrongPose framework is based on fully-convolutional engineering and permits proficient inference, with runtime basically autonomous of the number of individuals display within the scene. Train and test on COCO data alone, our framework achieves COCO test-dev keypoint average precision of 0.708 using ResNet-101 and 0.725 using ResNet-152, which considerably outperforms all prior bottom-up pose estimation frameworks.

VTT: Long-Term Visual Tracking with Transformers

Tianling Bian, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Neil Robertson, Haibing Guan

Responsive image

Auto-TLDR; Visual Tracking Transformer with transformers for long-term visual tracking

Slides Similar

Long-term visual tracking is a challenging problem. State-of-the-art long-term trackers, e.g., GlobalTrack, utilize region proposal networks (RPNs) to generate target proposals. However, the performance of the trackers is affected by occlusions and large scale or ratio variations. To address these issues, in this paper, we are the first to propose a novel architecture with transformers for long-term visual tracking. Specifically, the proposed Visual Tracking Transformer (VTT) utilizes a transformer encoder-decoder architecture for aggregating global information to deal with occlusion and large scale or ratio variation. Furthermore, it also shows better discriminative power against instance-level distractors without the need for extra labeling and hard-sample mining. We conduct extensive experiments on three largest long-term tracking dataset and have achieved state-of-the-art performance.

One-Stage Multi-Task Detector for 3D Cardiac MR Imaging

Weizeng Lu, Xi Jia, Wei Chen, Nicolò Savioli, Antonio De Marvao, Linlin Shen, Declan O'Regan, Jinming Duan

Responsive image

Auto-TLDR; Multi-task Learning for Real-Time, simultaneous landmark location and bounding box detection in 3D space

Slides Poster Similar

Fast and accurate landmark location and bounding box detection are important steps in 3D medical imaging. In this paper, we propose a novel multi-task learning framework, for real-time, simultaneous landmark location and bounding box detection in 3D space. Our method extends the famous single-shot multibox detector (SSD) from single-task learning to multi-task learning and from 2D to 3D. Furthermore, we propose a post-processing approach to refine the network landmark output, by averaging the candidate landmarks. Owing to these settings, the proposed framework is fast and accurate. For 3D cardiac magnetic resonance (MR) images with size 224 × 224 × 64, our framework runs about 128 volumes per second (VPS) on GPU and achieves 6.75mm average point-to-point distance error for landmark location, which outperforms both state-of-the-art and baseline methods. We also show that segmenting the 3D image cropped with the bounding box results in both improved performance and efficiency.

Tiny Object Detection in Aerial Images

Jinwang Wang, Wen Yang, Haowen Guo, Ruixiang Zhang, Gui-Song Xia

Responsive image

Auto-TLDR; Tiny Object Detection in Aerial Images Using Multiple Center Points Based Learning Network

Slides Similar

Object detection in Earth Vision has achieved great progress in recent years. However, tiny object detection in aerial images remains a very challenging problem since the tiny objects contain a small number of pixels and are easily confused with the background. To advance tiny object detection research in aerial images, we present a new dataset for Tiny Object Detection in Aerial Images (AI-TOD). Specifically, AI-TOD comes with 700,621 object instances for eight categories across 28,036 aerial images. Compared to existing object detection datasets in aerial images, the mean size of objects in AI-TOD is about 12.8 pixels, which is much smaller than others. To build a benchmark for tiny object detection in aerial images, we evaluate the state-of-the-art object detectors on our AI-TOD dataset. Experimental results show that direct application of these approaches on AI-TOD produces suboptimal object detection results, thus new specialized detectors for tiny object detection need to be designed. Therefore, we propose a multiple center points based learning network (M-CenterNet) to improve the localization performance of tiny object detection, and experimental results show the significant performance gain over the competitors.

A Fast and Accurate Object Detector for Handwritten Digit String Recognition

Jun Guo, Wenjing Wei, Yifeng Ma, Cong Peng

Responsive image

Auto-TLDR; ChipNet: An anchor-free object detector for handwritten digit string recognition

Slides Poster Similar

Focusing on handwritten digit string recognition (HDSR), we propose an anchor-free object detector called ChipNet, where a novel encoding method is designed. The input image is divided into columns, and then these columns are encoded by the ground truth. The adjacent columns are responsible for detecting the same target so that it can well address the class-imbalanced problem meanwhile reducing the network computation. ChipNet is composed of convolutional and bidirectional long short term memory networks. Different from the typical detectors, it doesn't use region proposals, anchors or regions of interest pooling. Hence, it can overcome the shortages of anchor-based and dense detectors in HDSR. The experiments are implemented on the synthetic digit strings, the CVL HDS database, and the ORAND-CAR-A & B databases. The high accuracies, which surpass the reported results by a large margin (up to 6.62%), are achieved. Furthermore, it gets 219 FPS speed on 160*32 px resolution images when using a Tesla P100 GPU. The results also show that ChipNet can handle touching, connecting and arbitrary length digit strings, and the obtained accuracies in HDSR are as high as the ones in single handwritten digit recognition.

Feature Pyramid Hierarchies for Multi-Scale Temporal Action Detection

Jiayu He, Guohui Li, Jun Lei

Responsive image

Auto-TLDR; Temporal Action Detection using Pyramid Hierarchies and Multi-scale Feature Maps

Slides Poster Similar

Temporal action detection is a challenging but promising task in video content analysis. It is in great demand in the field of public safety. The main difficulty of the task is precisely localizing activities in the video especially those short duration activities. And most of the existing methods can not achieve a satisfactory detection result. Our method addresses a key point to improve detection accuracy, which is to use multi-scale feature maps for regression and classification. In this paper, we introduce a novel network based on classification following proposal framework. In our network, a 3D feature pyramid hierarchies is built to enhance the ability of detecting short duration activities. The input RGB/Flow frames are first encoded by a 3D feature pyramid hierarchies, and this subnet produces multi-level feature maps. Then temporal proposal subnet uses these features to pick out proposals which might contain activity segments. Finally a pyramid region of interest (RoI) pooling pipeline and two fully connected layers reuse muti-level feature maps to refine the temporal boundaries of proposals and classify them. We use late feature fusion scheme to combine RGB and Flow information. The network is trained end-to-end and we evaluate it in THUMOS'14 dataset. Our network achieves a good result among typical methods. A further ablation test demonstrate that pyramid hierarchies is effective to improve detecting short duration activity segments.

Efficient-Receptive Field Block with Group Spatial Attention Mechanism for Object Detection

Jiacheng Zhang, Zhicheng Zhao, Fei Su

Responsive image

Auto-TLDR; E-RFB: Efficient-Receptive Field Block for Deep Neural Network for Object Detection

Slides Poster Similar

Object detection has been paid rising attention in computer vision field. Convolutional Neural Networks (CNNs) extract high-level semantic features of images, which directly determine the performance of object detection. As a common solution, embedding integration modules into CNNs can enrich extracted features and thereby improve the performance. However, the instability and inconsistency of internal multiple branches exist in these modules. To address this problem, we propose a novel multibranch module called Efficient-Receptive Field Block (E-RFB), in which multiple levels of features are combined for network optimization. Specifically, by downsampling and increasing depth, the E-RFB provides sufficient RF. Second, in order to eliminate the inconsistency across different branches, a novel spatial attention mechanism, namely, Group Spatial Attention Module (GSAM) is proposed. The GSAM gradually narrows a feature map by channel grouping; thus it encodes the information between spatial and channel dimensions into the final attention heat map. Third, the proposed module can be easily joined in various CNNs to enhance feature representation as a plug-and-play component. With SSD-style detectors, our method halves the parameters of the original detection head and achieves high accuracy on the PASCAL VOC and MS COCO datasets. Moreover, the proposed method achieves superior performance compared with state-of-the-art methods based on similar framework.

PointDrop: Improving Object Detection from Sparse Point Clouds Via Adversarial Data Augmentation

Wenxin Ma, Jian Chen, Qing Du, Wei Jia

Responsive image

Auto-TLDR; PointDrop: Improving Robust 3D Object Detection to Sparse Point Clouds

Slides Poster Similar

Current 3D object detection methods achieve accurate and efficient results on the standard point cloud dataset. However, in real-world applications, due to the expensive cost of obtaining the annotated 3D object detection data, we expect to directly apply the model trained on the standard dataset to real-world scenarios. This strategy may fail because the point cloud samples obtained in the real-world scenarios may be much sparser due to various reasons (occlusion, low reflectivity of objects and fewer laser beams) and existing methods do not consider the limitations of their models on sparse point clouds. To improve the robustness of an object detector to sparser point clouds, we propose PointDrop, which learns to drop the features of some key points in the point clouds to generate challenging sparse samples for data augmentation. Moreover, PointDrop is able to adjust the difficulty of the generated samples based on the capacity of the detector and thus progressively improve the performance of the detector. We create two sparse point clouds datasets from the KITTI dataset to evaluate our method, and the experimental results show that PointDrop significantly improves the robustness of the detector to sparse point clouds.

Yolo+FPN: 2D and 3D Fused Object Detection with an RGB-D Camera

Ya Wang

Responsive image

Auto-TLDR; Yolo+FPN: Combining 2D and 3D Object Detection for Real-Time Object Detection

Slides Poster Similar

In this paper we propose a new deep neural network system, called Yolo+FPN, which fuses both 2D and 3D object detection algorithms to achieve better real-time object detection results and faster inference speed, to be used on real robots. Finding an optimized fusion strategy to efficiently combine 3D object detection with 2D detection information is useful and challenging for both indoor and outdoor robots. In order to satisfy real-time requirements, a trade-off between accuracy and efficiency is needed. We not only have improved training and test accuracies and lower mean losses on the KITTI object detection benchmark, but also achieve better average precision on 3D detection of all classes in three levels of difficulty. Also, we implemented Yolo+FPN system using an RGB-D camera, and compared the speed of 2D and 3D object detection using different GPUs. For the real implementation of both indoor and outdoor scenes, we focus on person detection, which is the most challenging and important among the three classes.

Robust Pedestrian Detection in Thermal Imagery Using Synthesized Images

My Kieu, Lorenzo Berlincioni, Leonardo Galteri, Marco Bertini, Andrew Bagdanov, Alberto Del Bimbo

Responsive image

Auto-TLDR; Improving Pedestrian Detection in the thermal domain using Generative Adversarial Network

Slides Poster Similar

In this paper we propose a method for improving pedestrian detection in the thermal domain using two stages: first, a generative data augmentation approach is used, then a domain adaptation method using generated data adapts an RGB pedestrian detector. Our model, based on the Least-Squares Generative Adversarial Network, is trained to synthesize realistic thermal versions of input RGB images which are then used to augment the limited amount of labeled thermal pedestrian images available for training. We apply our generative data augmentation strategy in order to adapt a pretrained YOLOv3 pedestrian detector to detection in the thermal-only domain. Experimental results demonstrate the effectiveness of our approach: using less than 50% of available real thermal training data, and relying on synthesized data generated by our model in the domain adaptation phase, our detector achieves state-of-the-art results on the KAIST Multispectral Pedestrian Detection Benchmark; even if more real thermal data is available adding GAN generated images to the training data results in improved performance, thus showing that these images act as an effective form of data augmentation. To the best of our knowledge, our detector achieves the best single-modality detection results on KAIST with respect to the state-of-the-art.

Which Airline Is This? Airline Logo Detection in Real-World Weather Conditions

Christian Wilms, Rafael Heid, Mohammad Araf Sadeghi, Andreas Ribbrock, Simone Frintrop

Responsive image

Auto-TLDR; Airlines logo detection on airplane tails using data augmentation

Slides Poster Similar

The detection of logos in images, for instance, logos of airlines on airplane tails, is a difficult task in real-world weather conditions. Most systems used for logo detection are very good at detecting logos in clean images. However, they exhibit problems when images are degraded by effects of adverse weather conditions as they frequently occur in real-world scenarios. For investigating this problem on airline logo detection as a sub-problem of logo detection, we first present a new dataset for airline logo detection on airplane tails containing a test split with images degraded by adverse weather effects. Second, to handle the detection of airline logos effectively, a new two-stage airline logo detection system based on a state-of-the-art object proposal generation system and a specifically tailored classifier is proposed. Finally, improving the results on images degraded by adverse weather effects, we introduce a learning-free application-agnostic data augmentation strategy simulating effects like rain and fog. The results show the superior performance of our airline logo detection system compared to state-of-the-art. Furthermore, applying our data augmentation approach to a variety of systems, reduces the significant drop in performance on degraded images.

Multi-Label Contrastive Focal Loss for Pedestrian Attribute Recognition

Xiaoqiang Zheng, Zhenxia Yu, Lin Chen, Fan Zhu, Shilong Wang

Responsive image

Auto-TLDR; Multi-label Contrastive Focal Loss for Pedestrian Attribute Recognition

Slides Poster Similar

Pedestrian Attribute Recognition (PAR) has received extensive attention during the past few years. With the advances of deep constitutional neural networks (CNNs), the performance of PAR has been significantly improved. Existing methods tend to acquire attribute-specific features by designing various complex network structures with additional modules. Such additional modules, however, dramatically increase the number of parameters. Meanwhile, the problems of class imbalance and hard attribute retrieving remain underestimated in PAR. In this paper, we explore the optimization mechanism of the training processing to account for these problems and propose a new loss function called Multi-label Contrastive Focal Loss (MCFL). This proposed MCFL emphasizes the hard and minority attributes by using a separated re-weighting mechanism for different positive and negative classes to alleviate the impact of the imbalance. MCFL is also able to enlarge the gaps between the intra-class of multi-label attributes, to force CNNs to extract more subtle discriminative features. We evaluate the proposed MCFL on three large public pedestrian datasets, including RAP, PA-100K, and PETA. The experimental results indicate that the proposed MCFL with the ResNet-50 backbone is able to outperform other state-of-the-art approaches in comparison.

Thermal Image Enhancement Using Generative Adversarial Network for Pedestrian Detection

Mohamed Amine Marnissi, Hajer Fradi, Anis Sahbani, Najoua Essoukri Ben Amara

Responsive image

Auto-TLDR; Improving Visual Quality of Infrared Images for Pedestrian Detection Using Generative Adversarial Network

Slides Poster Similar

Infrared imaging has recently played an important role in a wide range of applications including surveillance, robotics and night vision. However, infrared cameras often suffer from some limitations, essentially about low-contrast and blurred details. These problems contribute to the loss of observation of target objects in infrared images, which could limit the feasibility of different infrared imaging applications. In this paper, we mainly focus on the problem of pedestrian detection on thermal images. Particularly, we emphasis the need for enhancing the visual quality of images beforehand performing the detection step. % to ensure effective results. To address that, we propose a novel thermal enhancement architecture based on Generative Adversarial Network, and composed of two modules contrast enhancement and denoising modules with a post-processing step for edge restoration in order to improve the overall quality. The effectiveness of the proposed architecture is assessed by means of visual quality metrics and better results are obtained compared to the original thermal images and to the obtained results by other existing enhancement methods. These results have been conduced on a subset of KAIST dataset. Using the same dataset, the impact of the proposed enhancement architecture has been demonstrated on the detection results by obtaining better performance with a significant margin using YOLOv3 detector.

Utilising Visual Attention Cues for Vehicle Detection and Tracking

Feiyan Hu, Venkatesh Gurram Munirathnam, Noel E O'Connor, Alan Smeaton, Suzanne Little

Responsive image

Auto-TLDR; Visual Attention for Object Detection and Tracking in Driver-Assistance Systems

Slides Poster Similar

Advanced Driver-Assistance Systems (ADAS) have been attracting attention from many researchers. Vision based sensors are the closest way to emulate human driver visual behavior while driving. In this paper, we explore possible ways to use visual attention (saliency) for object detection and tracking. We investigate: 1) How a visual attention map such as a subjectness attention or saliency map and an objectness attention map can facilitate region proposal generation in a 2-stage object detector; 2) How a visual attention map can be used for tracking multiple objects. We propose a neural network that can simultaneously detect objects as and generate objectness and subjectness maps to save computational power. We further exploit the visual attention map during tracking using a sequential Monte Carlo probability hypothesis density (PHD) filter. The experiments are conducted on KITTI and DETRAC datasets. The use of visual attention and hierarchical features has shown a considerable improvement of≈8% in object detection which effectively increased tracking performance by≈4% on KITTI dataset.

Boundary-Aware Graph Convolution for Semantic Segmentation

Hanzhe Hu, Jinshi Cui, Jinshi Hongbin Zha

Responsive image

Auto-TLDR; Boundary-Aware Graph Convolution for Semantic Segmentation

Slides Poster Similar

Recent works have made great progress in semantic segmentation by exploiting contextual information in a local or global manner with dilated convolutions, pyramid pooling or self-attention mechanism. However, few works have focused on harvesting boundary information to improve the segmentation performance. In order to enhance the feature similarity within the object and keep discrimination from other objects, we propose a boundary-aware graph convolution (BGC) module to propagate features within the object. The graph reasoning is performed among pixels of the same object apart from the boundary pixels. Based on the proposed BGC module, we further introduce the Boundary-aware Graph Convolution Network(BGCNet), which consists of two main components including a basic segmentation network and the BGC module, forming a coarse-to-fine paradigm. Specifically, the BGC module takes the coarse segmentation feature map as node features and boundary prediction to guide graph construction. After graph convolution, the reasoned feature and the input feature are fused together to get the refined feature, producing the refined segmentation result. We conduct extensive experiments on three popular semantic segmentation benchmarks including Cityscapes, PASCAL VOC 2012 and COCO Stuff, and achieve state-of-the-art performance on all three benchmarks.

P2 Net: Augmented Parallel-Pyramid Net for Attention Guided Pose Estimation

Luanxuan Hou, Jie Cao, Yuan Zhao, Haifeng Shen, Jian Tang, Ran He

Responsive image

Auto-TLDR; Parallel-Pyramid Net with Partial Attention for Human Pose Estimation

Slides Poster Similar

The target of human pose estimation is to determine the body parts and joint locations of persons in the image. Angular changes, motion blur and occlusion etc. in the natural scenes make this task challenging, while some joints are more difficult to be detected than others. In this paper, we propose an augmented Parallel-Pyramid Net (P^2Net) with an partial attention module. During data preprocessing, we proposed a differentiable auto data augmentation (DA^2) method in which sequences of data augmentations are formulated as a trainable and operational Convolution Neural Network (CNN) component. DA^2 improves the training efficiency and effectiveness. A parallel pyramid structure is followed to compensate the information loss introduced by the network. For the information loss problem in the backbone network, we optimize the backbone network by adopting a new parallel structure without increasing the overall computational complexity. To further refine the predictions after completion of global predictions, an Partial Attention Module (PAM) is defined to extract weighted features from different scale feature maps generated by the parallel pyramid structure. Compared with the traditional up-sampling refining, PAM can better capture the relationship between channels. Experiments corroborate the effectiveness of our proposed method. Notably, our method achieves the best performance on the challenging MSCOCO and MPII datasets.

Enhanced Vote Network for 3D Object Detection in Point Clouds

Min Zhong, Gang Zeng

Responsive image

Auto-TLDR; A Vote Feature Enhancement Network for 3D Bounding Box Prediction

Slides Poster Similar

In this work, we aim to estimate 3D bounding boxes by voting to object centers and then groups and aggregates the votes to generate 3D box proposals and semantic classes of objects. However, due to the sparse and unstructured nature of the point clouds, we face some challenges when directly predicting bounding box from the vote feature: the sparse vote feature may lack some necessary semantic and context information. To address the challenges, we propose a vote feature enhancement network that aims to encode semantic-aware information and aggravate global context for the vote feature. Specifically, we learn the point-wise semantic information and supplement it to the vote feature, and we also encode the pairwise relations to collect the global context. Experiments on two large datasets of real 3D scans, ScanNet and SUN RGB-D, demonstrate that our method can achieve excellent 3D detection results.

SynDHN: Multi-Object Fish Tracker Trained on Synthetic Underwater Videos

Mygel Andrei Martija, Prospero Naval

Responsive image

Auto-TLDR; Underwater Multi-Object Tracking in the Wild with Deep Hungarian Network

Slides Poster Similar

In this paper, we seek to extend multi-object tracking research on a relatively less explored domain, that of, underwater multi-object tracking in the wild. Multi-object fish tracking is an important task because it can provide fish monitoring systems with richer information (e.g. multiple views of the same fish) as compared to detections and it can be an invaluable input to fish behavior analysis. However, there is a lack of an annotated benchmark dataset with enough samples for this task. To circumvent the need for manual ground truth tracking annotation, we craft a synthetic dataset. Using this synthetic dataset, we train an integrated detector and tracker called SynDHN. SynDHN uses the Deep Hungarian Network (DHN), which is a differentiable approximation of the Hungarian assignment algorithm. We repurpose DHN to become the tracking component of our algorithm by performing the task of affinity estimation between detector predictions. We consider both spatial and appearance features for affinity estimation. Our results show that despite being trained on a synthetic dataset, SynDHN generalizes well to real underwater video tracking and performs better against our baseline algorithms.

CASNet: Common Attribute Support Network for Image Instance and Panoptic Segmentation

Xiaolong Liu, Yuqing Hou, Anbang Yao, Yurong Chen, Keqiang Li

Responsive image

Auto-TLDR; Common Attribute Support Network for instance segmentation and panoptic segmentation

Slides Poster Similar

Instance segmentation and panoptic segmentation is being paid more and more attention in recent years. In comparison with bounding box based object detection and semantic segmentation, instance segmentation can provide more analytical results at pixel level. Given the insight that pixels belonging to one instance have one or more common attributes of current instance, we bring up an one-stage instance segmentation network named Common Attribute Support Network (CASNet), which realizes instance segmentation by predicting and clustering common attributes. CASNet is designed in the manner of fully convolutional and can implement training and inference from end to end. And CASNet manages predicting the instance without overlaps and holes, which problem exists in most of current instance segmentation algorithms. Furthermore, it can be easily extended to panoptic segmentation through minor modifications with little computation overhead. CASNet builds a bridge between semantic and instance segmentation from finding pixel class ID to obtaining class and instance ID by operations on common attribute. Through experiment for instance and panoptic segmentation, CASNet gets mAP 32.8\% and PQ 59.0\% on Cityscapes validation dataset by joint training, and mAP 36.3\% and PQ 66.1\% by separated training mode. For panoptic segmentation, CASNet gets state-of-the-art performance on the Cityscapes validation dataset.

Activity and Relationship Modeling Driven Weakly Supervised Object Detection

Yinlin Li, Yang Qian, Xu Yang, Yuren Zhang

Responsive image

Auto-TLDR; Weakly Supervised Object Detection Using Activity Label and Relationship Modeling

Slides Poster Similar

This paper presents a weakly supervised object detection method based on activity label and relationship modeling, which is motivated by the assumption that configuration of human and object are similar in same activity, and joint modeling of human, active object and activity could leverage the recognition of them. Compared to most weakly supervised method taking object as independent instance, firstly, active human and object proposals are learned and filtered based on class activation map of multi-label classification. Secondly, a spatial relationship prior including relative position, scale, overlaps etc are learned dependent on action. Finally, a multi-stream object detection framework integrating the spatial prior and pairwise ROI pooling are proposed to jointly learn the object and action class. Experiments are conducted on HICO-DET dataset, and our approach outperforms the state of the art weakly supervised object detection methods.