VTT: Long-Term Visual Tracking with Transformers

Tianling Bian, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Neil Robertson, Haibing Guan

Responsive image

Auto-TLDR; Visual Tracking Transformer with transformers for long-term visual tracking

Slides

Long-term visual tracking is a challenging problem. State-of-the-art long-term trackers, e.g., GlobalTrack, utilize region proposal networks (RPNs) to generate target proposals. However, the performance of the trackers is affected by occlusions and large scale or ratio variations. To address these issues, in this paper, we are the first to propose a novel architecture with transformers for long-term visual tracking. Specifically, the proposed Visual Tracking Transformer (VTT) utilizes a transformer encoder-decoder architecture for aggregating global information to deal with occlusion and large scale or ratio variation. Furthermore, it also shows better discriminative power against instance-level distractors without the need for extra labeling and hard-sample mining. We conduct extensive experiments on three largest long-term tracking dataset and have achieved state-of-the-art performance.

Similar papers

Tackling Occlusion in Siamese Tracking with Structured Dropouts

Deepak Gupta, Efstratios Gavves, Arnold Smeulders

Responsive image

Auto-TLDR; Structured Dropout for Occlusion in latent space

Slides Poster Similar

Occlusion is one of the most difficult challenges in object tracking to model. This is because unlike other challenges, where data augmentation can be of help, occlusion is hard to simulate as the occluding object can be anything in any shape. In this paper, we propose a simple solution to simulate the effects of occlusion in the latent space. Specifically, we present structured dropout to mimic the change in latent codes under occlusion. We present three forms of dropout (channel dropout, segment dropout and slice dropout) with the various forms of occlusion in mind. To demonstrate its effectiveness, the dropouts are incorporated into two modern Siamese trackers (SiamFC and SiamRPN++). The outputs from multiple dropouts are combined using an encoder network to obtain the final prediction. Experiments on several tracking benchmarks show the benefits of structured dropouts, while due to their simplicity requiring only small changes to the existing tracker models.

DAL: A Deep Depth-Aware Long-Term Tracker

Yanlin Qian, Song Yan, Alan Lukežič, Matej Kristan, Joni-Kristian Kamarainen, Jiri Matas

Responsive image

Auto-TLDR; Deep Depth-Aware Long-Term RGBD Tracking with Deep Discriminative Correlation Filter

Slides Poster Similar

The best RGBD trackers provide high accuracy but are slow to run. On the other hand, the best RGB trackers are fast but clearly inferior on the RGBD datasets. In this work, we propose a deep depth-aware long-term tracker that achieves state-of-the-art RGBD tracking performance and is fast to run. We reformulate deep discriminative correlation filter (DCF) to embed the depth information into deep features. Moreover, the same depth-aware correlation filter is used for target re- detection. Comprehensive evaluations show that the proposed tracker achieves state-of-the-art performance on the Princeton RGBD, STC, and the newly-released CDTB benchmarks and runs 20 fps.

RSINet: Rotation-Scale Invariant Network for Online Visual Tracking

Yang Fang, Geunsik Jo, Chang-Hee Lee

Responsive image

Auto-TLDR; RSINet: Rotation-Scale Invariant Network for Adaptive Tracking

Slides Poster Similar

Most Siamese network-based trackers perform the tracking process without model update, and cannot learn target-specific variation adaptively. Moreover, Siamese-based trackers infer the new state of tracked objects by generating axis-aligned bounding boxes, which contain extra background noise, and are unable to accurately estimate the rotation and scale transformation of moving objects, thus potentially reducing tracking performance. In this paper, we propose a novel Rotation-Scale Invariant Network (RSINet) to address the above problem. Our RSINet tracker consists of a target-distractor discrimination branch and a rotation-scale estimation branch, the rotation and scale knowledge can be explicitly learned by a multi-task learning method in an end-to-end manner. In addtion, the tracking model is adaptively optimized and updated under spatio-temporal energy control, which ensures model stability and reliability, as well as high tracking efficiency. Comprehensive experiments on OTB-100, VOT2018, and LaSOT benchmarks demonstrate that our proposed RSINet tracker yields new state-of-the-art performance compared with recent trackers, while running at real-time speed about 45 FPS.

Model Decay in Long-Term Tracking

Efstratios Gavves, Ran Tao, Deepak Gupta, Arnold Smeulders

Responsive image

Auto-TLDR; Model Bias in Long-Term Tracking

Slides Poster Similar

To account for appearance variations, tracking models need to be updated during the course of inference. However, updating the tracker model with adverse bounding box predictions adds an unavoidable bias term to the learning. This bias term, which we refer to as model decay, offsets the learning and causes tracking drift. While its adverse affect might not be visible in short-term tracking, accumulation of this bias over a long-term can eventually lead to a permanent loss of the target. In this paper, we look at the problem of model bias from a mathematical perspective. Further, we briefly examine the effect of various sources of tracking error on model decay, using a correlation filter (ECO) and a Siamese (SINT) tracker. Based on observations and insights, we propose simple additions that help to reduce model decay in long-term tracking. The proposed tracker is evaluated on four long-term and one short-term tracking benchmarks, demonstrating superior accuracy and robustness, even on 30 minute long videos.

TSDM: Tracking by SiamRPN++ with a Depth-Refiner and a Mask-Generator

Pengyao Zhao, Quanli Liu, Wei Wang, Qiang Guo

Responsive image

Auto-TLDR; TSDM: A Depth-D Tracker for 3D Object Tracking

Slides Poster Similar

In a generic object tracking, depth (D) information provides informative cues for foreground-background separation and target bounding box regression. However, so far, few trackers have used depth information to play the important role aforementioned due to the lack of a suitable model. In this paper, a RGB-D tracker named TSDM is proposed, which is composed of a Mask-generator (M-g), SiamRPN++ and a Depth-refiner (D-r). The M-g generates the background masks, and updates them as the target 3D position changes. The D-r optimizes the target bounding box estimated by SiamRPN++, based on the spatial depth distribution difference between the target and the surrounding background. Extensive evaluation on the Princeton Tracking Benchmark and the Visual Object Tracking challenge shows that our tracker outperforms the state-of-the-art by a large margin while achieving 23 FPS. In addition, a light-weight variant can run at 31 FPS and thus it is practical for real world applications. Code and models of TSDM are available at https://github.com/lql-team/TSDM.

Siamese Dynamic Mask Estimation Network for Fast Video Object Segmentation

Dexiang Hong, Guorong Li, Kai Xu, Li Su, Qingming Huang

Responsive image

Auto-TLDR; Siamese Dynamic Mask Estimation for Video Object Segmentation

Slides Poster Similar

Video object segmentation(VOS) has been a fundamental topic in recent years, and many deep learning-based methods have achieved state-of-the-art performance on multiple benchmarks. However, most of these methods rely on pixel-level matching between the template and the searched frames on the whole image while the targets only occupy a small region. Calculating on the entire image brings lots of additional computation cost. Besides, the whole image may contain some distracting information resulting in many false-positive matching points. To address this issue, motivated by one-stage instance object segmentation methods, we propose an efficient siamese dynamic mask estimation network for fast video object segmentation. The VOS is decoupled into two tasks, i.e. mask feature learning and dynamic kernel prediction. The former is responsible for learning high-quality features to preserve structural geometric information, and the latter learns a dynamic kernel which is used to convolve with the mask feature to generate a mask output. We use Siamese neural network as a feature extractor and directly predict masks after correlation. In this way, we can avoid using pixel-level matching, making our framework more simple and efficient. Experiment results on DAVIS 2016 /2017 datasets show that our proposed methods can run at 35 frames per second on NVIDIA RTX TITAN while preserving competitive accuracy.

Exploiting Distilled Learning for Deep Siamese Tracking

Chengxin Liu, Zhiguo Cao, Wei Li, Yang Xiao, Shuaiyuan Du, Angfan Zhu

Responsive image

Auto-TLDR; Distilled Learning Framework for Siamese Tracking

Slides Poster Similar

Existing deep siamese trackers are typically built on off-the-shelf CNN models for feature learning, with the demand for huge power consumption and memory storage. This limits current deep siamese trackers to be carried on resource-constrained devices like mobile phones, given factor that such a deployment normally requires cost-effective considerations. In this work, we address this issue by presenting a novel Distilled Learning Framework(DLF) for siamese tracking, which aims at learning tracking model with efficiency and high accuracy. Specifically, we propose two simple yet effective knowledge distillation strategies, denote as point-wise distillation and pair-wise distillation, which are designed for transferring knowledge from a more discriminative teacher tracker into a compact student tracker. In this way, cost-effective and high performance tracking could be achieved. Extensive experiments on several tracking benchmarks demonstrate the effectiveness of our proposed method.

Robust Visual Object Tracking with Two-Stream Residual Convolutional Networks

Ning Zhang, Jingen Liu, Ke Wang, Dan Zeng, Tao Mei

Responsive image

Auto-TLDR; Two-Stream Residual Convolutional Network for Visual Tracking

Slides Poster Similar

The current deep learning based visual tracking approaches have been very successful by learning the target classification and/or estimation model from a large amount of supervised training data in offline mode. However, most of them can still fail in tracking objects due to some more challenging issues such as dense distractor objects, confusing background, motion blurs, and so on. Inspired by the human ``visual tracking'' capability which leverages motion cues to distinguish the target from the background, we propose a Two-Stream Residual Convolutional Network (TS-RCN) for visual tracking, which successfully exploits both appearance and motion features for model update. Our TS-RCN can be integrated with existing deep learning based visual trackers. To further improve the tracking performance, we adopt a ``wider'' residual network ResNeXt as its feature extraction backbone. To the best of our knowledge, TS-RCN is the first end-to-end trainable two-stream visual tracking system, which makes full use of both appearance and motion features of the target. We have extensively evaluated the TS-RCN on most widely used benchmark datasets including VOT2018, VOT2019, and GOT-10K. The experiment results have successfully demonstrated that our two-stream model can greatly outperform the appearance based tracker, and it also achieves state-of-the-art performance. The tracking system can run at up to 38.1 FPS.

MFST: Multi-Features Siamese Tracker

Zhenxi Li, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Responsive image

Auto-TLDR; Multi-Features Siamese Tracker for Robust Deep Similarity Tracking

Slides Similar

Siamese trackers have recently achieved interesting results due to their balanced accuracy-speed. This success is mainly due to the fact that deep similarity networks were specifically designed to address the image similarity problem. Therefore, they are inherently more appropriate than classical CNNs for the tracking task. However, Siamese trackers rely on the last convolutional layers for similarity analysis and target search, which restricts their performance. In this paper, we argue that using a single convolutional layer as feature representation is not the optimal choice within the deep similarity framework, as multiple convolutional layers provide several abstraction levels in characterizing an object. Starting from this motivation, we present the Multi-Features Siamese Tracker (MFST), a novel tracking algorithm exploiting several hierarchical feature maps for robust deep similarity tracking. MFST proceeds by fusing hierarchical features to ensure a richer and more efficient representation. Moreover, we handle appearance variation by calibrating deep features extracted from two different CNN models. Based on this advanced feature representation, our algorithm achieves high tracking accuracy, while outperforming several state-of-the-art trackers, including standard Siamese trackers.

Siamese Fully Convolutional Tracker with Motion Correction

Mathew Francis, Prithwijit Guha

Responsive image

Auto-TLDR; A Siamese Ensemble for Visual Tracking with Appearance and Motion Components

Slides Poster Similar

Visual tracking algorithms use cues like appearance, structure, motion etc. for locating an object in a video. We propose an ensemble tracker with appearance and motion components. A siamese tracker that learns object appearance from a static image and motion vectors computed between consecutive frames with a flow network forms the ensemble. Motion predicted object localization is used to correct the appearance component in the ensemble. Complementary nature of the components bring performance improvement as observed in experiments performed on VOT2018 and VOT2019 datasets.

SiamMT: Real-Time Arbitrary Multi-Object Tracking

Lorenzo Vaquero, Manuel Mucientes, Victor Brea

Responsive image

Auto-TLDR; SiamMT: A Deep-Learning-based Arbitrary Multi-Object Tracking System for Video

Slides Poster Similar

Visual object tracking is of great interest in many applications, as it preserves the identity of an object throughout a video. However, while real applications demand systems capable of real-time-tracking multiple objects, multi-object tracking solutions usually follow the tracking-by-detection paradigm, thus they depend on running a costly detector in each frame, and they do not allow the tracking of arbitrary objects, i.e., they require training for specific classes. In response to this need, this work presents the architecture of SiamMT, a system capable of efficiently applying individual visual tracking techniques to multiple objects in real-time. This makes it the first deep-learning-based arbitrary multi-object tracker. To achieve this, we propose the global frame features extraction by using a fully-convolutional neural network, followed by the cropping and resizing of the different object search areas. The final similarity operation between these search areas and the target exemplars is carried out with an optimized pairwise cross-correlation. These novelties allow the system to track multiple targets in a scalable manner, achieving 25 fps with 60 simultaneous objects for VGA videos and 40 objects for HD720 videos, all with a tracking quality similar to SiamFC.

Reducing False Positives in Object Tracking with Siamese Network

Takuya Ogawa, Takashi Shibata, Shoji Yachida, Toshinori Hosoi

Responsive image

Auto-TLDR; Robust Long-Term Object Tracking with Adaptive Search based on Motion Models

Slides Poster Similar

We have developed a robust long-term object tracking method that resolves the fundamental cause of the drift and loss of a target in visual object tracking. The proposed method consists of “sampling area extension”, which prevents a tracking result from drifting to other objects by learning false positive samples in advance (before they enter the search region of the target), and “adaptive search based on motion models”, which prevents a tracking result from drifting to other objects and avoids the loss of the target by using not only appearance features but also motion models to adaptively search for the target. Experiments conducted on long-term tracking dataset showed that our first technique improved robustness by 16.6% while the second technique improved robustness by 15.3%. By combining both, our method achieved 21.7% and 9.1% improvement for the robustness and precision, and the processing speed became 3.3 times faster. Additional experiments showed that our method achieved the top robustness among state-of-the-art methods on three long-term tracking datasets. These findings demonstrate that our method is effective for long-term object tracking and that its performance and speed are promising for use in practical applications of various technologies underlying object tracking.

Visual Object Tracking in Drone Images with Deep Reinforcement Learning

Derya Gözen, Sedat Ozer

Responsive image

Auto-TLDR; A Deep Reinforcement Learning based Single Object Tracker for Drone Applications

Slides Poster Similar

There is an increasing demand on utilizing camera equipped drones and their applications in many domains varying from agriculture to entertainment and from sports events to surveillance. In such drone applications, an essential and a common task is tracking an object of interest visually. Drone (or UAV) images have different properties when compared to the ground taken (natural) images and those differences introduce additional complexities to the existing object trackers to be directly applied on drone applications. Some important differences among those complexities include (i) smaller object sizes to be tracked and (ii) different orientations and viewing angles yielding different texture and features to be observed. Therefore, new algorithms trained on drone images are needed for the drone-based applications. In this paper, we introduce a deep reinforcement learning (RL) based single object tracker that tracks an object of interest in drone images by estimating a series of actions to find the location of the object in the next frame. This is the first work introducing a single object tracker using a deep RL-based technique for drone images. Our proposed solution introduces a novel reward function that aims to reduce the total number of actions taken to estimate the object's location in the next frame and also introduces a different backbone network to be used on low resolution images. Additionally, we introduce a set of new actions into the action library to better deal with the above-mentioned complexities. We compare our proposed solutions to a state of the art tracking algorithm from the recent literature and demonstrate up to 3.87\% improvement in precision and 3.6\% improvement in IoU values on the VisDrone2019 dataset. We also provide additional results on OTB-100 dataset and show up to 3.15\% improvement in precision on the OTB-100 dataset when compared to the same previous state of the art algorithm. Lastly, we analyze the ability to handle some of the challenges faced during tracking, including but not limited to occlusion, deformation, and scale variation for our proposed solutions.

Efficient Correlation Filter Tracking with Adaptive Training Sample Update Scheme

Shan Jiang, Shuxiao Li, Chengfei Zhu, Nan Yan

Responsive image

Auto-TLDR; Adaptive Training Sample Update Scheme of Correlation Filter Based Trackers for Visual Tracking

Slides Poster Similar

Visual tracking serves as a significant module in many applications. However, the heavy computation and low speed of many recent trackers restrict their applications in some computing power restricted scenarios. On the other hand, the simple update scheme of most correlation filter based trackers limits their robustness during target deformation and occlusion. In this paper, we explore the update scheme of correlation filter based trackers and propose an efficient and adaptive training sample update scheme. Training sample extracted in each frame is updated to the training set according to its distance between existing samples measured with difference hashing algorithm(DHA) or discarded according to tracking result reliability. Experiments on OTB-2015, Temple Color 128 and UAV123 demonstrate our tracker performs favourably against state-of-the-art trackers with light computation and runs over 100 fps on desktop computer with Intel i7-8700 CPU(3.2GHz).

SynDHN: Multi-Object Fish Tracker Trained on Synthetic Underwater Videos

Mygel Andrei Martija, Prospero Naval

Responsive image

Auto-TLDR; Underwater Multi-Object Tracking in the Wild with Deep Hungarian Network

Slides Poster Similar

In this paper, we seek to extend multi-object tracking research on a relatively less explored domain, that of, underwater multi-object tracking in the wild. Multi-object fish tracking is an important task because it can provide fish monitoring systems with richer information (e.g. multiple views of the same fish) as compared to detections and it can be an invaluable input to fish behavior analysis. However, there is a lack of an annotated benchmark dataset with enough samples for this task. To circumvent the need for manual ground truth tracking annotation, we craft a synthetic dataset. Using this synthetic dataset, we train an integrated detector and tracker called SynDHN. SynDHN uses the Deep Hungarian Network (DHN), which is a differentiable approximation of the Hungarian assignment algorithm. We repurpose DHN to become the tracking component of our algorithm by performing the task of affinity estimation between detector predictions. We consider both spatial and appearance features for affinity estimation. Our results show that despite being trained on a synthetic dataset, SynDHN generalizes well to real underwater video tracking and performs better against our baseline algorithms.

Bidirectional Matrix Feature Pyramid Network for Object Detection

Wei Xu, Yi Gan, Jianbo Su

Responsive image

Auto-TLDR; BMFPN: Bidirectional Matrix Feature Pyramid Network for Object Detection

Slides Poster Similar

Feature pyramids are widely used to improve scale invariance for object detection. Most methods just map the objects to feature maps with relevant square receptive fields, but rarely pay attention to the aspect ratio variation, which is also an important property of object instances. It will lead to a poor match between rectangular objects and assigned features with square receptive fields, thus preventing from accurate recognition and location. Besides, the information propagation among feature layers is sparse, namely, each feature in the pyramid may mainly or only contain single-level information, which is not representative enough for classification and localization sub-tasks. In this paper, Bidirectional Matrix Feature Pyramid Network (BMFPN) is proposed to address these issues. It consists of three modules: Diagonal Layer Generation Module (DLGM), Top-down Module (TDM) and Bottom-up Module (BUM). First, multi-level features extracted by backbone are fed into DLGM to produce the base features. Then these base features are utilized to construct the final feature pyramid through TDM and BUM in series. The receptive fields of the designed feature layers in BMFPN have various scales and aspect ratios. Objects can be correctly assigned to appropriate and representative feature maps with relevant receptive fields depending on its scale and aspect ratio properties. Moreover, TDM and BUM form bidirectional and reticular information flow, which effectively fuses multi level information in top-down and bottom-up manner respectively. To evaluate the effectiveness of our proposed architecture, an end-toend anchor-free detector is designed and trained by integrating BMFPN into FCOS. And the center ness branch in FCOS is modified with our Gaussian center-ness branch (GCB), which brings another slight improvement. Without bells and whistles, our method gains +3.3%, +2.4% and +2.6% AP on MS COCO dataset from baselines with ResNet-50, ResNet-101 and ResNeXt-101 backbones, respectively.

A Novel Region of Interest Extraction Layer for Instance Segmentation

Leonardo Rossi, Akbar Karimi, Andrea Prati

Responsive image

Auto-TLDR; Generic RoI Extractor for Two-Stage Neural Network for Instance Segmentation

Slides Poster Similar

Given the wide diffusion of deep neural network architectures for computer vision tasks, several new applications are nowadays more and more feasible. Among them, a particular attention has been recently given to instance segmentation, by exploiting the results achievable by two-stage networks (such as Mask R-CNN or Faster R-CNN), derived from R-CNN. In these complex architectures, a crucial role is played by the Region of Interest (RoI) extraction layer, devoted to extract a coherent subset of features from a single Feature Pyramid Network (FPN) layer attached on top of a backbone. This paper is motivated by the need to overcome to the limitations of existing RoI extractors which select only one (the best) layer from FPN. Our intuition is that all the layers of FPN retain useful information. Therefore, the proposed layer (called Generic RoI Extractor - GRoIE) introduces non-local building blocks and attention mechanisms to boost the performance. A comprehensive ablation study at component level is conducted to find the best set of algorithms and parameters for the GRoIE layer. Moreover, GRoIE can be integrated seamlessly with every two-stage architecture for both object detection and instance segmentation tasks. Therefore, the improvements brought by the use of GRoIE in different state-of-the-art architectures are also evaluated. The proposed layer leads up to gain a 1.1% AP on bounding box detection and 1.7% AP on instance segmentation. The code is publicly available on GitHub repository at https://github.com/IMPLabUniPr/mmdetection-groie

AerialMPTNet: Multi-Pedestrian Tracking in Aerial Imagery Using Temporal and Graphical Features

Maximilian Kraus, Seyed Majid Azimi, Emec Ercelik, Reza Bahmanyar, Peter Reinartz, Alois Knoll

Responsive image

Auto-TLDR; AerialMPTNet: A novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features

Slides Poster Similar

Multi-pedestrian tracking in aerial imagery has several applications such as large-scale event monitoring, disaster management, search-and-rescue missions, and as input into predictive crowd dynamic models. Due to the challenges such as the large number and the tiny size of the pedestrians (e.g., 4 x 4 pixels) with their similar appearances as well as different scales and atmospheric conditions of the images with their extremely low frame rates (e.g., 2 fps), current state-of-the-art algorithms including the deep learning-based ones are unable to perform well. In this paper, we propose AerialMPTNet, a novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features from a Siamese Neural Network, movement predictions from a Long Short-Term Memory, and pedestrian interconnections from a GraphCNN. In addition, to address the lack of diverse aerial multi-pedestrian tracking datasets, we introduce the Aerial Multi-Pedestrian Tracking (AerialMPT) dataset consisting of 307 frames and 44,740 pedestrians annotated. To the best of our knowledge, AerialMPT is the largest and most diverse dataset to this date and will be released publicly. We evaluate AerialMPTNet on AerialMPT and KIT AIS, and benchmark with several state-of-the-art tracking methods. Results indicate that AerialMPTNet significantly outperforms other methods on accuracy and time-efficiency.

Compact and Discriminative Multi-Object Tracking with Siamese CNNs

Claire Labit-Bonis, Jérôme Thomas, Frederic Lerasle

Responsive image

Auto-TLDR; Fast, Light-Weight and All-in-One Single Object Tracking for Multi-Target Management

Slides Poster Similar

Following the tracking-by-detection paradigm, multiple object tracking deals with challenging scenarios, occlusions or even missing detections; the priority is often given to quality measures instead of speed, and a good trade-off between the two is hard to achieve. Based on recent work, we propose a fast, light-weight tracker able to predict targets position and reidentify them at once, when it is usually done with two sequential steps. To do so, we combine a bounding box regressor with a target-oriented appearance learner in a newly designed and unified architecture. This way, our tracker can infer the targets' image pose but also provide us with a confidence level about target identity. Most of the time, it is also common to filter out the detector outputs with a preprocessing step, throwing away precious information about what has been seen in the image. We propose a tracks management strategy able to balance efficiently between detection and tracking outputs and their associated likelihoods. Simply put, we spotlight a full siamese based single object tracker able to predict both position and appearance features at once with a light-weight and all-in-one architecture, within a balanced overall multi-target management strategy. We demonstrate the efficiency and speed of our system w.r.t the literature on the well-known MOT17 challenge benchmark, and bring to the fore qualitative evaluations as well as state-of-the-art quantitative results.

DualBox: Generating BBox Pair with Strong Correspondence Via Occlusion Pattern Clustering and Proposal Refinement

Zheng Ge, Chuyu Hu, Xin Huang, Baiqiao Qiu, Osamu Yoshie

Responsive image

Auto-TLDR; R2NMS: Combining Full and Visible Body Bounding Box for Dense Pedestrian Detection

Slides Poster Similar

Despite the rapid development of pedestrian detection, the problem of dense pedestrian detection is still unsolved, especially the upper limit of Recall caused by Non-Maximum-Suppression (NMS). Out of this reason, R2NMS is proposed to simultaneously detect full and visible body bounding boxes, by replacing the full body BBoxes with less occluded visible body BBoxes in the NMS algorithm, achieving a higher recall. However, the P-RPN and P-RCNN modules proposed in R2NMS for simultaneous high quality full and visible body prediction require non-trivial positive/negative assigning strategies for anchor BBoxes. To simplify the prerequisites and improve the utility of R2NMS, we incorporate clustering analysis into the learning of visible body proposals from full body proposals. Furthermore, to reduce the computation complexity caused by the large number of potential visible body proposals, we introduce a novel occlusion pattern prediction branch on top of the R-CNN module (i.e. F-RCNN) to select the best matched visible proposals for each full body proposals and then feed them into another R-CNN module (i.e. V-RCNN). Incorporated with R2NMS, our DualBox model can achieve competitive performance while only requires few hyper-parameters. We validate the effectiveness of the proposed approach on the CrowdHuman and CityPersons datasets. Experimental results show that our approach achieves promising performance for detecting both non-occluded and occluded pedestrians, especially heavily occluded ones.

Construction Worker Hardhat-Wearing Detection Based on an Improved BiFPN

Chenyang Zhang, Zhiqiang Tian, Jingyi Song, Yaoyue Zheng, Bo Xu

Responsive image

Auto-TLDR; A One-Stage Object Detection Method for Hardhat-Wearing in Construction Site

Slides Poster Similar

Work in the construction site is considered to be one of the occupations with the highest safety risk factor. Therefore, safety plays an important role in construction site. One of the most fundamental safety rules in construction site is to wear a hardhat. To strengthen the safety of the construction site, most of the current methods use multi-stage method for hardhat-wearing detection. These methods have limitations in terms of adaptability and generalizability. In this paper, we propose a one-stage object detection method based on convolutional neural network. We present a multi-scale strategy that selects the high-resolution feature maps of DarkNet-53 to effectively identify small-scale hardhats. In addition, we propose an improved weighted bi-directional feature pyramid network (BiFPN), which could fuse more semantic features from more scales. The proposed method can not only detect hardhat-wearing, but also identify the color of the hardhat. Experimental results show that the proposed method achieves a mAP of 87.04%, which outperforms several state-of-the-art methods on a public dataset.

Forground-Guided Vehicle Perception Framework

Kun Tian, Tong Zhou, Shiming Xiang, Chunhong Pan

Responsive image

Auto-TLDR; A foreground segmentation branch for vehicle detection

Slides Poster Similar

As the basis of advanced visual tasks such as vehicle tracking and traffic flow analysis, vehicle detection needs to accurately predict the position and category of vehicle objects. In the past decade, deep learning based methods have made great progress. However, we also notice that some existing cases are not studied thoroughly. First, false positive on the background regions is one of the critical problems. Second, most of the previous approaches only optimize a single vehicle detection model, ignoring the relationship between different visual perception tasks. In response to the above two findings, we introduce a foreground segmentation branch for the first time, which can predict the pixel level of vehicles in advance. Furthermore, two attention modules are designed to guide the work of the detection branch. The proposed method can be easily grafted into the one-stage and two-stage detection framework. We evaluate the effectiveness of our model on LSVH, a dataset with large variations in vehicle scales, and achieve the state-of-the-art detection accuracy.

Adaptive Context-Aware Discriminative Correlation Filters for Robust Visual Object Tracking

Tianyang Xu, Zhenhua Feng, Xiaojun Wu, Josef Kittler

Responsive image

Auto-TLDR; ACA-DCF: Adaptive Context-Aware Discriminative Correlation Filter with complementary attention mechanisms

Slides Poster Similar

In recent years, Discriminative Correlation Filters (DCFs) have gained popularity due to their superior performance in visual object tracking. However, existing DCF trackers usually learn filters using fixed attention mechanisms that focus on the centre of an image and suppresses filter amplitudes in surroundings. In this paper, we propose an Adaptive Context-Aware Discriminative Correlation Filter (ACA-DCF) that is able to improve the existing DCF formulation with complementary attention mechanisms. Our ACA-DCF integrates foreground attention and background attention for complementary context-aware filter learning. More importantly, we ameliorate the design using an adaptive weighting strategy that takes complex appearance variations into account. The experimental results obtained on several well-known benchmarks demonstrate the effectiveness and superiority of the proposed method over the state-of-the-art approaches.

PRF-Ped: Multi-Scale Pedestrian Detector with Prior-Based Receptive Field

Yuzhi Tan, Hongxun Yao, Haoran Li, Xiusheng Lu, Haozhe Xie

Responsive image

Auto-TLDR; Bidirectional Feature Enhancement Module for Multi-Scale Pedestrian Detection

Slides Poster Similar

Multi-scale feature representation is a common strategy to handle the scale variation in pedestrian detection. Existing methods simply utilize the convolutional pyramidal features for multi-scale representation. However, they rarely pay attention to the differences among different feature scales and extract multi-scale features from a single feature map, which may make the detectors sensitive to scale-variance in multi-scale pedestrian detection. In this paper, we introduce a bidirectional feature enhancement module (BFEM) to augment the semantic information of low-level features and the localization information of high-level features. In addition, we propose a prior-based receptive field block (PRFB) for multi-scale pedestrian feature extraction, where the receptive field is closer to the aspect ratio of the pedestrian target. Consequently, it is less affected by the surrounding background when extracting features. Experimental results indicate that the proposed method outperform the state-of-the-art methods on the CityPersons and Caltech datasets.

Detective: An Attentive Recurrent Model for Sparse Object Detection

Amine Kechaou, Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen

Responsive image

Auto-TLDR; Detective: An attentive object detector that identifies objects in images in a sequential manner

Slides Poster Similar

In this work, we present Detective – an attentive object detector that identifies objects in images in a sequential manner. Our network is based on an encoder-decoder architecture, where the encoder is a convolutional neural network, and the decoder is a convolutional recurrent neural network coupled with an attention mechanism. At each iteration, our decoder focuses on the relevant parts of the image using an attention mechanism, and then estimates the object’s class and the bounding box coordinates. Current object detection models generate dense predictions and rely on post-processing to remove duplicate predictions. Detective is a sparse object detector that generates a single bounding box per object instance. However, training a sparse object detector is challenging, as it requires the model to reason at the instance level and not just at the class and spatial levels. We propose a training mechanism based on the Hungarian Algorithm and a loss that balances the localization and classification tasks. This allows Detective to achieve promising results on the PASCAL VOC object detection dataset. Our experiments demonstrate that sparse object detection is possible and has a great potential for future developments in applications where the order of the objects to be predicted is of interest.

Nighttime Pedestrian Detection Based on Feature Attention and Transformation

Gang Li, Shanshan Zhang, Jian Yang

Responsive image

Auto-TLDR; FAM and FTM: Enhanced Feature Attention Module and Feature Transformation Module for nighttime pedestrian detection

Slides Poster Similar

Pedestrian detection at nighttime is an important yet challenging task, which is fundamental for many practical applications, e.g. autonomous driving, video surveillance. To address this problem, in this work we start with some analysis, from which we find that the nighttime features have much more noise than that of daytime, resulting in low discrimination ability. Besides, we also observe some pedestrian examples are under adverse illumination conditions, and they can hardly provide sufficient information for accurate detection. Based on these findings, we propose the Feature Attention Module (FAM) and Feature Transformation Module (FTM) to enhance nighttime features. In FAM, guided by progressive segmentation supervision, hierarchical feature attention is produced to enhance multi-level features. On the other hand, FTM is introduced to enforce features from adverse illumination to approach that from better illumination. Based on feature attention and transformation (FAT) mechanism, a two-stage detector called FATNet is constructed for nighttime pedestrian detection. We conduct extensive experiments on nighttime datasets of EuroCity Persons (Night) and NightOwls to demonstrate the effectiveness of our method. On both two datasets, our method achieves significant improvements to the baseline and also outperforms state-of-the-art detectors.

ACRM: Attention Cascade R-CNN with Mix-NMS for Metallic Surface Defect Detection

Junting Fang, Xiaoyang Tan, Yuhui Wang

Responsive image

Auto-TLDR; Attention Cascade R-CNN with Mix Non-Maximum Suppression for Robust Metal Defect Detection

Slides Poster Similar

Metallic surface defect detection is of great significance in quality control for production. However, this task is very challenging due to the noise disturbance, large appearance variation, and the ambiguous definition of the defect individual. Traditional image processing methods are unable to detect the damaged region effectively and efficiently. In this paper, we propose a new defect detection method, Attention Cascade R-CNN with Mix-NMS (ACRM), to classify and locate defects robustly. Three submodules are developed to achieve this goal: 1) a lightweight attention block is introduced, which can improve the ability in capture global and local feature both in the spatial and channel dimension; 2) we firstly apply the cascade R-CNN to our task, which exploits multiple detectors to sequentially refine the detection result robustly; 3) we introduce a new method named Mix Non-Maximum Suppression (Mix-NMS), which can significantly improve its ability in filtering the redundant detection result in our task. Extensive experiments on a real industrial dataset show that ACRM achieves state-of-the-art results compared to the existing methods, demonstrating the effectiveness and robustness of our detection method.

HPERL: 3D Human Pose Estimastion from RGB and LiDAR

Michael Fürst, Shriya T.P. Gupta, René Schuster, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; 3D Human Pose Estimation Using RGB and LiDAR Using Weakly-Supervised Approach

Slides Poster Similar

In-the-wild human pose estimation has a huge potential for various fields, ranging from animation and action recognition to intention recognition and prediction for autonomous driving. The current state-of-the-art is focused only on RGB and RGB-D approaches for predicting the 3D human pose. However, not using precise LiDAR depth information limits the performance and leads to very inaccurate absolute pose estimation. With LiDAR sensors becoming more affordable and common on robots and autonomous vehicle setups, we propose an end-to-end architecture using RGB and LiDAR to predict the absolute 3D human pose with unprecedented precision. Additionally, we introduce a weakly-supervised approach to generate 3D predictions using 2D pose annotations from PedX. This allows for many new opportunities in the field of 3D human pose estimation.

Learning a Dynamic High-Resolution Network for Multi-Scale Pedestrian Detection

Mengyuan Ding, Shanshan Zhang, Jian Yang

Responsive image

Auto-TLDR; Learningable Dynamic HRNet for Pedestrian Detection

Slides Poster Similar

Pedestrian detection is a canonical instance of object detection in computer vision. In practice, scale variation is one of the key challenges, resulting in unbalanced performance across different scales. Recently, the High-Resolution Network (HRNet) has become popular because high-resolution feature representations are more friendly to small objects. However, when we apply HRNet for pedestrian detection, we observe that it improves for small pedestrians on one hand, but hurts the performance for larger ones on the other hand. To overcome this problem, we propose a learnable Dynamic HRNet (DHRNet) aiming to generate different network paths adaptive to different scales. Specifically, we construct a parallel multi-branch architecture and add a soft conditional gate module allowing for dynamic feature fusion. Both branches share all the same parameters except the soft gate module. Experimental results on CityPersons and Caltech benchmarks indicate that our proposed dynamic HRNet is more capable of dealing with pedestrians of various scales, and thus improves the performance across different scales consistently.

Scene Text Detection with Selected Anchors

Anna Zhu, Hang Du, Shengwu Xiong

Responsive image

Auto-TLDR; AS-RPN: Anchor Selection-based Region Proposal Network for Scene Text Detection

Slides Poster Similar

Object proposal technique with dense anchoring scheme for scene text detection were applied frequently to achieve high recall. It results in the significant improvement in accuracy but waste of computational searching, regression and classification. In this paper, we propose an anchor selection-based region proposal network (AS-RPN) using effective selected anchors instead of dense anchors to extract text proposals. The center, scales, aspect ratios and orientations of anchors are learnable instead of fixing, which leads to high recall and greatly reduced numbers of anchors. By replacing the anchor-based RPN in Faster RCNN, the AS-RPN-based Faster RCNN can achieve comparable performance with previous state-of-the-art text detecting approaches on standard benchmarks, including COCO-Text, ICDAR2013, ICDAR2015 and MSRA-TD500 when using single-scale and single model (ResNet50) testing only.

Object Detection Model Based on Scene-Level Region Proposal Self-Attention

Yu Quan, Zhixin Li, Canlong Zhang, Huifang Ma

Responsive image

Auto-TLDR; Exploiting Semantic Informations for Object Detection

Slides Poster Similar

The improvement of object detection performance is mostly focused on the extraction of local information near the region of interest in the image, which results in detection performance in this area being unable to achieve the desired effect. First, a depth-wise separable convolution network(D_SCNet-127 R-CNN) is built on the backbone network. Considering the importance of scene and semantic informations for visual recognition, the feature map is sent into the branch of the semantic segmentation module, region proposal network module, and the region proposal self-attention module to build the network of scene-level and region proposal self-attention module. Second, a deep reinforcement learning was utilized to achieve accurate positioning of border regression, and the calculation speed of the whole model was improved through implementing a light-weight head network. This model can effectively solve the limitation of feature extraction in traditional object detection and obtain more comprehensive detailed features. The experimental verification on MSCOCO17, VOC12, and Cityscapes datasets shows that the proposed method has good validity and scalability.

CenterRepp: Predict Central Representative Point Set's Distribution for Detection

Yulin He, Limeng Zhang, Wei Chen, Xin Luo, Chen Li, Xiaogang Jia

Responsive image

Auto-TLDR; CRPDet: CenterRepp Detector for Object Detection

Slides Poster Similar

Object detection has long been an important issue in the discipline of scene understanding. Existing researches mainly focus on the object itself, ignoring its surrounding environment. In fact, the surrounding environment provides abundant information to help detectors classify and locate objects. This paper proposes CRPDet, viz. CenterRepp Detector, a framework for object detection. The main function of CRPDet is accomplished by the CenterRepp module, which takes into account the surrounding environment by predicting the distribution of the central representative points. CenterRepp converts labeled object frames into the mean and standard variance of the sampling points’ distribution. This helps increase the receptive field of objects, breaking the limitation of object frames. CenterRepp defines a position-fixed center point with significant weights, avoiding to sample all points in the surroundings. Experiments on the COCO test-dev detection benchmark demonstrates that our proposed CRPDet has comparable performance with state-of-the-art detectors, achieving 39.4 mAP with 51 FPS tested under single size input.

Convolutional STN for Weakly Supervised Object Localization

Akhil Meethal, Marco Pedersoli, Soufiane Belharbi, Eric Granger

Responsive image

Auto-TLDR; Spatial Localization for Weakly Supervised Object Localization

Slides Similar

Weakly-supervised object localization is a challenging task in which the object of interest should be localized while learning its appearance. State-of-the-art methods recycle the architecture of a standard CNN by using the activation maps of the last layer for localizing the object. While this approach is simple and works relatively well, object localization relies on different features than classification, thus, a specialized localization mechanism is required during training to improve performance. In this paper, we propose a convolutional, multi-scale spatial localization network that provides accurate localization for the object of interest. Experimental results on CUB-200-2011 and ImageNet datasets show competitive performance of our proposed approach on Weakly supervised localization.

Object Detection Using Dual Graph Network

Shengjia Chen, Zhixin Li, Feicheng Huang, Canlong Zhang, Huifang Ma

Responsive image

Auto-TLDR; A Graph Convolutional Network for Object Detection with Key Relation Information

Slides Similar

Most object detection methods focus only on the local information near the region proposal and ignore the object's global semantic relation and local spatial relation information, resulting in limited performance. To capture and explore these important relations, we propose a detection method based on a graph convolutional network (GCN). Two independent relation graph networks are used to obtain the global semantic information of the object in labels and the local spatial information in images. Semantic relation networks can implicitly acquire global knowledge, and by constructing a directed graph on the dataset, each node is represented by the word embedding of labels and then sent to the GCN to obtain high-level semantic representation. The spatial relation network encodes the relation by the positional relation module and the visual connection module, and enriches the object features through local key information from objects. The feature representation is further improved by aggregating the outputs of the two networks. Instead of directly disseminating visual features in the network, the dual-graph network explores more advanced feature information, giving the detector the ability to obtain key relations in labels and region proposals. Experiments on the PASCAL VOC and MS COCO datasets demonstrate that key relation information significantly improve the performance of detection with better ability to detect small objects and reasonable boduning box. The results on COCO dataset demonstrate our method obtains around 32.3% improvement on AP in terms of small objects.

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Amar Shrestha, Krittaphat Pugdeethosapol, Haowen Fang, Qinru Qiu

Responsive image

Auto-TLDR; MAGNet: A Multi-Region Attention-Aware Grounding Network for Free-form Textual Queries

Slides Poster Similar

Grounding free-form textual queries necessitates an understanding of these textual phrases and its relation to the visual cues to reliably reason about the described locations. Spatial attention networks are known to learn this relationship and focus its gaze on salient objects in the image. Thus, we propose to utilize spatial attention networks for image-level visual-textual fusion preserving local (word) and global (phrase) information to refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query. We focus only on the phrase query - ground truth pair (referring expression) for a model independent of the constraints of the datasets i.e. additional attributes, context etc. For such referring expression dataset ReferIt game, our Multi- region Attention-assisted Grounding network (MAGNet) achieves over 12% improvement over the state-of-the-art. Without the con- text from image captions and attribute information in Flickr30k Entities, we still achieve competitive results compared to the state- of-the-art.

TAAN: Task-Aware Attention Network for Few-Shot Classification

Zhe Wang, Li Liu, Fanzhang Li

Responsive image

Auto-TLDR; TAAN: Task-Aware Attention Network for Few-Shot Classification

Slides Poster Similar

Few-shot classification aims to recognize unlabeled samples from unseen classes given only a few labeled samples.Current approaches of few-shot learning usually employ a metriclearning framework to learn a feature similarity comparison between a query (test) example and the few support (training) examples. However, these approaches all extract features from samples independently without looking at the entire task as a whole, and so fail to provide an enough discrimination to features. Moreover, the existing approaches lack the ability to select the most relevant features for the task at hand. In this work, we propose a novel algorithm called Task-Aware Attention Network (TAAN) to address the above problems in few-shot classification. By inserting a Task-Relevant Channel Attention Module into metric-based few-shot learners, TAAN generates channel attentions for each sample by aggregating the context of the entire support set and identifies the most relevant features for similarity comparison. The experiment demonstrates that TAAN is competitive in overall performance comparing to the recent state-of-the-art systems and improves the performance considerably over baseline systems on both mini-ImageNet and tiered-ImageNet benchmarks.

SFPN: Semantic Feature Pyramid Network for Object Detection

Yi Gan, Wei Xu, Jianbo Su

Responsive image

Auto-TLDR; SFPN: Semantic Feature Pyramid Network to Address Information Dilution Issue in FPN

Slides Poster Similar

Feature Pyramid Network(FPN) employs a top-down path to enhance low level feature by utilizing high level feature.However, further improvement of detector is greatly hindered by the inner defect of FPN. The dilution issue in FPN is analyzed in this paper, and a new architecture named Semantic Feature Pyramid Network(SFPN) is introduced to address the information imbalance problem caused by information dilution. The proposed method consists of two simple and effective components: Semantic Pyramid Module(SPM) and Semantic Feature Fusion Module(SFFM). To compensate for the weaknesses of FPN, the semantic segmentation result is utilized as an extra information source in our architecture.By constructing a semantic pyramid based on the segmentation result and fusing it with FPN, feature maps at each level can obtain the necessary information without suffering from the dilution issue. The proposed architecture could be applied on many detectors, and non-negligible improvement could be achieved. Although this method is designed for object detection, other tasks such as instance segmentation can also largely benefit from it. The proposed method brings Faster R-CNN and Mask R-CNN with ResNet-50 as backbone both 1.8 AP improvements respectively. Furthermore, SFPN improves Cascade R-CNN with backbone ResNet-101 from 42.4 AP to 43.5 AP.

Global Context-Based Network with Transformer for Image2latex

Nuo Pang, Chun Yang, Xiaobin Zhu, Jixuan Li, Xu-Cheng Yin

Responsive image

Auto-TLDR; Image2latex with Global Context block and Transformer

Slides Poster Similar

Image2latex usually means converts mathematical formulas in images into latex markup. It is a very challenging job due to the complex two-dimensional structure, variant scales of input, and very long representation sequence. Many researchers use encoder-decoder based model to solve this task and achieved good results. However, these methods don't make full use of the structure and position information of the formula. %In this paper, we improve the encoder by employing Global Context block and Transformer. To solve this problem, we propose a global context-based network with transformer that can (1) learn a more powerful and robust intermediate representation via aggregating global features and (2) encode position information explicitly and (3) learn latent dependencies between symbols by using self-attention mechanism. The experimental results on the dataset IM2LATEX-100K demonstrate the effectiveness of our method.

Mutual-Supervised Feature Modulation Network for Occluded Pedestrian Detection

Ye He, Chao Zhu, Xu-Cheng Yin

Responsive image

Auto-TLDR; A Mutual-Supervised Feature Modulation Network for Occluded Pedestrian Detection

Similar

State-of-the-art pedestrian detectors have achieved significant progress on non-occluded pedestrians, yet they are still struggling under heavy occlusions. The recent occlusion handling strategy of popular two-stage approaches is to build a two-branch architecture with the help of additional visible body annotations. Nonetheless, these methods still have some weaknesses. Either the two branches are trained independently with only score-level fusion, which cannot guarantee the detectors to learn robust enough pedestrian features. Or the attention mechanisms are exploited to only emphasize on the visible body features. However, the visible body features of heavily occluded pedestrians are concentrated on a relatively small area, which will easily cause missing detections. To address the above issues, we propose in this paper a novel Mutual-Supervised Feature Modulation (MSFM) network, to better handle occluded pedestrian detection. The key MSFM module in our network calculates the similarity loss of full body boxes and visible body boxes corresponding to the same pedestrian, so that the full-body detector could learn more complete and robust pedestrian features with the assist of contextual features from the occluding parts. To facilitate the MSFM module, we also propose a novel two-branch architecture, consisting of a standard full body detection branch and an extra visible body classification branch. These two branches are trained in a mutual-supervised way with full body annotations and visible body annotations, respectively. To verify the effectiveness of our proposed method, extensive experiments are conducted on two challenging pedestrian datasets: Caltech and CityPersons, and our approach achieves superior performances compared to other state-of-the-art methods on both datasets, especially in heavy occlusion cases.

TCATD: Text Contour Attention for Scene Text Detection

Ziling Hu, Wu Xingjiao, Jing Yang

Responsive image

Auto-TLDR; Text Contour Attention Text Detector

Slides Poster Similar

Segmentation-based approaches have enabled state-of-the-art performance in long or curved text detection tasks. However, false detection still is a challenge when two text instances are close to each other. To address this problem, in this paper, we propose a Text Contour Attention Text Detector (TCATD), which can locate scene text with arbitrary orientation and shape accurately. Different from previous work, TCATD focus on text contour map (TC), text center intensity map (TCI) and text kernel maps (TK). The TC can introduce text contour information, the TCI can help to learn the accurate text segmentation and the TK can generate the complete shape of text instances. Besides, we propose a Text Contour Attention Module to deal with contour information. After the Text Contour Attention Module, TC, TCI and TK will be obtained. Extensive experiments on ICDAR2015, CTW1500 and Total-Text demonstrate that the proposed method achieves the state-of-the-art performance.

Cascade Saliency Attention Network for Object Detection in Remote Sensing Images

Dayang Yu, Rong Zhang, Shan Qin

Responsive image

Auto-TLDR; Cascade Saliency Attention Network for Object Detection in Remote Sensing Images

Slides Poster Similar

Object detection in remote sensing images is a challenging task due to objects in the bird-view perspective appearing with arbitrary orientations. Though considerable progress has been made, there still exist challenges with the interference from complex backgrounds, dense arrangement, and large-scale variations. In this paper, we propose an oriented detector named Cascade Saliency Attention Network (CSAN), designed for comprehensively suppressing interference in remote sensing images. Specifically, we first combine context and pixel attention on feature maps to enhance saliency of objects for suppressing interference from backgrounds. Then, in cascade network, we apply instance segmentation on ROI to increase saliency of the central object, thus preventing object features from mutual interference in dense arrangement. Additionally, to alleviate large-scale variations, we devise a multi-scale merge module during FPN merging process to learn richer scale representations. Experimental results on DOTA and HRSC2016 datasets outperform other state-of-the-art object detection methods and verify the effectiveness of our method.

Temporal Feature Enhancement Network with External Memory for Object Detection in Surveillance Video

Masato Fujitake, Akihiro Sugimoto

Responsive image

Auto-TLDR; Temporal Attention Based External Memory Network for Surveillance Object Detection

Poster Similar

Video object detection is challenging and essential in practical applications, such as surveillance cameras for traffic control and public security. Unlike the video in natural scenes, the surveillance video tends to contain dense, and small objects (typically vehicles) in their appearances. Therefore, existing methods for surveillance object detection utilize still-image object detection approaches with rich feature extractors at the expense of their run-time speeds. The run-time speed, however, becomes essential when the video is being streamed. In this paper, we exploit temporal information in videos to enrich the feature maps, proposing the first temporal attention based external memory network for the live stream of video. Extensive experiments on real-world traffic surveillance benchmarks demonstrate the real-time performance of the proposed model while keeping comparable accuracy with state-of-the-art.

Efficient-Receptive Field Block with Group Spatial Attention Mechanism for Object Detection

Jiacheng Zhang, Zhicheng Zhao, Fei Su

Responsive image

Auto-TLDR; E-RFB: Efficient-Receptive Field Block for Deep Neural Network for Object Detection

Slides Poster Similar

Object detection has been paid rising attention in computer vision field. Convolutional Neural Networks (CNNs) extract high-level semantic features of images, which directly determine the performance of object detection. As a common solution, embedding integration modules into CNNs can enrich extracted features and thereby improve the performance. However, the instability and inconsistency of internal multiple branches exist in these modules. To address this problem, we propose a novel multibranch module called Efficient-Receptive Field Block (E-RFB), in which multiple levels of features are combined for network optimization. Specifically, by downsampling and increasing depth, the E-RFB provides sufficient RF. Second, in order to eliminate the inconsistency across different branches, a novel spatial attention mechanism, namely, Group Spatial Attention Module (GSAM) is proposed. The GSAM gradually narrows a feature map by channel grouping; thus it encodes the information between spatial and channel dimensions into the final attention heat map. Third, the proposed module can be easily joined in various CNNs to enhance feature representation as a plug-and-play component. With SSD-style detectors, our method halves the parameters of the original detection head and achieves high accuracy on the PASCAL VOC and MS COCO datasets. Moreover, the proposed method achieves superior performance compared with state-of-the-art methods based on similar framework.

Small Object Detection by Generative and Discriminative Learning

Yi Gu, Jie Li, Chentao Wu, Weijia Jia, Jianping Chen

Responsive image

Auto-TLDR; Generative and Discriminative Learning for Small Object Detection

Slides Poster Similar

With the development of deep convolutional neural networks (CNNs), the object detection accuracy has been greatly improved. But the performance of small object detection is still far from satisfactory, mainly because small objects are so tiny that the information contained in the feature map is limited. Existing methods focus on improving classification accuracy but still suffer from the limitation of bounding box prediction. To solve this issue, we propose a detection framework by generative and discriminative learning. First, a reconstruction generator network is designed to reconstruct the mapping from low frequency to high frequency for anchor box prediction. Then, a detector module extracts the regions of interest (ROIs) from generated results and implements a RoI-Head to predict object category and refine bounding box. In order to guide the reconstructed image related to the corresponding one, a discriminator module is adopted to tell from the generated result and the original image. Extensive evaluations on the challenging MS-COCO dataset demonstrate that our model outperforms most state-of-the-art models in detecting small objects, especially the reconstruction module improves the average precision for small object (APs) by 7.7%.

Detecting Objects with High Object Region Percentage

Fen Fang, Qianli Xu, Liyuan Li, Ying Gu, Joo-Hwee Lim

Responsive image

Auto-TLDR; Faster R-CNN for High-ORP Object Detection

Slides Poster Similar

Object shape is a subtle but important factor for object detection. It has been observed that the object-region-percentage (ORP) can be utilized to improve detection accuracy for elongated objects, which have much lower ORPs than other types of objects. In this paper, we propose an approach to improve the detection performance for objects whose ORPs are relatively higher.To address the problem of high-ORP object detection, we propose a method consisting of three steps. First, we adjust the ground truth bounding boxes of high-ORP objects to an optimal range. Second, we train an object detector, Faster R-CNN, based on adjusted bounding boxes to achieve high recall. Finally, we train a DCNN to learn the adjustment ratios towards four directions and adjust detected bounding boxes of objects to get better localization for higher precision. We evaluate the effectiveness of our method on 12 high-ORP objects in COCO and 8 objects in a proprietary gearbox dataset. The experimental results show that our method can achieve state-of-the-art performance on these objects while costing less resources in training and inference stages.

Learning Object Deformation and Motion Adaption for Semi-Supervised Video Object Segmentation

Xiaoyang Zheng, Xin Tan, Jianming Guo, Lizhuang Ma

Responsive image

Auto-TLDR; Semi-supervised Video Object Segmentation with Mask-propagation-based Model

Slides Poster Similar

We propose a novel method to solve the task of semi-supervised video object segmentation in this paper, where the mask annotation is only given at the first frame of the video sequence. A mask-propagation-based model is applied to learn the past and current information for segmentation. Besides, due to the scarcity of training data, image/mask pairs that model object deformation and shape variance are generated for the training phase. In addition, we generate the key flips between two adjacent frames for motion adaptation. The method works in an end-to-end way, without any online fine-tuning on test videos. Extensive experiments demonstrate that our method achieves competitive performance against state-of-the-art algorithms on benchmark datasets, covering cases with single object or multiple objects. We also conduct extensive ablation experiments to analyze the effectiveness of our proposed method.

Transformer Networks for Trajectory Forecasting

Francesco Giuliari, Hasan Irtiza, Marco Cristani, Fabio Galasso

Responsive image

Auto-TLDR; TransformerNetworks for Trajectory Prediction of People Interactions

Slides Poster Similar

Most recent successes on forecasting the people mo-tion are based on LSTM models andallmost recent progress hasbeen achieved by modelling the social interaction among peopleand the people interaction with the scene. We question the useof the LSTM models and propose the novel use of TransformerNetworks for trajectory forecasting. This is a fundamental switchfrom the sequential step-by-step processing of LSTMs to theonly-attention-based memory mechanisms of Transformers. Inparticular, we consider both the original Transformer Network(TF) and the larger Bidirectional Transformer (BERT), state-of-the-art on all natural language processing tasks. Our proposedTransformers predict the trajectories of the individual peoplein the scene. These are “simple” models because each personis modelled separately without any complex human-human norscene interaction terms. In particular, the TF modelwithoutbells and whistlesyields the best score on the largest and mostchallenging trajectory forecasting benchmark of TrajNet [1]. Ad-ditionally, its extension which predicts multiple plausible futuretrajectories performs on par with more engineered techniqueson the 5 datasets of ETH [2]+UCY [3]. Finally, we showthat Transformers may deal with missing observations, as itmay be the case with real sensor data. Code is available atgithub.com/FGiuliari/Trajectory-Transformer

A Transformer-Based Radical Analysis Network for Chinese Character Recognition

Chen Yang, Qing Wang, Jun Du, Jianshu Zhang, Changjie Wu, Jiaming Wang

Responsive image

Auto-TLDR; Transformer-based Radical Analysis Network for Chinese Character Recognition

Slides Poster Similar

Recently, a novel radical analysis network (RAN) has the capability of effectively recognizing unseen Chinese character classes and largely reducing the requirement of training data by treating a Chinese character as a hierarchical composition of radicals rather than a single character class.} However, when dealing with more challenging issues, such as the recognition of complicated characters, low-frequency character categories, and characters in natural scenes, RAN still has a lot of room for improvement. In this paper, we explore options to further improve the structure generalization and robustness capability of RAN with the Transformer architecture, which has achieved start-of-the-art results for many sequence-to-sequence tasks. More specifically, we propose to replace the original attention module in RAN with the transformer decoder, which is named as a transformer-based radical analysis network (RTN). The experimental results show that the proposed approach can significantly outperform the RAN on both printed Chinese character database and natural scene Chinese character database. Meanwhile, further analysis proves that RTN can be better generalized to complex samples and low-frequency characters, and has better robustness in recognizing Chinese characters with different attributes.