AerialMPTNet: Multi-Pedestrian Tracking in Aerial Imagery Using Temporal and Graphical Features

Maximilian Kraus, Seyed Majid Azimi, Emec Ercelik, Reza Bahmanyar, Peter Reinartz, Alois Knoll

Responsive image

Auto-TLDR; AerialMPTNet: A novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features

Slides Poster

Multi-pedestrian tracking in aerial imagery has several applications such as large-scale event monitoring, disaster management, search-and-rescue missions, and as input into predictive crowd dynamic models. Due to the challenges such as the large number and the tiny size of the pedestrians (e.g., 4 x 4 pixels) with their similar appearances as well as different scales and atmospheric conditions of the images with their extremely low frame rates (e.g., 2 fps), current state-of-the-art algorithms including the deep learning-based ones are unable to perform well. In this paper, we propose AerialMPTNet, a novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features from a Siamese Neural Network, movement predictions from a Long Short-Term Memory, and pedestrian interconnections from a GraphCNN. In addition, to address the lack of diverse aerial multi-pedestrian tracking datasets, we introduce the Aerial Multi-Pedestrian Tracking (AerialMPT) dataset consisting of 307 frames and 44,740 pedestrians annotated. To the best of our knowledge, AerialMPT is the largest and most diverse dataset to this date and will be released publicly. We evaluate AerialMPTNet on AerialMPT and KIT AIS, and benchmark with several state-of-the-art tracking methods. Results indicate that AerialMPTNet significantly outperforms other methods on accuracy and time-efficiency.

Similar papers

Compact and Discriminative Multi-Object Tracking with Siamese CNNs

Claire Labit-Bonis, Jérôme Thomas, Frederic Lerasle

Responsive image

Auto-TLDR; Fast, Light-Weight and All-in-One Single Object Tracking for Multi-Target Management

Slides Poster Similar

Following the tracking-by-detection paradigm, multiple object tracking deals with challenging scenarios, occlusions or even missing detections; the priority is often given to quality measures instead of speed, and a good trade-off between the two is hard to achieve. Based on recent work, we propose a fast, light-weight tracker able to predict targets position and reidentify them at once, when it is usually done with two sequential steps. To do so, we combine a bounding box regressor with a target-oriented appearance learner in a newly designed and unified architecture. This way, our tracker can infer the targets' image pose but also provide us with a confidence level about target identity. Most of the time, it is also common to filter out the detector outputs with a preprocessing step, throwing away precious information about what has been seen in the image. We propose a tracks management strategy able to balance efficiently between detection and tracking outputs and their associated likelihoods. Simply put, we spotlight a full siamese based single object tracker able to predict both position and appearance features at once with a light-weight and all-in-one architecture, within a balanced overall multi-target management strategy. We demonstrate the efficiency and speed of our system w.r.t the literature on the well-known MOT17 challenge benchmark, and bring to the fore qualitative evaluations as well as state-of-the-art quantitative results.

EAGLE: Large-Scale Vehicle Detection Dataset in Real-World Scenarios Using Aerial Imagery

Seyed Majid Azimi, Reza Bahmanyar, Corentin Henry, Kurz Franz

Responsive image

Auto-TLDR; EAGLE: A Large-Scale Dataset for Multi-class Vehicle Detection with Object Orientation Information in Airborne Imagery

Slides Similar

Multi-class vehicle detection from airborne imagery with orientation estimation is an important task in the near and remote vision domains with applications in traffic monitoring and disaster management. In the last decade, we have witnessed significant progress in object detection in ground imagery, but it is still in its infancy in airborne imagery, mostly due to the scarcity of diverse and large-scale datasets. Despite being a useful tool for different applications, current airborne datasets only partially reflect the challenges of real-world scenarios. To address this issue, we introduce EAGLE (oriEnted object detection using Aerial imaGery in real-worLd scEnarios), a large-scale dataset for multi-class vehicle detection with object orientation information in aerial imagery. It features high-resolution aerial images composed of different real-world situations with a wide variety of camera sensor, resolution, flight altitude, weather, illumination, haze, shadow, time, city, country, occlusion, and camera angle. The annotation was done by airborne imagery experts with small- and large-vehicle classes. EAGLE contains 215,986 instances annotated with oriented bounding boxes defined by four points and orientation, making it by far the largest dataset to date in this task. It also supports researches on the haze and shadow removal as well as super-resolution and in-painting applications. We define three tasks: detection by (1) horizontal bounding boxes, (2) rotated bounding boxes, and (3) oriented bounding boxes. We carried out several experiments to evaluate several state-of-the-art methods in object detection on our dataset to form a baseline. Experiments show that the EAGLE dataset accurately reflects real-world situations and correspondingly challenging applications. The dataset will be made publicly available.

DAL: A Deep Depth-Aware Long-Term Tracker

Yanlin Qian, Song Yan, Alan Lukežič, Matej Kristan, Joni-Kristian Kamarainen, Jiri Matas

Responsive image

Auto-TLDR; Deep Depth-Aware Long-Term RGBD Tracking with Deep Discriminative Correlation Filter

Slides Poster Similar

The best RGBD trackers provide high accuracy but are slow to run. On the other hand, the best RGB trackers are fast but clearly inferior on the RGBD datasets. In this work, we propose a deep depth-aware long-term tracker that achieves state-of-the-art RGBD tracking performance and is fast to run. We reformulate deep discriminative correlation filter (DCF) to embed the depth information into deep features. Moreover, the same depth-aware correlation filter is used for target re- detection. Comprehensive evaluations show that the proposed tracker achieves state-of-the-art performance on the Princeton RGBD, STC, and the newly-released CDTB benchmarks and runs 20 fps.

Siamese Fully Convolutional Tracker with Motion Correction

Mathew Francis, Prithwijit Guha

Responsive image

Auto-TLDR; A Siamese Ensemble for Visual Tracking with Appearance and Motion Components

Slides Poster Similar

Visual tracking algorithms use cues like appearance, structure, motion etc. for locating an object in a video. We propose an ensemble tracker with appearance and motion components. A siamese tracker that learns object appearance from a static image and motion vectors computed between consecutive frames with a flow network forms the ensemble. Motion predicted object localization is used to correct the appearance component in the ensemble. Complementary nature of the components bring performance improvement as observed in experiments performed on VOT2018 and VOT2019 datasets.

VTT: Long-Term Visual Tracking with Transformers

Tianling Bian, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Neil Robertson, Haibing Guan

Responsive image

Auto-TLDR; Visual Tracking Transformer with transformers for long-term visual tracking

Slides Similar

Long-term visual tracking is a challenging problem. State-of-the-art long-term trackers, e.g., GlobalTrack, utilize region proposal networks (RPNs) to generate target proposals. However, the performance of the trackers is affected by occlusions and large scale or ratio variations. To address these issues, in this paper, we are the first to propose a novel architecture with transformers for long-term visual tracking. Specifically, the proposed Visual Tracking Transformer (VTT) utilizes a transformer encoder-decoder architecture for aggregating global information to deal with occlusion and large scale or ratio variation. Furthermore, it also shows better discriminative power against instance-level distractors without the need for extra labeling and hard-sample mining. We conduct extensive experiments on three largest long-term tracking dataset and have achieved state-of-the-art performance.

SynDHN: Multi-Object Fish Tracker Trained on Synthetic Underwater Videos

Mygel Andrei Martija, Prospero Naval

Responsive image

Auto-TLDR; Underwater Multi-Object Tracking in the Wild with Deep Hungarian Network

Slides Poster Similar

In this paper, we seek to extend multi-object tracking research on a relatively less explored domain, that of, underwater multi-object tracking in the wild. Multi-object fish tracking is an important task because it can provide fish monitoring systems with richer information (e.g. multiple views of the same fish) as compared to detections and it can be an invaluable input to fish behavior analysis. However, there is a lack of an annotated benchmark dataset with enough samples for this task. To circumvent the need for manual ground truth tracking annotation, we craft a synthetic dataset. Using this synthetic dataset, we train an integrated detector and tracker called SynDHN. SynDHN uses the Deep Hungarian Network (DHN), which is a differentiable approximation of the Hungarian assignment algorithm. We repurpose DHN to become the tracking component of our algorithm by performing the task of affinity estimation between detector predictions. We consider both spatial and appearance features for affinity estimation. Our results show that despite being trained on a synthetic dataset, SynDHN generalizes well to real underwater video tracking and performs better against our baseline algorithms.

SiamMT: Real-Time Arbitrary Multi-Object Tracking

Lorenzo Vaquero, Manuel Mucientes, Victor Brea

Responsive image

Auto-TLDR; SiamMT: A Deep-Learning-based Arbitrary Multi-Object Tracking System for Video

Slides Poster Similar

Visual object tracking is of great interest in many applications, as it preserves the identity of an object throughout a video. However, while real applications demand systems capable of real-time-tracking multiple objects, multi-object tracking solutions usually follow the tracking-by-detection paradigm, thus they depend on running a costly detector in each frame, and they do not allow the tracking of arbitrary objects, i.e., they require training for specific classes. In response to this need, this work presents the architecture of SiamMT, a system capable of efficiently applying individual visual tracking techniques to multiple objects in real-time. This makes it the first deep-learning-based arbitrary multi-object tracker. To achieve this, we propose the global frame features extraction by using a fully-convolutional neural network, followed by the cropping and resizing of the different object search areas. The final similarity operation between these search areas and the target exemplars is carried out with an optimized pairwise cross-correlation. These novelties allow the system to track multiple targets in a scalable manner, achieving 25 fps with 60 simultaneous objects for VGA videos and 40 objects for HD720 videos, all with a tracking quality similar to SiamFC.

Model Decay in Long-Term Tracking

Efstratios Gavves, Ran Tao, Deepak Gupta, Arnold Smeulders

Responsive image

Auto-TLDR; Model Bias in Long-Term Tracking

Slides Poster Similar

To account for appearance variations, tracking models need to be updated during the course of inference. However, updating the tracker model with adverse bounding box predictions adds an unavoidable bias term to the learning. This bias term, which we refer to as model decay, offsets the learning and causes tracking drift. While its adverse affect might not be visible in short-term tracking, accumulation of this bias over a long-term can eventually lead to a permanent loss of the target. In this paper, we look at the problem of model bias from a mathematical perspective. Further, we briefly examine the effect of various sources of tracking error on model decay, using a correlation filter (ECO) and a Siamese (SINT) tracker. Based on observations and insights, we propose simple additions that help to reduce model decay in long-term tracking. The proposed tracker is evaluated on four long-term and one short-term tracking benchmarks, demonstrating superior accuracy and robustness, even on 30 minute long videos.

Tracking Fast Moving Objects by Segmentation Network

Ales Zita, Filip Sroubek

Responsive image

Auto-TLDR; Fast Moving Objects Tracking by Segmentation Using Deep Learning

Slides Poster Similar

Tracking Fast Moving Objects (FMO), which appear as blurred streaks in video sequences, is a difficult task for standard trackers, as the object position does not overlap in consecutive video frames and texture information of the objects is blurred. Up-to-date approaches tuned for this task are based on background subtraction with a static background and slow deblurring algorithms. In this article, we present a tracking-by-segmentation approach implemented using modern deep learning methods that perform near real-time tracking on real-world video sequences. We have developed a physically plausible FMO sequence generator to be a robust foundation for our training pipeline and demonstrate straightforward network adaptation for different FMO scenarios with varying foreground.

AG-GAN: An Attentive Group-Aware GAN for Pedestrian Trajectory Prediction

Yue Song, Niccolò Bisagno, Syed Zohaib Hassan, Nicola Conci

Responsive image

Auto-TLDR; An attentive group-aware GAN for motion prediction in crowded scenarios

Slides Poster Similar

Understanding human behaviors in crowded scenarios requires analyzing not only the position of the subjects in space, but also the scene context. Existing approaches mostly rely on the motion history of each pedestrian and model the interactions among people by considering the entire surrounding neighborhood. In our approach, we address the problem of motion prediction by applying coherent group clustering and a global attention mechanism on the LSTM-based Generative Adversarial Networks (GANs). The proposed model consists of an attentive group-aware GAN that observes the agents' past motion and predicts future paths, using (i) a group pooling module to model neighborhood interaction, and (ii) an attention module to specifically focus on hidden states. The experimental results demonstrate that our proposal outperforms state-of-the-art models on common benchmark datasets, and is able to generate socially-acceptable trajectories.

Visual Object Tracking in Drone Images with Deep Reinforcement Learning

Derya Gözen, Sedat Ozer

Responsive image

Auto-TLDR; A Deep Reinforcement Learning based Single Object Tracker for Drone Applications

Slides Poster Similar

There is an increasing demand on utilizing camera equipped drones and their applications in many domains varying from agriculture to entertainment and from sports events to surveillance. In such drone applications, an essential and a common task is tracking an object of interest visually. Drone (or UAV) images have different properties when compared to the ground taken (natural) images and those differences introduce additional complexities to the existing object trackers to be directly applied on drone applications. Some important differences among those complexities include (i) smaller object sizes to be tracked and (ii) different orientations and viewing angles yielding different texture and features to be observed. Therefore, new algorithms trained on drone images are needed for the drone-based applications. In this paper, we introduce a deep reinforcement learning (RL) based single object tracker that tracks an object of interest in drone images by estimating a series of actions to find the location of the object in the next frame. This is the first work introducing a single object tracker using a deep RL-based technique for drone images. Our proposed solution introduces a novel reward function that aims to reduce the total number of actions taken to estimate the object's location in the next frame and also introduces a different backbone network to be used on low resolution images. Additionally, we introduce a set of new actions into the action library to better deal with the above-mentioned complexities. We compare our proposed solutions to a state of the art tracking algorithm from the recent literature and demonstrate up to 3.87\% improvement in precision and 3.6\% improvement in IoU values on the VisDrone2019 dataset. We also provide additional results on OTB-100 dataset and show up to 3.15\% improvement in precision on the OTB-100 dataset when compared to the same previous state of the art algorithm. Lastly, we analyze the ability to handle some of the challenges faced during tracking, including but not limited to occlusion, deformation, and scale variation for our proposed solutions.

PHNet: Parasite-Host Network for Video Crowd Counting

Shiqiao Meng, Jiajie Li, Weiwei Guo, Jinfeng Jiang, Lai Ye

Responsive image

Auto-TLDR; PHNet: A Parasite-Host Network for Video Crowd Counting

Slides Poster Similar

Crowd counting plays an increasingly important role in public security. Recently, many crowd counting methods for a single image have been proposed but few studies have focused on using temporal information from image sequences of videos to improve prediction performance. In the existing methods using videos for crowd estimation, temporal features and spatial features are modeled jointly for the prediction, which makes the model less efficient in extracting spatiotemporal features and difficult to improve the performance of predictions. In order to solve these problems, this paper proposes a Parasite-Host Network(PHNet) which is composed of Parasite branch and Host branch to extract temporal features and spatial features respectively. To specifically extract the transform features in the time domain, we propose a novel architecture termed as “Relational Extractor”(RE) which models the multiplicative interaction features of adjacent frames. In addition, the Host branch extracts the spatial features from a current frame which can be replaced with any model that uses a single image for the prediction. We conducted experiments by using our PHNet on four video crowd counting benchmarks: Venice,UCSD,FDST and CrowdFlow. Experimental results show that PHnet achieves superior performance on these four datasets to the state-of-the-art methods.

Transformer Networks for Trajectory Forecasting

Francesco Giuliari, Hasan Irtiza, Marco Cristani, Fabio Galasso

Responsive image

Auto-TLDR; TransformerNetworks for Trajectory Prediction of People Interactions

Slides Poster Similar

Most recent successes on forecasting the people mo-tion are based on LSTM models andallmost recent progress hasbeen achieved by modelling the social interaction among peopleand the people interaction with the scene. We question the useof the LSTM models and propose the novel use of TransformerNetworks for trajectory forecasting. This is a fundamental switchfrom the sequential step-by-step processing of LSTMs to theonly-attention-based memory mechanisms of Transformers. Inparticular, we consider both the original Transformer Network(TF) and the larger Bidirectional Transformer (BERT), state-of-the-art on all natural language processing tasks. Our proposedTransformers predict the trajectories of the individual peoplein the scene. These are “simple” models because each personis modelled separately without any complex human-human norscene interaction terms. In particular, the TF modelwithoutbells and whistlesyields the best score on the largest and mostchallenging trajectory forecasting benchmark of TrajNet [1]. Ad-ditionally, its extension which predicts multiple plausible futuretrajectories performs on par with more engineered techniqueson the 5 datasets of ETH [2]+UCY [3]. Finally, we showthat Transformers may deal with missing observations, as itmay be the case with real sensor data. Code is available atgithub.com/FGiuliari/Trajectory-Transformer

IPT: A Dataset for Identity Preserved Tracking in Closed Domains

Thomas Heitzinger, Martin Kampel

Responsive image

Auto-TLDR; Identity Preserved Tracking Using Depth Data for Privacy and Privacy

Slides Poster Similar

We present a public dataset for Identity Preserved Tracking (IPT) consisting of sequences of depth data recorded using an Orbbec Astra depth sensor. The dataset features sequences in ten different locations with a high amount of background variation and is designed to be applicable to a wide range of tasks. Its labeling is versatile, allowing for tracking in either 3d space or image coordinates. Next to frame-by-frame 3d and inferred bounding box labeling we provide supplementary annotation of camera poses and room layouts, split in multiple semantically distinct categories. Intended use-cases are applications where both a high level understanding of scene understanding and privacy are central points of consideration, such as active and assisted living (AAL), security and industrial safety. Compared to similar public datasets IPT distinguishes itself with its sequential data format, 3d instance labeling and room layout annotation. We present baseline object detection results in image coordinates using a YOLOv3 network architecture and implement a background model suitable for online tracking applications to increase detection accuracy. Additionally we propose a novel volumetric non-maximum suppression (V-NMS) approach, taking advantage of known room geometry. Last we provide baseline person tracking results utilizing Multiple Object Tracking Challenge (MOTChallenge) evaluation metrics of the CVPR19 benchmark.

Image Sequence Based Cyclist Action Recognition Using Multi-Stream 3D Convolution

Stefan Zernetsch, Steven Schreck, Viktor Kress, Konrad Doll, Bernhard Sick

Responsive image

Auto-TLDR; 3D-ConvNet: A Multi-stream 3D Convolutional Neural Network for Detecting Cyclists in Real World Traffic Situations

Slides Poster Similar

In this article, we present an approach to detect basic movements of cyclists in real world traffic situations based on image sequences, optical flow (OF) sequences, and past positions using a multi-stream 3D convolutional neural network (3D-ConvNet) architecture. To resolve occlusions of cyclists by other traffic participants or road structures, we use a wide angle stereo camera system mounted at a heavily frequented public intersection. We created a large dataset consisting of 1,639 video sequences containing cyclists, recorded in real world traffic, resulting in over 1.1 million samples. Through modeling the cyclists' behavior by a state machine of basic cyclist movements, our approach takes every situation into account and is not limited to certain scenarios. We compare our method to an approach solely based on position sequences. Both methods are evaluated taking into account frame wise and scene wise classification results of basic movements, and detection times of basic movement transitions, where our approach outperforms the position based approach by producing more reliable detections with shorter detection times. Our code and parts of our dataset are made publicly available.

Reducing False Positives in Object Tracking with Siamese Network

Takuya Ogawa, Takashi Shibata, Shoji Yachida, Toshinori Hosoi

Responsive image

Auto-TLDR; Robust Long-Term Object Tracking with Adaptive Search based on Motion Models

Slides Poster Similar

We have developed a robust long-term object tracking method that resolves the fundamental cause of the drift and loss of a target in visual object tracking. The proposed method consists of “sampling area extension”, which prevents a tracking result from drifting to other objects by learning false positive samples in advance (before they enter the search region of the target), and “adaptive search based on motion models”, which prevents a tracking result from drifting to other objects and avoids the loss of the target by using not only appearance features but also motion models to adaptively search for the target. Experiments conducted on long-term tracking dataset showed that our first technique improved robustness by 16.6% while the second technique improved robustness by 15.3%. By combining both, our method achieved 21.7% and 9.1% improvement for the robustness and precision, and the processing speed became 3.3 times faster. Additional experiments showed that our method achieved the top robustness among state-of-the-art methods on three long-term tracking datasets. These findings demonstrate that our method is effective for long-term object tracking and that its performance and speed are promising for use in practical applications of various technologies underlying object tracking.

Utilising Visual Attention Cues for Vehicle Detection and Tracking

Feiyan Hu, Venkatesh Gurram Munirathnam, Noel E O'Connor, Alan Smeaton, Suzanne Little

Responsive image

Auto-TLDR; Visual Attention for Object Detection and Tracking in Driver-Assistance Systems

Slides Poster Similar

Advanced Driver-Assistance Systems (ADAS) have been attracting attention from many researchers. Vision based sensors are the closest way to emulate human driver visual behavior while driving. In this paper, we explore possible ways to use visual attention (saliency) for object detection and tracking. We investigate: 1) How a visual attention map such as a subjectness attention or saliency map and an objectness attention map can facilitate region proposal generation in a 2-stage object detector; 2) How a visual attention map can be used for tracking multiple objects. We propose a neural network that can simultaneously detect objects as and generate objectness and subjectness maps to save computational power. We further exploit the visual attention map during tracking using a sequential Monte Carlo probability hypothesis density (PHD) filter. The experiments are conducted on KITTI and DETRAC datasets. The use of visual attention and hierarchical features has shown a considerable improvement of≈8% in object detection which effectively increased tracking performance by≈4% on KITTI dataset.

Future Urban Scenes Generation through Vehicles Synthesis

Alessandro Simoni, Luca Bergamini, Andrea Palazzi, Simone Calderara, Rita Cucchiara

Responsive image

Auto-TLDR; Predicting the Future of an Urban Scene with a Novel View Synthesis Paradigm

Slides Poster Similar

In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene. Despite recent advances, generating the entire scene in an end-to-end fashion is still far from being achieved. Instead, here we follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently. We leverage a per-object novel view synthesis paradigm; i.e. generating a synthetic representation of an object undergoing a geometrical roto-translation in the 3D space. Our model can be easily conditioned with constraints (e.g. input trajectories) provided by state-of-the-art tracking methods or by the user itself. This allows us to generate a set of diverse realistic futures starting from the same input in a multi-modal fashion. We visually and quantitatively show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow, a challenging real world dataset.

Robust Visual Object Tracking with Two-Stream Residual Convolutional Networks

Ning Zhang, Jingen Liu, Ke Wang, Dan Zeng, Tao Mei

Responsive image

Auto-TLDR; Two-Stream Residual Convolutional Network for Visual Tracking

Slides Poster Similar

The current deep learning based visual tracking approaches have been very successful by learning the target classification and/or estimation model from a large amount of supervised training data in offline mode. However, most of them can still fail in tracking objects due to some more challenging issues such as dense distractor objects, confusing background, motion blurs, and so on. Inspired by the human ``visual tracking'' capability which leverages motion cues to distinguish the target from the background, we propose a Two-Stream Residual Convolutional Network (TS-RCN) for visual tracking, which successfully exploits both appearance and motion features for model update. Our TS-RCN can be integrated with existing deep learning based visual trackers. To further improve the tracking performance, we adopt a ``wider'' residual network ResNeXt as its feature extraction backbone. To the best of our knowledge, TS-RCN is the first end-to-end trainable two-stream visual tracking system, which makes full use of both appearance and motion features of the target. We have extensively evaluated the TS-RCN on most widely used benchmark datasets including VOT2018, VOT2019, and GOT-10K. The experiment results have successfully demonstrated that our two-stream model can greatly outperform the appearance based tracker, and it also achieves state-of-the-art performance. The tracking system can run at up to 38.1 FPS.

Point In: Counting Trees with Weakly Supervised Segmentation Network

Pinmo Tong, Shuhui Bu, Pengcheng Han

Responsive image

Auto-TLDR; Weakly Tree counting using Deep Segmentation Network with Localization and Mask Prediction

Slides Poster Similar

For tree counting tasks, since traditional image processing methods require expensive feature engineering and are not end-to-end frameworks, this will cause additional noise and cannot be optimized overall, so this method has not been widely used in recent trends of tree counting application. Recently, many deep learning based approaches are designed for this task because of the powerful feature extracting ability. The representative way is bounding box based supervised method, but time-consuming annotations are indispensable for them. Moreover, these methods are difficult to overcome the occlusion or overlap. To solve this problem, we propose a weakly tree counting network (WTCNet) based on deep segmentation network with only point supervision. It can simultaneously complete tree counting with localization and output mask of each tree at the same time. We first adopt a novel feature extractor network (FENet) to get features of input images, and then an effective strategy is introduced to deal with different mask predictions. In the end, we propose a basic localization guidance accompany with rectification guidance to train the network. We create two different datasets and select an existing challenging plant dataset to evaluate our method on three different tasks. Experimental results show the good performance improvement of our method compared with other existing methods. Further study shows that our method has great potential to reduce human labor and provide effective ground-truth masks and the results show the superiority of our method over the advanced methods.

Semantic Segmentation for Pedestrian Detection from Motion in Temporal Domain

Guo Cheng, Jiang Yu Zheng

Responsive image

Auto-TLDR; Motion Profile: Recognizing Pedestrians along with their Motion Directions in a Temporal Way

Slides Poster Similar

In autonomous driving, state-of-the-art methods detect pedestrian through appearance in 2-D spatial images. However, these approaches are typically time-consuming because of the complexity of algorithms to cope with large variations in shape, pose, action, and illumination. They also fall short of capturing temporal continuity in motion trace. In a completely different approach, this work recognizes pedestrians along with their motion directions in a temporal way. By projecting a driving video to a 2-D temporal image called Motion Profile (MP), we can robustly distinguish pedestrian in motion and standing-still against smooth background motion. To ensure non-redundant data processing of deep network on a compact motion profile further, a novel temporal-shift memory (TSM) model is developed to perform deep learning of sequential input in linear processing time. In experiments containing various pedestrian motion from sensors such as video and LiDAR, we demonstrate that, with the data size around 3/720th of video volume, this motion-based method can reach the detecting rate of pedestrians at 90% in near and mid-range on the road. With a super-fast processing speed and good accuracy, this method is promising for intelligent vehicles.

MFST: Multi-Features Siamese Tracker

Zhenxi Li, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Responsive image

Auto-TLDR; Multi-Features Siamese Tracker for Robust Deep Similarity Tracking

Slides Similar

Siamese trackers have recently achieved interesting results due to their balanced accuracy-speed. This success is mainly due to the fact that deep similarity networks were specifically designed to address the image similarity problem. Therefore, they are inherently more appropriate than classical CNNs for the tracking task. However, Siamese trackers rely on the last convolutional layers for similarity analysis and target search, which restricts their performance. In this paper, we argue that using a single convolutional layer as feature representation is not the optimal choice within the deep similarity framework, as multiple convolutional layers provide several abstraction levels in characterizing an object. Starting from this motivation, we present the Multi-Features Siamese Tracker (MFST), a novel tracking algorithm exploiting several hierarchical feature maps for robust deep similarity tracking. MFST proceeds by fusing hierarchical features to ensure a richer and more efficient representation. Moreover, we handle appearance variation by calibrating deep features extracted from two different CNN models. Based on this advanced feature representation, our algorithm achieves high tracking accuracy, while outperforming several state-of-the-art trackers, including standard Siamese trackers.

Siamese Dynamic Mask Estimation Network for Fast Video Object Segmentation

Dexiang Hong, Guorong Li, Kai Xu, Li Su, Qingming Huang

Responsive image

Auto-TLDR; Siamese Dynamic Mask Estimation for Video Object Segmentation

Slides Poster Similar

Video object segmentation(VOS) has been a fundamental topic in recent years, and many deep learning-based methods have achieved state-of-the-art performance on multiple benchmarks. However, most of these methods rely on pixel-level matching between the template and the searched frames on the whole image while the targets only occupy a small region. Calculating on the entire image brings lots of additional computation cost. Besides, the whole image may contain some distracting information resulting in many false-positive matching points. To address this issue, motivated by one-stage instance object segmentation methods, we propose an efficient siamese dynamic mask estimation network for fast video object segmentation. The VOS is decoupled into two tasks, i.e. mask feature learning and dynamic kernel prediction. The former is responsible for learning high-quality features to preserve structural geometric information, and the latter learns a dynamic kernel which is used to convolve with the mask feature to generate a mask output. We use Siamese neural network as a feature extractor and directly predict masks after correlation. In this way, we can avoid using pixel-level matching, making our framework more simple and efficient. Experiment results on DAVIS 2016 /2017 datasets show that our proposed methods can run at 35 frames per second on NVIDIA RTX TITAN while preserving competitive accuracy.

An Adaptive Fusion Model Based on Kalman Filtering and LSTM for Fast Tracking of Road Signs

Chengliang Wang, Xin Xie, Chao Liao

Responsive image

Auto-TLDR; Fusion of ThunderNet and Region Growing Detector for Road Sign Detection and Tracking

Slides Poster Similar

The detection and tracking of road signs plays a critical role in various autopilot application. Utilizing convolutional neural networks(CNN) mostly incurs a big run-time overhead in feature extraction and object localization. Although Klaman filter(KF) is a commonly-used tracker, it is likely to be impacted by omitted objects in the detection step. In this paper, we designed a high-efficient detector that combines ThunderNet and Region Growing Detector(RGD) to detect road signs, and built a fusion model of long short term memory network (LSTM) and KF in the state estimation and the color histogram. The experimental results demonstrate that the proposed method improved the state estimation accuracy by 6.4% and enhanced the Frames Per Second(FPS) to 41.

TSDM: Tracking by SiamRPN++ with a Depth-Refiner and a Mask-Generator

Pengyao Zhao, Quanli Liu, Wei Wang, Qiang Guo

Responsive image

Auto-TLDR; TSDM: A Depth-D Tracker for 3D Object Tracking

Slides Poster Similar

In a generic object tracking, depth (D) information provides informative cues for foreground-background separation and target bounding box regression. However, so far, few trackers have used depth information to play the important role aforementioned due to the lack of a suitable model. In this paper, a RGB-D tracker named TSDM is proposed, which is composed of a Mask-generator (M-g), SiamRPN++ and a Depth-refiner (D-r). The M-g generates the background masks, and updates them as the target 3D position changes. The D-r optimizes the target bounding box estimated by SiamRPN++, based on the spatial depth distribution difference between the target and the surrounding background. Extensive evaluation on the Princeton Tracking Benchmark and the Visual Object Tracking challenge shows that our tracker outperforms the state-of-the-art by a large margin while achieving 23 FPS. In addition, a light-weight variant can run at 31 FPS and thus it is practical for real world applications. Code and models of TSDM are available at https://github.com/lql-team/TSDM.

Efficient Correlation Filter Tracking with Adaptive Training Sample Update Scheme

Shan Jiang, Shuxiao Li, Chengfei Zhu, Nan Yan

Responsive image

Auto-TLDR; Adaptive Training Sample Update Scheme of Correlation Filter Based Trackers for Visual Tracking

Slides Poster Similar

Visual tracking serves as a significant module in many applications. However, the heavy computation and low speed of many recent trackers restrict their applications in some computing power restricted scenarios. On the other hand, the simple update scheme of most correlation filter based trackers limits their robustness during target deformation and occlusion. In this paper, we explore the update scheme of correlation filter based trackers and propose an efficient and adaptive training sample update scheme. Training sample extracted in each frame is updated to the training set according to its distance between existing samples measured with difference hashing algorithm(DHA) or discarded according to tracking result reliability. Experiments on OTB-2015, Temple Color 128 and UAV123 demonstrate our tracker performs favourably against state-of-the-art trackers with light computation and runs over 100 fps on desktop computer with Intel i7-8700 CPU(3.2GHz).

Machine-Learned Regularization and Polygonization of Building Segmentation Masks

Stefano Zorzi, Ksenia Bittner, Friedrich Fraundorfer

Responsive image

Auto-TLDR; Automatic Regularization and Polygonization of Building Segmentation masks using Generative Adversarial Network

Slides Poster Similar

We propose a machine learning based approach for automatic regularization and polygonization of building segmentation masks. Taking an image as input, we first predict building segmentation maps exploiting generic fully convolutional network (FCN). A generative adversarial network (GAN) is then involved to perform a regularization of building boundaries to make them more realistic, i.e., having more rectilinear outlines which construct right angles if required. This is achieved through the interplay between the discriminator which gives a probability of input image being true and generator that learns from discriminator’s response to create more realistic images. Finally, we train the backbone convolutional neural network (CNN) which is adapted to predict sparse outcomes corresponding to building corners out of regularized building segmentation results. Experiments on three building segmentation datasets demonstrate that the proposed method is not only capable of obtaining accurate results, but also of producing visually pleasing building outlines parameterized as polygons.

Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting

Pongpisit Thanasutives, Ken-Ichi Fukui, Masayuki Numao, Boonserm Kijsirikul

Responsive image

Auto-TLDR; M-SFANet and M-SegNet for Crowd Counting Using Multi-Scale Fusion Networks

Slides Poster Similar

In this paper, we proposed two modified neural networks based on dual path multi-scale fusion networks (SFANet) and SegNet for accurate and efficient crowd counting. Inspired by SFANet, the first model, which is named M-SFANet, is attached with atrous spatial pyramid pooling (ASPP) and context-aware module (CAN). The encoder of M-SFANet is enhanced with ASPP containing parallel atrous convolutional layers with different sampling rates and hence able to extract multi-scale features of the target object and incorporate larger context. To further deal with scale variation throughout an input image, we leverage the CAN module which adaptively encodes the scales of the contextual information. The combination yields an effective model for counting in both dense and sparse crowd scenes. Based on the SFANet decoder structure, M-SFANet's decoder has dual paths, for density map and attention map generation. The second model is called M-SegNet, which is produced by replacing the bilinear upsampling in SFANet with max unpooling that is used in SegNet. This change provides a faster model while providing competitive counting performance. Designed for high-speed surveillance applications, M-SegNet has no additional multi-scale-aware module in order to not increase the complexity. Both models are encoder-decoder based architectures and are end-to-end trainable. We conduct extensive experiments on five crowd counting datasets and one vehicle counting dataset to show that these modifications yield algorithms that could improve state-of-the-art crowd counting methods.

RSINet: Rotation-Scale Invariant Network for Online Visual Tracking

Yang Fang, Geunsik Jo, Chang-Hee Lee

Responsive image

Auto-TLDR; RSINet: Rotation-Scale Invariant Network for Adaptive Tracking

Slides Poster Similar

Most Siamese network-based trackers perform the tracking process without model update, and cannot learn target-specific variation adaptively. Moreover, Siamese-based trackers infer the new state of tracked objects by generating axis-aligned bounding boxes, which contain extra background noise, and are unable to accurately estimate the rotation and scale transformation of moving objects, thus potentially reducing tracking performance. In this paper, we propose a novel Rotation-Scale Invariant Network (RSINet) to address the above problem. Our RSINet tracker consists of a target-distractor discrimination branch and a rotation-scale estimation branch, the rotation and scale knowledge can be explicitly learned by a multi-task learning method in an end-to-end manner. In addtion, the tracking model is adaptively optimized and updated under spatio-temporal energy control, which ensures model stability and reliability, as well as high tracking efficiency. Comprehensive experiments on OTB-100, VOT2018, and LaSOT benchmarks demonstrate that our proposed RSINet tracker yields new state-of-the-art performance compared with recent trackers, while running at real-time speed about 45 FPS.

Anomaly Detection, Localization and Classification for Railway Inspection

Riccardo Gasparini, Andrea D'Eusanio, Guido Borghi, Stefano Pini, Giuseppe Scaglione, Simone Calderara, Eugenio Fedeli, Rita Cucchiara

Responsive image

Auto-TLDR; Anomaly Detection and Localization using thermal images in the lowlight environment

Slides Similar

The ability to detect, localize and classify objects that are anomalies is a challenging task in the computer vision community. In this paper, we tackle these tasks developing a framework to automatically inspect the railway during the night. Specifically, it is able to predict the presence, the image coordinates and the class of obstacles. To deal with the lowlight environment, the framework is based on thermal images and consists of three different modules that address the problem of detecting anomalies, predicting their image coordinates and classifying them. Moreover, due to the absolute lack of publicly released datasets collected in the railway context for anomaly detection, we introduce a new multi-modal dataset, acquired from a rail drone, used to evaluate the proposed framework. Experimental results confirm the accuracy of the framework and its suitability, in terms of computational load, performance, and inference time, to be implemented on a self-powered inspection system.

RescueNet: Joint Building Segmentation and Damage Assessment from Satellite Imagery

Rohit Gupta, Mubarak Shah

Responsive image

Auto-TLDR; RescueNet: End-to-End Building Segmentation and Damage Classification for Humanitarian Aid and Disaster Response

Slides Poster Similar

Accurate and fine-grained information about the extent of damage to buildings is essential for directing Humanitarian Aid and Disaster Response (HADR) operations in the immediate aftermath of any natural calamity. In recent years, satellite and UAV (drone) imagery has been used for this purpose, sometimes aided by computer vision algorithms. Existing Computer Vision approaches for building damage assessment typically rely on a two stage approach, consisting of building detection using an object detection model, followed by damage assessment through classification of the detected building tiles. These multi-stage methods are not end-to-end trainable, and suffer from poor overall results. We propose RescueNet, a unified model that can simultaneously segment buildings and assess the damage levels to individual buildings and can be trained end-to end. In order to to model the composite nature of this problem, we propose a novel localization aware loss function, which consists of a Binary Cross Entropy loss for building segmentation, and a foreground only selective Categorical Cross-Entropy loss for damage classification, and show significant improvement over the widely used Cross-Entropy loss. RescueNet is tested on the large scale and diverse xBD dataset and achieves significantly better building segmentation and damage classification performance than previous methods and achieves generalization across varied geographical regions and disaster types.

What and How? Jointly Forecasting Human Action and Pose

Yanjun Zhu, Yanxia Zhang, Qiong Liu, Andreas Girgensohn

Responsive image

Auto-TLDR; Forecasting Human Actions and Motion Trajectories with Joint Action Classification and Pose Regression

Slides Poster Similar

Forecasting human actions and motion trajectories addresses the problem of predicting what a person is going to do next and how they will perform it. This is crucial in a wide range of applications such as assisted living and future co-robotic settings. We propose to simultaneously learn actions and action-related human motion dynamics, while existing works perform them independently. In this paper, we present a method to jointly forecast categories of human action and the pose of skeletal joints in the hope that the two tasks can help each other. As a result, our system can predict not only the future actions but also the motion trajectories that will result. To achieve this, we define a task of joint action classification and pose regression. We employ a sequence to sequence encoder-decoder model combined with multi-task learning to forecast future actions and poses progressively before the action happens. Experimental results on two public datasets, IkeaDB and OAD, demonstrate the effectiveness of the proposed method.

DAG-Net: Double Attentive Graph Neural Network for Trajectory Forecasting

Alessio Monti, Alessia Bertugli, Simone Calderara, Rita Cucchiara

Responsive image

Auto-TLDR; Recurrent Generative Model for Multi-modal Human Motion Behaviour in Urban Environments

Slides Poster Similar

Understanding human motion behaviour is a critical task for several possible applications like self-driving cars or social robots, and in general for all those settings where an autonomous agent has to navigate inside a human-centric environment. This is non-trivial because human motion is inherently multi-modal: given a history of human motion paths, there are many plausible ways by which people could move in the future. Additionally, people activities are often driven by goals, e.g. reaching particular locations or interacting with the environment. We address both the aforementioned aspects by proposing a new recurrent generative model that considers both single agents’ future goals and interactions between different agents. The model exploits a double attention-based graph neural network to collect information about the mutual influences among different agents and integrates it with data about agents’ possible future objectives. Our proposal is general enough to be applied in different scenarios: the model achieves state-of-the-art results in both urban environments and also in sports applications.

RWF-2000: An Open Large Scale Video Database for Violence Detection

Ming Cheng, Kunjing Cai, Ming Li

Responsive image

Auto-TLDR; Flow Gated Network for Violence Detection in Surveillance Cameras

Slides Poster Similar

In recent years, surveillance cameras are widely deployed in public places, and the general crime rate has been reduced significantly due to these ubiquitous devices. Usually, these cameras provide cues and evidence after crimes were conducted, while they are rarely used to prevent or stop criminal activities in time. It is both time and labor consuming to manually monitor a large amount of video data from surveillance cameras. Therefore, automatically recognizing violent behaviors from video signals becomes essential. In this paper, we summarize several existing video datasets for violence detection and propose a new video dataset with 2,000 videos all captured by surveillance cameras in real-world scenes. Also, we present a new method that utilizes both the merits of 3D-CNNs and optical flow, namely Flow Gated Network. The proposed approach obtains an accuracy of 87.25% on the test set of our proposed RWF-2000 database. The proposed database and source codes of this paper are currently open to access.

Video Semantic Segmentation Using Deep Multi-View Representation Learning

Akrem Sellami, Salvatore Tabbone

Responsive image

Auto-TLDR; Deep Multi-view Representation Learning for Video Object Segmentation

Slides Poster Similar

In this paper, we propose a deep learning model based on deep multi-view representation learning, to address the video object segmentation task. The proposed model emphasizes the importance of the inherent correlation between video frames and incorporates a multi-view representation learning based on deep canonically correlated autoencoders. The multi-view representation learning in our model provides an efficient mechanism for capturing inherent correlations by jointly extracting useful features and learning better representation into a joint feature space, i.e., shared representation. To increase the training data and the learning capacity, we train the proposed model with pairs of video frames, i.e., $F_{a}$ and $F_{b}$. During the segmentation phase, the deep canonically correlated autoencoders model encodes useful features by processing multiple reference frames together, which is used to detect the frequently reappearing. Our model enhances the state-of-the-art deep learning-based methods that mainly focus on learning discriminative foreground representations over appearance and motion. Experimental results over two large benchmarks demonstrate the ability of the proposed method to outperform competitive approaches and to reach good performances, in terms of semantic segmentation.

Revisiting Sequence-To-Sequence Video Object Segmentation with Multi-Task Loss and Skip-Memory

Fatemeh Azimi, Benjamin Bischke, Sebastian Palacio, Federico Raue, Jörn Hees, Andreas Dengel

Responsive image

Auto-TLDR; Sequence-to-Sequence Learning for Video Object Segmentation

Slides Poster Similar

Video Object Segmentation (VOS) is an active research area of the visual domain. One of its fundamental sub-tasks is semi-supervised / one-shot learning: given only the segmentation mask for the first frame, the task is to provide pixel-accurate masks for the object over the rest of the sequence. Despite much progress in the last years, we noticed that many of the existing approaches lose objects in longer sequences, especially when the object is small or briefly occluded. In this work, we build upon a sequence-to-sequence approach that employs an encoder-decoder architecture together with a memory module for exploiting the sequential data. We further improve this approach by proposing a model that manipulates multi-scale spatio-temporal information using memory-equipped skip connections. Furthermore, we incorporate an auxiliary task based on distance classification which greatly enhances the quality of edges in segmentation masks. We compare our approach to the state of the art and show considerable improvement in the contour accuracy metric and the overall segmentation accuracy.

MagnifierNet: Learning Efficient Small-Scale Pedestrian Detector towards Multiple Dense Regions

Qi Cheng, Mingqin Chen, Yingjie Wu, Fei Chen, Shiping Lin

Responsive image

Auto-TLDR; MagnifierNet: A Simple but Effective Small-Scale Pedestrian Detection Towards Multiple Dense Regions

Slides Poster Similar

Despite the success of pedestrian detection, there is still a significant gap in the performance of the detection of pedestrians at different scales. Detecting small-scale pedestrians is extremely challenging due to the low resolution of their convolution features which is essential for downstream classifiers. To address this issue, we observed pedestrian datasets and found that pedestrians often gather together in crowded public places. Then we propose MagnifierNet, a simple but effective small-scale pedestrian detector towards multiple dense regions. MagnifierNet uses our proposed sweep-line based grouping algorithm to find dense regions based on the number of pedestrians in the grouped region. And we adopt a new definition of small-scale pedestrians through grid search and KL-divergence. Besides, our grouping method can also be used as a new strategy for pedestrian data augmentation. The ablation study demonstrates that MagnifierNet improves the representation of small-scale pedestrians. We validate the effectiveness of MagnifierNet on CityPersons and KITTI datasets. Experimental results show that MagnifierNet achieves the best small-scale pedestrian detection performance on CityPersons benchmark without any external data, and also achieves competitive performance for detecting small-scale pedestrians on KITTI dataset without bells and whistles.

Early Wildfire Smoke Detection in Videos

Taanya Gupta, Hengyue Liu, Bir Bhanu

Responsive image

Auto-TLDR; Semi-supervised Spatio-Temporal Video Object Segmentation for Automatic Detection of Smoke in Videos during Forest Fire

Poster Similar

Recent advances in unmanned aerial vehicles and camera technology have proven useful for the detection of smoke that emerges above the trees during a forest fire. Automatic detection of smoke in videos is of great interest to Fire department. To date, in most parts of the world, the fire is not detected in its early stage and generally it turns catastrophic. This paper introduces a novel technique that integrates spatial and temporal features in a deep learning framework using semi-supervised spatio-temporal video object segmentation and dense optical flow. However, detecting this smoke in the presence of haze and without the labeled data is difficult. Considering the visibility of haze in the sky, a dark channel pre-processing method is used that reduces the amount of haze in video frames and consequently improves the detection results. Online training is performed on a video at the time of testing that reduces the need for ground-truth data. Tests using the publicly available video datasets show that the proposed algorithms outperform previous work and they are robust across different wildfire-threatened locations.

PolyLaneNet: Lane Estimation Via Deep Polynomial Regression

Talles Torres, Rodrigo Berriel, Thiago Paixão, Claudine Badue, Alberto F. De Souza, Thiago Oliveira-Santos

Responsive image

Auto-TLDR; Real-Time Lane Detection with Deep Polynomial Regression

Slides Poster Similar

One of the main factors that contributed to the large advances in autonomous driving is the advent of deep learning. For safer self-driving vehicles, one of the problems that has yet to be solved completely is lane detection. Since methods for this task have to work in real time (+30 FPS), they not only have to be effective (i.e., have high accuracy) but they also have to be efficient (i.e., fast). In this work, we present a novel method for lane detection that uses as input an image from a forward-looking camera mounted in the vehicle and outputs polynomials representing each lane marking in the image, via deep polynomial regression. The proposed method is shown to be competitive with existing state-of-the-art methods in the TuSimple dataset, while maintaining its efficiency (115 FPS). Additionally, extensive qualitative results on two additional public datasets are presented, alongside with limitations in the evaluation metrics used by recent works for lane detection. Finally, we provide source code and trained models that allow others to replicate all the results shown in this paper, which is surprisingly rare in state-of-the-art lane detection methods.

Multi-Resolution Fusion and Multi-Scale Input Priors Based Crowd Counting

Usman Sajid, Wenchi Ma, Guanghui Wang

Responsive image

Auto-TLDR; Multi-resolution Fusion Based End-to-End Crowd Counting in Still Images

Slides Poster Similar

Crowd counting in still images is a challenging problem in practice due to huge crowd-density variations, large perspective changes, severe occlusion, and variable lighting conditions. The state-of-the-art patch rescaling module (PRM) based approaches prove to be very effective in improving the crowd counting performance. However, the PRM module requires an additional and compromising crowd-density classification process. To address these issues and challenges, the paper proposes a new multi-resolution fusion based end-to-end crowd counting network. It employs three deep-layers based columns/branches, each catering the respective crowd-density scale. These columns regularly fuse (share) the information with each other. The network is divided into three phases with each phase containing one or more columns. Three input priors are introduced to serve as an efficient and effective alternative to the PRM module, without requiring any additional classification operations. Along with the final crowd count regression head, the network also contains three auxiliary crowd estimation regression heads, which are strategically placed at each phase end to boost the overall performance. Comprehensive experiments on three benchmark datasets demonstrate that the proposed approach outperforms all the state-of-the-art models under the RMSE evaluation metric. The proposed approach also has better generalization capability with the best results during the cross-dataset experiments.

Visual Saliency Oriented Vehicle Scale Estimation

Qixin Chen, Tie Liu, Jiali Ding, Zejian Yuan, Yuanyuan Shang

Responsive image

Auto-TLDR; Regularized Intensity Matching for Vehicle Scale Estimation with salient object detection

Slides Poster Similar

Vehicle scale estimation with a single camera is a typical application for intelligent transportation and it faces the challenges from visual computing while intensity-based method and descriptor-based method should be balanced. This paper proposed a vehicle scale estimation method based on salient object detection to resolve this problem. The regularized intensity matching method is proposed in Lie Algebra to achieve robust and accurate scale estimation, and descriptor matching and intensity matching are combined to minimize the proposed loss function. The visual attention mechanism is designed to select image patches with texture and remove the occluded image patches. Then the weights are assigned to pixels from the selected image patches which alleviates the influence of noise-corrupted pixels. The experiments show that the proposed method significantly outperforms state-of-the-art methods with regard to the robustness and accuracy of vehicle scale estimation.

Tiny Object Detection in Aerial Images

Jinwang Wang, Wen Yang, Haowen Guo, Ruixiang Zhang, Gui-Song Xia

Responsive image

Auto-TLDR; Tiny Object Detection in Aerial Images Using Multiple Center Points Based Learning Network

Slides Similar

Object detection in Earth Vision has achieved great progress in recent years. However, tiny object detection in aerial images remains a very challenging problem since the tiny objects contain a small number of pixels and are easily confused with the background. To advance tiny object detection research in aerial images, we present a new dataset for Tiny Object Detection in Aerial Images (AI-TOD). Specifically, AI-TOD comes with 700,621 object instances for eight categories across 28,036 aerial images. Compared to existing object detection datasets in aerial images, the mean size of objects in AI-TOD is about 12.8 pixels, which is much smaller than others. To build a benchmark for tiny object detection in aerial images, we evaluate the state-of-the-art object detectors on our AI-TOD dataset. Experimental results show that direct application of these approaches on AI-TOD produces suboptimal object detection results, thus new specialized detectors for tiny object detection need to be designed. Therefore, we propose a multiple center points based learning network (M-CenterNet) to improve the localization performance of tiny object detection, and experimental results show the significant performance gain over the competitors.

A Grid-Based Representation for Human Action Recognition

Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Responsive image

Auto-TLDR; GRAR: Grid-based Representation for Action Recognition in Videos

Slides Poster Similar

Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for the task, and are limited in the way they fuse temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets that demonstrate that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges.

Global Feature Aggregation for Accident Anticipation

Mishal Fatima, Umar Karim Khan, Chong Min Kyung

Responsive image

Auto-TLDR; Feature Aggregation for Predicting Accidents in Video Sequences

Slides Similar

Anticipation of accidents ahead of time in autonomous and non-autonomous vehicles aids in accident avoidance. In order to recognize abnormal events such as traffic accidents in a video sequence, it is important that the network takes into account interactions of objects in a given frame. We propose a novel Feature Aggregation (FA) block that refines each object's features by computing a weighted sum of the features of all objects in a frame. We use FA block along with Long Short Term Memory (LSTM) network to anticipate accidents in the video sequences. We report mean Average Precision (mAP) and Average Time-to-Accident (ATTA) on Street Accident (SA) dataset. Our proposed method achieves the highest score for risk anticipation by predicting accidents 0.32 sec and 0.75 sec earlier compared to the best results with Adaptive Loss and dynamic parameter prediction based methods respectively.

SAT-Net: Self-Attention and Temporal Fusion for Facial Action Unit Detection

Zhihua Li, Zheng Zhang, Lijun Yin

Responsive image

Auto-TLDR; Temporal Fusion and Self-Attention Network for Facial Action Unit Detection

Slides Poster Similar

Research on facial action unit detection has shown remarkable performances by using deep spatial learning models in recent years, however, it is far from reaching its full capacity in learning due to the lack of use of temporal information of AUs across time. Since the AU occurrence in one frame is highly likely related to previous frames in a temporal sequence, exploring temporal correlation of AUs across frames becomes a key motivation of this work. In this paper, we propose a novel temporal fusion and AU-supervised self-attention network (a so-called SAT-Net) to address the AU detection problem. First of all, we input the deep features of a sequence into a convolutional LSTM network and fuse the previous temporal information into the feature map of the last frame, and continue to learn the AU occurrence. Second, considering the AU detection problem is a multi-label classification problem that individual label depends only on certain facial areas, we propose a new self-learned attention mask by focusing the detection of each AU on parts of facial areas through the learning of individual attention mask for each AU, thus increasing the AU independence without the loss of any spatial relations. Our extensive experiments show that the proposed framework achieves better results of AU detection over the state-of-the-arts on two benchmark databases (BP4D and DISFA).

Tackling Occlusion in Siamese Tracking with Structured Dropouts

Deepak Gupta, Efstratios Gavves, Arnold Smeulders

Responsive image

Auto-TLDR; Structured Dropout for Occlusion in latent space

Slides Poster Similar

Occlusion is one of the most difficult challenges in object tracking to model. This is because unlike other challenges, where data augmentation can be of help, occlusion is hard to simulate as the occluding object can be anything in any shape. In this paper, we propose a simple solution to simulate the effects of occlusion in the latent space. Specifically, we present structured dropout to mimic the change in latent codes under occlusion. We present three forms of dropout (channel dropout, segment dropout and slice dropout) with the various forms of occlusion in mind. To demonstrate its effectiveness, the dropouts are incorporated into two modern Siamese trackers (SiamFC and SiamRPN++). The outputs from multiple dropouts are combined using an encoder network to obtain the final prediction. Experiments on several tracking benchmarks show the benefits of structured dropouts, while due to their simplicity requiring only small changes to the existing tracker models.

Derivation of Geometrically and Semantically Annotated UAV Datasets at Large Scales from 3D City Models

Sidi Wu, Lukas Liebel, Marco Körner

Responsive image

Auto-TLDR; Large-Scale Dataset of Synthetic UAV Imagery for Geometric and Semantic Annotation

Slides Poster Similar

While in high demand for the development of deep learning approaches, extensive datasets of annotated UAV imagery are still scarce today. Manual annotation, however, is time-consuming and, thus, has limited the potential for creating large-scale datasets. We tackle this challenge by presenting a procedure for the automatic creation of simulated UAV image sequences in urban areas and pixel-level annotations from publicly available data sources. We synthesize photo-realistic UAV imagery from Goole Earth Studio and derive annotations from an open CityGML model that not only provides geometric but also semantic information. The first dataset we exemplarily created using our approach contains 144000 images of Berlin, Germany, with four types of annotations, namely semantic labels as well as depth, surface normals, and edge maps. In the future, a complete pipeline regarding all the technical problems will be provided, together with more accurate models to refine some of the empirical settings currently, to automatically generate a large-scale dataset with reliable ground-truth annotations over the whole city of Berlin. The dataset, as well as the source code, will be published by then. Different methods will also be facilitated to test the usability of the dataset. We believe our dataset can be used for, and not limited to, tasks like pose estimation, geo-localization, monocular depth estimation, edge detection, building/surface classification, and plane segmentation. A potential research pipeline for geo-localization based on the synthetic dataset is provided.

Ground-truthing Large Human Behavior Monitoring Datasets

Tehreem Qasim, Robert Fisher, Naeem Bhatti

Responsive image

Auto-TLDR; Semi-automated Groundtruthing for Large Video Datasets

Slides Poster Similar

We present a groundtruthing approach which is applicable to large video datasets collected for studying people’s behavior, and which are recorded at a low frame per second (fps) rate. Groundtruthing a large dataset manually is a time consuming task and is prone to errors. The proposed approach is semi-automated (using a combination of deepnet and traditional image analysis) to minimize human labeler’s interaction with the video frames. The framework employs mask-rcnn as a people counter followed by human assisted semi-automated tests to correct the wrong labels. Subsequently, a bounding box extraction algorithm is used which is fully automated for frames with a single person and semi-automated for frames with two or more people. We also propose a methodology for anomaly detection i.e., collapse on table or floor. Behavior recognition is performed by using a fine-tuned alexnet convolutional neural network. The people detection and behavior analysis components of the framework are primarily designed to help reduce human labor in ground-truthing so that minimal human involvement is required. They are not meant to be employed as fully automated state-of-the-art systems. The proposed approach is validated on a new dataset presented in this paper, containing human activity in an indoor office environment and recorded at 1 fps as well as an indoor video sequence recorded at 15 fps. Experimental results show a significant reduction in human labor involved in the process of ground-truthing i.e., the number of potential clicks for office dataset was reduced by 99.2% and for the additional test video by 99.7%.