Compact and Discriminative Multi-Object Tracking with Siamese CNNs

Claire Labit-Bonis, Jérôme Thomas, Frederic Lerasle

Responsive image

Auto-TLDR; Fast, Light-Weight and All-in-One Single Object Tracking for Multi-Target Management

Slides Poster

Following the tracking-by-detection paradigm, multiple object tracking deals with challenging scenarios, occlusions or even missing detections; the priority is often given to quality measures instead of speed, and a good trade-off between the two is hard to achieve. Based on recent work, we propose a fast, light-weight tracker able to predict targets position and reidentify them at once, when it is usually done with two sequential steps. To do so, we combine a bounding box regressor with a target-oriented appearance learner in a newly designed and unified architecture. This way, our tracker can infer the targets' image pose but also provide us with a confidence level about target identity. Most of the time, it is also common to filter out the detector outputs with a preprocessing step, throwing away precious information about what has been seen in the image. We propose a tracks management strategy able to balance efficiently between detection and tracking outputs and their associated likelihoods. Simply put, we spotlight a full siamese based single object tracker able to predict both position and appearance features at once with a light-weight and all-in-one architecture, within a balanced overall multi-target management strategy. We demonstrate the efficiency and speed of our system w.r.t the literature on the well-known MOT17 challenge benchmark, and bring to the fore qualitative evaluations as well as state-of-the-art quantitative results.

Similar papers

SynDHN: Multi-Object Fish Tracker Trained on Synthetic Underwater Videos

Mygel Andrei Martija, Prospero Naval

Responsive image

Auto-TLDR; Underwater Multi-Object Tracking in the Wild with Deep Hungarian Network

Slides Poster Similar

In this paper, we seek to extend multi-object tracking research on a relatively less explored domain, that of, underwater multi-object tracking in the wild. Multi-object fish tracking is an important task because it can provide fish monitoring systems with richer information (e.g. multiple views of the same fish) as compared to detections and it can be an invaluable input to fish behavior analysis. However, there is a lack of an annotated benchmark dataset with enough samples for this task. To circumvent the need for manual ground truth tracking annotation, we craft a synthetic dataset. Using this synthetic dataset, we train an integrated detector and tracker called SynDHN. SynDHN uses the Deep Hungarian Network (DHN), which is a differentiable approximation of the Hungarian assignment algorithm. We repurpose DHN to become the tracking component of our algorithm by performing the task of affinity estimation between detector predictions. We consider both spatial and appearance features for affinity estimation. Our results show that despite being trained on a synthetic dataset, SynDHN generalizes well to real underwater video tracking and performs better against our baseline algorithms.

SiamMT: Real-Time Arbitrary Multi-Object Tracking

Lorenzo Vaquero, Manuel Mucientes, Victor Brea

Responsive image

Auto-TLDR; SiamMT: A Deep-Learning-based Arbitrary Multi-Object Tracking System for Video

Slides Poster Similar

Visual object tracking is of great interest in many applications, as it preserves the identity of an object throughout a video. However, while real applications demand systems capable of real-time-tracking multiple objects, multi-object tracking solutions usually follow the tracking-by-detection paradigm, thus they depend on running a costly detector in each frame, and they do not allow the tracking of arbitrary objects, i.e., they require training for specific classes. In response to this need, this work presents the architecture of SiamMT, a system capable of efficiently applying individual visual tracking techniques to multiple objects in real-time. This makes it the first deep-learning-based arbitrary multi-object tracker. To achieve this, we propose the global frame features extraction by using a fully-convolutional neural network, followed by the cropping and resizing of the different object search areas. The final similarity operation between these search areas and the target exemplars is carried out with an optimized pairwise cross-correlation. These novelties allow the system to track multiple targets in a scalable manner, achieving 25 fps with 60 simultaneous objects for VGA videos and 40 objects for HD720 videos, all with a tracking quality similar to SiamFC.

RSINet: Rotation-Scale Invariant Network for Online Visual Tracking

Yang Fang, Geunsik Jo, Chang-Hee Lee

Responsive image

Auto-TLDR; RSINet: Rotation-Scale Invariant Network for Adaptive Tracking

Slides Poster Similar

Most Siamese network-based trackers perform the tracking process without model update, and cannot learn target-specific variation adaptively. Moreover, Siamese-based trackers infer the new state of tracked objects by generating axis-aligned bounding boxes, which contain extra background noise, and are unable to accurately estimate the rotation and scale transformation of moving objects, thus potentially reducing tracking performance. In this paper, we propose a novel Rotation-Scale Invariant Network (RSINet) to address the above problem. Our RSINet tracker consists of a target-distractor discrimination branch and a rotation-scale estimation branch, the rotation and scale knowledge can be explicitly learned by a multi-task learning method in an end-to-end manner. In addtion, the tracking model is adaptively optimized and updated under spatio-temporal energy control, which ensures model stability and reliability, as well as high tracking efficiency. Comprehensive experiments on OTB-100, VOT2018, and LaSOT benchmarks demonstrate that our proposed RSINet tracker yields new state-of-the-art performance compared with recent trackers, while running at real-time speed about 45 FPS.

DAL: A Deep Depth-Aware Long-Term Tracker

Yanlin Qian, Song Yan, Alan Lukežič, Matej Kristan, Joni-Kristian Kamarainen, Jiri Matas

Responsive image

Auto-TLDR; Deep Depth-Aware Long-Term RGBD Tracking with Deep Discriminative Correlation Filter

Slides Poster Similar

The best RGBD trackers provide high accuracy but are slow to run. On the other hand, the best RGB trackers are fast but clearly inferior on the RGBD datasets. In this work, we propose a deep depth-aware long-term tracker that achieves state-of-the-art RGBD tracking performance and is fast to run. We reformulate deep discriminative correlation filter (DCF) to embed the depth information into deep features. Moreover, the same depth-aware correlation filter is used for target re- detection. Comprehensive evaluations show that the proposed tracker achieves state-of-the-art performance on the Princeton RGBD, STC, and the newly-released CDTB benchmarks and runs 20 fps.

AerialMPTNet: Multi-Pedestrian Tracking in Aerial Imagery Using Temporal and Graphical Features

Maximilian Kraus, Seyed Majid Azimi, Emec Ercelik, Reza Bahmanyar, Peter Reinartz, Alois Knoll

Responsive image

Auto-TLDR; AerialMPTNet: A novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features

Slides Poster Similar

Multi-pedestrian tracking in aerial imagery has several applications such as large-scale event monitoring, disaster management, search-and-rescue missions, and as input into predictive crowd dynamic models. Due to the challenges such as the large number and the tiny size of the pedestrians (e.g., 4 x 4 pixels) with their similar appearances as well as different scales and atmospheric conditions of the images with their extremely low frame rates (e.g., 2 fps), current state-of-the-art algorithms including the deep learning-based ones are unable to perform well. In this paper, we propose AerialMPTNet, a novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features from a Siamese Neural Network, movement predictions from a Long Short-Term Memory, and pedestrian interconnections from a GraphCNN. In addition, to address the lack of diverse aerial multi-pedestrian tracking datasets, we introduce the Aerial Multi-Pedestrian Tracking (AerialMPT) dataset consisting of 307 frames and 44,740 pedestrians annotated. To the best of our knowledge, AerialMPT is the largest and most diverse dataset to this date and will be released publicly. We evaluate AerialMPTNet on AerialMPT and KIT AIS, and benchmark with several state-of-the-art tracking methods. Results indicate that AerialMPTNet significantly outperforms other methods on accuracy and time-efficiency.

Robust Visual Object Tracking with Two-Stream Residual Convolutional Networks

Ning Zhang, Jingen Liu, Ke Wang, Dan Zeng, Tao Mei

Responsive image

Auto-TLDR; Two-Stream Residual Convolutional Network for Visual Tracking

Slides Poster Similar

The current deep learning based visual tracking approaches have been very successful by learning the target classification and/or estimation model from a large amount of supervised training data in offline mode. However, most of them can still fail in tracking objects due to some more challenging issues such as dense distractor objects, confusing background, motion blurs, and so on. Inspired by the human ``visual tracking'' capability which leverages motion cues to distinguish the target from the background, we propose a Two-Stream Residual Convolutional Network (TS-RCN) for visual tracking, which successfully exploits both appearance and motion features for model update. Our TS-RCN can be integrated with existing deep learning based visual trackers. To further improve the tracking performance, we adopt a ``wider'' residual network ResNeXt as its feature extraction backbone. To the best of our knowledge, TS-RCN is the first end-to-end trainable two-stream visual tracking system, which makes full use of both appearance and motion features of the target. We have extensively evaluated the TS-RCN on most widely used benchmark datasets including VOT2018, VOT2019, and GOT-10K. The experiment results have successfully demonstrated that our two-stream model can greatly outperform the appearance based tracker, and it also achieves state-of-the-art performance. The tracking system can run at up to 38.1 FPS.

Model Decay in Long-Term Tracking

Efstratios Gavves, Ran Tao, Deepak Gupta, Arnold Smeulders

Responsive image

Auto-TLDR; Model Bias in Long-Term Tracking

Slides Poster Similar

To account for appearance variations, tracking models need to be updated during the course of inference. However, updating the tracker model with adverse bounding box predictions adds an unavoidable bias term to the learning. This bias term, which we refer to as model decay, offsets the learning and causes tracking drift. While its adverse affect might not be visible in short-term tracking, accumulation of this bias over a long-term can eventually lead to a permanent loss of the target. In this paper, we look at the problem of model bias from a mathematical perspective. Further, we briefly examine the effect of various sources of tracking error on model decay, using a correlation filter (ECO) and a Siamese (SINT) tracker. Based on observations and insights, we propose simple additions that help to reduce model decay in long-term tracking. The proposed tracker is evaluated on four long-term and one short-term tracking benchmarks, demonstrating superior accuracy and robustness, even on 30 minute long videos.

Siamese Fully Convolutional Tracker with Motion Correction

Mathew Francis, Prithwijit Guha

Responsive image

Auto-TLDR; A Siamese Ensemble for Visual Tracking with Appearance and Motion Components

Slides Poster Similar

Visual tracking algorithms use cues like appearance, structure, motion etc. for locating an object in a video. We propose an ensemble tracker with appearance and motion components. A siamese tracker that learns object appearance from a static image and motion vectors computed between consecutive frames with a flow network forms the ensemble. Motion predicted object localization is used to correct the appearance component in the ensemble. Complementary nature of the components bring performance improvement as observed in experiments performed on VOT2018 and VOT2019 datasets.

TSDM: Tracking by SiamRPN++ with a Depth-Refiner and a Mask-Generator

Pengyao Zhao, Quanli Liu, Wei Wang, Qiang Guo

Responsive image

Auto-TLDR; TSDM: A Depth-D Tracker for 3D Object Tracking

Slides Poster Similar

In a generic object tracking, depth (D) information provides informative cues for foreground-background separation and target bounding box regression. However, so far, few trackers have used depth information to play the important role aforementioned due to the lack of a suitable model. In this paper, a RGB-D tracker named TSDM is proposed, which is composed of a Mask-generator (M-g), SiamRPN++ and a Depth-refiner (D-r). The M-g generates the background masks, and updates them as the target 3D position changes. The D-r optimizes the target bounding box estimated by SiamRPN++, based on the spatial depth distribution difference between the target and the surrounding background. Extensive evaluation on the Princeton Tracking Benchmark and the Visual Object Tracking challenge shows that our tracker outperforms the state-of-the-art by a large margin while achieving 23 FPS. In addition, a light-weight variant can run at 31 FPS and thus it is practical for real world applications. Code and models of TSDM are available at https://github.com/lql-team/TSDM.

Tackling Occlusion in Siamese Tracking with Structured Dropouts

Deepak Gupta, Efstratios Gavves, Arnold Smeulders

Responsive image

Auto-TLDR; Structured Dropout for Occlusion in latent space

Slides Poster Similar

Occlusion is one of the most difficult challenges in object tracking to model. This is because unlike other challenges, where data augmentation can be of help, occlusion is hard to simulate as the occluding object can be anything in any shape. In this paper, we propose a simple solution to simulate the effects of occlusion in the latent space. Specifically, we present structured dropout to mimic the change in latent codes under occlusion. We present three forms of dropout (channel dropout, segment dropout and slice dropout) with the various forms of occlusion in mind. To demonstrate its effectiveness, the dropouts are incorporated into two modern Siamese trackers (SiamFC and SiamRPN++). The outputs from multiple dropouts are combined using an encoder network to obtain the final prediction. Experiments on several tracking benchmarks show the benefits of structured dropouts, while due to their simplicity requiring only small changes to the existing tracker models.

An Adaptive Fusion Model Based on Kalman Filtering and LSTM for Fast Tracking of Road Signs

Chengliang Wang, Xin Xie, Chao Liao

Responsive image

Auto-TLDR; Fusion of ThunderNet and Region Growing Detector for Road Sign Detection and Tracking

Slides Poster Similar

The detection and tracking of road signs plays a critical role in various autopilot application. Utilizing convolutional neural networks(CNN) mostly incurs a big run-time overhead in feature extraction and object localization. Although Klaman filter(KF) is a commonly-used tracker, it is likely to be impacted by omitted objects in the detection step. In this paper, we designed a high-efficient detector that combines ThunderNet and Region Growing Detector(RGD) to detect road signs, and built a fusion model of long short term memory network (LSTM) and KF in the state estimation and the color histogram. The experimental results demonstrate that the proposed method improved the state estimation accuracy by 6.4% and enhanced the Frames Per Second(FPS) to 41.

IPT: A Dataset for Identity Preserved Tracking in Closed Domains

Thomas Heitzinger, Martin Kampel

Responsive image

Auto-TLDR; Identity Preserved Tracking Using Depth Data for Privacy and Privacy

Slides Poster Similar

We present a public dataset for Identity Preserved Tracking (IPT) consisting of sequences of depth data recorded using an Orbbec Astra depth sensor. The dataset features sequences in ten different locations with a high amount of background variation and is designed to be applicable to a wide range of tasks. Its labeling is versatile, allowing for tracking in either 3d space or image coordinates. Next to frame-by-frame 3d and inferred bounding box labeling we provide supplementary annotation of camera poses and room layouts, split in multiple semantically distinct categories. Intended use-cases are applications where both a high level understanding of scene understanding and privacy are central points of consideration, such as active and assisted living (AAL), security and industrial safety. Compared to similar public datasets IPT distinguishes itself with its sequential data format, 3d instance labeling and room layout annotation. We present baseline object detection results in image coordinates using a YOLOv3 network architecture and implement a background model suitable for online tracking applications to increase detection accuracy. Additionally we propose a novel volumetric non-maximum suppression (V-NMS) approach, taking advantage of known room geometry. Last we provide baseline person tracking results utilizing Multiple Object Tracking Challenge (MOTChallenge) evaluation metrics of the CVPR19 benchmark.

From Certain to Uncertain: Toward Optimal Solution for Offline Multiple Object Tracking

Kaikai Zhao, Takashi Imaseki, Hiroshi Mouri, Einoshin Suzuki, Tetsu Matsukawa

Responsive image

Auto-TLDR; Agglomerative Hierarchical Clustering with Ensemble of Tracking Experts for Object Tracking

Slides Poster Similar

Affinity measure in object tracking outputs a similarity or distance score for given detections. As an affinity measure is typically imperfect, it generally has an uncertain region in which regarding two groups of detections as the same object or different objects based on the score can be wrong. How to reduce the uncertain region is a major challenge for most similarity-based tracking methods. Early mistakes often result in distribution drifts for tracked objects and this is another major issue for object tracking. In this paper, we propose a new offline tracking method called agglomerative hierarchical clustering with ensemble of tracking experts (AHC_ETE), to tackle the uncertain region and early mistake issues. We conduct tracking from certain to uncertain to reduce early mistakes. Meanwhile, we ensemble multiple tracking experts to reduce the uncertain region as the final one is the union of that of each tracking expert. Experiments on MOT16 datasets demonstrated the effectiveness of our method.

VTT: Long-Term Visual Tracking with Transformers

Tianling Bian, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Neil Robertson, Haibing Guan

Responsive image

Auto-TLDR; Visual Tracking Transformer with transformers for long-term visual tracking

Slides Similar

Long-term visual tracking is a challenging problem. State-of-the-art long-term trackers, e.g., GlobalTrack, utilize region proposal networks (RPNs) to generate target proposals. However, the performance of the trackers is affected by occlusions and large scale or ratio variations. To address these issues, in this paper, we are the first to propose a novel architecture with transformers for long-term visual tracking. Specifically, the proposed Visual Tracking Transformer (VTT) utilizes a transformer encoder-decoder architecture for aggregating global information to deal with occlusion and large scale or ratio variation. Furthermore, it also shows better discriminative power against instance-level distractors without the need for extra labeling and hard-sample mining. We conduct extensive experiments on three largest long-term tracking dataset and have achieved state-of-the-art performance.

Siamese Dynamic Mask Estimation Network for Fast Video Object Segmentation

Dexiang Hong, Guorong Li, Kai Xu, Li Su, Qingming Huang

Responsive image

Auto-TLDR; Siamese Dynamic Mask Estimation for Video Object Segmentation

Slides Poster Similar

Video object segmentation(VOS) has been a fundamental topic in recent years, and many deep learning-based methods have achieved state-of-the-art performance on multiple benchmarks. However, most of these methods rely on pixel-level matching between the template and the searched frames on the whole image while the targets only occupy a small region. Calculating on the entire image brings lots of additional computation cost. Besides, the whole image may contain some distracting information resulting in many false-positive matching points. To address this issue, motivated by one-stage instance object segmentation methods, we propose an efficient siamese dynamic mask estimation network for fast video object segmentation. The VOS is decoupled into two tasks, i.e. mask feature learning and dynamic kernel prediction. The former is responsible for learning high-quality features to preserve structural geometric information, and the latter learns a dynamic kernel which is used to convolve with the mask feature to generate a mask output. We use Siamese neural network as a feature extractor and directly predict masks after correlation. In this way, we can avoid using pixel-level matching, making our framework more simple and efficient. Experiment results on DAVIS 2016 /2017 datasets show that our proposed methods can run at 35 frames per second on NVIDIA RTX TITAN while preserving competitive accuracy.

MFST: Multi-Features Siamese Tracker

Zhenxi Li, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Responsive image

Auto-TLDR; Multi-Features Siamese Tracker for Robust Deep Similarity Tracking

Slides Similar

Siamese trackers have recently achieved interesting results due to their balanced accuracy-speed. This success is mainly due to the fact that deep similarity networks were specifically designed to address the image similarity problem. Therefore, they are inherently more appropriate than classical CNNs for the tracking task. However, Siamese trackers rely on the last convolutional layers for similarity analysis and target search, which restricts their performance. In this paper, we argue that using a single convolutional layer as feature representation is not the optimal choice within the deep similarity framework, as multiple convolutional layers provide several abstraction levels in characterizing an object. Starting from this motivation, we present the Multi-Features Siamese Tracker (MFST), a novel tracking algorithm exploiting several hierarchical feature maps for robust deep similarity tracking. MFST proceeds by fusing hierarchical features to ensure a richer and more efficient representation. Moreover, we handle appearance variation by calibrating deep features extracted from two different CNN models. Based on this advanced feature representation, our algorithm achieves high tracking accuracy, while outperforming several state-of-the-art trackers, including standard Siamese trackers.

Exploiting Distilled Learning for Deep Siamese Tracking

Chengxin Liu, Zhiguo Cao, Wei Li, Yang Xiao, Shuaiyuan Du, Angfan Zhu

Responsive image

Auto-TLDR; Distilled Learning Framework for Siamese Tracking

Slides Poster Similar

Existing deep siamese trackers are typically built on off-the-shelf CNN models for feature learning, with the demand for huge power consumption and memory storage. This limits current deep siamese trackers to be carried on resource-constrained devices like mobile phones, given factor that such a deployment normally requires cost-effective considerations. In this work, we address this issue by presenting a novel Distilled Learning Framework(DLF) for siamese tracking, which aims at learning tracking model with efficiency and high accuracy. Specifically, we propose two simple yet effective knowledge distillation strategies, denote as point-wise distillation and pair-wise distillation, which are designed for transferring knowledge from a more discriminative teacher tracker into a compact student tracker. In this way, cost-effective and high performance tracking could be achieved. Extensive experiments on several tracking benchmarks demonstrate the effectiveness of our proposed method.

Visual Object Tracking in Drone Images with Deep Reinforcement Learning

Derya Gözen, Sedat Ozer

Responsive image

Auto-TLDR; A Deep Reinforcement Learning based Single Object Tracker for Drone Applications

Slides Poster Similar

There is an increasing demand on utilizing camera equipped drones and their applications in many domains varying from agriculture to entertainment and from sports events to surveillance. In such drone applications, an essential and a common task is tracking an object of interest visually. Drone (or UAV) images have different properties when compared to the ground taken (natural) images and those differences introduce additional complexities to the existing object trackers to be directly applied on drone applications. Some important differences among those complexities include (i) smaller object sizes to be tracked and (ii) different orientations and viewing angles yielding different texture and features to be observed. Therefore, new algorithms trained on drone images are needed for the drone-based applications. In this paper, we introduce a deep reinforcement learning (RL) based single object tracker that tracks an object of interest in drone images by estimating a series of actions to find the location of the object in the next frame. This is the first work introducing a single object tracker using a deep RL-based technique for drone images. Our proposed solution introduces a novel reward function that aims to reduce the total number of actions taken to estimate the object's location in the next frame and also introduces a different backbone network to be used on low resolution images. Additionally, we introduce a set of new actions into the action library to better deal with the above-mentioned complexities. We compare our proposed solutions to a state of the art tracking algorithm from the recent literature and demonstrate up to 3.87\% improvement in precision and 3.6\% improvement in IoU values on the VisDrone2019 dataset. We also provide additional results on OTB-100 dataset and show up to 3.15\% improvement in precision on the OTB-100 dataset when compared to the same previous state of the art algorithm. Lastly, we analyze the ability to handle some of the challenges faced during tracking, including but not limited to occlusion, deformation, and scale variation for our proposed solutions.

Tracking Fast Moving Objects by Segmentation Network

Ales Zita, Filip Sroubek

Responsive image

Auto-TLDR; Fast Moving Objects Tracking by Segmentation Using Deep Learning

Slides Poster Similar

Tracking Fast Moving Objects (FMO), which appear as blurred streaks in video sequences, is a difficult task for standard trackers, as the object position does not overlap in consecutive video frames and texture information of the objects is blurred. Up-to-date approaches tuned for this task are based on background subtraction with a static background and slow deblurring algorithms. In this article, we present a tracking-by-segmentation approach implemented using modern deep learning methods that perform near real-time tracking on real-world video sequences. We have developed a physically plausible FMO sequence generator to be a robust foundation for our training pipeline and demonstrate straightforward network adaptation for different FMO scenarios with varying foreground.

Reducing False Positives in Object Tracking with Siamese Network

Takuya Ogawa, Takashi Shibata, Shoji Yachida, Toshinori Hosoi

Responsive image

Auto-TLDR; Robust Long-Term Object Tracking with Adaptive Search based on Motion Models

Slides Poster Similar

We have developed a robust long-term object tracking method that resolves the fundamental cause of the drift and loss of a target in visual object tracking. The proposed method consists of “sampling area extension”, which prevents a tracking result from drifting to other objects by learning false positive samples in advance (before they enter the search region of the target), and “adaptive search based on motion models”, which prevents a tracking result from drifting to other objects and avoids the loss of the target by using not only appearance features but also motion models to adaptively search for the target. Experiments conducted on long-term tracking dataset showed that our first technique improved robustness by 16.6% while the second technique improved robustness by 15.3%. By combining both, our method achieved 21.7% and 9.1% improvement for the robustness and precision, and the processing speed became 3.3 times faster. Additional experiments showed that our method achieved the top robustness among state-of-the-art methods on three long-term tracking datasets. These findings demonstrate that our method is effective for long-term object tracking and that its performance and speed are promising for use in practical applications of various technologies underlying object tracking.

Utilising Visual Attention Cues for Vehicle Detection and Tracking

Feiyan Hu, Venkatesh Gurram Munirathnam, Noel E O'Connor, Alan Smeaton, Suzanne Little

Responsive image

Auto-TLDR; Visual Attention for Object Detection and Tracking in Driver-Assistance Systems

Slides Poster Similar

Advanced Driver-Assistance Systems (ADAS) have been attracting attention from many researchers. Vision based sensors are the closest way to emulate human driver visual behavior while driving. In this paper, we explore possible ways to use visual attention (saliency) for object detection and tracking. We investigate: 1) How a visual attention map such as a subjectness attention or saliency map and an objectness attention map can facilitate region proposal generation in a 2-stage object detector; 2) How a visual attention map can be used for tracking multiple objects. We propose a neural network that can simultaneously detect objects as and generate objectness and subjectness maps to save computational power. We further exploit the visual attention map during tracking using a sequential Monte Carlo probability hypothesis density (PHD) filter. The experiments are conducted on KITTI and DETRAC datasets. The use of visual attention and hierarchical features has shown a considerable improvement of≈8% in object detection which effectively increased tracking performance by≈4% on KITTI dataset.

Efficient Correlation Filter Tracking with Adaptive Training Sample Update Scheme

Shan Jiang, Shuxiao Li, Chengfei Zhu, Nan Yan

Responsive image

Auto-TLDR; Adaptive Training Sample Update Scheme of Correlation Filter Based Trackers for Visual Tracking

Slides Poster Similar

Visual tracking serves as a significant module in many applications. However, the heavy computation and low speed of many recent trackers restrict their applications in some computing power restricted scenarios. On the other hand, the simple update scheme of most correlation filter based trackers limits their robustness during target deformation and occlusion. In this paper, we explore the update scheme of correlation filter based trackers and propose an efficient and adaptive training sample update scheme. Training sample extracted in each frame is updated to the training set according to its distance between existing samples measured with difference hashing algorithm(DHA) or discarded according to tracking result reliability. Experiments on OTB-2015, Temple Color 128 and UAV123 demonstrate our tracker performs favourably against state-of-the-art trackers with light computation and runs over 100 fps on desktop computer with Intel i7-8700 CPU(3.2GHz).

Multi-Level Deep Learning Vehicle Re-Identification Using Ranked-Based Loss Functions

Eleni Kamenou, Jesus Martinez-Del-Rincon, Paul Miller, Patricia Devlin - Hill

Responsive image

Auto-TLDR; Multi-Level Re-identification Network for Vehicle Re-Identification

Slides Poster Similar

Identifying vehicles across a network of cameras with non-overlapping fields of view remains a challenging research problem due to scene occlusions, significant inter-class similarity and intra-class variability. In this paper, we propose an end-to-end multi-level re-identification network that is capable of successfully projecting same identity vehicles closer to one another in the embedding space, compared to vehicles of different identities. Robust feature representations are obtained by combining features at multiple levels of the network. As for the learning process, we employ a recent state-of-the-art structured metric learning loss function previously applied to other retrieval problems and adjust it to the vehicle re-identification task. Furthermore, we explore the cases of image-to-image, image-to-video and video-to-video similarity metric. Finally, we evaluate our system and achieve great performance on two large-scale publicly available datasets, CityFlow-ReID and VeRi-776. Compared to most existing state-of-art approaches, our approach is simpler and more straightforward, utilizing only identity-level annotations, while avoiding post-processing the ranking results (re-ranking) at the testing phase.

FeatureNMS: Non-Maximum Suppression by Learning Feature Embeddings

Niels Ole Salscheider

Responsive image

Auto-TLDR; FeatureNMS: Non-Maximum Suppression for Multiple Object Detection

Slides Poster Similar

Most state of the art object detectors output multiple detections per object. The duplicates are removed in a post-processing step called Non-Maximum Suppression. Classical Non-Maximum Suppression has shortcomings in scenes that contain objects with high overlap: The idea of this heuristic is that a high bounding box overlap corresponds to a high probability of having a duplicate. We propose FeatureNMS to solve this problem. FeatureNMS recognizes duplicates not only based on the intersection over union between bounding boxes, but also based on the difference of feature vectors. These feature vectors can encode more information like visual appearance. Our approach outperforms classical NMS and derived approaches and achieves state of the art performance.

Correlation-Based ConvNet for Small Object Detection in Videos

Brais Bosquet, Manuel Mucientes, Victor Brea

Responsive image

Auto-TLDR; STDnet-ST: An End-to-End Spatio-Temporal Convolutional Neural Network for Small Object Detection in Video

Slides Poster Similar

The detection of small objects is of particular interest in many real applications. In this paper, we propose STDnet-ST, a novel approach to small object detection in video using spatial information operating alongside temporal video information. STDnet-ST is an end-to-end spatio-temporal convolutional neural network that detects small objects over time and correlates pairs of the top-ranked regions with the highest likelihood of containing small objects. This architecture links the small objects across the time as tubelets, being able to dismiss unprofitable object links in order to provide high-quality tubelets. STDnet-ST achieves state-of-the-art results for small objects on the publicly available USC-GRAD-STDdb and UAVDT video datasets.

Detective: An Attentive Recurrent Model for Sparse Object Detection

Amine Kechaou, Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen

Responsive image

Auto-TLDR; Detective: An attentive object detector that identifies objects in images in a sequential manner

Slides Poster Similar

In this work, we present Detective – an attentive object detector that identifies objects in images in a sequential manner. Our network is based on an encoder-decoder architecture, where the encoder is a convolutional neural network, and the decoder is a convolutional recurrent neural network coupled with an attention mechanism. At each iteration, our decoder focuses on the relevant parts of the image using an attention mechanism, and then estimates the object’s class and the bounding box coordinates. Current object detection models generate dense predictions and rely on post-processing to remove duplicate predictions. Detective is a sparse object detector that generates a single bounding box per object instance. However, training a sparse object detector is challenging, as it requires the model to reason at the instance level and not just at the class and spatial levels. We propose a training mechanism based on the Hungarian Algorithm and a loss that balances the localization and classification tasks. This allows Detective to achieve promising results on the PASCAL VOC object detection dataset. Our experiments demonstrate that sparse object detection is possible and has a great potential for future developments in applications where the order of the objects to be predicted is of interest.

Transformer Networks for Trajectory Forecasting

Francesco Giuliari, Hasan Irtiza, Marco Cristani, Fabio Galasso

Responsive image

Auto-TLDR; TransformerNetworks for Trajectory Prediction of People Interactions

Slides Poster Similar

Most recent successes on forecasting the people mo-tion are based on LSTM models andallmost recent progress hasbeen achieved by modelling the social interaction among peopleand the people interaction with the scene. We question the useof the LSTM models and propose the novel use of TransformerNetworks for trajectory forecasting. This is a fundamental switchfrom the sequential step-by-step processing of LSTMs to theonly-attention-based memory mechanisms of Transformers. Inparticular, we consider both the original Transformer Network(TF) and the larger Bidirectional Transformer (BERT), state-of-the-art on all natural language processing tasks. Our proposedTransformers predict the trajectories of the individual peoplein the scene. These are “simple” models because each personis modelled separately without any complex human-human norscene interaction terms. In particular, the TF modelwithoutbells and whistlesyields the best score on the largest and mostchallenging trajectory forecasting benchmark of TrajNet [1]. Ad-ditionally, its extension which predicts multiple plausible futuretrajectories performs on par with more engineered techniqueson the 5 datasets of ETH [2]+UCY [3]. Finally, we showthat Transformers may deal with missing observations, as itmay be the case with real sensor data. Code is available atgithub.com/FGiuliari/Trajectory-Transformer

PolyLaneNet: Lane Estimation Via Deep Polynomial Regression

Talles Torres, Rodrigo Berriel, Thiago Paixão, Claudine Badue, Alberto F. De Souza, Thiago Oliveira-Santos

Responsive image

Auto-TLDR; Real-Time Lane Detection with Deep Polynomial Regression

Slides Poster Similar

One of the main factors that contributed to the large advances in autonomous driving is the advent of deep learning. For safer self-driving vehicles, one of the problems that has yet to be solved completely is lane detection. Since methods for this task have to work in real time (+30 FPS), they not only have to be effective (i.e., have high accuracy) but they also have to be efficient (i.e., fast). In this work, we present a novel method for lane detection that uses as input an image from a forward-looking camera mounted in the vehicle and outputs polynomials representing each lane marking in the image, via deep polynomial regression. The proposed method is shown to be competitive with existing state-of-the-art methods in the TuSimple dataset, while maintaining its efficiency (115 FPS). Additionally, extensive qualitative results on two additional public datasets are presented, alongside with limitations in the evaluation metrics used by recent works for lane detection. Finally, we provide source code and trained models that allow others to replicate all the results shown in this paper, which is surprisingly rare in state-of-the-art lane detection methods.

Adaptive Context-Aware Discriminative Correlation Filters for Robust Visual Object Tracking

Tianyang Xu, Zhenhua Feng, Xiaojun Wu, Josef Kittler

Responsive image

Auto-TLDR; ACA-DCF: Adaptive Context-Aware Discriminative Correlation Filter with complementary attention mechanisms

Slides Poster Similar

In recent years, Discriminative Correlation Filters (DCFs) have gained popularity due to their superior performance in visual object tracking. However, existing DCF trackers usually learn filters using fixed attention mechanisms that focus on the centre of an image and suppresses filter amplitudes in surroundings. In this paper, we propose an Adaptive Context-Aware Discriminative Correlation Filter (ACA-DCF) that is able to improve the existing DCF formulation with complementary attention mechanisms. Our ACA-DCF integrates foreground attention and background attention for complementary context-aware filter learning. More importantly, we ameliorate the design using an adaptive weighting strategy that takes complex appearance variations into account. The experimental results obtained on several well-known benchmarks demonstrate the effectiveness and superiority of the proposed method over the state-of-the-art approaches.

An Empirical Analysis of Visual Features for Multiple Object Tracking in Urban Scenes

Mehdi Miah, Justine Pepin, Nicolas Saunier, Guillaume-Alexandre Bilodeau

Responsive image

Auto-TLDR; Evaluating Appearance Features for Multiple Object Tracking in Urban Scenes

Slides Poster Similar

This paper addresses the problem of selecting appearance features for multiple object tracking (MOT) in urban scenes. Over the years, a large number of features has been used for MOT. However, it is not clear whether some of them are better than others. Commonly used features are color histograms, histograms of oriented gradients, deep features from convolutional neural networks and re-identification (ReID) features. In this study, we assess how good these features are at discriminating objects enclosed by a bounding box in urban scene tracking scenarios. Several affinity measures, namely the L1, L2 and the Bhattacharyya distances, Rank-1 counts and the cosine similarity, are also assessed for their impact on the discriminative power of the features. Results on several datasets show that features from ReID networks are the best for discriminating instances from one another regardless of the quality of the detector. If a ReID model is not available, color histograms may be selected if the detector has a good recall and there are few occlusions; otherwise, deep features are more robust to detectors with lower recall.

Mobile Augmented Reality: Fast, Precise, and Smooth Planar Object Tracking

Dmitrii Matveichev, Daw-Tung Lin

Responsive image

Auto-TLDR; Planar Object Tracking with Sparse Optical Flow Tracking and Descriptor Matching

Slides Poster Similar

We propose an innovative method for combining sparse optical flow tracking and descriptor matching algorithms. The proposed approach solves the following problems that are inherent to keypoint-based and optical flow based tracking algorithms: spatial jitter, extreme scale transformation, extreme perspective transformation, degradation in the number of tracking points, and drifting of tracking points. Our algorithm provides smooth object-position tracking under six degrees of freedom transformations with a small computational cost for providing a high-quality real-time AR experience on mobile platforms. We experimentally demonstrate that our approach outperforms the state-of-the-art tracking algorithms while offering faster computational time. A mobile augmented reality (AR) application, which is developed using our approach, delivers planar object tracking with 30 FPS on modern mobile phones for a camera resolution of 1280$\times$720. Finally, we compare the performance of our AR application with that of the Vuforia-based AR application on the same planar objects database. The test results show that our AR application delivers better AR experience than Vuforia in terms of smooth transition of object-pose between video frames.

Not 3D Re-ID: Simple Single Stream 2D Convolution for Robust Video Re-Identification

Toby Breckon, Aishah Alsehaim

Responsive image

Auto-TLDR; ResNet50-IBN for Video-based Person Re-Identification using Single Stream 2D Convolution Network

Slides Poster Similar

Video-based person re-identification has received increasing attention recently, as it plays an important role within the surveillance video analysis. Video-based Re-ID is an expansion of earlier image-based re-identification methods by learning features from a video via multiple image frames for each person. Most contemporary video Re-ID methods utilise complex CNN-based network architectures using 3D convolution or multi-branch networks to extract spatial-temporal features from the video. By contrast, in this paper, we will illustrate superior performance from a simple single stream 2D convolution network leveraging the ResNet50-IBN architecture to extract frame-level features followed by temporal attention for clip level features. These clip level features can be generalised to extract video level features by averaging clip level features without any additional cost. Our model, uses best video Re-ID practice and transfer learning between datasets, outperforms existing state-of-the-art approaches on MARS, PRID2011 and iLIDSVID datasets with 89:62%, 97:75%, 97:33% rank-1 accuracy respectively and with 84:61% mAP for MARS, without reliance on complex and memory intensive 3D convolutions or multistream networks architectures as found in other contemporary work. Conversely, this work shows that global features extracted by the 2D convolution network are a sufficient representation for robust state of the art video Re-ID.

Light3DPose: Real-Time Multi-Person 3D Pose Estimation from Multiple Views

Alessio Elmi, Davide Mazzini, Pietro Tortella

Responsive image

Auto-TLDR; 3D Pose Estimation of Multiple People from a Few calibrated Camera Views using Deep Learning

Slides Poster Similar

We present an approach to perform 3D pose estimation of multiple people from a few calibrated camera views. Our architecture, leveraging the recently proposed unprojection layer, aggregates feature-maps from a 2D pose estimator backbone into a comprehensive representation of the 3D scene. Such intermediate representation is then elaborated by a fully-convolutional volumetric network and a decoding stage to extract 3D skeletons with sub-voxel accuracy. Our method achieves state of the art MPJPE on the CMU Panoptic dataset using a few unseen views and obtains competitive results even with a single input view. We also assess the transfer learning capabilities of the model by testing it against the publicly available Shelf dataset obtaining good performance metrics. The proposed method is inherently efficient: as a pure bottom-up approach, it is computationally independent of the number of people in the scene. Furthermore, even though the computational burden of the 2D part scales linearly with the number of input views, the overall architecture is able to exploit a very lightweight 2D backbone which is orders of magnitude faster than the volumetric counterpart, resulting in fast inference time. The system can run at 6 FPS, processing up to 10 camera views on a single 1080Ti GPU.

FC-DCNN: A Densely Connected Neural Network for Stereo Estimation

Dominik Hirner, Friedrich Fraundorfer

Responsive image

Auto-TLDR; FC-DCNN: A Lightweight Network for Stereo Estimation

Slides Poster Similar

We propose a novel lightweight network for stereo estimation. Our network consists of a fully-convolutional densely connected neural network (FC-DCNN) that computes matching costs between rectified image pairs. Our FC-DCNN method learns expressive features and performs some simple but effective post-processing steps. The densely connected layer structure connects the output of each layer to the input of each subsequent layer. This network structure in addition to getting rid of any fully-connected layers leads to a very lightweight network. The output of this network is used in order to calculate matching costs and create a cost-volume. Instead of using time and memory-inefficient cost-aggregation methods such as semi-global matching or conditional random fields in order to improve the result, we rely on filtering techniques, namely median filter and guided filter. By computing a left-right consistency check we get rid of inconsistent values. Afterwards we use a watershed foreground-background segmentation on the disparity image with removed inconsistencies. This mask is then used to refine the final prediction. We show that our method works well for both challenging indoor and outdoor scenes by evaluating it on the Middlebury, KITTI and ETH3D benchmarks respectively.

Object Segmentation Tracking from Generic Video Cues

Amirhossein Kardoost, Sabine Müller, Joachim Weickert, Margret Keuper

Responsive image

Auto-TLDR; A Light-Weight Variational Framework for Video Object Segmentation in Videos

Slides Poster Similar

We propose a light-weight variational framework for online tracking of object segmentations in videos based on optical flow and image boundaries. While high-end computer vision methods on this task rely on sequence specific training of dedicated CNN architectures, we show the potential of a variational model, based on generic video information from motion and color. Such cues are usually required for tasks such as robot navigation or grasp estimation. We leverage them directly for video object segmentation and thus provide accurate segmentations at potentially very low extra cost. Our simple method can provide competitive results compared to the costly CNN-based methods with parameter tuning. Furthermore, we show that our approach can be combined with state-of-the-art CNN-based segmentations in order to improve over their respective results. We evaluate our method on the datasets DAVIS 16,17 and SegTrack v2.

Domain Siamese CNNs for Sparse Multispectral Disparity Estimation

David-Alexandre Beaupre, Guillaume-Alexandre Bilodeau

Responsive image

Auto-TLDR; Multispectral Disparity Estimation between Thermal and Visible Images using Deep Neural Networks

Slides Poster Similar

Multispectral disparity estimation is a difficult task for many reasons: it as all the same challenges as traditional visible-visible disparity estimation (occlusions, repetitive patterns, textureless surfaces), in addition of having very few common visual information between images (e.g. color information vs. thermal information). In this paper, we propose a new CNN architecture able to do disparity estimation between images from different spectrum, namely thermal and visible in our case. Our proposed model takes two patches as input and proceeds to do domain feature extraction for each of them. Features from both domains are then merged with two fusion operations, namely correlation and concatenation. These merged vectors are then forwarded to their respective classification heads, which are responsible for classifying the inputs as being same or not. Using two merging operations gives more robustness to our feature extraction process, which leads to more precise disparity estimation. Our method was tested using the publicly available LITIV 2014 and LITIV 2018 datasets, and showed best results when compared to other state of the art methods.

ACRM: Attention Cascade R-CNN with Mix-NMS for Metallic Surface Defect Detection

Junting Fang, Xiaoyang Tan, Yuhui Wang

Responsive image

Auto-TLDR; Attention Cascade R-CNN with Mix Non-Maximum Suppression for Robust Metal Defect Detection

Slides Poster Similar

Metallic surface defect detection is of great significance in quality control for production. However, this task is very challenging due to the noise disturbance, large appearance variation, and the ambiguous definition of the defect individual. Traditional image processing methods are unable to detect the damaged region effectively and efficiently. In this paper, we propose a new defect detection method, Attention Cascade R-CNN with Mix-NMS (ACRM), to classify and locate defects robustly. Three submodules are developed to achieve this goal: 1) a lightweight attention block is introduced, which can improve the ability in capture global and local feature both in the spatial and channel dimension; 2) we firstly apply the cascade R-CNN to our task, which exploits multiple detectors to sequentially refine the detection result robustly; 3) we introduce a new method named Mix Non-Maximum Suppression (Mix-NMS), which can significantly improve its ability in filtering the redundant detection result in our task. Extensive experiments on a real industrial dataset show that ACRM achieves state-of-the-art results compared to the existing methods, demonstrating the effectiveness and robustness of our detection method.

Multi-View Object Detection Using Epipolar Constraints within Cluttered X-Ray Security Imagery

Brian Kostadinov Shalon Isaac-Medina, Chris G. Willcocks, Toby Breckon

Responsive image

Auto-TLDR; Exploiting Epipolar Constraints for Multi-View Object Detection in X-ray Security Images

Slides Poster Similar

Automatic detection for threat object items is an increasing emerging area of future application in X-ray security imagery. Although modern X-ray security scanners can provide two or more views, the integration of such object detectors across the views has not been widely explored with rigour. Therefore, we investigate the application of geometric constraints using the epipolar nature of multi-view imagery to improve object detection performance. Furthermore, we assume that images come from uncalibrated views, such that a method to estimate the fundamental matrix using ground truth bounding box centroids from multiple view object detection labels is proposed. In addition, detections are given a score based on its similarity with respect to the distribution of the error of the epipolar estimation. This score is used as confidence weights for merging duplicated predictions using non-maximum suppression. Using a standard object detector (YOLOv3), our technique increases the average precision of detection by 2.8% on a dataset composed of firearms, laptops, knives and cameras. These results indicate that the integration of images at different views significantly improves the detection performance of threat items of cluttered X-ray security images.

Mutual-Supervised Feature Modulation Network for Occluded Pedestrian Detection

Ye He, Chao Zhu, Xu-Cheng Yin

Responsive image

Auto-TLDR; A Mutual-Supervised Feature Modulation Network for Occluded Pedestrian Detection

Similar

State-of-the-art pedestrian detectors have achieved significant progress on non-occluded pedestrians, yet they are still struggling under heavy occlusions. The recent occlusion handling strategy of popular two-stage approaches is to build a two-branch architecture with the help of additional visible body annotations. Nonetheless, these methods still have some weaknesses. Either the two branches are trained independently with only score-level fusion, which cannot guarantee the detectors to learn robust enough pedestrian features. Or the attention mechanisms are exploited to only emphasize on the visible body features. However, the visible body features of heavily occluded pedestrians are concentrated on a relatively small area, which will easily cause missing detections. To address the above issues, we propose in this paper a novel Mutual-Supervised Feature Modulation (MSFM) network, to better handle occluded pedestrian detection. The key MSFM module in our network calculates the similarity loss of full body boxes and visible body boxes corresponding to the same pedestrian, so that the full-body detector could learn more complete and robust pedestrian features with the assist of contextual features from the occluding parts. To facilitate the MSFM module, we also propose a novel two-branch architecture, consisting of a standard full body detection branch and an extra visible body classification branch. These two branches are trained in a mutual-supervised way with full body annotations and visible body annotations, respectively. To verify the effectiveness of our proposed method, extensive experiments are conducted on two challenging pedestrian datasets: Caltech and CityPersons, and our approach achieves superior performances compared to other state-of-the-art methods on both datasets, especially in heavy occlusion cases.

Automated Whiteboard Lecture Video Summarization by Content Region Detection and Representation

Bhargava Urala Kota, Alexander Stone, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju

Responsive image

Auto-TLDR; A Framework for Summarizing Whiteboard Lecture Videos Using Feature Representations of Handwritten Content Regions

Poster Similar

Lecture videos are rapidly becoming an invaluable source of information for students across the globe. Given the large number of online courses currently available, it is important to condense the information within these videos into a compact yet representative summary that can be used for search-based applications. We propose a framework to summarize whiteboard lecture videos by finding feature representations of detected handwritten content regions to determine unique content. We investigate multi-scale histogram of gradients and embeddings from deep metric learning for feature representation. We explicitly handle occluded, growing and disappearing handwritten content. Our method is capable of producing two kinds of lecture video summaries - the unique regions themselves or so-called key content and keyframes (which contain all unique content in a video segment). We use weighted spatio-temporal conflict minimization to segment the lecture and produce keyframes from detected regions and features. We evaluate both types of summaries and find that we obtain state-of-the-art peformance in terms of number of summary keyframes while our unique content recall and precision are comparable to state-of-the-art.

HPERL: 3D Human Pose Estimastion from RGB and LiDAR

Michael Fürst, Shriya T.P. Gupta, René Schuster, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; 3D Human Pose Estimation Using RGB and LiDAR Using Weakly-Supervised Approach

Slides Poster Similar

In-the-wild human pose estimation has a huge potential for various fields, ranging from animation and action recognition to intention recognition and prediction for autonomous driving. The current state-of-the-art is focused only on RGB and RGB-D approaches for predicting the 3D human pose. However, not using precise LiDAR depth information limits the performance and leads to very inaccurate absolute pose estimation. With LiDAR sensors becoming more affordable and common on robots and autonomous vehicle setups, we propose an end-to-end architecture using RGB and LiDAR to predict the absolute 3D human pose with unprecedented precision. Additionally, we introduce a weakly-supervised approach to generate 3D predictions using 2D pose annotations from PedX. This allows for many new opportunities in the field of 3D human pose estimation.

Video Anomaly Detection by Estimating Likelihood of Representations

Yuqi Ouyang, Victor Sanchez

Responsive image

Auto-TLDR; Video Anomaly Detection in the latent feature space using a deep probabilistic model

Slides Poster Similar

Video anomaly detection is a challenging task not only because it involves solving many sub-tasks such as motion representation, object localization and action recognition, but also because it is commonly considered as an unsupervised learning problem that involves detecting outliers. Traditionally, solutions to this task have focused on the mapping between video frames and their low-dimensional features, while ignoring the spatial connections of those features. Recent solutions focus on analyzing these spatial connections by using hard clustering techniques, such as K-Means, or applying neural networks to map latent features to a general understanding, such as action attributes. In order to solve video anomaly in the latent feature space, we propose a deep probabilistic model to transfer this task into a density estimation problem where latent manifolds are generated by a deep denoising autoencoder and clustered by expectation maximization. Evaluations on several benchmarks datasets show the strengths of our model, achieving outstanding performance on challenging datasets.

Learning Object Deformation and Motion Adaption for Semi-Supervised Video Object Segmentation

Xiaoyang Zheng, Xin Tan, Jianming Guo, Lizhuang Ma

Responsive image

Auto-TLDR; Semi-supervised Video Object Segmentation with Mask-propagation-based Model

Slides Poster Similar

We propose a novel method to solve the task of semi-supervised video object segmentation in this paper, where the mask annotation is only given at the first frame of the video sequence. A mask-propagation-based model is applied to learn the past and current information for segmentation. Besides, due to the scarcity of training data, image/mask pairs that model object deformation and shape variance are generated for the training phase. In addition, we generate the key flips between two adjacent frames for motion adaptation. The method works in an end-to-end way, without any online fine-tuning on test videos. Extensive experiments demonstrate that our method achieves competitive performance against state-of-the-art algorithms on benchmark datasets, covering cases with single object or multiple objects. We also conduct extensive ablation experiments to analyze the effectiveness of our proposed method.

Video Object Detection Using Object's Motion Context and Spatio-Temporal Feature Aggregation

Jaekyum Kim, Junho Koh, Byeongwon Lee, Seungji Yang, Jun Won Choi

Responsive image

Auto-TLDR; Video Object Detection Using Spatio-Temporal Aggregated Features and Gated Attention Network

Slides Poster Similar

The deep learning technique has recently led to significant improvement in object-detection accuracy. Numerous object detection schemes have been designed to process each frame independently. However, in many applications, object detection is performed using video data, which consists of a sequence of two-dimensional (2D) image frames. Thus, the object detection accuracy can be improved by exploiting the temporal context of the video sequence. In this paper, we propose a novel video object detection method that exploits both the motion context of the object and spatio-temporal aggregated features in the video sequence to enhance the object detection performance. First, the motion of the object is captured by the correlation between the spatial feature maps of two adjacent frames. Then, the embedding vector, representing the motion context, is obtained by feeding the N correlation maps to long short term memory (LSTM). In addition to generating the motion context vector, the spatial feature maps for N adjacent frames are aggregated to boost the quality of the feature map. The gated attention network is employed to selectively combine only highly correlated feature maps based on their relevance. While most video object detectors are applied to two-stage detectors, our proposed method is applicable to one-stage detectors, which tend to be preferred for practical applications owing to reduced computational complexity. Our numerical evaluation conducted on the ImageNet VID dataset shows that our network offers significant performance gain over baseline algorithms, and it outperforms the existing state-of-the-art one-stage video object detection methods.

Video Face Manipulation Detection through Ensemble of CNNs

Nicolo Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, Stefano Tubaro

Responsive image

Auto-TLDR; Face Manipulation Detection in Video Sequences Using Convolutional Neural Networks

Slides Similar

In the last few years, several techniques for facial manipulation in videos have been successfully developed and made available to the masses (i.e., FaceSwap, deepfake, etc.). These methods enable anyone to easily edit faces in video sequences with incredibly realistic results and a very little effort. Despite the usefulness of these tools in many fields, if used maliciously, they can have a significantly bad impact on society (e.g., fake news spreading, cyber bullying through fake revenge porn). The ability of objectively detecting whether a face has been manipulated in a video sequence is then a task of utmost importance. In this paper, we tackle the problem of face manipulation detection in video sequences targeting modern facial manipulation techniques. In particular, we study the ensembling of different trained Convolutional Neural Network (CNN) models. In the proposed solution, different models are obtained starting from a base network (i.e., EfficientNetB4) making use of two different concepts: (i) attention layers; (ii) siamese training. We show that combining these networks leads to promising face manipulation detection results on two publicly available datasets with more than 119000 videos.

Building Computationally Efficient and Well-Generalizing Person Re-Identification Models with Metric Learning

Vladislav Sovrasov, Dmitry Sidnev

Responsive image

Auto-TLDR; Cross-Domain Generalization in Person Re-identification using Omni-Scale Network

Slides Similar

This work considers the problem of domain shift in person re-identification.Being trained on one dataset, a re-identification model usually performs much worse on unseen data. Partially this gap is caused by the relatively small scale of person re-identification datasets (compared to face recognition ones, for instance), but it is also related to training objectives. We propose to use the metric learning objective, namely AM-Softmax loss, and some additional training practices to build well-generalizing, yet, computationally efficient models. We use recently proposed Omni-Scale Network (OSNet) architecture combined with several training tricks and architecture adjustments to obtain state-of-the art results in cross-domain generalization problem on a large-scale MSMT17 dataset in three setups: MSMT17-all->DukeMTMC, MSMT17-train->Market1501 and MSMT17-all->Market1501.

An Adaptive Video-To-Video Face Identification System Based on Self-Training

Eric Lopez-Lopez, Carlos V. Regueiro, Xosé M. Pardo

Responsive image

Auto-TLDR; Adaptive Video-to-Video Face Recognition using Dynamic Ensembles of SVM's

Slides Poster Similar

Video-to-video face recognition in unconstrained conditions is still a very challenging problem, as the combination of several factors leads to an in general low-quality of facial frames. Besides, in some real contexts, the availability of labelled samples is limited, or data is streaming or it is only available temporarily due to storage constraints or privacy issues. In these cases, dealing with learning as an unsupervised incremental process is a feasible option. This work proposes a system based on dynamic ensembles of SVM's, which uses the ideas of self-training to perform adaptive Video-to-video face identification. The only label requirements of the system are a few frames (5 in our experiments) directly taken from the video-surveillance stream. The system will autonomously use additional video-frames to update and improve the initial model in an unsupervised way. Results show a significant improvement in comparison to other state-of-the-art static models.

SSDL: Self-Supervised Domain Learning for Improved Face Recognition

Samadhi Poornima Kumarasinghe Wickrama Arachchilage, Ebroul Izquierdo

Responsive image

Auto-TLDR; Self-supervised Domain Learning for Face Recognition in unconstrained environments

Slides Poster Similar

Face recognition in unconstrained environments is challenging due to variations in illumination, quality of sensing, motion blur and etc. An individual’s face appearance can vary drastically under different conditions creating a gap between train (source) and varying test (target) data. The domain gap could cause decreased performance levels in direct knowledge transfer from source to target. Despite fine-tuning with domain specific data could be an effective solution, collecting and annotating data for all domains is extremely expensive. To this end, we propose a self-supervised domain learning (SSDL) scheme that trains on triplets mined from unlabelled data. A key factor in effective discriminative learning, is selecting informative triplets. Building on most confident predictions, we follow an “easy-to-hard” scheme of alternate triplet mining and self-learning. Comprehensive experiments on four different benchmarks show that SSDL generalizes well on different domains.