Exploiting Distilled Learning for Deep Siamese Tracking

Chengxin Liu, Zhiguo Cao, Wei Li, Yang Xiao, Shuaiyuan Du, Angfan Zhu

Responsive image

Auto-TLDR; Distilled Learning Framework for Siamese Tracking

Slides Poster

Existing deep siamese trackers are typically built on off-the-shelf CNN models for feature learning, with the demand for huge power consumption and memory storage. This limits current deep siamese trackers to be carried on resource-constrained devices like mobile phones, given factor that such a deployment normally requires cost-effective considerations. In this work, we address this issue by presenting a novel Distilled Learning Framework(DLF) for siamese tracking, which aims at learning tracking model with efficiency and high accuracy. Specifically, we propose two simple yet effective knowledge distillation strategies, denote as point-wise distillation and pair-wise distillation, which are designed for transferring knowledge from a more discriminative teacher tracker into a compact student tracker. In this way, cost-effective and high performance tracking could be achieved. Extensive experiments on several tracking benchmarks demonstrate the effectiveness of our proposed method.

Similar papers

RSINet: Rotation-Scale Invariant Network for Online Visual Tracking

Yang Fang, Geunsik Jo, Chang-Hee Lee

Responsive image

Auto-TLDR; RSINet: Rotation-Scale Invariant Network for Adaptive Tracking

Slides Poster Similar

Most Siamese network-based trackers perform the tracking process without model update, and cannot learn target-specific variation adaptively. Moreover, Siamese-based trackers infer the new state of tracked objects by generating axis-aligned bounding boxes, which contain extra background noise, and are unable to accurately estimate the rotation and scale transformation of moving objects, thus potentially reducing tracking performance. In this paper, we propose a novel Rotation-Scale Invariant Network (RSINet) to address the above problem. Our RSINet tracker consists of a target-distractor discrimination branch and a rotation-scale estimation branch, the rotation and scale knowledge can be explicitly learned by a multi-task learning method in an end-to-end manner. In addtion, the tracking model is adaptively optimized and updated under spatio-temporal energy control, which ensures model stability and reliability, as well as high tracking efficiency. Comprehensive experiments on OTB-100, VOT2018, and LaSOT benchmarks demonstrate that our proposed RSINet tracker yields new state-of-the-art performance compared with recent trackers, while running at real-time speed about 45 FPS.

MFST: Multi-Features Siamese Tracker

Zhenxi Li, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Responsive image

Auto-TLDR; Multi-Features Siamese Tracker for Robust Deep Similarity Tracking

Slides Similar

Siamese trackers have recently achieved interesting results due to their balanced accuracy-speed. This success is mainly due to the fact that deep similarity networks were specifically designed to address the image similarity problem. Therefore, they are inherently more appropriate than classical CNNs for the tracking task. However, Siamese trackers rely on the last convolutional layers for similarity analysis and target search, which restricts their performance. In this paper, we argue that using a single convolutional layer as feature representation is not the optimal choice within the deep similarity framework, as multiple convolutional layers provide several abstraction levels in characterizing an object. Starting from this motivation, we present the Multi-Features Siamese Tracker (MFST), a novel tracking algorithm exploiting several hierarchical feature maps for robust deep similarity tracking. MFST proceeds by fusing hierarchical features to ensure a richer and more efficient representation. Moreover, we handle appearance variation by calibrating deep features extracted from two different CNN models. Based on this advanced feature representation, our algorithm achieves high tracking accuracy, while outperforming several state-of-the-art trackers, including standard Siamese trackers.

VTT: Long-Term Visual Tracking with Transformers

Tianling Bian, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Neil Robertson, Haibing Guan

Responsive image

Auto-TLDR; Visual Tracking Transformer with transformers for long-term visual tracking

Slides Similar

Long-term visual tracking is a challenging problem. State-of-the-art long-term trackers, e.g., GlobalTrack, utilize region proposal networks (RPNs) to generate target proposals. However, the performance of the trackers is affected by occlusions and large scale or ratio variations. To address these issues, in this paper, we are the first to propose a novel architecture with transformers for long-term visual tracking. Specifically, the proposed Visual Tracking Transformer (VTT) utilizes a transformer encoder-decoder architecture for aggregating global information to deal with occlusion and large scale or ratio variation. Furthermore, it also shows better discriminative power against instance-level distractors without the need for extra labeling and hard-sample mining. We conduct extensive experiments on three largest long-term tracking dataset and have achieved state-of-the-art performance.

Tackling Occlusion in Siamese Tracking with Structured Dropouts

Deepak Gupta, Efstratios Gavves, Arnold Smeulders

Responsive image

Auto-TLDR; Structured Dropout for Occlusion in latent space

Slides Poster Similar

Occlusion is one of the most difficult challenges in object tracking to model. This is because unlike other challenges, where data augmentation can be of help, occlusion is hard to simulate as the occluding object can be anything in any shape. In this paper, we propose a simple solution to simulate the effects of occlusion in the latent space. Specifically, we present structured dropout to mimic the change in latent codes under occlusion. We present three forms of dropout (channel dropout, segment dropout and slice dropout) with the various forms of occlusion in mind. To demonstrate its effectiveness, the dropouts are incorporated into two modern Siamese trackers (SiamFC and SiamRPN++). The outputs from multiple dropouts are combined using an encoder network to obtain the final prediction. Experiments on several tracking benchmarks show the benefits of structured dropouts, while due to their simplicity requiring only small changes to the existing tracker models.

SiamMT: Real-Time Arbitrary Multi-Object Tracking

Lorenzo Vaquero, Manuel Mucientes, Victor Brea

Responsive image

Auto-TLDR; SiamMT: A Deep-Learning-based Arbitrary Multi-Object Tracking System for Video

Slides Poster Similar

Visual object tracking is of great interest in many applications, as it preserves the identity of an object throughout a video. However, while real applications demand systems capable of real-time-tracking multiple objects, multi-object tracking solutions usually follow the tracking-by-detection paradigm, thus they depend on running a costly detector in each frame, and they do not allow the tracking of arbitrary objects, i.e., they require training for specific classes. In response to this need, this work presents the architecture of SiamMT, a system capable of efficiently applying individual visual tracking techniques to multiple objects in real-time. This makes it the first deep-learning-based arbitrary multi-object tracker. To achieve this, we propose the global frame features extraction by using a fully-convolutional neural network, followed by the cropping and resizing of the different object search areas. The final similarity operation between these search areas and the target exemplars is carried out with an optimized pairwise cross-correlation. These novelties allow the system to track multiple targets in a scalable manner, achieving 25 fps with 60 simultaneous objects for VGA videos and 40 objects for HD720 videos, all with a tracking quality similar to SiamFC.

Model Decay in Long-Term Tracking

Efstratios Gavves, Ran Tao, Deepak Gupta, Arnold Smeulders

Responsive image

Auto-TLDR; Model Bias in Long-Term Tracking

Slides Poster Similar

To account for appearance variations, tracking models need to be updated during the course of inference. However, updating the tracker model with adverse bounding box predictions adds an unavoidable bias term to the learning. This bias term, which we refer to as model decay, offsets the learning and causes tracking drift. While its adverse affect might not be visible in short-term tracking, accumulation of this bias over a long-term can eventually lead to a permanent loss of the target. In this paper, we look at the problem of model bias from a mathematical perspective. Further, we briefly examine the effect of various sources of tracking error on model decay, using a correlation filter (ECO) and a Siamese (SINT) tracker. Based on observations and insights, we propose simple additions that help to reduce model decay in long-term tracking. The proposed tracker is evaluated on four long-term and one short-term tracking benchmarks, demonstrating superior accuracy and robustness, even on 30 minute long videos.

TSDM: Tracking by SiamRPN++ with a Depth-Refiner and a Mask-Generator

Pengyao Zhao, Quanli Liu, Wei Wang, Qiang Guo

Responsive image

Auto-TLDR; TSDM: A Depth-D Tracker for 3D Object Tracking

Slides Poster Similar

In a generic object tracking, depth (D) information provides informative cues for foreground-background separation and target bounding box regression. However, so far, few trackers have used depth information to play the important role aforementioned due to the lack of a suitable model. In this paper, a RGB-D tracker named TSDM is proposed, which is composed of a Mask-generator (M-g), SiamRPN++ and a Depth-refiner (D-r). The M-g generates the background masks, and updates them as the target 3D position changes. The D-r optimizes the target bounding box estimated by SiamRPN++, based on the spatial depth distribution difference between the target and the surrounding background. Extensive evaluation on the Princeton Tracking Benchmark and the Visual Object Tracking challenge shows that our tracker outperforms the state-of-the-art by a large margin while achieving 23 FPS. In addition, a light-weight variant can run at 31 FPS and thus it is practical for real world applications. Code and models of TSDM are available at https://github.com/lql-team/TSDM.

Siamese Fully Convolutional Tracker with Motion Correction

Mathew Francis, Prithwijit Guha

Responsive image

Auto-TLDR; A Siamese Ensemble for Visual Tracking with Appearance and Motion Components

Slides Poster Similar

Visual tracking algorithms use cues like appearance, structure, motion etc. for locating an object in a video. We propose an ensemble tracker with appearance and motion components. A siamese tracker that learns object appearance from a static image and motion vectors computed between consecutive frames with a flow network forms the ensemble. Motion predicted object localization is used to correct the appearance component in the ensemble. Complementary nature of the components bring performance improvement as observed in experiments performed on VOT2018 and VOT2019 datasets.

DAL: A Deep Depth-Aware Long-Term Tracker

Yanlin Qian, Song Yan, Alan Lukežič, Matej Kristan, Joni-Kristian Kamarainen, Jiri Matas

Responsive image

Auto-TLDR; Deep Depth-Aware Long-Term RGBD Tracking with Deep Discriminative Correlation Filter

Slides Poster Similar

The best RGBD trackers provide high accuracy but are slow to run. On the other hand, the best RGB trackers are fast but clearly inferior on the RGBD datasets. In this work, we propose a deep depth-aware long-term tracker that achieves state-of-the-art RGBD tracking performance and is fast to run. We reformulate deep discriminative correlation filter (DCF) to embed the depth information into deep features. Moreover, the same depth-aware correlation filter is used for target re- detection. Comprehensive evaluations show that the proposed tracker achieves state-of-the-art performance on the Princeton RGBD, STC, and the newly-released CDTB benchmarks and runs 20 fps.

Robust Visual Object Tracking with Two-Stream Residual Convolutional Networks

Ning Zhang, Jingen Liu, Ke Wang, Dan Zeng, Tao Mei

Responsive image

Auto-TLDR; Two-Stream Residual Convolutional Network for Visual Tracking

Slides Poster Similar

The current deep learning based visual tracking approaches have been very successful by learning the target classification and/or estimation model from a large amount of supervised training data in offline mode. However, most of them can still fail in tracking objects due to some more challenging issues such as dense distractor objects, confusing background, motion blurs, and so on. Inspired by the human ``visual tracking'' capability which leverages motion cues to distinguish the target from the background, we propose a Two-Stream Residual Convolutional Network (TS-RCN) for visual tracking, which successfully exploits both appearance and motion features for model update. Our TS-RCN can be integrated with existing deep learning based visual trackers. To further improve the tracking performance, we adopt a ``wider'' residual network ResNeXt as its feature extraction backbone. To the best of our knowledge, TS-RCN is the first end-to-end trainable two-stream visual tracking system, which makes full use of both appearance and motion features of the target. We have extensively evaluated the TS-RCN on most widely used benchmark datasets including VOT2018, VOT2019, and GOT-10K. The experiment results have successfully demonstrated that our two-stream model can greatly outperform the appearance based tracker, and it also achieves state-of-the-art performance. The tracking system can run at up to 38.1 FPS.

Siamese Dynamic Mask Estimation Network for Fast Video Object Segmentation

Dexiang Hong, Guorong Li, Kai Xu, Li Su, Qingming Huang

Responsive image

Auto-TLDR; Siamese Dynamic Mask Estimation for Video Object Segmentation

Slides Poster Similar

Video object segmentation(VOS) has been a fundamental topic in recent years, and many deep learning-based methods have achieved state-of-the-art performance on multiple benchmarks. However, most of these methods rely on pixel-level matching between the template and the searched frames on the whole image while the targets only occupy a small region. Calculating on the entire image brings lots of additional computation cost. Besides, the whole image may contain some distracting information resulting in many false-positive matching points. To address this issue, motivated by one-stage instance object segmentation methods, we propose an efficient siamese dynamic mask estimation network for fast video object segmentation. The VOS is decoupled into two tasks, i.e. mask feature learning and dynamic kernel prediction. The former is responsible for learning high-quality features to preserve structural geometric information, and the latter learns a dynamic kernel which is used to convolve with the mask feature to generate a mask output. We use Siamese neural network as a feature extractor and directly predict masks after correlation. In this way, we can avoid using pixel-level matching, making our framework more simple and efficient. Experiment results on DAVIS 2016 /2017 datasets show that our proposed methods can run at 35 frames per second on NVIDIA RTX TITAN while preserving competitive accuracy.

Adaptive Context-Aware Discriminative Correlation Filters for Robust Visual Object Tracking

Tianyang Xu, Zhenhua Feng, Xiaojun Wu, Josef Kittler

Responsive image

Auto-TLDR; ACA-DCF: Adaptive Context-Aware Discriminative Correlation Filter with complementary attention mechanisms

Slides Poster Similar

In recent years, Discriminative Correlation Filters (DCFs) have gained popularity due to their superior performance in visual object tracking. However, existing DCF trackers usually learn filters using fixed attention mechanisms that focus on the centre of an image and suppresses filter amplitudes in surroundings. In this paper, we propose an Adaptive Context-Aware Discriminative Correlation Filter (ACA-DCF) that is able to improve the existing DCF formulation with complementary attention mechanisms. Our ACA-DCF integrates foreground attention and background attention for complementary context-aware filter learning. More importantly, we ameliorate the design using an adaptive weighting strategy that takes complex appearance variations into account. The experimental results obtained on several well-known benchmarks demonstrate the effectiveness and superiority of the proposed method over the state-of-the-art approaches.

Efficient Correlation Filter Tracking with Adaptive Training Sample Update Scheme

Shan Jiang, Shuxiao Li, Chengfei Zhu, Nan Yan

Responsive image

Auto-TLDR; Adaptive Training Sample Update Scheme of Correlation Filter Based Trackers for Visual Tracking

Slides Poster Similar

Visual tracking serves as a significant module in many applications. However, the heavy computation and low speed of many recent trackers restrict their applications in some computing power restricted scenarios. On the other hand, the simple update scheme of most correlation filter based trackers limits their robustness during target deformation and occlusion. In this paper, we explore the update scheme of correlation filter based trackers and propose an efficient and adaptive training sample update scheme. Training sample extracted in each frame is updated to the training set according to its distance between existing samples measured with difference hashing algorithm(DHA) or discarded according to tracking result reliability. Experiments on OTB-2015, Temple Color 128 and UAV123 demonstrate our tracker performs favourably against state-of-the-art trackers with light computation and runs over 100 fps on desktop computer with Intel i7-8700 CPU(3.2GHz).

Reducing False Positives in Object Tracking with Siamese Network

Takuya Ogawa, Takashi Shibata, Shoji Yachida, Toshinori Hosoi

Responsive image

Auto-TLDR; Robust Long-Term Object Tracking with Adaptive Search based on Motion Models

Slides Poster Similar

We have developed a robust long-term object tracking method that resolves the fundamental cause of the drift and loss of a target in visual object tracking. The proposed method consists of “sampling area extension”, which prevents a tracking result from drifting to other objects by learning false positive samples in advance (before they enter the search region of the target), and “adaptive search based on motion models”, which prevents a tracking result from drifting to other objects and avoids the loss of the target by using not only appearance features but also motion models to adaptively search for the target. Experiments conducted on long-term tracking dataset showed that our first technique improved robustness by 16.6% while the second technique improved robustness by 15.3%. By combining both, our method achieved 21.7% and 9.1% improvement for the robustness and precision, and the processing speed became 3.3 times faster. Additional experiments showed that our method achieved the top robustness among state-of-the-art methods on three long-term tracking datasets. These findings demonstrate that our method is effective for long-term object tracking and that its performance and speed are promising for use in practical applications of various technologies underlying object tracking.

Visual Object Tracking in Drone Images with Deep Reinforcement Learning

Derya Gözen, Sedat Ozer

Responsive image

Auto-TLDR; A Deep Reinforcement Learning based Single Object Tracker for Drone Applications

Slides Poster Similar

There is an increasing demand on utilizing camera equipped drones and their applications in many domains varying from agriculture to entertainment and from sports events to surveillance. In such drone applications, an essential and a common task is tracking an object of interest visually. Drone (or UAV) images have different properties when compared to the ground taken (natural) images and those differences introduce additional complexities to the existing object trackers to be directly applied on drone applications. Some important differences among those complexities include (i) smaller object sizes to be tracked and (ii) different orientations and viewing angles yielding different texture and features to be observed. Therefore, new algorithms trained on drone images are needed for the drone-based applications. In this paper, we introduce a deep reinforcement learning (RL) based single object tracker that tracks an object of interest in drone images by estimating a series of actions to find the location of the object in the next frame. This is the first work introducing a single object tracker using a deep RL-based technique for drone images. Our proposed solution introduces a novel reward function that aims to reduce the total number of actions taken to estimate the object's location in the next frame and also introduces a different backbone network to be used on low resolution images. Additionally, we introduce a set of new actions into the action library to better deal with the above-mentioned complexities. We compare our proposed solutions to a state of the art tracking algorithm from the recent literature and demonstrate up to 3.87\% improvement in precision and 3.6\% improvement in IoU values on the VisDrone2019 dataset. We also provide additional results on OTB-100 dataset and show up to 3.15\% improvement in precision on the OTB-100 dataset when compared to the same previous state of the art algorithm. Lastly, we analyze the ability to handle some of the challenges faced during tracking, including but not limited to occlusion, deformation, and scale variation for our proposed solutions.

Compact and Discriminative Multi-Object Tracking with Siamese CNNs

Claire Labit-Bonis, Jérôme Thomas, Frederic Lerasle

Responsive image

Auto-TLDR; Fast, Light-Weight and All-in-One Single Object Tracking for Multi-Target Management

Slides Poster Similar

Following the tracking-by-detection paradigm, multiple object tracking deals with challenging scenarios, occlusions or even missing detections; the priority is often given to quality measures instead of speed, and a good trade-off between the two is hard to achieve. Based on recent work, we propose a fast, light-weight tracker able to predict targets position and reidentify them at once, when it is usually done with two sequential steps. To do so, we combine a bounding box regressor with a target-oriented appearance learner in a newly designed and unified architecture. This way, our tracker can infer the targets' image pose but also provide us with a confidence level about target identity. Most of the time, it is also common to filter out the detector outputs with a preprocessing step, throwing away precious information about what has been seen in the image. We propose a tracks management strategy able to balance efficiently between detection and tracking outputs and their associated likelihoods. Simply put, we spotlight a full siamese based single object tracker able to predict both position and appearance features at once with a light-weight and all-in-one architecture, within a balanced overall multi-target management strategy. We demonstrate the efficiency and speed of our system w.r.t the literature on the well-known MOT17 challenge benchmark, and bring to the fore qualitative evaluations as well as state-of-the-art quantitative results.

A Boundary-Aware Distillation Network for Compressed Video Semantic Segmentation

Hongchao Lu

Responsive image

Auto-TLDR; A Boundary-Aware Distillation Network for Video Semantic Segmentation

Slides Poster Similar

In recent years optical flow is often estimated to reuse features so as to accelerate video semantic segmentation. With addition of optical flow network, however, extra cost may incur and accuracy may thus be degraded because of repeated warping operation. In this paper, we propose a boundary-aware distillation network (BDNet) that replaces optical flow network with block motion vectors encoded in compressed video, resulting in negligible computational complexity. In order to make salient features, an auxiliary boundary-aware stream is added to the main stream to jointly estimate silhouette and segmentation of objects. To further correct warped features, a well-trained teacher network is employed to transfer knowledge to the main stream. Both boundary-aware stream and the teacher network are neglected during inference stage, so that video segmentation network enables to get faster without increasing any computational burden. By splitting the task into three components, our BDNet shows almost 10% time saving as well as 1.6% accuracy improvement over baseline on the Cityscapes dataset.

Utilising Visual Attention Cues for Vehicle Detection and Tracking

Feiyan Hu, Venkatesh Gurram Munirathnam, Noel E O'Connor, Alan Smeaton, Suzanne Little

Responsive image

Auto-TLDR; Visual Attention for Object Detection and Tracking in Driver-Assistance Systems

Slides Poster Similar

Advanced Driver-Assistance Systems (ADAS) have been attracting attention from many researchers. Vision based sensors are the closest way to emulate human driver visual behavior while driving. In this paper, we explore possible ways to use visual attention (saliency) for object detection and tracking. We investigate: 1) How a visual attention map such as a subjectness attention or saliency map and an objectness attention map can facilitate region proposal generation in a 2-stage object detector; 2) How a visual attention map can be used for tracking multiple objects. We propose a neural network that can simultaneously detect objects as and generate objectness and subjectness maps to save computational power. We further exploit the visual attention map during tracking using a sequential Monte Carlo probability hypothesis density (PHD) filter. The experiments are conducted on KITTI and DETRAC datasets. The use of visual attention and hierarchical features has shown a considerable improvement of≈8% in object detection which effectively increased tracking performance by≈4% on KITTI dataset.

Neural Compression and Filtering for Edge-assisted Real-time Object Detection in Challenged Networks

Yoshitomo Matsubara, Marco Levorato

Responsive image

Auto-TLDR; Deep Neural Networks for Remote Object Detection Using Edge Computing

Slides Poster Similar

The edge computing paradigm places compute-capable devices - edge servers - at the network edge to assist mobile devices in executing data analysis tasks. Intuitively, offloading compute-intense tasks to edge servers can reduce their execution time. However, poor conditions of the wireless channel connecting the mobile devices to the edge servers may degrade the overall capture-to-output delay achieved by edge offloading. Herein, we focus on edge computing supporting remote object detection by means of Deep Neural Networks (DNN), and develop a framework to reduce the amount of data transmitted over the wireless link. The core idea we propose builds on recent approaches splitting DNNs into sections - namely head and tail models - executed by the mobile device and edge server, respectively. The wireless link, then, is used to transport the output of the last layer of the head model to the edge server, instead of the DNN input. Most prior work focuses on classification tasks and leaves the DNN structure unaltered. Herein, we focus on DNNs for three different object detection tasks, which present a much more convoluted structure, and modify the architecture of the network to: (i) achieve in-network compression by introducing a bottleneck layer in the early layers on the head model, and (ii) prefilter pictures that do not contain objects of interest using a convolutional neural network. Results show that the proposed technique represents an effective intermediate option between local and edge computing in a parameter region where these extreme point solutions fail to provide satisfactory performance.

FastSal: A Computationally Efficient Network for Visual Saliency Prediction

Feiyan Hu, Kevin Mcguinness

Responsive image

Auto-TLDR; MobileNetV2: A Convolutional Neural Network for Saliency Prediction

Slides Poster Similar

This paper focuses on the problem of visual saliency prediction, predicting regions of an image that tend to attract human visual attention, under a constrained computational budget. We modify and test various recent efficient convolutional neural network architectures like EfficientNet and MobileNetV2 and compare them with existing state-of-the-art saliency models such as SalGAN and DeepGaze II both in terms of standard accuracy metrics like AUC and NSS, and in terms of the computational complexity and model size. We find that MobileNetV2 makes an excellent backbone for a visual saliency model and can be effective even without a complex decoder. We also show that knowledge transfer from a more computationally expensive model like DeepGaze II can be achieved via pseudo-labelling an unlabelled dataset, and that this approach gives result on-par with many state-of-the-art algorithms with a fraction of the computational cost and model size.

SynDHN: Multi-Object Fish Tracker Trained on Synthetic Underwater Videos

Mygel Andrei Martija, Prospero Naval

Responsive image

Auto-TLDR; Underwater Multi-Object Tracking in the Wild with Deep Hungarian Network

Slides Poster Similar

In this paper, we seek to extend multi-object tracking research on a relatively less explored domain, that of, underwater multi-object tracking in the wild. Multi-object fish tracking is an important task because it can provide fish monitoring systems with richer information (e.g. multiple views of the same fish) as compared to detections and it can be an invaluable input to fish behavior analysis. However, there is a lack of an annotated benchmark dataset with enough samples for this task. To circumvent the need for manual ground truth tracking annotation, we craft a synthetic dataset. Using this synthetic dataset, we train an integrated detector and tracker called SynDHN. SynDHN uses the Deep Hungarian Network (DHN), which is a differentiable approximation of the Hungarian assignment algorithm. We repurpose DHN to become the tracking component of our algorithm by performing the task of affinity estimation between detector predictions. We consider both spatial and appearance features for affinity estimation. Our results show that despite being trained on a synthetic dataset, SynDHN generalizes well to real underwater video tracking and performs better against our baseline algorithms.

Tracking Fast Moving Objects by Segmentation Network

Ales Zita, Filip Sroubek

Responsive image

Auto-TLDR; Fast Moving Objects Tracking by Segmentation Using Deep Learning

Slides Poster Similar

Tracking Fast Moving Objects (FMO), which appear as blurred streaks in video sequences, is a difficult task for standard trackers, as the object position does not overlap in consecutive video frames and texture information of the objects is blurred. Up-to-date approaches tuned for this task are based on background subtraction with a static background and slow deblurring algorithms. In this article, we present a tracking-by-segmentation approach implemented using modern deep learning methods that perform near real-time tracking on real-world video sequences. We have developed a physically plausible FMO sequence generator to be a robust foundation for our training pipeline and demonstrate straightforward network adaptation for different FMO scenarios with varying foreground.

Compact CNN Structure Learning by Knowledge Distillation

Waqar Ahmed, Andrea Zunino, Pietro Morerio, Vittorio Murino

Responsive image

Auto-TLDR; Knowledge Distillation for Compressing Deep Convolutional Neural Networks

Slides Poster Similar

The concept of compressing deep Convolutional Neural Networks (CNNs) is essential to use limited computation, power, and memory resources on embedded devices. However, existing methods achieve this objective at the cost of a drop in inference accuracy in computer vision tasks. To address such a drawback, we propose a framework that leverages knowledge distillation along with customizable block-wise optimization to learn a lightweight CNN structure while preserving better control over the compression-performance tradeoff. Considering specific resource constraints, e.g., floating-point operations per second (FLOPs) or model-parameters, our method results in a state of the art network compression while being capable of achieving better inference accuracy. In a comprehensive evaluation, we demonstrate that our method is effective, robust, and consistent with results over a variety of network architectures and datasets, at negligible training overhead. In particular, for the already compact network MobileNet_v2, our method offers up to 2x and 5.2x better model compression in terms of FLOPs and model-parameters, respectively, while getting 1.05% better model performance than the baseline network.

Efficient Online Subclass Knowledge Distillation for Image Classification

Maria Tzelepi, Nikolaos Passalis, Anastasios Tefas

Responsive image

Auto-TLDR; OSKD: Online Subclass Knowledge Distillation

Slides Poster Similar

Deploying state-of-the-art deep learning models on embedded systems dictates certain storage and computation limitations. During the recent few years Knowledge Distillation (KD) has been recognized as a prominent approach to address this issue. That is, KD has been effectively proposed for training fast and compact deep learning models by transferring knowledge from more complex and powerful models. However, knowledge distillation, in its conventional form, involves multiple stages of training, rendering it a computationally and memory demanding procedure. In this paper, a novel single-stage self knowledge distillation method is proposed, namely Online Subclass Knowledge Distillation (OSKD), that aims at revealing the similarities inside classes, improving the performance of any deep neural model in an online manner. Hence, as opposed to existing online distillation methods, we are able to acquire further knowledge from the model itself, without building multiple identical models or using multiple models to teach each other, rendering the OSKD approach more efficient. The experimental evaluation on two datasets validates that the proposed method improves the classification performance.

Towards Low-Bit Quantization of Deep Neural Networks with Limited Data

Yong Yuan, Chen Chen, Xiyuan Hu, Silong Peng

Responsive image

Auto-TLDR; Low-Precision Quantization of Deep Neural Networks with Limited Data

Slides Poster Similar

Recent machine learning methods use increasingly large deep neural networks to achieve state-of-the-art results in various tasks. Network quantization can effectively reduce computation and memory costs without modifying network structures, facilitating the deployment of deep neural networks (DNNs) on cloud and edge devices. However, most of the existing methods usually need time-consuming training or fine-tuning and access to the original training dataset that may be unavailable due to privacy or security concerns. In this paper, we present a novel method to achieve low-precision quantization of deep neural networks with limited data. Firstly, to reduce the complexity of per-channel quantization and the degeneration of per-layer quantization, we introduce group-wise quantization which separates the output channels into groups that each group is quantized separately. Secondly, to better distill knowledge from the pre-trained FP32 model with limited data, we introduce a two-stage knowledge distillation method that divides the optimization process into independent optimization stage and joint optimization stage to address the limitation of layer-wise supervision and global supervision. Extensive experiments on ImageNet2012 (ResNet18/50, ShuffleNetV2, and MobileNetV2) demonstrate that the proposed approach can significantly improve the quantization model's accuracy when only a few training samples are available. We further show that the method also extends to other computer vision architectures and tasks such as object detection and semantic segmentation.

Automatic Student Network Search for Knowledge Distillation

Zhexi Zhang, Wei Zhu, Junchi Yan, Peng Gao, Guotong Xie

Responsive image

Auto-TLDR; NAS-KD: Knowledge Distillation for BERT

Slides Poster Similar

Pre-trained language models (PLMs), such as BERT, have achieved outstanding performance on multiple natural language processing (NLP) tasks. However, such pre-trained models usually contain a huge number of parameters and are computationally expensive. The high resource demand hinders their application on resource-restricted devices like mobile phones. Knowledge distillation (KD) is an effective compression approach, aiming at encouraging a light-weight student network to imitate the teacher network, and accordingly latent knowledge is transferred from the teacher to student. However, the great majority of student networks in previous KD methods are manually designed, normally a subnetwork of the teacher network. Transformer is generally utilized as the student for compressing BERT but still contains masses of parameters. Motivated by this, we propose a novel approach named NAS-KD, which automatically generates an optimal student network using neural architecture search (NAS) to enhance the distillation for BERT. Experiment on 7 classification tasks in NLP domain demonstrates that NAS-KD can substantially reduce the size of BERT without much performance sacrifice.

Knowledge Distillation Beyond Model Compression

Fahad Sarfraz, Elahe Arani, Bahram Zonooz

Responsive image

Auto-TLDR; Knowledge Distillation from Teacher to Student

Slides Poster Similar

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various techniques have been proposed since the original formulation, which mimics different aspects of the teacher such as the representation space, decision boundary or intra-data relationship. Some methods replace the one way knowledge distillation from a static teacher with collaborative learning between a cohort of students. Despite the recent advances, a clear understanding of where knowledge resides in a deep neural network and optimal method for capturing knowledge from teacher and transferring it to student still remains an open question. In this study we provide an extensive study on 9 different knowledge distillation methods which covers a broad spectrum of approaches to capture and transfer knowledge. We demonstrate the versatility of the KD framework on different datasets and network architectures under varying capacity gaps between the teacher and student. The study provides intuition for the effects of mimicking different aspects of the teacher and derives insights from the performance of the different distillation approaches to guide the the design of more effective KD methods . Furthermore, our study shows the effectiveness of the KD framework in learning efficiently under varying severity levels of label noise and class imbalance, consistently providing significant generalization gains over standard training. We emphasize that the efficacy of KD goes much beyond a model compression technique and should be considered as a general purpose training paradigm which offers more robustness to common challenges in the real-world datasets compared to the standard training procedure.

SFPN: Semantic Feature Pyramid Network for Object Detection

Yi Gan, Wei Xu, Jianbo Su

Responsive image

Auto-TLDR; SFPN: Semantic Feature Pyramid Network to Address Information Dilution Issue in FPN

Slides Poster Similar

Feature Pyramid Network(FPN) employs a top-down path to enhance low level feature by utilizing high level feature.However, further improvement of detector is greatly hindered by the inner defect of FPN. The dilution issue in FPN is analyzed in this paper, and a new architecture named Semantic Feature Pyramid Network(SFPN) is introduced to address the information imbalance problem caused by information dilution. The proposed method consists of two simple and effective components: Semantic Pyramid Module(SPM) and Semantic Feature Fusion Module(SFFM). To compensate for the weaknesses of FPN, the semantic segmentation result is utilized as an extra information source in our architecture.By constructing a semantic pyramid based on the segmentation result and fusing it with FPN, feature maps at each level can obtain the necessary information without suffering from the dilution issue. The proposed architecture could be applied on many detectors, and non-negligible improvement could be achieved. Although this method is designed for object detection, other tasks such as instance segmentation can also largely benefit from it. The proposed method brings Faster R-CNN and Mask R-CNN with ResNet-50 as backbone both 1.8 AP improvements respectively. Furthermore, SFPN improves Cascade R-CNN with backbone ResNet-101 from 42.4 AP to 43.5 AP.

Dynamic Multi-Path Neural Network

Yingcheng Su, Yichao Wu, Ken Chen, Ding Liang, Xiaolin Hu

Responsive image

Auto-TLDR; Dynamic Multi-path Neural Network

Slides Similar

Although deeper and larger neural networks have achieved better performance, due to overwhelming burden on computation, they cannot meet the demands of deployment on resource-limited devices. An effective strategy to address this problem is to make use of dynamic inference mechanism, which changes the inference path for different samples at runtime. Existing methods only reduce the depth by skipping an entire specific layer, which may lose important information in this layer. In this paper, we propose a novel method called Dynamic Multi-path Neural Network (DMNN), which provides more topology choices in terms of both width and depth on the fly. For better modelling the inference path selection, we further introduce previous state and object category information to guide the training process. Compared to previous dynamic inference techniques, the proposed method is more flexible and easier to incorporate into most modern network architectures. Experimental results on ImageNet and CIFAR-100 demonstrate the superiority of our method on both efficiency and classification accuracy.

Forground-Guided Vehicle Perception Framework

Kun Tian, Tong Zhou, Shiming Xiang, Chunhong Pan

Responsive image

Auto-TLDR; A foreground segmentation branch for vehicle detection

Slides Poster Similar

As the basis of advanced visual tasks such as vehicle tracking and traffic flow analysis, vehicle detection needs to accurately predict the position and category of vehicle objects. In the past decade, deep learning based methods have made great progress. However, we also notice that some existing cases are not studied thoroughly. First, false positive on the background regions is one of the critical problems. Second, most of the previous approaches only optimize a single vehicle detection model, ignoring the relationship between different visual perception tasks. In response to the above two findings, we introduce a foreground segmentation branch for the first time, which can predict the pixel level of vehicles in advance. Furthermore, two attention modules are designed to guide the work of the detection branch. The proposed method can be easily grafted into the one-stage and two-stage detection framework. We evaluate the effectiveness of our model on LSVH, a dataset with large variations in vehicle scales, and achieve the state-of-the-art detection accuracy.

Multi-Order Feature Statistical Model for Fine-Grained Visual Categorization

Qingtao Wang, Ke Zhang, Shaoli Huang, Lianbo Zhang, Jin Fan

Responsive image

Auto-TLDR; Multi-Order Feature Statistical Method for Fine-Grained Visual Categorization

Slides Poster Similar

Fine-grained visual categorization aims to learn a robust image representation modeling subtle differences from similar categories. Existing methods in this field tackle the problem by designing complex frameworks, which produce high-level features by performing first-order or second-order pooling. Despite the impressive performance achieved by these strategies, the single-order networks only carry linear or non-linear information of the last convolutional layer, neglecting the fact that feature from different orders are mutually complementary. In this paper, we propose a Multi-Order Feature Statistical Method (MOFS), which learns fine-grained features characterizing multiple orders. Specifically, the MOFS consists of two sub-modules: (i) a first-order module modeling both mid-level and high-level features. (ii) a covariance feature statistical module capturing high-order features. By deploying these two sub-modules on the top of existing backbone networks, MOFS simultaneously captures multi-level of discrimative patters including local, global and co-related patters. We evaluate the proposed method on three challenging benchmarks, namely CUB-200-2011, Stanford Cars, and FGVC-Aircraft. Compared with state-of-the-art methods, experiments results exhibit superior performance in recognizing fine-grained objects

Nighttime Pedestrian Detection Based on Feature Attention and Transformation

Gang Li, Shanshan Zhang, Jian Yang

Responsive image

Auto-TLDR; FAM and FTM: Enhanced Feature Attention Module and Feature Transformation Module for nighttime pedestrian detection

Slides Poster Similar

Pedestrian detection at nighttime is an important yet challenging task, which is fundamental for many practical applications, e.g. autonomous driving, video surveillance. To address this problem, in this work we start with some analysis, from which we find that the nighttime features have much more noise than that of daytime, resulting in low discrimination ability. Besides, we also observe some pedestrian examples are under adverse illumination conditions, and they can hardly provide sufficient information for accurate detection. Based on these findings, we propose the Feature Attention Module (FAM) and Feature Transformation Module (FTM) to enhance nighttime features. In FAM, guided by progressive segmentation supervision, hierarchical feature attention is produced to enhance multi-level features. On the other hand, FTM is introduced to enforce features from adverse illumination to approach that from better illumination. Based on feature attention and transformation (FAT) mechanism, a two-stage detector called FATNet is constructed for nighttime pedestrian detection. We conduct extensive experiments on nighttime datasets of EuroCity Persons (Night) and NightOwls to demonstrate the effectiveness of our method. On both two datasets, our method achieves significant improvements to the baseline and also outperforms state-of-the-art detectors.

Teacher-Student Training and Triplet Loss for Facial Expression Recognition under Occlusion

Mariana-Iuliana Georgescu, Radu Ionescu

Responsive image

Auto-TLDR; Knowledge Distillation for Facial Expression Recognition under Occlusion

Slides Similar

In this paper, we study the task of facial expression recognition under strong occlusion. We are particularly interested in cases where 50% of the face is occluded, e.g. when the subject wears a Virtual Reality (VR) headset. While previous studies show that pre-training convolutional neural networks (CNNs) on fully-visible (non-occluded) faces improves the accuracy, we propose to employ knowledge distillation to achieve further improvements. First of all, we employ the classic teacher-student training strategy, in which the teacher is a CNN trained on fully-visible faces and the student is a CNN trained on occluded faces. Second of all, we propose a new approach for knowledge distillation based on triplet loss. During training, the goal is to reduce the distance between an anchor embedding, produced by a student CNN that takes occluded faces as input, and a positive embedding (from the same class as the anchor), produced by a teacher CNN trained on fully-visible faces, so that it becomes smaller than the distance between the anchor and a negative embedding (from a different class than the anchor), produced by the student CNN. Third of all, we propose to combine the distilled embeddings obtained through the classic teacher-student strategy and our novel teacher-student strategy based on triplet loss into a single embedding vector. We conduct experiments on two benchmarks, FER+ and AffectNet, with two CNN architectures, VGG-f and VGG-face, showing that knowledge distillation can bring significant improvements over the state-of-the-art methods designed for occluded faces in the VR setting. Furthermore, we obtain accuracy rates that are quite close to the state-of-the-art models that take as input fully-visible faces. For example, on the FER+ data set, our VGG-face based on concatenated distilled embeddings attains an accuracy rate of 82.75% on lower-half-visible faces, which is only 2.24% below the accuracy rate of a state-of-the-art VGG-13 that is evaluated on fully-visible faces. Given that our model sees only the lower-half of the face, we consider this to be a remarkable achievement. In conclusion, we consider that our distilled CNN models can provide useful feedback for the task of recognizing the facial expressions of a person wearing a VR headset.

Object Detection Model Based on Scene-Level Region Proposal Self-Attention

Yu Quan, Zhixin Li, Canlong Zhang, Huifang Ma

Responsive image

Auto-TLDR; Exploiting Semantic Informations for Object Detection

Slides Poster Similar

The improvement of object detection performance is mostly focused on the extraction of local information near the region of interest in the image, which results in detection performance in this area being unable to achieve the desired effect. First, a depth-wise separable convolution network(D_SCNet-127 R-CNN) is built on the backbone network. Considering the importance of scene and semantic informations for visual recognition, the feature map is sent into the branch of the semantic segmentation module, region proposal network module, and the region proposal self-attention module to build the network of scene-level and region proposal self-attention module. Second, a deep reinforcement learning was utilized to achieve accurate positioning of border regression, and the calculation speed of the whole model was improved through implementing a light-weight head network. This model can effectively solve the limitation of feature extraction in traditional object detection and obtain more comprehensive detailed features. The experimental verification on MSCOCO17, VOC12, and Cityscapes datasets shows that the proposed method has good validity and scalability.

AerialMPTNet: Multi-Pedestrian Tracking in Aerial Imagery Using Temporal and Graphical Features

Maximilian Kraus, Seyed Majid Azimi, Emec Ercelik, Reza Bahmanyar, Peter Reinartz, Alois Knoll

Responsive image

Auto-TLDR; AerialMPTNet: A novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features

Slides Poster Similar

Multi-pedestrian tracking in aerial imagery has several applications such as large-scale event monitoring, disaster management, search-and-rescue missions, and as input into predictive crowd dynamic models. Due to the challenges such as the large number and the tiny size of the pedestrians (e.g., 4 x 4 pixels) with their similar appearances as well as different scales and atmospheric conditions of the images with their extremely low frame rates (e.g., 2 fps), current state-of-the-art algorithms including the deep learning-based ones are unable to perform well. In this paper, we propose AerialMPTNet, a novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features from a Siamese Neural Network, movement predictions from a Long Short-Term Memory, and pedestrian interconnections from a GraphCNN. In addition, to address the lack of diverse aerial multi-pedestrian tracking datasets, we introduce the Aerial Multi-Pedestrian Tracking (AerialMPT) dataset consisting of 307 frames and 44,740 pedestrians annotated. To the best of our knowledge, AerialMPT is the largest and most diverse dataset to this date and will be released publicly. We evaluate AerialMPTNet on AerialMPT and KIT AIS, and benchmark with several state-of-the-art tracking methods. Results indicate that AerialMPTNet significantly outperforms other methods on accuracy and time-efficiency.

Distilling Spikes: Knowledge Distillation in Spiking Neural Networks

Ravi Kumar Kushawaha, Saurabh Kumar, Biplab Banerjee, Rajbabu Velmurugan

Responsive image

Auto-TLDR; Knowledge Distillation in Spiking Neural Networks for Image Classification

Slides Poster Similar

Spiking Neural Networks (SNN) are energy-efficient computing architectures that exchange spikes for processing information, unlike classical Artificial Neural Networks (ANN). Due to this, SNNs are better suited for real-life deployments. However, similar to ANNs, SNNs also benefit from deeper architectures to obtain improved performance. Furthermore, like the deep ANNs, the memory, compute and power requirements of SNNs also increase with model size, and model compression becomes a necessity. Knowledge distillation is a model com- pression technique that enables transferring the learning of a large machine learning model to a smaller model with minimal loss in performance. In this paper, we propose techniques for knowledge distillation in spiking neural networks for the task of image classification. We present ways to distill spikes from a larger SNN, also called the teacher network, to a smaller one, also called the student network, while minimally impacting the classification accuracy. We demonstrate the effectiveness of the proposed method with detailed experiments on three standard datasets while proposing novel distillation methodologies and loss functions. We also present a multi-stage knowledge distillation technique for SNNs using an intermediate network to obtain higher performance from the student network. Our approach is expected to open up new avenues for deploying high performing large SNN models on resource-constrained hardware platforms.

Channel Planting for Deep Neural Networks Using Knowledge Distillation

Kakeru Mitsuno, Yuichiro Nomura, Takio Kurita

Responsive image

Auto-TLDR; Incremental Training for Deep Neural Networks with Knowledge Distillation

Slides Poster Similar

In recent years, deeper and wider neural networks have shown excellent performance in computer vision tasks, while their enormous amount of parameters results in increased computational cost and overfitting. Several methods have been proposed to compress the size of the networks without reducing network performance. Network pruning can reduce redundant and unnecessary parameters from a network. Knowledge distillation can transfer the knowledge of deeper and wider networks to smaller networks. The performance of the smaller network obtained by these methods is bounded by the predefined network. Neural architecture search has been proposed, which can search automatically the architecture of the networks to break the structure limitation. Also, there is a dynamic configuration method to train networks incrementally as sub-networks. In this paper, we present a novel incremental training algorithm for deep neural networks called planting. Our planting can search the optimal network architecture with smaller number of parameters for improving the network performance by augmenting channels incrementally to layers of the initial networks while keeping the earlier trained parameters fixed. Also, we propose using the knowledge distillation method for training the channels planted. By transferring the knowledge of deeper and wider networks, we can grow the networks effectively and efficiently. We evaluate the effectiveness of the proposed method on different datasets such as CIFAR-10/100 and STL-10. For the STL-10 dataset, we show that we are able to achieve comparable performance with only 7% parameters compared to the larger network and reduce the overfitting caused by a small amount of the data.

Knowledge Distillation with a Precise Teacher and Prediction with Abstention

Xu Yi, Jian Pu, Hui Zhao

Responsive image

Auto-TLDR; Knowledge Distillation using Deep gambler loss and selective classification framework

Slides Poster Similar

Knowledge distillation, which aims to train model under the supervision from another large model (teacher model) to the original model (student model), has achieved remarkable results in supervised learning. However, there are two major problems with existing knowledge distillation methods. One is the teacher's supervision is sometimes misleading, and the other is the student's prediction is not accurate enough. To address the first issue, instead of learning a combination of both teachers and ground truth, we apply knowledge adjustment to correct teachers' supervision using ground truth. For the second problem, we use the selective classification framework to train the student model. In particular, the deep gambler loss is adopted to predict with reservation by explicitly introducing the $(m+1)$-th class. We consider two settings of knowledge distillation: (1) distillation across different network structures ({\it AlexNet, ResNet}), and (2) distillation across networks with different depths ({\it ResNet18, ResNet50}) to evaluate the effectiveness of our method. The experimental results on benchmark datasets (i.e., {\it Fashion-MNIST, SVHN, CIFAR10, CIFAR100}) are reported with higher prediction accuracies and lower coverage errors.

Temporal Feature Enhancement Network with External Memory for Object Detection in Surveillance Video

Masato Fujitake, Akihiro Sugimoto

Responsive image

Auto-TLDR; Temporal Attention Based External Memory Network for Surveillance Object Detection

Poster Similar

Video object detection is challenging and essential in practical applications, such as surveillance cameras for traffic control and public security. Unlike the video in natural scenes, the surveillance video tends to contain dense, and small objects (typically vehicles) in their appearances. Therefore, existing methods for surveillance object detection utilize still-image object detection approaches with rich feature extractors at the expense of their run-time speeds. The run-time speed, however, becomes essential when the video is being streamed. In this paper, we exploit temporal information in videos to enrich the feature maps, proposing the first temporal attention based external memory network for the live stream of video. Extensive experiments on real-world traffic surveillance benchmarks demonstrate the real-time performance of the proposed model while keeping comparable accuracy with state-of-the-art.

Feature Fusion for Online Mutual Knowledge Distillation

Jangho Kim, Minsung Hyun, Inseop Chung, Nojun Kwak

Responsive image

Auto-TLDR; Feature Fusion Learning Using Fusion of Sub-Networks

Slides Poster Similar

We propose a learning framework named Feature Fusion Learning (FFL) that efficiently trains a powerful classifier through a fusion module which combines the feature maps generated from parallel neural networks and generates meaningful feature maps. Specifically, we train a number of parallel neural networks as sub-networks, then we combine the feature maps from each sub-network using a fusion module to create a more meaningful feature map. The fused feature map is passed into the fused classifier for overall classification. Unlike existing feature fusion methods, in our framework, an ensemble of sub-network classifiers transfers its knowledge to the fused classifier and then the fused classifier delivers its knowledge back to each sub-network, mutually teaching one another in an online-knowledge distillation manner. This mutually teaching system not only improves the performance of the fused classifier but also obtains performance gain in each sub-network. Moreover, our model is more beneficial than other alternative methods because different types of network can be used for each sub-network. We have performed a variety of experiments on multiple datasets such as CIFAR-10, CIFAR-100 and ImageNet and proved that our method is more effective than other alternative methods in terms of performances of both sub-networks and the fused classifier, and the aspect of generating meaningful feature maps.

Second-Order Attention Guided Convolutional Activations for Visual Recognition

Shannan Chen, Qian Wang, Qiule Sun, Bin Liu, Jianxin Zhang, Qiang Zhang

Responsive image

Auto-TLDR; Second-order Attention Guided Network for Convolutional Neural Networks for Visual Recognition

Slides Poster Similar

Recently, modeling deep convolutional activations by the global second-order pooling has shown great advance on visual recognition tasks. However, most of the existing deep second-order statistical models mainly compute second-order statistics of activations of the last convolutional layer as image representations, and they seldom introduce second-order statistics into earlier layers to better fit network topology, thus limiting the representational ability to a certain extent. Motivated by the flexibility of attention blocks that are commonly plugged into intermediate layers of deep convolutional networks (ConvNets), this work makes an attempt to combine deep second-order statistics with attention mechanisms in ConvNets, and further proposes a novel Second-order Attention Guided Network (SoAG-Net) for visual recognition. More specifically, SoAG-Net involves several SoAG modules seemingly inserted into intermediate layers of the network, in which SoAG collects second-order statistics of convolutional activations by polynomial kernel approximation to predict channel-wise attention maps utilized for guiding the learning of convolutional activations through tensor scaling along channel dimension. SoAG improves the nonlinearity of ConvNets and enables ConvNets to fit more complicated distribution of convolutional activations. Experiment results on three commonly used datasets illuminate that SoAG-Net outperforms its counterparts and achieves competitive performance with state-of-the-art models under the same backbone.

Rethinking of Deep Models Parameters with Respect to Data Distribution

Shitala Prasad, Dongyun Lin, Yiqun Li, Sheng Dong, Zaw Min Oo

Responsive image

Auto-TLDR; A progressive stepwise training strategy for deep neural networks

Slides Poster Similar

The performance of deep learning models are driven by various parameters but to tune all of them every time, for every dataset, is a heuristic practice. In this paper, unlike the common practice of decaying the learning rate, we propose a step-wise training strategy where the learning rate and the batch size are tuned based on the dataset size. Here, the given dataset size is progressively increased during the training to boost the network performance without saturating the learning curve, after certain epochs. We conducted extensive experiments on multiple networks and datasets to validate the proposed training strategy. The experimental results proves our hypothesis that the learning rate, the batch size and the data size are interrelated and can improve the network accuracy if an optimal progressive stepwise training strategy is applied. The proposed strategy also the overall training computational cost is reduced.

Fast and Accurate Real-Time Semantic Segmentation with Dilated Asymmetric Convolutions

Leonel Rosas-Arias, Gibran Benitez-Garcia, Jose Portillo-Portillo, Gabriel Sanchez-Perez, Keiji Yanai

Responsive image

Auto-TLDR; FASSD-Net: Dilated Asymmetric Pyramidal Fusion for Real-Time Semantic Segmentation

Slides Poster Similar

Recent works have shown promising results applied to real-time semantic segmentation tasks. To maintain fast inference speed, most of the existing networks make use of light decoders, or they simply do not use them at all. This strategy helps to maintain a fast inference speed; however, their accuracy performance is significantly lower in comparison to non-real-time semantic segmentation networks. In this paper, we introduce two key modules aimed to design a high-performance decoder for real-time semantic segmentation for reducing the accuracy gap between real-time and non-real-time segmentation networks. Our first module, Dilated Asymmetric Pyramidal Fusion (DAPF), is designed to substantially increase the receptive field on the top of the last stage of the encoder, obtaining richer contextual features. Our second module, Multi-resolution Dilated Asymmetric (MDA) module, fuses and refines detail and contextual information from multi-scale feature maps coming from early and deeper stages of the network. Both modules exploit contextual information without excessively increasing the computational complexity by using asymmetric convolutions. Our proposed network entitled “FASSD-Net” reaches 78.8% of mIoU accuracy on the Cityscapes validation dataset at 41.1 FPS on full resolution images (1024x2048). Besides, with a light version of our network, we reach 74.1% of mIoU at 133.1 FPS (full resolution) on a single NVIDIA GTX 1080Ti card with no additional acceleration techniques. The source code and pre-trained models are available at https://github.com/GibranBenitez/FASSD-Net.

Video Object Detection Using Object's Motion Context and Spatio-Temporal Feature Aggregation

Jaekyum Kim, Junho Koh, Byeongwon Lee, Seungji Yang, Jun Won Choi

Responsive image

Auto-TLDR; Video Object Detection Using Spatio-Temporal Aggregated Features and Gated Attention Network

Slides Poster Similar

The deep learning technique has recently led to significant improvement in object-detection accuracy. Numerous object detection schemes have been designed to process each frame independently. However, in many applications, object detection is performed using video data, which consists of a sequence of two-dimensional (2D) image frames. Thus, the object detection accuracy can be improved by exploiting the temporal context of the video sequence. In this paper, we propose a novel video object detection method that exploits both the motion context of the object and spatio-temporal aggregated features in the video sequence to enhance the object detection performance. First, the motion of the object is captured by the correlation between the spatial feature maps of two adjacent frames. Then, the embedding vector, representing the motion context, is obtained by feeding the N correlation maps to long short term memory (LSTM). In addition to generating the motion context vector, the spatial feature maps for N adjacent frames are aggregated to boost the quality of the feature map. The gated attention network is employed to selectively combine only highly correlated feature maps based on their relevance. While most video object detectors are applied to two-stage detectors, our proposed method is applicable to one-stage detectors, which tend to be preferred for practical applications owing to reduced computational complexity. Our numerical evaluation conducted on the ImageNet VID dataset shows that our network offers significant performance gain over baseline algorithms, and it outperforms the existing state-of-the-art one-stage video object detection methods.

Detective: An Attentive Recurrent Model for Sparse Object Detection

Amine Kechaou, Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen

Responsive image

Auto-TLDR; Detective: An attentive object detector that identifies objects in images in a sequential manner

Slides Poster Similar

In this work, we present Detective – an attentive object detector that identifies objects in images in a sequential manner. Our network is based on an encoder-decoder architecture, where the encoder is a convolutional neural network, and the decoder is a convolutional recurrent neural network coupled with an attention mechanism. At each iteration, our decoder focuses on the relevant parts of the image using an attention mechanism, and then estimates the object’s class and the bounding box coordinates. Current object detection models generate dense predictions and rely on post-processing to remove duplicate predictions. Detective is a sparse object detector that generates a single bounding box per object instance. However, training a sparse object detector is challenging, as it requires the model to reason at the instance level and not just at the class and spatial levels. We propose a training mechanism based on the Hungarian Algorithm and a loss that balances the localization and classification tasks. This allows Detective to achieve promising results on the PASCAL VOC object detection dataset. Our experiments demonstrate that sparse object detection is possible and has a great potential for future developments in applications where the order of the objects to be predicted is of interest.

Learning a Dynamic High-Resolution Network for Multi-Scale Pedestrian Detection

Mengyuan Ding, Shanshan Zhang, Jian Yang

Responsive image

Auto-TLDR; Learningable Dynamic HRNet for Pedestrian Detection

Slides Poster Similar

Pedestrian detection is a canonical instance of object detection in computer vision. In practice, scale variation is one of the key challenges, resulting in unbalanced performance across different scales. Recently, the High-Resolution Network (HRNet) has become popular because high-resolution feature representations are more friendly to small objects. However, when we apply HRNet for pedestrian detection, we observe that it improves for small pedestrians on one hand, but hurts the performance for larger ones on the other hand. To overcome this problem, we propose a learnable Dynamic HRNet (DHRNet) aiming to generate different network paths adaptive to different scales. Specifically, we construct a parallel multi-branch architecture and add a soft conditional gate module allowing for dynamic feature fusion. Both branches share all the same parameters except the soft gate module. Experimental results on CityPersons and Caltech benchmarks indicate that our proposed dynamic HRNet is more capable of dealing with pedestrians of various scales, and thus improves the performance across different scales consistently.

Dual-Attention Guided Dropblock Module for Weakly Supervised Object Localization

Junhui Yin, Siqing Zhang, Dongliang Chang, Zhanyu Ma, Jun Guo

Responsive image

Auto-TLDR; Dual-Attention Guided Dropblock for Weakly Supervised Object Localization

Slides Poster Similar

Attention mechanisms is frequently used to learn the discriminative features for better feature representations. In this paper, we extend the attention mechanism to the task of weakly supervised object localization (WSOL) and propose the dual-attention guided dropblock module (DGDM), which aims at learning the informative and complementary visual patterns for WSOL. This module contains two key components, the channel attention guided dropout (CAGD) and the spatial attention guided dropblock (SAGD). To model channel interdependencies, the CAGD ranks the channel attentions and treats the top-k attentions with the largest magnitudes as the important ones. It also keeps some low-valued elements to increase their value if they become important during training. The SAGD can efficiently remove the most discriminative information by erasing the contiguous regions of feature maps rather than individual pixels. This guides the model to capture the less discriminative parts for classification. Furthermore, it can also distinguish the foreground objects from the background regions to alleviate the attention misdirection. Experimental results demonstrate that the proposed method achieves new state-of-the-art localization performance.

VPU Specific CNNs through Neural Architecture Search

Ciarán Donegan, Hamza Yous, Saksham Sinha, Jonathan Byrne

Responsive image

Auto-TLDR; Efficient Convolutional Neural Networks for Edge Devices using Neural Architecture Search

Slides Poster Similar

The success of deep learning at computer vision tasks has led to an ever-increasing number of applications on edge devices. Often with the use of edge AI hardware accelerators like the Intel Movidius Vision Processing Unit (VPU). Performing computer vision tasks on edge devices is challenging. Many Convolutional Neural Networks (CNNs) are too complex to run on edge devices with limited computing power. This has created large interest in designing efficient CNNs and one promising way of doing this is through Neural Architecture Search (NAS). NAS aims to automate the design of neural networks. NAS can also optimize multiple different objectives together, like accuracy and efficiency, which is difficult for humans. In this paper, we use a differentiable NAS method to find efficient CNNs for VPU that achieves state-of-the-art classification accuracy on ImageNet. Our NAS designed model outperforms MobileNetV2, having almost 1\% higher top-1 accuracy while being 13\% faster on MyriadX VPU. To the best of our knowledge, this is the first time a VPU specific CNN has been designed using a NAS algorithm. Our results also reiterate the fact that efficient networks must be designed for each specific hardware. We show that efficient networks targeted at different devices do not perform as well on the VPU.