Accurate Background Subtraction Using Dynamic Object Presence Probability in Sports Scenes

Ryosuke Watanabe, Jun Chen, Tomoaki Konno, Sei Naito

Responsive image

Auto-TLDR; DOPP: Dynamic Object Presence Probabilistic Background Subtraction for Foreground Segmentation

Slides Poster

Foreground segmentation technologies play an important role in applications such as free-viewpoint video (FVV) and sports video analysis. In this situation, we propose a new method that achieves accurate foreground silhouette extraction using dynamic object presence probability (DOPP). Our main contributions are as follows. 1) Object presence probability for each pixel is calculated from the object recognition results based on deep learning. After that, background subtraction is implemented by changing the threshold and the update rate of the background model in response to the object presence probability. Parameter tuning of background subtraction is executed by using the object recognition results to improve the silhouette extraction quality. 2) To calculate more accurate silhouette images, parameters of background subtraction are adjusted by monitoring optical flows between consecutive frames. The object presence probability of the current frame is dynamically updated by using the object presence probability of the previous frame with optical flows. In the experiments, we confirmed that the proposed method achieved more accurate silhouette extraction than conventional methods in three sports sequences.

Similar papers

Coarse-To-Fine Foreground Segmentation Based on Co-Occurrence Pixel-Block and Spatio-Temporal Attention Model

Xinyu Liu, Dong Liang

Responsive image

Auto-TLDR; Foreground Segmentation from coarse to Fine Using Co-occurrence Pixel-Block Model for Dynamic Scene

Slides Poster Similar

Foreground segmentation in dynamic scene is an important task in video surveillance. The unsupervised background subtraction method based on background statistics modeling has difficulties in updating. On the other hand, the supervised foreground segmentation method based on deep learning relies on the large-scale of accurately annotated training data, which limits its cross-scene performance. In this paper, we propose a foreground segmentation method from coarse to fine. First, a across-scenes trained Spatio-Temporal Attention Model (STAM) is used to achieve coarse segmentation, which does not require training on specific scene. Then the coarse segmentation is used as a reference to help Co-occurrence Pixel-Block Model (CPB) complete the fine segmentation, and at the same time help CPB to update its background model. This method is more flexible than those deep-learning-based methods which depends on the specific-scene training, and realizes the accurate online dynamic update of the background model. Experimental results on WallFlower and LIMU validate our method outperforms STAM, CPB and other methods of participating in comparison.

GraphBGS: Background Subtraction Via Recovery of Graph Signals

Jhony Heriberto Giraldo Zuluaga, Thierry Bouwmans

Responsive image

Auto-TLDR; Graph BackGround Subtraction using Graph Signals

Slides Poster Similar

Background subtraction is a fundamental pre-processing task in computer vision. This task becomes challenging in real scenarios due to variations in the background for both static and moving camera sequences. Several deep learning methods for background subtraction have been proposed in the literature with competitive performances. However, these models show performance degradation when tested on unseen videos; and they require huge amount of data to avoid overfitting. Recently, graph-based algorithms have been successful approaching unsupervised and semi-supervised learning problems. Furthermore, the theory of graph signal processing and semi-supervised learning have been combined leading to new insights in the field of machine learning. In this paper, concepts of recovery of graph signals are introduced in the problem of background subtraction. We propose a new algorithm called Graph BackGround Subtraction (GraphBGS), which is composed of: instance segmentation, background initialization, graph construction, graph sampling, and a semi-supervised algorithm inspired from the theory of recovery of graph signals. Our algorithm has the advantage of requiring less data than deep learning methods while having competitive results on both: static and moving camera videos. GraphBGS outperforms unsupervised and supervised methods in several challenging conditions on the publicly available Change Detection (CDNet2014), and UCSD background subtraction databases.

Unsupervised Moving Object Detection through Background Models for PTZ Camera

Kimin Yun, Hyung-Il Kim, Kangmin Bae, Jongyoul Park

Responsive image

Auto-TLDR; Unsupervised Moving Object Detection in a PTZ Camera through Two Background Models

Slides Poster Similar

Moving object detection in a video plays an important role in many vision applications. Recently, moving object detection using appearance modeling based on a convolutional neural network has been actively developed. However, the CNN-based methods usually require the user's supervision of the first frame so that it becomes highly dependent on the training dataset. In contrast, the method of finding a foreground, which models a background occupying a large proportion in an image, can detect a moving object efficiently in an unsupervised manner. However, existing methods based on background modeling in a pan-tilt-zoom (PTZ) camera suffer many false positives or loss of moving objects due to the estimation error of camera motion. To overcome the aforementioned limitations, we propose a moving object detection method for a PTZ camera through two background models. In an unsupervised way, our method builds the two background models that have different roles: 1) a coarse background model for detecting large changes, and 2) a fine background model for detecting small changes. In more detail, the coarse background model builds a block-based Gaussian model, and the fine model builds a sample consensus model. Both models are adaptively updated according to the estimated camera motion in the video recorded by a PTZ camera. Then, each foreground result from two background models is incorporated to fill the moving object region. Through experiments, the proposed method achieves better performance than the state-of-the-art methods and operates in real-time without parallel processing. In addition, we showed the effectiveness of the proposed model through improved results of moving object detection through combination with the latest supervised method.

Motion U-Net: Multi-Cue Encoder-Decoder Network for Motion Segmentation

Gani Rahmon, Filiz Bunyak, Kannappan Palaniappan

Responsive image

Auto-TLDR; Motion U-Net: A Deep Learning Framework for Robust Moving Object Detection under Challenging Conditions

Slides Poster Similar

Detection of moving objects is a critical first step in many computer vision applications. Several algorithms for motion and change detection were proposed. However, many of these approaches lack the ability to handle challenging real-world scenarios. Recently, deep learning approaches started to produce impressive solutions to computer vision tasks, particularly for detection and segmentation. Many existing deep learning networks proposed for moving object detection rely only on spatial appearance cues. In this paper, we propose a novel multi-cue and multi-stream network, Motion U-Net (MU-Net), which integrates motion, change, and appearance cues using a deep learning framework for robust moving object detection under challenging conditions. The proposed network consists of a two-stream encoder module followed by feature concatenation and a decoder module. Motion and change cues are computed through our tensor-based motion estimation and a multi-modal background subtraction modules. The proposed system was tested and evaluated on the change detection challenge datasets (CDnet-2014) and compared to state-of-the-art methods. On CDnet-2014 dataset, our approach reaches an average overall F-measure of 0.9852 and outperforms all current state-of-the-art methods. The network was also tested on the unseen SBI-2015 dataset and produced promising results.

Tracking Fast Moving Objects by Segmentation Network

Ales Zita, Filip Sroubek

Responsive image

Auto-TLDR; Fast Moving Objects Tracking by Segmentation Using Deep Learning

Slides Poster Similar

Tracking Fast Moving Objects (FMO), which appear as blurred streaks in video sequences, is a difficult task for standard trackers, as the object position does not overlap in consecutive video frames and texture information of the objects is blurred. Up-to-date approaches tuned for this task are based on background subtraction with a static background and slow deblurring algorithms. In this article, we present a tracking-by-segmentation approach implemented using modern deep learning methods that perform near real-time tracking on real-world video sequences. We have developed a physically plausible FMO sequence generator to be a robust foundation for our training pipeline and demonstrate straightforward network adaptation for different FMO scenarios with varying foreground.

Multi-Camera Sports Players 3D Localization with Identification Reasoning

Yukun Yang, Ruiheng Zhang, Wanneng Wu, Yu Peng, Xu Min

Responsive image

Auto-TLDR; Probabilistic and Identified Occupancy Map for Sports Players 3D Localization

Slides Poster Similar

Multi-camera sports players 3D localization is always a challenging task due to heavy occlusions in crowded sports scene. Traditional methods can only provide players locations without identification information. Existing methods of localization may cause ambiguous detection and unsatisfactory precision and recall, especially when heavy occlusions occur. To solve this problem, we propose a generic localization method by providing distinguishable results that have the probabilities of locations being occupied by players with unique ID labels. We design the algorithms with a multi-dimensional Bayesian model to create a Probabilistic and Identified Occupancy Map (PIOM). By using this model, we jointly apply deep learning-based object segmentation and identification to obtain sports players probable positions and their likely identification labels. This approach not only provides players 3D locations but also gives their ID information that are distinguishable from others. Experimental results demonstrate that our method outperforms the previous localization approaches with reliable and distinguishable outcomes.

Early Wildfire Smoke Detection in Videos

Taanya Gupta, Hengyue Liu, Bir Bhanu

Responsive image

Auto-TLDR; Semi-supervised Spatio-Temporal Video Object Segmentation for Automatic Detection of Smoke in Videos during Forest Fire

Poster Similar

Recent advances in unmanned aerial vehicles and camera technology have proven useful for the detection of smoke that emerges above the trees during a forest fire. Automatic detection of smoke in videos is of great interest to Fire department. To date, in most parts of the world, the fire is not detected in its early stage and generally it turns catastrophic. This paper introduces a novel technique that integrates spatial and temporal features in a deep learning framework using semi-supervised spatio-temporal video object segmentation and dense optical flow. However, detecting this smoke in the presence of haze and without the labeled data is difficult. Considering the visibility of haze in the sky, a dark channel pre-processing method is used that reduces the amount of haze in video frames and consequently improves the detection results. Online training is performed on a video at the time of testing that reduces the need for ground-truth data. Tests using the publicly available video datasets show that the proposed algorithms outperform previous work and they are robust across different wildfire-threatened locations.

Object Segmentation Tracking from Generic Video Cues

Amirhossein Kardoost, Sabine Müller, Joachim Weickert, Margret Keuper

Responsive image

Auto-TLDR; A Light-Weight Variational Framework for Video Object Segmentation in Videos

Slides Poster Similar

We propose a light-weight variational framework for online tracking of object segmentations in videos based on optical flow and image boundaries. While high-end computer vision methods on this task rely on sequence specific training of dedicated CNN architectures, we show the potential of a variational model, based on generic video information from motion and color. Such cues are usually required for tasks such as robot navigation or grasp estimation. We leverage them directly for video object segmentation and thus provide accurate segmentations at potentially very low extra cost. Our simple method can provide competitive results compared to the costly CNN-based methods with parameter tuning. Furthermore, we show that our approach can be combined with state-of-the-art CNN-based segmentations in order to improve over their respective results. We evaluate our method on the datasets DAVIS 16,17 and SegTrack v2.

Edge-Aware Monocular Dense Depth Estimation with Morphology

Zhi Li, Xiaoyang Zhu, Haitao Yu, Qi Zhang, Yongshi Jiang

Responsive image

Auto-TLDR; Spatio-Temporally Smooth Dense Depth Maps Using Only a CPU

Slides Poster Similar

Dense depth maps play an important role in Computer Vision and AR (Augmented Reality). For CV applications, a dense depth map is the cornerstone of 3D reconstruction allowing real objects to be precisely displayed in the computer. And Dense depth maps can handle correct occlusion relationships between virtual content and real objects for better user experience in AR. However, the complicated computation limits the development of computing dense depth maps. We present a novel algorithm that produces low latency, spatio-temporally smooth dense depth maps using only a CPU. The depth maps exhibit sharp discontinuities at depth edges in low computational complexity ways. Our algorithm obtains the sparse SLAM reconstruction first, then extracts coarse depth edges from a down-sampled RGB image by morphology operations. Next, we thin the depth edges and align them with image edges. Finally, a Warm-Start initialization scheme and an improved optimization solver are adopted to accelerate convergence. We evaluate our proposal quantitatively and the result shows improvements on the accuracy of depth map with respect to other state-of-the-art and baseline techniques.

Weight Estimation from an RGB-D Camera in Top-View Configuration

Marco Mameli, Marina Paolanti, Nicola Conci, Filippo Tessaro, Emanuele Frontoni, Primo Zingaretti

Responsive image

Auto-TLDR; Top-View Weight Estimation using Deep Neural Networks

Slides Poster Similar

The development of so-called soft-biometrics aims at providing information related to the physical and behavioural characteristics of a person. This paper focuses on bodyweight estimation based on the observation from a top-view RGB-D camera. In fact, the capability to estimate the weight of a person can be of help in many different applications, from health-related scenarios to business intelligence and retail analytics. To deal with this issue, a TVWE (Top-View Weight Estimation) framework is proposed with the aim of predicting the weight. The approach relies on the adoption of Deep Neural Networks (DNNs) that have been trained on depth data. Each network has also been modified in its top section to replace classification with prediction inference. The performance of five state-of-art DNNs has been compared, namely VGG16, ResNet, Inception, DenseNet and Efficient-Net. In addition, a convolutional auto-encoder has also been included for completeness. Considering the limited literature in this domain, the TVWE framework has been evaluated on a new publicly available dataset: “VRAI Weight estimation Dataset”, which also collects, for each subject, labels related to weight, gender, and height. The experimental results have demonstrated that the proposed methods are suitable for this task, bringing different and significant insights for the application of the solution in different domains.

Video Semantic Segmentation Using Deep Multi-View Representation Learning

Akrem Sellami, Salvatore Tabbone

Responsive image

Auto-TLDR; Deep Multi-view Representation Learning for Video Object Segmentation

Slides Poster Similar

In this paper, we propose a deep learning model based on deep multi-view representation learning, to address the video object segmentation task. The proposed model emphasizes the importance of the inherent correlation between video frames and incorporates a multi-view representation learning based on deep canonically correlated autoencoders. The multi-view representation learning in our model provides an efficient mechanism for capturing inherent correlations by jointly extracting useful features and learning better representation into a joint feature space, i.e., shared representation. To increase the training data and the learning capacity, we train the proposed model with pairs of video frames, i.e., $F_{a}$ and $F_{b}$. During the segmentation phase, the deep canonically correlated autoencoders model encodes useful features by processing multiple reference frames together, which is used to detect the frequently reappearing. Our model enhances the state-of-the-art deep learning-based methods that mainly focus on learning discriminative foreground representations over appearance and motion. Experimental results over two large benchmarks demonstrate the ability of the proposed method to outperform competitive approaches and to reach good performances, in terms of semantic segmentation.

Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution

Renshu Gu, Gaoang Wang, Jenq-Neng Hwang

Responsive image

Auto-TLDR; 3D Human Pose Estimation for Multi-Human Videos with Occlusion

Slides Similar

3D human pose estimation (HPE) is crucial in human behavior analysis, augmented reality/virtual reality (AR/VR) applications, and self-driving industry. Videos that contain multiple potentially occluded people captured from freely moving monocular cameras are very common in real-world scenarios, while 3D HPE for such scenarios is quite challenging, partially because there is a lack of such data with accurate 3D ground truth labels in existing datasets. In this paper, we propose a temporal regression network with a gated convolution module to transform 2D joints to 3D and recover the missing occluded joints in the meantime. A simple yet effective localization approach is further conducted to transform the normalized pose to the global trajectory. To verify the effectiveness of our approach, we also collect a new moving camera multi-human (MMHuman) dataset that includes multiple people with heavy occlusion captured by moving cameras. The 3D ground truth joints are provided by accurate motion capture (MoCap) system. From the experiments on static-camera based Human3.6M data and our own collected moving-camera based data, we show that our proposed method outperforms most state-of-the-art 2D-to-3D pose estimation methods, especially for the scenarios with heavy occlusions.

Reducing False Positives in Object Tracking with Siamese Network

Takuya Ogawa, Takashi Shibata, Shoji Yachida, Toshinori Hosoi

Responsive image

Auto-TLDR; Robust Long-Term Object Tracking with Adaptive Search based on Motion Models

Slides Poster Similar

We have developed a robust long-term object tracking method that resolves the fundamental cause of the drift and loss of a target in visual object tracking. The proposed method consists of “sampling area extension”, which prevents a tracking result from drifting to other objects by learning false positive samples in advance (before they enter the search region of the target), and “adaptive search based on motion models”, which prevents a tracking result from drifting to other objects and avoids the loss of the target by using not only appearance features but also motion models to adaptively search for the target. Experiments conducted on long-term tracking dataset showed that our first technique improved robustness by 16.6% while the second technique improved robustness by 15.3%. By combining both, our method achieved 21.7% and 9.1% improvement for the robustness and precision, and the processing speed became 3.3 times faster. Additional experiments showed that our method achieved the top robustness among state-of-the-art methods on three long-term tracking datasets. These findings demonstrate that our method is effective for long-term object tracking and that its performance and speed are promising for use in practical applications of various technologies underlying object tracking.

Extracting and Interpreting Unknown Factors with Classifier for Foot Strike Types in Running

Chanjin Seo, Masato Sabanai, Yuta Goto, Koji Tagami, Hiroyuki Ogata, Kazuyuki Kanosue, Jun Ohya

Responsive image

Auto-TLDR; Deep Learning for Foot Strike Classification using Accelerometer Data

Slides Poster Similar

This paper proposes a method that can classify foot strike types using a deep learning model and can extract unknown factors, which enables to evaluate running motions without being influenced by biases of sports experts, using the contribution degree of input values (CDIV). Accelerometers are attached to the runner’s body, and when the runner runs, a fixed camera observes the runner and acquires a video sequence synchronously with the accelerometers. To train a deep learning model for classifying foot strikes, we annotate foot strike acceleration data for RFS (Rearfoot strike) or non-RFS objectively by watching the video. To interpret the unknown factors extracted from the learned model, we calculate two CDIVs: the contributions of the resampling time and the accelerometer value to the output (foot strike type) . Experiments on classifying unknown runners’ foot strikes were conducted. As a common result to sport science, it is confirmed that the CDIVs contribute highly at the time of the right foot strike, and the sensor values corresponding to the right and left tibias contribute highly to classifying the foot strikes. Experimental results show the right tibia is important for classifying foot strikes. This is because many of the training data represent difference between the two foot strikes in the right tibia. As a conclusion, our proposed method could extract unknown factors from the classifier and could interpret the factors that contain similar knowledge to the prior knowledge of experts, as well as new findings that are not included in conventional knowledge.

IPT: A Dataset for Identity Preserved Tracking in Closed Domains

Thomas Heitzinger, Martin Kampel

Responsive image

Auto-TLDR; Identity Preserved Tracking Using Depth Data for Privacy and Privacy

Slides Poster Similar

We present a public dataset for Identity Preserved Tracking (IPT) consisting of sequences of depth data recorded using an Orbbec Astra depth sensor. The dataset features sequences in ten different locations with a high amount of background variation and is designed to be applicable to a wide range of tasks. Its labeling is versatile, allowing for tracking in either 3d space or image coordinates. Next to frame-by-frame 3d and inferred bounding box labeling we provide supplementary annotation of camera poses and room layouts, split in multiple semantically distinct categories. Intended use-cases are applications where both a high level understanding of scene understanding and privacy are central points of consideration, such as active and assisted living (AAL), security and industrial safety. Compared to similar public datasets IPT distinguishes itself with its sequential data format, 3d instance labeling and room layout annotation. We present baseline object detection results in image coordinates using a YOLOv3 network architecture and implement a background model suitable for online tracking applications to increase detection accuracy. Additionally we propose a novel volumetric non-maximum suppression (V-NMS) approach, taking advantage of known room geometry. Last we provide baseline person tracking results utilizing Multiple Object Tracking Challenge (MOTChallenge) evaluation metrics of the CVPR19 benchmark.

Image Sequence Based Cyclist Action Recognition Using Multi-Stream 3D Convolution

Stefan Zernetsch, Steven Schreck, Viktor Kress, Konrad Doll, Bernhard Sick

Responsive image

Auto-TLDR; 3D-ConvNet: A Multi-stream 3D Convolutional Neural Network for Detecting Cyclists in Real World Traffic Situations

Slides Poster Similar

In this article, we present an approach to detect basic movements of cyclists in real world traffic situations based on image sequences, optical flow (OF) sequences, and past positions using a multi-stream 3D convolutional neural network (3D-ConvNet) architecture. To resolve occlusions of cyclists by other traffic participants or road structures, we use a wide angle stereo camera system mounted at a heavily frequented public intersection. We created a large dataset consisting of 1,639 video sequences containing cyclists, recorded in real world traffic, resulting in over 1.1 million samples. Through modeling the cyclists' behavior by a state machine of basic cyclist movements, our approach takes every situation into account and is not limited to certain scenarios. We compare our method to an approach solely based on position sequences. Both methods are evaluated taking into account frame wise and scene wise classification results of basic movements, and detection times of basic movement transitions, where our approach outperforms the position based approach by producing more reliable detections with shorter detection times. Our code and parts of our dataset are made publicly available.

PA-FlowNet: Pose-Auxiliary Optical Flow Network for Spacecraft Relative Pose Estimation

Zhi Yu Chen, Po-Heng Chen, Kuan-Wen Chen, Chen-Yu Chan

Responsive image

Auto-TLDR; PA-FlowNet: An End-to-End Pose-auxiliary Optical Flow Network for Space Travel and Landing

Slides Poster Similar

During the process of space travelling and space landing, the spacecraft attitude estimation is the indispensable work for navigation. Since there are not enough satellites for GPS-like localization in space, the computer vision technique is adopted to address the issue. The most crucial task for localization is the extraction of correspondences. In computer vision, optical flow estimation is often used for finding correspondences between images. As the deep neural network being more popular in recent years, FlowNet2 has played a vital role which achieves great success. In this paper, we present PA-FlowNet, an end-to-end pose-auxiliary optical flow network which can use the predicted relative camera pose to improve the performance of optical flow. PA-FlowNet is composed of two sub-networks, the foreground-attention flow network and the pose regression network. The foreground-attention flow network is constructed bybased on FlowNet2 model and modified with the proposed foreground-attention approach. We introduced this approach with the concept of curriculum learning for foreground-background segmentation to avoid backgrounds from resulting in flow prediction error. The pose regression network is used to regress the relative camera pose as an auxiliary for increasing the accuracy of the flow estimation. In addition, to simulate the test environment for spacecraft pose estimation, we construct a 64K moon model and to simulate aerial photography with various attitudes to generate Moon64K dataset in this paper. PA-FlowNet significantly outperforms all existing methods on our the proposed Moon64K dataset. Furthermore, we also predict the relative pose via proposed PA-FlowNet and accomplish the remarkable performance.

Siamese Dynamic Mask Estimation Network for Fast Video Object Segmentation

Dexiang Hong, Guorong Li, Kai Xu, Li Su, Qingming Huang

Responsive image

Auto-TLDR; Siamese Dynamic Mask Estimation for Video Object Segmentation

Slides Poster Similar

Video object segmentation(VOS) has been a fundamental topic in recent years, and many deep learning-based methods have achieved state-of-the-art performance on multiple benchmarks. However, most of these methods rely on pixel-level matching between the template and the searched frames on the whole image while the targets only occupy a small region. Calculating on the entire image brings lots of additional computation cost. Besides, the whole image may contain some distracting information resulting in many false-positive matching points. To address this issue, motivated by one-stage instance object segmentation methods, we propose an efficient siamese dynamic mask estimation network for fast video object segmentation. The VOS is decoupled into two tasks, i.e. mask feature learning and dynamic kernel prediction. The former is responsible for learning high-quality features to preserve structural geometric information, and the latter learns a dynamic kernel which is used to convolve with the mask feature to generate a mask output. We use Siamese neural network as a feature extractor and directly predict masks after correlation. In this way, we can avoid using pixel-level matching, making our framework more simple and efficient. Experiment results on DAVIS 2016 /2017 datasets show that our proposed methods can run at 35 frames per second on NVIDIA RTX TITAN while preserving competitive accuracy.

What and How? Jointly Forecasting Human Action and Pose

Yanjun Zhu, Yanxia Zhang, Qiong Liu, Andreas Girgensohn

Responsive image

Auto-TLDR; Forecasting Human Actions and Motion Trajectories with Joint Action Classification and Pose Regression

Slides Poster Similar

Forecasting human actions and motion trajectories addresses the problem of predicting what a person is going to do next and how they will perform it. This is crucial in a wide range of applications such as assisted living and future co-robotic settings. We propose to simultaneously learn actions and action-related human motion dynamics, while existing works perform them independently. In this paper, we present a method to jointly forecast categories of human action and the pose of skeletal joints in the hope that the two tasks can help each other. As a result, our system can predict not only the future actions but also the motion trajectories that will result. To achieve this, we define a task of joint action classification and pose regression. We employ a sequence to sequence encoder-decoder model combined with multi-task learning to forecast future actions and poses progressively before the action happens. Experimental results on two public datasets, IkeaDB and OAD, demonstrate the effectiveness of the proposed method.

AerialMPTNet: Multi-Pedestrian Tracking in Aerial Imagery Using Temporal and Graphical Features

Maximilian Kraus, Seyed Majid Azimi, Emec Ercelik, Reza Bahmanyar, Peter Reinartz, Alois Knoll

Responsive image

Auto-TLDR; AerialMPTNet: A novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features

Slides Poster Similar

Multi-pedestrian tracking in aerial imagery has several applications such as large-scale event monitoring, disaster management, search-and-rescue missions, and as input into predictive crowd dynamic models. Due to the challenges such as the large number and the tiny size of the pedestrians (e.g., 4 x 4 pixels) with their similar appearances as well as different scales and atmospheric conditions of the images with their extremely low frame rates (e.g., 2 fps), current state-of-the-art algorithms including the deep learning-based ones are unable to perform well. In this paper, we propose AerialMPTNet, a novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features from a Siamese Neural Network, movement predictions from a Long Short-Term Memory, and pedestrian interconnections from a GraphCNN. In addition, to address the lack of diverse aerial multi-pedestrian tracking datasets, we introduce the Aerial Multi-Pedestrian Tracking (AerialMPT) dataset consisting of 307 frames and 44,740 pedestrians annotated. To the best of our knowledge, AerialMPT is the largest and most diverse dataset to this date and will be released publicly. We evaluate AerialMPTNet on AerialMPT and KIT AIS, and benchmark with several state-of-the-art tracking methods. Results indicate that AerialMPTNet significantly outperforms other methods on accuracy and time-efficiency.

Ground-truthing Large Human Behavior Monitoring Datasets

Tehreem Qasim, Robert Fisher, Naeem Bhatti

Responsive image

Auto-TLDR; Semi-automated Groundtruthing for Large Video Datasets

Slides Poster Similar

We present a groundtruthing approach which is applicable to large video datasets collected for studying people’s behavior, and which are recorded at a low frame per second (fps) rate. Groundtruthing a large dataset manually is a time consuming task and is prone to errors. The proposed approach is semi-automated (using a combination of deepnet and traditional image analysis) to minimize human labeler’s interaction with the video frames. The framework employs mask-rcnn as a people counter followed by human assisted semi-automated tests to correct the wrong labels. Subsequently, a bounding box extraction algorithm is used which is fully automated for frames with a single person and semi-automated for frames with two or more people. We also propose a methodology for anomaly detection i.e., collapse on table or floor. Behavior recognition is performed by using a fine-tuned alexnet convolutional neural network. The people detection and behavior analysis components of the framework are primarily designed to help reduce human labor in ground-truthing so that minimal human involvement is required. They are not meant to be employed as fully automated state-of-the-art systems. The proposed approach is validated on a new dataset presented in this paper, containing human activity in an indoor office environment and recorded at 1 fps as well as an indoor video sequence recorded at 15 fps. Experimental results show a significant reduction in human labor involved in the process of ground-truthing i.e., the number of potential clicks for office dataset was reduced by 99.2% and for the additional test video by 99.7%.

Online Object Recognition Using CNN-Based Algorithm on High-Speed Camera Imaging

Shigeaki Namiki, Keiko Yokoyama, Shoji Yachida, Takashi Shibata, Hiroyoshi Miyano, Masatoshi Ishikawa

Responsive image

Auto-TLDR; Real-Time Object Recognition with High-Speed Camera Imaging with Population Data Clearing and Data Ensemble

Slides Poster Similar

High-speed camera imaging (e.g., 1,000 fps) is effective to detect and recognize objects moving at high speeds because temporally dense images obtained by a high-speed camera can usually capture the best moment for object detection and recognition. However, the latest recognition algorithms, with their high complexity, are difficult to utilize in real-time applications involving high-speed cameras because a vast amount of images need to be processed with no latency. To tackle this problem, we propose a novel framework for real-time object recognition with high-speed camera imaging. The proposed framework has the key processes of population data cleansing and data ensemble. Population data cleansing improves the recognition accuracy by quantifying the recognizability and by excluding part of the images prior to the recognition process, while data ensemble improves the robustness of object recognition by merging the class probabilities with multiple images of the same object. Experimental results with a real dataset show that our framework is more effective than existing methods.

An Adaptive Fusion Model Based on Kalman Filtering and LSTM for Fast Tracking of Road Signs

Chengliang Wang, Xin Xie, Chao Liao

Responsive image

Auto-TLDR; Fusion of ThunderNet and Region Growing Detector for Road Sign Detection and Tracking

Slides Poster Similar

The detection and tracking of road signs plays a critical role in various autopilot application. Utilizing convolutional neural networks(CNN) mostly incurs a big run-time overhead in feature extraction and object localization. Although Klaman filter(KF) is a commonly-used tracker, it is likely to be impacted by omitted objects in the detection step. In this paper, we designed a high-efficient detector that combines ThunderNet and Region Growing Detector(RGD) to detect road signs, and built a fusion model of long short term memory network (LSTM) and KF in the state estimation and the color histogram. The experimental results demonstrate that the proposed method improved the state estimation accuracy by 6.4% and enhanced the Frames Per Second(FPS) to 41.

Learning to Segment Dynamic Objects Using SLAM Outliers

Dupont Romain, Mohamed Tamaazousti, Hervé Le Borgne

Responsive image

Auto-TLDR; Automatic Segmentation of Dynamic Objects Using SLAM Outliers Using Consensus Inversion

Slides Poster Similar

We present a method to automatically learn to segment dynamic objects using SLAM outliers. It requires only one monocular sequence per dynamic object for training and consists in localizing dynamic objects using SLAM outliers, creating their masks, and using these masks to train a semantic segmentation network. We integrate the trained network in ORB-SLAM 2 and LDSO. At runtime we remove features on dynamic objects, making the SLAM unaffected by them. We also propose a new stereo dataset and new metrics to evaluate SLAM robustness. Our dataset includes consensus inversions, i.e., situations where the SLAM uses more features on dynamic objects that on the static background. Consensus inversions are challenging for SLAM as they may cause major SLAM failures. Our approach performs better than the State-of-the-Art on the TUM RGB-D dataset in monocular mode and on our dataset in both monocular and stereo modes.

Learning Defects in Old Movies from Manually Assisted Restoration

Arthur Renaudeau, Travis Seng, Axel Carlier, Jean-Denis Durou, Fabien Pierre, Francois Lauze, Jean-François Aujol

Responsive image

Auto-TLDR; U-Net: Detecting Defects in Old Movies by Inpainting Techniques

Slides Poster Similar

We propose to detect defects in old movies, as the first step of a larger framework of old movies restoration by inpainting techniques. The specificity of our work is to learn a film restorer's expertise from a pair of sequences, composed of a movie with defects, and the same movie which was semi-automatically restored with the help of a specialized software. In order to detect those defects with minimal human interaction and further reduce the time spent for a restoration, we feed a U-Net with consecutive defective frames as input to detect the unexpected variations of pixel intensity over space and time. Since the output of the network is a mask of defect location, we first have to create the dataset of mask frames on the basis of restored frames from the software used by the film restorer, instead of classical synthetic ground truth, which is not available. These masks are estimated by computing the absolute difference between restored frames and defectuous frames, combined with thresholding and morphological closing. Our network succeeds in automatically detecting real defects with more precision than the manual selection with an all-encompassing shape, including some the expert restorer could have missed for lack of time.

Real Time Fencing Move Classification and Detection at Touch Time During a Fencing Match

Cem Ekin Sunal, Chris G. Willcocks, Boguslaw Obara

Responsive image

Auto-TLDR; Fencing Body Move Classification and Detection Using Deep Learning

Slides Similar

Fencing is a fast-paced sport played with swords which are Epee, Foil, and Saber. However, such fast-pace can cause referees to make wrong decisions. Review of slow-motion camera footage in tournaments helps referees’ decision making, but it interrupts the match and may not be available for every organization. Motivated by the need for better decision making, analysis, and availability, we introduce the first fully-automated deep learning classification and detection system for fencing body moves at the moment a touch is made. This is an important step towards creating a fencing analysis system, with player profiling and decision tools that will benefit the fencing community. The proposed architecture combines You Only Look Once version three (YOLOv3) with a ResNet-34 classifier, trained on ImageNet settings to obtain 83.0\% test accuracy on the fencing moves. These results are exciting development in the sport, providing immediate feedback and analysis along with accessibility, hence making it a valuable tool for trainers and fencing match referees.

Gender Classification Using Video Sequences of Body Sway Recorded by Overhead Camera

Takuya Kamitani, Yuta Yamaguchi, Shintaro Nakatani, Masashi Nishiyama, Yoshio Iwai

Responsive image

Auto-TLDR; Spatio-Temporal Feature for Gender Classification of a Standing Person Using Body Stance Using Time-Series Signals

Slides Poster Similar

We investigate whether it is possible to classify the gender of a standing person based on a video sequence containing body sway recorded by an overhead camera.Existing methods that extract a feature from the movement of a walking person for gender classification cannot detect the slight movements of a standing person.In this paper,we propose a method for extracting a feature from the body sway of a standing person. We design a spatio-temporal feature for representing body sway using the frequency analysis of time-series signals derived from the local movements of the upper body.To evaluate the accuracy of our method, we acquired video sequences of body sway from 30 females and 30 males using an overhead camera.We found that our method obtained 90.3% accuracy for the gender classification of a standing person. We compared the accuracy of our method with that of parameters based on medical data.We found that the proposed spatio-temporal feature extracted from body sway significantly improved gender classification accuracy.

Object Detection Model Based on Scene-Level Region Proposal Self-Attention

Yu Quan, Zhixin Li, Canlong Zhang, Huifang Ma

Responsive image

Auto-TLDR; Exploiting Semantic Informations for Object Detection

Slides Poster Similar

The improvement of object detection performance is mostly focused on the extraction of local information near the region of interest in the image, which results in detection performance in this area being unable to achieve the desired effect. First, a depth-wise separable convolution network(D_SCNet-127 R-CNN) is built on the backbone network. Considering the importance of scene and semantic informations for visual recognition, the feature map is sent into the branch of the semantic segmentation module, region proposal network module, and the region proposal self-attention module to build the network of scene-level and region proposal self-attention module. Second, a deep reinforcement learning was utilized to achieve accurate positioning of border regression, and the calculation speed of the whole model was improved through implementing a light-weight head network. This model can effectively solve the limitation of feature extraction in traditional object detection and obtain more comprehensive detailed features. The experimental verification on MSCOCO17, VOC12, and Cityscapes datasets shows that the proposed method has good validity and scalability.

Detecting Anomalies from Video-Sequences: A Novel Descriptor

Giulia Orrù, Davide Ghiani, Maura Pintor, Gian Luca Marcialis, Fabio Roli

Responsive image

Auto-TLDR; Trit-based Measurement of Group Dynamics for Crowd Behavior Analysis and Anomaly Detection

Slides Poster Similar

We present a novel descriptor for crowd behavior analysis and anomaly detection. The goal is to measure by appropriate patterns the speed of formation and disintegration of groups in the crowd. This descriptor is inspired by the concept of one-dimensional local binary patterns: in our case, such patterns depend on the number of group observed in a time window. An appropriate measurement unit, named "trit" (trinary digit), represents three possible dynamic states of groups on a certain frame. Our hypothesis is that abrupt variations of the groups' number may be due to an anomalous event that can be accordingly detected, by translating these variations on temporal trit-based sequence of strings which are significantly different from the one describing the "no-anomaly" one. Due to the peculiarity of the rationale behind this work, relying on the number of groups, three different methods of people group's extraction are compared. Experiments are carried out on the Motion-Emotion benchmark data set. Reported results point out in which cases the trit-based measurement of group dynamics allows us to detect the anomaly. Besides the promising performance of our approach, we show how it is correlated with the anomaly typology and the camera's perspective to the crowd's flow (frontal, lateral).

RWF-2000: An Open Large Scale Video Database for Violence Detection

Ming Cheng, Kunjing Cai, Ming Li

Responsive image

Auto-TLDR; Flow Gated Network for Violence Detection in Surveillance Cameras

Slides Poster Similar

In recent years, surveillance cameras are widely deployed in public places, and the general crime rate has been reduced significantly due to these ubiquitous devices. Usually, these cameras provide cues and evidence after crimes were conducted, while they are rarely used to prevent or stop criminal activities in time. It is both time and labor consuming to manually monitor a large amount of video data from surveillance cameras. Therefore, automatically recognizing violent behaviors from video signals becomes essential. In this paper, we summarize several existing video datasets for violence detection and propose a new video dataset with 2,000 videos all captured by surveillance cameras in real-world scenes. Also, we present a new method that utilizes both the merits of 3D-CNNs and optical flow, namely Flow Gated Network. The proposed approach obtains an accuracy of 87.25% on the test set of our proposed RWF-2000 database. The proposed database and source codes of this paper are currently open to access.

Vacant Parking Space Detection Based on Task Consistency and Reinforcement Learning

Manh Hung Nguyen, Tzu-Yin Chao, Ching-Chun Huang

Responsive image

Auto-TLDR; Vacant Space Detection via Semantic Consistency Learning

Slides Poster Similar

In this paper, we proposed a novel task-consistency learning method that allows training a vacant space detection network (target task) based on the logistic consistency with the semantic outcomes from a naive flow-based motion behavior classifier (source task) in a parking lot. By well designing the reward mechanism upon semantic consistency, we show the possibility to train the target network in a reinforcement learning setting. Compared with conventional supervised detection methods, the major contribution of this work is to learn a vacant space detector via semantic consistency rather than supervised labels. The dynamic learning property may make the proposed detector been deployed in different lots easily without heavy training loads. The experiments show that based on the task consistency rewards from the motion behavior classifier, the vacant space detector can be trained successfully.

End-To-End Deep Learning Methods for Automated Damage Detection in Extreme Events at Various Scales

Yongsheng Bai, Alper Yilmaz, Halil Sezen

Responsive image

Auto-TLDR; Robust Mask R-CNN for Crack Detection in Extreme Events

Slides Poster Similar

Robust Mask R-CNN (Mask Regional Convolutional Neural Network) methods are proposed and tested for automatic detection of cracks on structures or their components that may be damaged during extreme events, such as earth-quakes. We curated a new dataset with 2,021 labeled images for training and validation and aimed to find end-to-end deep neural networks for crack detection in the field. With data augmentation and parameters fine-tuning, Path Aggregation Network (PANet) with spatial attention mechanisms and High-resolution Network (HRNet) are introduced into Mask R-CNNs. The tests on three public datasets with low- or high-resolution images demonstrate that the proposed methods can achieve a big improvement over alternative networks, so the proposed method may be sufficient for crack detection for a variety of scales in real applications.

STaRFlow: A SpatioTemporal Recurrent Cell for Lightweight Multi-Frame Optical Flow Estimation

Pierre Godet, Alexandre Boulch, Aurélien Plyer, Guy Le Besnerais

Responsive image

Auto-TLDR; STaRFlow: A lightweight CNN-based algorithm for optical flow estimation

Slides Poster Similar

We present a new lightweight CNN-based algorithm for multi-frame optical flow estimation. Our solution introduces a double recurrence over spatial scale and time through repeated use of a generic "STaR" (SpatioTemporal Recurrent) cell. It includes (i) a temporal recurrence based on conveying learned features rather than optical flow estimates; (ii) an occlusion detection process which is coupled with optical flow estimation and therefore uses a very limited number of extra parameters. The resulting STaRFlow algorithm gives state-of-the-art performances on MPI Sintel and Kitti2015 and involves significantly less parameters than all other methods with comparable results.

Visual Saliency Oriented Vehicle Scale Estimation

Qixin Chen, Tie Liu, Jiali Ding, Zejian Yuan, Yuanyuan Shang

Responsive image

Auto-TLDR; Regularized Intensity Matching for Vehicle Scale Estimation with salient object detection

Slides Poster Similar

Vehicle scale estimation with a single camera is a typical application for intelligent transportation and it faces the challenges from visual computing while intensity-based method and descriptor-based method should be balanced. This paper proposed a vehicle scale estimation method based on salient object detection to resolve this problem. The regularized intensity matching method is proposed in Lie Algebra to achieve robust and accurate scale estimation, and descriptor matching and intensity matching are combined to minimize the proposed loss function. The visual attention mechanism is designed to select image patches with texture and remove the occluded image patches. Then the weights are assigned to pixels from the selected image patches which alleviates the influence of noise-corrupted pixels. The experiments show that the proposed method significantly outperforms state-of-the-art methods with regard to the robustness and accuracy of vehicle scale estimation.

Video Anomaly Detection by Estimating Likelihood of Representations

Yuqi Ouyang, Victor Sanchez

Responsive image

Auto-TLDR; Video Anomaly Detection in the latent feature space using a deep probabilistic model

Slides Poster Similar

Video anomaly detection is a challenging task not only because it involves solving many sub-tasks such as motion representation, object localization and action recognition, but also because it is commonly considered as an unsupervised learning problem that involves detecting outliers. Traditionally, solutions to this task have focused on the mapping between video frames and their low-dimensional features, while ignoring the spatial connections of those features. Recent solutions focus on analyzing these spatial connections by using hard clustering techniques, such as K-Means, or applying neural networks to map latent features to a general understanding, such as action attributes. In order to solve video anomaly in the latent feature space, we propose a deep probabilistic model to transfer this task into a density estimation problem where latent manifolds are generated by a deep denoising autoencoder and clustered by expectation maximization. Evaluations on several benchmarks datasets show the strengths of our model, achieving outstanding performance on challenging datasets.

Better Prior Knowledge Improves Human-Pose-Based Extrinsic Camera Calibration

Olivier Moliner, Sangxia Huang, Kalle Åström

Responsive image

Auto-TLDR; Improving Human-pose-based Extrinsic Calibration for Multi-Camera Systems

Slides Poster Similar

Accurate extrinsic calibration of wide baseline multi-camera systems enables better understanding of 3D scenes for many applications and is of great practical importance. Classical Structure-from-Motion calibration methods require special calibration equipment so that accurate point correspondences can be detected between different views. In addition, an operator with some training is usually needed to ensure that data is collected in a way that leads to good calibration accuracy. This limits the ease of adoption of such technologies. Recently, methods have been proposed to use human pose estimation models to establish point correspondences, thus removing the need for any special equipment. The challenge with this approach is that human pose estimation algorithms typically produce much less accurate feature points compared to classical patch-based methods. Another problem is that ambient human motion might not be optimal for calibration. We build upon prior works and introduce several novel ideas to improve the accuracy of human-pose-based extrinsic calibration. Our first contribution is a robust reprojection loss based on a better understanding of the sources of pose estimation error. Our second contribution is a 3D human pose likelihood model learned from motion capture data. We demonstrate significant improvements in calibration accuracy by evaluating our method on four publicly available datasets.

Utilising Visual Attention Cues for Vehicle Detection and Tracking

Feiyan Hu, Venkatesh Gurram Munirathnam, Noel E O'Connor, Alan Smeaton, Suzanne Little

Responsive image

Auto-TLDR; Visual Attention for Object Detection and Tracking in Driver-Assistance Systems

Slides Poster Similar

Advanced Driver-Assistance Systems (ADAS) have been attracting attention from many researchers. Vision based sensors are the closest way to emulate human driver visual behavior while driving. In this paper, we explore possible ways to use visual attention (saliency) for object detection and tracking. We investigate: 1) How a visual attention map such as a subjectness attention or saliency map and an objectness attention map can facilitate region proposal generation in a 2-stage object detector; 2) How a visual attention map can be used for tracking multiple objects. We propose a neural network that can simultaneously detect objects as and generate objectness and subjectness maps to save computational power. We further exploit the visual attention map during tracking using a sequential Monte Carlo probability hypothesis density (PHD) filter. The experiments are conducted on KITTI and DETRAC datasets. The use of visual attention and hierarchical features has shown a considerable improvement of≈8% in object detection which effectively increased tracking performance by≈4% on KITTI dataset.

Global Feature Aggregation for Accident Anticipation

Mishal Fatima, Umar Karim Khan, Chong Min Kyung

Responsive image

Auto-TLDR; Feature Aggregation for Predicting Accidents in Video Sequences

Slides Similar

Anticipation of accidents ahead of time in autonomous and non-autonomous vehicles aids in accident avoidance. In order to recognize abnormal events such as traffic accidents in a video sequence, it is important that the network takes into account interactions of objects in a given frame. We propose a novel Feature Aggregation (FA) block that refines each object's features by computing a weighted sum of the features of all objects in a frame. We use FA block along with Long Short Term Memory (LSTM) network to anticipate accidents in the video sequences. We report mean Average Precision (mAP) and Average Time-to-Accident (ATTA) on Street Accident (SA) dataset. Our proposed method achieves the highest score for risk anticipation by predicting accidents 0.32 sec and 0.75 sec earlier compared to the best results with Adaptive Loss and dynamic parameter prediction based methods respectively.

LFIR2Pose: Pose Estimation from an Extremely Low-Resolution FIR Image Sequence

Saki Iwata, Yasutomo Kawanishi, Daisuke Deguchi, Ichiro Ide, Hiroshi Murase, Tomoyoshi Aizawa

Responsive image

Auto-TLDR; LFIR2Pose: Human Pose Estimation from a Low-Resolution Far-InfraRed Image Sequence

Slides Poster Similar

In this paper, we propose a method for human pose estimation from a Low-resolution Far-InfraRed (LFIR) image sequence captured by a 16 × 16 FIR sensor array. Human body estimation from such a single LFIR image is a hard task. For training the estimation model, annotation of the human pose to the images is also a difficult task for human. Thus, we propose the LFIR2Pose model which accepts a sequence of LFIR images and outputs the human pose of the last frame, and also propose an automatic annotation system for the model training. Additionally, considering that the scale of human body motion is largely different among body parts, we also propose a loss function focusing on the difference. Through an experiment, we evaluated the human pose estimation accuracy using an original data set, and confirmed that human pose can be estimated accurately from an LFIR image sequence.

PHNet: Parasite-Host Network for Video Crowd Counting

Shiqiao Meng, Jiajie Li, Weiwei Guo, Jinfeng Jiang, Lai Ye

Responsive image

Auto-TLDR; PHNet: A Parasite-Host Network for Video Crowd Counting

Slides Poster Similar

Crowd counting plays an increasingly important role in public security. Recently, many crowd counting methods for a single image have been proposed but few studies have focused on using temporal information from image sequences of videos to improve prediction performance. In the existing methods using videos for crowd estimation, temporal features and spatial features are modeled jointly for the prediction, which makes the model less efficient in extracting spatiotemporal features and difficult to improve the performance of predictions. In order to solve these problems, this paper proposes a Parasite-Host Network(PHNet) which is composed of Parasite branch and Host branch to extract temporal features and spatial features respectively. To specifically extract the transform features in the time domain, we propose a novel architecture termed as “Relational Extractor”(RE) which models the multiplicative interaction features of adjacent frames. In addition, the Host branch extracts the spatial features from a current frame which can be replaced with any model that uses a single image for the prediction. We conducted experiments by using our PHNet on four video crowd counting benchmarks: Venice,UCSD,FDST and CrowdFlow. Experimental results show that PHnet achieves superior performance on these four datasets to the state-of-the-art methods.

A Grid-Based Representation for Human Action Recognition

Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Responsive image

Auto-TLDR; GRAR: Grid-based Representation for Action Recognition in Videos

Slides Poster Similar

Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for the task, and are limited in the way they fuse temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets that demonstrate that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges.

Siamese Fully Convolutional Tracker with Motion Correction

Mathew Francis, Prithwijit Guha

Responsive image

Auto-TLDR; A Siamese Ensemble for Visual Tracking with Appearance and Motion Components

Slides Poster Similar

Visual tracking algorithms use cues like appearance, structure, motion etc. for locating an object in a video. We propose an ensemble tracker with appearance and motion components. A siamese tracker that learns object appearance from a static image and motion vectors computed between consecutive frames with a flow network forms the ensemble. Motion predicted object localization is used to correct the appearance component in the ensemble. Complementary nature of the components bring performance improvement as observed in experiments performed on VOT2018 and VOT2019 datasets.

Deep Photo Relighting by Integrating Both 2D and 3D Lighting Information

Takashi Machida, Satoru Nakanishi

Responsive image

Auto-TLDR; DPR: Deep Photorelighting for Image Detection/Classification and Data Augmentation

Slides Poster Similar

In this paper, we propose a novel framework called ``deep photorelighting'' (DPR) that can transform the lighting condition of an image for a virtual test of image detection/classification algorithm, city environment design, and data augmentation for machine learning. Our framework employs the deep neural network (DNN) approach based on U-Net. Specifically, DPR has two keypoints for transforming one lighting condition to another one by DNN. One is that we can support all factors that affect the lighting conditions (e.g., viewpoint, object materials/geometry, light position) by using 2D and 3D information such as omnidirectional image, omnidirectional depth image, and region segmentation image. The other keypoint is that we can reproduce indirect influences from outside the frame such as shadow by grasping the whole lighting environment with omnidirectional image/depth. As a result, DPR can generate relighting image without fatal artifacts such an unnatural shading/shadows of objects. In experiments, we confirmed that a generated image is well reproduced compared with the ground truth image. We also confirmed that shadows, which occur inside and outside the frame through obstacles, are properly added/deleted in the generated image compared with the ground truth image.

Motion and Region Aware Adversarial Learning for Fall Detection with Thermal Imaging

Vineet Mehta, Abhinav Dhall, Sujata Pal, Shehroz Khan

Responsive image

Auto-TLDR; Automatic Fall Detection with Adversarial Network using Thermal Imaging Camera

Slides Poster Similar

Automatic fall detection is a vital technology for ensuring health and safety of people. Home based camera systems for fall detection often put people's privacy at risk. Thermal cameras can partially/fully obfuscate facial features, thus preserving the privacy of a person. Another challenge is the less occurrence of falls in comparison to normal activities of daily living. As fall occurs rarely, it is non-trivial to learn algorithms due to class imbalance. To handle these problems, we formulate fall detection as an anomaly detection within an adversarial framework using thermal imaging camera. We present a novel adversarial network that comprise of two channel 3D convolutional auto encoders; one each handling video sequences and optical flow, which then reconstruct the thermal data and the optical flow input sequences. We introduce a differential constraint, a technique to track the region of interest and a joint discriminator to compute the reconstruction error. Larger reconstruction error indicates the occurrence of fall in a video sequence. The experiments on a publicly available thermal fall dataset show the superior results obtained in comparison to standard baseline.

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

Bing Li, Longteng Kong, Dongming Zhang, Xiuguo Bao, Di Huang, Yunhong Wang

Responsive image

Auto-TLDR; TEMSN: Temporal Enhanced Multi-Stream Network for Compressed Video Action Recognition

Slides Poster Similar

Current compressed video action recognition methods are mainly based on completely received compressed videos. However, in real transmission, the compressed video packets are usually disorderly received and lost due to network jitters or congestion. It is of great significance to recognize actions in early phases with limited packets, e.g. forecasting the potential risks from videos quickly. In this paper, we proposed a Temporal Enhanced Multi-Stream Network (TEMSN) for practical compressed video action recognition. First, we use three compressed modalities as complementary cues and build a multi-stream network to capture the rich information from compressed video packets. Second, we design a temporal enhanced module based on Encoder-Decoder structure applied on each stream to infer the missing packets, and generate more complete action dynamics. Thanks to the rich modalities and temporal enhancement, our approach is able to better modeling the action with limited compressed packets. Experiments on HMDB-51 and UCF-101 dataset validate its effectiveness and efficiency.

A Boundary-Aware Distillation Network for Compressed Video Semantic Segmentation

Hongchao Lu

Responsive image

Auto-TLDR; A Boundary-Aware Distillation Network for Video Semantic Segmentation

Slides Poster Similar

In recent years optical flow is often estimated to reuse features so as to accelerate video semantic segmentation. With addition of optical flow network, however, extra cost may incur and accuracy may thus be degraded because of repeated warping operation. In this paper, we propose a boundary-aware distillation network (BDNet) that replaces optical flow network with block motion vectors encoded in compressed video, resulting in negligible computational complexity. In order to make salient features, an auxiliary boundary-aware stream is added to the main stream to jointly estimate silhouette and segmentation of objects. To further correct warped features, a well-trained teacher network is employed to transfer knowledge to the main stream. Both boundary-aware stream and the teacher network are neglected during inference stage, so that video segmentation network enables to get faster without increasing any computational burden. By splitting the task into three components, our BDNet shows almost 10% time saving as well as 1.6% accuracy improvement over baseline on the Cityscapes dataset.

Human Segmentation with Dynamic LiDAR Data

Tao Zhong, Wonjik Kim, Masayuki Tanaka, Masatoshi Okutomi

Responsive image

Auto-TLDR; Spatiotemporal Neural Network for Human Segmentation with Dynamic Point Clouds

Slides Similar

Consecutive LiDAR scans and depth images compose dynamic 3D sequences, which contain more abundant spatiotemporal information than a single frame. Similar to the development history of image and video perception, dynamic 3D sequence perception starts to come into sight after inspiring research on static 3D data perception. This work proposes a spatiotemporal neural network for human segmentation with the dynamic LiDAR point clouds. It takes a sequence of depth images as input. It has a two-branch structure, i.e., the spatial segmentation branch and the temporal velocity estimation branch. The velocity estimation branch is designed to capture motion cues from the input sequence and then propagates them to the other branch. So that the segmentation branch segments humans according to both spatial and temporal features. These two branches are jointly learned on a generated dynamic point cloud data set for human recognition. Our works fill in the blank of dynamic point cloud perception with the spherical representation of point cloud and achieves high accuracy. The experiments indicate that the introduction of temporal feature benefits the segmentation of dynamic point cloud perception.

Correlation-Based ConvNet for Small Object Detection in Videos

Brais Bosquet, Manuel Mucientes, Victor Brea

Responsive image

Auto-TLDR; STDnet-ST: An End-to-End Spatio-Temporal Convolutional Neural Network for Small Object Detection in Video

Slides Poster Similar

The detection of small objects is of particular interest in many real applications. In this paper, we propose STDnet-ST, a novel approach to small object detection in video using spatial information operating alongside temporal video information. STDnet-ST is an end-to-end spatio-temporal convolutional neural network that detects small objects over time and correlates pairs of the top-ranked regions with the highest likelihood of containing small objects. This architecture links the small objects across the time as tubelets, being able to dismiss unprofitable object links in order to provide high-quality tubelets. STDnet-ST achieves state-of-the-art results for small objects on the publicly available USC-GRAD-STDdb and UAVDT video datasets.