HMFlow: Hybrid Matching Optical Flow Network for Small and Fast-Moving Objects

Suihanjin Yu, Youmin Zhang, Chen Wang, Xiao Bai, Liang Zhang, Edwin Hancock

Responsive image

Auto-TLDR; Hybrid Matching Optical Flow Network with Global Matching Component

Slides Poster

In optical flow estimation task, coarse-to-fine warping strategy is widely used to deal with the large displacement problem and provides efficiency and speed. However, limited by the small search range between the first images and warped second images, current coarse-to-fine optical flow networks fail to capture small and fast-moving objects which has disappeared at coarse resolution levels. To address this problem, we introduce a lightweight but effective Global Matching Component (GMC) to grab global matching features. We propose a new Hybrid Matching Optical Flow Network (HMFlow) by integrating GMC into existing coarse-to-fine networks seamlessly. Besides keeping in high accuracy and small model size, our proposed HMFlow can apply global matching features to guide the network to discover the small and fast-moving objects mismatched by local matching features. We also build a new dataset, named SFChairs, for evaluation. The experimental results show that our proposed network achieves considerable performance, especially at regions with small and fast-moving objects.

Similar papers

STaRFlow: A SpatioTemporal Recurrent Cell for Lightweight Multi-Frame Optical Flow Estimation

Pierre Godet, Alexandre Boulch, Aurélien Plyer, Guy Le Besnerais

Responsive image

Auto-TLDR; STaRFlow: A lightweight CNN-based algorithm for optical flow estimation

Slides Poster Similar

We present a new lightweight CNN-based algorithm for multi-frame optical flow estimation. Our solution introduces a double recurrence over spatial scale and time through repeated use of a generic "STaR" (SpatioTemporal Recurrent) cell. It includes (i) a temporal recurrence based on conveying learned features rather than optical flow estimates; (ii) an occlusion detection process which is coupled with optical flow estimation and therefore uses a very limited number of extra parameters. The resulting STaRFlow algorithm gives state-of-the-art performances on MPI Sintel and Kitti2015 and involves significantly less parameters than all other methods with comparable results.

PA-FlowNet: Pose-Auxiliary Optical Flow Network for Spacecraft Relative Pose Estimation

Zhi Yu Chen, Po-Heng Chen, Kuan-Wen Chen, Chen-Yu Chan

Responsive image

Auto-TLDR; PA-FlowNet: An End-to-End Pose-auxiliary Optical Flow Network for Space Travel and Landing

Slides Poster Similar

During the process of space travelling and space landing, the spacecraft attitude estimation is the indispensable work for navigation. Since there are not enough satellites for GPS-like localization in space, the computer vision technique is adopted to address the issue. The most crucial task for localization is the extraction of correspondences. In computer vision, optical flow estimation is often used for finding correspondences between images. As the deep neural network being more popular in recent years, FlowNet2 has played a vital role which achieves great success. In this paper, we present PA-FlowNet, an end-to-end pose-auxiliary optical flow network which can use the predicted relative camera pose to improve the performance of optical flow. PA-FlowNet is composed of two sub-networks, the foreground-attention flow network and the pose regression network. The foreground-attention flow network is constructed bybased on FlowNet2 model and modified with the proposed foreground-attention approach. We introduced this approach with the concept of curriculum learning for foreground-background segmentation to avoid backgrounds from resulting in flow prediction error. The pose regression network is used to regress the relative camera pose as an auxiliary for increasing the accuracy of the flow estimation. In addition, to simulate the test environment for spacecraft pose estimation, we construct a 64K moon model and to simulate aerial photography with various attitudes to generate Moon64K dataset in this paper. PA-FlowNet significantly outperforms all existing methods on our the proposed Moon64K dataset. Furthermore, we also predict the relative pose via proposed PA-FlowNet and accomplish the remarkable performance.

ResFPN: Residual Skip Connections in Multi-Resolution Feature Pyramid Networks for Accurate Dense Pixel Matching

Rishav ., René Schuster, Ramy Battrawy, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; Resolution Feature Pyramid Networks for Dense Pixel Matching

Slides Similar

Dense pixel matching is required for many computer vision algorithms such as disparity, optical flow or scene flow estimation. Feature Pyramid Networks (FPN) have proven to be a suitable feature extractor for CNN-based dense matching tasks. FPN generates well localized and semantically strong features at multiple scales. However, the generic FPN is not utilizing its full potential, due to its reasonable but limited localization accuracy. Thus, we present ResFPN – a multiresolution feature pyramid network with multiple residual skip connections, where at any scale, we leverage the information from higher resolution maps for stronger and better localized features. In our ablation study we demonstrate the effectiveness of our novel architecture with clearly higher accuracy than FPN. In addition, we verify the superior accuracy of ResFPN in many different pixel matching applications on established datasets like KITTI, Sintel, and FlyingThings3D.

Two-Stage Adaptive Object Scene Flow Using Hybrid CNN-CRF Model

Congcong Li, Haoyu Ma, Qingmin Liao

Responsive image

Auto-TLDR; Adaptive object scene flow estimation using a hybrid CNN-CRF model and adaptive iteration

Slides Poster Similar

Scene flow estimation based on stereo sequences is a comprehensive task relevant to disparity and optical flow. Some existing methods are time-consuming and often fail in the presence of reflective surfaces. In this paper, we propose a two-stage adaptive object scene flow estimation method using a hybrid CNN-CRF model (ACOSF), which benefits from high-quality features and the structured modelling capability. Meanwhile, in order to balance the computational efficiency and accuracy, we employ adaptive iteration for energy function optimization, which is flexible and efficient for various scenes. Besides, we utilize high-quality pixel selection to reduce the computation time with only a slight decrease in accuracy. Our method achieves competitive results with the state-of-the-art, which ranks second on the challenging KITTI 2015 scene flow benchmark.

Deeply-Fused Attentive Network for Stereo Matching

Zuliu Yang, Xindong Ai, Weida Yang, Yong Zhao, Qifei Dai, Fuchi Li

Responsive image

Auto-TLDR; DF-Net: Deep Learning-based Network for Stereo Matching

Slides Poster Similar

In this paper, we propose a novel learning-based network for stereo matching called DF-Net, which makes three main contributions that are experimentally shown to have practical merit. Firstly, we further increase the accuracy by using the deeply fused spatial pyramid pooling (DF-SPP) module, which can acquire the continuous multi-scale context information in both parallel and cascade manners. Secondly, we introduce channel attention block to dynamically boost the informative features. Finally, we propose a stacked encoder-decoder structure with 3D attention gate for cost regularization. More precisely, the module fuses the coding features to their next encoder-decoder structure under the supervision of attention gate with long-range skip connection, and thus exploit deep and hierarchical context information for disparity prediction. The performance on SceneFlow and KITTI datasets shows that our model is able to generate better results against several state-of-the-art algorithms.

A Lightweight Network to Learn Optical Flow from Event Data

Zhuoyan Li, Jiawei Shen

Responsive image

Auto-TLDR; A lightweight pyramid network with attention mechanism to learn optical flow from events data

Similar

Existing deep neural networks have found success in estimation of event-based optical flow, but are at the expense of complicated architectures. Moreover, few prior works discuss how to tackle with the noise problem of event camera, which would severely contaminate the data quality and make estimation an ill-posed problem. In this work, we present a lightweight pyramid network with attention mechanism to learn optical flow from events data. Specially, the network is designed according to two-well established principles: Laplacian pyramidal decomposition and channel attention mechanism. By integrating Laplacian pyramidal processing into CNN, the learning problem is simplified into several subproblems at each pyramid level, which can be handled by a relatively shallow network with few parameters. The channel attention block, embedded in each pyramid level, treats channels of feature map unequally and provides extra flexibility in suppressing background noises. The size of the proposed network is about only 5% of previous methods while our method still achieves state-of-the-art performance on the benchmark dataset. The experimental video samples of continuous flow estimation is presented at :https://github.com/xfleezy/blob.

OmniFlowNet: A Perspective Neural Network Adaptation for Optical Flow Estimation in Omnidirectional Images

Charles-Olivier Artizzu, Haozhou Zhang, Guillaume Allibert, Cédric Demonceaux

Responsive image

Auto-TLDR; OmniFlowNet: A Convolutional Neural Network for Omnidirectional Optical Flow Estimation

Slides Poster Similar

Spherical cameras and the latest image processing techniques open up new horizons. In particular, methods based on Convolutional Neural Networks (CNNs) now give excellent results for optical flow estimation on perspective images. However, these approaches are highly dependent on their architectures and training datasets. This paper proposes to benefit from years of improvement in perspective images optical flow estimation and to apply it to omnidirectional ones without training on new datasets. Our network, OmniFlowNet, is built on a CNN specialized in perspective images. Its convolution operation is adapted to be consistent with the equirectangular projection. Tested on spherical datasets created with Blender and several equirectangular videos realized from real indoor and outdoor scenes, OmniFlowNet shows better performance than its original network.

Learning Stereo Matchability in Disparity Regression Networks

Jingyang Zhang, Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, Long Quan

Responsive image

Auto-TLDR; Deep Stereo Matchability for Weakly Matchable Regions

Slides Similar

Learning-based stereo matching has recently achieved promising results, yet still suffers difficulties in establishing reliable matches in weakly matchable regions that are textureless, non-Lambertian, or occluded. In this paper, we address this challenge by proposing a stereo matching network that considers pixel-wise matchability. Specifically, the network jointly regresses disparity and matchability maps from 3D probability volume through expectation and entropy operations. Next, a learned attenuation is applied as the robust loss function to alleviate the influence of weakly matchable pixels in the training. Finally, a matchability-aware disparity refinement is introduced to improve the depth inference in weakly matchable regions. The proposed deep stereo matchability (DSM) framework can improve the matching result or accelerate the computation while still guaranteeing the quality. Moreover, the DSM framework is portable to many recent stereo networks. Extensive experiments are conducted on Scene Flow and KITTI stereo datasets to demonstrate the effectiveness of the proposed framework over the state-of-the-art learning-based stereo methods.

Revisiting Optical Flow Estimation in 360 Videos

Keshav Bhandari, Ziliang Zong, Yan Yan

Responsive image

Auto-TLDR; LiteFlowNet360: A Domain Adaptation Framework for 360 Video Optical Flow Estimation

Slides Similar

Nowadays 360 video analysis has become a significant research topic in the field since the appearance of high-quality and low-cost 360 wearable devices. In this paper, we propose a novel LiteFlowNet360 architecture for 360 videos optical flow estimation. We design LiteFlowNet360 as a domain adaptation framework from perspective video domain to 360 video domain. We adapt it from simple kernel transformation techniques inspired by Kernel Transformer Network (KTN) to cope with inherent distortion in 360 videos caused by the sphere-to-plane projection. First, we apply an incremental transformation of convolution layers in feature pyramid network and show that further transformation in inference and regularization layers are not important, hence reducing the network growth in terms of size and computation cost. Second, we refine the network by training with augmented data in a supervised manner. We perform data augmentation by projecting the images in a sphere and re-projecting to a plane. Third, we train LiteFlowNet360 in a self-supervised manner using target domain 360 videos. Experimental results show the promising results of 360 video optical flow estimation using the proposed novel architecture.

Movement-Induced Priors for Deep Stereo

Yuxin Hou, Muhammad Kamran Janjua, Juho Kannala, Arno Solin

Responsive image

Auto-TLDR; Fusing Stereo Disparity Estimation with Movement-induced Prior Information

Slides Poster Similar

We propose a method for fusing stereo disparity estimation with movement-induced prior information. Instead of independent inference frame-by-frame, we formulate the problem as a non-parametric learning task in terms of a temporal Gaussian process prior with a movement-driven kernel for inter-frame reasoning. We present a hierarchy of three Gaussian process kernels depending on the availability of motion information, where our main focus is on a new gyroscope-driven kernel for handheld devices with low-quality MEMS sensors, thus also relaxing the requirement of having full 6D camera poses available. We show how our method can be combined with two state-of-the-art deep stereo methods. The method either work in a plug-and-play fashion with pre-trained deep stereo networks, or further improved by jointly training the kernels together with encoder--decoder architectures, leading to consistent improvement.

Attention Stereo Matching Network

Doudou Zhang, Jing Cai, Yanbing Xue, Zan Gao, Hua Zhang

Responsive image

Auto-TLDR; ASM-Net: Attention Stereo Matching with Disparity Refinement

Slides Poster Similar

Despite great progress, previous stereo matching algorithms still lack the ability to match textureless regions and slender structure areas. To tackle this problem, we propose ASM-Net, an attention stereo matching network. Attention module and disparity refinement module are constructed in the ASMNet. The attention module can improve correlation information between two images by channels and spatial attention.The feature-guided disparity refinement module learns more geometry information in different feature levels to refine the coarse prediction resolution constantly. The proposed approach was evaluated on several benchmark datasets. Experiments show that the proposed method achieves competitive results on KITTI and Scene-Flow datasets while running in real-time at 14ms.

Leveraging a Weakly Adversarial Paradigm for Joint Learning of Disparity and Confidence Estimation

Matteo Poggi, Fabio Tosi, Filippo Aleotti, Stefano Mattoccia

Responsive image

Auto-TLDR; Joint Training of Deep-Networks for Outlier Detection from Stereo Images

Slides Poster Similar

Deep architectures represent the state-of-the-art for perceiving depth from stereo images. Although these methods are highly accurate, it is crucial to effectively detect any outlier through confidence measures since a wrong perception of even small portions of the sensed scene might lead to catastrophic consequences, for instance, in autonomous driving. Purposely, state-of-the-art confidence estimation methods rely on deep-networks as well. In this paper, arguing that these tasks are two sides of the same coin, we propose a novel paradigm for their joint training. Specifically, inspired by the successful deployment of GANs in other fields, we design two deep architectures: a generator for disparity estimation and a discriminator for distinguishing correct assignments from outliers. The two networks are jointly trained in a new peculiar weakly adversarial manner pushing the former to fix the errors detected by the discriminator while keeping the correct prediction unchanged. Experimental results on standard stereo datasets prove that such joint training paradigm yields significant improvements. Moreover, an additional outcome of our proposal is the ability to detect outliers with better accuracy compared to the state-of-the-art.

Suppressing Features That Contain Disparity Edge for Stereo Matching

Xindong Ai, Zuliu Yang, Weida Yang, Yong Zhao, Zhengzhong Yu, Fuchi Li

Responsive image

Auto-TLDR; SDE-Attention: A Novel Attention Mechanism for Stereo Matching

Slides Poster Similar

Existing networks for stereo matching usually use 2-D CNN as the feature extractor. However, objects are usually continuous in spatial, if an extracted feature contains disparity edge (the representation of this feature on original image contains disparity edge), then this feature usually not occur inside the region of an object. We propose a novel attention mechanism to suppress features containing disparity edge, named SDE-Attention (SDEA). We notice that features containing disparity edge are usually continuous in one image and discontinuous in another, which means that they usually have a greater difference in two feature maps of the same layer than features that don’t contain disparity edge. SDEA calculate the weight matrix of the intermediate feature map according to this trait, then the weight matrix is multiplied to the intermediate feature map. We test SDEA on PSMNet, experimental results show that our method has a significant improvement in accuracy and our network achieves state-of-the-art performance among the published networks.

FC-DCNN: A Densely Connected Neural Network for Stereo Estimation

Dominik Hirner, Friedrich Fraundorfer

Responsive image

Auto-TLDR; FC-DCNN: A Lightweight Network for Stereo Estimation

Slides Poster Similar

We propose a novel lightweight network for stereo estimation. Our network consists of a fully-convolutional densely connected neural network (FC-DCNN) that computes matching costs between rectified image pairs. Our FC-DCNN method learns expressive features and performs some simple but effective post-processing steps. The densely connected layer structure connects the output of each layer to the input of each subsequent layer. This network structure in addition to getting rid of any fully-connected layers leads to a very lightweight network. The output of this network is used in order to calculate matching costs and create a cost-volume. Instead of using time and memory-inefficient cost-aggregation methods such as semi-global matching or conditional random fields in order to improve the result, we rely on filtering techniques, namely median filter and guided filter. By computing a left-right consistency check we get rid of inconsistent values. Afterwards we use a watershed foreground-background segmentation on the disparity image with removed inconsistencies. This mask is then used to refine the final prediction. We show that our method works well for both challenging indoor and outdoor scenes by evaluating it on the Middlebury, KITTI and ETH3D benchmarks respectively.

Object Segmentation Tracking from Generic Video Cues

Amirhossein Kardoost, Sabine Müller, Joachim Weickert, Margret Keuper

Responsive image

Auto-TLDR; A Light-Weight Variational Framework for Video Object Segmentation in Videos

Slides Poster Similar

We propose a light-weight variational framework for online tracking of object segmentations in videos based on optical flow and image boundaries. While high-end computer vision methods on this task rely on sequence specific training of dedicated CNN architectures, we show the potential of a variational model, based on generic video information from motion and color. Such cues are usually required for tasks such as robot navigation or grasp estimation. We leverage them directly for video object segmentation and thus provide accurate segmentations at potentially very low extra cost. Our simple method can provide competitive results compared to the costly CNN-based methods with parameter tuning. Furthermore, we show that our approach can be combined with state-of-the-art CNN-based segmentations in order to improve over their respective results. We evaluate our method on the datasets DAVIS 16,17 and SegTrack v2.

Siamese Fully Convolutional Tracker with Motion Correction

Mathew Francis, Prithwijit Guha

Responsive image

Auto-TLDR; A Siamese Ensemble for Visual Tracking with Appearance and Motion Components

Slides Poster Similar

Visual tracking algorithms use cues like appearance, structure, motion etc. for locating an object in a video. We propose an ensemble tracker with appearance and motion components. A siamese tracker that learns object appearance from a static image and motion vectors computed between consecutive frames with a flow network forms the ensemble. Motion predicted object localization is used to correct the appearance component in the ensemble. Complementary nature of the components bring performance improvement as observed in experiments performed on VOT2018 and VOT2019 datasets.

Residual Learning of Video Frame Interpolation Using Convolutional LSTM

Keito Suzuki, Masaaki Ikehara

Responsive image

Auto-TLDR; Video Frame Interpolation Using Residual Learning and Convolutional LSTMs

Slides Poster Similar

Video frame interpolation aims to generate interme- diate frames between the original frames. This produces videos with a higher frame r ate and creates smoother motion. Many video frame interpolation methods first estimate the motion vector between the input frames and then synthesizes the intermediate frame based on the motion. However, these methods rely on the accuracy of the motion estimation step and fail to accurately generate the interpolated frame when the estimated motion vectors are inaccurate. Therefore, to avoid the uncertainties caused by motion estimation, this paper proposes a method that directly generates the intermediate frame. Since two consecutive frames are relatively similar, our method takes the average of these two frames and utilizes residual learning to learn the difference between the average of these frames and the ground truth middle frame. In addition, our method uses Convolutional LSTMs and four input frames to better incorporate spatiotemporal information. This neural network can be easily trained end to end without difficult to obtain data such as optical flow. Our experimental results show that the proposed method can perform favorably against other state-of-the-art frame interpolation methods.

Extending Single Beam Lidar to Full Resolution by Fusing with Single Image Depth Estimation

Yawen Lu, Yuxing Wang, Devarth Parikh, Guoyu Lu

Responsive image

Auto-TLDR; Self-supervised LIDAR for Low-Cost Depth Estimation

Slides Similar

Depth estimation is playing an important role in indoor and outdoor scene understanding, autonomous driving, augmented reality and many other tasks. Vehicles and robotics are able to use active illumination sensors such as LIDAR to receive high precision depth estimation. However, high-resolution Lidars are usually too expensive, which limits its massive production on various applications. Though single beam LIDAR enjoys the benefits of low cost, one beam depth sensing is not usually sufficient to perceive the surrounding environment in many scenarios. In this paper, we propose a learning-based framework to explore to replicate similar or even higher performance as costly LIDARs with our designed self-supervised network and a low-cost single-beam LIDAR. After the accurate calibration with a visible camera, the single beam LIDAR can adjust the scale uncertainty of the depth map estimated by the visible camera. The adjusted depth map enjoys the benefits of high resolution and sensing accuracy as high beam LIDAR and maintains low-cost as single beam LIDAR. Thus we can achieve similar sensing effect of high beam LIDAR with more than a 50-100 times cheaper price (e.g., \$80000 Velodyne HDL-64E LIDAR v.s. \$1000 SICK TIM-781 2D LIDAR and normal camera). The proposed approach is verified on our collected dataset and public dataset with superior depth-sensing performance.

Video Semantic Segmentation Using Deep Multi-View Representation Learning

Akrem Sellami, Salvatore Tabbone

Responsive image

Auto-TLDR; Deep Multi-view Representation Learning for Video Object Segmentation

Slides Poster Similar

In this paper, we propose a deep learning model based on deep multi-view representation learning, to address the video object segmentation task. The proposed model emphasizes the importance of the inherent correlation between video frames and incorporates a multi-view representation learning based on deep canonically correlated autoencoders. The multi-view representation learning in our model provides an efficient mechanism for capturing inherent correlations by jointly extracting useful features and learning better representation into a joint feature space, i.e., shared representation. To increase the training data and the learning capacity, we train the proposed model with pairs of video frames, i.e., $F_{a}$ and $F_{b}$. During the segmentation phase, the deep canonically correlated autoencoders model encodes useful features by processing multiple reference frames together, which is used to detect the frequently reappearing. Our model enhances the state-of-the-art deep learning-based methods that mainly focus on learning discriminative foreground representations over appearance and motion. Experimental results over two large benchmarks demonstrate the ability of the proposed method to outperform competitive approaches and to reach good performances, in terms of semantic segmentation.

Multi-Scale Residual Pyramid Attention Network for Monocular Depth Estimation

Jing Liu, Xiaona Zhang, Zhaoxin Li, Tianlu Mao

Responsive image

Auto-TLDR; Multi-scale Residual Pyramid Attention Network for Monocular Depth Estimation

Slides Poster Similar

Monocular depth estimation is a challenging problem in computer vision and is crucial for understanding 3D scene geometry. Recently, deep convolutional neural networks (DCNNs) based methods have improved the estimation accuracy significantly. However, existing methods fail to consider complex textures and geometries in scenes, thereby resulting in loss of local details, distorted object boundaries, and blurry reconstruction. In this paper, we proposed an end-to-end Multi-scale Residual Pyramid Attention Network (MRPAN) to mitigate these problems.First,we propose a Multi-scale Attention Context Aggregation (MACA) module, which consists of Spatial Attention Module (SAM) and Global Attention Module (GAM). By considering the position and scale correlation of pixels from spatial and global perspectives, the proposed module can adaptively learn the similarity between pixels so as to obtain more global context information of the image and recover the complex structure in the scene. Then we proposed an improved Residual Refinement Module (RRM) to further refine the scene structure, giving rise to deeper semantic information and retain more local details. Experimental results show that our method achieves more promisin performance in object boundaries and local details compared with other state-of-the-art methods.

Siamese Dynamic Mask Estimation Network for Fast Video Object Segmentation

Dexiang Hong, Guorong Li, Kai Xu, Li Su, Qingming Huang

Responsive image

Auto-TLDR; Siamese Dynamic Mask Estimation for Video Object Segmentation

Slides Poster Similar

Video object segmentation(VOS) has been a fundamental topic in recent years, and many deep learning-based methods have achieved state-of-the-art performance on multiple benchmarks. However, most of these methods rely on pixel-level matching between the template and the searched frames on the whole image while the targets only occupy a small region. Calculating on the entire image brings lots of additional computation cost. Besides, the whole image may contain some distracting information resulting in many false-positive matching points. To address this issue, motivated by one-stage instance object segmentation methods, we propose an efficient siamese dynamic mask estimation network for fast video object segmentation. The VOS is decoupled into two tasks, i.e. mask feature learning and dynamic kernel prediction. The former is responsible for learning high-quality features to preserve structural geometric information, and the latter learns a dynamic kernel which is used to convolve with the mask feature to generate a mask output. We use Siamese neural network as a feature extractor and directly predict masks after correlation. In this way, we can avoid using pixel-level matching, making our framework more simple and efficient. Experiment results on DAVIS 2016 /2017 datasets show that our proposed methods can run at 35 frames per second on NVIDIA RTX TITAN while preserving competitive accuracy.

Multi-Direction Convolution for Semantic Segmentation

Dehui Li, Zhiguo Cao, Ke Xian, Xinyuan Qi, Chao Zhang, Hao Lu

Responsive image

Auto-TLDR; Multi-Direction Convolution for Contextual Segmentation

Slides Similar

Context is known to be one of crucial factors effecting the performance improvement of semantic segmentation. However, state-of-the-art segmentation models built upon fully convolutional networks are inherently weak in encoding contextual information because of stacked local operations such as convolution and pooling. Failing to capture context leads to inferior segmentation performance. Despite many context modules have been proposed to relieve this problem, they still operate in a local manner or use the same contextual information in different positions (due to upsampling). In this paper, we introduce the idea of Multi-Direction Convolution (MDC)—a novel operator capable of encoding rich contextual information. This operator is inspired by an observation that the standard convolution only slides along the spatial dimension (x, y direction) where the channel dimension (z direction) is fixed, which renders slow growth of the receptive field (RF). If considering the channel-fixed convolution to be one-direction, MDC is multi-direction in the sense that MDC slides along both spatial and channel dimensions, i.e., it slides along x, y when z is fixed, along x, z when y is fixed, and along y, z when x is fixed. In this way, MDC is able to encode rich contextual information with the fast increase of the RF. Compared to existing context modules, the encoded context is position-sensitive because no upsampling is required. MDC is also efficient and easy to implement. It can be implemented with few standard convolution layers with permutation. We show through extensive experiments that MDC effectively and selectively enlarges the RF and outperforms existing contextual modules on two standard benchmarks, including Cityscapes and PASCAL VOC2012.

Real-Time Monocular Depth Estimation with Extremely Light-Weight Neural Network

Mian Jhong Chiu, Wei-Chen Chiu, Hua-Tsung Chen, Jen-Hui Chuang

Responsive image

Auto-TLDR; Real-Time Light-Weight Depth Prediction for Obstacle Avoidance and Environment Sensing with Deep Learning-based CNN

Slides Poster Similar

Obstacle avoidance and environment sensing are crucial applications in autonomous driving and robotics. Among all types of sensors, RGB camera is widely used in these applications as it can offer rich visual contents with relatively low-cost, and using a single image to perform depth estimation has become one of the main focuses in resent research works. However, prior works usually rely on highly complicated computation and power-consuming GPU to achieve such task; therefore, we focus on developing a real-time light-weight system for depth prediction in this paper. Based on the well-known encoder-decoder architecture, we propose a supervised learning-based CNN with detachable decoders that produce depth predictions with different scales. We also formulate a novel log-depth loss function that computes the difference of predicted depth map and ground truth depth map in log space, so as to increase the prediction accuracy for nearby locations. To train our model efficiently, we generate depth map and semantic segmentation with complex teacher models. Via a series of ablation studies and experiments, it is validated that our model can efficiently performs real-time depth prediction with only 0.32M parameters, with the best trained model outperforms previous works on KITTI dataset for various evaluation matrices.

Visual Saliency Oriented Vehicle Scale Estimation

Qixin Chen, Tie Liu, Jiali Ding, Zejian Yuan, Yuanyuan Shang

Responsive image

Auto-TLDR; Regularized Intensity Matching for Vehicle Scale Estimation with salient object detection

Slides Poster Similar

Vehicle scale estimation with a single camera is a typical application for intelligent transportation and it faces the challenges from visual computing while intensity-based method and descriptor-based method should be balanced. This paper proposed a vehicle scale estimation method based on salient object detection to resolve this problem. The regularized intensity matching method is proposed in Lie Algebra to achieve robust and accurate scale estimation, and descriptor matching and intensity matching are combined to minimize the proposed loss function. The visual attention mechanism is designed to select image patches with texture and remove the occluded image patches. Then the weights are assigned to pixels from the selected image patches which alleviates the influence of noise-corrupted pixels. The experiments show that the proposed method significantly outperforms state-of-the-art methods with regard to the robustness and accuracy of vehicle scale estimation.

Fast and Accurate Real-Time Semantic Segmentation with Dilated Asymmetric Convolutions

Leonel Rosas-Arias, Gibran Benitez-Garcia, Jose Portillo-Portillo, Gabriel Sanchez-Perez, Keiji Yanai

Responsive image

Auto-TLDR; FASSD-Net: Dilated Asymmetric Pyramidal Fusion for Real-Time Semantic Segmentation

Slides Poster Similar

Recent works have shown promising results applied to real-time semantic segmentation tasks. To maintain fast inference speed, most of the existing networks make use of light decoders, or they simply do not use them at all. This strategy helps to maintain a fast inference speed; however, their accuracy performance is significantly lower in comparison to non-real-time semantic segmentation networks. In this paper, we introduce two key modules aimed to design a high-performance decoder for real-time semantic segmentation for reducing the accuracy gap between real-time and non-real-time segmentation networks. Our first module, Dilated Asymmetric Pyramidal Fusion (DAPF), is designed to substantially increase the receptive field on the top of the last stage of the encoder, obtaining richer contextual features. Our second module, Multi-resolution Dilated Asymmetric (MDA) module, fuses and refines detail and contextual information from multi-scale feature maps coming from early and deeper stages of the network. Both modules exploit contextual information without excessively increasing the computational complexity by using asymmetric convolutions. Our proposed network entitled “FASSD-Net” reaches 78.8% of mIoU accuracy on the Cityscapes validation dataset at 41.1 FPS on full resolution images (1024x2048). Besides, with a light version of our network, we reach 74.1% of mIoU at 133.1 FPS (full resolution) on a single NVIDIA GTX 1080Ti card with no additional acceleration techniques. The source code and pre-trained models are available at https://github.com/GibranBenitez/FASSD-Net.

Learning Object Deformation and Motion Adaption for Semi-Supervised Video Object Segmentation

Xiaoyang Zheng, Xin Tan, Jianming Guo, Lizhuang Ma

Responsive image

Auto-TLDR; Semi-supervised Video Object Segmentation with Mask-propagation-based Model

Slides Poster Similar

We propose a novel method to solve the task of semi-supervised video object segmentation in this paper, where the mask annotation is only given at the first frame of the video sequence. A mask-propagation-based model is applied to learn the past and current information for segmentation. Besides, due to the scarcity of training data, image/mask pairs that model object deformation and shape variance are generated for the training phase. In addition, we generate the key flips between two adjacent frames for motion adaptation. The method works in an end-to-end way, without any online fine-tuning on test videos. Extensive experiments demonstrate that our method achieves competitive performance against state-of-the-art algorithms on benchmark datasets, covering cases with single object or multiple objects. We also conduct extensive ablation experiments to analyze the effectiveness of our proposed method.

Real-Time Semantic Segmentation Via Region and Pixel Context Network

Yajun Li, Yazhou Liu, Quansen Sun

Responsive image

Auto-TLDR; A Dual Context Network for Real-Time Semantic Segmentation

Slides Poster Similar

Real-time semantic segmentation is a challenging task as both segmentation accuracy and inference speed need to be considered at the same time. In this paper, we present a Dual Context Network (DCNet) to address this challenge. It contains two independent sub-networks: Region Context Network and Pixel Context Network. Region Context Network is main network with low-resolution input and feature re-weighting module to achieve sufficient receptive field. Meanwhile, Pixel Context Network with location attention module to capture the location dependencies of each pixel for assisting the main network to recover spatial detail. A contextual feature fusion is introduced to combine output features of these two sub-networks. The experiments show that DCNet can achieve high-quality segmentation while keeping a high speed. Specifically, for Cityscapes test dataset, we achieve 76.1% Mean IOU with the speed of 82 FPS on a single GTX 2080Ti GPU when using ResNet50 as backbone, and 71.2% Mean IOU with the speed of 142 FPS when using ResNet18 as backbone.

Coarse to Fine: Progressive and Multi-Task Learning for Salient Object Detection

Dong-Goo Kang, Sangwoo Park, Joonki Paik

Responsive image

Auto-TLDR; Progressive and mutl-task learning scheme for salient object detection

Slides Poster Similar

Most deep learning-based salient object detection (SOD) methods tried to manipulate the convolution block to effectively capture the context of object. In this paper, we propose a novel method, called progressive and mutl-task learning scheme, to extract the context of object by only manipulating the learning scheme without changing the network architecture. The progressive learning scheme is a method to grow the decoder progressively in the train phase. In other words, starting from easier low-resolution layers, it gradually adds high-resolution layers. Although the progressive learning successfullyl captures the context of object, its output boundary tends to be rough. To solve this problem, we also propose a multi-task learning (MTL) scheme that processes the object saliency map and contour in a single network jointly. The proposed MTL scheme trains the network in an edge-preserved direction through an auxiliary branch that learns contours. The proposed a learning scheme can be combined with other convolution block manipulation methods. Extensive experiments on five datasets show that the proposed method performs best compared with state-of-the-art methods in most cases.

TSMSAN: A Three-Stream Multi-Scale Attentive Network for Video Saliency Detection

Jingwen Yang, Guanwen Zhang, Wei Zhou

Responsive image

Auto-TLDR; Three-stream Multi-scale attentive network for video saliency detection in dynamic scenes

Slides Poster Similar

Video saliency detection is an important low-level task that has been used in a large range of high-level applications. In this paper, we proposed a three-stream multi-scale attentive network (TSMSAN) for saliency detection in dynamic scenes. TSMSAN integrates motion vector representation, static saliency map, and RGB information in multi-scales together into one framework on the basis of Fully Convolutional Network (FCN) and spatial attention mechanism. On the one hand, the respective motion features, spatial features, as well as the scene features can provide abundant information for video saliency detection. On the other hand, spatial attention mechanism can combine features with multi-scales to focus on key information in dynamic scenes. In this manner, the proposed TSMSAN can encode the spatiotemporal features of the dynamic scene comprehensively. We evaluate the proposed approach on two public dynamic saliency data sets. The experimental results demonstrate TSMSAN is able to achieve the state-of-the-art performance as well as the excellent generalization ability. Furthermore, the proposed TSMSAN can provide more convincing video saliency information, in line with human perception.

5D Light Field Synthesis from a Monocular Video

Kyuho Bae, Andre Ivan, Hajime Nagahara, In Kyu Park

Responsive image

Auto-TLDR; Synthesis of Light Field Video from Monocular Video using Deep Learning

Slides Similar

Commercially available light field cameras have difficulty in capturing 5D (4D + time) light field videos. They can only capture still light filed images or are excessively expensive for normal users to capture the light field video. To tackle this problem, we propose a deep learning-based method for synthesizing a light field video from a monocular video. We propose a new synthetic light field video dataset that renders photorealistic scenes using Unreal Engine because no light field video dataset is available. The proposed deep learning framework synthesizes the light field video with a full set (9x9) of sub-aperture images from a normal monocular video. The proposed network consists of three sub-networks, namely, feature extraction, 5D light field video synthesis, and temporal consistency refinement. Experimental results show that our model can successfully synthesize the light field video for synthetic and real scenes and outperforms the previous frame-by-frame method quantitatively and qualitatively.

Delivering Meaningful Representation for Monocular Depth Estimation

Doyeon Kim, Donggyu Joo, Junmo Kim

Responsive image

Auto-TLDR; Monocular Depth Estimation by Bridging the Context between Encoding and Decoding

Slides Poster Similar

Monocular depth estimation plays a key role in 3D scene understanding, and a number of recent papers have achieved significant improvements using deep learning based algorithms. Most papers among them proposed methods that use a pre-trained network as a deep feature extractor and then decode the obtained features to create a depth map. In this study, we focus on how to use this encoder-decoder structure to deliver meaningful representation throughout the entire network. We propose a new network architecture with our suggested modules to create a more accurate depth map by bridging the context between the encoding and decoding phase. First, we place the pyramid block at the bottleneck of the network to enlarge the view and convey rich information about the global context to the decoder. Second, we suggest a skip connection with the fuse module to aggregate the encoder and decoder feature. Finally, we validate our approach on the NYU Depth V2 and KITTI datasets. The experimental results prove the efficacy of the suggested model and show performance gains over the state-of-the-art model.

Enhancing Depth Quality of Stereo Vision Using Deep Learning-Based Prior Information of the Driving Environment

Weifu Li, Vijay John, Seiichi Mita

Responsive image

Auto-TLDR; A Novel Post-processing Mathematical Framework for Stereo Vision

Slides Poster Similar

Generation of high density depth values of the driving environment is indispensable for autonomous driving. Stereo vision is one of the practical and effective methods to generate these depth values. However, the accuracy of the stereo vision is limited by texture-less regions, such as sky and road areas, and repeated patterns in the image. To overcome these problems, we propose to enhance the stereo generated depth by incorporating prior information of the driving environment. Prior information, generated by deep learning-based U-Net model, is utilized in a novel post-processing mathematical framework to refine the stereo generated depth. The proposed mathematical framework is formulated as an optimization problem, which refines the errors due to texture-less regions and repeated patterns. Owing to its mathematical formulation, the post-processing framework is not a black-box and is explainable, and can be readily utilized for depth maps generated by any stereo vision algorithm. The proposed framework is qualitatively validated on the acquired dataset and KITTI dataset. The results obtained show that the proposed framework improves the stereo depth generation accuracy

Motion Complementary Network for Efficient Action Recognition

Ke Cheng, Yifan Zhang, Chenghua Li, Jian Cheng, Hanqing Lu

Responsive image

Auto-TLDR; Efficient Motion Complementary Network for Action Recognition

Slides Poster Similar

Both two-stream ConvNet and 3D ConvNet are widely used in action recognition. However, both methods are not efficient for deployment: calculating optical flow is very slow, while 3D convolution is computationally expensive. Our key insight is that the motion information from optical flow maps is complementary to the motion information from 3D ConvNet. Instead of simply combining these two methods, we propose two novel techniques to enhance the performance with less computational cost: \textit{fixed-motion-accumulation} and \textit{balanced-motion-policy}. With these two techniques, we propose a novel framework called Efficient Motion Complementary Network(EMC-Net) that enjoys both high efficiency and high performance. We conduct extensive experiments on Kinetics, UCF101, and Jester datasets. We achieve notably higher performance while consuming 4.7$\times$ less computation than I3D, 11.6$\times$ less computation than ECO, 17.8$\times$ less computation than R(2+1)D. On Kinetics dataset, we achieve 2.6\% better performance than the recent proposed TSM with 1.4$\times$ fewer FLOPs and 10ms faster on K80 GPU.

ACCLVOS: Atrous Convolution with Spatial-Temporal ConvLSTM for Video Object Segmentation

Muzhou Xu, Shan Zong, Chunping Liu, Shengrong Gong, Zhaohui Wang, Yu Xia

Responsive image

Auto-TLDR; Semi-supervised Video Object Segmentation using U-shape Convolution and ConvLSTM

Slides Poster Similar

Semi-supervised video object segmentation aims at segmenting the target of interest throughout a video sequence when only the annotated mask of the first frame is given. A feasible method for segmentation is to capture the spatial-temporal coherence between frames. However, it may suffer from mask drift when the spatial-temporal coherence is unreliable. To relieve this problem, we propose an encoder-decoder-recurrent model for semi-supervised video object segmentation. The model adopts a U-shape architecture that combines atrous convolution and ConvLSTM to establish the coherence in both the spatial and temporal domains. Furthermore, the weight ratio for each block is also reconstructed to make the model more suitable for the VOS task. We evaluate our method on two benchmarks, DAVIS-2017 and Youtube-VOS, where state-of-the-art segmentation accuracy with a real-time inference speed of 21.3 frames per second on a Tesla P100 is obtained.

Dynamic Guided Network for Monocular Depth Estimation

Xiaoxia Xing, Yinghao Cai, Yiping Yang, Dayong Wen

Responsive image

Auto-TLDR; DGNet: Dynamic Guidance Upsampling for Self-attention-Decoding for Monocular Depth Estimation

Slides Poster Similar

Self-attention or encoder-decoder structure has been widely used in deep neural networks for monocular depth estimation tasks. The former mechanism are capable to capture long-range information by computing the representation of each position by a weighted sum of the features at all positions, while the latter networks can capture structural details information by gradually recovering the spatial information. In this work, we combine the advantages of both methods. Specifically, our proposed model, DGNet, extends EMANet Network by adding an effective decoder module to refine the depth results. In the decoder stage, we further design dynamic guidance upsampling which uses local neighboring information of low-level features guide coarser depth to upsample. In this way, dynamic guidance upsampling generates content-dependent and spatially-variant kernels for depth upsampling which makes full use of spatial details information from low-level features. Experimental results demonstrate that our method obtains higher accuracy and generates the desired depth map.

Coarse-To-Fine Foreground Segmentation Based on Co-Occurrence Pixel-Block and Spatio-Temporal Attention Model

Xinyu Liu, Dong Liang

Responsive image

Auto-TLDR; Foreground Segmentation from coarse to Fine Using Co-occurrence Pixel-Block Model for Dynamic Scene

Slides Poster Similar

Foreground segmentation in dynamic scene is an important task in video surveillance. The unsupervised background subtraction method based on background statistics modeling has difficulties in updating. On the other hand, the supervised foreground segmentation method based on deep learning relies on the large-scale of accurately annotated training data, which limits its cross-scene performance. In this paper, we propose a foreground segmentation method from coarse to fine. First, a across-scenes trained Spatio-Temporal Attention Model (STAM) is used to achieve coarse segmentation, which does not require training on specific scene. Then the coarse segmentation is used as a reference to help Co-occurrence Pixel-Block Model (CPB) complete the fine segmentation, and at the same time help CPB to update its background model. This method is more flexible than those deep-learning-based methods which depends on the specific-scene training, and realizes the accurate online dynamic update of the background model. Experimental results on WallFlower and LIMU validate our method outperforms STAM, CPB and other methods of participating in comparison.

FastSal: A Computationally Efficient Network for Visual Saliency Prediction

Feiyan Hu, Kevin Mcguinness

Responsive image

Auto-TLDR; MobileNetV2: A Convolutional Neural Network for Saliency Prediction

Slides Poster Similar

This paper focuses on the problem of visual saliency prediction, predicting regions of an image that tend to attract human visual attention, under a constrained computational budget. We modify and test various recent efficient convolutional neural network architectures like EfficientNet and MobileNetV2 and compare them with existing state-of-the-art saliency models such as SalGAN and DeepGaze II both in terms of standard accuracy metrics like AUC and NSS, and in terms of the computational complexity and model size. We find that MobileNetV2 makes an excellent backbone for a visual saliency model and can be effective even without a complex decoder. We also show that knowledge transfer from a more computationally expensive model like DeepGaze II can be achieved via pseudo-labelling an unlabelled dataset, and that this approach gives result on-par with many state-of-the-art algorithms with a fraction of the computational cost and model size.

Ordinal Depth Classification Using Region-Based Self-Attention

Minh Hieu Phan, Son Lam Phung, Abdesselam Bouzerdoum

Responsive image

Auto-TLDR; Region-based Self-Attention for Multi-scale Depth Estimation from a Single 2D Image

Slides Poster Similar

Depth estimation from a single 2D image has been widely applied in 3D understanding, 3D modelling and robotics. It is challenging as reliable cues (e.g. stereo correspondences and motions) are not available. Most of the modern approaches exploited multi-scale feature extraction to provide more powerful representations for deep networks. However, these studies have not focused on how to effectively fuse the learned multi-scale features. This paper proposes a novel region-based self-attention (rSA) module. The rSA recalibrates the multi-scale responses by explicitly modelling the interdependency between channels in separate image regions. We discretize continuous depths to solve an ordinal depth classification in which the relative order between categories is significant. We contribute a dataset of 4410 RGB-D images, captured in outdoor environments at the University of Wollongong's campus. In our experimental results, the proposed module improves the lightweight models on small-sized datasets by 22% - 40%

3D Semantic Labeling of Photogrammetry Meshes Based on Active Learning

Mengqi Rong, Shuhan Shen, Zhanyi Hu

Responsive image

Auto-TLDR; 3D Semantic Expression of Urban Scenes Based on Active Learning

Slides Poster Similar

As different urban scenes are similar but still not completely consistent, coupled with the complexity of labeling directly in 3D, high-level understanding of 3D scenes has always been a tricky problem. In this paper, we propose a procedural approach for 3D semantic expression of urban scenes based on active learning. We first start with a small labeled image set to fine-tune a semantic segmentation network and then project its probability map onto a 3D mesh model for fusion, finally outputs a 3D semantic mesh model in which each facet has a semantic label and a heat model showing each facet’s confidence. Our key observation is that our algorithm is iterative, in each iteration, we use the output semantic model as a supervision to select several valuable images for annotation to co-participate in the fine-tuning for overall improvement. In this way, we reduce the workload of labeling but not the quality of 3D semantic model. Using urban areas from two different cities, we show the potential of our method and demonstrate its effectiveness.

Revisiting Sequence-To-Sequence Video Object Segmentation with Multi-Task Loss and Skip-Memory

Fatemeh Azimi, Benjamin Bischke, Sebastian Palacio, Federico Raue, Jörn Hees, Andreas Dengel

Responsive image

Auto-TLDR; Sequence-to-Sequence Learning for Video Object Segmentation

Slides Poster Similar

Video Object Segmentation (VOS) is an active research area of the visual domain. One of its fundamental sub-tasks is semi-supervised / one-shot learning: given only the segmentation mask for the first frame, the task is to provide pixel-accurate masks for the object over the rest of the sequence. Despite much progress in the last years, we noticed that many of the existing approaches lose objects in longer sequences, especially when the object is small or briefly occluded. In this work, we build upon a sequence-to-sequence approach that employs an encoder-decoder architecture together with a memory module for exploiting the sequential data. We further improve this approach by proposing a model that manipulates multi-scale spatio-temporal information using memory-equipped skip connections. Furthermore, we incorporate an auxiliary task based on distance classification which greatly enhances the quality of edges in segmentation masks. We compare our approach to the state of the art and show considerable improvement in the contour accuracy metric and the overall segmentation accuracy.

Selective Kernel and Motion-Emphasized Loss Based Attention-Guided Network for HDR Imaging of Dynamic Scenes

Yipeng Deng, Qin Liu, Takeshi Ikenaga

Responsive image

Auto-TLDR; SK-AHDRNet: A Deep Network with attention module and motion-emphasized loss function to produce ghost-free HDR images

Slides Poster Similar

Ghost-like artifacts caused by ill-exposed and motion areas is one of the most challenging problems in high dynamic range (HDR) image reconstruction.When the motion range is small, previous methods based on optical flow or patch-match can suppress ghost-like artifacts by first aligning input images before merging them.However, they are not robust enough and still produce artifacts for challenging scenes where large foreground motions exist.To this end, we propose a deep network with attention module and motion-emphasized loss function to produce ghost-free HDR images. In attention module, we use channel and spatial attention to guide network to emphasize important components such as motion and saturated areas automatically. With the purpose of being robust to images with different resolutions and objects with distinct scale, we adopt the selective kernel network as the basic framework for channel attention. In addition to the attention module, the motion-emphasized loss function based on the motion and ill-exposed areas mask is designed to help network reconstruct motion areas. Experiments on the public dataset indicate that the proposed SK-AHDRNet produces ghost-free results where detail in ill-exposed areas is well recovered. The proposed method scores 43.17 with PSNR metric and 61.02 with HDR-VDP-2 metric on test which outperforms all conventional works. According to quantitative and qualitative evaluations, the proposed method can achieve state-of-the-art performance.

Motion-Supervised Co-Part Segmentation

Aliaksandr Siarohin, Subhankar Roy, Stéphane Lathuiliere, Sergey Tulyakov, Elisa Ricci, Nicu Sebe

Responsive image

Auto-TLDR; Self-supervised Co-Part Segmentation Using Motion Information from Videos

Slides Similar

Recent co-part segmentation methods mostly operate in a supervised learning setting, which requires a large amount of annotated data for training. To overcome this limitation, we propose a self-supervised deep learning method for co-part segmentation. Differently from previous works, our approach develops the idea that motion information inferred from videos can be leveraged to discover meaningful object parts. To this end, our method relies on pairs of frames sampled from the same video. The network learns to predict part segments together with a representation of the motion between two frames, which permits reconstruction of the target image. Through extensive experimental evaluation on publicly available video sequences we demonstrate that our approach can produce improved segmentation maps with respect to previous self-supervised co-part segmentation approaches.

DA-RefineNet: Dual-Inputs Attention RefineNet for Whole Slide Image Segmentation

Ziqiang Li, Rentuo Tao, Qianrun Wu, Bin Li

Responsive image

Auto-TLDR; DA-RefineNet: A dual-inputs attention network for whole slide image segmentation

Slides Poster Similar

Automatic medical image segmentation techniques have wide applications for disease diagnosing, however, its much more challenging than natural optical image segmentation tasks due to the high-resolution of medical images and the corresponding huge computation cost. Sliding window was a commonly used technique for whole slide image (WSI) segmentation, however, for these methods that based on sliding window, the main drawback was lacking of global contextual information for supervision. In this paper, we proposed a dual-inputs attention network (denoted as DA-RefineNet) for WSI segmentation, where both local fine-grained information and global coarse information can be efficiently utilized. Sufficient comparative experiments were conducted to evaluate the effectiveness of the proposed method, the results proved that the proposed method can achieve better performance on WSI segmentation tasks compared to methods rely on single-input.

P2D: A Self-Supervised Method for Depth Estimation from Polarimetry

Marc Blanchon, Desire Sidibe, Olivier Morel, Ralph Seulin, Daniel Braun, Fabrice Meriaudeau

Responsive image

Auto-TLDR; Polarimetric Regularization for Monocular Depth Estimation

Slides Poster Similar

Monocular depth estimation is a recurring subject in the field of computer vision. Its ability to describe scenes via a depth map while reducing the constraints related to the formulation of perspective geometry tends to favor its use. However, despite the constant improvement of algorithms, most methods exploit only colorimetric information. Consequently, robustness to events to which the modality is not sensitive to, like specularity or transparency, is neglected. In response to this phenomenon, we propose using polarimetry as an input for a self-supervised monodepth network. Therefore, we propose exploiting polarization cues to encourage accurate reconstruction of scenes. Furthermore, we include a term of polarimetric regularization to state-of-the-art method to take specific advantage of the data. Our method is evaluated both qualitatively and quantitatively demonstrating that the contribution of this new information as well as an enhanced loss function improves depth estimation results, especially for specular areas.

Human Segmentation with Dynamic LiDAR Data

Tao Zhong, Wonjik Kim, Masayuki Tanaka, Masatoshi Okutomi

Responsive image

Auto-TLDR; Spatiotemporal Neural Network for Human Segmentation with Dynamic Point Clouds

Slides Similar

Consecutive LiDAR scans and depth images compose dynamic 3D sequences, which contain more abundant spatiotemporal information than a single frame. Similar to the development history of image and video perception, dynamic 3D sequence perception starts to come into sight after inspiring research on static 3D data perception. This work proposes a spatiotemporal neural network for human segmentation with the dynamic LiDAR point clouds. It takes a sequence of depth images as input. It has a two-branch structure, i.e., the spatial segmentation branch and the temporal velocity estimation branch. The velocity estimation branch is designed to capture motion cues from the input sequence and then propagates them to the other branch. So that the segmentation branch segments humans according to both spatial and temporal features. These two branches are jointly learned on a generated dynamic point cloud data set for human recognition. Our works fill in the blank of dynamic point cloud perception with the spherical representation of point cloud and achieves high accuracy. The experiments indicate that the introduction of temporal feature benefits the segmentation of dynamic point cloud perception.

TinyVIRAT: Low-Resolution Video Action Recognition

Ugur Demir, Yogesh Rawat, Mubarak Shah

Responsive image

Auto-TLDR; TinyVIRAT: A Progressive Generative Approach for Action Recognition in Videos

Slides Poster Similar

The existing research in action recognition is mostly focused on high-quality videos where the action is distinctly visible. In real-world surveillance environments, the actions in videos are captured at a wide range of resolutions. Most activities occur at a distance with a small resolution and recognizing such activities is a challenging problem. In this work, we focus on recognizing tiny actions in videos. We introduce a benchmark dataset, TinyVIRAT, which contains natural low-resolution activities. The actions in TinyVIRAT videos have multiple labels and they are extracted from surveillance videos which makes them realistic and more challenging. We propose a novel method for recognizing tiny actions in videos which utilizes a progressive generative approach to improve the quality of low-resolution actions. The proposed method also consists of a weakly trained attention mechanism which helps in focusing on the activity regions in the video. We perform extensive experiments to benchmark the proposed TinyVIRAT dataset and observe that the proposed method significantly improves the action recognition performance over baselines. We also evaluate the proposed approach on synthetically resized action recognition datasets and achieve state-of-the-art results when compared with existing methods. The dataset and code will be publicly available.

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

Bing Li, Longteng Kong, Dongming Zhang, Xiuguo Bao, Di Huang, Yunhong Wang

Responsive image

Auto-TLDR; TEMSN: Temporal Enhanced Multi-Stream Network for Compressed Video Action Recognition

Slides Poster Similar

Current compressed video action recognition methods are mainly based on completely received compressed videos. However, in real transmission, the compressed video packets are usually disorderly received and lost due to network jitters or congestion. It is of great significance to recognize actions in early phases with limited packets, e.g. forecasting the potential risks from videos quickly. In this paper, we proposed a Temporal Enhanced Multi-Stream Network (TEMSN) for practical compressed video action recognition. First, we use three compressed modalities as complementary cues and build a multi-stream network to capture the rich information from compressed video packets. Second, we design a temporal enhanced module based on Encoder-Decoder structure applied on each stream to infer the missing packets, and generate more complete action dynamics. Thanks to the rich modalities and temporal enhancement, our approach is able to better modeling the action with limited compressed packets. Experiments on HMDB-51 and UCF-101 dataset validate its effectiveness and efficiency.

Global-Local Attention Network for Semantic Segmentation in Aerial Images

Minglong Li, Lianlei Shan, Weiqiang Wang

Responsive image

Auto-TLDR; GLANet: Global-Local Attention Network for Semantic Segmentation

Slides Poster Similar

Errors in semantic segmentation task could be classified into two types: large area misclassification and local inaccurate boundaries. Previously attention based methods capture rich global contextual information, this is beneficial to diminish the first type of error, but local imprecision still exists. In this paper we propose Global-Local Attention Network (GLANet) with a simultaneous consideration of global context and local details. Specifically, our GLANet is composed of two branches namely global attention branch and local attention branch, and three different modules are embedded in the two branches for the purpose of modeling semantic interdependencies in spatial, channel and boundary dimensions respectively. We sum the outputs of the two branches to further improve feature representation, leading to more precise segmentation results. The proposed method achieves very competitive segmentation accuracy on two public aerial image datasets, bringing significant improvements over baseline.