AV-SLAM: Autonomous Vehicle SLAM with Gravity Direction Initialization

Kaan Yilmaz, Baris Suslu, Sohini Roychowdhury, L. Srikar Muppirisetty

Responsive image

Auto-TLDR; VI-SLAM with AGI: A combination of three SLAM algorithms for autonomous vehicles

Slides Poster

Simultaneous localization and mapping (SLAM) algorithms that are aimed at autonomous vehicles (AVs) are required to utilize sensor redundancies specific to AVs and enable accurate, fast and repeatable estimations of pose and path trajectories. In this work, we present a combination of three SLAM algorithms that utilize a different subset of available sensors such as inertial measurement unit (IMU), a gray-scale mono-camera, and a Lidar. Also, we propose a novel acceleration-based gravity direction initialization (AGI) method for the visual-inertial SLAM algorithm. We analyze the SLAM algorithms and initialization methods for pose estimation accuracy, speed of convergence and repeatability on the KITTI odometry sequences. The proposed VI-SLAM with AGI method achieves relative pose errors less than 2\%, convergence in half a minute or less and convergence time variability less than 3s, which makes it preferable for AVs.

Similar papers

Benchmarking Cameras for OpenVSLAM Indoors

Kevin Chappellet, Guillaume Caron, Fumio Kanehiro, Ken Sakurada, Abderrahmane Kheddar

Responsive image

Auto-TLDR; OpenVSLAM: Benchmarking Camera Types for Visual Simultaneous Localization and Mapping

Slides Poster Similar

In this paper we benchmark different types of cameras and evaluate their performance in terms of reliable localization reliability and precision in Visual Simultaneous Localization and Mapping (vSLAM). Such benchmarking is merely found for visual odometry, but never for vSLAM. Existing studies usually compare several algorithms for a given camera. %This work is the first to handle the dual of the latter, i.e. comparing several cameras for a given SLAM algorithm. The evaluation methodology we propose is applied to the recent OpenVSLAM framework. The latter is versatile enough to natively deal with perspective, fisheye, 360 cameras in a monocular or stereoscopic setup, an in RGB or RGB-D modalities. Results in various sequences containing light variation and scenery modifications in the scene assess quantitatively the maximum localization rate for 360 vision. In the contrary, RGB-D vision shows the lowest localization rate, but highest precision when localization is possible. Stereo-fisheye trades-off with localization rates and precision between 360 vision and RGB-D vision. The dataset with ground truth will be made available in open access to allow evaluating other/future vSLAM algorithms with respect to these camera types.

RISEdb: A Novel Indoor Localization Dataset

Carlos Sanchez Belenguer, Erik Wolfart, Álvaro Casado Coscollá, Vitor Sequeira

Responsive image

Auto-TLDR; Indoor Localization Using LiDAR SLAM and Smartphones: A Benchmarking Dataset

Slides Poster Similar

In this paper we introduce a novel public dataset for developing and benchmarking indoor localization systems. We have selected and 3D mapped a set of representative indoor environments including a large office building, a conference room, a workshop, an exhibition area and a restaurant. Our acquisition pipeline is based on a portable LiDAR SLAM backpack to map the buildings and to accurately track the pose of the user as it moves freely inside them. We introduce the calibration procedures that enable us to acquire and geo-reference live data coming from different independent sensors rigidly attached to the backpack. This has allowed us to collect long sequences of spherical and stereo images, together with all the sensor readings coming from a consumer smartphone and locate them inside the map with centimetre accuracy. The dataset addresses many of the limitations of existing indoor localization datasets regarding the scale and diversity of the mapped buildings; the number of acquired sequences under varying conditions; the accuracy of the ground-truth trajectory; the availability of a detailed 3D model and the availability of different sensor types. It enables the benchmarking of existing and the development of new indoor localization approaches, in particular for deep learning based systems that require large amounts of labeled training data.

A Two-Step Approach to Lidar-Camera Calibration

Yingna Su, Yaqing Ding, Jian Yang, Hui Kong

Responsive image

Auto-TLDR; Closed-Form Calibration of Lidar-camera System for Ego-motion Estimation and Scene Understanding

Slides Poster Similar

Autonomous vehicles and robots are typically equipped with Lidar and camera. Hence, calibrating the Lidar-camera system is of extreme importance for ego-motion estimation and scene understanding. In this paper, we propose a two-step approach (coarse + fine) for the external calibration between a camera and a multiple-line Lidar. First, a new closed-form solution is proposed to obtain the initial calibration parameters. We compare our solution with the state-of-the-art SVD-based algorithm, and show the benefits of both the efficiency and stability. With the initial calibration parameters, the ICP-based calibration framework is used to register the point clouds which extracted from the camera and Lidar coordinate frames, respectively. Our method has been applied to two Lidar-camera systems: an HDL-64E Lidar-camera system, and a VLP-16 Lidar-camera system. Experimental results demonstrate that our method achieves promising performance and higher accuracy than two open-source methods.

NetCalib: A Novel Approach for LiDAR-Camera Auto-Calibration Based on Deep Learning

Shan Wu, Amnir Hadachi, Damien Vivet, Yadu Prabhakar

Responsive image

Auto-TLDR; Automatic Calibration of LiDAR and Cameras using Deep Neural Network

Slides Poster Similar

A fusion of LiDAR and cameras have been widely used in many robotics applications such as classification, segmentation, object detection, and autonomous driving. It is essential that the LiDAR sensor can measure distances accurately, which is a good complement to the cameras. Hence, calibrating sensors before deployment is a mandatory step. The conventional methods include checkerboards, specific patterns, or human labeling, which is trivial and human-labor extensive if we do the same calibration process every time. The main propose of this research work is to build a deep neural network that is capable of automatically finding the geometric transformation between LiDAR and cameras. The results show that our model manages to find the transformations from randomly sampled artificial errors. Besides, our work is open-sourced for the community to fully utilize the advances of the methodology for developing more the approach, initiating collaboration, and innovation in the topic.

Better Prior Knowledge Improves Human-Pose-Based Extrinsic Camera Calibration

Olivier Moliner, Sangxia Huang, Kalle Åström

Responsive image

Auto-TLDR; Improving Human-pose-based Extrinsic Calibration for Multi-Camera Systems

Slides Poster Similar

Accurate extrinsic calibration of wide baseline multi-camera systems enables better understanding of 3D scenes for many applications and is of great practical importance. Classical Structure-from-Motion calibration methods require special calibration equipment so that accurate point correspondences can be detected between different views. In addition, an operator with some training is usually needed to ensure that data is collected in a way that leads to good calibration accuracy. This limits the ease of adoption of such technologies. Recently, methods have been proposed to use human pose estimation models to establish point correspondences, thus removing the need for any special equipment. The challenge with this approach is that human pose estimation algorithms typically produce much less accurate feature points compared to classical patch-based methods. Another problem is that ambient human motion might not be optimal for calibration. We build upon prior works and introduce several novel ideas to improve the accuracy of human-pose-based extrinsic calibration. Our first contribution is a robust reprojection loss based on a better understanding of the sources of pose estimation error. Our second contribution is a 3D human pose likelihood model learned from motion capture data. We demonstrate significant improvements in calibration accuracy by evaluating our method on four publicly available datasets.

Generic Merging of Structure from Motion Maps with a Low Memory Footprint

Gabrielle Flood, David Gillsjö, Patrik Persson, Anders Heyden, Kalle Åström

Responsive image

Auto-TLDR; A Low-Memory Footprint Representation for Robust Map Merge

Slides Poster Similar

With the development of cheap image sensors, the amount of available image data have increased enormously, and the possibility of using crowdsourced collection methods has emerged. This calls for development of ways to handle all these data. In this paper, we present new tools that will enable efficient, flexible and robust map merging. Assuming that separate optimisations have been performed for the individual maps, we show how only relevant data can be stored in a low memory footprint representation. We use these representations to perform map merging so that the algorithm is invariant to the merging order and independent of the choice of coordinate system. The result is a robust algorithm that can be applied to several maps simultaneously. The result of a merge can also be represented with the same type of low-memory footprint format, which enables further merging and updating of the map in a hierarchical way. Furthermore, the method can perform loop closing and also detect changes in the scene between the capture of the different image sequences. Using both simulated and real data — from both a hand held mobile phone and from a drone — we verify the performance of the proposed method.

Rotational Adjoint Methods for Learning-Free 3D Human Pose Estimation from IMU Data

Caterina Emilia Agelide Buizza, Yiannis Demiris

Responsive image

Auto-TLDR; Learning-free 3D Human Pose Estimation from Inertial Measurement Unit Data

Poster Similar

We present a new framework for learning-free 3D human pose estimation from Inertial Measurement Unit (IMU) data. The proposed method does not rely on a full motion sequence to calculate a pose for any particular time point and thus can operate in real-time. A cost function based only on joint rotations is used, removing the need for frequent transformations between rotations and 3D Cartesian coordinates. A Jacobian that preserves skeleton structure is derived using Adjoint methods from Variational Data Assimilation. To facilitate further research in IMU-based Motion Capture, we provide a dataset that combines RGB and depth images from an Intel RealSense camera, marker-based motion capture from an Optitrack system and Xsens IMU data. We have evaluated our method on both our dataset and the Total Capture dataset, showing an average error across 24 joints of 0.45 and 0.48 radians respectively.

Learning to Segment Dynamic Objects Using SLAM Outliers

Dupont Romain, Mohamed Tamaazousti, Hervé Le Borgne

Responsive image

Auto-TLDR; Automatic Segmentation of Dynamic Objects Using SLAM Outliers Using Consensus Inversion

Slides Poster Similar

We present a method to automatically learn to segment dynamic objects using SLAM outliers. It requires only one monocular sequence per dynamic object for training and consists in localizing dynamic objects using SLAM outliers, creating their masks, and using these masks to train a semantic segmentation network. We integrate the trained network in ORB-SLAM 2 and LDSO. At runtime we remove features on dynamic objects, making the SLAM unaffected by them. We also propose a new stereo dataset and new metrics to evaluate SLAM robustness. Our dataset includes consensus inversions, i.e., situations where the SLAM uses more features on dynamic objects that on the static background. Consensus inversions are challenging for SLAM as they may cause major SLAM failures. Our approach performs better than the State-of-the-Art on the TUM RGB-D dataset in monocular mode and on our dataset in both monocular and stereo modes.

Map-Based Temporally Consistent Geolocalization through Learning Motion Trajectories

Bing Zha, Alper Yilmaz

Responsive image

Auto-TLDR; Exploiting Motion Trajectories for Geolocalization of Object on Topological Map using Recurrent Neural Network

Slides Poster Similar

In this paper, we propose a novel trajectory learning method that exploits motion trajectories on topological map using recurrent neural network for temporally consistent geolocalization of object. Inspired by human's ability to both be aware of distance and direction of self-motion in navigation, our trajectory learning method learns a pattern representation of trajectories encoded as a sequence of distances and turning angles to assist self-localization. We pose the learning process as a conditional sequence prediction problem in which each output locates the object on a traversable edge in a map. Considering the prediction sequence ought to be topologically connected in the graph-structured map, we adopt two different hypotheses generation and elimination strategies to eliminate disconnected sequence prediction. We demonstrate our approach on the KITTI stereo visual odometry dataset which is a city-scale environment. The key benefits of our approach to geolocalization are that 1) we take advantage of powerful sequence modeling ability of recurrent neural network and its robustness to noisy input, 2) only require a map in the form of a graph and 3) simply use an affordable sensor that generates motion trajectory. The experiments show that the motion trajectories can be learned by training an recurrent neural network, and temporally consistent geolocation can be predicted with both of the proposed strategies.

Extending Single Beam Lidar to Full Resolution by Fusing with Single Image Depth Estimation

Yawen Lu, Yuxing Wang, Devarth Parikh, Guoyu Lu

Responsive image

Auto-TLDR; Self-supervised LIDAR for Low-Cost Depth Estimation

Slides Similar

Depth estimation is playing an important role in indoor and outdoor scene understanding, autonomous driving, augmented reality and many other tasks. Vehicles and robotics are able to use active illumination sensors such as LIDAR to receive high precision depth estimation. However, high-resolution Lidars are usually too expensive, which limits its massive production on various applications. Though single beam LIDAR enjoys the benefits of low cost, one beam depth sensing is not usually sufficient to perceive the surrounding environment in many scenarios. In this paper, we propose a learning-based framework to explore to replicate similar or even higher performance as costly LIDARs with our designed self-supervised network and a low-cost single-beam LIDAR. After the accurate calibration with a visible camera, the single beam LIDAR can adjust the scale uncertainty of the depth map estimated by the visible camera. The adjusted depth map enjoys the benefits of high resolution and sensing accuracy as high beam LIDAR and maintains low-cost as single beam LIDAR. Thus we can achieve similar sensing effect of high beam LIDAR with more than a 50-100 times cheaper price (e.g., \$80000 Velodyne HDL-64E LIDAR v.s. \$1000 SICK TIM-781 2D LIDAR and normal camera). The proposed approach is verified on our collected dataset and public dataset with superior depth-sensing performance.

Total Estimation from RGB Video: On-Line Camera Self-Calibration, Non-Rigid Shape and Motion

Antonio Agudo

Responsive image

Auto-TLDR; Joint Auto-Calibration, Pose and 3D Reconstruction of a Non-rigid Object from an uncalibrated RGB Image Sequence

Slides Poster Similar

In this paper we present a sequential approach to jointly retrieve camera auto-calibration, camera pose and the 3D reconstruction of a non-rigid object from an uncalibrated RGB image sequence, without assuming any prior information about the shape structure, nor the need for a calibration pattern, nor the use of training data at all. To this end, we propose a Bayesian filtering approach based on a sum-of-Gaussians filter composed of a bank of extended Kalman filters (EKF). For every EKF, we make use of dynamic models to estimate its state vector, which later will be Gaussianly combined to achieve a global solution. To deal with deformable objects, we incorporate a mechanical model solved by using the finite element method. Thanks to these ingredients, the resulting method is both efficient and robust to several artifacts such as missing and noisy observations as well as sudden camera motions, while being available for a wide variety of objects and materials, including isometric and elastic shape deformations. Experimental validation is proposed in real experiments, showing its strengths with respect to competing approaches.

Camera Calibration Using Parallel Line Segments

Gaku Nakano

Responsive image

Auto-TLDR; Closed-Form Calibration of Surveillance Cameras using Parallel 3D Line Segment Projections

Slides Poster Similar

This paper proposes a camera calibration method suitable for surveillance cameras using the image projection of parallel 3D line segments of the same length. We assume that vertical line segments are perpendicular to the ground plane and their bottom end-points are on the ground plane. Under this assumption, the camera parameters can be directly solved by at least two line segments without estimating vanishing points. Extending the minimal solution, we derive a closed-form solution to the least squares case with more than two line segments. Lens distortion is jointly optimized in bundle adjustment. Synthetic data evaluation shows that the best depression angle of a camera is around 50 degrees. In real data evaluation, we use body joints of pedestrians as vertical line segments. The experimental results on publicly available datasets show that the proposed method with a human pose detector can correctly calibrate wide-angle cameras including radial distortion.

A Plane-Based Approach for Indoor Point Clouds Registration

Ketty Favre, Muriel Pressigout, Luce Morin, Eric Marchand

Responsive image

Auto-TLDR; A plane-based registration approach for indoor environments based on LiDAR data

Slides Poster Similar

Iterative Closest Point (ICP) is one of the mostly used algorithms for 3D point clouds registration. This classical approach can be impacted by the large number of points contained in a point cloud. Planar structures, which are less numerous than points, can be used in well-structured man-made environment. In this paper we propose a registration method inspired by the ICP algorithm in a plane-based registration approach for indoor environments. This method is based solely on data acquired with a LiDAR sensor. A new metric based on plane characteristics is introduced to find the best plane correspondences. The optimal transformation is estimated through a two-step minimization approach, successively performing robust plane-to-plane minimization and non-linear robust point-to-plane registration. Experiments on the Autonomous Systems Lab (ASL) dataset show that the proposed method enables to successfully register 100% of the scans from the three indoor sequences. Experiments also show that the proposed method is more robust in large motion scenarios than other state-of-the-art algorithms.

P2D: A Self-Supervised Method for Depth Estimation from Polarimetry

Marc Blanchon, Desire Sidibe, Olivier Morel, Ralph Seulin, Daniel Braun, Fabrice Meriaudeau

Responsive image

Auto-TLDR; Polarimetric Regularization for Monocular Depth Estimation

Slides Poster Similar

Monocular depth estimation is a recurring subject in the field of computer vision. Its ability to describe scenes via a depth map while reducing the constraints related to the formulation of perspective geometry tends to favor its use. However, despite the constant improvement of algorithms, most methods exploit only colorimetric information. Consequently, robustness to events to which the modality is not sensitive to, like specularity or transparency, is neglected. In response to this phenomenon, we propose using polarimetry as an input for a self-supervised monodepth network. Therefore, we propose exploiting polarization cues to encourage accurate reconstruction of scenes. Furthermore, we include a term of polarimetric regularization to state-of-the-art method to take specific advantage of the data. Our method is evaluated both qualitatively and quantitatively demonstrating that the contribution of this new information as well as an enhanced loss function improves depth estimation results, especially for specular areas.

Edge-Aware Monocular Dense Depth Estimation with Morphology

Zhi Li, Xiaoyang Zhu, Haitao Yu, Qi Zhang, Yongshi Jiang

Responsive image

Auto-TLDR; Spatio-Temporally Smooth Dense Depth Maps Using Only a CPU

Slides Poster Similar

Dense depth maps play an important role in Computer Vision and AR (Augmented Reality). For CV applications, a dense depth map is the cornerstone of 3D reconstruction allowing real objects to be precisely displayed in the computer. And Dense depth maps can handle correct occlusion relationships between virtual content and real objects for better user experience in AR. However, the complicated computation limits the development of computing dense depth maps. We present a novel algorithm that produces low latency, spatio-temporally smooth dense depth maps using only a CPU. The depth maps exhibit sharp discontinuities at depth edges in low computational complexity ways. Our algorithm obtains the sparse SLAM reconstruction first, then extracts coarse depth edges from a down-sampled RGB image by morphology operations. Next, we thin the depth edges and align them with image edges. Finally, a Warm-Start initialization scheme and an improved optimization solver are adopted to accelerate convergence. We evaluate our proposal quantitatively and the result shows improvements on the accuracy of depth map with respect to other state-of-the-art and baseline techniques.

Calibration and Absolute Pose Estimation of Trinocular Linear Camera Array for Smart City Applications

Martin Ahrnbom, Mikael Nilsson, Håkan Ardö, Kalle Åström, Oksana Yastremska-Kravchenko, Aliaksei Laureshyn

Responsive image

Auto-TLDR; Trinocular Linear Camera Array Calibration for Traffic Surveillance Applications

Slides Poster Similar

A method for calibrating a Trinocular Linear Camera Array (TLCA) for traffic surveillance applications, such as towards smart cities, is presented. A TLCA-specific parametrization guarantees that the calibration finds a model where all the cameras are on a straight line. The method uses both a chequerboard close to the camera, as well as measured 3D points far from the camera: points measured in world coordinates, as well as their corresponding 2D points found manually in the images. Superior calibration accuracy can be obtained compared to standard methods using only a single data source, largely due to the use of chequerboards, while the line constraint in the parametrization allows for joint rectification. The improved triangulation accuracy, from 8-12 cm to around 6 cm when calibrating with 30-50 points in our experiment, allowing better road user analysis. The method is demonstrated by a proof-of-concept application where a point cloud is generated from multiple disparity maps, visualizing road user detections in 3D.

Multiple Future Prediction Leveraging Synthetic Trajectories

Lorenzo Berlincioni, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo

Responsive image

Auto-TLDR; Synthetic Trajectory Prediction using Markov Chains

Slides Poster Similar

Trajectory prediction is an important task, especially in autonomous driving. The ability to forecast the position of other moving agents can yield to an effective planning, ensuring safety for the autonomous vehicle as well for the observed entities. In this work we propose a data driven approach based on Markov Chains to generate synthetic trajectories, which are useful for training a multiple future trajectory predictor. The advantages are twofold: on the one hand synthetic samples can be used to augment existing datasets and train more effective predictors; on the other hand, it allows to generate samples with multiple ground truths, corresponding to diverse equally likely outcomes of the observed trajectory. We define a trajectory prediction model and a loss that explicitly address the multimodality of the problem and we show that combining synthetic and real data leads to prediction improvements, obtaining state of the art results.

Vehicle Lane Merge Visual Benchmark

Kai Cordes, Hellward Broszio

Responsive image

Auto-TLDR; A Benchmark for Automated Cooperative Maneuvering Using Multi-view Video Streams and Ground Truth Vehicle Description

Slides Poster Similar

Automated driving is regarded as the most promising technology for improving road safety in the future. In this context, connected vehicles have an important role regarding their ability to perform cooperative maneuvers for challenging traffic situations. We propose a benchmark for automated cooperative maneuvers. The targeted cooperative maneuver is the vehicle lane merge where a vehicle on the acceleration lane merges into the traffic of a motorway. The benchmark enables the evaluation of vehicle localization approaches as well as the study of cooperative maneuvers. It consists of temporally synchronized multi-view video streams, highly accurate camera calibration, and ground truth vehicle descriptions, including position, heading, speed, and shape. For benchmark generation, the lane merge maneuver is performed by human drivers on a test track, resulting in 120 lane merge data sets with various traffic situations and video recording conditions.

Localization of Unmanned Aerial Vehicles in Corridor Environments Using Deep Learning

Ram Padhy, Shahzad Ahmad, Sachin Verma, Sambit Bakshi, Pankaj Kumar Sa

Responsive image

Auto-TLDR; A monocular vision assisted localization algorithm for indoor corridor environments

Slides Poster Similar

We propose a monocular vision assisted localization algorithm, that will help a UAV navigate safely in indoor corridor environments. Always, the aim is to navigate the UAV through a corridor in the forward direction by keeping it at the center with no orientation either to the left or right side. The algorithm makes use of the RGB image, captured from the UAV front camera, and passes it through a trained Deep Neural Network (DNN) to predict the position of the UAV as either on the left or center or right side of the corridor. Depending upon the divergence of the UAV with respect to an imaginary central line, known as the central bisector line (CBL) of the corridor, a suitable command is generated to bring the UAV to the center. When the UAV is at the center of the corridor, a new image is passed through another trained DNN to predict the orientation of the UAV with respect to the CBL of the corridor. If the UAV is either left or right tilted, an appropriate command is generated to rectify the orientation. We also propose a new corridor dataset, named UAVCorV1, which contains images as captured by the UAV front camera when the UAV is at all possible locations of a variety of corridors. An exhaustive set of experiments in different corridors reveal the efficacy of the proposed algorithm.

Unconstrained Vision Guided UAV Based Safe Helicopter Landing

Arindam Sikdar, Abhimanyu Sahu, Debajit Sen, Rohit Mahajan, Ananda Chowdhury

Responsive image

Auto-TLDR; Autonomous Helicopter Landing in Hazardous Environments from Unmanned Aerial Images Using Constrained Graph Clustering

Slides Poster Similar

In this paper, we have addressed the problem of automated detection of safe zone(s) for helicopter landing in hazardous environments from images captured by an Unmanned Aerial Vehicle (UAV). The unconstrained motion of the image capturing drone (the UAV in our case) makes the problem further difficult. The solution pipeline consists of natural landmark detection and tracking, stereo-pair generation using constrained graph clustering, digital terrain map construction and safe landing zone detection. The main methodological contribution lies in mathematically formulating epipolar constraint and then using it in a Minimum Spanning Tree (MST) based graph clustering approach. We have also made publicly available AHL (Autonomous Helicopter Landing) dataset, a new aerial video dataset captured by a drone, with annotated ground-truths. Experimental comparisons with other competing clustering methods i) in terms of Dunn Index and Davies Bouldin Index as well as ii) for frame-level safe zone detection in terms of F-measure and confusion matrix clearly demonstrate the effectiveness of the proposed formulation.

Minimal Solvers for Indoor UAV Positioning

Marcus Valtonen Örnhag, Patrik Persson, Mårten Wadenbäck, Kalle Åström, Anders Heyden

Responsive image

Auto-TLDR; Relative Pose Solvers for Visual Indoor UAV Navigation

Slides Poster Similar

In this paper we consider a collection of relative pose problems which arise naturally in applications for visual indoor UAV navigation. We focus on cases where additional information from an onboard IMU is available and thus provides a partial extrinsic calibration through the gravitational vector. The solvers are designed for a partially calibrated camera, for a variety of realistic indoor scenarios, which makes it possible to navigate using images of the ground floor. Current state-of-the-art solvers use more general assumptions, such as using arbitrary planar structures; however, these solvers do not yield adequate reconstructions for real scenes, nor do they perform fast enough to be incorporated in real-time systems. We show that the proposed solvers enjoy better numerical stability, are faster, and require fewer point correspondences, compared to state-of-the-art solvers. These properties are vital components for robust navigation in real-time systems, and we demonstrate on both synthetic and real data that our method outperforms other methods, and yields superior motion estimation.

Object-Oriented Map Exploration and Construction Based on Auxiliary Task Aided DRL

Junzhe Xu, Jianhua Zhang, Shengyong Chen, Honghai Liu

Responsive image

Auto-TLDR; Auxiliary Task Aided Deep Reinforcement Learning for Environment Exploration by Autonomous Robots

Similar

Environment exploration by autonomous robots through deep reinforcement learning (DRL) based methods has attracted more and more attention. However, existing methods usually focus on robot navigation to single or multiple fixed goals, while ignoring the perception and construction of external environments. In this paper, we propose a novel environment exploration task based on DRL, which requires a robot fast and completely perceives all objects of interest, and reconstructs their poses in a global environment map, as much as the robot can do. To this end, we design an auxiliary task aided DRL model, which is integrated with the auxiliary object detection and 6-DoF pose estimation components. The outcome of auxiliary tasks can improve the learning speed and robustness of DRL, as well as the accuracy of object pose estimation. Comprehensive experimental results on the indoor simulation platform AI2-THOR have shown the effectiveness and robustness of our method.

Two-Stage Adaptive Object Scene Flow Using Hybrid CNN-CRF Model

Congcong Li, Haoyu Ma, Qingmin Liao

Responsive image

Auto-TLDR; Adaptive object scene flow estimation using a hybrid CNN-CRF model and adaptive iteration

Slides Poster Similar

Scene flow estimation based on stereo sequences is a comprehensive task relevant to disparity and optical flow. Some existing methods are time-consuming and often fail in the presence of reflective surfaces. In this paper, we propose a two-stage adaptive object scene flow estimation method using a hybrid CNN-CRF model (ACOSF), which benefits from high-quality features and the structured modelling capability. Meanwhile, in order to balance the computational efficiency and accuracy, we employ adaptive iteration for energy function optimization, which is flexible and efficient for various scenes. Besides, we utilize high-quality pixel selection to reduce the computation time with only a slight decrease in accuracy. Our method achieves competitive results with the state-of-the-art, which ranks second on the challenging KITTI 2015 scene flow benchmark.

A Bayesian Approach to Reinforcement Learning of Vision-Based Vehicular Control

Zahra Gharaee, Karl Holmquist, Linbo He, Michael Felsberg

Responsive image

Auto-TLDR; Bayesian Reinforcement Learning for Autonomous Driving

Slides Poster Similar

In this paper, we present a state-of-the-art reinforcement learning method for autonomous driving. Our approach employs temporal difference learning in a Bayesian framework to learn vehicle control signals from sensor data. The agent has access to images from a forward facing camera, which are pre-processed to generate semantic segmentation maps. We trained our system using both ground truth and estimated semantic segmentation input. Based on our observations from a large set of experiments, we conclude that training the system on ground truth input data leads to better performance than training the system on estimated input even if estimated input is used for evaluation. The system is trained and evaluated in a realistic simulated urban environment using the CARLA simulator. The simulator also contains a benchmark that allows for comparing to other systems and methods. The required training time of the system is shown to be lower and the performance on the benchmark superior to competing approaches.

Holistic Grid Fusion Based Stop Line Estimation

Runsheng Xu, Faezeh Tafazzoli, Li Zhang, Timo Rehfeld, Gunther Krehl, Arunava Seal

Responsive image

Auto-TLDR; Fused Multi-Sensory Data for Stop Lines Detection in Intersection Scenarios

Slides Similar

Intersection scenarios provide the most complex traffic situations in Autonomous Driving and Driving Assistance Systems. Knowing where to stop in advance in an intersection is an essential parameter in controlling the longitudinal velocity of the vehicle. Most of the existing methods in literature solely use cameras to detect stop lines, which is typically not sufficient in terms of detection range. To address this issue, we propose a method that takes advantage of fused multi-sensory data including stereo camera and lidar as input and utilizes a carefully designed convolutional neural network architecture to detect stop lines. Our experiments show that the proposed approach can improve detection range compared to camera data alone, works under heavy occlusion without observing the ground markings explicitly, is able to predict stop lines for all lanes and allows detection at a distance up to 50 meters.

Can You Trust Your Pose? Confidence Estimation in Visual Localization

Luca Ferranti, Xiaotian Li, Jani Boutellier, Juho Kannala

Responsive image

Auto-TLDR; Pose Confidence Estimation in Large-Scale Environments: A Light-weight Approach to Improving Pose Estimation Pipeline

Slides Poster Similar

Camera pose estimation in large-scale environments is still an open question and, despite recent promising results, it may still fail in some situations. The research so far has focused on improving subcomponents of estimation pipelines, to achieve more accurate poses. However, there is no guarantee for the result to be correct, even though the correctness of pose estimation is critically important in several visual localization applications, such as in autonomous navigation. In this paper we bring to attention a novel research question, pose confidence estimation, where we aim at quantifying how reliable the visually estimated pose is. We develop a novel confidence measure to fulfill this task and show that it can be flexibly applied to different datasets, indoor or outdoor, and for various visual localization pipelines. We also show that the proposed techniques can be used to accomplish a secondary goal: improving the accuracy of existing pose estimation pipelines. Finally, the proposed approach is computationally light-weight and adds only a negligible increase to the computational effort of pose estimation.

HPERL: 3D Human Pose Estimastion from RGB and LiDAR

Michael Fürst, Shriya T.P. Gupta, René Schuster, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; 3D Human Pose Estimation Using RGB and LiDAR Using Weakly-Supervised Approach

Slides Poster Similar

In-the-wild human pose estimation has a huge potential for various fields, ranging from animation and action recognition to intention recognition and prediction for autonomous driving. The current state-of-the-art is focused only on RGB and RGB-D approaches for predicting the 3D human pose. However, not using precise LiDAR depth information limits the performance and leads to very inaccurate absolute pose estimation. With LiDAR sensors becoming more affordable and common on robots and autonomous vehicle setups, we propose an end-to-end architecture using RGB and LiDAR to predict the absolute 3D human pose with unprecedented precision. Additionally, we introduce a weakly-supervised approach to generate 3D predictions using 2D pose annotations from PedX. This allows for many new opportunities in the field of 3D human pose estimation.

Movement-Induced Priors for Deep Stereo

Yuxin Hou, Muhammad Kamran Janjua, Juho Kannala, Arno Solin

Responsive image

Auto-TLDR; Fusing Stereo Disparity Estimation with Movement-induced Prior Information

Slides Poster Similar

We propose a method for fusing stereo disparity estimation with movement-induced prior information. Instead of independent inference frame-by-frame, we formulate the problem as a non-parametric learning task in terms of a temporal Gaussian process prior with a movement-driven kernel for inter-frame reasoning. We present a hierarchy of three Gaussian process kernels depending on the availability of motion information, where our main focus is on a new gyroscope-driven kernel for handheld devices with low-quality MEMS sensors, thus also relaxing the requirement of having full 6D camera poses available. We show how our method can be combined with two state-of-the-art deep stereo methods. The method either work in a plug-and-play fashion with pre-trained deep stereo networks, or further improved by jointly training the kernels together with encoder--decoder architectures, leading to consistent improvement.

Real-Time End-To-End Lane ID Estimation Using Recurrent Networks

Ibrahim Halfaoui, Fahd Bouzaraa, Onay Urfalioglu

Responsive image

Auto-TLDR; Real-Time, Vision-Only Lane Identification Using Monocular Camera

Slides Poster Similar

Acquiring information about the road lane structure is a crucial step for autonomous navigation. To this end, several approaches tackle this task from different perspectives such as lane marking detection or semantic lane segmentation.However, to the best of our knowledge, there is yet no purely vision based end-to-end solution to answer the precise question: How to estimate the relative number or "ID" of the current driven lane within a multi-lane road or a highway? In this work, we propose a real-time, vision-only (i.e. monocular camera) solution to the problem based on a dual left-right convention. We interpret this task as a classification problem by limiting the maximum number of lane candidates to eight. Our approach is designed to meet low-complexity specifications and limited runtime requirements. It harnesses the temporal dimension inherent to the input sequences to improve upon high complexity state-of-the-art models. We achieve more than 95% accuracy on a challenging test set with extreme conditions and different routes.

Attention Based Coupled Framework for Road and Pothole Segmentation

Shaik Masihullah, Ritu Garg, Prerana Mukherjee, Anupama Ray

Responsive image

Auto-TLDR; Few Shot Learning for Road and Pothole Segmentation on KITTI and IDD

Slides Poster Similar

In this paper, we propose a novel attention based coupled framework for road and pothole segmentation. In many developing countries as well as in rural areas, the drivable areas are neither well-defined, nor well-maintained. Under such circumstances, an Advance Driver Assistant System (ADAS) is needed to assess the drivable area and alert about the potholes ahead to ensure vehicle safety. Moreover, this information can also be used in structured environments for assessment and maintenance of road health. We demonstrate few shot learning approach for pothole detection to leverage accuracy even with fewer training samples. We report the exhaustive experimental results for road segmentation on KITTI and IDD datasets. We also present pothole segmentation on IDD.

CARRADA Dataset: Camera and Automotive Radar with Range-Angle-Doppler Annotations

Arthur Ouaknine, Alasdair Newson, Julien Rebut, Florence Tupin, Patrick Pérez

Responsive image

Auto-TLDR; CARRADA: A dataset of synchronized camera and radar recordings with range-angle-Doppler annotations for autonomous driving

Slides Poster Similar

High quality perception is essential for autonomous driving (AD) systems. To reach the accuracy and robustness that are required by such systems, several types of sensors must be combined. Currently, mostly cameras and laser scanners (lidar) are deployed to build a representation of the world around the vehicle. While radar sensors have been used for a long time in the automotive industry, they are still under-used for AD despite their appealing characteristics (notably, their ability to measure the relative speed of obstacles and to operate even in adverse weather conditions). To a large extent, this situation is due to the relative lack of automotive datasets with real radar signals that are both raw and annotated. In this work, we introduce CARRADA, a dataset of synchronized camera and radar recordings with range-angle-Doppler annotations. We also present a semi-automatic annotation approach, which was used to annotate the dataset, and a radar semantic segmentation baseline, which we evaluate on several metrics. Both our code and dataset will be released.

RefiNet: 3D Human Pose Refinement with Depth Maps

Andrea D'Eusanio, Stefano Pini, Guido Borghi, Roberto Vezzani, Rita Cucchiara

Responsive image

Auto-TLDR; RefiNet: A Multi-stage Framework for 3D Human Pose Estimation

Slides Similar

Human Pose Estimation is a fundamental task for many applications in the Computer Vision community and it has been widely investigated in the 2D domain, i.e. intensity images. Therefore, most of the available methods for this task are mainly based on 2D Convolutional Neural Networks and huge manually-annotated RGB datasets, achieving stunning results. In this paper, we propose RefiNet, a multi-stage framework that regresses an extremely-precise 3D human pose estimation from a given 2D pose and a depth map. The framework consists of three different modules, each one specialized in a particular refinement and data representation, i.e. depth patches, 3D skeleton and point clouds. Moreover, we collect a new dataset, namely Baracca, acquired with RGB, depth and thermal cameras and specifically created for the automotive context. Experimental results confirm the quality of the refinement procedure that largely improves the human pose estimations of off-the-shelf 2D methods.

Partially Supervised Multi-Task Network for Single-View Dietary Assessment

Ya Lu, Thomai Stathopoulou, Stavroula Mougiakakou

Responsive image

Auto-TLDR; Food Volume Estimation from a Single Food Image via Geometric Understanding and Semantic Prediction

Slides Poster Similar

Food volume estimation is an essential step in the pipeline of dietary assessment and demands the precise depth estimation of the food surface and table plane. Existing methods based on computer vision require either multi-image input or additional depth maps, reducing convenience of implementation and practical significance. Despite the recent advances in unsupervised depth estimation from a single image, the achieved performance in the case of large texture-less areas needs to be improved. In this paper, we propose a network architecture that jointly performs geometric understanding (i.e., depth prediction and 3D plane estimation) and semantic prediction on a single food image, enabling a robust and accurate food volume estimation regardless of the texture characteristics of the target plane. For the training of the network, only monocular videos with semantic ground truth are required, while the depth map and 3D plane ground truth are no longer needed. Experimental results on two separate food image databases demonstrate that our method performs robustly on texture-less scenarios and is superior to unsupervised networks and structure from motion based approaches, while it achieves comparable performance to fully-supervised methods.

Dynamic Resource-Aware Corner Detection for Bio-Inspired Vision Sensors

Sherif Abdelmonem Sayed Mohamed, Jawad Yasin, Mohammad-Hashem Haghbayan, Antonio Miele, Jukka Veikko Heikkonen, Hannu Tenhunen, Juha Plosila

Responsive image

Auto-TLDR; Three Layer Filtering-Harris Algorithm for Event-based Cameras in Real-Time

Slides Similar

Event-based cameras are vision devices that transmit only brightness changes with low latency and ultra-low power consumption. Such characteristics make event-based cameras attractive in the field of localization and object tracking in resource-constrained systems. Since the number of generated events in such cameras is huge, the selection and filtering of the incoming events are beneficial from both increasing the accuracy of the features and reducing the computational load. In this paper, we present an algorithm to detect asynchronous corners form a stream of events in real-time on embedded systems. The algorithm is called the Three Layer Filtering-Harris or TLF-Harris algorithm. The algorithm is based on an events' filtering strategy whose purpose is 1) to increase the accuracy by deliberately eliminating some incoming events, i.e., noise and 2) to improve the real-time performance of the system, i.e., preserving a constant throughput in terms of input events per second, by discarding unnecessary events with a limited accuracy loss. An approximation of the Harris algorithm, in turn, is used to exploit its high-quality detection capability with a low-complexity implementation to enable seamless real-time performance on embedded computing platforms. The proposed algorithm is capable of selecting the best corner candidate among neighbors and achieves an average execution time savings of 59 % compared with the conventional Harris score. Moreover, our approach outperforms the competing methods, such as eFAST, eHarris, and FA-Harris, in terms of real-time performance, and surpasses Arc* in terms of accuracy.

Hybrid Approach for 3D Head Reconstruction: Using Neural Networks and Visual Geometry

Oussema Bouafif, Bogdan Khomutenko, Mohammed Daoudi

Responsive image

Auto-TLDR; Recovering 3D Head Geometry from a Single Image using Deep Learning and Geometric Techniques

Slides Poster Similar

Recovering the 3D geometric structure of a face from a single input image is a challenging active research area in computer vision. In this paper, we present a novel method for reconstructing 3D heads from a single or multiple image(s) using a hybrid approach based on deep learning and geometric techniques. We propose an encoder-decoder network based on the U-net architecture and trained on synthetic data only. It predicts both pixel-wise normal vectors and landmarks maps from a single input photo. Landmarks are used for the pose computation and the initialization of the optimization problem, which, in turn, reconstructs the 3D head geometry by using a parametric morphable model and normal vector fields. State-of-the-art results are achieved through qualitative and quantitative evaluation tests on both single and multi-view settings. Despite the fact that the model was trained only on synthetic data, it successfully recovers 3D geometry and precise poses for real-world images.

Loop-closure detection by LiDAR scan re-identification

Jukka Peltomäki, Xingyang Ni, Jussi Puura, Joni-Kristian Kamarainen, Heikki Juhani Huttunen

Responsive image

Auto-TLDR; Loop-Closing Detection from LiDAR Scans Using Convolutional Neural Networks

Slides Poster Similar

In this work, loop-closure detection from LiDAR scans is defined as an image re-identification problem. Re-identification is performed by computing Euclidean distances of a query scan to a gallery set of previous scans. The distances are computed in a feature embedding space where the scans are mapped by a convolutional neural network (CNN). The network is trained using the triplet loss training strategy. In our experiments we compare different backbone networks, variants of the triplet loss and generic and LiDAR specific data augmentation techniques. With a realistic indoor dataset the best architecture obtains the mean average precision (mAP) above 90%.

Ghost Target Detection in 3D Radar Data Using Point Cloud Based Deep Neural Network

Mahdi Chamseddine, Jason Rambach, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; Point Based Deep Learning for Ghost Target Detection in 3D Radar Point Clouds

Slides Poster Similar

Ghost targets are targets that appear at wrong locations in radar data and are caused by the presence of multiple indirect reflections between the target and the sensor. In this work, we introduce the first point based deep learning approach for ghost target detection in 3D radar point clouds. This is done by extending the PointNet network architecture by modifying its input to include radar point features beyond location and introducing skip connetions. We compare different input modalities and analyze the effects of the changes we introduced. We also propose an approach for automatic labeling of ghost targets 3D radar data using lidar as reference. The algorithm is trained and tested on real data in various driving scenarios and the tests show promising results in classifying real and ghost radar targets.

A Globally Optimal Method for the PnP Problem with MRP Rotation Parameterization

Manolis Lourakis, George Terzakis

Responsive image

Auto-TLDR; A Direct least squares, algebraic PnP solver with modified Rodrigues parameters

Poster Similar

The perspective-n-point (PnP) problem is of fundamental importance in computer vision. A global optimality condition for PnP that is independent of a particular rotation parameterization was recently developed by Nakano. This paper puts forward a direct least squares, algebraic PnP solution that extends Nakano's work by combining his optimality condition with the modified Rodrigues parameters (MRPs) for parameterizing rotation. The result is a system of polynomials that is solved using the Groebner basis approach. An MRP vector has twice the rotational range of the classical Rodrigues (i.e., Cayley) vector used by Nakano to represent rotation. The proposed solver provides strong guarantees that the full rotation singularity associated with MRPs is avoided. Furthermore, detailed experiments provide evidence that our solver attains accuracy that is indistinguishable from Nakano's Cayley-based solution with a moderate increase in computational cost.

Inner Eye Canthus Localization for Human Body Temperature Screening

Claudio Ferrari, Lorenzo Berlincioni, Marco Bertini, Alberto Del Bimbo

Responsive image

Auto-TLDR; Automatic Localization of the Inner Eye Canthus in Thermal Face Images using 3D Morphable Face Model

Slides Poster Similar

In this paper, we propose an automatic approach for localizing the inner eye canthus in thermal face images. We first coarsely detect 5 facial keypoints corresponding to the center of the eyes, the nosetip and the ears. Then we compute a sparse 2D-3D points correspondence using a 3D Morphable Face Model (3DMM). This correspondence is used to project the entire 3D face onto the image, and subsequently locate the inner eye canthus. Detecting this location allows to obtain the most precise body temperature measurement for a person using a thermal camera. We evaluated the approach on a thermal face dataset provided with manually annotated landmarks. However, such manual annotations are normally conceived to identify facial parts such as eyes, nose and mouth, and are not specifically tailored for localizing the eye canthus region. As additional contribution, we enrich the original dataset by using the annotated landmarks to deform and project the 3DMM onto the images. Then, by manually selecting a small region corresponding to the eye canthus, we enrich the dataset with additional annotations. By using the manual landmarks, we ensure the correctness of the 3DMM projection, which can be used as ground-truth for future evaluations. Moreover, we supply the dataset with the 3D head poses and per-point visibility masks for detecting self-occlusions. The data will be publicly released.

Visual Prediction of Driver Behavior in Shared Road Areas

Peter Gawronski, Darius Burschka

Responsive image

Auto-TLDR; Predicting Vehicle Behavior in Shared Road Segment Intersections Using Topological Knowledge

Slides Poster Similar

We propose a framework to analyze and predict vehicles behavior within shared road segments like intersections or at narrow passages. The system first identifies critical interaction regions based on topological knowledge. It then checks possible colliding trajectories from the current state of vehicles in the scene, defined by overlapping occupation times in road segments. For each possible interaction area, it analyzes the behavioral profile of both vehicles. Depending on right of way and (unpredictable) behavior parameters, different outcomes are expected and will be tested against input. The interaction between vehicles is analyzed over a short time horizon based on an initial action from one vehicle and the reaction by the other. The vehicle to yield most often performs the first action and the response of the opponent vehicle is measured after a reaction time. The observed reaction is classified by attention, if there was a reaction at all, and the collaboration of the opponent vehicle, whether it helps to resolve the situation or hinders it. The output is a classification of behavior of involved vehicles in terms of active participation in the interaction and assertiveness of driving style in terms of collaborative or disruptive behavior. The additional knowledge is used to refine the prediction of intention and outcome of a scene, which is then compared to the current status to catch unexpected behavior. The applicability of the concept and ideas of the approach is validated on scenarios from the recent Intersection Drone (inD) data set.

Mobile Augmented Reality: Fast, Precise, and Smooth Planar Object Tracking

Dmitrii Matveichev, Daw-Tung Lin

Responsive image

Auto-TLDR; Planar Object Tracking with Sparse Optical Flow Tracking and Descriptor Matching

Slides Poster Similar

We propose an innovative method for combining sparse optical flow tracking and descriptor matching algorithms. The proposed approach solves the following problems that are inherent to keypoint-based and optical flow based tracking algorithms: spatial jitter, extreme scale transformation, extreme perspective transformation, degradation in the number of tracking points, and drifting of tracking points. Our algorithm provides smooth object-position tracking under six degrees of freedom transformations with a small computational cost for providing a high-quality real-time AR experience on mobile platforms. We experimentally demonstrate that our approach outperforms the state-of-the-art tracking algorithms while offering faster computational time. A mobile augmented reality (AR) application, which is developed using our approach, delivers planar object tracking with 30 FPS on modern mobile phones for a camera resolution of 1280$\times$720. Finally, we compare the performance of our AR application with that of the Vuforia-based AR application on the same planar objects database. The test results show that our AR application delivers better AR experience than Vuforia in terms of smooth transition of object-pose between video frames.

Self-Supervised Detection and Pose Estimation of Logistical Objects in 3D Sensor Data

Nikolas Müller, Jonas Stenzel, Jian-Jia Chen

Responsive image

Auto-TLDR; A self-supervised and fully automated deep learning approach for object pose estimation using simulated 3D data

Slides Poster Similar

Localization of objects in cluttered scenes with machine learning methods is a fairly young research area. Despite the high potential of object localization for full process automation in Industry 4.0 and logistical environments, 3D data sets for such applications to train machine learning models are not openly available and less publications have been made on that topic. To the authors knowledge, this is the first publication that describes a self-supervised and fully automated deep learning approach for object pose estimation using simulated 3D data. The solution covers the simulated generation of training data, the detection of objects in point clouds using a fully convolutional feedforward network and the computation of the pose for each detected object instance.

Towards life-long mapping of dynamic environments using temporal persistence modeling

Georgios Tsamis, Ioannis Kostavelis, Dimitrios Giakoumis, Dimitrios Tzovaras

Responsive image

Auto-TLDR; Lifelong Mapping for Mobile Robot Navigation in Dynamic Environments

Slides Poster Similar

The contemporary SLAM mapping systems assume a static environment and build a map that is then used for mobile robot navigation disregarding the dynamic changes in this environment. The paper at hand presents a novel solution for the \emph{lifelong mapping} problem that continually updates a metric map represented as a 2D occupancy grid in large scale indoor environments with movable objects such as people, robots, objects etc. suitable for industrial applications. We formalize each cell's occupancy as a failure analysis problem and contribute temporal persistence modeling (TPM), an algorithm for probabilistic prediction of the time that a cell in an observed location is expected to be ``occupied" or ``empty" given sparse prior observations from a task specific mobile robot. Our work is evaluated in Gazebo simulation environment against the nominal occupancy of cells and the estimated obstacles persistence. We also show that robot navigation with lifelong mapping demands less re-plans and leads to more efficient navigation in highly dynamic environments.

Air-Writing with Sparse Network of Radars Using Spatio-Temporal Learning

Muhammad Arsalan, Avik Santra, Kay Bierzynski, Vadim Issakov

Responsive image

Auto-TLDR; An Air-writing System for Sparse Radars using Deep Convolutional Neural Networks

Slides Poster Similar

Hand gesture and motion sensing offer an intuitive and natural form of human-machine interface. Air-writing systems allow users to draw alpha-numerical or linguistic characters in the virtual board in air through hand gestures. Traditionally, radar-based air-writing systems have been based on a network of radars, at least three, to localize the hand target through trilateration algorithm followed by tracking to extract the drawn trajectory, which is then followed by recognition of the drawn character by either Long-Short Term Memory (LSTM) utilizing the sensed trajectory or Deep Convolutional Neural Network (DCNN) utilizing a reconstructed 2D image from the trajectory. However, the practical deployments of such systems are limited since the detection of the finger or hand target by all three radars cannot be guaranteed leading to failure of the trilateration algorithm. Further placement of three or more radars for the air-writing solution is neither always physically plausible nor cost-effective. Furthermore, these solutions do not exploit the full potentials of deep neural networks, which are generally capable of learning features implicitly. In this paper, we propose an air-writing system based on a network of sparse radars, i.e. strictly less than three, using 1D DCNN-LSTM-1D transposed DCNN architecture to reconstruct and classify the drawn character utilizing only the range information from each radar. The paper employs real data using one and two 60 GHz milli-meter wave radar sensors to demonstrate the success of the proposed air-writing solution.

Motion Segmentation with Pairwise Matches and Unknown Number of Motions

Federica Arrigoni, Tomas Pajdla, Luca Magri

Responsive image

Auto-TLDR; Motion Segmentation using Multi-Modelfitting andpermutation synchronization

Slides Poster Similar

In this paper we address motion segmentation, that is the problem of clustering points in multiple images according to a number of moving objects. Two-frame correspondences are assumed as input without prior knowledge about trajectories. Our method is based on principles from ''multi-model fitting'' and ''permutation synchronization'', and - differently from previous techniques working under the same assumptions - it can handle an unknown number of motions. The proposed approach is validated on standard datasets, showing that it can correctly estimate the number of motions while maintaining comparable or better accuracy than the state of the art.

Deep Next-Best-View Planner for Cross-Season Visual Route Classification

Kurauchi Kanya, Kanji Tanaka

Responsive image

Auto-TLDR; Active Visual Place Recognition using Deep Convolutional Neural Network

Slides Poster Similar

This paper addresses the problem of active visual place recognition (VPR) from a novel perspective of long-term autonomy. In our approach, a next-best-view (NBV) planner plans an optimal action-observation-sequence to maximize the expected cost-performance for a visual route classification task. A difficulty arises from the fact that the NBV planner is trained and tested in different domains (times of day, weather conditions, and seasons). Existing NBV methods may be confused and deteriorated by the domain-shifts, and require significant efforts for adapting them to a new domain. We address this issue by a novel deep convolutional neural network (DNN) -based NBV planner that does not require the adaptation. Our main contributions in this paper are summarized as follows: (1) We present a novel domain-invariant NBV planner that is specifically tailored for DNN-based VPR. (2) We formulate the active VPR as a POMDP problem and present a feasible solution to address the inherent intractability. Specifically, the probability distribution vector (PDV) output by the available DNN is used as a domain-invariant observation model without the need to retrain it. (3) We verify efficacy of the proposed approach through challenging cross-season VPR experiments, where it is confirmed that the proposed approach clearly outperforms the previous single-view-based or multi-view-based VPR in terms of VPR accuracy and/or action-observation-cost.

Future Urban Scenes Generation through Vehicles Synthesis

Alessandro Simoni, Luca Bergamini, Andrea Palazzi, Simone Calderara, Rita Cucchiara

Responsive image

Auto-TLDR; Predicting the Future of an Urban Scene with a Novel View Synthesis Paradigm

Slides Poster Similar

In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene. Despite recent advances, generating the entire scene in an end-to-end fashion is still far from being achieved. Instead, here we follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently. We leverage a per-object novel view synthesis paradigm; i.e. generating a synthetic representation of an object undergoing a geometrical roto-translation in the 3D space. Our model can be easily conditioned with constraints (e.g. input trajectories) provided by state-of-the-art tracking methods or by the user itself. This allows us to generate a set of diverse realistic futures starting from the same input in a multi-modal fashion. We visually and quantitatively show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow, a challenging real world dataset.

Effective Deployment of CNNs for 3DoF Pose Estimation and Grasping in Industrial Settings

Daniele De Gregorio, Riccardo Zanella, Gianluca Palli, Luigi Di Stefano

Responsive image

Auto-TLDR; Automated Deep Learning for Robotic Grasping Applications

Slides Poster Similar

In this paper we investigate how to effectively deploy deep learning in practical industrial settings, such as robotic grasping applications. When a deep-learning based solution is proposed, usually lacks of any simple method to generate the training data. In the industrial field, where automation is the main goal, not bridging this gap is one of the main reasons why deep learning is not as widespread as it is in the academic world. For this reason, in this work we developed a system composed by a 3-DoF Pose Estimator based on Convolutional Neural Networks (CNNs) and an effective procedure to gather massive amounts of training images in the field with minimal human intervention. By automating the labeling stage, we also obtain very robust systems suitable for production-level usage. An open source implementation of our solution is provided, alongside with the dataset used for the experimental evaluation.