Polarimetric Image Augmentation

Marc Blanchon, Fabrice Meriaudeau, Olivier Morel, Ralph Seulin, Desire Sidibe

Responsive image

Auto-TLDR; Polarimetric Augmentation for Deep Learning in Robotics Applications

Poster

This paper deals with new augmentation methods for an unconventional imaging modality sensitive to the physics of the observed scene called polarimetry. In nature, polarized light is obtained by reflection or scattering. Robotics applications in urban environments are subject to many obstacles that can be specular and therefore provide polarized light. These areas are prone to segmentation errors using standard modalities but could be solved using information carried by the polarized light. Deep Convolutional Neural Networks (DCNNs) have shown excellent segmentation results, but require a significant amount of data to achieve best performances. The lack of data is usually overcomed by using augmentation methods. However, unlike RGB images, polarization images are not only scalar (intensity) images and standard augmentation techniques cannot be applied straightforwardly. We propose enhancing deep learning models through a regularized augmentation procedure applied to polarimetric data in order to characterize scenes more effectively under challenging conditions. We subsequently observe an average of 18.1% improvement in IoU between not augmented and regularized training procedures on real world data.

Similar papers

P2D: A Self-Supervised Method for Depth Estimation from Polarimetry

Marc Blanchon, Desire Sidibe, Olivier Morel, Ralph Seulin, Daniel Braun, Fabrice Meriaudeau

Responsive image

Auto-TLDR; Polarimetric Regularization for Monocular Depth Estimation

Slides Poster Similar

Monocular depth estimation is a recurring subject in the field of computer vision. Its ability to describe scenes via a depth map while reducing the constraints related to the formulation of perspective geometry tends to favor its use. However, despite the constant improvement of algorithms, most methods exploit only colorimetric information. Consequently, robustness to events to which the modality is not sensitive to, like specularity or transparency, is neglected. In response to this phenomenon, we propose using polarimetry as an input for a self-supervised monodepth network. Therefore, we propose exploiting polarization cues to encourage accurate reconstruction of scenes. Furthermore, we include a term of polarimetric regularization to state-of-the-art method to take specific advantage of the data. Our method is evaluated both qualitatively and quantitatively demonstrating that the contribution of this new information as well as an enhanced loss function improves depth estimation results, especially for specular areas.

A Fine-Grained Dataset and Its Efficient Semantic Segmentation for Unstructured Driving Scenarios

Kai Andreas Metzger, Peter Mortimer, Hans J "Joe" Wuensche

Responsive image

Auto-TLDR; TAS500: A Semantic Segmentation Dataset for Autonomous Driving in Unstructured Environments

Slides Poster Similar

Research in autonomous driving for unstructured environments suffers from a lack of semantically labeled datasets compared to its urban counterpart. Urban and unstructured outdoor environments are challenging due to the varying lighting and weather conditions during a day and across seasons. In this paper, we introduce TAS500, a novel semantic segmentation dataset for autonomous driving in unstructured environments. TAS500 offers fine-grained vegetation and terrain classes to learn drivable surfaces and natural obstacles in outdoor scenes effectively. We evaluate the performance of modern semantic segmentation models with an additional focus on their efficiency. Our experiments demonstrate the advantages of fine-grained semantic classes to improve the overall prediction accuracy, especially along the class boundaries. The dataset, code, and pretrained model are available online.

Surface Material Dataset for Robotics Applications (SMDRA): A Dataset with Friction Coefficient and RGB-D for Surface Segmentation

Donghun Noh, Hyunwoo Nam, Min Sung Ahn, Hosik Chae, Sangjoon Lee, Kyle Gillespie, Dennis Hong

Responsive image

Auto-TLDR; A Surface Material Dataset for Robotics Applications

Slides Poster Similar

In this paper, we introduce the Surface Material Dataset for Robotics Applications (SMDRA), a collection of RGB color image, depth data, and pixel-wise friction coefficient data of 10 different materials for computer vision research specifically with robotics applications in mind that require physical contact between the robot and its environment such as robotic manipulators or walking robots. These selected surface materials are both easily accessible around our daily lives and cover a wide range of friction coefficients. Our dataset is unique in that while there is an abundance of RGB-D data due to the popularization of imaging sensors, additional pixel-wise aligned data of a different modality are not readily available. The depth data is collected by an active stereo camera which has shown promise on a variety of different robotic applications. In addition, this dataset is greatly expanded with friction coefficient data. Similarly to humans, this additional information can be helpful in ensuing proper decision making in tasks ranging from grasping orientation and strength to path determination in an unstructured environment. A newly developed friction measuring device was used to obtain this data. We verify that existing Convolutional Neural Network (CNN) architectures, the Fully Convolutional Network (FCN) and U-Net, can be trained on the SMDRA. This result demonstrates that the SMDRA can be utilized to train a neural network model for segmentation and these different modes are not just additional information, but valuable modes that researchers can incorporate and exploit when applying computer vision algorithms on robotic platforms.

RISEdb: A Novel Indoor Localization Dataset

Carlos Sanchez Belenguer, Erik Wolfart, Álvaro Casado Coscollá, Vitor Sequeira

Responsive image

Auto-TLDR; Indoor Localization Using LiDAR SLAM and Smartphones: A Benchmarking Dataset

Slides Poster Similar

In this paper we introduce a novel public dataset for developing and benchmarking indoor localization systems. We have selected and 3D mapped a set of representative indoor environments including a large office building, a conference room, a workshop, an exhibition area and a restaurant. Our acquisition pipeline is based on a portable LiDAR SLAM backpack to map the buildings and to accurately track the pose of the user as it moves freely inside them. We introduce the calibration procedures that enable us to acquire and geo-reference live data coming from different independent sensors rigidly attached to the backpack. This has allowed us to collect long sequences of spherical and stereo images, together with all the sensor readings coming from a consumer smartphone and locate them inside the map with centimetre accuracy. The dataset addresses many of the limitations of existing indoor localization datasets regarding the scale and diversity of the mapped buildings; the number of acquired sequences under varying conditions; the accuracy of the ground-truth trajectory; the availability of a detailed 3D model and the availability of different sensor types. It enables the benchmarking of existing and the development of new indoor localization approaches, in particular for deep learning based systems that require large amounts of labeled training data.

Benchmarking Cameras for OpenVSLAM Indoors

Kevin Chappellet, Guillaume Caron, Fumio Kanehiro, Ken Sakurada, Abderrahmane Kheddar

Responsive image

Auto-TLDR; OpenVSLAM: Benchmarking Camera Types for Visual Simultaneous Localization and Mapping

Slides Poster Similar

In this paper we benchmark different types of cameras and evaluate their performance in terms of reliable localization reliability and precision in Visual Simultaneous Localization and Mapping (vSLAM). Such benchmarking is merely found for visual odometry, but never for vSLAM. Existing studies usually compare several algorithms for a given camera. %This work is the first to handle the dual of the latter, i.e. comparing several cameras for a given SLAM algorithm. The evaluation methodology we propose is applied to the recent OpenVSLAM framework. The latter is versatile enough to natively deal with perspective, fisheye, 360 cameras in a monocular or stereoscopic setup, an in RGB or RGB-D modalities. Results in various sequences containing light variation and scenery modifications in the scene assess quantitatively the maximum localization rate for 360 vision. In the contrary, RGB-D vision shows the lowest localization rate, but highest precision when localization is possible. Stereo-fisheye trades-off with localization rates and precision between 360 vision and RGB-D vision. The dataset with ground truth will be made available in open access to allow evaluating other/future vSLAM algorithms with respect to these camera types.

Enhancing Deep Semantic Segmentation of RGB-D Data with Entangled Forests

Matteo Terreran, Elia Bonetto, Stefano Ghidoni

Responsive image

Auto-TLDR; FuseNet: A Lighter Deep Learning Model for Semantic Segmentation

Slides Poster Similar

Semantic segmentation is a problem which is getting more and more attention in the computer vision community. Nowadays, deep learning methods represent the state of the art to solve this problem, and the trend is to use deeper networks to get higher performance. The drawback with such models is a higher computational cost, which makes it difficult to integrate them on mobile robot platforms. In this work we want to explore how to obtain lighter deep learning models without compromising performance. To do so we will consider the features used in the Entangled Random Forest algorithm and we will study the best strategies to integrate these within FuseNet deep network. Such new features allow us to shrink the network size without loosing performance, obtaining hence a lighter model which achieves state-of-the-art performance on the semantic segmentation task and represents an interesting alternative for mobile robotics applications, where computational power and energy are limited.

Extending Single Beam Lidar to Full Resolution by Fusing with Single Image Depth Estimation

Yawen Lu, Yuxing Wang, Devarth Parikh, Guoyu Lu

Responsive image

Auto-TLDR; Self-supervised LIDAR for Low-Cost Depth Estimation

Slides Similar

Depth estimation is playing an important role in indoor and outdoor scene understanding, autonomous driving, augmented reality and many other tasks. Vehicles and robotics are able to use active illumination sensors such as LIDAR to receive high precision depth estimation. However, high-resolution Lidars are usually too expensive, which limits its massive production on various applications. Though single beam LIDAR enjoys the benefits of low cost, one beam depth sensing is not usually sufficient to perceive the surrounding environment in many scenarios. In this paper, we propose a learning-based framework to explore to replicate similar or even higher performance as costly LIDARs with our designed self-supervised network and a low-cost single-beam LIDAR. After the accurate calibration with a visible camera, the single beam LIDAR can adjust the scale uncertainty of the depth map estimated by the visible camera. The adjusted depth map enjoys the benefits of high resolution and sensing accuracy as high beam LIDAR and maintains low-cost as single beam LIDAR. Thus we can achieve similar sensing effect of high beam LIDAR with more than a 50-100 times cheaper price (e.g., \$80000 Velodyne HDL-64E LIDAR v.s. \$1000 SICK TIM-781 2D LIDAR and normal camera). The proposed approach is verified on our collected dataset and public dataset with superior depth-sensing performance.

Derivation of Geometrically and Semantically Annotated UAV Datasets at Large Scales from 3D City Models

Sidi Wu, Lukas Liebel, Marco Körner

Responsive image

Auto-TLDR; Large-Scale Dataset of Synthetic UAV Imagery for Geometric and Semantic Annotation

Slides Poster Similar

While in high demand for the development of deep learning approaches, extensive datasets of annotated UAV imagery are still scarce today. Manual annotation, however, is time-consuming and, thus, has limited the potential for creating large-scale datasets. We tackle this challenge by presenting a procedure for the automatic creation of simulated UAV image sequences in urban areas and pixel-level annotations from publicly available data sources. We synthesize photo-realistic UAV imagery from Goole Earth Studio and derive annotations from an open CityGML model that not only provides geometric but also semantic information. The first dataset we exemplarily created using our approach contains 144000 images of Berlin, Germany, with four types of annotations, namely semantic labels as well as depth, surface normals, and edge maps. In the future, a complete pipeline regarding all the technical problems will be provided, together with more accurate models to refine some of the empirical settings currently, to automatically generate a large-scale dataset with reliable ground-truth annotations over the whole city of Berlin. The dataset, as well as the source code, will be published by then. Different methods will also be facilitated to test the usability of the dataset. We believe our dataset can be used for, and not limited to, tasks like pose estimation, geo-localization, monocular depth estimation, edge detection, building/surface classification, and plane segmentation. A potential research pipeline for geo-localization based on the synthetic dataset is provided.

Improving Robotic Grasping on Monocular Images Via Multi-Task Learning and Positional Loss

William Prew, Toby Breckon, Magnus Bordewich, Ulrik Beierholm

Responsive image

Auto-TLDR; Improving grasping performance from monocularcolour images in an end-to-end CNN architecture with multi-task learning

Slides Poster Similar

In this paper we introduce two methods of improv-ing real-time objecting grasping performance from monocularcolour images in an end-to-end CNN architecture. The first isthe addition of an auxiliary task during model training (multi-task learning). Our multi-task CNN model improves graspingperformance from a baseline average of 72.04% to 78.14% onthe large Jacquard grasping dataset when performing a supple-mentary depth reconstruction task. The second is introducinga positional loss function that emphasises loss per pixel forsecondary parameters (gripper angle and width) only on points ofan object where a successful grasp can take place. This increasesperformance from a baseline average of 72.04% to 78.92% aswell as reducing the number of training epochs required. Thesemethods can be also performed in tandem resulting in a furtherperformance increase to 79.12%, while maintaining sufficientinference speed to enable processing at 50FPS

Early Wildfire Smoke Detection in Videos

Taanya Gupta, Hengyue Liu, Bir Bhanu

Responsive image

Auto-TLDR; Semi-supervised Spatio-Temporal Video Object Segmentation for Automatic Detection of Smoke in Videos during Forest Fire

Poster Similar

Recent advances in unmanned aerial vehicles and camera technology have proven useful for the detection of smoke that emerges above the trees during a forest fire. Automatic detection of smoke in videos is of great interest to Fire department. To date, in most parts of the world, the fire is not detected in its early stage and generally it turns catastrophic. This paper introduces a novel technique that integrates spatial and temporal features in a deep learning framework using semi-supervised spatio-temporal video object segmentation and dense optical flow. However, detecting this smoke in the presence of haze and without the labeled data is difficult. Considering the visibility of haze in the sky, a dark channel pre-processing method is used that reduces the amount of haze in video frames and consequently improves the detection results. Online training is performed on a video at the time of testing that reduces the need for ground-truth data. Tests using the publicly available video datasets show that the proposed algorithms outperform previous work and they are robust across different wildfire-threatened locations.

Learning Non-Rigid Surface Reconstruction from Spatio-Temporal Image Patches

Matteo Pedone, Abdelrahman Mostafa, Janne Heikkilä

Responsive image

Auto-TLDR; Dense Spatio-Temporal Depth Maps of Deformable Objects from Video Sequences

Slides Poster Similar

We present a method to reconstruct a dense spatio-temporal depth map of a non-rigidly deformable object directly from a video sequence. The estimation of depth is performed locally on spatio-temporal patches of the video, and then the full depth video of the entire shape is recovered by combining them together. Since the geometric complexity of a local spatio-temporal patch of a deforming non-rigid object is often simple enough to be faithfully represented with a parametric model, we artificially generate a database of small deforming rectangular meshes rendered with different material properties and light conditions, along with their corresponding depth videos, and use such data to train a convolutional neural network. We tested our method on both synthetic and Kinect data and experimentally observed that the reconstruction error is significantly lower than the one obtained using other approaches like conventional non-rigid structure from motion.

Semi-Supervised Deep Learning Techniques for Spectrum Reconstruction

Adriano Simonetto, Vincent Parret, Alexander Gatto, Piergiorgio Sartor, Pietro Zanuttigh

Responsive image

Auto-TLDR; hyperspectral data estimation from RGB data using semi-supervised learning

Slides Poster Similar

State-of-the-art approaches for the estimation of hyperspectral images (HSI) from RGB data are mostly based on deep learning techniques but due to the lack of training data their performances are limited to uncommon scenarios where a large hyperspectral database is available. In this work we present a family of novel deep learning schemes for hyperspectral data estimation able to work when the hyperspectral information at our disposal is limited. Firstly, we introduce a learning scheme exploiting a physical model based on the backward mapping to the RGB space and total variation regularization that can be trained with a limited amount of HSI images. Then, we propose a novel semi-supervised learning scheme able to work even with just a few pixels labeled with hyperspectral information. Finally, we show that the approach can be extended to a transfer learning scenario. The proposed techniques allow to reach impressive performances while requiring only some HSI images or just a few pixels for the training.

NetCalib: A Novel Approach for LiDAR-Camera Auto-Calibration Based on Deep Learning

Shan Wu, Amnir Hadachi, Damien Vivet, Yadu Prabhakar

Responsive image

Auto-TLDR; Automatic Calibration of LiDAR and Cameras using Deep Neural Network

Slides Poster Similar

A fusion of LiDAR and cameras have been widely used in many robotics applications such as classification, segmentation, object detection, and autonomous driving. It is essential that the LiDAR sensor can measure distances accurately, which is a good complement to the cameras. Hence, calibrating sensors before deployment is a mandatory step. The conventional methods include checkerboards, specific patterns, or human labeling, which is trivial and human-labor extensive if we do the same calibration process every time. The main propose of this research work is to build a deep neural network that is capable of automatically finding the geometric transformation between LiDAR and cameras. The results show that our model manages to find the transformations from randomly sampled artificial errors. Besides, our work is open-sourced for the community to fully utilize the advances of the methodology for developing more the approach, initiating collaboration, and innovation in the topic.

A Lumen Segmentation Method in Ureteroscopy Images Based on a Deep Residual U-Net Architecture

Jorge Lazo, Marzullo Aldo, Sara Moccia, Michele Catellani, Benoit Rosa, Elena De Momi, Michel De Mathelin, Francesco Calimeri

Responsive image

Auto-TLDR; A Deep Neural Network for Ureteroscopy with Residual Units

Slides Poster Similar

Ureteroscopy is becoming the first surgical treatment option for the majority of urinary affections. This procedure is carried out using an endoscope which provides the surgeon with the visual and spatial information necessary to navigate inside the urinary tract. Having in mind the development of surgical assistance systems, that could enhance the performance of surgeon, the task of lumen segmentation is a fundamental part since this is the visual reference which marks the path that the endoscope should follow. This is something that has not been analyzed in ureteroscopy data before. However, this task presents several challenges given the image quality and the conditions itself of ureteroscopy procedures. In this paper, we study the implementation of a Deep Neural Network which exploits the advantage of residual units in an architecture based on U-Net. For the training of these networks, we analyze the use of two different color spaces: gray-scale and RGB data images. We found that training on gray-scale images gives the best results obtaining mean values of Dice Score, Precision, and Recall of 0.73, 0.58, and 0.92 respectively. The results obtained show that the use of residual U-Net could be a suitable model for further development for a computer-aided system for navigation and guidance through the urinary system.

Machine-Learned Regularization and Polygonization of Building Segmentation Masks

Stefano Zorzi, Ksenia Bittner, Friedrich Fraundorfer

Responsive image

Auto-TLDR; Automatic Regularization and Polygonization of Building Segmentation masks using Generative Adversarial Network

Slides Poster Similar

We propose a machine learning based approach for automatic regularization and polygonization of building segmentation masks. Taking an image as input, we first predict building segmentation maps exploiting generic fully convolutional network (FCN). A generative adversarial network (GAN) is then involved to perform a regularization of building boundaries to make them more realistic, i.e., having more rectilinear outlines which construct right angles if required. This is achieved through the interplay between the discriminator which gives a probability of input image being true and generator that learns from discriminator’s response to create more realistic images. Finally, we train the backbone convolutional neural network (CNN) which is adapted to predict sparse outcomes corresponding to building corners out of regularized building segmentation results. Experiments on three building segmentation datasets demonstrate that the proposed method is not only capable of obtaining accurate results, but also of producing visually pleasing building outlines parameterized as polygons.

Hybrid Approach for 3D Head Reconstruction: Using Neural Networks and Visual Geometry

Oussema Bouafif, Bogdan Khomutenko, Mohammed Daoudi

Responsive image

Auto-TLDR; Recovering 3D Head Geometry from a Single Image using Deep Learning and Geometric Techniques

Slides Poster Similar

Recovering the 3D geometric structure of a face from a single input image is a challenging active research area in computer vision. In this paper, we present a novel method for reconstructing 3D heads from a single or multiple image(s) using a hybrid approach based on deep learning and geometric techniques. We propose an encoder-decoder network based on the U-net architecture and trained on synthetic data only. It predicts both pixel-wise normal vectors and landmarks maps from a single input photo. Landmarks are used for the pose computation and the initialization of the optimization problem, which, in turn, reconstructs the 3D head geometry by using a parametric morphable model and normal vector fields. State-of-the-art results are achieved through qualitative and quantitative evaluation tests on both single and multi-view settings. Despite the fact that the model was trained only on synthetic data, it successfully recovers 3D geometry and precise poses for real-world images.

User-Independent Gaze Estimation by Extracting Pupil Parameter and Its Mapping to the Gaze Angle

Sang Yoon Han, Nam Ik Cho

Responsive image

Auto-TLDR; Gaze Point Estimation using Pupil Shape for Generalization

Slides Poster Similar

Since gaze estimation plays a crucial role in recognizing human intentions, it has been researched for a long time, and its accuracy is ever increasing. However, due to the wide variation in eye shapes and focusing abilities between the individuals, accuracies of most algorithms vary depending on each person in the test group, especially when the initial calibration is not well performed. To alleviate the user-dependency, we attempt to derive features that are general for most people and use them as the input to a deep network instead of using the images as the input. Specifically, we use the pupil shape as the core feature because it is directly related to the 3D eyeball rotation, and thus the gaze direction. While existing deep learning methods learn the gaze point by extracting various features from the image, we focus on the mapping function from the eyeball rotation to the gaze point by using the pupil shape as the input. It is shown that the accuracy of gaze point estimation also becomes robust for the uncalibrated points by following the characteristics of the mapping function. Also, our gaze network learns the gaze difference to facilitate the re-calibration process to fix the calibration-drift problem that typically occurs with glass-type or head-mount devices.

Incorporating Depth Information into Few-Shot Semantic Segmentation

Yifei Zhang, Desire Sidibe, Olivier Morel, Fabrice Meriaudeau

Responsive image

Auto-TLDR; RDNet: A Deep Neural Network for Few-shot Segmentation Using Depth Information

Slides Poster Similar

Few-shot segmentation presents a significant challenge for semantic scene understanding under limited supervision. Namely, this task targets at generalizing the segmentation ability of the model to new categories given a few samples. In order to obtain complete scene information, we extend the RGB-centric methods to take advantage of complementary depth information. In this paper, we propose a two-stream deep neural network based on metric learning. Our method, known as RDNet, learns class-specific prototype representations within RGB and depth embedding spaces, respectively. The learned prototypes provide effective semantic guidance on the corresponding RGB and depth query image, leading to more accurate performance. Moreover, we build a novel outdoor scene dataset, known as Cityscapes-3i, using labeled RGB images and depth images from the Cityscapes dataset. We also perform ablation studies to explore the effective use of depth information in few-shot segmentation tasks. Experiments on Cityscapes-3i show that our method achieves promising results with visual and complementary geometric cues from only a few labeled examples.

3D Semantic Labeling of Photogrammetry Meshes Based on Active Learning

Mengqi Rong, Shuhan Shen, Zhanyi Hu

Responsive image

Auto-TLDR; 3D Semantic Expression of Urban Scenes Based on Active Learning

Slides Poster Similar

As different urban scenes are similar but still not completely consistent, coupled with the complexity of labeling directly in 3D, high-level understanding of 3D scenes has always been a tricky problem. In this paper, we propose a procedural approach for 3D semantic expression of urban scenes based on active learning. We first start with a small labeled image set to fine-tune a semantic segmentation network and then project its probability map onto a 3D mesh model for fusion, finally outputs a 3D semantic mesh model in which each facet has a semantic label and a heat model showing each facet’s confidence. Our key observation is that our algorithm is iterative, in each iteration, we use the output semantic model as a supervision to select several valuable images for annotation to co-participate in the fine-tuning for overall improvement. In this way, we reduce the workload of labeling but not the quality of 3D semantic model. Using urban areas from two different cities, we show the potential of our method and demonstrate its effectiveness.

Planar 3D Transfer Learning for End to End Unimodal MRI Unbalanced Data Segmentation

Martin Kolarik, Radim Burget, Carlos M. Travieso-Gonzalez, Jan Kocica

Responsive image

Auto-TLDR; Planar 3D Res-U-Net Network for Unbalanced 3D Image Segmentation using Fluid Attenuation Inversion Recover

Slides Similar

We present a novel approach of 2D to 3D transfer learning based on mapping pre-trained 2D convolutional neural network weights into planar 3D kernels. The method is validated by proposed planar 3D res-u-net network with encoder transferred from the 2D VGG-16 which is applied for a single-stage unbalanced 3D image data segmentation. In particular, we evaluate the method on the MICCAI 2016 MS lesion segmentation challenge dataset utilizing solely Fluid Attenuation Inversion Recover (FLAIR) sequence without brain extraction for training and inference to simulate real medical praxis. The planar 3D res-u-net network performed the best both in sensitivity and Dice score amongst end to end methods processing raw MRI scans and achieved comparable Dice score to a state-of-the-art unimodal not end to end approach. Complete source code was released under the open-source license and this paper is in compliance with the Machine learning Reproducibility Checklist. By implementing practical transfer learning for 3D data representation we were able to successfully segment heavily unbalanced data without selective sampling and achieved more reliable results using less training data in single modality. From medical perspective, the unimodal approach gives an advantage in real praxis as it does not require co-registration nor additional scanning time during examination. Although modern medical imaging methods capture high resolution 3D anatomy scans suitable for computer aided detection system processing, deployment of automatic systems for interpretation of radiology imaging is still rather theoretical in many medical areas. Our work aims to bridge the gap offering solution for partial research questions.

Multiple Future Prediction Leveraging Synthetic Trajectories

Lorenzo Berlincioni, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo

Responsive image

Auto-TLDR; Synthetic Trajectory Prediction using Markov Chains

Slides Poster Similar

Trajectory prediction is an important task, especially in autonomous driving. The ability to forecast the position of other moving agents can yield to an effective planning, ensuring safety for the autonomous vehicle as well for the observed entities. In this work we propose a data driven approach based on Markov Chains to generate synthetic trajectories, which are useful for training a multiple future trajectory predictor. The advantages are twofold: on the one hand synthetic samples can be used to augment existing datasets and train more effective predictors; on the other hand, it allows to generate samples with multiple ground truths, corresponding to diverse equally likely outcomes of the observed trajectory. We define a trajectory prediction model and a loss that explicitly address the multimodality of the problem and we show that combining synthetic and real data leads to prediction improvements, obtaining state of the art results.

One Step Clustering Based on A-Contrario Framework for Detection of Alterations in Historical Violins

Alireza Rezaei, Sylvie Le Hégarat-Mascle, Emanuel Aldea, Piercarlo Dondi, Marco Malagodi

Responsive image

Auto-TLDR; A-Contrario Clustering for the Detection of Altered Violins using UVIFL Images

Slides Poster Similar

Preventive conservation is an important practice in Cultural Heritage. The constant monitoring of the state of conservation of an artwork helps us reduce the risk of damage and number of interventions necessary. In this work, we propose a probabilistic approach for the detection of alterations on the surface of historical violins based on an a-contrario framework. Our method is a one step NFA clustering solution which considers grey-level and spatial density information in one background model. The proposed method is robust to noise and avoids parameter tuning and any assumption about the quantity of the worn out areas. We have used as input UV induced fluorescence (UVIFL) images for considering details not perceivable with visible light. Tests were conducted on image sequences included in the ``Violins UVIFL imagery'' dataset. Results illustrate the ability of the algorithm to distinguish the worn area from the surrounding regions. Comparisons with the state of the art clustering methods shows improved overall precision and recall.

Effective Deployment of CNNs for 3DoF Pose Estimation and Grasping in Industrial Settings

Daniele De Gregorio, Riccardo Zanella, Gianluca Palli, Luigi Di Stefano

Responsive image

Auto-TLDR; Automated Deep Learning for Robotic Grasping Applications

Slides Poster Similar

In this paper we investigate how to effectively deploy deep learning in practical industrial settings, such as robotic grasping applications. When a deep-learning based solution is proposed, usually lacks of any simple method to generate the training data. In the industrial field, where automation is the main goal, not bridging this gap is one of the main reasons why deep learning is not as widespread as it is in the academic world. For this reason, in this work we developed a system composed by a 3-DoF Pose Estimator based on Convolutional Neural Networks (CNNs) and an effective procedure to gather massive amounts of training images in the field with minimal human intervention. By automating the labeling stage, we also obtain very robust systems suitable for production-level usage. An open source implementation of our solution is provided, alongside with the dataset used for the experimental evaluation.

Yolo+FPN: 2D and 3D Fused Object Detection with an RGB-D Camera

Ya Wang

Responsive image

Auto-TLDR; Yolo+FPN: Combining 2D and 3D Object Detection for Real-Time Object Detection

Slides Poster Similar

In this paper we propose a new deep neural network system, called Yolo+FPN, which fuses both 2D and 3D object detection algorithms to achieve better real-time object detection results and faster inference speed, to be used on real robots. Finding an optimized fusion strategy to efficiently combine 3D object detection with 2D detection information is useful and challenging for both indoor and outdoor robots. In order to satisfy real-time requirements, a trade-off between accuracy and efficiency is needed. We not only have improved training and test accuracies and lower mean losses on the KITTI object detection benchmark, but also achieve better average precision on 3D detection of all classes in three levels of difficulty. Also, we implemented Yolo+FPN system using an RGB-D camera, and compared the speed of 2D and 3D object detection using different GPUs. For the real implementation of both indoor and outdoor scenes, we focus on person detection, which is the most challenging and important among the three classes.

Ghost Target Detection in 3D Radar Data Using Point Cloud Based Deep Neural Network

Mahdi Chamseddine, Jason Rambach, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; Point Based Deep Learning for Ghost Target Detection in 3D Radar Point Clouds

Slides Poster Similar

Ghost targets are targets that appear at wrong locations in radar data and are caused by the presence of multiple indirect reflections between the target and the sensor. In this work, we introduce the first point based deep learning approach for ghost target detection in 3D radar point clouds. This is done by extending the PointNet network architecture by modifying its input to include radar point features beyond location and introducing skip connetions. We compare different input modalities and analyze the effects of the changes we introduced. We also propose an approach for automatic labeling of ghost targets 3D radar data using lidar as reference. The algorithm is trained and tested on real data in various driving scenarios and the tests show promising results in classifying real and ghost radar targets.

A Systematic Investigation on Deep Architectures for Automatic Skin Lesions Classification

Pierluigi Carcagni, Marco Leo, Andrea Cuna, Giuseppe Celeste, Cosimo Distante

Responsive image

Auto-TLDR; RegNet: Deep Investigation of Convolutional Neural Networks for Automatic Classification of Skin Lesions

Slides Poster Similar

Computer vision-based techniques are more and more employed in healthcare and medical fields nowadays in order, principally, to be as a support to the experienced medical staff to help them to make a quick and correct diagnosis. One of the hot topics in this arena concerns the automatic classification of skin lesions. Several promising works exist about it, mainly leveraging Convolutional Neural Networks (CNN), but proposed pipeline mainly rely on complex data preprocessing and there is no systematic investigation about how available deep models can actually reach the accuracy needed for real applications. In order to overcome these drawbacks, in this work, an end-to-end pipeline is introduced and some of the most recent Convolutional Neural Networks (CNNs) architectures are included in it and compared on the largest common benchmark dataset recently introduced. To this aim, for the first time in this application context, a new network design paradigm, namely RegNet, has been exploited to get the best models among a population of configurations. The paper introduces a threefold level of contribution and novelty with respect the previous literature: the deep investigation of several CNN architectures driving to a consistent improvement of the lesions recognition accuracy, the exploitation of a new network design paradigm able to study the behavior of populations of models and a deep discussion about pro and cons of each analyzed method paving the path towards new research lines.

Attention Based Coupled Framework for Road and Pothole Segmentation

Shaik Masihullah, Ritu Garg, Prerana Mukherjee, Anupama Ray

Responsive image

Auto-TLDR; Few Shot Learning for Road and Pothole Segmentation on KITTI and IDD

Slides Poster Similar

In this paper, we propose a novel attention based coupled framework for road and pothole segmentation. In many developing countries as well as in rural areas, the drivable areas are neither well-defined, nor well-maintained. Under such circumstances, an Advance Driver Assistant System (ADAS) is needed to assess the drivable area and alert about the potholes ahead to ensure vehicle safety. Moreover, this information can also be used in structured environments for assessment and maintenance of road health. We demonstrate few shot learning approach for pothole detection to leverage accuracy even with fewer training samples. We report the exhaustive experimental results for road segmentation on KITTI and IDD datasets. We also present pothole segmentation on IDD.

EdgeNet: Semantic Scene Completion from a Single RGB-D Image

Aloisio Dourado, Teofilo De Campos, Adrian Hilton, Hansung Kim

Responsive image

Auto-TLDR; Semantic Scene Completion using 3D Depth and RGB Information

Slides Poster Similar

Semantic scene completion is the task of predicting a complete 3D representation of volumetric occupancy with corresponding semantic labels for a scene from a single point of view. In this paper, we present EdgeNet, a new end-to-end neural network architecture that fuses information from depth and RGB, explicitly representing RGB edges in 3D space. Previous works on this task used either depth-only or depth with colour by projecting 2D semantic labels generated by a 2D segmentation network into the 3D volume, requiring a two step training process. Our EdgeNet representation encodes colour information in 3D space using edge detection and flipped truncated signed distance, which improves semantic completion scores especially in hard to detect classes. We achieved state-of-the-art scores on both synthetic and real datasets with a simpler and a more computationally efficient training pipeline than competing approaches.

Street-Map Based Validation of Semantic Segmentation in Autonomous Driving

Laura Von Rueden, Tim Wirtz, Fabian Hueger, Jan David Schneider, Nico Piatkowski, Christian Bauckhage

Responsive image

Auto-TLDR; Semantic Segmentation Mask Validation Using A-priori Knowledge from Street Maps

Slides Poster Similar

Artificial intelligence for autonomous driving must meet strict requirements on safety and robustness, which motivates the thorough validation of learned models. However, current validation approaches mostly require ground truth data and are thus both cost-intensive and limited in their applicability. We propose to overcome these limitations by a model agnostic validation using a-priori knowledge from street maps. In particular, we show how to validate semantic segmentation masks and demonstrate the potential of our approach using OpenStreetMap. We introduce validation metrics that indicate false positive or negative road segments. Besides the validation approach, we present a method to correct the vehicle's GPS position so that a more accurate localization can be used for the street map based validation. Lastly, we present quantitative results on the Cityscapes dataset indicating that our validation approach can indeed uncover errors in semantic segmentation masks.

A Two-Step Approach to Lidar-Camera Calibration

Yingna Su, Yaqing Ding, Jian Yang, Hui Kong

Responsive image

Auto-TLDR; Closed-Form Calibration of Lidar-camera System for Ego-motion Estimation and Scene Understanding

Slides Poster Similar

Autonomous vehicles and robots are typically equipped with Lidar and camera. Hence, calibrating the Lidar-camera system is of extreme importance for ego-motion estimation and scene understanding. In this paper, we propose a two-step approach (coarse + fine) for the external calibration between a camera and a multiple-line Lidar. First, a new closed-form solution is proposed to obtain the initial calibration parameters. We compare our solution with the state-of-the-art SVD-based algorithm, and show the benefits of both the efficiency and stability. With the initial calibration parameters, the ICP-based calibration framework is used to register the point clouds which extracted from the camera and Lidar coordinate frames, respectively. Our method has been applied to two Lidar-camera systems: an HDL-64E Lidar-camera system, and a VLP-16 Lidar-camera system. Experimental results demonstrate that our method achieves promising performance and higher accuracy than two open-source methods.

Automatically Gather Address Specific Dwelling Images Using Google Street View

Salman Khan, Carl Salvaggio

Responsive image

Auto-TLDR; Automatic Address Specific Dwelling Image Collection Using Google Street View Data

Slides Poster Similar

Exciting research is being conducted using Google’s street view imagery. Researchers can have access to training data that allows CNN training for topics ranging from assessing neighborhood environments to estimating the age of a building. However, due to the uncontrolled nature of imagery available via Google’s Street View API, data collection can be lengthy and tedious. In an effort to help researchers gather address specific dwelling images efficiently, we developed an innovative and novel way of automatically performing this task. It was accomplished by exploiting Google’s publicly available platform with a combination of 3 separate network types and postprocessing techniques. Our uniquely developed NMS technique helped achieve 99.4%, valid, address specific dwelling images.

SAILenv: Learning in Virtual Visual Environments Made Simple

Enrico Meloni, Luca Pasqualini, Matteo Tiezzi, Marco Gori, Stefano Melacci

Responsive image

Auto-TLDR; SAILenv: A Simple and Customized Platform for Visual Recognition in Virtual 3D Environment

Slides Poster Similar

Recently, researchers in Machine Learning algorithms, Computer Vision scientists, engineers and others, showed a growing interest in 3D simulators as a mean to artificially create experimental settings that are very close to those in the real world. However, most of the existing platforms to interface algorithms with 3D environments are often designed to setup navigation-related experiments, to study physical interactions, or to handle ad-hoc cases that are not thought to be customized, sometimes lacking a strong photorealistic appearance and an easy-to-use software interface. In this paper, we present a novel platform, SAILenv, that is specifically designed to be simple and customizable, and that allows researchers to experiment visual recognition in virtual 3D scenes. A few lines of code are needed to interface every algorithm with the virtual world, and non-3D-graphics experts can easily customize the 3D environment itself, exploiting a collection of photorealistic objects. Our framework yields pixel-level semantic and instance labeling, depth, and, to the best of our knowledge, it is the only one that provides motion-related information directly inherited from the 3D engine. The client-server communication operates at a low level, avoiding the overhead of HTTP-based data exchanges. We perform experiments using a state-of-the-art object detector trained on real-world images, showing that it is able to recognize the photorealistic 3D objects of our environment. The computational burden of the optical flow compares favourably with the estimation performed using modern GPU-based convolutional networks or more classic implementations. We believe that the scientific community will benefit from the easiness and high-quality of our framework to evaluate newly proposed algorithms in their own customized realistic conditions.

Anomaly Detection, Localization and Classification for Railway Inspection

Riccardo Gasparini, Andrea D'Eusanio, Guido Borghi, Stefano Pini, Giuseppe Scaglione, Simone Calderara, Eugenio Fedeli, Rita Cucchiara

Responsive image

Auto-TLDR; Anomaly Detection and Localization using thermal images in the lowlight environment

Slides Similar

The ability to detect, localize and classify objects that are anomalies is a challenging task in the computer vision community. In this paper, we tackle these tasks developing a framework to automatically inspect the railway during the night. Specifically, it is able to predict the presence, the image coordinates and the class of obstacles. To deal with the lowlight environment, the framework is based on thermal images and consists of three different modules that address the problem of detecting anomalies, predicting their image coordinates and classifying them. Moreover, due to the absolute lack of publicly released datasets collected in the railway context for anomaly detection, we introduce a new multi-modal dataset, acquired from a rail drone, used to evaluate the proposed framework. Experimental results confirm the accuracy of the framework and its suitability, in terms of computational load, performance, and inference time, to be implemented on a self-powered inspection system.

Developing Motion Code Embedding for Action Recognition in Videos

Maxat Alibayev, David Andrea Paulius, Yu Sun

Responsive image

Auto-TLDR; Motion Embedding via Motion Codes for Action Recognition

Slides Poster Similar

We propose a motion embedding strategy via the motion codes that is a vectorized representation of motions based on their salient mechanical attributes. We show that our motion codes can provide robust motion representation. We train a deep neural network model that learns to embed demonstration videos into motion codes. We integrate the extracted features from the motion embedding model into the current state-of-the-art action recognition model. The obtained model achieved higher accuracy than the baseline on a verb classification task from egocentric videos in EPIC-KITCHENS dataset.

Bridging the Gap between Natural and Medical Images through Deep Colorization

Lia Morra, Luca Piano, Fabrizio Lamberti, Tatiana Tommasi

Responsive image

Auto-TLDR; Transfer Learning for Diagnosis on X-ray Images Using Color Adaptation

Slides Poster Similar

Deep learning has thrived by training on large-scale datasets. However, in many applications, as for medical image diagnosis, getting massive amount of data is still prohibitive due to privacy, lack of acquisition homogeneity and annotation cost. In this scenario transfer learning from natural image collections is a standard practice that attempts to tackle shape, texture and color discrepancy all at once through pretrained model fine-tuning. In this work we propose to disentangle those challenges and design a dedicated network module that focuses on color adaptation. We combine learning from scratch of the color module with transfer learning of different classification backbones obtaining an end-to-end, easy-to-train architecture for diagnostic image recognition on X-ray images. Extensive experiments show how our approach is particularly efficient in case of data scarcity and provides a new path for further transferring the learned color information across multiple medical datasets.

Holistic Grid Fusion Based Stop Line Estimation

Runsheng Xu, Faezeh Tafazzoli, Li Zhang, Timo Rehfeld, Gunther Krehl, Arunava Seal

Responsive image

Auto-TLDR; Fused Multi-Sensory Data for Stop Lines Detection in Intersection Scenarios

Slides Similar

Intersection scenarios provide the most complex traffic situations in Autonomous Driving and Driving Assistance Systems. Knowing where to stop in advance in an intersection is an essential parameter in controlling the longitudinal velocity of the vehicle. Most of the existing methods in literature solely use cameras to detect stop lines, which is typically not sufficient in terms of detection range. To address this issue, we propose a method that takes advantage of fused multi-sensory data including stereo camera and lidar as input and utilizes a carefully designed convolutional neural network architecture to detect stop lines. Our experiments show that the proposed approach can improve detection range compared to camera data alone, works under heavy occlusion without observing the ground markings explicitly, is able to predict stop lines for all lanes and allows detection at a distance up to 50 meters.

Fast and Accurate Real-Time Semantic Segmentation with Dilated Asymmetric Convolutions

Leonel Rosas-Arias, Gibran Benitez-Garcia, Jose Portillo-Portillo, Gabriel Sanchez-Perez, Keiji Yanai

Responsive image

Auto-TLDR; FASSD-Net: Dilated Asymmetric Pyramidal Fusion for Real-Time Semantic Segmentation

Slides Poster Similar

Recent works have shown promising results applied to real-time semantic segmentation tasks. To maintain fast inference speed, most of the existing networks make use of light decoders, or they simply do not use them at all. This strategy helps to maintain a fast inference speed; however, their accuracy performance is significantly lower in comparison to non-real-time semantic segmentation networks. In this paper, we introduce two key modules aimed to design a high-performance decoder for real-time semantic segmentation for reducing the accuracy gap between real-time and non-real-time segmentation networks. Our first module, Dilated Asymmetric Pyramidal Fusion (DAPF), is designed to substantially increase the receptive field on the top of the last stage of the encoder, obtaining richer contextual features. Our second module, Multi-resolution Dilated Asymmetric (MDA) module, fuses and refines detail and contextual information from multi-scale feature maps coming from early and deeper stages of the network. Both modules exploit contextual information without excessively increasing the computational complexity by using asymmetric convolutions. Our proposed network entitled “FASSD-Net” reaches 78.8% of mIoU accuracy on the Cityscapes validation dataset at 41.1 FPS on full resolution images (1024x2048). Besides, with a light version of our network, we reach 74.1% of mIoU at 133.1 FPS (full resolution) on a single NVIDIA GTX 1080Ti card with no additional acceleration techniques. The source code and pre-trained models are available at https://github.com/GibranBenitez/FASSD-Net.

Real-Time Monocular Depth Estimation with Extremely Light-Weight Neural Network

Mian Jhong Chiu, Wei-Chen Chiu, Hua-Tsung Chen, Jen-Hui Chuang

Responsive image

Auto-TLDR; Real-Time Light-Weight Depth Prediction for Obstacle Avoidance and Environment Sensing with Deep Learning-based CNN

Slides Poster Similar

Obstacle avoidance and environment sensing are crucial applications in autonomous driving and robotics. Among all types of sensors, RGB camera is widely used in these applications as it can offer rich visual contents with relatively low-cost, and using a single image to perform depth estimation has become one of the main focuses in resent research works. However, prior works usually rely on highly complicated computation and power-consuming GPU to achieve such task; therefore, we focus on developing a real-time light-weight system for depth prediction in this paper. Based on the well-known encoder-decoder architecture, we propose a supervised learning-based CNN with detachable decoders that produce depth predictions with different scales. We also formulate a novel log-depth loss function that computes the difference of predicted depth map and ground truth depth map in log space, so as to increase the prediction accuracy for nearby locations. To train our model efficiently, we generate depth map and semantic segmentation with complex teacher models. Via a series of ablation studies and experiments, it is validated that our model can efficiently performs real-time depth prediction with only 0.32M parameters, with the best trained model outperforms previous works on KITTI dataset for various evaluation matrices.

Confidence Calibration for Deep Renal Biopsy Immunofluorescence Image Classification

Federico Pollastri, Juan Maroñas, Federico Bolelli, Giulia Ligabue, Roberto Paredes, Riccardo Magistroni, Costantino Grana

Responsive image

Auto-TLDR; A Probabilistic Convolutional Neural Network for Immunofluorescence Classification in Renal Biopsy

Slides Poster Similar

With this work we tackle immunofluorescence classification in renal biopsy, employing state-of-the-art Convolutional Neural Networks. In this setting, the aim of the probabilistic model is to assist an expert practitioner towards identifying the location pattern of antibody deposits within a glomerulus. Since modern neural networks often provide overconfident outputs, we stress the importance of having a reliable prediction, demonstrating that Temperature Scaling, a recently introduced re-calibration technique, can be successfully applied to immunofluorescence classification in renal biopsy. Experimental results demonstrate that the designed model yields good accuracy on the specific task, and that Temperature Scaling is able to provide reliable probabilities, which are highly valuable for such a task given the low inter-rater agreement.

Multi-Direction Convolution for Semantic Segmentation

Dehui Li, Zhiguo Cao, Ke Xian, Xinyuan Qi, Chao Zhang, Hao Lu

Responsive image

Auto-TLDR; Multi-Direction Convolution for Contextual Segmentation

Slides Similar

Context is known to be one of crucial factors effecting the performance improvement of semantic segmentation. However, state-of-the-art segmentation models built upon fully convolutional networks are inherently weak in encoding contextual information because of stacked local operations such as convolution and pooling. Failing to capture context leads to inferior segmentation performance. Despite many context modules have been proposed to relieve this problem, they still operate in a local manner or use the same contextual information in different positions (due to upsampling). In this paper, we introduce the idea of Multi-Direction Convolution (MDC)—a novel operator capable of encoding rich contextual information. This operator is inspired by an observation that the standard convolution only slides along the spatial dimension (x, y direction) where the channel dimension (z direction) is fixed, which renders slow growth of the receptive field (RF). If considering the channel-fixed convolution to be one-direction, MDC is multi-direction in the sense that MDC slides along both spatial and channel dimensions, i.e., it slides along x, y when z is fixed, along x, z when y is fixed, and along y, z when x is fixed. In this way, MDC is able to encode rich contextual information with the fast increase of the RF. Compared to existing context modules, the encoded context is position-sensitive because no upsampling is required. MDC is also efficient and easy to implement. It can be implemented with few standard convolution layers with permutation. We show through extensive experiments that MDC effectively and selectively enlarges the RF and outperforms existing contextual modules on two standard benchmarks, including Cityscapes and PASCAL VOC2012.

Localization of Unmanned Aerial Vehicles in Corridor Environments Using Deep Learning

Ram Padhy, Shahzad Ahmad, Sachin Verma, Sambit Bakshi, Pankaj Kumar Sa

Responsive image

Auto-TLDR; A monocular vision assisted localization algorithm for indoor corridor environments

Slides Poster Similar

We propose a monocular vision assisted localization algorithm, that will help a UAV navigate safely in indoor corridor environments. Always, the aim is to navigate the UAV through a corridor in the forward direction by keeping it at the center with no orientation either to the left or right side. The algorithm makes use of the RGB image, captured from the UAV front camera, and passes it through a trained Deep Neural Network (DNN) to predict the position of the UAV as either on the left or center or right side of the corridor. Depending upon the divergence of the UAV with respect to an imaginary central line, known as the central bisector line (CBL) of the corridor, a suitable command is generated to bring the UAV to the center. When the UAV is at the center of the corridor, a new image is passed through another trained DNN to predict the orientation of the UAV with respect to the CBL of the corridor. If the UAV is either left or right tilted, an appropriate command is generated to rectify the orientation. We also propose a new corridor dataset, named UAVCorV1, which contains images as captured by the UAV front camera when the UAV is at all possible locations of a variety of corridors. An exhaustive set of experiments in different corridors reveal the efficacy of the proposed algorithm.

Weight Estimation from an RGB-D Camera in Top-View Configuration

Marco Mameli, Marina Paolanti, Nicola Conci, Filippo Tessaro, Emanuele Frontoni, Primo Zingaretti

Responsive image

Auto-TLDR; Top-View Weight Estimation using Deep Neural Networks

Slides Poster Similar

The development of so-called soft-biometrics aims at providing information related to the physical and behavioural characteristics of a person. This paper focuses on bodyweight estimation based on the observation from a top-view RGB-D camera. In fact, the capability to estimate the weight of a person can be of help in many different applications, from health-related scenarios to business intelligence and retail analytics. To deal with this issue, a TVWE (Top-View Weight Estimation) framework is proposed with the aim of predicting the weight. The approach relies on the adoption of Deep Neural Networks (DNNs) that have been trained on depth data. Each network has also been modified in its top section to replace classification with prediction inference. The performance of five state-of-art DNNs has been compared, namely VGG16, ResNet, Inception, DenseNet and Efficient-Net. In addition, a convolutional auto-encoder has also been included for completeness. Considering the limited literature in this domain, the TVWE framework has been evaluated on a new publicly available dataset: “VRAI Weight estimation Dataset”, which also collects, for each subject, labels related to weight, gender, and height. The experimental results have demonstrated that the proposed methods are suitable for this task, bringing different and significant insights for the application of the solution in different domains.

OmniFlowNet: A Perspective Neural Network Adaptation for Optical Flow Estimation in Omnidirectional Images

Charles-Olivier Artizzu, Haozhou Zhang, Guillaume Allibert, Cédric Demonceaux

Responsive image

Auto-TLDR; OmniFlowNet: A Convolutional Neural Network for Omnidirectional Optical Flow Estimation

Slides Poster Similar

Spherical cameras and the latest image processing techniques open up new horizons. In particular, methods based on Convolutional Neural Networks (CNNs) now give excellent results for optical flow estimation on perspective images. However, these approaches are highly dependent on their architectures and training datasets. This paper proposes to benefit from years of improvement in perspective images optical flow estimation and to apply it to omnidirectional ones without training on new datasets. Our network, OmniFlowNet, is built on a CNN specialized in perspective images. Its convolution operation is adapted to be consistent with the equirectangular projection. Tested on spherical datasets created with Blender and several equirectangular videos realized from real indoor and outdoor scenes, OmniFlowNet shows better performance than its original network.

Lane Detection Based on Object Detection and Image-To-Image Translation

Hiroyuki Komori, Kazunori Onoguchi

Responsive image

Auto-TLDR; Lane Marking and Road Boundary Detection from Monocular Camera Images using Inverse Perspective Mapping

Slides Poster Similar

In this paper, we propose a method to detect various types of lane markings and road boundaries simultaneously from a monocular camera image. This method detects lane markings and road boundaries in IPM images obtained by the Inverse Perspective Mapping of input images. First, bounding boxes surrounding a lane marking or the road boundary are extracted by the object detection network. At the same time, these areas are labelled with a solid line, a dashed line, a zebra line, a curb, a grass, a sidewall and so on. Next, in each bounding box, lane marking boundaries or road boundaries are drawn by the image-to-image translation network. We use YOLOv3 for the object detection and pix2pix for the image translation. We create our own datasets including various types of lane markings and road boundaries and evaluate our approach using these datasets qualitatively and quantitatively.

Enhancing Semantic Segmentation of Aerial Images with Inhibitory Neurons

Ihsan Ullah, Sean Reilly, Michael Madden

Responsive image

Auto-TLDR; Lateral Inhibition in Deep Neural Networks for Object Recognition and Semantic Segmentation

Slides Poster Similar

In a Convolutional Neural Network, each neuron in the output feature map takes input from the neurons in its receptive field. This receptive field concept plays a vital role in today's deep neural networks. However, inspired by neuro-biological research, it has been proposed to add inhibitory neurons outside the receptive field, which may enhance the performance of neural network models. In this paper, we begin with deep network architectures such as VGG and ResNet, and propose an approach to add lateral inhibition in each output neuron to reduce its impact on its neighbours, both in fine-tuning pre-trained models and training from scratch. Our experiments show that notable improvements upon prior baseline deep models can be achieved. A key feature of our approach is that it is easy to add to baseline models; it can be adopted in any model containing convolution layers, and we demonstrate its value in applications including object recognition and semantic segmentation of aerial images, where we show state-of-the-art result on the Aeroscape dataset. On semantic segmentation tasks, our enhancement shows 17.43% higher mIoU than a single baseline model on a single source (the Aeroscape dataset), 13.43% higher performance than an ensemble model on the same single source, and 7.03% higher than an ensemble model on multiple sources (segmentation datasets). Our experiments illustrate the potential impact of using inhibitory neurons in deep learning models, and they also show better results than the baseline models that have standard convolutional layer.

PSDNet: A Balanced Architecture of Accuracy and Parameters for Semantic Segmentation

Yue Liu, Zhichao Lian

Responsive image

Auto-TLDR; Pyramid Pooling Module with SE1Cblock and D2SUpsample Network (PSDNet)

Slides Poster Similar

Abstract—In this paper, we present our Pyramid Pooling Module (PPM) with SE1Cblock and D2SUpsample Network (PSDNet), a novel architecture for accurate semantic segmentation. Started from the known work called Pyramid Scene Parsing Network (PSPNet), PSDNet takes advantage of pyramid pooling structure with channel attention module and feature transform module in Pyramid Pooling Module (PPM). The enhanced PPM with these two components can strengthen context information flowing in the network instead of damaging it. The channel attention module we mentioned is an improved “Squeeze and Excitation with 1D Convolution” (SE1C) block which can explicitly model interrelationship between channels with fewer number of parameters. We propose a feature transform module named “Depth to Space Upsampling” (D2SUpsample) in the PPM which keeps integrity of features by transforming features while interpolating features, at the same time reducing parameters. In addition, we introduce a joint strategy in SE1Cblock which combines two variants of global pooling without increasing parameters. Compared with PSPNet, our work achieves higher accuracy on public datasets with 73.97% mIoU and 82.89% mAcc accuracy on Cityscapes Dataset based on ResNet50 backbone.

Vehicle Lane Merge Visual Benchmark

Kai Cordes, Hellward Broszio

Responsive image

Auto-TLDR; A Benchmark for Automated Cooperative Maneuvering Using Multi-view Video Streams and Ground Truth Vehicle Description

Slides Poster Similar

Automated driving is regarded as the most promising technology for improving road safety in the future. In this context, connected vehicles have an important role regarding their ability to perform cooperative maneuvers for challenging traffic situations. We propose a benchmark for automated cooperative maneuvers. The targeted cooperative maneuver is the vehicle lane merge where a vehicle on the acceleration lane merges into the traffic of a motorway. The benchmark enables the evaluation of vehicle localization approaches as well as the study of cooperative maneuvers. It consists of temporally synchronized multi-view video streams, highly accurate camera calibration, and ground truth vehicle descriptions, including position, heading, speed, and shape. For benchmark generation, the lane merge maneuver is performed by human drivers on a test track, resulting in 120 lane merge data sets with various traffic situations and video recording conditions.

Complex-Object Visual Inspection: Empirical Studies on a Multiple Lighting Solution

Maya Aghaei, Matteo Bustreo, Pietro Morerio, Nicolò Carissimi, Alessio Del Bue, Vittorio Murino

Responsive image

Auto-TLDR; A Novel Illumination Setup for Automatic Visual Inspection of Complex Objects

Slides Poster Similar

The design of an automatic visual inspection system is usually performed in two stages. While the first stage consists in selecting the most suitable hardware setup for highlighting most effectively the defects on the surface to be inspected, the second stage concerns the development of algorithmic solutions to exploit the potentials offered by the collected data. In this paper, first, we present a novel illumination setup embedding four illumination configurations to resemble diffused, dark-field, and front lighting techniques. Second, we analyze the contributions brought by deploying the proposed setup in the training phase only, mimicking the scenario in which an already developed visual inspection system cannot be modified on the customer site. Along with an exhaustive set of experiments, in this paper, we demonstrate the suitability of the proposed setup for effective illumination of complex-objects, defined as manufactured items with variable surface characteristics that cannot be determined a priori. Eventually, we provide insights into the importance of multiple light configurations availability during training and their natural boosting effect which, without the need to modify the system design in the evaluation phase, lead to improvements in the overall system performance.