PEAN: 3D Hand Pose Estimation Adversarial Network

Linhui Sun, Yifan Zhang, Jing Lu, Jian Cheng, Hanqing Lu

Responsive image

Auto-TLDR; PEAN: 3D Hand Pose Estimation with Adversarial Learning Framework

Slides Poster

Despite recent emerging research attention, 3D hand pose estimation still suffers from the problems of predicting inaccurate or invalid poses which conflict with physical and kinematic constraints. To address these problems, we propose a novel 3D hand pose estimation adversarial network (PEAN) which can implicitly utilize such constraints to regularize the prediction in an adversarial learning framework. PEAN contains two parts: a 3D hierarchical estimation network (3DHNet) to predict hand pose, which decouples the task into multiple subtasks with a hierarchical structure; a pose discrimination network (PDNet) to judge the reasonableness of the estimated 3D hand pose, which back-propagates the constraints to the estimation network. During the adversarial learning process, PDNet is expected to distinguish the estimated 3D hand pose and the ground truth, while 3DHNet is expected to estimate more valid pose to confuse PDNet. In this way, 3DHNet is capable of generating 3D poses with accurate positions and adaptively adjusting the invalid poses without additional prior knowledge. Experiments show that the proposed 3DHNet does a good job in predicting hand poses, and introducing PDNet to 3DHNet does further improve the accuracy and reasonableness of the predicted results. As a result, the proposed PEAN achieves the state-of-the-art performance on three public hand pose estimation datasets.

Similar papers

Orthographic Projection Linear Regression for Single Image 3D Human Pose Estimation

Yahui Zhang, Shaodi You, Theo Gevers

Responsive image

Auto-TLDR; A Deep Neural Network for 3D Human Pose Estimation from a Single 2D Image in the Wild

Slides Poster Similar

3D human pose estimation from a single 2D image in the wild is an important computer vision task but yet extremely challenging. Unlike images taken from indoor and well constrained environments, 2D outdoor images in the wild are extremely complex because of varying imaging conditions. Furthermore, 2D images usually do not have corresponding 3D pose ground truth making a supervised approach ill constrained. Therefore, in this paper, we propose to associate the 3D human pose, the 2D human pose projection and the 2D image appearance through a new orthographic projection based linear regression module. Unlike existing reprojection based approaches, our orthographic projection and regression do not suffer from small angle problems, which usually lead to overfitting in the depth dimension. Hence, we propose a deep neural network which adopts the 2D pose, 3D pose regression and orthographic projection linear regression module. The proposed method shows state-of-the art performance on the Human3.6M dataset and generalizes well to in-the-wild images.

RefiNet: 3D Human Pose Refinement with Depth Maps

Andrea D'Eusanio, Stefano Pini, Guido Borghi, Roberto Vezzani, Rita Cucchiara

Responsive image

Auto-TLDR; RefiNet: A Multi-stage Framework for 3D Human Pose Estimation

Slides Similar

Human Pose Estimation is a fundamental task for many applications in the Computer Vision community and it has been widely investigated in the 2D domain, i.e. intensity images. Therefore, most of the available methods for this task are mainly based on 2D Convolutional Neural Networks and huge manually-annotated RGB datasets, achieving stunning results. In this paper, we propose RefiNet, a multi-stage framework that regresses an extremely-precise 3D human pose estimation from a given 2D pose and a depth map. The framework consists of three different modules, each one specialized in a particular refinement and data representation, i.e. depth patches, 3D skeleton and point clouds. Moreover, we collect a new dataset, namely Baracca, acquired with RGB, depth and thermal cameras and specifically created for the automotive context. Experimental results confirm the quality of the refinement procedure that largely improves the human pose estimations of off-the-shelf 2D methods.

Light3DPose: Real-Time Multi-Person 3D Pose Estimation from Multiple Views

Alessio Elmi, Davide Mazzini, Pietro Tortella

Responsive image

Auto-TLDR; 3D Pose Estimation of Multiple People from a Few calibrated Camera Views using Deep Learning

Slides Poster Similar

We present an approach to perform 3D pose estimation of multiple people from a few calibrated camera views. Our architecture, leveraging the recently proposed unprojection layer, aggregates feature-maps from a 2D pose estimator backbone into a comprehensive representation of the 3D scene. Such intermediate representation is then elaborated by a fully-convolutional volumetric network and a decoding stage to extract 3D skeletons with sub-voxel accuracy. Our method achieves state of the art MPJPE on the CMU Panoptic dataset using a few unseen views and obtains competitive results even with a single input view. We also assess the transfer learning capabilities of the model by testing it against the publicly available Shelf dataset obtaining good performance metrics. The proposed method is inherently efficient: as a pure bottom-up approach, it is computationally independent of the number of people in the scene. Furthermore, even though the computational burden of the 2D part scales linearly with the number of input views, the overall architecture is able to exploit a very lightweight 2D backbone which is orders of magnitude faster than the volumetric counterpart, resulting in fast inference time. The system can run at 6 FPS, processing up to 10 camera views on a single 1080Ti GPU.

A Multi-Task Neural Network for Action Recognition with 3D Key-Points

Rongxiao Tang, Wang Luyang, Zhenhua Guo

Responsive image

Auto-TLDR; Multi-task Neural Network for Action Recognition and 3D Human Pose Estimation

Slides Poster Similar

Action recognition and 3D human pose estimation are the fundamental problems in computer vision and closely related. In this work, we propose a multi-task neural network for action recognition and 3D human pose estimation. The results of the previous methods are still error-prone especially when tested against the images taken in-the-wild, leading error results in action recognition. To solve this problem, we propose a principled approach to generate high quality 3D pose ground truth given any in-the-wild image with a person inside. We achieve this by first devising a novel stereo inspired neural network to directly map any 2D pose to high quality 3D counterpart. Based on the high-quality 3D labels, we carefully design the multi-task framework for action recognition and 3D human pose estimation. The proposed architecture can utilize the shallow, deep features of the images, and the in-the-wild 3D human key-points to guide a more precise result. High quality 3D key-points can fully reflect the morphological features of motions, thus boosting the performance on action recognition. Experiments demonstrate that 3D pose estimation leads to significantly higher performance on action recognition than separated learning. We also evaluate the generalization ability of our method both quantitatively and qualitatively. The proposed architecture performs favorably against the baseline 3D pose estimation methods. In addition, the reported results on Penn Action and NTU datasets demonstrate the effectiveness of our method on the action recognition task.

Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution

Renshu Gu, Gaoang Wang, Jenq-Neng Hwang

Responsive image

Auto-TLDR; 3D Human Pose Estimation for Multi-Human Videos with Occlusion

Slides Similar

3D human pose estimation (HPE) is crucial in human behavior analysis, augmented reality/virtual reality (AR/VR) applications, and self-driving industry. Videos that contain multiple potentially occluded people captured from freely moving monocular cameras are very common in real-world scenarios, while 3D HPE for such scenarios is quite challenging, partially because there is a lack of such data with accurate 3D ground truth labels in existing datasets. In this paper, we propose a temporal regression network with a gated convolution module to transform 2D joints to 3D and recover the missing occluded joints in the meantime. A simple yet effective localization approach is further conducted to transform the normalized pose to the global trajectory. To verify the effectiveness of our approach, we also collect a new moving camera multi-human (MMHuman) dataset that includes multiple people with heavy occlusion captured by moving cameras. The 3D ground truth joints are provided by accurate motion capture (MoCap) system. From the experiments on static-camera based Human3.6M data and our own collected moving-camera based data, we show that our proposed method outperforms most state-of-the-art 2D-to-3D pose estimation methods, especially for the scenarios with heavy occlusions.

Unsupervised 3D Human Pose Estimation in Multi-view-multi-pose Video

Cheng Sun, Diego Thomas, Hiroshi Kawasaki

Responsive image

Auto-TLDR; Unsupervised 3D Human Pose Estimation from 2D Videos Using Generative Adversarial Network

Slides Poster Similar

3D human pose estimation from a single 2D video is an extremely difficult task because computing 3D geometry from 2D images is an ill-posed problem. Recent popular solutions adopt fully-supervised learning strategy, which requires to train a deep network on a large-scale ground truth dataset of 3D poses and 2D images. However, such a large-scale dataset with natural images does not exist, which limits the usability of existing methods. While building a complete 3D dataset is tedious and expensive, abundant 2D in-the-wild data is already publicly available. As a consequence, there is a growing interest in the computer vision community to design efficient techniques that use the unsupervised learning strategy, which does not require any ground truth 3D data. Such methods can be trained with only natural 2D images of humans. In this paper we propose an unsupervised method for estimating 3D human pose in videos. The standard approach for unsupervised learning is to use the Generative Adversarial Network (GAN) framework. To improve the performance of 3D human pose estimation in videos, we propose a new GAN network that enforces body consistency over frames in a video. We evaluate the efficiency of our proposed method on a public 3D human body dataset.

PROPEL: Probabilistic Parametric Regression Loss for Convolutional Neural Networks

Muhammad Asad, Rilwan Basaru, S M Masudur Rahman Al Arif, Greg Slabaugh

Responsive image

Auto-TLDR; PRObabilistic Parametric rEgression Loss for Probabilistic Regression Using Convolutional Neural Networks

Slides Similar

In recent years, Convolutional Neural Networks (CNNs) have enabled significant advancements to the state-of-the-art in computer vision. For classification tasks, CNNs have widely employed probabilistic output and have shown the significance of providing additional confidence for predictions. However, such probabilistic methodologies are not widely applicable for addressing regression problems using CNNs, as regression involves learning unconstrained continuous and, in many cases, multi-variate target variables. We propose a PRObabilistic Parametric rEgression Loss (PROPEL) that facilitates CNNs to learn parameters of probability distributions for addressing probabilistic regression problems. PROPEL is fully differentiable and, hence, can be easily incorporated for end-to-end training of existing CNN regression architectures using existing optimization algorithms. The proposed method is flexible as it enables learning complex unconstrained probabilities while being generalizable to higher dimensional multi-variate regression problems. We utilize a PROPEL-based CNN to address the problem of learning hand and head orientation from uncalibrated color images. Our experimental validation and comparison with existing CNN regression loss functions show that PROPEL improves the accuracy of a CNN by enabling probabilistic regression, while significantly reducing required model parameters by 10x, resulting in improved generalization as compared to the existing state-of-the-art.

Occlusion-Tolerant and Personalized 3D Human Pose Estimation in RGB Images

Ammar Qammaz, Antonis Argyros

Responsive image

Auto-TLDR; Real-Time 3D Human Pose Estimation in BVH using Inverse Kinematics Solver and Neural Networks

Poster Similar

We introduce a real-time method that estimates the 3D human pose directly in the popular BVH format, given estimations of the 2D body joints in RGB images. Our contributions include: (a) A novel and compact 2D pose representation. (b) A human body orientation classifier and an ensemble of orientation-tuned neural networks that regress the 3D human pose by also allowing for the decomposition of the body to an upper and lower kinematic hierarchy. This permits the recovery of the human pose even in the case of significant occlusions. (c) An efficient Inverse Kinematics solver that refines the neural-network-based solution providing 3D human pose estimations that are consistent with the limb sizes of a target person (if known). All the above yield a 33% accuracy improvement on the H3.6M dataset compared to the baseline MocapNET method while maintaining real-time performance (70 fps in CPU-only execution).

Simple Multi-Resolution Representation Learning for Human Pose Estimation

Trung Tran Quang, Van Giang Nguyen, Daeyoung Kim

Responsive image

Auto-TLDR; Multi-resolution Heatmap Learning for Human Pose Estimation

Slides Poster Similar

Human pose estimation - the process of recognizing human keypoints in a given image - is one of the most important tasks in computer vision and has a wide range of applications including movement diagnostics, surveillance, or self-driving vehicle. The accuracy of human keypoint prediction is increasingly improved thanks to the burgeoning development of deep learning. Most existing methods solved human pose estimation by generating heatmaps in which the ith heatmap indicates the location confidence of the ith keypoint. In this paper, we introduce novel network structures referred to as multi-resolution representation learning for human keypoint prediction. At different resolutions in the learning process, our networks branch off and use extra layers to learn heatmap generation. We firstly consider the architectures for generating the multi-resolution heatmaps after obtaining the lowest-resolution feature maps. Our second approach allows learning during the process of feature extraction in which the heatmaps are generated at each resolution of the feature extractor. The first and second approaches are referred to as multi-resolution heatmap learning and multi-resolution feature map learning respectively. Our architectures are simple yet effective, achieving good performance. We conducted experiments on two common benchmarks for human pose estimation: MS-COCO and MPII dataset.

On the Robustness of 3D Human Pose Estimation

Zerui Chen, Yan Huang, Liang Wang

Responsive image

Auto-TLDR; Robustness of 3D Human Pose Estimation Methods to Adversarial Attacks

Slides Similar

It is widely shown that Convolutional Neural Networks (CNNs) are vulnerable to adversarial examples on most recognition tasks, such as image classification and segmentation. However, few work studies the more complicated task -- 3D human pose estimation. This task often requires large-scale datasets, specialized network architectures, and it can be solved either from single-view RGB images or from multi-view RGB images. In this paper, we make the first attempt to investigate the robustness of current state-of-the-art 3D human pose estimation methods. To this end, we build four representative baseline models, where most of the current methods can be generally classified as one of them. Furthermore, we design targeted adversarial attacks to detect whether 3D pose estimators are robust to different camera parameters. For different types of methods, we present a comprehensive study of their robustness on the large-scale \emph{Human3.6M} benchmark. Our work shows that different methods vary significantly in their resistance to adversarial attacks. Through extensive experiments, we show that multi-view 3D pose estimators can be more vulnerable to adversarial examples. We believe that our efforts can shed light on future works to design more robust 3D human pose estimators.

HPERL: 3D Human Pose Estimastion from RGB and LiDAR

Michael Fürst, Shriya T.P. Gupta, René Schuster, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; 3D Human Pose Estimation Using RGB and LiDAR Using Weakly-Supervised Approach

Slides Poster Similar

In-the-wild human pose estimation has a huge potential for various fields, ranging from animation and action recognition to intention recognition and prediction for autonomous driving. The current state-of-the-art is focused only on RGB and RGB-D approaches for predicting the 3D human pose. However, not using precise LiDAR depth information limits the performance and leads to very inaccurate absolute pose estimation. With LiDAR sensors becoming more affordable and common on robots and autonomous vehicle setups, we propose an end-to-end architecture using RGB and LiDAR to predict the absolute 3D human pose with unprecedented precision. Additionally, we introduce a weakly-supervised approach to generate 3D predictions using 2D pose annotations from PedX. This allows for many new opportunities in the field of 3D human pose estimation.

Learning to Implicitly Represent 3D Human Body from Multi-Scale Features and Multi-View Images

Zhongguo Li, Magnus Oskarsson, Anders Heyden

Responsive image

Auto-TLDR; Reconstruction of 3D human bodies from multi-view images using multi-stage end-to-end neural networks

Slides Poster Similar

Reconstruction of 3D human bodies, from images, faces many challenges, due to it generally being an ill-posed problem. In this paper we present a method to reconstruct 3D human bodies from multi-view images, through learning an implicit function to represent 3D shape, based on multi-scale features extracted by multi-stage end-to-end neural networks. Our model consists of several end-to-end hourglass networks for extracting multi-scale features from multi-view images, and a fully connected network for implicit function classification from these features. Given a 3D point, it is projected to multi-view images and these images are fed into our model to extract multi-scale features. The scales of features extracted by the hourglass networks decrease with the depth of our model, which represents the information from local to global scale. Then, the multi-scale features as well as the depth of the 3D point are combined to a new feature vector and the fully connected network classifies the feature vector, in order to predict if the point lies inside or outside of the 3D mesh. The advantage of our method is that we use both local and global features in the fully connected network and represent the 3D mesh by an implicit function, which is more memory-efficient. Experiments on public datasets demonstrate that our method surpasses previous approaches in terms of the accuracy of 3D reconstruction of human bodies from images.

P2 Net: Augmented Parallel-Pyramid Net for Attention Guided Pose Estimation

Luanxuan Hou, Jie Cao, Yuan Zhao, Haifeng Shen, Jian Tang, Ran He

Responsive image

Auto-TLDR; Parallel-Pyramid Net with Partial Attention for Human Pose Estimation

Slides Poster Similar

The target of human pose estimation is to determine the body parts and joint locations of persons in the image. Angular changes, motion blur and occlusion etc. in the natural scenes make this task challenging, while some joints are more difficult to be detected than others. In this paper, we propose an augmented Parallel-Pyramid Net (P^2Net) with an partial attention module. During data preprocessing, we proposed a differentiable auto data augmentation (DA^2) method in which sequences of data augmentations are formulated as a trainable and operational Convolution Neural Network (CNN) component. DA^2 improves the training efficiency and effectiveness. A parallel pyramid structure is followed to compensate the information loss introduced by the network. For the information loss problem in the backbone network, we optimize the backbone network by adopting a new parallel structure without increasing the overall computational complexity. To further refine the predictions after completion of global predictions, an Partial Attention Module (PAM) is defined to extract weighted features from different scale feature maps generated by the parallel pyramid structure. Compared with the traditional up-sampling refining, PAM can better capture the relationship between channels. Experiments corroborate the effectiveness of our proposed method. Notably, our method achieves the best performance on the challenging MSCOCO and MPII datasets.

MixedFusion: 6D Object Pose Estimation from Decoupled RGB-Depth Features

Hangtao Feng, Lu Zhang, Xu Yang, Zhiyong Liu

Responsive image

Auto-TLDR; MixedFusion: Combining Color and Point Clouds for 6D Pose Estimation

Slides Poster Similar

Estimating the 6D pose of objects is an important process for intelligent systems to achieve interaction with the real-world. As the RGB-D sensors become more accessible, the fusion-based methods have prevailed, since the point clouds provide complementary geometric information with RGB values. However, Due to the difference in feature space between color image and depth image, the network structures that directly perform point-to-point matching fusion do not effectively fuse the features of the two. In this paper, we propose a simple but effective approach, named MixedFusion. Different from the prior works, we argue that the spatial correspondence of color and point clouds could be decoupled and reconnected, thus enabling a more flexible fusion scheme. By performing the proposed method, more informative points can be mixed and fused with rich color features. Extensive experiments are conducted on the challenging LineMod and YCB-Video datasets, show that our method significantly boosts the performance without introducing extra overheads. Furthermore, when the minimum tolerance of metric narrows, the proposed approach performs better for the high-precision demands.

Joint Face Alignment and 3D Face Reconstruction with Efficient Convolution Neural Networks

Keqiang Li, Huaiyu Wu, Xiuqin Shang, Zhen Shen, Gang Xiong, Xisong Dong, Bin Hu, Fei-Yue Wang

Responsive image

Auto-TLDR; Mobile-FRNet: Efficient 3D Morphable Model Alignment and 3D Face Reconstruction from a Single 2D Facial Image

Slides Poster Similar

3D face reconstruction from a single 2D facial image is a challenging and concerned problem. Recent methods based on CNN typically aim to learn parameters of 3D Morphable Model (3DMM) from 2D images to render face alignment and 3D face reconstruction. Most algorithms are designed for faces with small, medium yaw angles, which is extremely challenging to align faces in large poses. At the same time, they are not efficient usually. The main challenge is that it takes time to determine the parameters accurately. In order to address this challenge with the goal of improving performance, this paper proposes a novel and efficient end-to-end framework. We design an efficient and lightweight network model combined with Depthwise Separable Convolution and Muti-scale Representation, Lightweight Attention Mechanism, named Mobile-FRNet. Simultaneously, different loss functions are used to constrain and optimize 3DMM parameters and 3D vertices during training to improve the performance of the network. Meanwhile, extensive experiments on the challenging datasets show that our method significantly improves the accuracy of face alignment and 3D face reconstruction. The model parameters and complexity of our method are also improved greatly.

DeepPear: Deep Pose Estimation and Action Recognition

Wen-Jiin Tsai, You-Ying Jhuang

Responsive image

Auto-TLDR; Human Action Recognition Using RGB Video Using 3D Human Pose and Appearance Features

Slides Poster Similar

Human action recognition has been a popular issue recently because it can be applied in many applications such as intelligent surveillance systems, human-robot interaction, and autonomous vehicle control. Human action recognition using RGB video is a challenging task because the learning of actions is easily affected by the cluttered background. To cope with this problem, the proposed method estimates 3D human poses first which can help remove the cluttered background and focus on the human body. In addition to the human poses, the proposed method also utilizes appearance features nearby the predicted joints to make our action prediction context-aware. Instead of using 3D convolutional neural networks as many action recognition approaches did, the proposed method uses a two-stream architecture that aggregates the results from skeleton-based and appearance-based approaches to do action recognition. Experimental results show that the proposed method achieved state-of-the-art performance on NTU RGB+D which is a largescale dataset for human action recognition.

A Grid-Based Representation for Human Action Recognition

Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Responsive image

Auto-TLDR; GRAR: Grid-based Representation for Action Recognition in Videos

Slides Poster Similar

Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for the task, and are limited in the way they fuse temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets that demonstrate that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges.

Shape Consistent 2D Keypoint Estimation under Domain Shift

Levi Vasconcelos, Massimiliano Mancini, Davide Boscaini, Barbara Caputo, Elisa Ricci

Responsive image

Auto-TLDR; Deep Adaptation for Keypoint Prediction under Domain Shift

Slides Poster Similar

Recent unsupervised domain adaptation methods based on deep architectures have shown remarkable performance not only in traditional classification tasks but also in more complex problems involving structured predictions (e.g. semantic segmentation, depth estimation). Following this trend, in this paper we present a novel deep adaptation framework for estimating keypoints under \textit{domain shift}, i.e. when the training (\textit{source}) and the test (\textit{target}) images significantly differ in terms of visual appearance. Our method seamlessly combines three different components: feature alignment, adversarial training and self-supervision. Specifically, our deep architecture leverages from domain-specific distribution alignment layers to perform target adaptation at the feature level. Furthermore, a novel loss is proposed which combines an adversarial term for ensuring aligned predictions in the output space and a geometric consistency term which guarantees coherent predictions between a target sample and its perturbed version. Our extensive experimental evaluation conducted on three publicly available benchmarks shows that our approach outperforms state-of-the-art domain adaptation methods in the 2D keypoint prediction task.

What and How? Jointly Forecasting Human Action and Pose

Yanjun Zhu, Yanxia Zhang, Qiong Liu, Andreas Girgensohn

Responsive image

Auto-TLDR; Forecasting Human Actions and Motion Trajectories with Joint Action Classification and Pose Regression

Slides Poster Similar

Forecasting human actions and motion trajectories addresses the problem of predicting what a person is going to do next and how they will perform it. This is crucial in a wide range of applications such as assisted living and future co-robotic settings. We propose to simultaneously learn actions and action-related human motion dynamics, while existing works perform them independently. In this paper, we present a method to jointly forecast categories of human action and the pose of skeletal joints in the hope that the two tasks can help each other. As a result, our system can predict not only the future actions but also the motion trajectories that will result. To achieve this, we define a task of joint action classification and pose regression. We employ a sequence to sequence encoder-decoder model combined with multi-task learning to forecast future actions and poses progressively before the action happens. Experimental results on two public datasets, IkeaDB and OAD, demonstrate the effectiveness of the proposed method.

Rotational Adjoint Methods for Learning-Free 3D Human Pose Estimation from IMU Data

Caterina Emilia Agelide Buizza, Yiannis Demiris

Responsive image

Auto-TLDR; Learning-free 3D Human Pose Estimation from Inertial Measurement Unit Data

Poster Similar

We present a new framework for learning-free 3D human pose estimation from Inertial Measurement Unit (IMU) data. The proposed method does not rely on a full motion sequence to calculate a pose for any particular time point and thus can operate in real-time. A cost function based only on joint rotations is used, removing the need for frequent transformations between rotations and 3D Cartesian coordinates. A Jacobian that preserves skeleton structure is derived using Adjoint methods from Variational Data Assimilation. To facilitate further research in IMU-based Motion Capture, we provide a dataset that combines RGB and depth images from an Intel RealSense camera, marker-based motion capture from an Optitrack system and Xsens IMU data. We have evaluated our method on both our dataset and the Total Capture dataset, showing an average error across 24 joints of 0.45 and 0.48 radians respectively.

Dynamic Guided Network for Monocular Depth Estimation

Xiaoxia Xing, Yinghao Cai, Yiping Yang, Dayong Wen

Responsive image

Auto-TLDR; DGNet: Dynamic Guidance Upsampling for Self-attention-Decoding for Monocular Depth Estimation

Slides Poster Similar

Self-attention or encoder-decoder structure has been widely used in deep neural networks for monocular depth estimation tasks. The former mechanism are capable to capture long-range information by computing the representation of each position by a weighted sum of the features at all positions, while the latter networks can capture structural details information by gradually recovering the spatial information. In this work, we combine the advantages of both methods. Specifically, our proposed model, DGNet, extends EMANet Network by adding an effective decoder module to refine the depth results. In the decoder stage, we further design dynamic guidance upsampling which uses local neighboring information of low-level features guide coarser depth to upsample. In this way, dynamic guidance upsampling generates content-dependent and spatially-variant kernels for depth upsampling which makes full use of spatial details information from low-level features. Experimental results demonstrate that our method obtains higher accuracy and generates the desired depth map.

JUMPS: Joints Upsampling Method for Pose Sequences

Lucas Mourot, Francois Le Clerc, Cédric Thébault, Pierre Hellier

Responsive image

Auto-TLDR; JUMPS: Increasing the Number of Joints in 2D Pose Estimation and Recovering Occluded or Missing Joints

Slides Poster Similar

Human Pose Estimation is a low-level task useful for surveillance, human action recognition, and scene understanding at large. It also offers promising perspectives for the animation of synthetic characters. For all these applications, and especially the latter, estimating the positions of many joints is desirable for improved performance and realism. To this purpose, we propose a novel method called JUMPS for increasing the number of joints in 2D pose estimates and recovering occluded or missing joints. We believe this is the first attempt to address the issue. We build on a deep generative model that combines a GAN and an encoder. The GAN learns the distribution of high-resolution human pose sequences, the encoder maps the input low-resolution sequences to its latent space. Inpainting is obtained by computing the latent representation whose decoding by the GAN generator optimally matches the joints locations at the input. Post-processing a 2D pose sequence using our method provides a richer representation of the character motion. We show experimentally that the localization accuracy of the additional joints is on average on par with the original pose estimates.

Extending Single Beam Lidar to Full Resolution by Fusing with Single Image Depth Estimation

Yawen Lu, Yuxing Wang, Devarth Parikh, Guoyu Lu

Responsive image

Auto-TLDR; Self-supervised LIDAR for Low-Cost Depth Estimation

Slides Similar

Depth estimation is playing an important role in indoor and outdoor scene understanding, autonomous driving, augmented reality and many other tasks. Vehicles and robotics are able to use active illumination sensors such as LIDAR to receive high precision depth estimation. However, high-resolution Lidars are usually too expensive, which limits its massive production on various applications. Though single beam LIDAR enjoys the benefits of low cost, one beam depth sensing is not usually sufficient to perceive the surrounding environment in many scenarios. In this paper, we propose a learning-based framework to explore to replicate similar or even higher performance as costly LIDARs with our designed self-supervised network and a low-cost single-beam LIDAR. After the accurate calibration with a visible camera, the single beam LIDAR can adjust the scale uncertainty of the depth map estimated by the visible camera. The adjusted depth map enjoys the benefits of high resolution and sensing accuracy as high beam LIDAR and maintains low-cost as single beam LIDAR. Thus we can achieve similar sensing effect of high beam LIDAR with more than a 50-100 times cheaper price (e.g., \$80000 Velodyne HDL-64E LIDAR v.s. \$1000 SICK TIM-781 2D LIDAR and normal camera). The proposed approach is verified on our collected dataset and public dataset with superior depth-sensing performance.

Attention-Oriented Action Recognition for Real-Time Human-Robot Interaction

Ziyang Song, Ziyi Yin, Zejian Yuan, Chong Zhang, Wanchao Chi, Yonggen Ling, Shenghao Zhang

Responsive image

Auto-TLDR; Attention-Oriented Multi-Level Network for Action Recognition in Interaction Scenes

Slides Poster Similar

Despite the notable progress made in action recognition tasks, not much work has been done in action recognition specifically for human-robot interaction. In this paper, we deeply explore the characteristics of the action recognition task in interaction scenes and propose an attention-oriented multi-level network framework to meet the need for real-time interaction. Specifically, a Pre-Attention network is employed to roughly focus on the interactor in the scene at low resolution firstly and then perform fine-grained pose estimation at high resolution. The other compact CNN receives the extracted skeleton sequence as input for action recognition, utilizing attention-like mechanisms to capture local spatial-temporal patterns and global semantic information effectively. To evaluate our approach, we construct a new action dataset specially for the recognition task in interaction scenes. Experimental results on our dataset and high efficiency (112 fps at 640 x 480 RGBD) on the mobile computing platform (Nvidia Jetson AGX Xavier) demonstrate excellent applicability of our method on action recognition in real-time human-robot interaction.

Adaptive Feature Fusion Network for Gaze Tracking in Mobile Tablets

Yiwei Bao, Yihua Cheng, Yunfei Liu, Feng Lu

Responsive image

Auto-TLDR; Adaptive Feature Fusion Network for Multi-stream Gaze Estimation in Mobile Tablets

Slides Poster Similar

Recently, many multi-stream gaze estimation methods have been proposed. They estimate gaze from eye and face appearances and achieve reasonable accuracy. However, most of the methods simply concatenate the features extracted from eye and face appearance. The feature fusion process has been ignored. In this paper, we propose a novel Adaptive Feature Fusion Network (AFF-Net), which performs gaze tracking task in mobile tablets. We stack two-eye feature maps and utilize Squeeze-and-Excitation layers to adaptively fuse two-eye features based on different eye features. Meanwhile, we also propose Adaptive Group Normalization to recalibrate eye features with the guidance of face appearance characteristics. Extensive experiments on both GazeCapture and MPIIFaceGaze datasets demonstrate consistently superior performance of the proposed method.

6D Pose Estimation with Correlation Fusion

Yi Cheng, Hongyuan Zhu, Ying Sun, Cihan Acar, Wei Jing, Yan Wu, Liyuan Li, Cheston Tan, Joo-Hwee Lim

Responsive image

Auto-TLDR; Intra- and Inter-modality Fusion for 6D Object Pose Estimation with Attention Mechanism

Slides Poster Similar

6D object pose estimation is widely applied in robotic tasks such as grasping and manipulation. Prior methods using RGB-only images are vulnerable to heavy occlusion and poor illumination, so it is important to complement them with depth information. However, existing methods using RGB-D data cannot adequately exploit consistent and complementary information between RGB and depth modalities. In this paper, we present a novel method to effectively consider the correlation within and across both modalities with attention mechanism to learn discriminative and compact multi-modal features. Then, effective fusion strategies for intra- and inter-correlation modules are explored to ensure efficient information flow between RGB and depth. To our best knowledge, this is the first work to explore effective intra- and inter-modality fusion in 6D pose estimation. The experimental results show that our method can achieve the state-of-the-art performance on LineMOD and YCBVideo dataset. We also demonstrate that the proposed method can benefit a real-world robot grasping task by providing accurate object pose estimation.

Pose-Based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo

Responsive image

Auto-TLDR; Body Language Based Emotion Recognition for Psychiatric Symptoms Prediction

Slides Poster Similar

Inspired by the human ability to infer emotions from body language, we propose an automated framework for body language based emotion recognition starting from regular RGB videos. In collaboration with psychologists, we further extend the framework for psychiatric symptom prediction. Because a specific application domain of the proposed framework may only supply a limited amount of data, the framework is designed to work on a small training set and possess a good transferability. The proposed system in the first stage generates sequences of body language predictions based on human poses estimated from input videos. In the second stage, the predicted sequences are fed into a temporal network for emotion interpretation and psychiatric symptom prediction. We first validate the accuracy and transferability of the proposed body language recognition method on several public action recognition datasets. We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms. The proposed framework outperforms other methods on the URMC dataset.

Face Super-Resolution Network with Incremental Enhancement of Facial Parsing Information

Shuang Liu, Chengyi Xiong, Zhirong Gao

Responsive image

Auto-TLDR; Learning-based Face Super-Resolution with Incremental Boosting Facial Parsing Information

Slides Poster Similar

Recently, facial priors based face super-resolution (SR) methods have obtained significant performance gains in dealing with extremely degraded facial images, and facial priors have also been proved useful in facilitating the inference of face images. Based on this, how to fully fuse facial priors into deep features to improve face SR performance has attracted a major attention. In this paper, we propose a learning-based face SR approach with incremental boosting facial parsing information (IFPSR) for high-magnification of low-resolution faces. The proposed IFPSR method consists of three main parts: i) a three-stage parsing map embedded features upsampling network, in which image recovery and prior estimation processes are performed simultaneously and progressively to improve the image resolution; ii) a progressive training method and a joint facial attention and heatmap loss to obtain better facial attributes; iii) the channel attention strategy in residual dense blocks to adaptively learn facial features. Extensive experimental results show that compared with the state-of-the-art methods in terms of quantitative and qualitative metrics, our approach can achieve an outstanding balance between SR image quality and low network complexity.

Multi-Attribute Regression Network for Face Reconstruction

Xiangzheng Li, Suping Wu

Responsive image

Auto-TLDR; A Multi-Attribute Regression Network for Face Reconstruction

Slides Poster Similar

In this paper, we propose a multi-attribute regression network (MARN) to investigate the problem of face reconstruction, especially in challenging cases when faces undergo large variations including severe poses, extreme expressions, and partial occlusions in unconstrained environments. The traditional 3DMM parametric regression method is absent from the learning of identity, expression, and attitude attributes, resulting in lacking geometric details in the reconstructed face. Our MARN method is to enable the network to better extract the feature information of face identity, expression, and pose attributes. We introduced identity, expression, and pose attribute loss functions to enhance the learning of details in each attribute. At the same time, we carefully design the geometric contour constraint loss function and use the constraints of sparse 2D face landmarks to improve the reconstructed geometric contour information. The experimental results show that our face reconstruction method has achieved significant results on the AFLW2000-3D and AFLW datasets compared with the most advanced methods. In addition, there has been a great improvement in dense face alignment. .

Deep Space Probing for Point Cloud Analysis

Yirong Yang, Bin Fan, Yongcheng Liu, Hua Lin, Jiyong Zhang, Xin Liu, 蔡鑫宇 蔡鑫宇, Shiming Xiang, Chunhong Pan

Responsive image

Auto-TLDR; SPCNN: Space Probing Convolutional Neural Network for Point Cloud Analysis

Slides Poster Similar

3D points distribute in a continuous 3D space irregularly, thus directly adapting 2D image convolution to 3D points is not an easy job. Previous works often artificially divide the space into regular grids, yet it could be suboptimal to learn geometry. In this paper, we propose SPCNN, namely, Space Probing Convolutional Neural Network, which naturally generalizes image CNN to deal with point clouds. The key idea of SPCNN is learning to probe the 3D space in an adaptive manner. Specifically, we define a pool of learnable convolutional weights, and let each point in the local region learn to choose a suitable convolutional weight from the pool. This is achieved by constructing a geometry guided index-mapping function that implicitly establishes a correspondence between convolutional weights and some local regions in the neighborhood (Fig. 1). In this way, the index-mapping function learns to adaptively partition nearby space for local geometry pattern recognition. With this convolution as a basic operator, SPCNN, a hierarchical architecture can be developed for effective point cloud analysis. Extensive experiments on challenging benchmarks across three tasks demonstrate that SPCNN achieves the state-of-the-art or has competitive performance.

Towards Efficient 3D Point Cloud Scene Completion Via Novel Depth View Synthesis

Haiyan Wang, Liang Yang, Xuejian Rong, Ying-Li Tian

Responsive image

Auto-TLDR; 3D Point Cloud Completion with Depth View Synthesis and Depth View synthesis

Poster Similar

3D point cloud completion has been a long-standing challenge at scale, and corresponding per-point supervised training strategies suffered from the cumbersome annotations. 2D supervision has recently emerged as a promising alternative for 3D tasks, but specific approaches for 3D point cloud completion still remain to be explored. To overcome these limitations, we propose an end-to-end method that directly lifts a single depth map to a completed point cloud. With one depth map as input, a multi-way novel depth view synthesis network (NDVNet) is designed to infer coarsely completed depth maps under various viewpoints. Meanwhile, a geometric depth perspective rendering module is introduced to utilize the raw input depth map to generate a re-projected depth map for each view. Therefore, the two parallelly generated depth maps for each view are further concatenated and refined by a depth completion network (DCNet). The final completed point cloud is fused from all refined depth views. Experimental results demonstrate the effectiveness of our proposed approach composed of aforementioned components, to produce high-quality state-of-the-art results on the popular SUNCG benchmark.

Hybrid Approach for 3D Head Reconstruction: Using Neural Networks and Visual Geometry

Oussema Bouafif, Bogdan Khomutenko, Mohammed Daoudi

Responsive image

Auto-TLDR; Recovering 3D Head Geometry from a Single Image using Deep Learning and Geometric Techniques

Slides Poster Similar

Recovering the 3D geometric structure of a face from a single input image is a challenging active research area in computer vision. In this paper, we present a novel method for reconstructing 3D heads from a single or multiple image(s) using a hybrid approach based on deep learning and geometric techniques. We propose an encoder-decoder network based on the U-net architecture and trained on synthetic data only. It predicts both pixel-wise normal vectors and landmarks maps from a single input photo. Landmarks are used for the pose computation and the initialization of the optimization problem, which, in turn, reconstructs the 3D head geometry by using a parametric morphable model and normal vector fields. State-of-the-art results are achieved through qualitative and quantitative evaluation tests on both single and multi-view settings. Despite the fact that the model was trained only on synthetic data, it successfully recovers 3D geometry and precise poses for real-world images.

Human Segmentation with Dynamic LiDAR Data

Tao Zhong, Wonjik Kim, Masayuki Tanaka, Masatoshi Okutomi

Responsive image

Auto-TLDR; Spatiotemporal Neural Network for Human Segmentation with Dynamic Point Clouds

Slides Similar

Consecutive LiDAR scans and depth images compose dynamic 3D sequences, which contain more abundant spatiotemporal information than a single frame. Similar to the development history of image and video perception, dynamic 3D sequence perception starts to come into sight after inspiring research on static 3D data perception. This work proposes a spatiotemporal neural network for human segmentation with the dynamic LiDAR point clouds. It takes a sequence of depth images as input. It has a two-branch structure, i.e., the spatial segmentation branch and the temporal velocity estimation branch. The velocity estimation branch is designed to capture motion cues from the input sequence and then propagates them to the other branch. So that the segmentation branch segments humans according to both spatial and temporal features. These two branches are jointly learned on a generated dynamic point cloud data set for human recognition. Our works fill in the blank of dynamic point cloud perception with the spherical representation of point cloud and achieves high accuracy. The experiments indicate that the introduction of temporal feature benefits the segmentation of dynamic point cloud perception.

Let's Play Music: Audio-Driven Performance Video Generation

Hao Zhu, Yi Li, Feixia Zhu, Aihua Zheng, Ran He

Responsive image

Auto-TLDR; APVG: Audio-driven Performance Video Generation Using Structured Temporal UNet

Slides Poster Similar

We propose a new task named Audio-driven Performance Video Generation (APVG), which aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip. It is a challenging task to generate the high-dimensional temporal consistent videos from low-dimensional audio modality. In this paper, we propose a multi-staged framework to achieve this new task to generate realistic and synchronized performance video from given music. Firstly, we provide both global appearance and local spatial information by generating the coarse videos and keypoints of body and hands from a given music respectively. Then, we propose to transform the generated keypoints to heatmap via a differentiable space transformer, since the heatmap offers more spatial information but is harder to generate directly from audio. Finally, we propose a Structured Temporal UNet (STU) to extract both intra-frame structured information and inter-frame temporal consistency. They are obtained via graph-based structure module, and CNN-GRU based high-level temporal module respectively for final video generation. Comprehensive experiments validate the effectiveness of our proposed framework.

Learning Semantic Representations Via Joint 3D Face Reconstruction and Facial Attribute Estimation

Zichun Weng, Youjun Xiang, Xianfeng Li, Juntao Liang, Wanliang Huo, Yuli Fu

Responsive image

Auto-TLDR; Joint Framework for 3D Face Reconstruction with Facial Attribute Estimation

Slides Poster Similar

We propose a novel joint framework for 3D face reconstruction (3DFR) that integrates facial attribute estimation (FAE) as an auxiliary task. One of the essential problems of 3DFR is to extract semantic facial features (e.g., Big Nose, High Cheekbones, and Asian) from in-the-wild 2D images, which is inherently involved with FAE. These two tasks, though heterogeneous, are highly relevant to each other. To achieve this, we leverage a Convolutional Neural Network to extract shared facial representations for both shape decoder and attribute classifier. We further develop an in-batch hybrid-task training scheme that enables our model to learn from heterogeneous facial datasets jointly within a mini-batch. Thanks to the joint loss that provides supervision from both 3DFR and FAE domains, our model learns the correlations between 3D shapes and facial attributes, which benefit both feature extraction and shape inference. Quantitative evaluation and qualitative visualization results confirm the effectiveness and robustness of our joint framework.

Attention-Driven Body Pose Encoding for Human Activity Recognition

Bappaditya Debnath, Swagat Kumar, Marry O'Brien, Ardhendu Behera

Responsive image

Auto-TLDR; Attention-based Body Pose Encoding for Human Activity Recognition

Slides Poster Similar

This article proposes a novel attention-based body pose encoding for human activity recognition. Most of the existing human activity recognition approaches based on 3D pose data often enrich the input data using additional handcrafted representations such as velocity, super normal vectors, pairwise relations, and so on. The enriched data complements the 3D body joint position data and improves the model performance. In this paper, we propose a novel approach that learns enhanced feature representations from a given sequence of 3D body joints. To achieve this, the approach exploits two body pose streams: 1) a spatial stream which encodes the spatial relationship between various body joints at each time point to learn spatial structure involving the spatial distribution of different body joints 2) a temporal stream that learns the temporal variation of individual body joints over the entire sequence duration to present a temporally enhanced representation. Afterwards, these two pose streams are fused with a multi-head attention mechanism. We also capture the contextual information from the RGB video stream using a deep Convolutional Neural Network (CNN) model combined with a multi-head attention and a bidirectional Long Short-Term Memory (LSTM) network. Finally, the RGB video stream is combined with the fused body pose stream to give a novel end-to-end deep model for effective human activity recognition. The proposed model is evaluated on three datasets including the challenging NTU-RGBD dataset and achieves state-of-the-art results.

Partially Supervised Multi-Task Network for Single-View Dietary Assessment

Ya Lu, Thomai Stathopoulou, Stavroula Mougiakakou

Responsive image

Auto-TLDR; Food Volume Estimation from a Single Food Image via Geometric Understanding and Semantic Prediction

Slides Poster Similar

Food volume estimation is an essential step in the pipeline of dietary assessment and demands the precise depth estimation of the food surface and table plane. Existing methods based on computer vision require either multi-image input or additional depth maps, reducing convenience of implementation and practical significance. Despite the recent advances in unsupervised depth estimation from a single image, the achieved performance in the case of large texture-less areas needs to be improved. In this paper, we propose a network architecture that jointly performs geometric understanding (i.e., depth prediction and 3D plane estimation) and semantic prediction on a single food image, enabling a robust and accurate food volume estimation regardless of the texture characteristics of the target plane. For the training of the network, only monocular videos with semantic ground truth are required, while the depth map and 3D plane ground truth are no longer needed. Experimental results on two separate food image databases demonstrate that our method performs robustly on texture-less scenarios and is superior to unsupervised networks and structure from motion based approaches, while it achieves comparable performance to fully-supervised methods.

Better Prior Knowledge Improves Human-Pose-Based Extrinsic Camera Calibration

Olivier Moliner, Sangxia Huang, Kalle Åström

Responsive image

Auto-TLDR; Improving Human-pose-based Extrinsic Calibration for Multi-Camera Systems

Slides Poster Similar

Accurate extrinsic calibration of wide baseline multi-camera systems enables better understanding of 3D scenes for many applications and is of great practical importance. Classical Structure-from-Motion calibration methods require special calibration equipment so that accurate point correspondences can be detected between different views. In addition, an operator with some training is usually needed to ensure that data is collected in a way that leads to good calibration accuracy. This limits the ease of adoption of such technologies. Recently, methods have been proposed to use human pose estimation models to establish point correspondences, thus removing the need for any special equipment. The challenge with this approach is that human pose estimation algorithms typically produce much less accurate feature points compared to classical patch-based methods. Another problem is that ambient human motion might not be optimal for calibration. We build upon prior works and introduce several novel ideas to improve the accuracy of human-pose-based extrinsic calibration. Our first contribution is a robust reprojection loss based on a better understanding of the sources of pose estimation error. Our second contribution is a 3D human pose likelihood model learned from motion capture data. We demonstrate significant improvements in calibration accuracy by evaluating our method on four publicly available datasets.

StrongPose: Bottom-up and Strong Keypoint Heat Map Based Pose Estimation

Niaz Ahmad, Jongwon Yoon

Responsive image

Auto-TLDR; StrongPose: A bottom-up box-free approach for human pose estimation and action recognition

Slides Poster Similar

Adaptation of deep convolutional neural network has made revolutionary progress in human pose estimation, various applications in recent years have drawn considerable attention. However, prediction and localization of the keypoints in single and multi-person images are a challenging problem. Towards this purpose, we present a bottom-up box-free approach for the task of pose estimation and action recognition. We proposed a StrongPose system model that uses part-based modeling to tackle object-part associations. The model utilizes a convolution network that learns how to detect Strong Keypoints Heat Maps (SKHM) and predict their comparative displacements, enabling us to group keypoints into person pose instances. Further, we produce Body Heat Maps (BHM) with the help of keypoints which allows us to localize the human body in the picture. The StrongPose framework is based on fully-convolutional engineering and permits proficient inference, with runtime basically autonomous of the number of individuals display within the scene. Train and test on COCO data alone, our framework achieves COCO test-dev keypoint average precision of 0.708 using ResNet-101 and 0.725 using ResNet-152, which considerably outperforms all prior bottom-up pose estimation frameworks.

Space-Time Domain Tensor Neural Networks: An Application on Human Pose Classification

Konstantinos Makantasis, Athanasios Voulodimos, Anastasios Doulamis, Nikolaos Doulamis, Nikolaos Bakalos

Responsive image

Auto-TLDR; Tensor-Based Neural Network for Spatiotemporal Pose Classifiaction using Three-Dimensional Skeleton Data

Slides Poster Similar

Recent advances in sensing technologies require the design and development of pattern recognition models capable of processing spatiotemporal data efficiently. In this study, we propose a spatially and temporally aware tensor-based neural network for human pose classifiaction using three-dimensional skeleton data. Our model employs three novel components. First, an input layer capable of constructing highly discriminative spatiotemporal features. Second, a tensor fusion operation that produces compact yet rich representations of the data, and third, a tensor-based neural network that processes data representations in their original tensor form. Our model is end-to-end trainable and characterized by a small number of trainable parameters making it suitable for problems where the annotated data is limited. Experimental evaluation of the proposed model indicates that it can achieve state-of-the-art performance.

Real-Time Monocular Depth Estimation with Extremely Light-Weight Neural Network

Mian Jhong Chiu, Wei-Chen Chiu, Hua-Tsung Chen, Jen-Hui Chuang

Responsive image

Auto-TLDR; Real-Time Light-Weight Depth Prediction for Obstacle Avoidance and Environment Sensing with Deep Learning-based CNN

Slides Poster Similar

Obstacle avoidance and environment sensing are crucial applications in autonomous driving and robotics. Among all types of sensors, RGB camera is widely used in these applications as it can offer rich visual contents with relatively low-cost, and using a single image to perform depth estimation has become one of the main focuses in resent research works. However, prior works usually rely on highly complicated computation and power-consuming GPU to achieve such task; therefore, we focus on developing a real-time light-weight system for depth prediction in this paper. Based on the well-known encoder-decoder architecture, we propose a supervised learning-based CNN with detachable decoders that produce depth predictions with different scales. We also formulate a novel log-depth loss function that computes the difference of predicted depth map and ground truth depth map in log space, so as to increase the prediction accuracy for nearby locations. To train our model efficiently, we generate depth map and semantic segmentation with complex teacher models. Via a series of ablation studies and experiments, it is validated that our model can efficiently performs real-time depth prediction with only 0.32M parameters, with the best trained model outperforms previous works on KITTI dataset for various evaluation matrices.

The Role of Cycle Consistency for Generating Better Human Action Videos from a Single Frame

Runze Li, Bir Bhanu

Responsive image

Auto-TLDR; Generating Videos with Human Action Semantics using Cycle Constraints

Slides Poster Similar

This paper addresses the challenging problem of generating videos with human action semantics. Unlike previous work which predict future frames in a single forward pass, this paper introduces the cycle constraints in both forward and backward passes in the generation of human actions. This is achieved by enforcing the appearance and motion consistency across a sequence of frames generated in the future. The approach consists of two stages. In the first stage, the pose of a human body is generated. In the second stage, an image generator is used to generate future frames by using (a) generated human poses in the future from the first stage, (b) the single observed human pose, and (c) the single corresponding future frame. The experiments are performed on three datasets: Weizmann dataset involving simple human actions, Penn Action dataset and UCF-101 dataset containing complicated human actions, especially in sports. The results from these experiments demonstrate the effectiveness of the proposed approach.

Efficient Grouping for Keypoint Detection

Alexey Sidnev, Ekaterina Krasikova, Maxim Kazakov

Responsive image

Auto-TLDR; Automatic Keypoint Grouping for DeepFashion2 Dataset

Slides Poster Similar

DeepFashion2 dataset raises a new challenge for a keypoint detection task. It contains 13 categories with a different number of keypoints, 294 in total. Direct prediction of all keypoints leads to huge memory consumption, slow training, and inference speed. This paper presents a study of keypoint grouping approach and how it affects performance on the example of CenterNet architecture. We propose a simple and efficient automatic grouping technique and apply it to DeepFashion2 fashion landmark task and MS COCO Human Pose task. It allows reducing memory consumption up to 30%, decreasing inference time up to 30%, and training time up to 26% without compromising accuracy.

Multi-Scale Residual Pyramid Attention Network for Monocular Depth Estimation

Jing Liu, Xiaona Zhang, Zhaoxin Li, Tianlu Mao

Responsive image

Auto-TLDR; Multi-scale Residual Pyramid Attention Network for Monocular Depth Estimation

Slides Poster Similar

Monocular depth estimation is a challenging problem in computer vision and is crucial for understanding 3D scene geometry. Recently, deep convolutional neural networks (DCNNs) based methods have improved the estimation accuracy significantly. However, existing methods fail to consider complex textures and geometries in scenes, thereby resulting in loss of local details, distorted object boundaries, and blurry reconstruction. In this paper, we proposed an end-to-end Multi-scale Residual Pyramid Attention Network (MRPAN) to mitigate these problems.First,we propose a Multi-scale Attention Context Aggregation (MACA) module, which consists of Spatial Attention Module (SAM) and Global Attention Module (GAM). By considering the position and scale correlation of pixels from spatial and global perspectives, the proposed module can adaptively learn the similarity between pixels so as to obtain more global context information of the image and recover the complex structure in the scene. Then we proposed an improved Residual Refinement Module (RRM) to further refine the scene structure, giving rise to deeper semantic information and retain more local details. Experimental results show that our method achieves more promisin performance in object boundaries and local details compared with other state-of-the-art methods.

Leveraging a Weakly Adversarial Paradigm for Joint Learning of Disparity and Confidence Estimation

Matteo Poggi, Fabio Tosi, Filippo Aleotti, Stefano Mattoccia

Responsive image

Auto-TLDR; Joint Training of Deep-Networks for Outlier Detection from Stereo Images

Slides Poster Similar

Deep architectures represent the state-of-the-art for perceiving depth from stereo images. Although these methods are highly accurate, it is crucial to effectively detect any outlier through confidence measures since a wrong perception of even small portions of the sensed scene might lead to catastrophic consequences, for instance, in autonomous driving. Purposely, state-of-the-art confidence estimation methods rely on deep-networks as well. In this paper, arguing that these tasks are two sides of the same coin, we propose a novel paradigm for their joint training. Specifically, inspired by the successful deployment of GANs in other fields, we design two deep architectures: a generator for disparity estimation and a discriminator for distinguishing correct assignments from outliers. The two networks are jointly trained in a new peculiar weakly adversarial manner pushing the former to fix the errors detected by the discriminator while keeping the correct prediction unchanged. Experimental results on standard stereo datasets prove that such joint training paradigm yields significant improvements. Moreover, an additional outcome of our proposal is the ability to detect outliers with better accuracy compared to the state-of-the-art.

DmifNet:3D Shape Reconstruction Based on Dynamic Multi-Branch Information Fusion

Lei Li, Suping Wu

Responsive image

Auto-TLDR; DmifNet: Dynamic Multi-branch Information Fusion Network for 3D Shape Reconstruction from a Single-View Image

Slides Similar

3D object reconstruction from a single-view image is a long-standing challenging problem. Previous works are difficult to accurately reconstruct 3D shapes with a complex topology which has rich details at the edges and corners. Moreover, previous works use synthetic data to train their network, but domain adaptation problems occurred when testing on real data. In this paper, we propose a Dynamic Multi-branch Information Fusion Network (DmifNet) which can recover a high-fidelity 3D shape of arbitrary topology from a 2D image. Specifically, we design several side branches from the intermediate layers to make the network produce more diverse representations to improve the generalization ability of network. In addition, we utilize DoG (Difference of Gaussians) to extract edge geometry and corners information from input images. Then, we use a separate side branch network to process the extracted data to better capture edge geometry and corners feature information. Finally, we dynamically fuse the information of all branches to gain final predicted probability. Extensive qualitative and quantitative experiments on a large-scale publicly available dataset demonstrate the validity and efficiency of our method. Code and models are publicly available at https://github.com/leilimaster/DmifNet.

P2D: A Self-Supervised Method for Depth Estimation from Polarimetry

Marc Blanchon, Desire Sidibe, Olivier Morel, Ralph Seulin, Daniel Braun, Fabrice Meriaudeau

Responsive image

Auto-TLDR; Polarimetric Regularization for Monocular Depth Estimation

Slides Poster Similar

Monocular depth estimation is a recurring subject in the field of computer vision. Its ability to describe scenes via a depth map while reducing the constraints related to the formulation of perspective geometry tends to favor its use. However, despite the constant improvement of algorithms, most methods exploit only colorimetric information. Consequently, robustness to events to which the modality is not sensitive to, like specularity or transparency, is neglected. In response to this phenomenon, we propose using polarimetry as an input for a self-supervised monodepth network. Therefore, we propose exploiting polarization cues to encourage accurate reconstruction of scenes. Furthermore, we include a term of polarimetric regularization to state-of-the-art method to take specific advantage of the data. Our method is evaluated both qualitatively and quantitatively demonstrating that the contribution of this new information as well as an enhanced loss function improves depth estimation results, especially for specular areas.

DEN: Disentangling and Exchanging Network for Depth Completion

You-Feng Wu, Vu-Hoang Tran, Ting-Wei Chang, Wei-Chen Chiu, Ching-Chun Huang

Responsive image

Auto-TLDR; Disentangling and Exchanging Network for Depth Completion

Slides Similar

In this paper, we tackle the depth completion problem. Conventional depth sensors usually produce incomplete depth maps due to the property of surface reflection, especially for the window areas, metal surfaces, and object boundaries. However, we observe that the corresponding RGB images are still dense and preserve all of the useful structural information. This brings us to the question of whether we can borrow this structural information from RGB images to inpaint the corresponding incomplete depth maps. In this paper, we answer that question by proposing a Disentangling and Exchanging Network (DEN) for depth completion. The network is designed based on an assumption that after suitable feature disentanglement, RGB images and depth maps share a common domain for representing structural information. So we firstly disentangle both RGB and depth images into domain-invariant content parts, which contain structural information, and domain-specific style parts. Then, by exchanging the complete structural information extracted from RGB image with incomplete information extracted from depth map, we can generate the complete version of depth map. Furthermore, to address the mixed-depth problem, a newly proposed depth representation is applied. By modeling depth estimation as a classification problem coupled with coefficient estimation, blurry edges are enhanced in the depth map. At last, we have implemented ablation experiments to verify the effectiveness of our proposed DEN model. The results also demonstrate the superiority of DEN over some state-of-the-art approaches.