LFIR2Pose: Pose Estimation from an Extremely Low-Resolution FIR Image Sequence

Saki Iwata, Yasutomo Kawanishi, Daisuke Deguchi, Ichiro Ide, Hiroshi Murase, Tomoyoshi Aizawa

Responsive image

Auto-TLDR; LFIR2Pose: Human Pose Estimation from a Low-Resolution Far-InfraRed Image Sequence

Slides Poster

In this paper, we propose a method for human pose estimation from a Low-resolution Far-InfraRed (LFIR) image sequence captured by a 16 × 16 FIR sensor array. Human body estimation from such a single LFIR image is a hard task. For training the estimation model, annotation of the human pose to the images is also a difficult task for human. Thus, we propose the LFIR2Pose model which accepts a sequence of LFIR images and outputs the human pose of the last frame, and also propose an automatic annotation system for the model training. Additionally, considering that the scale of human body motion is largely different among body parts, we also propose a loss function focusing on the difference. Through an experiment, we evaluated the human pose estimation accuracy using an original data set, and confirmed that human pose can be estimated accurately from an LFIR image sequence.

Similar papers

RefiNet: 3D Human Pose Refinement with Depth Maps

Andrea D'Eusanio, Stefano Pini, Guido Borghi, Roberto Vezzani, Rita Cucchiara

Responsive image

Auto-TLDR; RefiNet: A Multi-stage Framework for 3D Human Pose Estimation

Slides Similar

Human Pose Estimation is a fundamental task for many applications in the Computer Vision community and it has been widely investigated in the 2D domain, i.e. intensity images. Therefore, most of the available methods for this task are mainly based on 2D Convolutional Neural Networks and huge manually-annotated RGB datasets, achieving stunning results. In this paper, we propose RefiNet, a multi-stage framework that regresses an extremely-precise 3D human pose estimation from a given 2D pose and a depth map. The framework consists of three different modules, each one specialized in a particular refinement and data representation, i.e. depth patches, 3D skeleton and point clouds. Moreover, we collect a new dataset, namely Baracca, acquired with RGB, depth and thermal cameras and specifically created for the automotive context. Experimental results confirm the quality of the refinement procedure that largely improves the human pose estimations of off-the-shelf 2D methods.

Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution

Renshu Gu, Gaoang Wang, Jenq-Neng Hwang

Responsive image

Auto-TLDR; 3D Human Pose Estimation for Multi-Human Videos with Occlusion

Slides Similar

3D human pose estimation (HPE) is crucial in human behavior analysis, augmented reality/virtual reality (AR/VR) applications, and self-driving industry. Videos that contain multiple potentially occluded people captured from freely moving monocular cameras are very common in real-world scenarios, while 3D HPE for such scenarios is quite challenging, partially because there is a lack of such data with accurate 3D ground truth labels in existing datasets. In this paper, we propose a temporal regression network with a gated convolution module to transform 2D joints to 3D and recover the missing occluded joints in the meantime. A simple yet effective localization approach is further conducted to transform the normalized pose to the global trajectory. To verify the effectiveness of our approach, we also collect a new moving camera multi-human (MMHuman) dataset that includes multiple people with heavy occlusion captured by moving cameras. The 3D ground truth joints are provided by accurate motion capture (MoCap) system. From the experiments on static-camera based Human3.6M data and our own collected moving-camera based data, we show that our proposed method outperforms most state-of-the-art 2D-to-3D pose estimation methods, especially for the scenarios with heavy occlusions.

What and How? Jointly Forecasting Human Action and Pose

Yanjun Zhu, Yanxia Zhang, Qiong Liu, Andreas Girgensohn

Responsive image

Auto-TLDR; Forecasting Human Actions and Motion Trajectories with Joint Action Classification and Pose Regression

Slides Poster Similar

Forecasting human actions and motion trajectories addresses the problem of predicting what a person is going to do next and how they will perform it. This is crucial in a wide range of applications such as assisted living and future co-robotic settings. We propose to simultaneously learn actions and action-related human motion dynamics, while existing works perform them independently. In this paper, we present a method to jointly forecast categories of human action and the pose of skeletal joints in the hope that the two tasks can help each other. As a result, our system can predict not only the future actions but also the motion trajectories that will result. To achieve this, we define a task of joint action classification and pose regression. We employ a sequence to sequence encoder-decoder model combined with multi-task learning to forecast future actions and poses progressively before the action happens. Experimental results on two public datasets, IkeaDB and OAD, demonstrate the effectiveness of the proposed method.

StrongPose: Bottom-up and Strong Keypoint Heat Map Based Pose Estimation

Niaz Ahmad, Jongwon Yoon

Responsive image

Auto-TLDR; StrongPose: A bottom-up box-free approach for human pose estimation and action recognition

Slides Poster Similar

Adaptation of deep convolutional neural network has made revolutionary progress in human pose estimation, various applications in recent years have drawn considerable attention. However, prediction and localization of the keypoints in single and multi-person images are a challenging problem. Towards this purpose, we present a bottom-up box-free approach for the task of pose estimation and action recognition. We proposed a StrongPose system model that uses part-based modeling to tackle object-part associations. The model utilizes a convolution network that learns how to detect Strong Keypoints Heat Maps (SKHM) and predict their comparative displacements, enabling us to group keypoints into person pose instances. Further, we produce Body Heat Maps (BHM) with the help of keypoints which allows us to localize the human body in the picture. The StrongPose framework is based on fully-convolutional engineering and permits proficient inference, with runtime basically autonomous of the number of individuals display within the scene. Train and test on COCO data alone, our framework achieves COCO test-dev keypoint average precision of 0.708 using ResNet-101 and 0.725 using ResNet-152, which considerably outperforms all prior bottom-up pose estimation frameworks.

A Multi-Task Neural Network for Action Recognition with 3D Key-Points

Rongxiao Tang, Wang Luyang, Zhenhua Guo

Responsive image

Auto-TLDR; Multi-task Neural Network for Action Recognition and 3D Human Pose Estimation

Slides Poster Similar

Action recognition and 3D human pose estimation are the fundamental problems in computer vision and closely related. In this work, we propose a multi-task neural network for action recognition and 3D human pose estimation. The results of the previous methods are still error-prone especially when tested against the images taken in-the-wild, leading error results in action recognition. To solve this problem, we propose a principled approach to generate high quality 3D pose ground truth given any in-the-wild image with a person inside. We achieve this by first devising a novel stereo inspired neural network to directly map any 2D pose to high quality 3D counterpart. Based on the high-quality 3D labels, we carefully design the multi-task framework for action recognition and 3D human pose estimation. The proposed architecture can utilize the shallow, deep features of the images, and the in-the-wild 3D human key-points to guide a more precise result. High quality 3D key-points can fully reflect the morphological features of motions, thus boosting the performance on action recognition. Experiments demonstrate that 3D pose estimation leads to significantly higher performance on action recognition than separated learning. We also evaluate the generalization ability of our method both quantitatively and qualitatively. The proposed architecture performs favorably against the baseline 3D pose estimation methods. In addition, the reported results on Penn Action and NTU datasets demonstrate the effectiveness of our method on the action recognition task.

Rotational Adjoint Methods for Learning-Free 3D Human Pose Estimation from IMU Data

Caterina Emilia Agelide Buizza, Yiannis Demiris

Responsive image

Auto-TLDR; Learning-free 3D Human Pose Estimation from Inertial Measurement Unit Data

Poster Similar

We present a new framework for learning-free 3D human pose estimation from Inertial Measurement Unit (IMU) data. The proposed method does not rely on a full motion sequence to calculate a pose for any particular time point and thus can operate in real-time. A cost function based only on joint rotations is used, removing the need for frequent transformations between rotations and 3D Cartesian coordinates. A Jacobian that preserves skeleton structure is derived using Adjoint methods from Variational Data Assimilation. To facilitate further research in IMU-based Motion Capture, we provide a dataset that combines RGB and depth images from an Intel RealSense camera, marker-based motion capture from an Optitrack system and Xsens IMU data. We have evaluated our method on both our dataset and the Total Capture dataset, showing an average error across 24 joints of 0.45 and 0.48 radians respectively.

A Grid-Based Representation for Human Action Recognition

Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Responsive image

Auto-TLDR; GRAR: Grid-based Representation for Action Recognition in Videos

Slides Poster Similar

Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for the task, and are limited in the way they fuse temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets that demonstrate that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges.

Video Analytics Gait Trend Measurement for Fall Prevention and Health Monitoring

Lawrence O'Gorman, Xinyi Liu, Md Imran Sarker, Mariofanna Milanova

Responsive image

Auto-TLDR; Towards Health Monitoring of Gait with Deep Learning

Slides Poster Similar

We design a video analytics system to measure gait over time and detect trend and outliers in the data. The purpose is for health monitoring, the thesis being that trend especially can lead to early detection of declining health and be used to prevent accidents such as falls in the elderly. We use the OpenPose deep learning tool for recognizing the back and neck angle features of walking people, and measure speed as well. Trend and outlier statistics are calculated upon time series of these features. A challenge in this work is lack of testing data of decaying gait. We first designed experiments to measure consistency of the system on a healthy population, then analytically altered this real data to simulate gait decay. Results on about 4000 gait samples of 50 people over 3 months showed good separation of healthy gait subjects from those with trend or outliers, and furthermore the trend measurement was able to detect subtle decay in gait not easily discerned by the human eye.

Better Prior Knowledge Improves Human-Pose-Based Extrinsic Camera Calibration

Olivier Moliner, Sangxia Huang, Kalle Åström

Responsive image

Auto-TLDR; Improving Human-pose-based Extrinsic Calibration for Multi-Camera Systems

Slides Poster Similar

Accurate extrinsic calibration of wide baseline multi-camera systems enables better understanding of 3D scenes for many applications and is of great practical importance. Classical Structure-from-Motion calibration methods require special calibration equipment so that accurate point correspondences can be detected between different views. In addition, an operator with some training is usually needed to ensure that data is collected in a way that leads to good calibration accuracy. This limits the ease of adoption of such technologies. Recently, methods have been proposed to use human pose estimation models to establish point correspondences, thus removing the need for any special equipment. The challenge with this approach is that human pose estimation algorithms typically produce much less accurate feature points compared to classical patch-based methods. Another problem is that ambient human motion might not be optimal for calibration. We build upon prior works and introduce several novel ideas to improve the accuracy of human-pose-based extrinsic calibration. Our first contribution is a robust reprojection loss based on a better understanding of the sources of pose estimation error. Our second contribution is a 3D human pose likelihood model learned from motion capture data. We demonstrate significant improvements in calibration accuracy by evaluating our method on four publicly available datasets.

Pose-Based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo

Responsive image

Auto-TLDR; Body Language Based Emotion Recognition for Psychiatric Symptoms Prediction

Slides Poster Similar

Inspired by the human ability to infer emotions from body language, we propose an automated framework for body language based emotion recognition starting from regular RGB videos. In collaboration with psychologists, we further extend the framework for psychiatric symptom prediction. Because a specific application domain of the proposed framework may only supply a limited amount of data, the framework is designed to work on a small training set and possess a good transferability. The proposed system in the first stage generates sequences of body language predictions based on human poses estimated from input videos. In the second stage, the predicted sequences are fed into a temporal network for emotion interpretation and psychiatric symptom prediction. We first validate the accuracy and transferability of the proposed body language recognition method on several public action recognition datasets. We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms. The proposed framework outperforms other methods on the URMC dataset.

Orthographic Projection Linear Regression for Single Image 3D Human Pose Estimation

Yahui Zhang, Shaodi You, Theo Gevers

Responsive image

Auto-TLDR; A Deep Neural Network for 3D Human Pose Estimation from a Single 2D Image in the Wild

Slides Poster Similar

3D human pose estimation from a single 2D image in the wild is an important computer vision task but yet extremely challenging. Unlike images taken from indoor and well constrained environments, 2D outdoor images in the wild are extremely complex because of varying imaging conditions. Furthermore, 2D images usually do not have corresponding 3D pose ground truth making a supervised approach ill constrained. Therefore, in this paper, we propose to associate the 3D human pose, the 2D human pose projection and the 2D image appearance through a new orthographic projection based linear regression module. Unlike existing reprojection based approaches, our orthographic projection and regression do not suffer from small angle problems, which usually lead to overfitting in the depth dimension. Hence, we propose a deep neural network which adopts the 2D pose, 3D pose regression and orthographic projection linear regression module. The proposed method shows state-of-the art performance on the Human3.6M dataset and generalizes well to in-the-wild images.

Space-Time Domain Tensor Neural Networks: An Application on Human Pose Classification

Konstantinos Makantasis, Athanasios Voulodimos, Anastasios Doulamis, Nikolaos Doulamis, Nikolaos Bakalos

Responsive image

Auto-TLDR; Tensor-Based Neural Network for Spatiotemporal Pose Classifiaction using Three-Dimensional Skeleton Data

Slides Poster Similar

Recent advances in sensing technologies require the design and development of pattern recognition models capable of processing spatiotemporal data efficiently. In this study, we propose a spatially and temporally aware tensor-based neural network for human pose classifiaction using three-dimensional skeleton data. Our model employs three novel components. First, an input layer capable of constructing highly discriminative spatiotemporal features. Second, a tensor fusion operation that produces compact yet rich representations of the data, and third, a tensor-based neural network that processes data representations in their original tensor form. Our model is end-to-end trainable and characterized by a small number of trainable parameters making it suitable for problems where the annotated data is limited. Experimental evaluation of the proposed model indicates that it can achieve state-of-the-art performance.

Inner Eye Canthus Localization for Human Body Temperature Screening

Claudio Ferrari, Lorenzo Berlincioni, Marco Bertini, Alberto Del Bimbo

Responsive image

Auto-TLDR; Automatic Localization of the Inner Eye Canthus in Thermal Face Images using 3D Morphable Face Model

Slides Poster Similar

In this paper, we propose an automatic approach for localizing the inner eye canthus in thermal face images. We first coarsely detect 5 facial keypoints corresponding to the center of the eyes, the nosetip and the ears. Then we compute a sparse 2D-3D points correspondence using a 3D Morphable Face Model (3DMM). This correspondence is used to project the entire 3D face onto the image, and subsequently locate the inner eye canthus. Detecting this location allows to obtain the most precise body temperature measurement for a person using a thermal camera. We evaluated the approach on a thermal face dataset provided with manually annotated landmarks. However, such manual annotations are normally conceived to identify facial parts such as eyes, nose and mouth, and are not specifically tailored for localizing the eye canthus region. As additional contribution, we enrich the original dataset by using the annotated landmarks to deform and project the 3DMM onto the images. Then, by manually selecting a small region corresponding to the eye canthus, we enrich the dataset with additional annotations. By using the manual landmarks, we ensure the correctness of the 3DMM projection, which can be used as ground-truth for future evaluations. Moreover, we supply the dataset with the 3D head poses and per-point visibility masks for detecting self-occlusions. The data will be publicly released.

Boundary Guided Image Translation for Pose Estimation from Ultra-Low Resolution Thermal Sensor

Kohei Kurihara, Tianren Wang, Teng Zhang, Brian Carrington Lovell

Responsive image

Auto-TLDR; Pose Estimation on Low-Resolution Thermal Images Using Image-to-Image Translation Architecture

Slides Poster Similar

This work addresses the pose estimation task on low-resolution images captured using thermal sensors which can operate in a no-light environment. Low-resolution thermal sensors have been widely adopted in various applications for cost control and privacy protection purposes. In this paper, targeting the challenging scenario of ultra-low resolution thermal imaging (3232 pixels), we aim to estimate human poses for the purpose of monitoring health conditions and indoor events. To overcome the challenges in ultra-low resolution thermal imaging such as blurred boundaries and data scarcity, we propose a new Image-to-Image (I2I) translation architecture which can translate the original blurred thermal image into a visible light image with sharper boundaries. Then the generated visible light image can be fed into the off-the-shelf pose estimator which was well-trained in the visible domain. Experimental results suggest that the proposed framework outperforms other state-of-the-art methods in the I2I based pose estimation task for our thermal image dataset. Furthermore, we also demonstrated the merits of the proposed method on the publicly available FLIR dataset by measuring the quality of translated images.

Weight Estimation from an RGB-D Camera in Top-View Configuration

Marco Mameli, Marina Paolanti, Nicola Conci, Filippo Tessaro, Emanuele Frontoni, Primo Zingaretti

Responsive image

Auto-TLDR; Top-View Weight Estimation using Deep Neural Networks

Slides Poster Similar

The development of so-called soft-biometrics aims at providing information related to the physical and behavioural characteristics of a person. This paper focuses on bodyweight estimation based on the observation from a top-view RGB-D camera. In fact, the capability to estimate the weight of a person can be of help in many different applications, from health-related scenarios to business intelligence and retail analytics. To deal with this issue, a TVWE (Top-View Weight Estimation) framework is proposed with the aim of predicting the weight. The approach relies on the adoption of Deep Neural Networks (DNNs) that have been trained on depth data. Each network has also been modified in its top section to replace classification with prediction inference. The performance of five state-of-art DNNs has been compared, namely VGG16, ResNet, Inception, DenseNet and Efficient-Net. In addition, a convolutional auto-encoder has also been included for completeness. Considering the limited literature in this domain, the TVWE framework has been evaluated on a new publicly available dataset: “VRAI Weight estimation Dataset”, which also collects, for each subject, labels related to weight, gender, and height. The experimental results have demonstrated that the proposed methods are suitable for this task, bringing different and significant insights for the application of the solution in different domains.

Light3DPose: Real-Time Multi-Person 3D Pose Estimation from Multiple Views

Alessio Elmi, Davide Mazzini, Pietro Tortella

Responsive image

Auto-TLDR; 3D Pose Estimation of Multiple People from a Few calibrated Camera Views using Deep Learning

Slides Poster Similar

We present an approach to perform 3D pose estimation of multiple people from a few calibrated camera views. Our architecture, leveraging the recently proposed unprojection layer, aggregates feature-maps from a 2D pose estimator backbone into a comprehensive representation of the 3D scene. Such intermediate representation is then elaborated by a fully-convolutional volumetric network and a decoding stage to extract 3D skeletons with sub-voxel accuracy. Our method achieves state of the art MPJPE on the CMU Panoptic dataset using a few unseen views and obtains competitive results even with a single input view. We also assess the transfer learning capabilities of the model by testing it against the publicly available Shelf dataset obtaining good performance metrics. The proposed method is inherently efficient: as a pure bottom-up approach, it is computationally independent of the number of people in the scene. Furthermore, even though the computational burden of the 2D part scales linearly with the number of input views, the overall architecture is able to exploit a very lightweight 2D backbone which is orders of magnitude faster than the volumetric counterpart, resulting in fast inference time. The system can run at 6 FPS, processing up to 10 camera views on a single 1080Ti GPU.

DeepPear: Deep Pose Estimation and Action Recognition

Wen-Jiin Tsai, You-Ying Jhuang

Responsive image

Auto-TLDR; Human Action Recognition Using RGB Video Using 3D Human Pose and Appearance Features

Slides Poster Similar

Human action recognition has been a popular issue recently because it can be applied in many applications such as intelligent surveillance systems, human-robot interaction, and autonomous vehicle control. Human action recognition using RGB video is a challenging task because the learning of actions is easily affected by the cluttered background. To cope with this problem, the proposed method estimates 3D human poses first which can help remove the cluttered background and focus on the human body. In addition to the human poses, the proposed method also utilizes appearance features nearby the predicted joints to make our action prediction context-aware. Instead of using 3D convolutional neural networks as many action recognition approaches did, the proposed method uses a two-stream architecture that aggregates the results from skeleton-based and appearance-based approaches to do action recognition. Experimental results show that the proposed method achieved state-of-the-art performance on NTU RGB+D which is a largescale dataset for human action recognition.

HPERL: 3D Human Pose Estimastion from RGB and LiDAR

Michael Fürst, Shriya T.P. Gupta, René Schuster, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; 3D Human Pose Estimation Using RGB and LiDAR Using Weakly-Supervised Approach

Slides Poster Similar

In-the-wild human pose estimation has a huge potential for various fields, ranging from animation and action recognition to intention recognition and prediction for autonomous driving. The current state-of-the-art is focused only on RGB and RGB-D approaches for predicting the 3D human pose. However, not using precise LiDAR depth information limits the performance and leads to very inaccurate absolute pose estimation. With LiDAR sensors becoming more affordable and common on robots and autonomous vehicle setups, we propose an end-to-end architecture using RGB and LiDAR to predict the absolute 3D human pose with unprecedented precision. Additionally, we introduce a weakly-supervised approach to generate 3D predictions using 2D pose annotations from PedX. This allows for many new opportunities in the field of 3D human pose estimation.

Attention-Oriented Action Recognition for Real-Time Human-Robot Interaction

Ziyang Song, Ziyi Yin, Zejian Yuan, Chong Zhang, Wanchao Chi, Yonggen Ling, Shenghao Zhang

Responsive image

Auto-TLDR; Attention-Oriented Multi-Level Network for Action Recognition in Interaction Scenes

Slides Poster Similar

Despite the notable progress made in action recognition tasks, not much work has been done in action recognition specifically for human-robot interaction. In this paper, we deeply explore the characteristics of the action recognition task in interaction scenes and propose an attention-oriented multi-level network framework to meet the need for real-time interaction. Specifically, a Pre-Attention network is employed to roughly focus on the interactor in the scene at low resolution firstly and then perform fine-grained pose estimation at high resolution. The other compact CNN receives the extracted skeleton sequence as input for action recognition, utilizing attention-like mechanisms to capture local spatial-temporal patterns and global semantic information effectively. To evaluate our approach, we construct a new action dataset specially for the recognition task in interaction scenes. Experimental results on our dataset and high efficiency (112 fps at 640 x 480 RGBD) on the mobile computing platform (Nvidia Jetson AGX Xavier) demonstrate excellent applicability of our method on action recognition in real-time human-robot interaction.

Deep Gait Relative Attribute Using a Signed Quadratic Contrastive Loss

Yuta Hayashi, Shehata Allam, Yasushi Makihara, Daigo Muramatsu, Yasushi Yagi

Responsive image

Auto-TLDR; Signal-Contrastive Loss for Gait Attributes Estimation

Similar

This paper presents a deep learning-based method to estimate gait attributes (e.g., stately, cool, relax, etc.). Similarly to the existing studies on relative attribute, human perception-based annotations on the gait attributes are given to pairs of gait videos (i.e., the first one is better, tie, and the second one is better), and the relative annotations are utilized to train a ranking model of the gait attribute. More specifically, we design a Siamese (i.e., two-stream) network which takes a pair of gait inputs and output gait attribute score for each. We then introduce a suitable loss function called a signed contrastive loss to train the network parameters with the relative annotation. Unlike the existing loss functions for learning to rank does not inherent a nice property of a quadratic contrastive loss, the proposed signed quadratic contrastive loss function inherents the nice property. The quantitative evaluation results reveal that the proposed method shows better or comparable accuracies of relative attribute prediction against the baseline methods.

JT-MGCN: Joint-Temporal Motion Graph Convolutional Network for Skeleton-Based Action Recognition

Suekyeong Nam, Seungkyu Lee

Responsive image

Auto-TLDR; Joint-temporal Motion Graph Convolutional Networks for Action Recognition

Slides Similar

Recently, action recognition methods using graph convolutional networks (GCN) have shown remarkable performance thanks to its concise but effective representation of human body motion. Prior methods construct human body motion graph building edges between neighbor or distant body joints. On the other hand, human action contains lots of temporal variations showing strong temporal correlations between joint motions. Thus the characterization of an action requires a comprehensive analysis of joint motion correlations on spatial and temporal domains. In this paper, we propose Joint-temporal Motion Graph Convolutional Networks (JT-MGCN) in which joint-temporal edges learn the correlations between different joints at different time. Experimental evaluation on large public data sets such as NTU rgb+d data set and kinetics-skeleton data set show outstanding action recognition performance.

Audio-Video Detection of the Active Speaker in Meetings

Francisco Madrigal, Frederic Lerasle, Lionel Pibre, Isabelle Ferrané

Responsive image

Auto-TLDR; Active Speaker Detection with Visual and Contextual Information from Meeting Context

Slides Poster Similar

Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using speech, however, certain visual and contextual information can provide additional insights. In this paper we propose an active speaker detection framework that integrates audiovisual features with social information, from the meeting context. Visual cue is processed using a Convolutional Neural Network (CNN) that captures the spatio-temporal relationships. We analyze several CNN architectures with both cues: raw pixels (RGB images) and motion (estimated with optical flow). Contextual reasoning is done with an original methodology, based on the gaze of all participants. We evaluate our proposal with a public \textcolor{black}{benchmark} in state-of-art: AMI corpus. We show how the addition of visual and context information improves the performance of the active speaker detection.

P2 Net: Augmented Parallel-Pyramid Net for Attention Guided Pose Estimation

Luanxuan Hou, Jie Cao, Yuan Zhao, Haifeng Shen, Jian Tang, Ran He

Responsive image

Auto-TLDR; Parallel-Pyramid Net with Partial Attention for Human Pose Estimation

Slides Poster Similar

The target of human pose estimation is to determine the body parts and joint locations of persons in the image. Angular changes, motion blur and occlusion etc. in the natural scenes make this task challenging, while some joints are more difficult to be detected than others. In this paper, we propose an augmented Parallel-Pyramid Net (P^2Net) with an partial attention module. During data preprocessing, we proposed a differentiable auto data augmentation (DA^2) method in which sequences of data augmentations are formulated as a trainable and operational Convolution Neural Network (CNN) component. DA^2 improves the training efficiency and effectiveness. A parallel pyramid structure is followed to compensate the information loss introduced by the network. For the information loss problem in the backbone network, we optimize the backbone network by adopting a new parallel structure without increasing the overall computational complexity. To further refine the predictions after completion of global predictions, an Partial Attention Module (PAM) is defined to extract weighted features from different scale feature maps generated by the parallel pyramid structure. Compared with the traditional up-sampling refining, PAM can better capture the relationship between channels. Experiments corroborate the effectiveness of our proposed method. Notably, our method achieves the best performance on the challenging MSCOCO and MPII datasets.

Temporal Attention-Augmented Graph Convolutional Network for Efficient Skeleton-Based Human Action Recognition

Negar Heidari, Alexandros Iosifidis

Responsive image

Auto-TLDR; Temporal Attention Module for Efficient Graph Convolutional Network-based Action Recognition

Slides Poster Similar

Graph convolutional networks (GCNs) have been very successful in modeling non-Euclidean data structures, like sequences of body skeletons forming actions modeled as spatio-temporal graphs. Most GCN-based action recognition methods use deep feed-forward networks with high computational complexity to process all skeletons in an action. This leads to a high number of floating point operations (ranging from 16G to 100G FLOPs) to process a single sample, making their adoption in restricted computation application scenarios infeasible. In this paper, we propose a temporal attention module (TAM) for increasing the efficiency in skeleton-based action recognition by selecting the most informative skeletons of an action at the early layers of the network. We incorporate the TAM in a light-weight GCN topology to further reduce the overall number of computations. Experimental results on two benchmark datasets show that the proposed method outperforms with a large margin the baseline GCN-based method while having 2.9 times less number of computations. Moreover, it performs on par with the state-of-the-art with up to 9.6 times less number of computations.

IPT: A Dataset for Identity Preserved Tracking in Closed Domains

Thomas Heitzinger, Martin Kampel

Responsive image

Auto-TLDR; Identity Preserved Tracking Using Depth Data for Privacy and Privacy

Slides Poster Similar

We present a public dataset for Identity Preserved Tracking (IPT) consisting of sequences of depth data recorded using an Orbbec Astra depth sensor. The dataset features sequences in ten different locations with a high amount of background variation and is designed to be applicable to a wide range of tasks. Its labeling is versatile, allowing for tracking in either 3d space or image coordinates. Next to frame-by-frame 3d and inferred bounding box labeling we provide supplementary annotation of camera poses and room layouts, split in multiple semantically distinct categories. Intended use-cases are applications where both a high level understanding of scene understanding and privacy are central points of consideration, such as active and assisted living (AAL), security and industrial safety. Compared to similar public datasets IPT distinguishes itself with its sequential data format, 3d instance labeling and room layout annotation. We present baseline object detection results in image coordinates using a YOLOv3 network architecture and implement a background model suitable for online tracking applications to increase detection accuracy. Additionally we propose a novel volumetric non-maximum suppression (V-NMS) approach, taking advantage of known room geometry. Last we provide baseline person tracking results utilizing Multiple Object Tracking Challenge (MOTChallenge) evaluation metrics of the CVPR19 benchmark.

Simple Multi-Resolution Representation Learning for Human Pose Estimation

Trung Tran Quang, Van Giang Nguyen, Daeyoung Kim

Responsive image

Auto-TLDR; Multi-resolution Heatmap Learning for Human Pose Estimation

Slides Poster Similar

Human pose estimation - the process of recognizing human keypoints in a given image - is one of the most important tasks in computer vision and has a wide range of applications including movement diagnostics, surveillance, or self-driving vehicle. The accuracy of human keypoint prediction is increasingly improved thanks to the burgeoning development of deep learning. Most existing methods solved human pose estimation by generating heatmaps in which the ith heatmap indicates the location confidence of the ith keypoint. In this paper, we introduce novel network structures referred to as multi-resolution representation learning for human keypoint prediction. At different resolutions in the learning process, our networks branch off and use extra layers to learn heatmap generation. We firstly consider the architectures for generating the multi-resolution heatmaps after obtaining the lowest-resolution feature maps. Our second approach allows learning during the process of feature extraction in which the heatmaps are generated at each resolution of the feature extractor. The first and second approaches are referred to as multi-resolution heatmap learning and multi-resolution feature map learning respectively. Our architectures are simple yet effective, achieving good performance. We conducted experiments on two common benchmarks for human pose estimation: MS-COCO and MPII dataset.

Occlusion-Tolerant and Personalized 3D Human Pose Estimation in RGB Images

Ammar Qammaz, Antonis Argyros

Responsive image

Auto-TLDR; Real-Time 3D Human Pose Estimation in BVH using Inverse Kinematics Solver and Neural Networks

Poster Similar

We introduce a real-time method that estimates the 3D human pose directly in the popular BVH format, given estimations of the 2D body joints in RGB images. Our contributions include: (a) A novel and compact 2D pose representation. (b) A human body orientation classifier and an ensemble of orientation-tuned neural networks that regress the 3D human pose by also allowing for the decomposition of the body to an upper and lower kinematic hierarchy. This permits the recovery of the human pose even in the case of significant occlusions. (c) An efficient Inverse Kinematics solver that refines the neural-network-based solution providing 3D human pose estimations that are consistent with the limb sizes of a target person (if known). All the above yield a 33% accuracy improvement on the H3.6M dataset compared to the baseline MocapNET method while maintaining real-time performance (70 fps in CPU-only execution).

Toward Building a Data-Driven System ForDetecting Mounting Actions of Black Beef Cattle

Yuriko Kawano, Susumu Saito, Nakano Teppei, Ikumi Kondo, Ryota Yamazaki, Hiromi Kusaka, Minoru Sakaguchi, Tetsuji Ogawa

Responsive image

Auto-TLDR; Cattle Mounting Action Detection Using Crowdsourcing and Pattern Recognition

Poster Similar

This paper tackles on building a pattern recognition system that detects whether a pair of Japanese black beefs captured in a given image region is in a “mounting” action, which is known to be a sign critically important to be detected for cattle farmers before artificial insemination. The “mounting” action refers to a cattle’s action where a cow bends over another cow usually when either cow is in estrus. Although a pattern recognition-based approach for detecting such an action would be appreciated as being low-cost and robust, it had not been discussed much due to the complexity of the system architecture, unavailability of datasets, etc. This study presents i) our image dataset construction technique that exploits both object detection algorithm and crowdsourcing for collecting cattle pair images with labels of either “mounting” or not; and ii) a system for detecting the mounting action from any given image of a cattle pair, developed based on the dataset. Starting with an algorithm for extracting regions of cattle pairs from a video frame based on intersection of single cattle regions, we then designed our crowdsourcing microtask in which crowd workers were given simple guidelines to annotate mounting-action-relevant labels to the extracted regions, to finally obtain a dataset. We also introduce our tandem-layered pattern recognition system trained with the dataset. The system is comprised of two serially-connected machine learning components, and is capable of more robustly detecting mounting actions even with a small amount of training data than a normal end-to-end neural network. Experimental comparisons demonstrated that our detection system was capable of detecting estrus with a precision rate of 80% and a recall rate of 76%.

Unsupervised 3D Human Pose Estimation in Multi-view-multi-pose Video

Cheng Sun, Diego Thomas, Hiroshi Kawasaki

Responsive image

Auto-TLDR; Unsupervised 3D Human Pose Estimation from 2D Videos Using Generative Adversarial Network

Slides Poster Similar

3D human pose estimation from a single 2D video is an extremely difficult task because computing 3D geometry from 2D images is an ill-posed problem. Recent popular solutions adopt fully-supervised learning strategy, which requires to train a deep network on a large-scale ground truth dataset of 3D poses and 2D images. However, such a large-scale dataset with natural images does not exist, which limits the usability of existing methods. While building a complete 3D dataset is tedious and expensive, abundant 2D in-the-wild data is already publicly available. As a consequence, there is a growing interest in the computer vision community to design efficient techniques that use the unsupervised learning strategy, which does not require any ground truth 3D data. Such methods can be trained with only natural 2D images of humans. In this paper we propose an unsupervised method for estimating 3D human pose in videos. The standard approach for unsupervised learning is to use the Generative Adversarial Network (GAN) framework. To improve the performance of 3D human pose estimation in videos, we propose a new GAN network that enforces body consistency over frames in a video. We evaluate the efficiency of our proposed method on a public 3D human body dataset.

Learning Group Activities from Skeletons without Individual Action Labels

Fabio Zappardino, Tiberio Uricchio, Lorenzo Seidenari, Alberto Del Bimbo

Responsive image

Auto-TLDR; Lean Pose Only for Group Activity Recognition

Similar

To understand human behavior we must not just recognize individual actions but model possibly complex group activity and interactions. Hierarchical models obtain the best results in group activity recognition but require fine grained individual action annotations at the actor level. In this paper we show that using only skeletal data we can train a state-of-the art end-to-end system using only group activity labels at the sequence level. Our experiments show that models trained without individual action supervision perform poorly. On the other hand we show that pseudo-labels can be computed from any pre-trained feature extractor with comparable final performance. Finally our carefully designed lean pose only architecture shows highly competitive results versus more complex multimodal approaches even in the self-supervised variant.

Temporal Extension Module for Skeleton-Based Action Recognition

Yuya Obinata, Takuma Yamamoto

Responsive image

Auto-TLDR; Extended Temporal Graph for Action Recognition with Kinetics-Skeleton

Slides Poster Similar

We present a module that extends the temporal graph of a graph convolutional network (GCN) for action recognition with a sequence of skeletons. Existing methods attempt to represent a more appropriate spatial graph on an intra-frame, but disregard optimization of the temporal graph on the inter-frame. Concretely, these methods connect between vertices corresponding only to the same joint on the inter-frame. In this work, we focus on adding connections to neighboring multiple vertices on the inter-frame and extracting additional features based on the extended temporal graph. Our module is a simple yet effective method to extract correlated features of multiple joints in human movement. Moreover, our module aids in further performance improvements, along with other GCN methods that optimize only the spatial graph. We conduct extensive experiments on two large datasets, NTU RGB+D and Kinetics-Skeleton, and demonstrate that our module is effective for several existing models and our final model achieves state-of-the-art performance.

PEAN: 3D Hand Pose Estimation Adversarial Network

Linhui Sun, Yifan Zhang, Jing Lu, Jian Cheng, Hanqing Lu

Responsive image

Auto-TLDR; PEAN: 3D Hand Pose Estimation with Adversarial Learning Framework

Slides Poster Similar

Despite recent emerging research attention, 3D hand pose estimation still suffers from the problems of predicting inaccurate or invalid poses which conflict with physical and kinematic constraints. To address these problems, we propose a novel 3D hand pose estimation adversarial network (PEAN) which can implicitly utilize such constraints to regularize the prediction in an adversarial learning framework. PEAN contains two parts: a 3D hierarchical estimation network (3DHNet) to predict hand pose, which decouples the task into multiple subtasks with a hierarchical structure; a pose discrimination network (PDNet) to judge the reasonableness of the estimated 3D hand pose, which back-propagates the constraints to the estimation network. During the adversarial learning process, PDNet is expected to distinguish the estimated 3D hand pose and the ground truth, while 3DHNet is expected to estimate more valid pose to confuse PDNet. In this way, 3DHNet is capable of generating 3D poses with accurate positions and adaptively adjusting the invalid poses without additional prior knowledge. Experiments show that the proposed 3DHNet does a good job in predicting hand poses, and introducing PDNet to 3DHNet does further improve the accuracy and reasonableness of the predicted results. As a result, the proposed PEAN achieves the state-of-the-art performance on three public hand pose estimation datasets.

Attention-Driven Body Pose Encoding for Human Activity Recognition

Bappaditya Debnath, Swagat Kumar, Marry O'Brien, Ardhendu Behera

Responsive image

Auto-TLDR; Attention-based Body Pose Encoding for Human Activity Recognition

Slides Poster Similar

This article proposes a novel attention-based body pose encoding for human activity recognition. Most of the existing human activity recognition approaches based on 3D pose data often enrich the input data using additional handcrafted representations such as velocity, super normal vectors, pairwise relations, and so on. The enriched data complements the 3D body joint position data and improves the model performance. In this paper, we propose a novel approach that learns enhanced feature representations from a given sequence of 3D body joints. To achieve this, the approach exploits two body pose streams: 1) a spatial stream which encodes the spatial relationship between various body joints at each time point to learn spatial structure involving the spatial distribution of different body joints 2) a temporal stream that learns the temporal variation of individual body joints over the entire sequence duration to present a temporally enhanced representation. Afterwards, these two pose streams are fused with a multi-head attention mechanism. We also capture the contextual information from the RGB video stream using a deep Convolutional Neural Network (CNN) model combined with a multi-head attention and a bidirectional Long Short-Term Memory (LSTM) network. Finally, the RGB video stream is combined with the fused body pose stream to give a novel end-to-end deep model for effective human activity recognition. The proposed model is evaluated on three datasets including the challenging NTU-RGBD dataset and achieves state-of-the-art results.

Median-Shape Representation Learning for Category-Level Object Pose Estimation in Cluttered Environments

Hiroki Tatemichi, Yasutomo Kawanishi, Daisuke Deguchi, Ichiro Ide, Hiroshi Murase, Ayako Amma

Responsive image

Auto-TLDR; An Occlusion-Robust Pose Estimation Method from a Depth Image

Slides Poster Similar

In this paper, we propose an occlusion-robust pose estimation method of an unknown object instance in an object category from a depth image. In a cluttered environment, objects are often occluded mutually. For estimating the pose of an object in such a situation, a method that de-occludes the unobservable area of the object would be effective. However, there are two difficulties; occlusion causes offset between the actual object center and the center of its observable area, and different instances in a category may have different shapes. To cope with these difficulties, we propose a two-stage Encoder-Decoder model to extract features with objects whose centers are aligned to the image center. In the model, we also propose the Median-shape Reconstructor as the second stage to absorb shape variations in a category. By evaluating the method with both a large-scale virtual dataset and a real dataset, we confirmed the proposed method achieves good performance on pose estimation of an occluded object from a depth image.

JUMPS: Joints Upsampling Method for Pose Sequences

Lucas Mourot, Francois Le Clerc, Cédric Thébault, Pierre Hellier

Responsive image

Auto-TLDR; JUMPS: Increasing the Number of Joints in 2D Pose Estimation and Recovering Occluded or Missing Joints

Slides Poster Similar

Human Pose Estimation is a low-level task useful for surveillance, human action recognition, and scene understanding at large. It also offers promising perspectives for the animation of synthetic characters. For all these applications, and especially the latter, estimating the positions of many joints is desirable for improved performance and realism. To this purpose, we propose a novel method called JUMPS for increasing the number of joints in 2D pose estimates and recovering occluded or missing joints. We believe this is the first attempt to address the issue. We build on a deep generative model that combines a GAN and an encoder. The GAN learns the distribution of high-resolution human pose sequences, the encoder maps the input low-resolution sequences to its latent space. Inpainting is obtained by computing the latent representation whose decoding by the GAN generator optimally matches the joints locations at the input. Post-processing a 2D pose sequence using our method provides a richer representation of the character motion. We show experimentally that the localization accuracy of the additional joints is on average on par with the original pose estimates.

Online Object Recognition Using CNN-Based Algorithm on High-Speed Camera Imaging

Shigeaki Namiki, Keiko Yokoyama, Shoji Yachida, Takashi Shibata, Hiroyoshi Miyano, Masatoshi Ishikawa

Responsive image

Auto-TLDR; Real-Time Object Recognition with High-Speed Camera Imaging with Population Data Clearing and Data Ensemble

Slides Poster Similar

High-speed camera imaging (e.g., 1,000 fps) is effective to detect and recognize objects moving at high speeds because temporally dense images obtained by a high-speed camera can usually capture the best moment for object detection and recognition. However, the latest recognition algorithms, with their high complexity, are difficult to utilize in real-time applications involving high-speed cameras because a vast amount of images need to be processed with no latency. To tackle this problem, we propose a novel framework for real-time object recognition with high-speed camera imaging. The proposed framework has the key processes of population data cleansing and data ensemble. Population data cleansing improves the recognition accuracy by quantifying the recognizability and by excluding part of the images prior to the recognition process, while data ensemble improves the robustness of object recognition by merging the class probabilities with multiple images of the same object. Experimental results with a real dataset show that our framework is more effective than existing methods.

On the Robustness of 3D Human Pose Estimation

Zerui Chen, Yan Huang, Liang Wang

Responsive image

Auto-TLDR; Robustness of 3D Human Pose Estimation Methods to Adversarial Attacks

Slides Similar

It is widely shown that Convolutional Neural Networks (CNNs) are vulnerable to adversarial examples on most recognition tasks, such as image classification and segmentation. However, few work studies the more complicated task -- 3D human pose estimation. This task often requires large-scale datasets, specialized network architectures, and it can be solved either from single-view RGB images or from multi-view RGB images. In this paper, we make the first attempt to investigate the robustness of current state-of-the-art 3D human pose estimation methods. To this end, we build four representative baseline models, where most of the current methods can be generally classified as one of them. Furthermore, we design targeted adversarial attacks to detect whether 3D pose estimators are robust to different camera parameters. For different types of methods, we present a comprehensive study of their robustness on the large-scale \emph{Human3.6M} benchmark. Our work shows that different methods vary significantly in their resistance to adversarial attacks. Through extensive experiments, we show that multi-view 3D pose estimators can be more vulnerable to adversarial examples. We believe that our efforts can shed light on future works to design more robust 3D human pose estimators.

Efficient High-Resolution High-Level-Semantic Representation Learning for Human Pose Estimation

Hong Liu, Lisi Guan

Responsive image

Auto-TLDR; Spatial enhanced separated temporal spatial convolutional neural network

Slides Poster Similar

Temporal-spatial information, as the most key issue for human action recognition, has been explored by lots of means, such as 3D convolution network (3DCNN) based or 3DCNN decomposing based approaches. Though the latter can be seen as a trade-off for overcoming the shortage caused by the former for reducing the computation cost and saving parameters, information imbalance of videos between spatial and temporal information is still not been well excavated. To tackle this problem, spatial enhanced separated temporal spatial convolutional neural network (SESTSN) is proposed in this paper, which can easily outperform 3DCNN based and 3DCNN decomposing based methods with fewer parameters. What's more, to further reduce parameter and computation cost, we adopt depth-wise convolution to the proposed SESTSN and propose the channel separated spatial enhanced separated temporal spatial convolutional neural network (CSESTSN). Experiments show that the proposed CSESTSN contains considerably fewer parameters involving much lower computation cost, while it achieves comparable performance to 3D convolution-based methods. Our method outperforms state-of-the-art methods on two challenging datasets, namely NTU RGB+D dataset and Northwestern-UCLA dataset, which verifies the effectiveness of our method.

Activity Recognition Using First-Person-View Cameras Based on Sparse Optical Flows

Peng-Yuan Kao, Yan-Jing Lei, Chia-Hao Chang, Chu-Song Chen, Ming-Sui Lee, Yi-Ping Hung

Responsive image

Auto-TLDR; 3D Convolutional Neural Network for Activity Recognition with FPV Videos

Slides Poster Similar

First-person-view (FPV) cameras are finding wide use in daily life to record activities and sports. In this paper, we propose a succinct and robust 3D convolutional neural network (CNN) architecture accompanied with an ensemble-learning network for activity recognition with FPV videos. The proposed 3D CNN is trained on low-resolution (32x32) sparse optical flows using FPV video datasets consisting of daily activities. According to the experimental results, our network achieves an average accuracy of 90%.

Modeling Long-Term Interactions to Enhance Action Recognition

Alejandro Cartas, Petia Radeva, Mariella Dimiccoli

Responsive image

Auto-TLDR; A Hierarchical Long Short-Term Memory Network for Action Recognition in Egocentric Videos

Slides Poster Similar

In this paper, we propose a new approach to understand actions in egocentric videos that exploit the semantics of object interactions at both frame and temporal levels. At the frame level, we use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects and calculates the action score through a CNN formulation. This information is then fed to a Hierarchical Long Short-Term Memory Network (HLSTM) that captures temporal dependencies between actions within and across shots. Ablation studies thoroughly validate the proposed approach, showing in particular that both levels of the HLSTM architecture contribute to performance improvement. Furthermore, quantitative comparisons show that the proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks, without relying on motion information.

From Human Pose to On-Body Devices for Human-Activity Recognition

Fernando Moya Rueda, Gernot Fink

Responsive image

Auto-TLDR; Transfer Learning from Human Pose Estimation for Human Activity Recognition using Inertial Measurements from On-Body Devices

Slides Poster Similar

Human Activity Recognition (HAR), using inertial measurements from on-body devices, has not seen a great advantage from deep architectures. This is mainly due to the lack of annotated data, diversity of on-body device configurations, the class-unbalance problem, and non-standard human activity definitions. Approaches for improving the performance of such architectures, e.g., transfer learning, are therefore difficult to apply. This paper introduces a method for transfer learning from human-pose estimations as a source for improving HAR using inertial measurements obtained from on-body devices. We propose to fine-tune deep architectures, trained using sequences of human poses from a large dataset and their derivatives, for solving HAR on inertial measurements from on-body devices. Derivatives of human poses will be considered as a sort of synthetic data for HAR. We deploy two different temporal-convolutional architectures as classifiers. An evaluation of the method is carried out on three benchmark datasets improving the classification performance.

Tilting at Windmills: Data Augmentation for Deeppose Estimation Does Not Help with Occlusions

Rafal Pytel, Osman Semih Kayhan, Jan Van Gemert

Responsive image

Auto-TLDR; Targeted Keypoint and Body Part Occlusion Attacks for Human Pose Estimation

Slides Poster Similar

Occlusion degrades the performance of human poseestimation. In this paper, we introduce targeted keypoint andbody part occlusion attacks. The effects of the attacks are system-atically analyzed on the best performing methods. In addition, wepropose occlusion specific data augmentation techniques againstkeypoint and part attacks. Our extensive experiments show thathuman pose estimation methods are not robust to occlusion anddata augmentation does not solve the occlusion problems.

Channel-Wise Dense Connection Graph Convolutional Network for Skeleton-Based Action Recognition

Michael Lao Banteng, Zhiyong Wu

Responsive image

Auto-TLDR; Two-stream channel-wise dense connection GCN for human action recognition

Slides Poster Similar

Skeleton-based action recognition task has drawn much attention for many years. Graph Convolutional Network (GCN) has proved its effectiveness in this task. However, how to improve the model's robustness to different human actions and how to make effective use of features produced by the network are main topics needed to be further explored. Human actions are time series sequence, meaning that temporal information is a key factor to model the representation of data. The ranges of body parts involved in small actions (e.g. raise a glass or shake head) and big actions (e.g. walking or jumping) are diverse. It's crucial for the model to generate and utilize more features that can be adaptive to a wider range of actions. Furthermore, feature channels are specific with the action class, the model needs to weigh their importance and pay attention to more related ones. To address these problems, in this work, we propose a two-stream channel-wise dense connection GCN (2s-CDGCN). Specifically, the skeleton data was extracted and processed into spatial and temporal information for better feature representation. A channel-wise attention module was used to select and emphasize the more useful features generated by the network. Moreover, to ensure maximum information flow, dense connection was introduced to the network structure, which enables the network to reuse the skeleton features and generate more information adaptive and related to different human actions. Our model has shown its ability to improve the accuracy of human action recognition task on two large datasets, NTU-RGB+D and Kinetics. Extensive evaluations were conducted to prove the effectiveness of our model.

PHNet: Parasite-Host Network for Video Crowd Counting

Shiqiao Meng, Jiajie Li, Weiwei Guo, Jinfeng Jiang, Lai Ye

Responsive image

Auto-TLDR; PHNet: A Parasite-Host Network for Video Crowd Counting

Slides Poster Similar

Crowd counting plays an increasingly important role in public security. Recently, many crowd counting methods for a single image have been proposed but few studies have focused on using temporal information from image sequences of videos to improve prediction performance. In the existing methods using videos for crowd estimation, temporal features and spatial features are modeled jointly for the prediction, which makes the model less efficient in extracting spatiotemporal features and difficult to improve the performance of predictions. In order to solve these problems, this paper proposes a Parasite-Host Network(PHNet) which is composed of Parasite branch and Host branch to extract temporal features and spatial features respectively. To specifically extract the transform features in the time domain, we propose a novel architecture termed as “Relational Extractor”(RE) which models the multiplicative interaction features of adjacent frames. In addition, the Host branch extracts the spatial features from a current frame which can be replaced with any model that uses a single image for the prediction. We conducted experiments by using our PHNet on four video crowd counting benchmarks: Venice,UCSD,FDST and CrowdFlow. Experimental results show that PHnet achieves superior performance on these four datasets to the state-of-the-art methods.

Late Fusion of Bayesian and Convolutional Models for Action Recognition

Camille Maurice, Francisco Madrigal, Frederic Lerasle

Responsive image

Auto-TLDR; Fusion of Deep Neural Network and Bayesian-based Approach for Temporal Action Recognition

Slides Poster Similar

The activities we do in our daily-life are generally carried out as a succession of atomic actions, following a logical order. During a video sequence, actions usually follow a logical order. In this paper, we propose a hybrid approach resulting from the fusion of a deep learning neural network with a Bayesian-based approach. The latter models human-object interactions and transition between actions. The key idea is to combine both approaches in the final prediction. We validate our strategy in two public datasets: CAD-120 and Watch-n-Patch. We show that our fusion approach yields performance gains in accuracy of respectively +4\% and +6\% over a baseline approach. Temporal action recognition performances are clearly improved by the fusion, especially when classes are imbalanced.

Learning to Implicitly Represent 3D Human Body from Multi-Scale Features and Multi-View Images

Zhongguo Li, Magnus Oskarsson, Anders Heyden

Responsive image

Auto-TLDR; Reconstruction of 3D human bodies from multi-view images using multi-stage end-to-end neural networks

Slides Poster Similar

Reconstruction of 3D human bodies, from images, faces many challenges, due to it generally being an ill-posed problem. In this paper we present a method to reconstruct 3D human bodies from multi-view images, through learning an implicit function to represent 3D shape, based on multi-scale features extracted by multi-stage end-to-end neural networks. Our model consists of several end-to-end hourglass networks for extracting multi-scale features from multi-view images, and a fully connected network for implicit function classification from these features. Given a 3D point, it is projected to multi-view images and these images are fed into our model to extract multi-scale features. The scales of features extracted by the hourglass networks decrease with the depth of our model, which represents the information from local to global scale. Then, the multi-scale features as well as the depth of the 3D point are combined to a new feature vector and the fully connected network classifies the feature vector, in order to predict if the point lies inside or outside of the 3D mesh. The advantage of our method is that we use both local and global features in the fully connected network and represent the 3D mesh by an implicit function, which is more memory-efficient. Experiments on public datasets demonstrate that our method surpasses previous approaches in terms of the accuracy of 3D reconstruction of human bodies from images.

Real-Time Monocular Depth Estimation with Extremely Light-Weight Neural Network

Mian Jhong Chiu, Wei-Chen Chiu, Hua-Tsung Chen, Jen-Hui Chuang

Responsive image

Auto-TLDR; Real-Time Light-Weight Depth Prediction for Obstacle Avoidance and Environment Sensing with Deep Learning-based CNN

Slides Poster Similar

Obstacle avoidance and environment sensing are crucial applications in autonomous driving and robotics. Among all types of sensors, RGB camera is widely used in these applications as it can offer rich visual contents with relatively low-cost, and using a single image to perform depth estimation has become one of the main focuses in resent research works. However, prior works usually rely on highly complicated computation and power-consuming GPU to achieve such task; therefore, we focus on developing a real-time light-weight system for depth prediction in this paper. Based on the well-known encoder-decoder architecture, we propose a supervised learning-based CNN with detachable decoders that produce depth predictions with different scales. We also formulate a novel log-depth loss function that computes the difference of predicted depth map and ground truth depth map in log space, so as to increase the prediction accuracy for nearby locations. To train our model efficiently, we generate depth map and semantic segmentation with complex teacher models. Via a series of ablation studies and experiments, it is validated that our model can efficiently performs real-time depth prediction with only 0.32M parameters, with the best trained model outperforms previous works on KITTI dataset for various evaluation matrices.

RWF-2000: An Open Large Scale Video Database for Violence Detection

Ming Cheng, Kunjing Cai, Ming Li

Responsive image

Auto-TLDR; Flow Gated Network for Violence Detection in Surveillance Cameras

Slides Poster Similar

In recent years, surveillance cameras are widely deployed in public places, and the general crime rate has been reduced significantly due to these ubiquitous devices. Usually, these cameras provide cues and evidence after crimes were conducted, while they are rarely used to prevent or stop criminal activities in time. It is both time and labor consuming to manually monitor a large amount of video data from surveillance cameras. Therefore, automatically recognizing violent behaviors from video signals becomes essential. In this paper, we summarize several existing video datasets for violence detection and propose a new video dataset with 2,000 videos all captured by surveillance cameras in real-world scenes. Also, we present a new method that utilizes both the merits of 3D-CNNs and optical flow, namely Flow Gated Network. The proposed approach obtains an accuracy of 87.25% on the test set of our proposed RWF-2000 database. The proposed database and source codes of this paper are currently open to access.