Subspace Clustering for Action Recognition with Covariance Representations and Temporal Pruning

Giancarlo Paoletti, Jacopo Cavazza, Cigdem Beyan, Alessio Del Bue

Responsive image

Auto-TLDR; Unsupervised Learning for Human Action Recognition from Skeletal Data

Slides

This paper tackles the problem of human action recognition, defined as classifying which action is displayed in a trimmed sequence, from skeletal data. Albeit state-of-the-art approaches designed for this application are all supervised, in this paper we pursue a more challenging direction: Solving the problem with unsupervised learning. To this end, we propose a novel subspace clustering method, which exploits covariance matrix to enhance the action’s discriminability and a timestamp pruning approach that allow us to better handle the temporal dimension of the data. Through a broad experimental validation, we show that our computational pipeline surpasses existing unsupervised approaches but also can result in favorable performances as compared to supervised methods.

Similar papers

Vision-Based Multi-Modal Framework for Action Recognition

Djamila Romaissa Beddiar, Mourad Oussalah, Brahim Nini

Responsive image

Auto-TLDR; Multi-modal Framework for Human Activity Recognition Using RGB, Depth and Skeleton Data

Slides Poster Similar

Human activity recognition plays a central role in the development of intelligent systems for video surveillance, public security, health care and home monitoring, where detection and recognition of activities can improve the quality of life and security of humans. Typically, automated, intuitive and real-time systems are required to recognize human activities and identify accurately unusual behaviors in order to prevent dangerous situations. In this work, we explore the combination of three modalities (RGB, depth and skeleton data) to design a robust multi-modal framework for vision-based human activity recognition. Especially, spatial information, body shape/posture and temporal evolution of actions are highlighted using illustrative representations obtained from a combination of dynamic RGB images, dynamic depth images and skeleton data representations. Therefore, each video is represented with three images that summarize the ongoing action. Our framework takes advantage of transfer learning from pre trained models to extract significant features from these newly created images. Next, we fuse extracted features using Canonical Correlation Analysis and train a Long Short-Term Memory network to classify actions from visual descriptive images. Experimental results demonstrated the reliability of our feature-fusion framework that allows us to capture highly significant features and enables us to achieve the state-of-the-art performance on the public UTD-MHAD and NTU RGB+D datasets.

Sparse-Dense Subspace Clustering

Shuai Yang, Wenqi Zhu, Yuesheng Zhu

Responsive image

Auto-TLDR; Sparse-Dense Subspace Clustering with Piecewise Correlation Estimation

Slides Poster Similar

Subspace clustering refers to the problem of clustering high-dimensional data into a union of low-dimensional subspaces. Current subspace clustering approaches are usually based on a two-stage framework. In the first stage, an affinity matrix is generated from data. In the second one, spectral clustering is applied on the affinity matrix. However, the affinity matrix produced by two-stage methods cannot fully reveal the similarity between data points from the same subspace, resulting in inaccurate clustering. Besides, most approaches fail to solve large-scale clustering problems due to poor efficiency. In this paper, we first propose a new scalable sparse method called Iterative Maximum Correlation (IMC) to learn the affinity matrix from data. Then we develop Piecewise Correlation Estimation (PCE) to densify the intra-subspace similarity produced by IMC. Finally we extend our work into a Sparse-Dense Subspace Clustering (SDSC) framework with a dense stage to optimize the affinity matrix for two-stage methods. We show that IMC is efficient for large-scale tasks, and PCE ensures better performance for IMC. We show the universality of our SDSC framework for current two-stage methods as well. Experiments on benchmark data sets demonstrate the effectiveness of our approaches.

Subspace Clustering Via Joint Unsupervised Feature Selection

Wenhua Dong, Xiaojun Wu, Hui Li, Zhenhua Feng, Josef Kittler

Responsive image

Auto-TLDR; Unsupervised Feature Selection for Subspace Clustering

Poster Similar

Any high-dimensional data arising from practical applications usually contains irrelevant features, which may impact on the performance of existing subspace clustering methods. This paper proposes a novel subspace clustering method, which reconstructs the feature matrix by the means of unsupervised feature selection (UFS) to achieve a better dictionary for subspace clustering (SC). Different from most existing clustering methods, the proposed approach uses a reconstructed feature matrix as the dictionary rather than the original data matrix. As the feature matrix reconstructed by representative features is more discriminative and closer to the ground-truth, it results in improved performance. The corresponding non-convex optimization problem is effectively solved using the half-quadratic and augmented Lagrange multiplier methods. Extensive experiments on four real datasets demonstrate the effectiveness of the proposed method.

Learning Recurrent High-Order Statistics for Skeleton-Based Hand Gesture Recognition

Xuan Son Nguyen, Luc Brun, Olivier Lezoray, Sébastien Bougleux

Responsive image

Auto-TLDR; Exploiting High-Order Statistics in Recurrent Neural Networks for Hand Gesture Recog-nition

Slides Similar

High-order statistics have been proven useful inthe framework of Convolutional Neural Networks (CNN) fora variety of computer vision tasks. In this paper, we proposeto exploit high-order statistics in the framework of RecurrentNeural Networks (RNN) for skeleton-based hand gesture recog-nition. Our method is based on the Statistical Recurrent Units(SRU), an un-gated architecture that has been introduced as analternative model for Long-Short Term Memory (LSTM) andGate Recurrent Unit (GRU). The SRU captures sequential infor-mation by generating recurrent statistics that depend on a contextof previously seen data and by computing moving averages atdifferent scales. The integration of high-order statistics in theSRU significantly improves the performance of the original one,resulting in a model that is competitive to state-of-the-art methodson the Dynamic Hand Gesture (DHG) dataset, and outperformsthem on the First-Person Hand Action (FPHA) dataset.

Fast Subspace Clustering Based on the Kronecker Product

Lei Zhou, Xiao Bai, Liang Zhang, Jun Zhou, Edwin Hancock

Responsive image

Auto-TLDR; Subspace Clustering with Kronecker Product for Large Scale Datasets

Slides Poster Similar

Subspace clustering is a useful technique for many computer vision applications in which the intrinsic dimension of high-dimensional data is often smaller than the ambient dimension. Spectral clustering, as one of the main approaches to subspace clustering, often takes on a sparse representation or a low-rank representation to learn a block diagonal self-representation matrix for subspace generation. However, existing methods require solving a large scale convex optimization problem with a large set of data, with computational complexity reaches O(N^3) for N data points. Therefore, the efficiency and scalability of traditional spectral clustering methods can not be guaranteed for large scale datasets. In this paper, we propose a subspace clustering model based on the Kronecker product. Due to the property that the Kronecker product of a block diagonal matrix with any other matrix is still a block diagonal matrix, we can efficiently learn the representation matrix which is formed by the Kronecker product of k smaller matrices. By doing so, our model significantly reduces the computational complexity to O(kN^{3/k}). Furthermore, our model is general in nature, and can be adapted to different regularization based subspace clustering methods. Experimental results on two public datasets show that our model significantly improves the efficiency compared with several state-of-the-art methods. Moreover, we have conducted experiments on synthetic data to verify the scalability of our model for large scale datasets.

A Grid-Based Representation for Human Action Recognition

Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Responsive image

Auto-TLDR; GRAR: Grid-based Representation for Action Recognition in Videos

Slides Poster Similar

Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for the task, and are limited in the way they fuse temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets that demonstrate that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges.

Vertex Feature Encoding and Hierarchical Temporal Modeling in a Spatio-Temporal Graph Convolutional Network for Action Recognition

Konstantinos Papadopoulos, Enjie Ghorbel, Djamila Aouada, Bjorn Ottersten

Responsive image

Auto-TLDR; Spatio-Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

Slides Poster Similar

Spatio-temporal Graph Convolutional Networks (ST-GCNs) have shown great performance in the context of skeleton-based action recognition. Nevertheless, ST-GCNs use raw skeleton data as vertex features. Such features have low dimensionality and might not be optimal for action discrimination. Moreover, a single layer of temporal convolution is used to model short-term temporal dependencies but can be insufficient for capturing both long-term. In this paper, we extend the Spatio-Temporal Graph Convolutional Network for skeleton-based action recognition by introducing two novel modules, namely, the Graph Vertex Feature Encoder (GVFE) and the Dilated Hierarchical Temporal Convolutional Network (DH-TCN). On the one hand, the GVFE module learns appropriate vertex features for action recognition by encoding raw skeleton data into a new feature space. On the other hand, the DH-TCN module is capable of capturing both short-term and long-term temporal dependencies using a hierarchical dilated convolutional network. Experiments have been conducted on the challenging NTU RGB-D 60, NTU RGB-D 120 and Kinetics datasets. The obtained results show that our method competes with state-of-the-art approaches while using a smaller number of layers and parameters; thus reducing the required training time and memory.

Attention-Driven Body Pose Encoding for Human Activity Recognition

Bappaditya Debnath, Swagat Kumar, Marry O'Brien, Ardhendu Behera

Responsive image

Auto-TLDR; Attention-based Body Pose Encoding for Human Activity Recognition

Slides Poster Similar

This article proposes a novel attention-based body pose encoding for human activity recognition. Most of the existing human activity recognition approaches based on 3D pose data often enrich the input data using additional handcrafted representations such as velocity, super normal vectors, pairwise relations, and so on. The enriched data complements the 3D body joint position data and improves the model performance. In this paper, we propose a novel approach that learns enhanced feature representations from a given sequence of 3D body joints. To achieve this, the approach exploits two body pose streams: 1) a spatial stream which encodes the spatial relationship between various body joints at each time point to learn spatial structure involving the spatial distribution of different body joints 2) a temporal stream that learns the temporal variation of individual body joints over the entire sequence duration to present a temporally enhanced representation. Afterwards, these two pose streams are fused with a multi-head attention mechanism. We also capture the contextual information from the RGB video stream using a deep Convolutional Neural Network (CNN) model combined with a multi-head attention and a bidirectional Long Short-Term Memory (LSTM) network. Finally, the RGB video stream is combined with the fused body pose stream to give a novel end-to-end deep model for effective human activity recognition. The proposed model is evaluated on three datasets including the challenging NTU-RGBD dataset and achieves state-of-the-art results.

JT-MGCN: Joint-Temporal Motion Graph Convolutional Network for Skeleton-Based Action Recognition

Suekyeong Nam, Seungkyu Lee

Responsive image

Auto-TLDR; Joint-temporal Motion Graph Convolutional Networks for Action Recognition

Slides Similar

Recently, action recognition methods using graph convolutional networks (GCN) have shown remarkable performance thanks to its concise but effective representation of human body motion. Prior methods construct human body motion graph building edges between neighbor or distant body joints. On the other hand, human action contains lots of temporal variations showing strong temporal correlations between joint motions. Thus the characterization of an action requires a comprehensive analysis of joint motion correlations on spatial and temporal domains. In this paper, we propose Joint-temporal Motion Graph Convolutional Networks (JT-MGCN) in which joint-temporal edges learn the correlations between different joints at different time. Experimental evaluation on large public data sets such as NTU rgb+d data set and kinetics-skeleton data set show outstanding action recognition performance.

Temporal Attention-Augmented Graph Convolutional Network for Efficient Skeleton-Based Human Action Recognition

Negar Heidari, Alexandros Iosifidis

Responsive image

Auto-TLDR; Temporal Attention Module for Efficient Graph Convolutional Network-based Action Recognition

Slides Poster Similar

Graph convolutional networks (GCNs) have been very successful in modeling non-Euclidean data structures, like sequences of body skeletons forming actions modeled as spatio-temporal graphs. Most GCN-based action recognition methods use deep feed-forward networks with high computational complexity to process all skeletons in an action. This leads to a high number of floating point operations (ranging from 16G to 100G FLOPs) to process a single sample, making their adoption in restricted computation application scenarios infeasible. In this paper, we propose a temporal attention module (TAM) for increasing the efficiency in skeleton-based action recognition by selecting the most informative skeletons of an action at the early layers of the network. We incorporate the TAM in a light-weight GCN topology to further reduce the overall number of computations. Experimental results on two benchmark datasets show that the proposed method outperforms with a large margin the baseline GCN-based method while having 2.9 times less number of computations. Moreover, it performs on par with the state-of-the-art with up to 9.6 times less number of computations.

Channel-Wise Dense Connection Graph Convolutional Network for Skeleton-Based Action Recognition

Michael Lao Banteng, Zhiyong Wu

Responsive image

Auto-TLDR; Two-stream channel-wise dense connection GCN for human action recognition

Slides Poster Similar

Skeleton-based action recognition task has drawn much attention for many years. Graph Convolutional Network (GCN) has proved its effectiveness in this task. However, how to improve the model's robustness to different human actions and how to make effective use of features produced by the network are main topics needed to be further explored. Human actions are time series sequence, meaning that temporal information is a key factor to model the representation of data. The ranges of body parts involved in small actions (e.g. raise a glass or shake head) and big actions (e.g. walking or jumping) are diverse. It's crucial for the model to generate and utilize more features that can be adaptive to a wider range of actions. Furthermore, feature channels are specific with the action class, the model needs to weigh their importance and pay attention to more related ones. To address these problems, in this work, we propose a two-stream channel-wise dense connection GCN (2s-CDGCN). Specifically, the skeleton data was extracted and processed into spatial and temporal information for better feature representation. A channel-wise attention module was used to select and emphasize the more useful features generated by the network. Moreover, to ensure maximum information flow, dense connection was introduced to the network structure, which enables the network to reuse the skeleton features and generate more information adaptive and related to different human actions. Our model has shown its ability to improve the accuracy of human action recognition task on two large datasets, NTU-RGB+D and Kinetics. Extensive evaluations were conducted to prove the effectiveness of our model.

Kernel-based Graph Convolutional Networks

Hichem Sahbi

Responsive image

Auto-TLDR; Spatial Graph Convolutional Networks in Recurrent Kernel Hilbert Space

Slides Poster Similar

Learning graph convolutional networks (GCNs) is an emerging field which aims at generalizing deep learning to arbitrary non-regular domains. Most of the existing GCNs follow a neighborhood aggregation scheme, where the representation of a node is recursively obtained by aggregating its neighboring node representations using averaging or sorting operations. However, these operations are either ill-posed or weak to be discriminant or increase the number of training parameters and thereby the computational complexity and the risk of overfitting. In this paper, we introduce a novel GCN framework that achieves spatial graph convolution in a reproducing kernel Hilbert space. The latter makes it possible to design, via implicit kernel representations, convolutional graph filters in a high dimensional and more discriminating space without increasing the number of training parameters. The particularity of our GCN model also resides in its ability to achieve convolutions without explicitly realigning nodes in the receptive fields of the learned graph filters with those of the input graphs, thereby making convolutions permutation agnostic and well defined. Experiments conducted on the challenging task of skeleton-based action recognition show the superiority of the proposed method against different baselines as well as the related work.

Recurrent Graph Convolutional Networks for Skeleton-Based Action Recognition

Guangming Zhu, Lu Yang, Liang Zhang, Peiyi Shen, Juan Song

Responsive image

Auto-TLDR; Recurrent Graph Convolutional Network for Human Action Recognition

Slides Poster Similar

Human action recognition is one of the challenging and active research fields due to its wide applications. Recently, graph convolutions for skeleton-based action recognition have attracted much attention. Generally, the adjacency matrices of the graph are fixed to the hand-crafted physical connectivity of the human joints, or learned adaptively via deep learining. The hand-crafted or learned adjacency matrices are fixed when processing each frame of an action sequence. However, the interactions of different subsets of joints may play a core role at different phases of an action. Therefore, it is reasonable to evolve the graph topology with time. In this paper, a recurrent graph convolution is proposed, in which the graph topology is evolved via a long short-term memory (LSTM) network. The proposed recurrent graph convolutional network (R-GCN) can recurrently learn the data-dependent graph topologies for different layers, different time steps and different kinds of actions. Experimental results on the NTU RGB+D and Kinetics-Skeleton datasets demonstrate the advantages of the proposed R-GCN.

A Two-Stream Recurrent Network for Skeleton-Based Human Interaction Recognition

Qianhui Men, Edmond S. L. Ho, Shum Hubert P. H., Howard Leung

Responsive image

Auto-TLDR; Two-Stream Recurrent Neural Network for Human-Human Interaction Recognition

Slides Poster Similar

This paper addresses the problem of recognizing human-human interaction from skeletal sequences. Existing methods are mainly designed to classify single human action. Many of them simply stack the movement features of two characters to deal with human interaction, while neglecting the abundant relationships between characters. In this paper, we propose a novel two-stream recurrent neural network by adopting the geometric features from both single actions and interactions to describe the spatial correlations with different discriminative abilities. The first stream is constructed under pairwise joint distance (PJD) in a fully-connected mesh to categorize the interactions with explicit distance patterns. To better distinguish similar interactions, in the second stream, we combine PJD with the spatial features from individual joint positions using graph convolutions to detect the implicit correlations among joints, where the joint connections in the graph are adaptive for flexible correlations. After spatial modeling, each stream is fed to a bi-directional LSTM to encode two-way temporal properties. To take advantage of the diverse discriminative power of the two streams, we come up with a late fusion algorithm to combine their output predictions concerning information entropy. Experimental results show that the proposed framework achieves state-of-the-art performance on 3D and comparable performance on 2D interaction datasets. Moreover, the late fusion results demonstrate the effectiveness of improving the recognition accuracy compared with single streams.

Scalable Direction-Search-Based Approach to Subspace Clustering

Yicong He, George Atia

Responsive image

Auto-TLDR; Fast Direction-Search-Based Subspace Clustering

Slides Similar

Subspace clustering finds a multi-subspace representation that best fits a high-dimensional dataset. The computational and storage complexities of existing algorithms limit their usefulness for large scale data. In this paper, we develop a novel scalable approach to subspace clustering termed Fast Direction-Search-Based Subspace Clustering (Fast DiSC). In sharp contrast to existing scalable solutions which are mostly based on the self-expressiveness property of the data, Fast DiSC rests upon a new representation obtained from projections on computed data-dependent directions. These directions are derived from a convex formulation for optimal direction search to gauge hidden similarity relations. The computational complexity is significantly reduced by performing direction search in partitions of sampled data, followed by a retrieval step to cluster out-of-sample data using projections on the computed directions. A theoretical analysis underscores the ability of the proposed formulation to construct local similarity relations for the different data points. Experiments on both synthetic and real data demonstrate that the proposed algorithm can often outperform the state-of-the-art clustering methods.

Single View Learning in Action Recognition

Gaurvi Goyal, Nicoletta Noceti, Francesca Odone

Responsive image

Auto-TLDR; Cross-View Action Recognition Using Domain Adaptation for Knowledge Transfer

Slides Poster Similar

Viewpoint is an essential aspect of how an action is visually perceived, with the motion appearing substantially different for some viewpoint pairs. Data driven action recognition algorithms compensate for this by including a variety of viewpoints in their training data, adding to the cost of data acquisition as well as training. We propose a novel methodology that leverages deeply pretrained features to learn actions from a single viewpoint using domain adaptation for knowledge transfer. We demonstrate the effectiveness of this pipeline on 3 different datasets: IXMAS, MoCA and NTU RGBD+, and compare with both classical and deep learning methods. Our method requires low training data and demonstrates unparalleled cross-view action recognition accuracies for single view learning.

Learning Connectivity with Graph Convolutional Networks

Hichem Sahbi

Responsive image

Auto-TLDR; Learning Graph Convolutional Networks Using Topological Properties of Graphs

Slides Poster Similar

Learning graph convolutional networks (GCNs) is an emerging field which aims at generalizing convolutional operations to arbitrary non-regular domains. In particular, GCNs operating on spatial domains show superior performances compared to spectral ones, however their success is highly dependent on how the topology of input graphs is defined. In this paper, we introduce a novel framework for graph convolutional networks that learns the topological properties of graphs. The design principle of our method is based on the optimization of a constrained objective function which learns not only the usual convolutional parameters in GCNs but also a transformation basis that conveys the most relevant topological relationships in these graphs. Experiments conducted on the challenging task of skeleton-based action recognition shows the superiority of the proposed method compared to handcrafted graph design as well as the related work.

DeepPear: Deep Pose Estimation and Action Recognition

Wen-Jiin Tsai, You-Ying Jhuang

Responsive image

Auto-TLDR; Human Action Recognition Using RGB Video Using 3D Human Pose and Appearance Features

Slides Poster Similar

Human action recognition has been a popular issue recently because it can be applied in many applications such as intelligent surveillance systems, human-robot interaction, and autonomous vehicle control. Human action recognition using RGB video is a challenging task because the learning of actions is easily affected by the cluttered background. To cope with this problem, the proposed method estimates 3D human poses first which can help remove the cluttered background and focus on the human body. In addition to the human poses, the proposed method also utilizes appearance features nearby the predicted joints to make our action prediction context-aware. Instead of using 3D convolutional neural networks as many action recognition approaches did, the proposed method uses a two-stream architecture that aggregates the results from skeleton-based and appearance-based approaches to do action recognition. Experimental results show that the proposed method achieved state-of-the-art performance on NTU RGB+D which is a largescale dataset for human action recognition.

Space-Time Domain Tensor Neural Networks: An Application on Human Pose Classification

Konstantinos Makantasis, Athanasios Voulodimos, Anastasios Doulamis, Nikolaos Doulamis, Nikolaos Bakalos

Responsive image

Auto-TLDR; Tensor-Based Neural Network for Spatiotemporal Pose Classifiaction using Three-Dimensional Skeleton Data

Slides Poster Similar

Recent advances in sensing technologies require the design and development of pattern recognition models capable of processing spatiotemporal data efficiently. In this study, we propose a spatially and temporally aware tensor-based neural network for human pose classifiaction using three-dimensional skeleton data. Our model employs three novel components. First, an input layer capable of constructing highly discriminative spatiotemporal features. Second, a tensor fusion operation that produces compact yet rich representations of the data, and third, a tensor-based neural network that processes data representations in their original tensor form. Our model is end-to-end trainable and characterized by a small number of trainable parameters making it suitable for problems where the annotated data is limited. Experimental evaluation of the proposed model indicates that it can achieve state-of-the-art performance.

Low Rank Representation on Product Grassmann Manifolds for Multi-viewSubspace Clustering

Jipeng Guo, Yanfeng Sun, Junbin Gao, Yongli Hu, Baocai Yin

Responsive image

Auto-TLDR; Low Rank Representation on Product Grassmann Manifold for Multi-View Data Clustering

Slides Poster Similar

Clustering high dimension multi-view data with complex intrinsic properties and nonlinear manifold structure is a challenging task since these data are always embedded in low dimension manifolds. Inspired by Low Rank Representation (LRR), some researchers extended classic LRR on Grassmann manifold or Product Grassmann manifold to represent data with non-linear metrics. However, most of these methods utilized convex nuclear norm to leverage a low-rank structure, which was over-relaxation of true rank and would lead to the results deviated from the true underlying ones. And, the computational complexity of singular value decomposition of matrix is high for nuclear norm minimization. In this paper, we propose a new low rank model for high-dimension multi-view data clustering on Product Grassmann Manifold with the matrix tri-factorization which is used to control the upper bound of true rank of representation matrix. And, the original problem can be transformed into the nuclear norm minimization with smaller scale matrices. An effective solution and theoretical analysis are also provided. The experimental results show that the proposed method obviously outperforms other state-of-the-art methods on several multi-source human/crowd action video datasets.

A Spectral Clustering on Grassmann Manifold Via Double Low Rank Constraint

Xinglin Piao, Yongli Hu, Junbin Gao, Yanfeng Sun, Xin Yang, Baocai Yin

Responsive image

Auto-TLDR; Double Low Rank Representation for High-Dimensional Data Clustering on Grassmann Manifold

Slides Similar

High-dimension data clustering is a fundamental topic in machine learning and data mining areas. In recent year, researchers have proposed a series of effective methods based on Low Rank Representation (LRR) which could explore low-dimension subspace structure embedded in original data effectively. The traditional LRR methods usually treat original data as samples in Euclidean space. They generally adopt linear metric to measure the distance between two data. However, high-dimension data (such as video clip or imageset) are always considered as non-linear manifold data such as Grassmann manifold. Therefore, the traditional linear Euclidean metric would be no longer suitable for these special data. In addition, traditional LRR clustering method always adopt nuclear norm as low rank constraint which would lead to suboptimal solution and decrease the clustering accuracy. In this paper, we proposed a new low rank method on Grassmann manifold for high-dimension data clustering task. In the proposed method, a double low rank representation approach is proposed by combining the nuclear norm and bilinear representation for better construct the representation matrix. The experimental results on several public datasets show that the proposed method outperforms the state-of-the-art clustering methods.

Learning Group Activities from Skeletons without Individual Action Labels

Fabio Zappardino, Tiberio Uricchio, Lorenzo Seidenari, Alberto Del Bimbo

Responsive image

Auto-TLDR; Lean Pose Only for Group Activity Recognition

Similar

To understand human behavior we must not just recognize individual actions but model possibly complex group activity and interactions. Hierarchical models obtain the best results in group activity recognition but require fine grained individual action annotations at the actor level. In this paper we show that using only skeletal data we can train a state-of-the art end-to-end system using only group activity labels at the sequence level. Our experiments show that models trained without individual action supervision perform poorly. On the other hand we show that pseudo-labels can be computed from any pre-trained feature extractor with comparable final performance. Finally our carefully designed lean pose only architecture shows highly competitive results versus more complex multimodal approaches even in the self-supervised variant.

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

Mirco Planamente, Andrea Bottino, Barbara Caputo

Responsive image

Auto-TLDR; A Single Stream Architecture for Egocentric Action Recognition from the First-Person Point of View

Slides Poster Similar

Wearable cameras are becoming more and more popular in several applications, increasing the interest of the research community in developing approaches for recognizing actions from the first-person point of view. An open challenge in egocentric action recognition is that videos lack detailed information about the main actor's pose and thus tend to record only parts of the movement when focusing on manipulation tasks. Thus, the amount of information about the action itself is limited, making crucial the understanding of the manipulated objects and their context. Many previous works addressed this issue with two-stream architectures, where one stream is dedicated to modeling the appearance of objects involved in the action, and another to extracting motion features from optical flow. In this paper, we argue that learning features jointly from these two information channels is beneficial to capture the spatio-temporal correlations between the two better. To this end, we propose a single stream architecture able to do so, thanks to the addition of a self-supervised block that uses a pretext motion prediction task to intertwine motion and appearance knowledge. Experiments on several publicly available databases show the power of our approach.

What and How? Jointly Forecasting Human Action and Pose

Yanjun Zhu, Yanxia Zhang, Qiong Liu, Andreas Girgensohn

Responsive image

Auto-TLDR; Forecasting Human Actions and Motion Trajectories with Joint Action Classification and Pose Regression

Slides Poster Similar

Forecasting human actions and motion trajectories addresses the problem of predicting what a person is going to do next and how they will perform it. This is crucial in a wide range of applications such as assisted living and future co-robotic settings. We propose to simultaneously learn actions and action-related human motion dynamics, while existing works perform them independently. In this paper, we present a method to jointly forecast categories of human action and the pose of skeletal joints in the hope that the two tasks can help each other. As a result, our system can predict not only the future actions but also the motion trajectories that will result. To achieve this, we define a task of joint action classification and pose regression. We employ a sequence to sequence encoder-decoder model combined with multi-task learning to forecast future actions and poses progressively before the action happens. Experimental results on two public datasets, IkeaDB and OAD, demonstrate the effectiveness of the proposed method.

SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Raphael Memmesheimer, Nick Theisen, Dietrich Paulus

Responsive image

Auto-TLDR; One-Shot Action Recognition using Metric Learning

Slides Similar

Recognizing an activity with a single reference sample using metric learning approaches is a promising research field. The majority of few-shot methods focus on object recognition or face-identification. We propose a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space. We encode signals into images and extract features using a deep residual CNN. Using triplet loss, we learn a feature embedding. The resulting encoder transforms features into an embedding space in which closer distances encode similar actions while higher distances encode different actions. Our approach is based on a signal level formulation and remains flexible across a variety of modalities. It further outperforms the baseline on the large scale NTU RGB+D 120 dataset for the One-Shot action recognition protocol by \ntuoneshotimpro%. With just 60% of the training data, our approach still outperforms the baseline approach by \ntuoneshotimproreduced%. With 40% of the training data, our approach performs comparably well as the second follow up. Further, we show that our approach generalizes well in experiments on the UTD-MHAD dataset for inertial, skeleton and fused data and the Simitate dataset for motion capturing data. Furthermore, our inter-joint and inter-sensor experiments suggest good capabilities on previously unseen setups.

A Multi-Task Neural Network for Action Recognition with 3D Key-Points

Rongxiao Tang, Wang Luyang, Zhenhua Guo

Responsive image

Auto-TLDR; Multi-task Neural Network for Action Recognition and 3D Human Pose Estimation

Slides Poster Similar

Action recognition and 3D human pose estimation are the fundamental problems in computer vision and closely related. In this work, we propose a multi-task neural network for action recognition and 3D human pose estimation. The results of the previous methods are still error-prone especially when tested against the images taken in-the-wild, leading error results in action recognition. To solve this problem, we propose a principled approach to generate high quality 3D pose ground truth given any in-the-wild image with a person inside. We achieve this by first devising a novel stereo inspired neural network to directly map any 2D pose to high quality 3D counterpart. Based on the high-quality 3D labels, we carefully design the multi-task framework for action recognition and 3D human pose estimation. The proposed architecture can utilize the shallow, deep features of the images, and the in-the-wild 3D human key-points to guide a more precise result. High quality 3D key-points can fully reflect the morphological features of motions, thus boosting the performance on action recognition. Experiments demonstrate that 3D pose estimation leads to significantly higher performance on action recognition than separated learning. We also evaluate the generalization ability of our method both quantitatively and qualitatively. The proposed architecture performs favorably against the baseline 3D pose estimation methods. In addition, the reported results on Penn Action and NTU datasets demonstrate the effectiveness of our method on the action recognition task.

Learning Dictionaries of Kinematic Primitives for Action Classification

Alessia Vignolo, Nicoletta Noceti, Alessandra Sciutti, Francesca Odone, Giulio Sandini

Responsive image

Auto-TLDR; Action Understanding using Visual Motion Primitives

Slides Poster Similar

This paper proposes a method based on visual motion primitives to address the problem of action understanding. The approach builds in an unsupervised way a dictionary of kinematic primitives from a set of sub-movements obtained by segmenting the velocity profile of an action on the basis of local minima derived directly from the optical flow. The dictionary is then used to describe each sub-movement as a linear combination of atoms using sparse coding. The descriptive capability of the proposed motion representation is experimentally validated on the MoCA dataset, a collection of synchronized multi-view videos and motion capture data of cooking activities. The results show that the approach, despite its simplicity, has a good performance in action classification, especially when the motion primitives are combined over time. Also, the method is proved to be tolerant to view point changes, and can thus support cross-view action recognition. Overall, the method may be seen as a backbone of a general approach to action understanding, with potential applications in robotics.

Pose-Based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo

Responsive image

Auto-TLDR; Body Language Based Emotion Recognition for Psychiatric Symptoms Prediction

Slides Poster Similar

Inspired by the human ability to infer emotions from body language, we propose an automated framework for body language based emotion recognition starting from regular RGB videos. In collaboration with psychologists, we further extend the framework for psychiatric symptom prediction. Because a specific application domain of the proposed framework may only supply a limited amount of data, the framework is designed to work on a small training set and possess a good transferability. The proposed system in the first stage generates sequences of body language predictions based on human poses estimated from input videos. In the second stage, the predicted sequences are fed into a temporal network for emotion interpretation and psychiatric symptom prediction. We first validate the accuracy and transferability of the proposed body language recognition method on several public action recognition datasets. We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms. The proposed framework outperforms other methods on the URMC dataset.

Extracting Action Hierarchies from Action Labels and their Use in Deep Action Recognition

Konstadinos Bacharidis, Antonis Argyros

Responsive image

Auto-TLDR; Exploiting the Information Content of Language Label Associations for Human Action Recognition

Slides Poster Similar

Human activity recognition is a fundamental and challenging task in computer vision. Its solution can support multiple and diverse applications in areas including but not limited to smart homes, surveillance, daily living assistance, Human-Robot Collaboration (HRC), etc. In realistic conditions, the complexity of human activities ranges from simple coarse actions, such as siting or standing up, to more complex activities that consist of multiple actions with subtle variations in appearance and motion patterns. A large variety of existing datasets target specific action classes, with some of them being coarse and others being fine-grained. In all of them, a description of the action and its complexity is manifested in the action label sentence. As the action/activity complexity increases, so is the label sentence size and the amount of action-related semantic information contained in this description. In this paper, we propose an approach to exploit the information content of these action labels to formulate a coarse-to-fine action hierarchy based on linguistic label associations, and investigate the potential benefits and drawbacks. Moreover, in a series of quantitative and qualitative experiments, we show that the exploitation of this hierarchical organization of action classes in different levels of granularity improves the learning speed and overall performance of a range of baseline and mid-range deep architectures for human action recognition (HAR).

N2D: (Not Too) Deep Clustering Via Clustering the Local Manifold of an Autoencoded Embedding

Ryan Mcconville, Raul Santos-Rodriguez, Robert Piechocki, Ian Craddock

Responsive image

Auto-TLDR; Local Manifold Learning for Deep Clustering on Autoencoded Embeddings

Slides Similar

Deep clustering has increasingly been demonstrating superiority over conventional shallow clustering algorithms. Deep clustering algorithms usually combine representation learning with deep neural networks to achieve this performance, typically optimizing a clustering and non-clustering loss. In such cases, an autoencoder is typically connected with a clustering network, and the final clustering is jointly learned by both the autoencoder and clustering network. Instead, we propose to learn an autoencoded embedding and then search this further for the underlying manifold. For simplicity, we then cluster this with a shallow clustering algorithm, rather than a deeper network. We study a number of local and global manifold learning methods on both the raw data and autoencoded embedding, concluding that UMAP in our framework is able to find the best clusterable manifold of the embedding. This suggests that local manifold learning on an autoencoded embedding is effective for discovering higher quality clusters. We quantitatively show across a range of image and time-series datasets that our method has competitive performance against the latest deep clustering algorithms, including out-performing current state-of-the-art on several. We postulate that these results show a promising research direction for deep clustering. The code can be found at https://github.com/rymc/n2d.

Temporal Extension Module for Skeleton-Based Action Recognition

Yuya Obinata, Takuma Yamamoto

Responsive image

Auto-TLDR; Extended Temporal Graph for Action Recognition with Kinetics-Skeleton

Slides Poster Similar

We present a module that extends the temporal graph of a graph convolutional network (GCN) for action recognition with a sequence of skeletons. Existing methods attempt to represent a more appropriate spatial graph on an intra-frame, but disregard optimization of the temporal graph on the inter-frame. Concretely, these methods connect between vertices corresponding only to the same joint on the inter-frame. In this work, we focus on adding connections to neighboring multiple vertices on the inter-frame and extracting additional features based on the extended temporal graph. Our module is a simple yet effective method to extract correlated features of multiple joints in human movement. Moreover, our module aids in further performance improvements, along with other GCN methods that optimize only the spatial graph. We conduct extensive experiments on two large datasets, NTU RGB+D and Kinetics-Skeleton, and demonstrate that our module is effective for several existing models and our final model achieves state-of-the-art performance.

From Human Pose to On-Body Devices for Human-Activity Recognition

Fernando Moya Rueda, Gernot Fink

Responsive image

Auto-TLDR; Transfer Learning from Human Pose Estimation for Human Activity Recognition using Inertial Measurements from On-Body Devices

Slides Poster Similar

Human Activity Recognition (HAR), using inertial measurements from on-body devices, has not seen a great advantage from deep architectures. This is mainly due to the lack of annotated data, diversity of on-body device configurations, the class-unbalance problem, and non-standard human activity definitions. Approaches for improving the performance of such architectures, e.g., transfer learning, are therefore difficult to apply. This paper introduces a method for transfer learning from human-pose estimations as a source for improving HAR using inertial measurements obtained from on-body devices. We propose to fine-tune deep architectures, trained using sequences of human poses from a large dataset and their derivatives, for solving HAR on inertial measurements from on-body devices. Derivatives of human poses will be considered as a sort of synthetic data for HAR. We deploy two different temporal-convolutional architectures as classifiers. An evaluation of the method is carried out on three benchmark datasets improving the classification performance.

Attention-Oriented Action Recognition for Real-Time Human-Robot Interaction

Ziyang Song, Ziyi Yin, Zejian Yuan, Chong Zhang, Wanchao Chi, Yonggen Ling, Shenghao Zhang

Responsive image

Auto-TLDR; Attention-Oriented Multi-Level Network for Action Recognition in Interaction Scenes

Slides Poster Similar

Despite the notable progress made in action recognition tasks, not much work has been done in action recognition specifically for human-robot interaction. In this paper, we deeply explore the characteristics of the action recognition task in interaction scenes and propose an attention-oriented multi-level network framework to meet the need for real-time interaction. Specifically, a Pre-Attention network is employed to roughly focus on the interactor in the scene at low resolution firstly and then perform fine-grained pose estimation at high resolution. The other compact CNN receives the extracted skeleton sequence as input for action recognition, utilizing attention-like mechanisms to capture local spatial-temporal patterns and global semantic information effectively. To evaluate our approach, we construct a new action dataset specially for the recognition task in interaction scenes. Experimental results on our dataset and high efficiency (112 fps at 640 x 480 RGBD) on the mobile computing platform (Nvidia Jetson AGX Xavier) demonstrate excellent applicability of our method on action recognition in real-time human-robot interaction.

Motion Segmentation with Pairwise Matches and Unknown Number of Motions

Federica Arrigoni, Tomas Pajdla, Luca Magri

Responsive image

Auto-TLDR; Motion Segmentation using Multi-Modelfitting andpermutation synchronization

Slides Poster Similar

In this paper we address motion segmentation, that is the problem of clustering points in multiple images according to a number of moving objects. Two-frame correspondences are assumed as input without prior knowledge about trajectories. Our method is based on principles from ''multi-model fitting'' and ''permutation synchronization'', and - differently from previous techniques working under the same assumptions - it can handle an unknown number of motions. The proposed approach is validated on standard datasets, showing that it can correctly estimate the number of motions while maintaining comparable or better accuracy than the state of the art.

Temporal Binary Representation for Event-Based Action Recognition

Simone Undri Innocenti, Federico Becattini, Federico Pernici, Alberto Del Bimbo

Responsive image

Auto-TLDR; Temporal Binary Representation for Gesture Recognition

Slides Poster Similar

In this paper we present an event aggregation strategy to convert the output of an event camera into frames processable by traditional Computer Vision algorithms. The proposed method first generates sequences of intermediate binary representations, which are then losslessly transformed into a compact format by simply applying a binary-to-decimal conversion. This strategy allows us to encode temporal information directly into pixel values, which are then interpreted by deep learning models. We apply our strategy, called Temporal Binary Representation, to the task of Gesture Recognition, obtaining state of the art results on the popular DVS128 Gesture Dataset. To underline the effectiveness of the proposed method compared to existing ones, we also collect an extension of the dataset under more challenging conditions on which to perform experiments.

Modeling Long-Term Interactions to Enhance Action Recognition

Alejandro Cartas, Petia Radeva, Mariella Dimiccoli

Responsive image

Auto-TLDR; A Hierarchical Long Short-Term Memory Network for Action Recognition in Egocentric Videos

Slides Poster Similar

In this paper, we propose a new approach to understand actions in egocentric videos that exploit the semantics of object interactions at both frame and temporal levels. At the frame level, we use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects and calculates the action score through a CNN formulation. This information is then fed to a Hierarchical Long Short-Term Memory Network (HLSTM) that captures temporal dependencies between actions within and across shots. Ablation studies thoroughly validate the proposed approach, showing in particular that both levels of the HLSTM architecture contribute to performance improvement. Furthermore, quantitative comparisons show that the proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks, without relying on motion information.

Late Fusion of Bayesian and Convolutional Models for Action Recognition

Camille Maurice, Francisco Madrigal, Frederic Lerasle

Responsive image

Auto-TLDR; Fusion of Deep Neural Network and Bayesian-based Approach for Temporal Action Recognition

Slides Poster Similar

The activities we do in our daily-life are generally carried out as a succession of atomic actions, following a logical order. During a video sequence, actions usually follow a logical order. In this paper, we propose a hybrid approach resulting from the fusion of a deep learning neural network with a Bayesian-based approach. The latter models human-object interactions and transition between actions. The key idea is to combine both approaches in the final prediction. We validate our strategy in two public datasets: CAD-120 and Watch-n-Patch. We show that our fusion approach yields performance gains in accuracy of respectively +4\% and +6\% over a baseline approach. Temporal action recognition performances are clearly improved by the fusion, especially when classes are imbalanced.

Nonlinear Ranking Loss on Riemannian Potato Embedding

Byung Hyung Kim, Yoonje Suh, Honggu Lee, Sungho Jo

Responsive image

Auto-TLDR; Riemannian Potato for Rank-based Metric Learning

Slides Poster Similar

We propose a rank-based metric learning method by leveraging a concept of the Riemannian Potato for better separating non-linear data. By exploring the geometric properties of Riemannian manifolds, the proposed loss function optimizes the measure of dispersion using the distribution of Riemannian distances between a reference sample and neighbors and builds a ranked list according to the similarities. We show the proposed function can learn a hypersphere for each class, preserving the similarity structure inside it on Riemannian manifold. As a result, compared with Euclidean distance-based metric, our method can further jointly reduce the intra-class distances and enlarge the inter-class distances for learned features, consistently outperforming state-of-the-art methods on three widely used non-linear datasets.

Occlusion-Tolerant and Personalized 3D Human Pose Estimation in RGB Images

Ammar Qammaz, Antonis Argyros

Responsive image

Auto-TLDR; Real-Time 3D Human Pose Estimation in BVH using Inverse Kinematics Solver and Neural Networks

Poster Similar

We introduce a real-time method that estimates the 3D human pose directly in the popular BVH format, given estimations of the 2D body joints in RGB images. Our contributions include: (a) A novel and compact 2D pose representation. (b) A human body orientation classifier and an ensemble of orientation-tuned neural networks that regress the 3D human pose by also allowing for the decomposition of the body to an upper and lower kinematic hierarchy. This permits the recovery of the human pose even in the case of significant occlusions. (c) An efficient Inverse Kinematics solver that refines the neural-network-based solution providing 3D human pose estimations that are consistent with the limb sizes of a target person (if known). All the above yield a 33% accuracy improvement on the H3.6M dataset compared to the baseline MocapNET method while maintaining real-time performance (70 fps in CPU-only execution).

Signature Features with the Visibility Transformation

Yue Wu, Hao Ni, Terry Lyons, Robin Hudson

Responsive image

Auto-TLDR; The Visibility Transformation for Pattern Recognition

Slides Poster Similar

In this paper we put the visibility transformation on a clear theoretical footing and show that this transform is able to embed the effect of the absolute position of the data stream into signature features in a unified and efficient way. The generated feature set is particularly useful in pattern recognition tasks, for its simplifying role in allowing the signature feature set to accommodate nonlinear functions of absolute and relative values.

Transformer Networks for Trajectory Forecasting

Francesco Giuliari, Hasan Irtiza, Marco Cristani, Fabio Galasso

Responsive image

Auto-TLDR; TransformerNetworks for Trajectory Prediction of People Interactions

Slides Poster Similar

Most recent successes on forecasting the people mo-tion are based on LSTM models andallmost recent progress hasbeen achieved by modelling the social interaction among peopleand the people interaction with the scene. We question the useof the LSTM models and propose the novel use of TransformerNetworks for trajectory forecasting. This is a fundamental switchfrom the sequential step-by-step processing of LSTMs to theonly-attention-based memory mechanisms of Transformers. Inparticular, we consider both the original Transformer Network(TF) and the larger Bidirectional Transformer (BERT), state-of-the-art on all natural language processing tasks. Our proposedTransformers predict the trajectories of the individual peoplein the scene. These are “simple” models because each personis modelled separately without any complex human-human norscene interaction terms. In particular, the TF modelwithoutbells and whistlesyields the best score on the largest and mostchallenging trajectory forecasting benchmark of TrajNet [1]. Ad-ditionally, its extension which predicts multiple plausible futuretrajectories performs on par with more engineered techniqueson the 5 datasets of ETH [2]+UCY [3]. Finally, we showthat Transformers may deal with missing observations, as itmay be the case with real sensor data. Code is available atgithub.com/FGiuliari/Trajectory-Transformer

Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution

Renshu Gu, Gaoang Wang, Jenq-Neng Hwang

Responsive image

Auto-TLDR; 3D Human Pose Estimation for Multi-Human Videos with Occlusion

Slides Similar

3D human pose estimation (HPE) is crucial in human behavior analysis, augmented reality/virtual reality (AR/VR) applications, and self-driving industry. Videos that contain multiple potentially occluded people captured from freely moving monocular cameras are very common in real-world scenarios, while 3D HPE for such scenarios is quite challenging, partially because there is a lack of such data with accurate 3D ground truth labels in existing datasets. In this paper, we propose a temporal regression network with a gated convolution module to transform 2D joints to 3D and recover the missing occluded joints in the meantime. A simple yet effective localization approach is further conducted to transform the normalized pose to the global trajectory. To verify the effectiveness of our approach, we also collect a new moving camera multi-human (MMHuman) dataset that includes multiple people with heavy occlusion captured by moving cameras. The 3D ground truth joints are provided by accurate motion capture (MoCap) system. From the experiments on static-camera based Human3.6M data and our own collected moving-camera based data, we show that our proposed method outperforms most state-of-the-art 2D-to-3D pose estimation methods, especially for the scenarios with heavy occlusions.

2D Deep Video Capsule Network with Temporal Shift for Action Recognition

Théo Voillemin, Hazem Wannous, Jean-Philippe Vandeborre

Responsive image

Auto-TLDR; Temporal Shift Module over Capsule Network for Action Recognition in Continuous Videos

Slides Similar

Action recognition in continuous video streams is a growing field since the past few years. Deep learning techniques and in particular Convolutional Neural Networks (CNNs) achieved good results in this topic. However, intrinsic CNNs limitations begin to cap the results since 2D CNN cannot capture temporal information and 3D CNN are to much resource demanding for real-time applications. Capsule Network, evolution of CNN, already proves its interesting benefits on small and low informational datasets like MNIST but yet its true potential has not emerged. In this paper we tackle the action recognition problem by proposing a new architecture combining Temporal Shift module over deep Capsule Network. Temporal Shift module permits us to insert temporal information over 2D Capsule Network with a zero computational cost to conserve the lightness of 2D capsules and their ability to connect spatial features. Our proposed approach outperforms or brings near state-of-the-art results on color and depth information on public datasets like First Person Hand Action and DHG 14/28 with a number of parameters 10 to 40 times less than existing approaches.

Soft Label and Discriminant Embedding Estimation for Semi-Supervised Classification

Fadi Dornaika, Abdullah Baradaaji, Youssof El Traboulsi

Responsive image

Auto-TLDR; Semi-supervised Semi-Supervised Learning for Linear Feature Extraction and Label Propagation

Slides Poster Similar

In recent times, graph-based semi-supervised learning proved to be a powerful paradigm for processing and mining large datasets. The main advantage relies on the fact that these methods can be useful in propagating a small set of known labels to a large set of unlabeled data. The scarcity of labeled data may affect the performance of the semi-learning. This paper introduces a new semi-supervised framework for simultaneous linear feature extraction and label propagation. The proposed method simultaneously estimates a discriminant transformation and the unknown label by exploiting both labeled and unlabeled data. In addition, the unknowns of the learning model are estimated by integrating two types of graph-based smoothness constraints. The resulting semi-supervised model is expected to learn more discriminative information. Experiments are conducted on six public image datasets. These experimental results show that the performance of the proposed method can be better than that of many state-of-the-art graph-based semi-supervised algorithms.

Learning Sparse Deep Neural Networks Using Efficient Structured Projections on Convex Constraints for Green AI

Michel Barlaud, Frederic Guyard

Responsive image

Auto-TLDR; Constrained Deep Neural Network with Constrained Splitting Projection

Slides Poster Similar

In recent years, deep neural networks (DNN) have been applied to different domains and achieved dramatic performance improvements over state-of-the-art classical methods. These performances of DNNs were however often obtained with networks containing millions of parameters and which training required heavy computational power. In order to cope with this computational issue a huge literature deals with proximal regularization methods which are time consuming.\\ In this paper, we propose instead a constrained approach. We provide the general framework for our new splitting projection gradient method. Our splitting algorithm iterates a gradient step and a projection on convex sets. We study algorithms for different constraints: the classical $\ell_1$ unstructured constraint and structured constraints such as the nuclear norm, the $\ell_{2,1} $ constraint (Group LASSO). We propose a new $\ell_{1,1} $ structured constraint for which we provide a new projection algorithm We demonstrate the effectiveness of our method on three popular datasets (MNIST, Fashion MNIST and CIFAR). Experiments on these datasets show that our splitting projection method with our new $\ell_{1,1} $ structured constraint provides the best reduction of memory and computational power. Experiments show that fully connected linear DNN are more efficient for green AI.

Light3DPose: Real-Time Multi-Person 3D Pose Estimation from Multiple Views

Alessio Elmi, Davide Mazzini, Pietro Tortella

Responsive image

Auto-TLDR; 3D Pose Estimation of Multiple People from a Few calibrated Camera Views using Deep Learning

Slides Poster Similar

We present an approach to perform 3D pose estimation of multiple people from a few calibrated camera views. Our architecture, leveraging the recently proposed unprojection layer, aggregates feature-maps from a 2D pose estimator backbone into a comprehensive representation of the 3D scene. Such intermediate representation is then elaborated by a fully-convolutional volumetric network and a decoding stage to extract 3D skeletons with sub-voxel accuracy. Our method achieves state of the art MPJPE on the CMU Panoptic dataset using a few unseen views and obtains competitive results even with a single input view. We also assess the transfer learning capabilities of the model by testing it against the publicly available Shelf dataset obtaining good performance metrics. The proposed method is inherently efficient: as a pure bottom-up approach, it is computationally independent of the number of people in the scene. Furthermore, even though the computational burden of the 2D part scales linearly with the number of input views, the overall architecture is able to exploit a very lightweight 2D backbone which is orders of magnitude faster than the volumetric counterpart, resulting in fast inference time. The system can run at 6 FPS, processing up to 10 camera views on a single 1080Ti GPU.

Deep Convolutional Embedding for Digitized Painting Clustering

Giovanna Castellano, Gennaro Vessio

Responsive image

Auto-TLDR; A Deep Convolutional Embedding Model for Clustering Artworks

Slides Poster Similar

Clustering artworks is difficult because of several reasons. On one hand, recognizing meaningful patterns in accordance with domain knowledge and visual perception is extremely hard. On the other hand, the application of traditional clustering and feature reduction techniques to the highly dimensional pixel space can be ineffective. To address these issues, we propose to use a deep convolutional embedding model for digitized painting clustering, in which the task of mapping the input raw data to an abstract, latent space is jointly optimized with the task of finding a set of cluster centroids in this latent feature space. Quantitative and qualitative experimental results show the effectiveness of the proposed method. The model is also able to outperform other state-of-the-art deep clustering approaches to the same problem. The proposed method may be beneficial to several art-related tasks, particularly visual link retrieval and historical knowledge discovery in painting datasets.

RMS-Net: Regression and Masking for Soccer Event Spotting

Matteo Tomei, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, Rita Cucchiara

Responsive image

Auto-TLDR; An Action Spotting Network for Soccer Videos

Slides Poster Similar

The recently proposed action spotting task consists in finding the exact timestamp in which an event occurs. This task fits particularly well for soccer videos, where events correspond to salient actions strictly defined by soccer rules (a goal occurs when the ball crosses the goal line). In this paper, we devise a lightweight and modular network for action spotting, which can simultaneously predict the event label and its temporal offset using the same underlying features. We enrich our model with two training strategies: the first one for data balancing and uniform sampling, the second for masking ambiguous frames and keeping the most discriminative visual cues. When tested on the SoccerNet dataset and using standard features, our full proposal exceeds the current state of the art by 3 Average-mAP points. Additionally, it reaches a gain of more than 10 Average-mAP points on the test set when fine-tuned in combination with a strong 2D backbone.