Extracting Action Hierarchies from Action Labels and their Use in Deep Action Recognition

Konstadinos Bacharidis, Antonis Argyros

Responsive image

Auto-TLDR; Exploiting the Information Content of Language Label Associations for Human Action Recognition

Slides Poster

Human activity recognition is a fundamental and challenging task in computer vision. Its solution can support multiple and diverse applications in areas including but not limited to smart homes, surveillance, daily living assistance, Human-Robot Collaboration (HRC), etc. In realistic conditions, the complexity of human activities ranges from simple coarse actions, such as siting or standing up, to more complex activities that consist of multiple actions with subtle variations in appearance and motion patterns. A large variety of existing datasets target specific action classes, with some of them being coarse and others being fine-grained. In all of them, a description of the action and its complexity is manifested in the action label sentence. As the action/activity complexity increases, so is the label sentence size and the amount of action-related semantic information contained in this description. In this paper, we propose an approach to exploit the information content of these action labels to formulate a coarse-to-fine action hierarchy based on linguistic label associations, and investigate the potential benefits and drawbacks. Moreover, in a series of quantitative and qualitative experiments, we show that the exploitation of this hierarchical organization of action classes in different levels of granularity improves the learning speed and overall performance of a range of baseline and mid-range deep architectures for human action recognition (HAR).

Similar papers

Single View Learning in Action Recognition

Gaurvi Goyal, Nicoletta Noceti, Francesca Odone

Responsive image

Auto-TLDR; Cross-View Action Recognition Using Domain Adaptation for Knowledge Transfer

Slides Poster Similar

Viewpoint is an essential aspect of how an action is visually perceived, with the motion appearing substantially different for some viewpoint pairs. Data driven action recognition algorithms compensate for this by including a variety of viewpoints in their training data, adding to the cost of data acquisition as well as training. We propose a novel methodology that leverages deeply pretrained features to learn actions from a single viewpoint using domain adaptation for knowledge transfer. We demonstrate the effectiveness of this pipeline on 3 different datasets: IXMAS, MoCA and NTU RGBD+, and compare with both classical and deep learning methods. Our method requires low training data and demonstrates unparalleled cross-view action recognition accuracies for single view learning.

A Grid-Based Representation for Human Action Recognition

Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Responsive image

Auto-TLDR; GRAR: Grid-based Representation for Action Recognition in Videos

Slides Poster Similar

Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for the task, and are limited in the way they fuse temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets that demonstrate that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges.

Developing Motion Code Embedding for Action Recognition in Videos

Maxat Alibayev, David Andrea Paulius, Yu Sun

Responsive image

Auto-TLDR; Motion Embedding via Motion Codes for Action Recognition

Slides Poster Similar

We propose a motion embedding strategy via the motion codes that is a vectorized representation of motions based on their salient mechanical attributes. We show that our motion codes can provide robust motion representation. We train a deep neural network model that learns to embed demonstration videos into motion codes. We integrate the extracted features from the motion embedding model into the current state-of-the-art action recognition model. The obtained model achieved higher accuracy than the baseline on a verb classification task from egocentric videos in EPIC-KITCHENS dataset.

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

Mirco Planamente, Andrea Bottino, Barbara Caputo

Responsive image

Auto-TLDR; A Single Stream Architecture for Egocentric Action Recognition from the First-Person Point of View

Slides Poster Similar

Wearable cameras are becoming more and more popular in several applications, increasing the interest of the research community in developing approaches for recognizing actions from the first-person point of view. An open challenge in egocentric action recognition is that videos lack detailed information about the main actor's pose and thus tend to record only parts of the movement when focusing on manipulation tasks. Thus, the amount of information about the action itself is limited, making crucial the understanding of the manipulated objects and their context. Many previous works addressed this issue with two-stream architectures, where one stream is dedicated to modeling the appearance of objects involved in the action, and another to extracting motion features from optical flow. In this paper, we argue that learning features jointly from these two information channels is beneficial to capture the spatio-temporal correlations between the two better. To this end, we propose a single stream architecture able to do so, thanks to the addition of a self-supervised block that uses a pretext motion prediction task to intertwine motion and appearance knowledge. Experiments on several publicly available databases show the power of our approach.

Vision-Based Multi-Modal Framework for Action Recognition

Djamila Romaissa Beddiar, Mourad Oussalah, Brahim Nini

Responsive image

Auto-TLDR; Multi-modal Framework for Human Activity Recognition Using RGB, Depth and Skeleton Data

Slides Poster Similar

Human activity recognition plays a central role in the development of intelligent systems for video surveillance, public security, health care and home monitoring, where detection and recognition of activities can improve the quality of life and security of humans. Typically, automated, intuitive and real-time systems are required to recognize human activities and identify accurately unusual behaviors in order to prevent dangerous situations. In this work, we explore the combination of three modalities (RGB, depth and skeleton data) to design a robust multi-modal framework for vision-based human activity recognition. Especially, spatial information, body shape/posture and temporal evolution of actions are highlighted using illustrative representations obtained from a combination of dynamic RGB images, dynamic depth images and skeleton data representations. Therefore, each video is represented with three images that summarize the ongoing action. Our framework takes advantage of transfer learning from pre trained models to extract significant features from these newly created images. Next, we fuse extracted features using Canonical Correlation Analysis and train a Long Short-Term Memory network to classify actions from visual descriptive images. Experimental results demonstrated the reliability of our feature-fusion framework that allows us to capture highly significant features and enables us to achieve the state-of-the-art performance on the public UTD-MHAD and NTU RGB+D datasets.

3D Attention Mechanism for Fine-Grained Classification of Table Tennis Strokes Using a Twin Spatio-Temporal Convolutional Neural Networks

Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, Julien Morlier

Responsive image

Auto-TLDR; Attentional Blocks for Action Recognition in Table Tennis Strokes

Slides Poster Similar

The paper addresses the problem of recognition of actions in video with low inter-class variability such as Table Tennis strokes. Two stream, "twin" convolutional neural networks are used with 3D convolutions both on RGB data and optical flow. Actions are recognized by classification of temporal windows. We introduce 3D attention modules and examine their impact on classification efficiency. In the context of the study of sportsmen performances, a corpus of the particular actions of table tennis strokes is considered. The use of attention blocks in the network speeds up the training step and improves the classification scores up to 5% with our twin model. We visualize the impact on the obtained features and notice correlation between attention and player movements and position. Score comparison of state-of-the-art action classification method and proposed approach with attentional blocks is performed on the corpus. Proposed model with attention blocks outperforms previous model without them and our baseline.

Modeling Long-Term Interactions to Enhance Action Recognition

Alejandro Cartas, Petia Radeva, Mariella Dimiccoli

Responsive image

Auto-TLDR; A Hierarchical Long Short-Term Memory Network for Action Recognition in Egocentric Videos

Slides Poster Similar

In this paper, we propose a new approach to understand actions in egocentric videos that exploit the semantics of object interactions at both frame and temporal levels. At the frame level, we use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects and calculates the action score through a CNN formulation. This information is then fed to a Hierarchical Long Short-Term Memory Network (HLSTM) that captures temporal dependencies between actions within and across shots. Ablation studies thoroughly validate the proposed approach, showing in particular that both levels of the HLSTM architecture contribute to performance improvement. Furthermore, quantitative comparisons show that the proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks, without relying on motion information.

Learning Group Activities from Skeletons without Individual Action Labels

Fabio Zappardino, Tiberio Uricchio, Lorenzo Seidenari, Alberto Del Bimbo

Responsive image

Auto-TLDR; Lean Pose Only for Group Activity Recognition

Similar

To understand human behavior we must not just recognize individual actions but model possibly complex group activity and interactions. Hierarchical models obtain the best results in group activity recognition but require fine grained individual action annotations at the actor level. In this paper we show that using only skeletal data we can train a state-of-the art end-to-end system using only group activity labels at the sequence level. Our experiments show that models trained without individual action supervision perform poorly. On the other hand we show that pseudo-labels can be computed from any pre-trained feature extractor with comparable final performance. Finally our carefully designed lean pose only architecture shows highly competitive results versus more complex multimodal approaches even in the self-supervised variant.

Knowledge Distillation for Action Anticipation Via Label Smoothing

Guglielmo Camporese, Pasquale Coscia, Antonino Furnari, Giovanni Maria Farinella, Lamberto Ballan

Responsive image

Auto-TLDR; A Multi-Modal Framework for Action Anticipation using Long Short-Term Memory Networks

Slides Poster Similar

Human capability to anticipate near future from visual observations and non-verbal cues is essential for developing intelligent systems that need to interact with people. Several research areas, such as human-robot interaction (HRI), assisted living or autonomous driving need to foresee future events to avoid crashes or help people. Egocentric scenarios are classic examples where action anticipation is applied due to their numerous applications. Such challenging task demands to capture and model domain's hidden structure to reduce prediction uncertainty. Since multiple actions may equally occur in the future, we treat action anticipation as a multi-label problem with missing labels extending the concept of label smoothing. This idea resembles the knowledge distillation process since useful information is injected into the model during training. We implement a multi-modal framework based on long short-term memory (LSTM) networks to summarize past observations and make predictions at different time steps. We perform extensive experiments on EPIC-Kitchens and EGTEA Gaze+ datasets including more than 2500 and 100 action classes, respectively. The experiments show that label smoothing systematically improves performance of state-of-the-art models for action anticipation.

Attentive Visual Semantic Specialized Network for Video Captioning

Jesus Perez-Martin, Benjamin Bustos, Jorge Pérez

Responsive image

Auto-TLDR; Adaptive Visual Semantic Specialized Network for Video Captioning

Slides Poster Similar

As an essential high-level task of video understanding topic, automatically describing a video with natural language has recently gained attention as a fundamental challenge in computer vision. Previous models for video captioning have several limitations, such as the existence of gaps in current semantic representations and the inexpressibility of the generated captions. To deal with these limitations, in this paper, we present a new architecture that we callAttentive Visual Semantic Specialized Network(AVSSN), which is an encoder-decoder model based on our Adaptive Attention Gate and Specialized LSTM layers. This architecture can selectively decide when to use visual or semantic information into the text generation process. The adaptive gate makes the decoder to automatically select the relevant information for providing a better temporal state representation than the existing decoders. Besides, the model is capable of learning to improve the expressiveness of generated captions attending to their length, using a sentence-length-related loss function. We evaluate the effectiveness of the proposed approach on the Microsoft Video Description(MSVD) and the Microsoft Research Video-to-Text (MSR-VTT) datasets, achieving state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, CIDEr, and ROUGE_L.

Subspace Clustering for Action Recognition with Covariance Representations and Temporal Pruning

Giancarlo Paoletti, Jacopo Cavazza, Cigdem Beyan, Alessio Del Bue

Responsive image

Auto-TLDR; Unsupervised Learning for Human Action Recognition from Skeletal Data

Slides Similar

This paper tackles the problem of human action recognition, defined as classifying which action is displayed in a trimmed sequence, from skeletal data. Albeit state-of-the-art approaches designed for this application are all supervised, in this paper we pursue a more challenging direction: Solving the problem with unsupervised learning. To this end, we propose a novel subspace clustering method, which exploits covariance matrix to enhance the action’s discriminability and a timestamp pruning approach that allow us to better handle the temporal dimension of the data. Through a broad experimental validation, we show that our computational pipeline surpasses existing unsupervised approaches but also can result in favorable performances as compared to supervised methods.

Learnable Higher-Order Representation for Action Recognition

Jie Shao, Xiangyang Xue

Responsive image

Auto-TLDR; Learningable Higher-Order Operations for Spatiotemporal Dynamics in Video Recognition

Similar

Capturing spatiotemporal dynamics is an essential topic in video recognition. In this paper, we present learnable higher-order operations as a generic family of building blocks for capturing spatiotemporal dynamics from RGB input video space. Similar to higher-order functions, the weights of higher-order operations are themselves derived from the data with learnable parameters. Classical architectures such as residual learning and network-in-network are first-order operations where weights are directly learned from the data. Higher-order operations make it easier to capture context-sensitive patterns, such as motion. Self-attention models are also higher-order operations, but the attention weights are mostly computed from an affine operation or dot product. The learnable higher-order operations can be more generic and flexible. Experimentally, we show that on the task of video recognition, our higher-order models can achieve results on par with or better than the existing state-of-the-art methods on Something-Something (V1 and V2), Kinetics and Charades datasets.

Attention-Driven Body Pose Encoding for Human Activity Recognition

Bappaditya Debnath, Swagat Kumar, Marry O'Brien, Ardhendu Behera

Responsive image

Auto-TLDR; Attention-based Body Pose Encoding for Human Activity Recognition

Slides Poster Similar

This article proposes a novel attention-based body pose encoding for human activity recognition. Most of the existing human activity recognition approaches based on 3D pose data often enrich the input data using additional handcrafted representations such as velocity, super normal vectors, pairwise relations, and so on. The enriched data complements the 3D body joint position data and improves the model performance. In this paper, we propose a novel approach that learns enhanced feature representations from a given sequence of 3D body joints. To achieve this, the approach exploits two body pose streams: 1) a spatial stream which encodes the spatial relationship between various body joints at each time point to learn spatial structure involving the spatial distribution of different body joints 2) a temporal stream that learns the temporal variation of individual body joints over the entire sequence duration to present a temporally enhanced representation. Afterwards, these two pose streams are fused with a multi-head attention mechanism. We also capture the contextual information from the RGB video stream using a deep Convolutional Neural Network (CNN) model combined with a multi-head attention and a bidirectional Long Short-Term Memory (LSTM) network. Finally, the RGB video stream is combined with the fused body pose stream to give a novel end-to-end deep model for effective human activity recognition. The proposed model is evaluated on three datasets including the challenging NTU-RGBD dataset and achieves state-of-the-art results.

Learning Dictionaries of Kinematic Primitives for Action Classification

Alessia Vignolo, Nicoletta Noceti, Alessandra Sciutti, Francesca Odone, Giulio Sandini

Responsive image

Auto-TLDR; Action Understanding using Visual Motion Primitives

Slides Poster Similar

This paper proposes a method based on visual motion primitives to address the problem of action understanding. The approach builds in an unsupervised way a dictionary of kinematic primitives from a set of sub-movements obtained by segmenting the velocity profile of an action on the basis of local minima derived directly from the optical flow. The dictionary is then used to describe each sub-movement as a linear combination of atoms using sparse coding. The descriptive capability of the proposed motion representation is experimentally validated on the MoCA dataset, a collection of synchronized multi-view videos and motion capture data of cooking activities. The results show that the approach, despite its simplicity, has a good performance in action classification, especially when the motion primitives are combined over time. Also, the method is proved to be tolerant to view point changes, and can thus support cross-view action recognition. Overall, the method may be seen as a backbone of a general approach to action understanding, with potential applications in robotics.

Contextual Classification Using Self-Supervised Auxiliary Models for Deep Neural Networks

Sebastian Palacio, Philipp Engler, Jörn Hees, Andreas Dengel

Responsive image

Auto-TLDR; Self-Supervised Autogenous Learning for Deep Neural Networks

Slides Poster Similar

Classification problems solved with deep neural networks (DNNs) typically rely on a closed world paradigm, and optimize over a single objective (e.g., minimization of the cross- entropy loss). This setup dismisses all kinds of supporting signals that can be used to reinforce the existence or absence of particular patterns. The increasing need for models that are interpretable by design makes the inclusion of said contextual signals a crucial necessity. To this end, we introduce the notion of Self-Supervised Autogenous Learning (SSAL). A SSAL objective is realized through one or more additional targets that are derived from the original supervised classification task, following architectural principles found in multi-task learning. SSAL branches impose low-level priors into the optimization process (e.g., grouping). The ability of using SSAL branches during inference, allow models to converge faster, focusing on a richer set of class-relevant features. We equip state-of-the-art DNNs with SSAL objectives and report consistent improvements for all of them on CIFAR100 and Imagenet. We show that SSAL models outperform similar state-of-the-art methods focused on contextual loss functions, auxiliary branches and hierarchical priors.

A Two-Stream Recurrent Network for Skeleton-Based Human Interaction Recognition

Qianhui Men, Edmond S. L. Ho, Shum Hubert P. H., Howard Leung

Responsive image

Auto-TLDR; Two-Stream Recurrent Neural Network for Human-Human Interaction Recognition

Slides Poster Similar

This paper addresses the problem of recognizing human-human interaction from skeletal sequences. Existing methods are mainly designed to classify single human action. Many of them simply stack the movement features of two characters to deal with human interaction, while neglecting the abundant relationships between characters. In this paper, we propose a novel two-stream recurrent neural network by adopting the geometric features from both single actions and interactions to describe the spatial correlations with different discriminative abilities. The first stream is constructed under pairwise joint distance (PJD) in a fully-connected mesh to categorize the interactions with explicit distance patterns. To better distinguish similar interactions, in the second stream, we combine PJD with the spatial features from individual joint positions using graph convolutions to detect the implicit correlations among joints, where the joint connections in the graph are adaptive for flexible correlations. After spatial modeling, each stream is fed to a bi-directional LSTM to encode two-way temporal properties. To take advantage of the diverse discriminative power of the two streams, we come up with a late fusion algorithm to combine their output predictions concerning information entropy. Experimental results show that the proposed framework achieves state-of-the-art performance on 3D and comparable performance on 2D interaction datasets. Moreover, the late fusion results demonstrate the effectiveness of improving the recognition accuracy compared with single streams.

TinyVIRAT: Low-Resolution Video Action Recognition

Ugur Demir, Yogesh Rawat, Mubarak Shah

Responsive image

Auto-TLDR; TinyVIRAT: A Progressive Generative Approach for Action Recognition in Videos

Slides Poster Similar

The existing research in action recognition is mostly focused on high-quality videos where the action is distinctly visible. In real-world surveillance environments, the actions in videos are captured at a wide range of resolutions. Most activities occur at a distance with a small resolution and recognizing such activities is a challenging problem. In this work, we focus on recognizing tiny actions in videos. We introduce a benchmark dataset, TinyVIRAT, which contains natural low-resolution activities. The actions in TinyVIRAT videos have multiple labels and they are extracted from surveillance videos which makes them realistic and more challenging. We propose a novel method for recognizing tiny actions in videos which utilizes a progressive generative approach to improve the quality of low-resolution actions. The proposed method also consists of a weakly trained attention mechanism which helps in focusing on the activity regions in the video. We perform extensive experiments to benchmark the proposed TinyVIRAT dataset and observe that the proposed method significantly improves the action recognition performance over baselines. We also evaluate the proposed approach on synthetically resized action recognition datasets and achieve state-of-the-art results when compared with existing methods. The dataset and code will be publicly available.

Flow-Guided Spatial Attention Tracking for Egocentric Activity Recognition

Tianshan Liu, Kin-Man Lam

Responsive image

Auto-TLDR; flow-guided spatial attention tracking for egocentric activity recognition

Slides Poster Similar

The popularity of wearable cameras has opened up a new dimension for egocentric activity recognition. While some methods introduce attention mechanisms into deep learning networks to capture fine-grained hand-object interactions, they often neglect exploring the spatio-temporal relationships. Generating spatial attention, without adequately exploiting temporal consistency, will result in potentially sub-optimal performance in the video-based task. In this paper, we propose a flow-guided spatial attention tracking (F-SAT) module, which is based on enhancing motion patterns and inter-frame information, to highlight the discriminative features from regions of interest across a video sequence. A new form of input, namely the optical-flow volume, is presented to provide informative cues from moving parts for spatial attention tracking. The proposed F-SAT module is deployed to a two-branch-based deep architecture, which fuses complementary information for egocentric activity recognition. Experimental results on three egocentric activity benchmarks show that the proposed method achieves state-of-the-art performance.

What and How? Jointly Forecasting Human Action and Pose

Yanjun Zhu, Yanxia Zhang, Qiong Liu, Andreas Girgensohn

Responsive image

Auto-TLDR; Forecasting Human Actions and Motion Trajectories with Joint Action Classification and Pose Regression

Slides Poster Similar

Forecasting human actions and motion trajectories addresses the problem of predicting what a person is going to do next and how they will perform it. This is crucial in a wide range of applications such as assisted living and future co-robotic settings. We propose to simultaneously learn actions and action-related human motion dynamics, while existing works perform them independently. In this paper, we present a method to jointly forecast categories of human action and the pose of skeletal joints in the hope that the two tasks can help each other. As a result, our system can predict not only the future actions but also the motion trajectories that will result. To achieve this, we define a task of joint action classification and pose regression. We employ a sequence to sequence encoder-decoder model combined with multi-task learning to forecast future actions and poses progressively before the action happens. Experimental results on two public datasets, IkeaDB and OAD, demonstrate the effectiveness of the proposed method.

Feature Pyramid Hierarchies for Multi-Scale Temporal Action Detection

Jiayu He, Guohui Li, Jun Lei

Responsive image

Auto-TLDR; Temporal Action Detection using Pyramid Hierarchies and Multi-scale Feature Maps

Slides Poster Similar

Temporal action detection is a challenging but promising task in video content analysis. It is in great demand in the field of public safety. The main difficulty of the task is precisely localizing activities in the video especially those short duration activities. And most of the existing methods can not achieve a satisfactory detection result. Our method addresses a key point to improve detection accuracy, which is to use multi-scale feature maps for regression and classification. In this paper, we introduce a novel network based on classification following proposal framework. In our network, a 3D feature pyramid hierarchies is built to enhance the ability of detecting short duration activities. The input RGB/Flow frames are first encoded by a 3D feature pyramid hierarchies, and this subnet produces multi-level feature maps. Then temporal proposal subnet uses these features to pick out proposals which might contain activity segments. Finally a pyramid region of interest (RoI) pooling pipeline and two fully connected layers reuse muti-level feature maps to refine the temporal boundaries of proposals and classify them. We use late feature fusion scheme to combine RGB and Flow information. The network is trained end-to-end and we evaluate it in THUMOS'14 dataset. Our network achieves a good result among typical methods. A further ablation test demonstrate that pyramid hierarchies is effective to improve detecting short duration activity segments.

RMS-Net: Regression and Masking for Soccer Event Spotting

Matteo Tomei, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, Rita Cucchiara

Responsive image

Auto-TLDR; An Action Spotting Network for Soccer Videos

Slides Poster Similar

The recently proposed action spotting task consists in finding the exact timestamp in which an event occurs. This task fits particularly well for soccer videos, where events correspond to salient actions strictly defined by soccer rules (a goal occurs when the ball crosses the goal line). In this paper, we devise a lightweight and modular network for action spotting, which can simultaneously predict the event label and its temporal offset using the same underlying features. We enrich our model with two training strategies: the first one for data balancing and uniform sampling, the second for masking ambiguous frames and keeping the most discriminative visual cues. When tested on the SoccerNet dataset and using standard features, our full proposal exceeds the current state of the art by 3 Average-mAP points. Additionally, it reaches a gain of more than 10 Average-mAP points on the test set when fine-tuned in combination with a strong 2D backbone.

SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Raphael Memmesheimer, Nick Theisen, Dietrich Paulus

Responsive image

Auto-TLDR; One-Shot Action Recognition using Metric Learning

Slides Similar

Recognizing an activity with a single reference sample using metric learning approaches is a promising research field. The majority of few-shot methods focus on object recognition or face-identification. We propose a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space. We encode signals into images and extract features using a deep residual CNN. Using triplet loss, we learn a feature embedding. The resulting encoder transforms features into an embedding space in which closer distances encode similar actions while higher distances encode different actions. Our approach is based on a signal level formulation and remains flexible across a variety of modalities. It further outperforms the baseline on the large scale NTU RGB+D 120 dataset for the One-Shot action recognition protocol by \ntuoneshotimpro%. With just 60% of the training data, our approach still outperforms the baseline approach by \ntuoneshotimproreduced%. With 40% of the training data, our approach performs comparably well as the second follow up. Further, we show that our approach generalizes well in experiments on the UTD-MHAD dataset for inertial, skeleton and fused data and the Simitate dataset for motion capturing data. Furthermore, our inter-joint and inter-sensor experiments suggest good capabilities on previously unseen setups.

Continuous Sign Language Recognition with Iterative Spatiotemporal Fine-Tuning

Kenessary Koishybay, Medet Mukushev, Anara Sandygulova

Responsive image

Auto-TLDR; A Deep Neural Network for Continuous Sign Language Recognition with Iterative Gloss Recognition

Slides Poster Similar

This paper aims to develop a deep neural network for Continuous Sign Language Recognition (CSLR) with iterative Gloss Recognition (GR) fine-tuning. CSLR has been a popular research field in the last years and iterative optimization methods are well established. This paper introduces our proposed architecture involving Spatiotemporal feature-extraction model to segment useful ``gloss-unit" features and BiLSTM with CTC as a sequence model. Spatiotemporal Feature Extractor is used for both image features extraction and sequence length reduction. To this end, we compare different architectures for feature extraction and sequence model. In addition, we iteratively fine-tune feature extractor on gloss-unit video segments with alignments from the end2end model. During the iterative training, we use novel alignment correction technique, which is based on minimum transformations of Levenshtein distance. All the experiments were conducted on the RWTH-PHOENIX-Weather-2014 dataset.

Space-Time Domain Tensor Neural Networks: An Application on Human Pose Classification

Konstantinos Makantasis, Athanasios Voulodimos, Anastasios Doulamis, Nikolaos Doulamis, Nikolaos Bakalos

Responsive image

Auto-TLDR; Tensor-Based Neural Network for Spatiotemporal Pose Classifiaction using Three-Dimensional Skeleton Data

Slides Poster Similar

Recent advances in sensing technologies require the design and development of pattern recognition models capable of processing spatiotemporal data efficiently. In this study, we propose a spatially and temporally aware tensor-based neural network for human pose classifiaction using three-dimensional skeleton data. Our model employs three novel components. First, an input layer capable of constructing highly discriminative spatiotemporal features. Second, a tensor fusion operation that produces compact yet rich representations of the data, and third, a tensor-based neural network that processes data representations in their original tensor form. Our model is end-to-end trainable and characterized by a small number of trainable parameters making it suitable for problems where the annotated data is limited. Experimental evaluation of the proposed model indicates that it can achieve state-of-the-art performance.

Multimodal Side-Tuning for Document Classification

Stefano Zingaro, Giuseppe Lisanti, Maurizio Gabbrielli

Responsive image

Auto-TLDR; Side-tuning for Multimodal Document Classification

Slides Poster Similar

In this paper, we propose to exploit the side-tuning framework for multimodal document classification. Side-tuning is a methodology for network adaptation recently introduced to solve some of the problems related to previous approaches. Thanks to this technique it is actually possible to overcome model rigidity and catastrophic forgetting of transfer learning by fine-tuning. The proposed solution uses off-the-shelf deep learning architectures leveraging the side-tuning framework to combine a base model with a tandem of two side networks. We show that side-tuning can be successfully employed also when different data sources are considered, e.g. text and images in document classification. The experimental results show that this approach pushes further the limit for document classification accuracy with respect to the state of the art.

Text Synopsis Generation for Egocentric Videos

Aidean Sharghi, Niels Lobo, Mubarak Shah

Responsive image

Auto-TLDR; Egocentric Video Summarization Using Multi-task Learning for End-to-End Learning

Slides Similar

Mass utilization of body-worn cameras has led to a huge corpus of available egocentric video. Existing video summarization algorithms can accelerate browsing such videos by selecting (visually) interesting shots from them. Nonetheless, since the system user still has to watch the summary videos, browsing large video databases remain a challenge. Hence, in this work, we propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos. Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database using text queries. Since egocentric videos are long and contain many activities and events, using video-to-text algorithms results in thousands of descriptions, many of which are incorrect. Therefore, we propose a multi-task learning scheme to simultaneously generate descriptions for video segments and summarize the resulting descriptions in an end-to-end fashion. We Input a set of video shots and the network generates a text description for each shot. Next, visual-language content matching unit that is trained with a weakly supervised objective, identifies the correct descriptions. Finally, the last component of our network, called purport network, evaluates the descriptions all together to select the ones containing crucial information. Out of thousands of descriptions generated for the video, a few informative sentences are returned to the user. We validate our framework on the challenging UT Egocentric video dataset, where each video is between 3 to 5 hours long, associated with over 3000 textual descriptions on average. The generated textual summaries, including only 5 percent (or less) of the generated descriptions, are compared to groundtruth summaries in text domain using well-established metrics in natural language processing.

MFI: Multi-Range Feature Interchange for Video Action Recognition

Sikai Bai, Qi Wang, Xuelong Li

Responsive image

Auto-TLDR; Multi-range Feature Interchange Network for Action Recognition in Videos

Slides Poster Similar

Short-range motion features and long-range dependencies are two complementary and vital cues for action recognition in videos, but it remains unclear how to efficiently and effectively extract these two features. In this paper, we propose a novel network to capture these two features in a unified 2D framework. Specifically, we first construct a Short-range Temporal Interchange (STI) block, which contains a Channels-wise Temporal Interchange (CTI) module for encoding short-range motion features. Then a Graph-based Regional Interchange (GRI) module is built to present long-range dependencies using graph convolution. Finally, we replace original bottleneck blocks in the ResNet with STI blocks and insert several GRI modules between STI blocks, to form a Multi-range Feature Interchange (MFI) Network. Practically, extensive experiments are conducted on three action recognition datasets (i.e., Something-Something V1, HMDB51, and UCF101), which demonstrate that the proposed MFI network achieves impressive results with very limited computing cost.

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Responsive image

Auto-TLDR; Fully-Attentive Reduction for Vision and Language

Slides Poster Similar

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements - like regions and words - proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.

Video Representation Fusion Network For Multi-Label Movie Genre Classification

Tianyu Bi, Dmitri Jarnikov, Johan Lukkien

Responsive image

Auto-TLDR; A Video Representation Fusion Network for Movie Genre Classification

Slides Poster Similar

In this paper, we introduce a Video Representation Fusion Network (VRFN) for movie genre classification. Different from the previous works, which use frame-level features for movie genre classification, our approach uses video classification architecture to create video-level features from a group of frames and fuse these features temporally to learn long-term spatiotemporal information for the movie genre classification task. We use a pre-trained I3D model to generate intermediate video representations and connect it with a C3D-LSTM model for feature fusion and movie genre classification. LMTD-9 dataset which contains 4007 trailers multi-labeled with 9 movie genres is used for training and evaluation of the model. The experimental results demonstrate that learning long-term temporal dependencies by fusing video representations improves the performance in movie genre classification. Our best model outperforms the state-of-the-art methods by 3.4% improvement in AUPRC (macro).

Deep Convolutional Embedding for Digitized Painting Clustering

Giovanna Castellano, Gennaro Vessio

Responsive image

Auto-TLDR; A Deep Convolutional Embedding Model for Clustering Artworks

Slides Poster Similar

Clustering artworks is difficult because of several reasons. On one hand, recognizing meaningful patterns in accordance with domain knowledge and visual perception is extremely hard. On the other hand, the application of traditional clustering and feature reduction techniques to the highly dimensional pixel space can be ineffective. To address these issues, we propose to use a deep convolutional embedding model for digitized painting clustering, in which the task of mapping the input raw data to an abstract, latent space is jointly optimized with the task of finding a set of cluster centroids in this latent feature space. Quantitative and qualitative experimental results show the effectiveness of the proposed method. The model is also able to outperform other state-of-the-art deep clustering approaches to the same problem. The proposed method may be beneficial to several art-related tasks, particularly visual link retrieval and historical knowledge discovery in painting datasets.

RWF-2000: An Open Large Scale Video Database for Violence Detection

Ming Cheng, Kunjing Cai, Ming Li

Responsive image

Auto-TLDR; Flow Gated Network for Violence Detection in Surveillance Cameras

Slides Poster Similar

In recent years, surveillance cameras are widely deployed in public places, and the general crime rate has been reduced significantly due to these ubiquitous devices. Usually, these cameras provide cues and evidence after crimes were conducted, while they are rarely used to prevent or stop criminal activities in time. It is both time and labor consuming to manually monitor a large amount of video data from surveillance cameras. Therefore, automatically recognizing violent behaviors from video signals becomes essential. In this paper, we summarize several existing video datasets for violence detection and propose a new video dataset with 2,000 videos all captured by surveillance cameras in real-world scenes. Also, we present a new method that utilizes both the merits of 3D-CNNs and optical flow, namely Flow Gated Network. The proposed approach obtains an accuracy of 87.25% on the test set of our proposed RWF-2000 database. The proposed database and source codes of this paper are currently open to access.

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

Sungmin Eum, Heesung Kwon

Responsive image

Auto-TLDR; Semantics-to-Space: Deep Zero-Shot Learning for Verb-Object Interaction with Vectors

Slides Poster Similar

We present a novel deep zero-shot learning (ZSL) model for inferencing human-object-interaction with verb-object (VO) query. While the previous two-stream ZSL approaches only use the semantic/textual information to be fed into the query stream, we seek to incorporate and embed the semantics into the visual representation stream as well. Our approach is powered by Semantics-to-Space (S2S) architecture where semantics derived from the residing objects are embedded into a spatial space of the visual stream. This architecture allows the co-capturing of the semantic attributes of the human and the objects along with their location/size/silhouette information. To validate, we have constructed a new dataset, Verb-Transferability 60 (VT60). VT60 provides 60 different VO pairs with overlapping verbs tailored for testing two-stream ZSL approaches with VO query. Experimental evaluations show that our approach not only outperforms the state-of-the-art, but also shows the capability of consistently improving performance regardless of which ZSL baseline architecture is used.

Uncertainty-Sensitive Activity Recognition: A Reliability Benchmark and the CARING Models

Alina Roitberg, Monica Haurilet, Manuel Martinez, Rainer Stiefelhagen

Responsive image

Auto-TLDR; CARING: Calibrated Action Recognition with Input Guidance

Slides Similar

Beyond assigning the correct class, an activity recognition model should also to be able to determine, how certain it is in its predictions. We present the first study of how well the confidence values of modern action recognition architectures indeed reflect the probability of the correct outcome and propose a learning-based approach for improving it. First, we extend two popular action recognition datasets with a reliability benchmark in form of the expected calibration error and reliability diagrams. Since our evaluation highlights that confidence values of standard action recognition architectures do not represent the uncertainty well, we introduce a new approach which learns to transform the model output into realistic confidence estimates through an additional calibration network. The main idea of our Calibrated Action Recognition with Input Guidance (CARING) model is to learn an optimal scaling parameter depending on the video representation. We compare our model with the native action recognition networks and the temperature scaling approach - a wide spread calibration method utilized in image classification. While temperature scaling alone drastically improves the reliability of the confidence values, our CARING method consistently leads to the best uncertainty estimates in all benchmark settings.

You Ought to Look Around: Precise, Large Span Action Detection

Ge Pan, Zhang Han, Fan Yu, Yonghong Song, Yuanlin Zhang, Han Yuan

Responsive image

Auto-TLDR; YOLA: Local Feature Extraction for Action Localization with Variable receptive field

Slides Similar

For the action localization task, pre-defined action anchors are the cornerstone of mainstream techniques. State-of-the-art models mostly rely on a dense segmenting scheme, where anchors are sampled uniformly over the temporal domain with a predefined set of scales. However, it is not sufficient because action duration varies greatly. Therefore, it is necessary for the anchors or proposals to have a variable receptive field. In this paper, we propose a method called YOLA (You Ought to Look Around) which includes three parts: 1) a robust backbone SPN-I3D for extracting spatio-temporal features. In this part, we employ a stronger backbone I3D with SPN (Segment Pyramid Network) instead of C3D to obtain multi-scale features; 2) a simple but useful feature fusion module named LFE (Local Feature Extraction). Compared with the fully connected layer and global average pooling, our LFE model is more advantageous for network to fit and fuse features. 3) a new feature segment aligning method called TPGC (Two Pathway Graph Convolution), which allows one proposal to leverage semantic features of adjacent proposals to update its content and make sure the proposals have a variable receptive field. YOLA add only a small overhead to the baseline network, and is easy to train in an end-to-end manner, running at a speed of 1097 fps. YOLA achieves a mAP of 58.3%, outperforming all existing models including both RGB-based and two stream on THUMOS'14, and achieves competitive results on ActivityNet 1.3.

From Human Pose to On-Body Devices for Human-Activity Recognition

Fernando Moya Rueda, Gernot Fink

Responsive image

Auto-TLDR; Transfer Learning from Human Pose Estimation for Human Activity Recognition using Inertial Measurements from On-Body Devices

Slides Poster Similar

Human Activity Recognition (HAR), using inertial measurements from on-body devices, has not seen a great advantage from deep architectures. This is mainly due to the lack of annotated data, diversity of on-body device configurations, the class-unbalance problem, and non-standard human activity definitions. Approaches for improving the performance of such architectures, e.g., transfer learning, are therefore difficult to apply. This paper introduces a method for transfer learning from human-pose estimations as a source for improving HAR using inertial measurements obtained from on-body devices. We propose to fine-tune deep architectures, trained using sequences of human poses from a large dataset and their derivatives, for solving HAR on inertial measurements from on-body devices. Derivatives of human poses will be considered as a sort of synthetic data for HAR. We deploy two different temporal-convolutional architectures as classifiers. An evaluation of the method is carried out on three benchmark datasets improving the classification performance.

Enriching Video Captions with Contextual Text

Philipp Rimle, Pelin Dogan, Markus Gross

Responsive image

Auto-TLDR; Contextualized Video Captioning Using Contextual Text

Slides Poster Similar

Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning by infusing extracted information from relevant text data. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mines relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not preprocess the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the contextual text via a pointer-generator network, allowing to produce more specific video captions. We show competitive performance on the News Video Dataset and, through ablation studies, validate the efficacy of contextual video captioning as well as individual design choices in our model architecture.

Personalized Models in Human Activity Recognition Using Deep Learning

Hamza Amrani, Daniela Micucci, Paolo Napoletano

Responsive image

Auto-TLDR; Incremental Learning for Personalized Human Activity Recognition

Slides Poster Similar

Current sensor-based human activity recognition techniques that rely on a user-independent model struggle to generalize to new users and on to changes that a person may make over time to his or her way of carrying out activities. Incremental learning is a technique that allows to obtain personalized models which may improve the performance on the classifiers thanks to a continuous learning based on user data. Finally, deep learning techniques have been proven to be more effective with respect to traditional ones in the generation of user-independent models. The aim of our work is therefore to put together deep learning techniques with incremental learning in order to obtain personalized models that perform better with respect to user-independent model and personalized model obtained using traditional machine learning techniques. The experimentation was done by comparing the results obtained by a technique in the state of the art with those obtained by two neural networks (ResNet and a simplified CNN) on three datasets. The experimentation showed that neural networks adapt faster to a new user than the baseline.

Pose-Based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo

Responsive image

Auto-TLDR; Body Language Based Emotion Recognition for Psychiatric Symptoms Prediction

Slides Poster Similar

Inspired by the human ability to infer emotions from body language, we propose an automated framework for body language based emotion recognition starting from regular RGB videos. In collaboration with psychologists, we further extend the framework for psychiatric symptom prediction. Because a specific application domain of the proposed framework may only supply a limited amount of data, the framework is designed to work on a small training set and possess a good transferability. The proposed system in the first stage generates sequences of body language predictions based on human poses estimated from input videos. In the second stage, the predicted sequences are fed into a temporal network for emotion interpretation and psychiatric symptom prediction. We first validate the accuracy and transferability of the proposed body language recognition method on several public action recognition datasets. We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms. The proposed framework outperforms other methods on the URMC dataset.

Attention-Oriented Action Recognition for Real-Time Human-Robot Interaction

Ziyang Song, Ziyi Yin, Zejian Yuan, Chong Zhang, Wanchao Chi, Yonggen Ling, Shenghao Zhang

Responsive image

Auto-TLDR; Attention-Oriented Multi-Level Network for Action Recognition in Interaction Scenes

Slides Poster Similar

Despite the notable progress made in action recognition tasks, not much work has been done in action recognition specifically for human-robot interaction. In this paper, we deeply explore the characteristics of the action recognition task in interaction scenes and propose an attention-oriented multi-level network framework to meet the need for real-time interaction. Specifically, a Pre-Attention network is employed to roughly focus on the interactor in the scene at low resolution firstly and then perform fine-grained pose estimation at high resolution. The other compact CNN receives the extracted skeleton sequence as input for action recognition, utilizing attention-like mechanisms to capture local spatial-temporal patterns and global semantic information effectively. To evaluate our approach, we construct a new action dataset specially for the recognition task in interaction scenes. Experimental results on our dataset and high efficiency (112 fps at 640 x 480 RGBD) on the mobile computing platform (Nvidia Jetson AGX Xavier) demonstrate excellent applicability of our method on action recognition in real-time human-robot interaction.

A Detection-Based Approach to Multiview Action Classification in Infants

Carolina Pacheco, Effrosyni Mavroudi, Elena Kokkoni, Herbert Tanner, Rene Vidal

Responsive image

Auto-TLDR; Multiview Action Classification for Infants in a Pediatric Rehabilitation Environment

Slides Similar

Activity recognition in children and infants is important in applications such as safety monitoring, behavior assessment, and child-robot interaction, among others. However, it differs from activity recognition in adults not only because body poses and proportions are different, but also because of the way in which actions are performed. This paper addresses the problem of infant action classification (up to 2 years old) in challenging conditions. The actions are performed in a pediatric rehabilitation environment in which not only infants but also robots and adults are present, with the infant being one of the smallest actors in the scene. We propose a multiview action classification system based on Faster R-CNN and LSTM networks, which fuses information from different views by using learnable fusion coefficients derived from detection confidence scores. The proposed system is view-independent, learns features that are close to view-invariant, and can handle new or missing views at test time. Our approach outperforms the state-of-the-art baseline model for this dataset by 11.4% in terms of average classification accuracy in four classes (crawl, sit, stand and walk). Moreover, experiments in a extended dataset from 6 subjects (8 to 24 months old) show that the proposed fusion strategy outperforms the best post-processing fusion strategy by 2.5% and 6.8% average classification accuracy in Leave One Super-session Out and Leave One Subject Out cross-validation, respectively.

Activity Recognition Using First-Person-View Cameras Based on Sparse Optical Flows

Peng-Yuan Kao, Yan-Jing Lei, Chia-Hao Chang, Chu-Song Chen, Ming-Sui Lee, Yi-Ping Hung

Responsive image

Auto-TLDR; 3D Convolutional Neural Network for Activity Recognition with FPV Videos

Slides Poster Similar

First-person-view (FPV) cameras are finding wide use in daily life to record activities and sports. In this paper, we propose a succinct and robust 3D convolutional neural network (CNN) architecture accompanied with an ensemble-learning network for activity recognition with FPV videos. The proposed 3D CNN is trained on low-resolution (32x32) sparse optical flows using FPV video datasets consisting of daily activities. According to the experimental results, our network achieves an average accuracy of 90%.

Context Visual Information-Based Deliberation Network for Video Captioning

Min Lu, Xueyong Li, Caihua Liu

Responsive image

Auto-TLDR; Context visual information-based deliberation network for video captioning

Slides Poster Similar

Video captioning is to automatically and accurately generate a textual description for a video. The typical methods following the encoder-decoder architecture directly utilized hidden states to predict words. Nevertheless, these methods did not amend the inaccurate hidden states before feeding those states into word prediction. This led to a cascade of errors on generating word by word. In this paper, the context visual information-based deliberation network is proposed, abbreviated as CVI-DelNet. Its key idea is to introduce the deliberator into the encoder-decoder framework. The encoder-decoder firstly generates a raw hidden state sequence. Unlike the existing methods, the raw hidden state is no more directly used for word prediction but is fed into the deliberator to generate the refined hidden state. The words are then predicted according to the refined hidden states and the contextual visual features. Results on two datasets shows that the proposed method significantly outperforms the baselines.

Context Aware Group Activity Recognition

Avijit Dasgupta, C. V. Jawahar, Karteek Alahari

Responsive image

Auto-TLDR; A Two-Stream Architecture for Group Activity Recognition in Multi-Person Videos

Slides Poster Similar

This paper addresses the task of group activity recognition in multi-person videos. Existing approaches decompose this task into feature learning and relational reasoning. Despite showing progress, these methods only rely on appearance features for people and overlook the available contextual information, which can play an important role in group activity understanding. In this work, we focus on the feature learning aspect and propose a two-stream architecture that not only considers person-level appearance features, but also makes use of contextual information present in videos for group activity recognition. In particular, we propose to use two types of contextual information beneficial for two different scenarios: \textit{pose context} and \textit{scene context} that provide crucial cues for group activity understanding. We combine appearance and contextual features to encode each person with an enriched representation. Finally, these combined features are used in relational reasoning for predicting group activities. We evaluate our method on two benchmarks, Volleyball and Collective Activity and show that joint modeling of contextual information with appearance features benefits in group activity understanding.

Late Fusion of Bayesian and Convolutional Models for Action Recognition

Camille Maurice, Francisco Madrigal, Frederic Lerasle

Responsive image

Auto-TLDR; Fusion of Deep Neural Network and Bayesian-based Approach for Temporal Action Recognition

Slides Poster Similar

The activities we do in our daily-life are generally carried out as a succession of atomic actions, following a logical order. During a video sequence, actions usually follow a logical order. In this paper, we propose a hybrid approach resulting from the fusion of a deep learning neural network with a Bayesian-based approach. The latter models human-object interactions and transition between actions. The key idea is to combine both approaches in the final prediction. We validate our strategy in two public datasets: CAD-120 and Watch-n-Patch. We show that our fusion approach yields performance gains in accuracy of respectively +4\% and +6\% over a baseline approach. Temporal action recognition performances are clearly improved by the fusion, especially when classes are imbalanced.

SCA Net: Sparse Channel Attention Module for Action Recognition

Hang Song, Yonghong Song, Yuanlin Zhang

Responsive image

Auto-TLDR; SCA Net: Efficient Group Convolution for Sparse Channel Attention

Slides Poster Similar

Channel attention has shown its great performance recently when it was incorporated into deep convolutional neural networks. However, existing methods usually require extensive computing resources due to their involuted structure, which is hard for 3D CNNs to take full advantage of. In this paper, a lightweight sparse channel attention (SCA) module implemented by efficient group convolution is proposed, which adopts the idea of sparse channel connection and involves much less parameters but brings clear performance gain. Meanwhile, to solve the lack of local channel interaction brought by group convolution, a dominant function called Aggregate-Shuffle-Diverge (ASD) is leveraged to enhance information flow over each group with no additional parameters. We also adjust the existing mainstream 3D CNNs by employing 3D convolution factorization, so as to further reduce the parameters. Our SCA module can be flexibly incorporated into most existing 3D CNNs, all of which can achieve a perfect trade-off between performance and complexity on action recognition task with factorized I3D or 3D ResNet backbone networks. The experimental results also indicate that the resulting network, namely, SCA Net can achieve an outstanding performance on UCF-101 and HMDB-51 datasets.

Semantic Bilinear Pooling for Fine-Grained Recognition

Xinjie Li, Chun Yang, Song-Lu Chen, Chao Zhu, Xu-Cheng Yin

Responsive image

Auto-TLDR; Semantic bilinear pooling for fine-grained recognition with hierarchical label tree

Slides Poster Similar

Naturally, fine-grained recognition, e.g., vehicle identification or bird classification, has specific hierarchical labels, where fine categories are always harder to be classified than coarse categories. However, most of the recent deep learning based methods neglect the semantic structure of fine-grained objects and do not take advantage of the traditional fine-grained recognition techniques (e.g. coarse-to-fine classification). In this paper, we propose a novel framework with a two-branch network (coarse branch and fine branch), i.e., semantic bilinear pooling, for fine-grained recognition with a hierarchical label tree. This framework can adaptively learn the semantic information from the hierarchical levels. Specifically, we design a generalized cross-entropy loss for the training of the proposed framework to fully exploit the semantic priors via considering the relevance between adjacent levels and enlarge the distance between samples of different coarse classes. Furthermore, our method leverages only the fine branch when testing so that it adds no overhead to the testing time. Experimental results show that our proposed method achieves state-of-the-art performance on four public datasets.

Precise Temporal Action Localization with Quantified Temporal Structure of Actions

Chongkai Lu, Ruimin Li, Hong Fu, Bin Fu, Yihao Wang, Wai Lun Lo, Zheru Chi

Responsive image

Auto-TLDR; Action progression networks for temporal action detection

Slides Poster Similar

Existing temporal action detection algorithms cannot distinguish complete and incomplete actions while this property is essential in many applications. To tackle this challenge, we proposed the action progression networks (APN), a novel model that predicts action progression of video frames with continuous numbers. Using the progression sequence of test video, on the top of the APN, a complete action searching algorithm (CAS) was designed to detect complete actions only. With the usage of frame-level fine-grained temporal structure modeling and detecting actions according to their whole temporal context, our framework can locate actions precisely and is good at avoiding incomplete action detection. We evaluated our framework on a new dataset (DFMAD-70) collected by ourselves which contains both complete and incomplete actions. Our framework got good temporal localization results with 95.77% average precision when the IoU threshold is 0.5. On the benchmark THUMOS14, an incomplete-ignostic dataset, our framework still obtain competitive performance. The code is available online at https://github.com/MakeCent/Action-Progression-Network

An Improved Bilinear Pooling Method for Image-Based Action Recognition

Wei Wu, Jiale Yu

Responsive image

Auto-TLDR; An improved bilinear pooling method for image-based action recognition

Slides Poster Similar

Action recognition in still images is a challenging task because of the complexity of human motions and the variance of background in the same action category. And some actions typically occur in fine-grained categories, with little visual differences between these categories. So extracting discriminative features or modeling various semantic parts is essential for image-based action recognition. Many methods apply expensive manual annotations to learn discriminative parts information for action recognition, which may severely discourage potential applications in real life. In recent years, bilinear pooling method has shown its effectiveness for image classification due to its learning distinctive features automatically. Inspired by this model, in this paper, an improved bilinear pooling method is proposed for avoiding the shortcomings of traditional bilinear pooling methods. The previous bilinear pooling approaches contain lots of noisy background or harmful feature information, which limit their application for action recognition. In our method, the attention mechanism is introduced into hierarchical bilinear pooling framework with mask aggregation for action recognition. The proposed model can generate the distinctive and ROI-aware feature information by combining multiple attention mask maps from the channel and spatial-wise attention features. To be more specific, our method makes the network to better pay attention to discriminative region of the vital objects in an image. We verify our model on the two challenging datasets: 1) Stanford 40 action dataset and 2) our action dataset that includes 60 categories. Experimental results demonstrate the effectiveness of our approach, which is superior to the traditional and state-of-the-art methods.