Recognizing American Sign Language Nonmanual Signal Grammar Errors in Continuous Videos

Elahe Vahdani, Longlong Jing, Ying-Li Tian, Matt Huenerfauth

Responsive image

Auto-TLDR; ASL-HW-RGBD: Recognizing Grammatical Errors in Continuous Sign Language

Slides Poster

As part of the development of an educational tool that can help students achieve fluency in American Sign Language (ASL) through independent and interactive practice with immediate feedback, this paper introduces a near real-time system to recognize grammatical errors in continuous signing videos without necessarily identifying the entire sequence of signs. Our system automatically recognizes if a performance of ASL sentences contains grammatical errors made by ASL students. We first recognize the ASL grammatical elements including both manual gestures and nonmanual signals independently from multiple modalities (i.e. hand gestures, facial expressions, and head movements) by 3D-ResNet networks. Then the temporal boundaries of grammatical elements from different modalities are examined to detect ASL grammatical mistakes by using a sliding window-based approach. We have collected a dataset of continuous sign language, ASL-HW-RGBD, covering different aspects of ASL grammars for training and testing. Our system is able to recognize grammatical elements on ASL-HW-RGBD from manual gestures, facial expressions, and head movements and successfully detect 8 ASL grammatical mistakes.

Similar papers

Continuous Sign Language Recognition with Iterative Spatiotemporal Fine-Tuning

Kenessary Koishybay, Medet Mukushev, Anara Sandygulova

Responsive image

Auto-TLDR; A Deep Neural Network for Continuous Sign Language Recognition with Iterative Gloss Recognition

Slides Poster Similar

This paper aims to develop a deep neural network for Continuous Sign Language Recognition (CSLR) with iterative Gloss Recognition (GR) fine-tuning. CSLR has been a popular research field in the last years and iterative optimization methods are well established. This paper introduces our proposed architecture involving Spatiotemporal feature-extraction model to segment useful ``gloss-unit" features and BiLSTM with CTC as a sequence model. Spatiotemporal Feature Extractor is used for both image features extraction and sequence length reduction. To this end, we compare different architectures for feature extraction and sequence model. In addition, we iteratively fine-tune feature extractor on gloss-unit video segments with alignments from the end2end model. During the iterative training, we use novel alignment correction technique, which is based on minimum transformations of Levenshtein distance. All the experiments were conducted on the RWTH-PHOENIX-Weather-2014 dataset.

Context Matters: Self-Attention for Sign Language Recognition

Fares Ben Slimane, Mohamed Bouguessa

Responsive image

Auto-TLDR; Attentional Network for Continuous Sign Language Recognition

Slides Poster Similar

This paper proposes an attentional network for the task of Continuous Sign Language Recognition. The proposed approach exploits co-independent streams of data to model the sign language modalities. These different channels of information can share a complex temporal structure between each other. For that reason, we apply attention to synchronize and help capture entangled dependencies between the different sign language components. Even though Sign Language is multi-channel, handshapes represent the central entities in sign interpretation. Seeing handshapes in their correct context defines the meaning of a sign. Taking that into account, we utilize the attention mechanism to efficiently aggregate the hand features with their appropriate Spatio-temporal context for better sign recognition. We found that by doing so the model is able to identify the essential Sign Language components that revolve around the dominant hand and the face areas. We test our model on the benchmark dataset RWTH-PHOENIX-Weather 2014, yielding competitive results.

Pose-Based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo

Responsive image

Auto-TLDR; Body Language Based Emotion Recognition for Psychiatric Symptoms Prediction

Slides Poster Similar

Inspired by the human ability to infer emotions from body language, we propose an automated framework for body language based emotion recognition starting from regular RGB videos. In collaboration with psychologists, we further extend the framework for psychiatric symptom prediction. Because a specific application domain of the proposed framework may only supply a limited amount of data, the framework is designed to work on a small training set and possess a good transferability. The proposed system in the first stage generates sequences of body language predictions based on human poses estimated from input videos. In the second stage, the predicted sequences are fed into a temporal network for emotion interpretation and psychiatric symptom prediction. We first validate the accuracy and transferability of the proposed body language recognition method on several public action recognition datasets. We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms. The proposed framework outperforms other methods on the URMC dataset.

Applying (3+2+1)D Residual Neural Network with Frame Selection for Hong Kong Sign Language Recognition

Zhenxing Zhou, King-Shan Lui, Vincent W.L. Tam, Edmund Y. Lam

Responsive image

Auto-TLDR; Hong Kong Sign Language Recognition with 3D Residual Neural Network and Resilience Model

Slides Poster Similar

As reported by Hong Kong Government in 2017, there are more than 1.5 million residents suffering from hearing impairment in Hong Kong. Most of them rely on Hong Kong Sign Language for daily communication while there are only 63 registered sign language interpreters in Hong Kong. To address this specific social issue and also facilitate the effective communication between the hearing impaired and other people, this paper introduces a word-level Hong Kong Sign Language(HKSL) dataset which currently includes 45 isolated words and at least 30 sign videos per word performed by different signers(more than 1500 videos in total now and still enlarging). Based on this dataset, this paper systemically compares the performances of various deep learning approaches, including (1) 2D histogram of oriented gradients(HOG) feature/pose estimation/feature extraction with long-short term memory(LSTM) layer; (2) 3D Residual Neural Network(ResNet) (3) (2+1)D Residual Neural Network, in HKSL recognition. Meanwhile, to further improve the accuracy of sign language recognition, this paper proposes a novel method called (3+2+1)D ResNet Model with Frame Selection which adopts blurriness detection with Laplacian kernel to construct highquality video clips and also combines both (2+1)D and 3D ResNet for recognizing the sign language. At the end, the experimental results show that the proposed method outperforms other deep learning approaches and attain an impressive accuracy of 94.6% in our dataset.

IPN Hand: A Video Dataset and Benchmark for Real-Time Continuous Hand Gesture Recognition

Gibran Benitez-Garcia, Jesus Olivares-Mercado, Gabriel Sanchez-Perez, Keiji Yanai

Responsive image

Auto-TLDR; IPN Hand: A Benchmark Dataset for Continuous Hand Gesture Recognition

Slides Poster Similar

Continuous hand gesture recognition (HGR) is an essential part of human-computer interaction with a wide range of applications in the automotive sector, consumer electronics, home automation, and others. In recent years, accurate and efficient deep learning models have been proposed for HGR. However, in the research community, the current publicly available datasets lack real-world elements needed to build responsive and efficient HGR systems. In this paper, we introduce a new benchmark dataset named IPN Hand with sufficient size, variation, and real-world elements able to train and evaluate deep neural networks. This dataset contains more than 4 000 gesture samples and 800 000 RGB frames from 50 distinct subjects. We design 13 different static and dynamic gestures focused on interaction with touchless screens. We especially consider the scenario when continuous gestures are performed without transition states, and when subjects perform natural movements with their hands as non-gesture actions. Gestures were collected from about 30 diverse scenes, with real-world variation in background and illumination. With our dataset, the performance of three 3D-CNN models is evaluated on the tasks of isolated and continuous real-time HGR. Furthermore, we analyze the possibility of increasing the recognition accuracy by adding multiple modalities derived from RGB frames, i.e., optical flow and semantic segmentation, while keeping the real-time performance of the 3D-CNN model. Our empirical study also provides a comparison with the publicly available nvGesture (NVIDIA) dataset. The experimental results show that the state-of-the-art ResNext-101 model decreases about 30% accuracy when using our real-world dataset, demonstrating that the IPN Hand dataset can be used as a benchmark, and may help the community to step forward in the continuous HGR.

Depth Videos for the Classification of Micro-Expressions

Ankith Jain Rakesh Kumar, Bir Bhanu, Christopher Casey, Sierra Cheung, Aaron Seitz

Responsive image

Auto-TLDR; RGB-D Dataset for the Classification of Facial Micro-expressions

Slides Poster Similar

Facial micro-expressions are spontaneous, subtle, involuntary muscle movements occurring briefly on the face. The spotting and recognition of these expressions are difficult due to the subtle behavior, and the time duration of these expressions is about half a second, which makes it difficult for humans to identify them. These micro-expressions have many applications in our daily life, such as in the field of online learning, game playing, lie detection, and therapy sessions. Traditionally, researchers use RGB images/videos to spot and classify these micro-expressions, which pose challenging problems, such as illumination, privacy concerns and pose variation. The use of depth videos solves these issues to some extent, as the depth videos are not susceptible to the variation in illumination. This paper describes the collection of a first RGB-D dataset for the classification of facial micro-expressions into 6 universal expressions: Anger, Happy, Sad, Fear, Disgust, and Surprise. This paper shows the comparison between the RGB and Depth videos for the classification of facial micro-expressions. Further, a comparison of results shows that depth videos alone can be used to classify facial micro-expressions correctly in a decision tree structure by using the traditional and deep learning approaches with good classification accuracy. The dataset will be released to the public in the near future.

Automatic Annotation of Corpora for Emotion Recognition through Facial Expressions Analysis

Alex Mircoli, Claudia Diamantini, Domenico Potena, Emanuele Storti

Responsive image

Auto-TLDR; Automatic annotation of video subtitles on the basis of facial expressions using machine learning algorithms

Slides Poster Similar

The recent diffusion of social networks has made available an unprecedented amount of user-generated content, which may be analyzed in order to determine people's opinions and emotions about a large variety of topics. Research has made many efforts in defining accurate algorithms for analyzing emotions expressed by users in texts; however, their performance often rely on the existence of large annotated datasets, whose current scarcity represents a major issue. The manual creation of such datasets represents a costly and time-consuming activity and hence there is an increasing demand for techniques for the automatic annotation of corpora. In this work we present a methodology for the automatic annotation of video subtitles on the basis of the analysis of facial expressions of people in videos, with the goal of creating annotated corpora that may be used to train emotion recognition algorithms. Facial expressions are analyzed through machine learning algorithms, on the basis of a set of manually-engineered facial features that are extracted from video frames. The soundness of the proposed methodology has been evaluated through an extensive experimentation aimed at determining the performance on real datasets of each methodological step.

Concept Embedding through Canonical Forms: A Case Study on Zero-Shot ASL Recognition

Azamat Kamzin, Apurupa Amperyani, Prasanth Sukhapalli, Ayan Banerjee, Sandeep Gupta

Responsive image

Auto-TLDR; A canonical form of gestures in American Sign Language

Slides Poster Similar

In the recognition problem, a canonical form that expresses the spatio-temporal relation of concepts for a given class can potentially increase accuracy. Concepts are defined as attributes that can be recognized using a soft matching paradigm. We consider the specific case study of American Sign Language (ASL) to show that canonical forms of classes can be used to recognize unseen gestures. There are several advantages of a canonical form of gestures including translation between gestures, gesture-based searching, and automated transcription of gestures into any spoken language. We applied our technique to two independently collected datasets: a) IMPACT Lab dataset: 23 ASL gestures each executed three times from 130 first time ASL learners as training data and b) ASLTEXT dataset: 190 gestures each executed six times on an average. Our technique was able to recognize 19 arbitrarily chosen previously unseen gestures in the IMPACT dataset from seven individuals who are not a part of 130 and 34 unseen gestures from the ASLTEXT dataset without any retraining. Our normalized accuracy on ASLTEXT dataset is 66 % which is 13.6 % higher than state-of-art technique.

Exploiting the Logits: Joint Sign Language Recognition and Spell-Correction

Christina Runkel, Stefan Dorenkamp, Hartmut Bauermeister, Michael Möller

Responsive image

Auto-TLDR; A Convolutional Neural Network for Spell-correction in Sign Language Videos

Slides Poster Similar

Machine learning techniques have excelled in the automatic semantic analysis of images, reaching human-level performances on challenging bechmarks. Yet, the semantic analysis of videos remains challenging due to the significantly higher dimensionality of the input data, respectively, the significantly higher need for annotated training examples. By studying the automatic recognition of German sign language videos, we demonstrate that on the relatively scarce training data of 2.800 videos, modern deep learning architectures for video analysis (such as ResNeXt) along with transfer learning on large gesture recognition tasks, can achieve about 75% character accuracy. Considering that this leaves us with a probability of under 25% that a five letter word is spelled correctly, spell-correction systems are crucial for producing readable outputs. The contribution of this paper is to propose a convolutional neural network for spell-correction that expects the softmax outputs of the character recognition network (instead of a misspelled word) as an input. We demonstrate that purely learning on softmax inputs in combination with scarce training data yields overfitting as the network learns the inputs by heart. In contrast, training the network on several variants of the logits of the classification output i.e. scaling by a constant factor, adding of random noise, mixing of softmax and hardmax inputs or purely training on hardmax inputs, leads to better generalization while benefitting from the significant information hidden in these outputs (that have 98% top-5 accuracy), yielding a readable text despite the comparably low character accuracy.

Siamese-Structure Deep Neural Network Recognizing Changes in Facial Expression According to the Degree of Smiling

Kazuaki Kondo, Taichi Nakamura, Yuichi Nakamura, Shin'Ichi Satoh

Responsive image

Auto-TLDR; A Siamese-Structure Deep Neural Network for Happiness Recognition

Slides Poster Similar

A smile is a representative expression of happiness or high quality-of-life; however, automatic recognition of a smile according to happiness remains a challenging task. Because expressions of happiness are strongly dependent upon physical condition and occurrence of other emotions, and similar facial expression often occur under different emotions, we consider that there is no absolute visual pattern of a smile corresponding to happiness. Therefore, in this study, we assumed that a ``smile with happiness'' is observed as the temporal ascent in the degree of smiling and attempted to recognize this by capturing changes in facial expression within temporally sequential images. As an implementation of this scheme, we proposed a Siamese-structure deep neural network to compare facial expressions in two input images and estimate the existence of smile ascension or descension. For primal analysis of the proposed network, we developed a unique smiling dataset containing image pairs with various changes in smiling degree, including slight changes. The results demonstrated that the proposed method achieved nearly perfect recognition with >0.95 accuracy when recognizing changes in the degree of smiling that humans certainly recognize. Attention regions that contributed to the predicted labels were concentrated on the mouth, cheeks, and tail of the eyes, which indicates a reasonable function for recognizing changes in smiling degree was constructed by the proposed method.

What and How? Jointly Forecasting Human Action and Pose

Yanjun Zhu, Yanxia Zhang, Qiong Liu, Andreas Girgensohn

Responsive image

Auto-TLDR; Forecasting Human Actions and Motion Trajectories with Joint Action Classification and Pose Regression

Slides Poster Similar

Forecasting human actions and motion trajectories addresses the problem of predicting what a person is going to do next and how they will perform it. This is crucial in a wide range of applications such as assisted living and future co-robotic settings. We propose to simultaneously learn actions and action-related human motion dynamics, while existing works perform them independently. In this paper, we present a method to jointly forecast categories of human action and the pose of skeletal joints in the hope that the two tasks can help each other. As a result, our system can predict not only the future actions but also the motion trajectories that will result. To achieve this, we define a task of joint action classification and pose regression. We employ a sequence to sequence encoder-decoder model combined with multi-task learning to forecast future actions and poses progressively before the action happens. Experimental results on two public datasets, IkeaDB and OAD, demonstrate the effectiveness of the proposed method.

Modeling Long-Term Interactions to Enhance Action Recognition

Alejandro Cartas, Petia Radeva, Mariella Dimiccoli

Responsive image

Auto-TLDR; A Hierarchical Long Short-Term Memory Network for Action Recognition in Egocentric Videos

Slides Poster Similar

In this paper, we propose a new approach to understand actions in egocentric videos that exploit the semantics of object interactions at both frame and temporal levels. At the frame level, we use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects and calculates the action score through a CNN formulation. This information is then fed to a Hierarchical Long Short-Term Memory Network (HLSTM) that captures temporal dependencies between actions within and across shots. Ablation studies thoroughly validate the proposed approach, showing in particular that both levels of the HLSTM architecture contribute to performance improvement. Furthermore, quantitative comparisons show that the proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks, without relying on motion information.

Late Fusion of Bayesian and Convolutional Models for Action Recognition

Camille Maurice, Francisco Madrigal, Frederic Lerasle

Responsive image

Auto-TLDR; Fusion of Deep Neural Network and Bayesian-based Approach for Temporal Action Recognition

Slides Poster Similar

The activities we do in our daily-life are generally carried out as a succession of atomic actions, following a logical order. During a video sequence, actions usually follow a logical order. In this paper, we propose a hybrid approach resulting from the fusion of a deep learning neural network with a Bayesian-based approach. The latter models human-object interactions and transition between actions. The key idea is to combine both approaches in the final prediction. We validate our strategy in two public datasets: CAD-120 and Watch-n-Patch. We show that our fusion approach yields performance gains in accuracy of respectively +4\% and +6\% over a baseline approach. Temporal action recognition performances are clearly improved by the fusion, especially when classes are imbalanced.

Facial Expression Recognition Using Residual Masking Network

Luan Pham, Vu Huynh, Tuan Anh Tran

Responsive image

Auto-TLDR; Deep Residual Masking for Automatic Facial Expression Recognition

Slides Poster Similar

Automatic facial expression recognition (FER) has gained much attention due to its applications in human-computer interaction. Among the approaches to improve FER tasks, this paper focuses on deep architecture with the attention mechanism. We propose a novel Masking idea to boost the performance of CNN in facial expression task. It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions. In experiments, we combine the ubiquitous Deep Residual Network and Unet-like architecture to produce a Residual Masking Network. The proposed method holds state-of-the-art (SOTA) accuracy on the well-known FER2013 and private VEMO datasets. Our works are available on Github.

Interpretable Emotion Classification Using Temporal Convolutional Models

Manasi Bharat Gund, Abhiram Ravi Bharadwaj, Ifeoma Nwogu

Responsive image

Auto-TLDR; Understanding the Dynamics of Facial Emotion Expression with Spatiotemporal Representations

Slides Poster Similar

As with many problems solved by deep neural networks, existing solutions rarely explain, precisely, the important factors responsible for the predictions made by the model. This work looks to investigate how different spatial regions and landmark points change in position over time, to better explain the underlying factors responsible for various facial emotion expressions. By pinpointing the specific regions or points responsible for the classification of a particular facial expression, we gain better insight into the dynamics of the face when displaying that emotion. To accomplish this, we examine two spatiotemporal representations of moving faces, while expressing different emotions. The representations are then presented to a convolutional neural network for emotion classification. Class activation maps are used in highlighting the regions of interest and the results are qualitatively compared with the well known facial action units, using the facial action coding system. The model was originally trained and tested on the CK+ dataset for emotion classification, and then generalized to the SAMM dataset. In so doing, we successfully present an interpretable technique for understanding the dynamics that occur during convolutional-based prediction tasks on sequences of face data.

Two-Stream Temporal Convolutional Network for Dynamic Facial Attractiveness Prediction

Nina Weng, Jiahao Wang, Annan Li, Yunhong Wang

Responsive image

Auto-TLDR; 2S-TCN: A Two-Stream Temporal Convolutional Network for Dynamic Facial Attractiveness Prediction

Slides Poster Similar

In the field of facial attractiveness prediction, while deep models using static pictures have shown promising results, little attention is paid to dynamic facial information, which is proven to be influential by psychological studies. Meanwhile, the increasing popularity of short video apps creates an enormous demand of facial attractiveness prediction from short video clips. In this paper, we target on the dynamic facial attractiveness prediction problem. To begin with, a large-scale video-based facial attractiveness prediction dataset (VFAP) with more than one thousand clips from TikTok is collected. A two-stream temporal convolutional network (2S-TCN) is then proposed to capture dynamic attractiveness feature from both facial appearance and landmarks. We employ attentive feature enhancement along with specially designed modality and temporal fusion strategies to better explore the temporal dynamics. Extensive experiments on the proposed VFAP dataset demonstrate that 2S-TCN has a distinct advantage over the state-of-the-art static prediction methods.

SAT-Net: Self-Attention and Temporal Fusion for Facial Action Unit Detection

Zhihua Li, Zheng Zhang, Lijun Yin

Responsive image

Auto-TLDR; Temporal Fusion and Self-Attention Network for Facial Action Unit Detection

Slides Poster Similar

Research on facial action unit detection has shown remarkable performances by using deep spatial learning models in recent years, however, it is far from reaching its full capacity in learning due to the lack of use of temporal information of AUs across time. Since the AU occurrence in one frame is highly likely related to previous frames in a temporal sequence, exploring temporal correlation of AUs across frames becomes a key motivation of this work. In this paper, we propose a novel temporal fusion and AU-supervised self-attention network (a so-called SAT-Net) to address the AU detection problem. First of all, we input the deep features of a sequence into a convolutional LSTM network and fuse the previous temporal information into the feature map of the last frame, and continue to learn the AU occurrence. Second, considering the AU detection problem is a multi-label classification problem that individual label depends only on certain facial areas, we propose a new self-learned attention mask by focusing the detection of each AU on parts of facial areas through the learning of individual attention mask for each AU, thus increasing the AU independence without the loss of any spatial relations. Our extensive experiments show that the proposed framework achieves better results of AU detection over the state-of-the-arts on two benchmark databases (BP4D and DISFA).

RMS-Net: Regression and Masking for Soccer Event Spotting

Matteo Tomei, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, Rita Cucchiara

Responsive image

Auto-TLDR; An Action Spotting Network for Soccer Videos

Slides Poster Similar

The recently proposed action spotting task consists in finding the exact timestamp in which an event occurs. This task fits particularly well for soccer videos, where events correspond to salient actions strictly defined by soccer rules (a goal occurs when the ball crosses the goal line). In this paper, we devise a lightweight and modular network for action spotting, which can simultaneously predict the event label and its temporal offset using the same underlying features. We enrich our model with two training strategies: the first one for data balancing and uniform sampling, the second for masking ambiguous frames and keeping the most discriminative visual cues. When tested on the SoccerNet dataset and using standard features, our full proposal exceeds the current state of the art by 3 Average-mAP points. Additionally, it reaches a gain of more than 10 Average-mAP points on the test set when fine-tuned in combination with a strong 2D backbone.

Responsive Social Smile: A Machine-Learning Based Multimodal Behavior Assessment Framework towards Early Stage Autism Screening

Yueran Pan, Kunjing Cai, Ming Cheng, Xiaobing Zou, Ming Li

Responsive image

Auto-TLDR; Responsive Social Smile: A Machine Learningbased Assessment Framework for Early ASD Screening

Poster Similar

Autism spectrum disorder (ASD) is a neurodevelopmental disorder, which causes social deficits in social lives. Early ASD screening for children is an important method to reduce the impact of ASD on people’s whole lives. Traditional screening methods rely on protocol experiments and subjective evaluations from clinicians and domain experts and thereby cost a lot. To standardize the process of ASD screening, we 1 collaborate with a group of ASD experts, and design a ”Responsive Social Smile” protocol and an experiment environment. Also, we propose a machine learningbased assessment framework for early ASD screening. By integrating technologies of speech recognition and computer vision, the framework can quantitatively analyze the behaviors of children under well-designed protocols. By collecting 196 test samples from 41 children in the clinical treatments, our proposed method obtains 85.20% accuracy for the score prediction of individual protocol, and 80.49% unweighted accuracy for the final ASD prediction. This result indicates that our model reaches the average level of domain experts in ASD diagnosis.

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Stéphane Lathuiliere, Pablo Mesejo, Radu Horaud

Responsive image

Auto-TLDR; Deep Visual Voice Activity Detection with Optical Flow

Slides Similar

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild, based on combining A-VAD and face detection. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with such a dataset.

Sequential Non-Rigid Factorisation for Head Pose Estimation

Stefania Cristina, Kenneth Patrick Camilleri

Responsive image

Auto-TLDR; Sequential Shape-and-Motion Factorisation for Head Pose Estimation in Eye-Gaze Tracking

Slides Poster Similar

Within the context of eye-gaze tracking, the capability of permitting the user to move naturally is an important step towards allowing for more natural user interaction in less constrained scenarios. Natural movement can be characterised by changes in head pose, as well as non-rigid face deformations as the user performs different facial expressions. While the estimation of head pose within the domain of eye-gaze tracking is being increasingly considered, the face is most often regarded as a rigid body. The few methods that factor the challenge of handling face deformations into the head pose estimation problem, often require the availability of a pre-defined face model or a considerable amount of training data. In this paper, we direct our attention towards the application of shape-and-motion factorisation for head pose estimation, since this does not generally rely on the availability of an initial face model. Over the years, various shape-and-motion factorisation methods have been proposed to address the challenges of rigid and non-rigid shape and motion recovery, in a batch or sequential manner. However, the real-time recovery of non-rigid shape and motion by factorisation remains, in general, an open problem. Our work addresses this open problem by proposing a sequential factorisation method for non-rigid shape and motion recovery, which does not rely on the availability of a pre-defined face deformation model or training data. Quantitative and qualitative results show that our method can handle various non-rigid face deformations without deterioration of the head pose estimation accuracy.

Inner Eye Canthus Localization for Human Body Temperature Screening

Claudio Ferrari, Lorenzo Berlincioni, Marco Bertini, Alberto Del Bimbo

Responsive image

Auto-TLDR; Automatic Localization of the Inner Eye Canthus in Thermal Face Images using 3D Morphable Face Model

Slides Poster Similar

In this paper, we propose an automatic approach for localizing the inner eye canthus in thermal face images. We first coarsely detect 5 facial keypoints corresponding to the center of the eyes, the nosetip and the ears. Then we compute a sparse 2D-3D points correspondence using a 3D Morphable Face Model (3DMM). This correspondence is used to project the entire 3D face onto the image, and subsequently locate the inner eye canthus. Detecting this location allows to obtain the most precise body temperature measurement for a person using a thermal camera. We evaluated the approach on a thermal face dataset provided with manually annotated landmarks. However, such manual annotations are normally conceived to identify facial parts such as eyes, nose and mouth, and are not specifically tailored for localizing the eye canthus region. As additional contribution, we enrich the original dataset by using the annotated landmarks to deform and project the 3DMM onto the images. Then, by manually selecting a small region corresponding to the eye canthus, we enrich the dataset with additional annotations. By using the manual landmarks, we ensure the correctness of the 3DMM projection, which can be used as ground-truth for future evaluations. Moreover, we supply the dataset with the 3D head poses and per-point visibility masks for detecting self-occlusions. The data will be publicly released.

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Thi Phuong Thao Ha, Bt Balamurali, Herremans Dorien, Roig Gemma

Responsive image

Auto-TLDR; AttendAffectNet: A Self-Attention Based Network for Emotion Prediction from Movies

Slides Poster Similar

In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet. We take both audio and video into account and incorporate the relation among multiple modalities by applying self-attention mechanism in a novel manner into the extracted features for emotion prediction. We compare it to the typically temporal integration of the self-attention based model, which in our case, allows to capture the relation of temporal representations of the movie while considering the sequential dependencies of emotion responses. We demonstrate the effectiveness of our proposed architectures on the extended COGNIMUSE dataset [1], [2] and the MediaEval 2016 Emotional Impact of Movies Task [3], which consist of movies with emotion annotations. Our results show that applying the self-attention mechanism on the different audio-visual features, rather than in the time domain, is more effective for emotion prediction. Our approach is also proven to outperform state-of-the-art models for emotion prediction.

Real-Time Driver Drowsiness Detection Using Facial Action Units

Malaika Vijay, Nandagopal Netrakanti Vinayak, Maanvi Nunna, Subramanyam Natarajan

Responsive image

Auto-TLDR; Real-Time Detection of Driver Drowsiness using Facial Action Units using Extreme Gradient Boosting

Slides Poster Similar

This paper presents a two-stage, vision-based pipeline for the real-time detection of driver drowsiness using Facial Action Units (FAUs). FAUs capture movements in groups of muscles in the face like widening of the eyes or dropping of the jaw. The first stage of the pipeline employs a Convolutional Neural Network (CNN) trained to detect FAUs. The output of the penultimate layer of this network serves as an image embedding that captures features relevant to FAU detection. These embeddings are then used to predict drowsiness using an Extreme Gradient Boosting (XGBoost) classifier. A separate XGBoost model is trained for each user of the system so that behavior specific to each user can be modeled into the drowsiness classifier. We show that user-specific classifiers require very little data and low training time to yield high prediction accuracies in real-time.

Audio-Video Detection of the Active Speaker in Meetings

Francisco Madrigal, Frederic Lerasle, Lionel Pibre, Isabelle Ferrané

Responsive image

Auto-TLDR; Active Speaker Detection with Visual and Contextual Information from Meeting Context

Slides Poster Similar

Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using speech, however, certain visual and contextual information can provide additional insights. In this paper we propose an active speaker detection framework that integrates audiovisual features with social information, from the meeting context. Visual cue is processed using a Convolutional Neural Network (CNN) that captures the spatio-temporal relationships. We analyze several CNN architectures with both cues: raw pixels (RGB images) and motion (estimated with optical flow). Contextual reasoning is done with an original methodology, based on the gaze of all participants. We evaluate our proposal with a public \textcolor{black}{benchmark} in state-of-art: AMI corpus. We show how the addition of visual and context information improves the performance of the active speaker detection.

Precise Temporal Action Localization with Quantified Temporal Structure of Actions

Chongkai Lu, Ruimin Li, Hong Fu, Bin Fu, Yihao Wang, Wai Lun Lo, Zheru Chi

Responsive image

Auto-TLDR; Action progression networks for temporal action detection

Slides Poster Similar

Existing temporal action detection algorithms cannot distinguish complete and incomplete actions while this property is essential in many applications. To tackle this challenge, we proposed the action progression networks (APN), a novel model that predicts action progression of video frames with continuous numbers. Using the progression sequence of test video, on the top of the APN, a complete action searching algorithm (CAS) was designed to detect complete actions only. With the usage of frame-level fine-grained temporal structure modeling and detecting actions according to their whole temporal context, our framework can locate actions precisely and is good at avoiding incomplete action detection. We evaluated our framework on a new dataset (DFMAD-70) collected by ourselves which contains both complete and incomplete actions. Our framework got good temporal localization results with 95.77% average precision when the IoU threshold is 0.5. On the benchmark THUMOS14, an incomplete-ignostic dataset, our framework still obtain competitive performance. The code is available online at https://github.com/MakeCent/Action-Progression-Network

Learning Group Activities from Skeletons without Individual Action Labels

Fabio Zappardino, Tiberio Uricchio, Lorenzo Seidenari, Alberto Del Bimbo

Responsive image

Auto-TLDR; Lean Pose Only for Group Activity Recognition

Similar

To understand human behavior we must not just recognize individual actions but model possibly complex group activity and interactions. Hierarchical models obtain the best results in group activity recognition but require fine grained individual action annotations at the actor level. In this paper we show that using only skeletal data we can train a state-of-the art end-to-end system using only group activity labels at the sequence level. Our experiments show that models trained without individual action supervision perform poorly. On the other hand we show that pseudo-labels can be computed from any pre-trained feature extractor with comparable final performance. Finally our carefully designed lean pose only architecture shows highly competitive results versus more complex multimodal approaches even in the self-supervised variant.

Temporal Binary Representation for Event-Based Action Recognition

Simone Undri Innocenti, Federico Becattini, Federico Pernici, Alberto Del Bimbo

Responsive image

Auto-TLDR; Temporal Binary Representation for Gesture Recognition

Slides Poster Similar

In this paper we present an event aggregation strategy to convert the output of an event camera into frames processable by traditional Computer Vision algorithms. The proposed method first generates sequences of intermediate binary representations, which are then losslessly transformed into a compact format by simply applying a binary-to-decimal conversion. This strategy allows us to encode temporal information directly into pixel values, which are then interpreted by deep learning models. We apply our strategy, called Temporal Binary Representation, to the task of Gesture Recognition, obtaining state of the art results on the popular DVS128 Gesture Dataset. To underline the effectiveness of the proposed method compared to existing ones, we also collect an extension of the dataset under more challenging conditions on which to perform experiments.

Video-Based Facial Expression Recognition Using Graph Convolutional Networks

Daizong Liu, Hongting Zhang, Pan Zhou

Responsive image

Auto-TLDR; Graph Convolutional Network for Video-based Facial Expression Recognition

Slides Poster Similar

Facial expression recognition (FER), aiming to classify the expression present in the facial image or video, has attracted a lot of research interests in the field of artificial intelligence and multimedia. In terms of video based FER task, it is sensible to capture the dynamic expression variation among the frames to recognize facial expression. However, existing methods directly utilize CNN-RNN or 3D CNN to extract the spatial-temporal features from different facial units, instead of concentrating on a certain region during expression variation capturing, which leads to limited performance in FER. In our paper, we introduce a Graph Convolutional Network (GCN) layer into a common CNN-RNN based model for video-based FER. First, the GCN layer is utilized to learn more contributing facial expression features which concentrate on certain regions after sharing information between nodes those represent CNN extracted features. Then, a LSTM layer is applied to learn long-term dependencies among the GCN learned features to model the variation. In addition, a weight assignment mechanism is also designed to weight the output of different nodes for final classification by characterizing the expression intensities in each frame. To the best of our knowledge, it is the first time to use GCN in FER task. We evaluate our method on three widely-used datasets, CK+, Oulu-CASIA and MMI, and also one challenging wild dataset AFEW8.0, and the experimental results demonstrate that our method has superior performance to existing methods.

RWF-2000: An Open Large Scale Video Database for Violence Detection

Ming Cheng, Kunjing Cai, Ming Li

Responsive image

Auto-TLDR; Flow Gated Network for Violence Detection in Surveillance Cameras

Slides Poster Similar

In recent years, surveillance cameras are widely deployed in public places, and the general crime rate has been reduced significantly due to these ubiquitous devices. Usually, these cameras provide cues and evidence after crimes were conducted, while they are rarely used to prevent or stop criminal activities in time. It is both time and labor consuming to manually monitor a large amount of video data from surveillance cameras. Therefore, automatically recognizing violent behaviors from video signals becomes essential. In this paper, we summarize several existing video datasets for violence detection and propose a new video dataset with 2,000 videos all captured by surveillance cameras in real-world scenes. Also, we present a new method that utilizes both the merits of 3D-CNNs and optical flow, namely Flow Gated Network. The proposed approach obtains an accuracy of 87.25% on the test set of our proposed RWF-2000 database. The proposed database and source codes of this paper are currently open to access.

Attention-Oriented Action Recognition for Real-Time Human-Robot Interaction

Ziyang Song, Ziyi Yin, Zejian Yuan, Chong Zhang, Wanchao Chi, Yonggen Ling, Shenghao Zhang

Responsive image

Auto-TLDR; Attention-Oriented Multi-Level Network for Action Recognition in Interaction Scenes

Slides Poster Similar

Despite the notable progress made in action recognition tasks, not much work has been done in action recognition specifically for human-robot interaction. In this paper, we deeply explore the characteristics of the action recognition task in interaction scenes and propose an attention-oriented multi-level network framework to meet the need for real-time interaction. Specifically, a Pre-Attention network is employed to roughly focus on the interactor in the scene at low resolution firstly and then perform fine-grained pose estimation at high resolution. The other compact CNN receives the extracted skeleton sequence as input for action recognition, utilizing attention-like mechanisms to capture local spatial-temporal patterns and global semantic information effectively. To evaluate our approach, we construct a new action dataset specially for the recognition task in interaction scenes. Experimental results on our dataset and high efficiency (112 fps at 640 x 480 RGBD) on the mobile computing platform (Nvidia Jetson AGX Xavier) demonstrate excellent applicability of our method on action recognition in real-time human-robot interaction.

Identity-Aware Facial Expression Recognition in Compressed Video

Xiaofeng Liu, Linghao Jin, Xu Han, Jun Lu, Jonghye Woo, Jane You

Responsive image

Auto-TLDR; Exploring Facial Expression Representation in Compressed Video with Mutual Information Minimization

Slides Similar

This paper targets to explore the inter-subject variations eliminated facial expression representation in the compressed video domain. Most of the previous methods process the RGB images of a sequence, while the off-the-shelf and valuable expression-related muscle movement already embedded in the compression format. In the up to two orders of magnitude compressed domain, we can explicitly infer the expression from the residual frames and possible to extract identity factors from the I frame with a pre-trained face recognition network. By enforcing the marginal independent of them, the expression feature is expected to be purer for the expression and be robust to identity shifts. Specifically, we propose a novel collaborative min-min game for mutual information (MI) minimization in latent space. We do not need the identity label or multiple expression samples from the same person for identity elimination. Moreover, when the apex frame is annotated in the dataset, the complementary constraint can be further added to regularize the feature-level game. In testing, only the compressed residual frames are required to achieve expression prediction. Our solution can achieve comparable or better performance than the recent decoded image based methods on the typical FER benchmarks with about 3$\times$ faster inference with compressed data.

Enriching Video Captions with Contextual Text

Philipp Rimle, Pelin Dogan, Markus Gross

Responsive image

Auto-TLDR; Contextualized Video Captioning Using Contextual Text

Slides Poster Similar

Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning by infusing extracted information from relevant text data. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mines relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not preprocess the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the contextual text via a pointer-generator network, allowing to produce more specific video captions. We show competitive performance on the News Video Dataset and, through ablation studies, validate the efficacy of contextual video captioning as well as individual design choices in our model architecture.

A Prototype-Based Generalized Zero-Shot Learning Framework for Hand Gesture Recognition

Jinting Wu, Yujia Zhang, Xiao-Guang Zhao

Responsive image

Auto-TLDR; Generalized Zero-Shot Learning for Hand Gesture Recognition

Slides Poster Similar

Hand gesture recognition plays a significant role in human-computer interaction for understanding various human gestures and their intent. However, most prior works can only recognize gestures of limited labeled classes and fail to adapt to new categories. The task of Generalized Zero-Shot Learning (GZSL) for hand gesture recognition aims to address the above issue by leveraging semantic representations and detecting both seen and unseen class samples. In this paper, we propose an end-to-end prototype-based GZSL framework for hand gesture recognition which consists of two branches. The first branch is a prototype-based detector that learns gesture representations and determines whether an input sample belongs to a seen or unseen category. The second branch is a zero-shot label predictor which takes the features of unseen classes as input and outputs predictions through a learned mapping mechanism between the feature and the semantic space. We further establish a hand gesture dataset that specifically targets this GZSL task, and comprehensive experiments on this dataset demonstrate the effectiveness of our proposed approach on recognizingQuestionnaire both seen and unseen gestures.

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Jianrong Wang, Tong Wu, Shanyu Wang, Mei Yu, Qiang Fang, Ju Zhang, Li Liu

Responsive image

Auto-TLDR; Lip Motion Network for Text-Independent and Text-Dependent Speaker Recognition

Slides Poster Similar

Lip motion reflects behavior characteristics of speakers, and thus can be used as a new kind of biometrics in speaker recognition. In the literature, lots of works used two dimensional (2D) lip images to recognize speaker in a text-dependent context. However, 2D lip easily suffers from face orientations. To this end, in this work, we present a novel end-to-end 3D lip motion Network (3LMNet) by utilizing the sentence-level 3D lip motion (S3DLM) to recognize speakers in both the text-independent and text-dependent contexts. A novel regional feedback module (RFM) is proposed to explore attentions in different lip regions. Besides, prior knowledge of lip motion is investigated to complement RFM, where landmark-level and frame-level features are merged to form a better feature representation. Moreover, we present two methods, i.e., coordinate transformation and face posture correction to pre-process the LSD-AV dataset, which contains 68 speakers and 146 sentences per speaker. The evaluation results on this dataset demonstrate that our proposed 3LMNet is superior to the baseline models, i.e., LSTM, VGG-16 and ResNet-34, and outperforms the state-of-the-art using 2D lip image as well as the 3D face. The code of this work is released at https://github.com/wutong18/Three-Dimensional-Lip-Motion-Ne twork-for-Text-Independent-Speaker-Recognition.

Attentive Hybrid Feature Based a Two-Step Fusion for Facial Expression Recognition

Jun Weng, Yang Yang, Zichang Tan, Zhen Lei

Responsive image

Auto-TLDR; Attentive Hybrid Architecture for Facial Expression Recognition

Slides Poster Similar

Facial expression recognition is inherently a challenging task, especially for the in-the-wild images with various occlusions and large pose variations, which may lead to the loss of some crucial information. To address it, in this paper, we propose an attentive hybrid architecture (AHA) which learns global, local and integrated features based on different face regions. Compared with one type of feature, our extracted features own complementary information and can reduce the loss of crucial information. Specifically, AHA contains three branches, where all sub-networks in those branches employ the attention mechanism to further localize the interested pixels/regions. Moreover, we propose a two-step fusion strategy based on LSTM to deeply explore the hidden correlations among different face regions. Extensive experiments on four popular expression databases (i.e., CK+, FER-2013, SFEW 2.0, RAF-DB) show the effectiveness of the proposed method.

A Grid-Based Representation for Human Action Recognition

Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Responsive image

Auto-TLDR; GRAR: Grid-based Representation for Action Recognition in Videos

Slides Poster Similar

Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for the task, and are limited in the way they fuse temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets that demonstrate that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges.

JT-MGCN: Joint-Temporal Motion Graph Convolutional Network for Skeleton-Based Action Recognition

Suekyeong Nam, Seungkyu Lee

Responsive image

Auto-TLDR; Joint-temporal Motion Graph Convolutional Networks for Action Recognition

Slides Similar

Recently, action recognition methods using graph convolutional networks (GCN) have shown remarkable performance thanks to its concise but effective representation of human body motion. Prior methods construct human body motion graph building edges between neighbor or distant body joints. On the other hand, human action contains lots of temporal variations showing strong temporal correlations between joint motions. Thus the characterization of an action requires a comprehensive analysis of joint motion correlations on spatial and temporal domains. In this paper, we propose Joint-temporal Motion Graph Convolutional Networks (JT-MGCN) in which joint-temporal edges learn the correlations between different joints at different time. Experimental evaluation on large public data sets such as NTU rgb+d data set and kinetics-skeleton data set show outstanding action recognition performance.

Feature Pyramid Hierarchies for Multi-Scale Temporal Action Detection

Jiayu He, Guohui Li, Jun Lei

Responsive image

Auto-TLDR; Temporal Action Detection using Pyramid Hierarchies and Multi-scale Feature Maps

Slides Poster Similar

Temporal action detection is a challenging but promising task in video content analysis. It is in great demand in the field of public safety. The main difficulty of the task is precisely localizing activities in the video especially those short duration activities. And most of the existing methods can not achieve a satisfactory detection result. Our method addresses a key point to improve detection accuracy, which is to use multi-scale feature maps for regression and classification. In this paper, we introduce a novel network based on classification following proposal framework. In our network, a 3D feature pyramid hierarchies is built to enhance the ability of detecting short duration activities. The input RGB/Flow frames are first encoded by a 3D feature pyramid hierarchies, and this subnet produces multi-level feature maps. Then temporal proposal subnet uses these features to pick out proposals which might contain activity segments. Finally a pyramid region of interest (RoI) pooling pipeline and two fully connected layers reuse muti-level feature maps to refine the temporal boundaries of proposals and classify them. We use late feature fusion scheme to combine RGB and Flow information. The network is trained end-to-end and we evaluate it in THUMOS'14 dataset. Our network achieves a good result among typical methods. A further ablation test demonstrate that pyramid hierarchies is effective to improve detecting short duration activity segments.

Quantified Facial Temporal-Expressiveness Dynamics for Affect Analysis

Md Taufeeq Uddin, Shaun Canavan

Responsive image

Auto-TLDR; quantified facial Temporal-expressiveness Dynamics for quantified affect analysis

Poster Similar

The quantification of visual affect data (e.g. face images) is essential to build and monitor automated affect modeling systems efficiently. Considering this, this work proposes quantified facial Temporal-expressiveness Dynamics (TED) to quantify the expressiveness of human faces. The proposed algorithm leverages multimodal facial features by incorporating static and dynamic information to enable accurate measurements of facial expressiveness. We show that TED can be used for high-level tasks such as summarization of unstructured visual data, expectation from and interpretation of automated affect recognition models. To evaluate the positive impact of using TED, a case study was conducted on spontaneous pain using the UNBC-McMaster spontaneous shoulder pain dataset. Experimental results show the efficacy of using TED for quantified affect analysis.

Inferring Tasks and Fluents in Videos by Learning Causal Relations

Haowen Tang, Ping Wei, Huan Li, Nanning Zheng

Responsive image

Auto-TLDR; Joint Learning of Complex Task and Fluent States in Videos

Slides Poster Similar

Recognizing time-varying object states in complex tasks is an important and challenging issue. In this paper, we propose a novel model to jointly infer object fluents and complex tasks in videos. A task is a complex goal-driven human activity and a fluent is defined as a time-varying object state. A hierarchical graph represents a task as a human action stream and multiple concurrent object fluents which vary as the human performs the actions. In this process, the human actions serve as the causes of object state changes which conversely reflect the effects of human actions. Given an input video, a causal sampling beam search (CSBS) algorithm is proposed to jointly infer the task category and the states of objects in each video frame. For model learning, a structural SVM framework is adopted to jointly train the task, fluent, cause, and effect parameters. We collected a new large-scale dataset of tasks and fluents in third-person view videos. It contains 14 categories of tasks, 24 categories of object fluents, 50 categories of object states, 809 videos, and 333,351 frames. Experimental results demonstrate the effectiveness of the proposed method.

You Ought to Look Around: Precise, Large Span Action Detection

Ge Pan, Zhang Han, Fan Yu, Yonghong Song, Yuanlin Zhang, Han Yuan

Responsive image

Auto-TLDR; YOLA: Local Feature Extraction for Action Localization with Variable receptive field

Slides Similar

For the action localization task, pre-defined action anchors are the cornerstone of mainstream techniques. State-of-the-art models mostly rely on a dense segmenting scheme, where anchors are sampled uniformly over the temporal domain with a predefined set of scales. However, it is not sufficient because action duration varies greatly. Therefore, it is necessary for the anchors or proposals to have a variable receptive field. In this paper, we propose a method called YOLA (You Ought to Look Around) which includes three parts: 1) a robust backbone SPN-I3D for extracting spatio-temporal features. In this part, we employ a stronger backbone I3D with SPN (Segment Pyramid Network) instead of C3D to obtain multi-scale features; 2) a simple but useful feature fusion module named LFE (Local Feature Extraction). Compared with the fully connected layer and global average pooling, our LFE model is more advantageous for network to fit and fuse features. 3) a new feature segment aligning method called TPGC (Two Pathway Graph Convolution), which allows one proposal to leverage semantic features of adjacent proposals to update its content and make sure the proposals have a variable receptive field. YOLA add only a small overhead to the baseline network, and is easy to train in an end-to-end manner, running at a speed of 1097 fps. YOLA achieves a mAP of 58.3%, outperforming all existing models including both RGB-based and two stream on THUMOS'14, and achieves competitive results on ActivityNet 1.3.

Attention-Driven Body Pose Encoding for Human Activity Recognition

Bappaditya Debnath, Swagat Kumar, Marry O'Brien, Ardhendu Behera

Responsive image

Auto-TLDR; Attention-based Body Pose Encoding for Human Activity Recognition

Slides Poster Similar

This article proposes a novel attention-based body pose encoding for human activity recognition. Most of the existing human activity recognition approaches based on 3D pose data often enrich the input data using additional handcrafted representations such as velocity, super normal vectors, pairwise relations, and so on. The enriched data complements the 3D body joint position data and improves the model performance. In this paper, we propose a novel approach that learns enhanced feature representations from a given sequence of 3D body joints. To achieve this, the approach exploits two body pose streams: 1) a spatial stream which encodes the spatial relationship between various body joints at each time point to learn spatial structure involving the spatial distribution of different body joints 2) a temporal stream that learns the temporal variation of individual body joints over the entire sequence duration to present a temporally enhanced representation. Afterwards, these two pose streams are fused with a multi-head attention mechanism. We also capture the contextual information from the RGB video stream using a deep Convolutional Neural Network (CNN) model combined with a multi-head attention and a bidirectional Long Short-Term Memory (LSTM) network. Finally, the RGB video stream is combined with the fused body pose stream to give a novel end-to-end deep model for effective human activity recognition. The proposed model is evaluated on three datasets including the challenging NTU-RGBD dataset and achieves state-of-the-art results.

Automated Whiteboard Lecture Video Summarization by Content Region Detection and Representation

Bhargava Urala Kota, Alexander Stone, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju

Responsive image

Auto-TLDR; A Framework for Summarizing Whiteboard Lecture Videos Using Feature Representations of Handwritten Content Regions

Poster Similar

Lecture videos are rapidly becoming an invaluable source of information for students across the globe. Given the large number of online courses currently available, it is important to condense the information within these videos into a compact yet representative summary that can be used for search-based applications. We propose a framework to summarize whiteboard lecture videos by finding feature representations of detected handwritten content regions to determine unique content. We investigate multi-scale histogram of gradients and embeddings from deep metric learning for feature representation. We explicitly handle occluded, growing and disappearing handwritten content. Our method is capable of producing two kinds of lecture video summaries - the unique regions themselves or so-called key content and keyframes (which contain all unique content in a video segment). We use weighted spatio-temporal conflict minimization to segment the lecture and produce keyframes from detected regions and features. We evaluate both types of summaries and find that we obtain state-of-the-art peformance in terms of number of summary keyframes while our unique content recall and precision are comparable to state-of-the-art.

Attribute-Based Quality Assessment for Demographic Estimation in Face Videos

Fabiola Becerra-Riera, Annette Morales-González, Heydi Mendez-Vazquez, Jean-Luc Dugelay

Responsive image

Auto-TLDR; Facial Demographic Estimation in Video Scenarios Using Quality Assessment

Slides Similar

Most existing works regarding facial demographic estimation are focused on still image datasets, although nowadays the need to analyze video content in real applications is increasing. We propose to tackle gender, age and ethnicity estimation in the context of video scenarios. Our main contribution is to use an attribute-specific quality assessment procedure to select best quality frames from a video sequence for each of the three demographic modalities. Best quality frames are classified with fine-tuned MobileNet models and a final video prediction is obtained with a majority voting strategy among the best selected frames. Our validation on three different datasets and our comparison with state-of-the-art models, show the effectiveness of the proposed demographic classifiers and the quality pipeline, which allows to reduce both: the number of frames to be classified and the processing time in practical applications; and improves the soft biometrics prediction accuracy.

Relevance Detection in Cataract Surgery Videos by Spatio-Temporal Action Localization

Negin Ghamsarian, Mario Taschwer, Doris Putzgruber, Stephanie. Sarny, Klaus Schoeffmann

Responsive image

Auto-TLDR; relevance-based retrieval in cataract surgery videos

Slides Similar

In cataract surgery, the operation is performed with the help of a microscope. Since the microscope enables watching real-time surgery by up to two people only, a major part of surgical training is conducted using the recorded videos. To optimize the training procedure with the video content, the surgeons require an automatic relevance detection approach. In addition to relevance-based retrieval, these results can be further used for skill assessment and irregularity detection in cataract surgery videos. In this paper, a three-module framework is proposed to detect and classify the relevant phase segments in cataract videos. Taking advantage of an idle frame recognition network, the video is divided into idle and action segments. To boost the performance in relevance detection Mask R-CNN is utilized to detect the cornea in each frame where the relevant surgical actions are conducted. The spatio-temporal localized segments containing higher-resolution information about the pupil texture and actions, and complementary temporal information from the same phase are fed into the relevance detection module. This module consists of four parallel recurrent CNNs being responsible to detect four relevant phases that have been defined with medical experts. The results will then be integrated to classify the action phases as irrelevant or one of four relevant phases. Experimental results reveal that the proposed approach outperforms static CNNs and different configurations of feature-based and end-to-end recurrent networks.

MRP-Net: A Light Multiple Region Perception Neural Network for Multi-Label AU Detection

Yang Tang, Shuang Chen, Honggang Zhang, Gang Wang, Rui Yang

Responsive image

Auto-TLDR; MRP-Net: A Fast and Light Neural Network for Facial Action Unit Detection

Slides Poster Similar

Facial Action Units (AUs) are of great significance in communication. Automatic AU detection can improve the understanding of psychological condition and emotional status. Recently, a number of deep learning methods have been proposed to take charge with problems in automatic AU detection. Several challenges, like unbalanced labels and ignorance of local information, remain to be addressed. In this paper, we propose a fast and light neural network called MRP-Net, which is an end-to-end trainable method for facial AU detection to solve these problems. First, we design a Multiple Region Perception (MRP) module aimed at capturing different locations and sizes of features in the deeper level of the network without facial landmark points. Then, in order to balance the positive and negative samples in the large dataset, a batch balanced method adjusting the weight of every sample in one batch in our loss function is suggested. Experimental results on two popular AU datasets, BP4D and DISFA prove that MRP-Net outperforms state-of-the-art methods. Compared with the best method, not only does MRP-Net have an average F1 score improvement of 2.95% on BP4D and 5.43% on DISFA, and it also decreases the number of network parameters by 54.62% and the number of network FLOPs by 19.6%.

Self-Supervised Learning of Dynamic Representations for Static Images

Siyang Song, Enrique Sanchez, Linlin Shen, Michel Valstar

Responsive image

Auto-TLDR; Facial Action Unit Intensity Estimation and Affect Estimation from Still Images with Multiple Temporal Scale

Slides Poster Similar

Facial actions are spatio-temporal signals by nature, and therefore their modeling is crucially dependent on the availability of temporal information. In this paper, we focus on inferring such temporal dynamics of facial actions when no explicit temporal information is available, i.e. from still images. We present a novel approach to capture multiple scales of such temporal dynamics, with an application to facial Action Unit (AU) intensity estimation and dimensional affect estimation. In particular, 1) we propose a framework that infers a dynamic representation (DR) from a still image, which captures the bi-directional flow of time within a short time-window centered at the input image; 2) we show that we can train our method without the need of explicitly generating target representations, allowing the network to represent dynamics more broadly; and 3) we propose to apply a multiple temporal scale approach that infers DRs for different window lengths (MDR) from a still image. We empirically validate the value of our approach on the task of frame ranking, and show how our proposed MDR attains state of the art results on BP4D for AU intensity estimation and on SEMAINE for dimensional affect estimation, using only still images at test time.