ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Knowledge Distillation for Action Anticipation Via Label Smoothing

Guglielmo Camporese, Pasquale Coscia, Antonino Furnari, Giovanni Maria Farinella, Lamberto Ballan

Auto-TLDR; A Multi-Modal Framework for Action Anticipation using Long Short-Term Memory Networks

Abstract Slides Poster

Human capability to anticipate near future from visual observations and non-verbal cues is essential for developing intelligent systems that need to interact with people. Several research areas, such as human-robot interaction (HRI), assisted living or autonomous driving need to foresee future events to avoid crashes or help people. Egocentric scenarios are classic examples where action anticipation is applied due to their numerous applications. Such challenging task demands to capture and model domain's hidden structure to reduce prediction uncertainty. Since multiple actions may equally occur in the future, we treat action anticipation as a multi-label problem with missing labels extending the concept of label smoothing. This idea resembles the knowledge distillation process since useful information is injected into the model during training. We implement a multi-modal framework based on long short-term memory (LSTM) networks to summarize past observations and make predictions at different time steps. We perform extensive experiments on EPIC-Kitchens and EGTEA Gaze+ datasets including more than 2500 and 100 action classes, respectively. The experiments show that label smoothing systematically improves performance of state-of-the-art models for action anticipation.

Similar papers

What and How? Jointly Forecasting Human Action and Pose

Yanjun Zhu, Yanxia Zhang, Qiong Liu, Andreas Girgensohn

Auto-TLDR; Forecasting Human Actions and Motion Trajectories with Joint Action Classification and Pose Regression

Knowledge Distillation for Action Anticipation Via Label Smoothing

Similar papers

What and How? Jointly Forecasting Human Action and Pose

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

Modeling Long-Term Interactions to Enhance Action Recognition

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

Developing Motion Code Embedding for Action Recognition in Videos

Anticipating Activity from Multimodal Signals

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

A Grid-Based Representation for Human Action Recognition

Transformer Networks for Trajectory Forecasting

Feature-Supervised Action Modality Transfer

Extracting Action Hierarchies from Action Labels and their Use in Deep Action Recognition

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Flow-Guided Spatial Attention Tracking for Egocentric Activity Recognition

Late Fusion of Bayesian and Convolutional Models for Action Recognition

RMS-Net: Regression and Masking for Soccer Event Spotting

Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos

Single View Learning in Action Recognition

Global Feature Aggregation for Accident Anticipation

Audio-Video Detection of the Active Speaker in Meetings

TinyVIRAT: Low-Resolution Video Action Recognition

3D Attention Mechanism for Fine-Grained Classification of Table Tennis Strokes Using a Twin Spatio-Temporal Convolutional Neural Networks

Temporal Binary Representation for Event-Based Action Recognition

ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos

Uncertainty-Sensitive Activity Recognition: A Reliability Benchmark and the CARING Models

Inferring Tasks and Fluents in Videos by Learning Causal Relations

DAG-Net: Double Attentive Graph Neural Network for Trajectory Forecasting

Learnable Higher-Order Representation for Action Recognition

RWF-2000: An Open Large Scale Video Database for Violence Detection

Attentive Visual Semantic Specialized Network for Video Captioning

MFI: Multi-Range Feature Interchange for Video Action Recognition

Learning Dictionaries of Kinematic Primitives for Action Classification

Activity Recognition Using First-Person-View Cameras Based on Sparse Optical Flows

Hierarchical Multimodal Attention for Deep Video Summarization

Precise Temporal Action Localization with Quantified Temporal Structure of Actions

Explore and Explain: Self-Supervised Navigation and Recounting

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

A Detection-Based Approach to Multiview Action Classification in Infants

Vision-Based Multi-Modal Framework for Action Recognition

Image Sequence Based Cyclist Action Recognition Using Multi-Stream 3D Convolution

Pose-Based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

Text Synopsis Generation for Egocentric Videos

Feature Pyramid Hierarchies for Multi-Scale Temporal Action Detection

Temporally Coherent Embeddings for Self-Supervised Video Representation Learning

Probability Guided Maxout

Motion U-Net: Multi-Cue Encoder-Decoder Network for Motion Segmentation

The Role of Cycle Consistency for Generating Better Human Action Videos from a Single Frame

SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Object Segmentation Tracking from Generic Video Cues