ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Anticipating Activity from Multimodal Signals

Tiziana Rotondo, Giovanni Maria Farinella, Davide Giacalone, Sebastiano Mauro Strano, Valeria Tomaselli, Sebastiano Battiato

Auto-TLDR; Exploiting Multimodal Signal Embedding Space for Multi-Action Prediction

Abstract Slides Poster

Images, videos, audio signals, sensor data, can be easily collected in huge quantity by different devices and processed in order to emulate the human capability of elaborating a variety of different stimuli. Are multimodal signals useful to understand and anticipate human actions if acquired from the user viewpoint? This paper proposes to build an embedding space where inputs of different nature, but semantically correlated, are projected in a new representation space and properly exploited to anticipate the future user activity. To this purpose, we built a new multimodal dataset comprising video, audio, tri-axial acceleration, angular velocity, tri-axial magnetic field, pressure and temperature. To benchmark the proposed multimodal anticipation challenge, we consider classic classifiers on top of deep learning methods used to build the embedding space representing multimodal signals. The achieved results show that the exploitation of different modalities is useful to improve the anticipation of the future activity.

Similar papers

Knowledge Distillation for Action Anticipation Via Label Smoothing

Guglielmo Camporese, Pasquale Coscia, Antonino Furnari, Giovanni Maria Farinella, Lamberto Ballan

Auto-TLDR; A Multi-Modal Framework for Action Anticipation using Long Short-Term Memory Networks

Anticipating Activity from Multimodal Signals

Similar papers

Knowledge Distillation for Action Anticipation Via Label Smoothing

Vision-Based Multi-Modal Framework for Action Recognition

What and How? Jointly Forecasting Human Action and Pose

SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

From Human Pose to On-Body Devices for Human-Activity Recognition

Modeling Long-Term Interactions to Enhance Action Recognition

Activity Recognition Using First-Person-View Cameras Based on Sparse Optical Flows

Personalized Models in Human Activity Recognition Using Deep Learning

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

Learning Dictionaries of Kinematic Primitives for Action Classification

Space-Time Domain Tensor Neural Networks: An Application on Human Pose Classification

A Grid-Based Representation for Human Action Recognition

A Detection-Based Approach to Multiview Action Classification in Infants

Pose-Based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

Translation Resilient Opportunistic WiFi Sensing

Inferring Tasks and Fluents in Videos by Learning Causal Relations

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Conditional-UNet: A Condition-Aware Deep Model for Coherent Human Activity Recognition from Wearables

Weight Estimation from an RGB-D Camera in Top-View Configuration

Electroencephalography Signal Processing Based on Textural Features for Monitoring the Driver’s State by a Brain-Computer Interface

Audio-Video Detection of the Active Speaker in Meetings

Developing Motion Code Embedding for Action Recognition in Videos

Single View Learning in Action Recognition

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Late Fusion of Bayesian and Convolutional Models for Action Recognition

Spatial Bias in Vision-Based Voice Activity Detection

RWF-2000: An Open Large Scale Video Database for Violence Detection

Fall Detection by Human Pose Estimation and Kinematic Theory

Hierarchical Multimodal Attention for Deep Video Summarization

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

Flow-Guided Spatial Attention Tracking for Egocentric Activity Recognition

DenseRecognition of Spoken Languages

Location Prediction in Real Homes of Older Adults based on K-Means in Low-Resolution Depth Videos

One-Shot Representational Learning for Joint Biometric and Device Authentication

IPN Hand: A Video Dataset and Benchmark for Real-Time Continuous Hand Gesture Recognition

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

Ballroom Dance Recognition from Audio Recordings

Attention-Driven Body Pose Encoding for Human Activity Recognition

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Person Recognition with HGR Maximal Correlation on Multimodal Data

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Better Prior Knowledge Improves Human-Pose-Based Extrinsic Camera Calibration

Subspace Clustering for Action Recognition with Covariance Representations and Temporal Pruning

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Mutual Alignment between Audiovisual Features for End-To-End Audiovisual Speech Recognition

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

Which are the factors affecting the performance of audio surveillance systems?

Temporal Binary Representation for Event-Based Action Recognition