ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Feature Pyramid Hierarchies for Multi-Scale Temporal Action Detection

Jiayu He, Guohui Li, Jun Lei

Auto-TLDR; Temporal Action Detection using Pyramid Hierarchies and Multi-scale Feature Maps

Abstract Slides Poster

Temporal action detection is a challenging but promising task in video content analysis. It is in great demand in the field of public safety. The main difficulty of the task is precisely localizing activities in the video especially those short duration activities. And most of the existing methods can not achieve a satisfactory detection result. Our method addresses a key point to improve detection accuracy, which is to use multi-scale feature maps for regression and classification. In this paper, we introduce a novel network based on classification following proposal framework. In our network, a 3D feature pyramid hierarchies is built to enhance the ability of detecting short duration activities. The input RGB/Flow frames are first encoded by a 3D feature pyramid hierarchies, and this subnet produces multi-level feature maps. Then temporal proposal subnet uses these features to pick out proposals which might contain activity segments. Finally a pyramid region of interest (RoI) pooling pipeline and two fully connected layers reuse muti-level feature maps to refine the temporal boundaries of proposals and classify them. We use late feature fusion scheme to combine RGB and Flow information. The network is trained end-to-end and we evaluate it in THUMOS'14 dataset. Our network achieves a good result among typical methods. A further ablation test demonstrate that pyramid hierarchies is effective to improve detecting short duration activity segments.

Similar papers

You Ought to Look Around: Precise, Large Span Action Detection

Ge Pan, Zhang Han, Fan Yu, Yonghong Song, Yuanlin Zhang, Han Yuan

Auto-TLDR; YOLA: Local Feature Extraction for Action Localization with Variable receptive field

Abstract Slides Similar

For the action localization task, pre-defined action anchors are the cornerstone of mainstream techniques. State-of-the-art models mostly rely on a dense segmenting scheme, where anchors are sampled uniformly over the temporal domain with a predefined set of scales. However, it is not sufficient because action duration varies greatly. Therefore, it is necessary for the anchors or proposals to have a variable receptive field. In this paper, we propose a method called YOLA (You Ought to Look Around) which includes three parts: 1) a robust backbone SPN-I3D for extracting spatio-temporal features. In this part, we employ a stronger backbone I3D with SPN (Segment Pyramid Network) instead of C3D to obtain multi-scale features; 2) a simple but useful feature fusion module named LFE (Local Feature Extraction). Compared with the fully connected layer and global average pooling, our LFE model is more advantageous for network to fit and fuse features. 3) a new feature segment aligning method called TPGC (Two Pathway Graph Convolution), which allows one proposal to leverage semantic features of adjacent proposals to update its content and make sure the proposals have a variable receptive field. YOLA add only a small overhead to the baseline network, and is easy to train in an end-to-end manner, running at a speed of 1097 fps. YOLA achieves a mAP of 58.3%, outperforming all existing models including both RGB-based and two stream on THUMOS'14, and achieves competitive results on ActivityNet 1.3.

Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos

Mamshad Nayeem Rizve, Ugur Demir, Praveen Praveen Tirupattur, Aayush Jung Rana, Kevin Duarte, Ishan Rajendrakumar Dave, Yogesh Rawat, Mubarak Shah

Auto-TLDR; Gabriella: A Real-Time Online System for Activity Detection in Surveillance Videos

Abstract Slides Similar

Activity detection in surveillance videos is a difficult problem due to multiple factors such as large field of view, presence of multiple activities, varying scales and viewpoints, and its untrimmed nature. The existing research in activity detection is mainly focused on datasets, such as UCF-101, JHMDB, THUMOS, and AVA, which partially address these issues. The requirement of processing the surveillance videos in real-time makes this even more challenging. In this work we propose Gabriella, a real-time online system to perform activity detection on untrimmed surveillance videos. The proposed method consists of three stages: tubelet extraction, activity classification, and online tubelet merging. For tubelet extraction, we propose a localization network which takes a video clip as input and spatio-temporally detects potential foreground regions at multiple scales to generate action tubelets. We propose a novel Patch-Dice loss to handle large variations in actor size. Our online processing of videos at a clip level drastically reduces the computation time in detecting activities. The detected tubelets are assigned activity class scores by the classification network and merged together using our proposed Tubelet-Merge Action-Split (TMAS) algorithm to form the final action detections. The TMAS algorithm efficiently connects the tubelets in an online fashion to generate action detections which are robust against varying length activities. We perform our experiments on the VIRAT and MEVA (Multiview Extended Video with Activities) datasets and demonstrate the effectiveness of the proposed approach in terms of speed ($\sim$100 fps) and performance with state-of-the-art results. The code and models will be made publicly available.

ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos

Guillaume Vaudaux-Ruth, Adrien Chan-Hon-Tong, Catherine Achard

Auto-TLDR; ActionSpotter: A Reinforcement Learning Algorithm for Action Spotting in Video

Feature Pyramid Hierarchies for Multi-Scale Temporal Action Detection

Similar papers

You Ought to Look Around: Precise, Large Span Action Detection

Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos

ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos

Precise Temporal Action Localization with Quantified Temporal Structure of Actions

RMS-Net: Regression and Masking for Soccer Event Spotting

RWF-2000: An Open Large Scale Video Database for Violence Detection

Scene Text Detection with Selected Anchors

TinyVIRAT: Low-Resolution Video Action Recognition

Bidirectional Matrix Feature Pyramid Network for Object Detection

MFI: Multi-Range Feature Interchange for Video Action Recognition

SFPN: Semantic Feature Pyramid Network for Object Detection

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Learning a Dynamic High-Resolution Network for Multi-Scale Pedestrian Detection

What and How? Jointly Forecasting Human Action and Pose

Learnable Higher-Order Representation for Action Recognition

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

Construction Worker Hardhat-Wearing Detection Based on an Improved BiFPN

Triplet-Path Dilated Network for Detection and Segmentation of General Pathological Images

Late Fusion of Bayesian and Convolutional Models for Action Recognition

Object Detection Model Based on Scene-Level Region Proposal Self-Attention

Feature-Supervised Action Modality Transfer

CenterRepp: Predict Central Representative Point Set's Distribution for Detection

Cascade Saliency Attention Network for Object Detection in Remote Sensing Images

PRF-Ped: Multi-Scale Pedestrian Detector with Prior-Based Receptive Field

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

Video Object Detection Using Object's Motion Context and Spatio-Temporal Feature Aggregation

Revisiting Sequence-To-Sequence Video Object Segmentation with Multi-Task Loss and Skip-Memory

Developing Motion Code Embedding for Action Recognition in Videos

Modeling Long-Term Interactions to Enhance Action Recognition

A Novel Region of Interest Extraction Layer for Instance Segmentation

Hierarchical Head Design for Object Detectors

Correlation-Based ConvNet for Small Object Detection in Videos

CAggNet: Crossing Aggregation Network for Medical Image Segmentation

Pose-Based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

Attention-Oriented Action Recognition for Real-Time Human-Robot Interaction

A Multi-Task Neural Network for Action Recognition with 3D Key-Points

Activity Recognition Using First-Person-View Cameras Based on Sparse Optical Flows

Flow-Guided Spatial Attention Tracking for Egocentric Activity Recognition

Mutual-Supervised Feature Modulation Network for Occluded Pedestrian Detection

A Grid-Based Representation for Human Action Recognition

Video Representation Fusion Network For Multi-Label Movie Genre Classification

Detecting Objects with High Object Region Percentage

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition

TSMSAN: A Three-Stream Multi-Scale Attentive Network for Video Saliency Detection

Activity and Relationship Modeling Driven Weakly Supervised Object Detection

Attentive Visual Semantic Specialized Network for Video Captioning

Extracting Action Hierarchies from Action Labels and their Use in Deep Action Recognition

Hierarchical Multimodal Attention for Deep Video Summarization