ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

TSMSAN: A Three-Stream Multi-Scale Attentive Network for Video Saliency Detection

Jingwen Yang, Guanwen Zhang, Wei Zhou

Auto-TLDR; Three-stream Multi-scale attentive network for video saliency detection in dynamic scenes

Abstract Slides Poster

Video saliency detection is an important low-level task that has been used in a large range of high-level applications. In this paper, we proposed a three-stream multi-scale attentive network (TSMSAN) for saliency detection in dynamic scenes. TSMSAN integrates motion vector representation, static saliency map, and RGB information in multi-scales together into one framework on the basis of Fully Convolutional Network (FCN) and spatial attention mechanism. On the one hand, the respective motion features, spatial features, as well as the scene features can provide abundant information for video saliency detection. On the other hand, spatial attention mechanism can combine features with multi-scales to focus on key information in dynamic scenes. In this manner, the proposed TSMSAN can encode the spatiotemporal features of the dynamic scene comprehensively. We evaluate the proposed approach on two public dynamic saliency data sets. The experimental results demonstrate TSMSAN is able to achieve the state-of-the-art performance as well as the excellent generalization ability. Furthermore, the proposed TSMSAN can provide more convincing video saliency information, in line with human perception.

Similar papers

ACCLVOS: Atrous Convolution with Spatial-Temporal ConvLSTM for Video Object Segmentation

Muzhou Xu, Shan Zong, Chunping Liu, Shengrong Gong, Zhaohui Wang, Yu Xia

Auto-TLDR; Semi-supervised Video Object Segmentation using U-shape Convolution and ConvLSTM

TSMSAN: A Three-Stream Multi-Scale Attentive Network for Video Saliency Detection

Similar papers

ACCLVOS: Atrous Convolution with Spatial-Temporal ConvLSTM for Video Object Segmentation

PHNet: Parasite-Host Network for Video Crowd Counting

Spatial-Related and Scale-Aware Network for Crowd Counting

Flow-Guided Spatial Attention Tracking for Egocentric Activity Recognition

RWF-2000: An Open Large Scale Video Database for Violence Detection

Saliency Prediction on Omnidirectional Images with Brain-Like Shallow Neural Network

Video Semantic Segmentation Using Deep Multi-View Representation Learning

FastSal: A Computationally Efficient Network for Visual Saliency Prediction

MFI: Multi-Range Feature Interchange for Video Action Recognition

SAT-Net: Self-Attention and Temporal Fusion for Facial Action Unit Detection

Multi-Scale Residual Pyramid Attention Network for Monocular Depth Estimation

Wavelet Attention Embedding Networks for Video Super-Resolution

Coarse to Fine: Progressive and Multi-Task Learning for Salient Object Detection

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

A Grid-Based Representation for Human Action Recognition

Coarse-To-Fine Foreground Segmentation Based on Co-Occurrence Pixel-Block and Spatio-Temporal Attention Model

Enhanced Feature Pyramid Network for Semantic Segmentation

Classifying Eye-Tracking Data Using Saliency Maps

Efficient-Receptive Field Block with Group Spatial Attention Mechanism for Object Detection

Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting

Context-Aware Residual Module for Image Classification

Two-Stream Temporal Convolutional Network for Dynamic Facial Attractiveness Prediction

Deeply-Fused Attentive Network for Stereo Matching

A Duplex Spatiotemporal Filtering Network for Video-Based Person Re-Identification

Video Object Detection Using Object's Motion Context and Spatio-Temporal Feature Aggregation

Attentive Hybrid Feature Based a Two-Step Fusion for Facial Expression Recognition

What and How? Jointly Forecasting Human Action and Pose

Selective Kernel and Motion-Emphasized Loss Based Attention-Guided Network for HDR Imaging of Dynamic Scenes

DARN: Deep Attentive Refinement Network for Liver Tumor Segmentation from 3D CT Volume

VGG-Embedded Adaptive Layer-Normalized Crowd Counting Net with Scale-Shuffling Modules

A Lightweight Network to Learn Optical Flow from Event Data

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Dynamic Guided Network for Monocular Depth Estimation

Translating Adult's Focus of Attention to Elderly's

Cascade Saliency Attention Network for Object Detection in Remote Sensing Images

Early Wildfire Smoke Detection in Videos

Two-Level Attention-Based Fusion Learning for RGB-D Face Recognition

Feature Pyramid Hierarchies for Multi-Scale Temporal Action Detection

An Improved Bilinear Pooling Method for Image-Based Action Recognition

Human Segmentation with Dynamic LiDAR Data

Residual Learning of Video Frame Interpolation Using Convolutional LSTM

Attention-Driven Body Pose Encoding for Human Activity Recognition

Do Not Treat Boundaries and Regions Differently: An Example on Heart Left Atrial Segmentation

Activity Recognition Using First-Person-View Cameras Based on Sparse Optical Flows

A Multi-Task Neural Network for Action Recognition with 3D Key-Points

Video Reconstruction by Spatio-Temporal Fusion of Blurred-Coded Image Pair

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition