ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Temporally Coherent Embeddings for Self-Supervised Video Representation Learning

Joshua Knights, Ben Harwood, Daniel Ward, Anthony Vanderkop, Olivia Mackenzie-Ross, Peyman Moghadam

Auto-TLDR; Temporally Coherent Embeddings for Self-supervised Video Representation Learning

Abstract Slides Poster

This paper presents TCE: Temporally Coherent Embeddings for self-supervised video representation learning. The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space, rather than indirectly learning it through ranking or predictive proxy tasks. In the same way that high-level visual information in the world changes smoothly, we believe that nearby frames in learned representations will benefit from demonstrating similar properties. Using this assumption, we train our TCE model to encode videos such that adjacent frames exist close to each other and videos are separated from one another. Using TCE we learn robust representations from large quantities of unlabeled video data. We thoroughly analyse and evaluate our self-supervised learned TCE models on a downstream task of video action recognition using multiple challenging benchmarks (Kinetics400, UCF101, HMDB51). With a simple but effective 2D-CNN backbone and only RGB stream inputs, TCE pre-trained representations outperform all previous self-supervised 2D-CNN and 3D-CNN trained on UCF101. The code and pre-trained models for this paper can be downloaded at: https://github.com/csiro-robotics/TCE

Similar papers

Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

Mani Kumar Tellamekala, Michel Valstar, Michael Pound, Timo Giesbrecht

Auto-TLDR; AV-PPC: A Multi-task Learning Framework for Learning Semantic Visual Features from Unlabeled Video Data

Temporally Coherent Embeddings for Self-Supervised Video Representation Learning

Similar papers

Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

RMS-Net: Regression and Masking for Soccer Event Spotting

Not 3D Re-ID: Simple Single Stream 2D Convolution for Robust Video Re-Identification

The Color Out of Space: Learning Self-Supervised Representations for Earth Observation Imagery

Learnable Higher-Order Representation for Action Recognition

Self-Supervised Learning of Dynamic Representations for Static Images

Multi-Level Deep Learning Vehicle Re-Identification Using Ranked-Based Loss Functions

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

SSDL: Self-Supervised Domain Learning for Improved Face Recognition

What and How? Jointly Forecasting Human Action and Pose

Single View Learning in Action Recognition

Feature-Supervised Action Modality Transfer

Nonlinear Ranking Loss on Riemannian Potato Embedding

MFI: Multi-Range Feature Interchange for Video Action Recognition

Developing Motion Code Embedding for Action Recognition in Videos

A Grid-Based Representation for Human Action Recognition

Learning Embeddings for Image Clustering: An Empirical Study of Triplet Loss Approaches

Multi-Modal Deep Clustering: Unsupervised Partitioning of Images

SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

TinyVIRAT: Low-Resolution Video Action Recognition

Generalized Local Attention Pooling for Deep Metric Learning

Learning Group Activities from Skeletons without Individual Action Labels

Graph-Based Interpolation of Feature Vectors for Accurate Few-Shot Classification

Image Representation Learning by Transformation Regression

Self-Supervised Learning for Astronomical Image Classification

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos

Building Computationally Efficient and Well-Generalizing Person Re-Identification Models with Metric Learning

Dual-MTGAN: Stochastic and Deterministic Motion Transfer for Image-To-Video Synthesis

Supervised Domain Adaptation Using Graph Embedding

Local Clustering with Mean Teacher for Semi-Supervised Learning

Mutual Information Based Method for Unsupervised Disentanglement of Video Representation

Reducing the Variance of Variational Estimates of Mutual Information by Limiting the Critic's Hypothesis Space to RKHS

Temporal Binary Representation for Event-Based Action Recognition

A Self-Supervised GAN for Unsupervised Few-Shot Object Recognition

Generative Latent Implicit Conditional Optimization When Learning from Small Sample

ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos

Video Representation Fusion Network For Multi-Label Movie Genre Classification

Modeling Long-Term Interactions to Enhance Action Recognition

3D Attention Mechanism for Fine-Grained Classification of Table Tennis Strokes Using a Twin Spatio-Temporal Convolutional Neural Networks

Video Semantic Segmentation Using Deep Multi-View Representation Learning

Future Urban Scenes Generation through Vehicles Synthesis

Towards Robust Learning with Different Label Noise Distributions

Shape Consistent 2D Keypoint Estimation under Domain Shift

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Precise Temporal Action Localization with Quantified Temporal Structure of Actions