Michel Valstar

Papers from this author

Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

Mani Kumar Tellamekala, Michel Valstar, Michael Pound, Timo Giesbrecht

Responsive image

Auto-TLDR; AV-PPC: A Multi-task Learning Framework for Learning Semantic Visual Features from Unlabeled Video Data

Slides Poster Similar

Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce `Audio-Visual Permutative Predictive Coding' (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 80.30% spoken word classification rate on the LRW dataset, performing on par with directly or fully supervised visual encoders learned from large amounts of labeled data.

Self-Supervised Learning of Dynamic Representations for Static Images

Siyang Song, Enrique Sanchez, Linlin Shen, Michel Valstar

Responsive image

Auto-TLDR; Facial Action Unit Intensity Estimation and Affect Estimation from Still Images with Multiple Temporal Scale

Slides Poster Similar

Facial actions are spatio-temporal signals by nature, and therefore their modeling is crucially dependent on the availability of temporal information. In this paper, we focus on inferring such temporal dynamics of facial actions when no explicit temporal information is available, i.e. from still images. We present a novel approach to capture multiple scales of such temporal dynamics, with an application to facial Action Unit (AU) intensity estimation and dimensional affect estimation. In particular, 1) we propose a framework that infers a dynamic representation (DR) from a still image, which captures the bi-directional flow of time within a short time-window centered at the input image; 2) we show that we can train our method without the need of explicitly generating target representations, allowing the network to represent dynamics more broadly; and 3) we propose to apply a multiple temporal scale approach that infers DRs for different window lengths (MDR) from a still image. We empirically validate the value of our approach on the task of frame ranking, and show how our proposed MDR attains state of the art results on BP4D for AU intensity estimation and on SEMAINE for dimensional affect estimation, using only still images at test time.