Hierarchical Multimodal Attention for Deep Video Summarization

Melissa Sanabria, Frederic Precioso, Thomas Menguy

Responsive image

Auto-TLDR; Automatic Summarization of Professional Soccer Matches Using Event-Stream Data and Multi- Instance Learning

Slides Poster

The way people consume sports on TV has drastically evolved in the last years, particularly under the combined effects of the legalization of sport betting and the huge increase of sport analytics. Several companies are nowadays sending observers in the stadiums to collect live data of all the events happening on the field during the match. Those data contain meaningful information providing a very detailed description of all the actions occurring during the match to feed the coaches and staff, the fans, the viewers, and the gamblers. Exploiting all these data, sport broadcasters want to generate extra content such as match highlights, match summaries, players and teams analytics, etc., to appeal subscribers. This paper explores the problem of summarizing professional soccer matches as automatically as possible using both the aforementioned event-stream data collected from the field and the content broadcasted on TV. We have designed an architecture, introducing first (1) a Multiple Instance Learning method that takes into account the sequential dependency among events and then (2) a hierarchical multimodal attention layer that grasps the importance of each event in an action. We evaluate our approach on matches from two professional European soccer leagues, showing its capability to identify the best actions for automatic summarization by comparing with real summaries made by human operators.

Similar papers

RMS-Net: Regression and Masking for Soccer Event Spotting

Matteo Tomei, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, Rita Cucchiara

Responsive image

Auto-TLDR; An Action Spotting Network for Soccer Videos

Slides Poster Similar

The recently proposed action spotting task consists in finding the exact timestamp in which an event occurs. This task fits particularly well for soccer videos, where events correspond to salient actions strictly defined by soccer rules (a goal occurs when the ball crosses the goal line). In this paper, we devise a lightweight and modular network for action spotting, which can simultaneously predict the event label and its temporal offset using the same underlying features. We enrich our model with two training strategies: the first one for data balancing and uniform sampling, the second for masking ambiguous frames and keeping the most discriminative visual cues. When tested on the SoccerNet dataset and using standard features, our full proposal exceeds the current state of the art by 3 Average-mAP points. Additionally, it reaches a gain of more than 10 Average-mAP points on the test set when fine-tuned in combination with a strong 2D backbone.

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Bang Yang, Yuexian Zou

Responsive image

Auto-TLDR; Visual Oriented Encoder for Video Captioning

Slides Poster Similar

Video captioning is a challenging task which aims at automatically generating a natural language description of a given video. Recent researches have shown that exploiting the intrinsic multi-modalities of videos significantly promotes captioning performance. However, how to integrate multi-modalities to generate effective semantic representations for video captioning is still an open issue. Some researchers proposed to learn multimodal features in parallel during the encoding stage. The downside of these methods lies in the neglect of the interaction among multi-modalities and their rich contextual information. In this study, inspired by the fact that visual contents are generally more important for comprehending videos, we propose a novel Visual Oriented Encoder (VOE) to integrate multimodal features in an interactive manner. Specifically, VOE is designed as a hierarchical structure, where bottom layers are utilized to extract multi-scale contexts from auxiliary modalities while the top layer is exploited to generate joint representations by considering both visual and contextual information. Following the encoder-decoder framework, we systematically develop a VOE-LSTM model and evaluate it on two mainstream benchmarks: MSVD and MSR-VTT. Experimental results show that the proposed VOE surpasses conventional encoders and our VOE-LSTM model achieves competitive results compared with state-of-the-art approaches.

ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos

Guillaume Vaudaux-Ruth, Adrien Chan-Hon-Tong, Catherine Achard

Responsive image

Auto-TLDR; ActionSpotter: A Reinforcement Learning Algorithm for Action Spotting in Video

Slides Poster Similar

Action spotting has recently been proposed as an alternative to action detection and key frame extraction. However, the current state-of-the-art method of action spotting requires an expensive ground truth composed of the search sequences employed by human annotators spotting actions - a critical limitation. In this article, we propose to use a reinforcement learning algorithm to perform efficient action spotting using only the temporal segments from the action detection annotations, thus opening an interesting solution for video understanding. Experiments performed on THUMOS14 and ActivityNet datasets show that the proposed method, named ActionSpotter, leads to good results and outperforms state-of-the-art detection outputs redrawn for this application. In particular, the spotting mean Average Precision on THUMOS14 is significantly improved from 59.7% to 65.6% while skipping 23% of video.

3D Attention Mechanism for Fine-Grained Classification of Table Tennis Strokes Using a Twin Spatio-Temporal Convolutional Neural Networks

Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, Julien Morlier

Responsive image

Auto-TLDR; Attentional Blocks for Action Recognition in Table Tennis Strokes

Slides Poster Similar

The paper addresses the problem of recognition of actions in video with low inter-class variability such as Table Tennis strokes. Two stream, "twin" convolutional neural networks are used with 3D convolutions both on RGB data and optical flow. Actions are recognized by classification of temporal windows. We introduce 3D attention modules and examine their impact on classification efficiency. In the context of the study of sportsmen performances, a corpus of the particular actions of table tennis strokes is considered. The use of attention blocks in the network speeds up the training step and improves the classification scores up to 5% with our twin model. We visualize the impact on the obtained features and notice correlation between attention and player movements and position. Score comparison of state-of-the-art action classification method and proposed approach with attentional blocks is performed on the corpus. Proposed model with attention blocks outperforms previous model without them and our baseline.

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Thi Phuong Thao Ha, Bt Balamurali, Herremans Dorien, Roig Gemma

Responsive image

Auto-TLDR; AttendAffectNet: A Self-Attention Based Network for Emotion Prediction from Movies

Slides Poster Similar

In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet. We take both audio and video into account and incorporate the relation among multiple modalities by applying self-attention mechanism in a novel manner into the extracted features for emotion prediction. We compare it to the typically temporal integration of the self-attention based model, which in our case, allows to capture the relation of temporal representations of the movie while considering the sequential dependencies of emotion responses. We demonstrate the effectiveness of our proposed architectures on the extended COGNIMUSE dataset [1], [2] and the MediaEval 2016 Emotional Impact of Movies Task [3], which consist of movies with emotion annotations. Our results show that applying the self-attention mechanism on the different audio-visual features, rather than in the time domain, is more effective for emotion prediction. Our approach is also proven to outperform state-of-the-art models for emotion prediction.

Video Summarization with a Dual Attention Capsule Network

Hao Fu, Hongxing Wang, Jianyu Yang

Responsive image

Auto-TLDR; Dual Self-Attention Capsule Network for Video Summarization

Slides Poster Similar

In this paper, we address the problem of video summarization, which aims at selecting a subset of video frames as a summary to represent the original video contents compactly and completely. We propose a simple but effective supervised approach with a dual attention capsule network towards this end. Unlike existing LSTM based methods, it pays attention to short- and long-term dependencies among video frames through an elaborate dual self-attention architecture, which can handle longer-term dependencies and admit parallel computing. To reconcile the outputs of dual self-attention, we rely on a two-stream capsule network to learn the underlying frame selection criteria. Experiments on real-world datasets show the advantages of the proposed approach compared with state-of-the-art methods.

DAG-Net: Double Attentive Graph Neural Network for Trajectory Forecasting

Alessio Monti, Alessia Bertugli, Simone Calderara, Rita Cucchiara

Responsive image

Auto-TLDR; Recurrent Generative Model for Multi-modal Human Motion Behaviour in Urban Environments

Slides Poster Similar

Understanding human motion behaviour is a critical task for several possible applications like self-driving cars or social robots, and in general for all those settings where an autonomous agent has to navigate inside a human-centric environment. This is non-trivial because human motion is inherently multi-modal: given a history of human motion paths, there are many plausible ways by which people could move in the future. Additionally, people activities are often driven by goals, e.g. reaching particular locations or interacting with the environment. We address both the aforementioned aspects by proposing a new recurrent generative model that considers both single agents’ future goals and interactions between different agents. The model exploits a double attention-based graph neural network to collect information about the mutual influences among different agents and integrates it with data about agents’ possible future objectives. Our proposal is general enough to be applied in different scenarios: the model achieves state-of-the-art results in both urban environments and also in sports applications.

Enriching Video Captions with Contextual Text

Philipp Rimle, Pelin Dogan, Markus Gross

Responsive image

Auto-TLDR; Contextualized Video Captioning Using Contextual Text

Slides Poster Similar

Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning by infusing extracted information from relevant text data. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mines relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not preprocess the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the contextual text via a pointer-generator network, allowing to produce more specific video captions. We show competitive performance on the News Video Dataset and, through ablation studies, validate the efficacy of contextual video captioning as well as individual design choices in our model architecture.

Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity

Yasunori Ohishi, Yuki Tanaka, Kunio Kashino

Responsive image

Auto-TLDR; A guided attention scheme for audio-visual co-segmentation

Slides Poster Similar

Audio-visual co-segmentation is a task to extract segments and regions corresponding to specific events on unlabelled audio and video signals. It is particularly important to accomplish it in an unsupervised way, since it is generally very difficult to manually label all the objects and events appearing in audio-visual signals for supervised learning. Here, we propose to take advantage of temporal proximity of corresponding audio and video entities included in the signals. For this purpose, we newly employ a guided attention scheme to this task to efficiently detect and utilize temporal cooccurrences of audio and video information. The experiments using a real TV broadcasting of Sumo wrestling, a sport event, with live commentaries show that our model can automatically extract specific athlete movements and its spoken descriptions in an unsupervised manner.

Global Feature Aggregation for Accident Anticipation

Mishal Fatima, Umar Karim Khan, Chong Min Kyung

Responsive image

Auto-TLDR; Feature Aggregation for Predicting Accidents in Video Sequences

Slides Similar

Anticipation of accidents ahead of time in autonomous and non-autonomous vehicles aids in accident avoidance. In order to recognize abnormal events such as traffic accidents in a video sequence, it is important that the network takes into account interactions of objects in a given frame. We propose a novel Feature Aggregation (FA) block that refines each object's features by computing a weighted sum of the features of all objects in a frame. We use FA block along with Long Short Term Memory (LSTM) network to anticipate accidents in the video sequences. We report mean Average Precision (mAP) and Average Time-to-Accident (ATTA) on Street Accident (SA) dataset. Our proposed method achieves the highest score for risk anticipation by predicting accidents 0.32 sec and 0.75 sec earlier compared to the best results with Adaptive Loss and dynamic parameter prediction based methods respectively.

Text Synopsis Generation for Egocentric Videos

Aidean Sharghi, Niels Lobo, Mubarak Shah

Responsive image

Auto-TLDR; Egocentric Video Summarization Using Multi-task Learning for End-to-End Learning

Slides Similar

Mass utilization of body-worn cameras has led to a huge corpus of available egocentric video. Existing video summarization algorithms can accelerate browsing such videos by selecting (visually) interesting shots from them. Nonetheless, since the system user still has to watch the summary videos, browsing large video databases remain a challenge. Hence, in this work, we propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos. Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database using text queries. Since egocentric videos are long and contain many activities and events, using video-to-text algorithms results in thousands of descriptions, many of which are incorrect. Therefore, we propose a multi-task learning scheme to simultaneously generate descriptions for video segments and summarize the resulting descriptions in an end-to-end fashion. We Input a set of video shots and the network generates a text description for each shot. Next, visual-language content matching unit that is trained with a weakly supervised objective, identifies the correct descriptions. Finally, the last component of our network, called purport network, evaluates the descriptions all together to select the ones containing crucial information. Out of thousands of descriptions generated for the video, a few informative sentences are returned to the user. We validate our framework on the challenging UT Egocentric video dataset, where each video is between 3 to 5 hours long, associated with over 3000 textual descriptions on average. The generated textual summaries, including only 5 percent (or less) of the generated descriptions, are compared to groundtruth summaries in text domain using well-established metrics in natural language processing.

Automated Whiteboard Lecture Video Summarization by Content Region Detection and Representation

Bhargava Urala Kota, Alexander Stone, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju

Responsive image

Auto-TLDR; A Framework for Summarizing Whiteboard Lecture Videos Using Feature Representations of Handwritten Content Regions

Poster Similar

Lecture videos are rapidly becoming an invaluable source of information for students across the globe. Given the large number of online courses currently available, it is important to condense the information within these videos into a compact yet representative summary that can be used for search-based applications. We propose a framework to summarize whiteboard lecture videos by finding feature representations of detected handwritten content regions to determine unique content. We investigate multi-scale histogram of gradients and embeddings from deep metric learning for feature representation. We explicitly handle occluded, growing and disappearing handwritten content. Our method is capable of producing two kinds of lecture video summaries - the unique regions themselves or so-called key content and keyframes (which contain all unique content in a video segment). We use weighted spatio-temporal conflict minimization to segment the lecture and produce keyframes from detected regions and features. We evaluate both types of summaries and find that we obtain state-of-the-art peformance in terms of number of summary keyframes while our unique content recall and precision are comparable to state-of-the-art.

What and How? Jointly Forecasting Human Action and Pose

Yanjun Zhu, Yanxia Zhang, Qiong Liu, Andreas Girgensohn

Responsive image

Auto-TLDR; Forecasting Human Actions and Motion Trajectories with Joint Action Classification and Pose Regression

Slides Poster Similar

Forecasting human actions and motion trajectories addresses the problem of predicting what a person is going to do next and how they will perform it. This is crucial in a wide range of applications such as assisted living and future co-robotic settings. We propose to simultaneously learn actions and action-related human motion dynamics, while existing works perform them independently. In this paper, we present a method to jointly forecast categories of human action and the pose of skeletal joints in the hope that the two tasks can help each other. As a result, our system can predict not only the future actions but also the motion trajectories that will result. To achieve this, we define a task of joint action classification and pose regression. We employ a sequence to sequence encoder-decoder model combined with multi-task learning to forecast future actions and poses progressively before the action happens. Experimental results on two public datasets, IkeaDB and OAD, demonstrate the effectiveness of the proposed method.

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Edward Kim, Connor Onweller, Kathleen F. Mccoy

Responsive image

Auto-TLDR; A multimodal deep learning framework that can generate summarization text supporting the main idea of an information graphic for presentation to blind or visually impaired

Slides Similar

We present a multimodal deep learning framework that can generate summarization text supporting the main idea of an information graphic for presentation to a person who is blind or visually impaired. The framework utilizes the visual, textual, positional, and size characteristics extracted from the image to create the summary. Different and complimentary neural architectures are optimized for each task using crowdsourced training data. From our quantitative experiments and results, we explain the reasoning behind our framework and show the effectiveness of our models. Our qualitative results showcase text generated from our framework and show that Mechanical Turk participants favor them to other automatic and human generated summarizations. We describe the design and of of an experiment to evaluate the utility of our system for people who have visual impairments in the context of understanding Twitter Tweets containing line graphs.

RWF-2000: An Open Large Scale Video Database for Violence Detection

Ming Cheng, Kunjing Cai, Ming Li

Responsive image

Auto-TLDR; Flow Gated Network for Violence Detection in Surveillance Cameras

Slides Poster Similar

In recent years, surveillance cameras are widely deployed in public places, and the general crime rate has been reduced significantly due to these ubiquitous devices. Usually, these cameras provide cues and evidence after crimes were conducted, while they are rarely used to prevent or stop criminal activities in time. It is both time and labor consuming to manually monitor a large amount of video data from surveillance cameras. Therefore, automatically recognizing violent behaviors from video signals becomes essential. In this paper, we summarize several existing video datasets for violence detection and propose a new video dataset with 2,000 videos all captured by surveillance cameras in real-world scenes. Also, we present a new method that utilizes both the merits of 3D-CNNs and optical flow, namely Flow Gated Network. The proposed approach obtains an accuracy of 87.25% on the test set of our proposed RWF-2000 database. The proposed database and source codes of this paper are currently open to access.

Feature Pyramid Hierarchies for Multi-Scale Temporal Action Detection

Jiayu He, Guohui Li, Jun Lei

Responsive image

Auto-TLDR; Temporal Action Detection using Pyramid Hierarchies and Multi-scale Feature Maps

Slides Poster Similar

Temporal action detection is a challenging but promising task in video content analysis. It is in great demand in the field of public safety. The main difficulty of the task is precisely localizing activities in the video especially those short duration activities. And most of the existing methods can not achieve a satisfactory detection result. Our method addresses a key point to improve detection accuracy, which is to use multi-scale feature maps for regression and classification. In this paper, we introduce a novel network based on classification following proposal framework. In our network, a 3D feature pyramid hierarchies is built to enhance the ability of detecting short duration activities. The input RGB/Flow frames are first encoded by a 3D feature pyramid hierarchies, and this subnet produces multi-level feature maps. Then temporal proposal subnet uses these features to pick out proposals which might contain activity segments. Finally a pyramid region of interest (RoI) pooling pipeline and two fully connected layers reuse muti-level feature maps to refine the temporal boundaries of proposals and classify them. We use late feature fusion scheme to combine RGB and Flow information. The network is trained end-to-end and we evaluate it in THUMOS'14 dataset. Our network achieves a good result among typical methods. A further ablation test demonstrate that pyramid hierarchies is effective to improve detecting short duration activity segments.

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

Pavlos Avgoustinakis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Andreas L. Symeonidis, Ioannis Kompatsiaris

Responsive image

Auto-TLDR; AuSiL: Audio Similarity Learning for Near-duplicate Video Retrieval

Slides Poster Similar

In this work, we address the problem of audio-based near-duplicate video retrieval. We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs. For the robust similarity calculation between two videos, we first extract representative audio-based video descriptors by leveraging transfer learning based on a Convolutional Neural Network (CNN) trained on a large scale dataset of audio events, and then we calculate the similarity matrix derived from the pairwise similarity of these descriptors. The similarity matrix is subsequently fed to a CNN network that captures the temporal structures existing within its content. We train our network following a triplet generation process and optimizing the triplet loss function. To evaluate the effectiveness of the proposed approach, we have manually annotated two publicly available video datasets based on the audio duplicity between their videos. The proposed approach achieves very competitive results compared to three state-of-the-art methods. Also, unlike the competing methods, it is very robust for the retrieval of audio duplicates generated with speed transformations.

Temporal Binary Representation for Event-Based Action Recognition

Simone Undri Innocenti, Federico Becattini, Federico Pernici, Alberto Del Bimbo

Responsive image

Auto-TLDR; Temporal Binary Representation for Gesture Recognition

Slides Poster Similar

In this paper we present an event aggregation strategy to convert the output of an event camera into frames processable by traditional Computer Vision algorithms. The proposed method first generates sequences of intermediate binary representations, which are then losslessly transformed into a compact format by simply applying a binary-to-decimal conversion. This strategy allows us to encode temporal information directly into pixel values, which are then interpreted by deep learning models. We apply our strategy, called Temporal Binary Representation, to the task of Gesture Recognition, obtaining state of the art results on the popular DVS128 Gesture Dataset. To underline the effectiveness of the proposed method compared to existing ones, we also collect an extension of the dataset under more challenging conditions on which to perform experiments.

A Grid-Based Representation for Human Action Recognition

Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Responsive image

Auto-TLDR; GRAR: Grid-based Representation for Action Recognition in Videos

Slides Poster Similar

Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for the task, and are limited in the way they fuse temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets that demonstrate that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges.

Context Visual Information-Based Deliberation Network for Video Captioning

Min Lu, Xueyong Li, Caihua Liu

Responsive image

Auto-TLDR; Context visual information-based deliberation network for video captioning

Slides Poster Similar

Video captioning is to automatically and accurately generate a textual description for a video. The typical methods following the encoder-decoder architecture directly utilized hidden states to predict words. Nevertheless, these methods did not amend the inaccurate hidden states before feeding those states into word prediction. This led to a cascade of errors on generating word by word. In this paper, the context visual information-based deliberation network is proposed, abbreviated as CVI-DelNet. Its key idea is to introduce the deliberator into the encoder-decoder framework. The encoder-decoder firstly generates a raw hidden state sequence. Unlike the existing methods, the raw hidden state is no more directly used for word prediction but is fed into the deliberator to generate the refined hidden state. The words are then predicted according to the refined hidden states and the contextual visual features. Results on two datasets shows that the proposed method significantly outperforms the baselines.

You Ought to Look Around: Precise, Large Span Action Detection

Ge Pan, Zhang Han, Fan Yu, Yonghong Song, Yuanlin Zhang, Han Yuan

Responsive image

Auto-TLDR; YOLA: Local Feature Extraction for Action Localization with Variable receptive field

Slides Similar

For the action localization task, pre-defined action anchors are the cornerstone of mainstream techniques. State-of-the-art models mostly rely on a dense segmenting scheme, where anchors are sampled uniformly over the temporal domain with a predefined set of scales. However, it is not sufficient because action duration varies greatly. Therefore, it is necessary for the anchors or proposals to have a variable receptive field. In this paper, we propose a method called YOLA (You Ought to Look Around) which includes three parts: 1) a robust backbone SPN-I3D for extracting spatio-temporal features. In this part, we employ a stronger backbone I3D with SPN (Segment Pyramid Network) instead of C3D to obtain multi-scale features; 2) a simple but useful feature fusion module named LFE (Local Feature Extraction). Compared with the fully connected layer and global average pooling, our LFE model is more advantageous for network to fit and fuse features. 3) a new feature segment aligning method called TPGC (Two Pathway Graph Convolution), which allows one proposal to leverage semantic features of adjacent proposals to update its content and make sure the proposals have a variable receptive field. YOLA add only a small overhead to the baseline network, and is easy to train in an end-to-end manner, running at a speed of 1097 fps. YOLA achieves a mAP of 58.3%, outperforming all existing models including both RGB-based and two stream on THUMOS'14, and achieves competitive results on ActivityNet 1.3.

Modeling Long-Term Interactions to Enhance Action Recognition

Alejandro Cartas, Petia Radeva, Mariella Dimiccoli

Responsive image

Auto-TLDR; A Hierarchical Long Short-Term Memory Network for Action Recognition in Egocentric Videos

Slides Poster Similar

In this paper, we propose a new approach to understand actions in egocentric videos that exploit the semantics of object interactions at both frame and temporal levels. At the frame level, we use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects and calculates the action score through a CNN formulation. This information is then fed to a Hierarchical Long Short-Term Memory Network (HLSTM) that captures temporal dependencies between actions within and across shots. Ablation studies thoroughly validate the proposed approach, showing in particular that both levels of the HLSTM architecture contribute to performance improvement. Furthermore, quantitative comparisons show that the proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks, without relying on motion information.

Precise Temporal Action Localization with Quantified Temporal Structure of Actions

Chongkai Lu, Ruimin Li, Hong Fu, Bin Fu, Yihao Wang, Wai Lun Lo, Zheru Chi

Responsive image

Auto-TLDR; Action progression networks for temporal action detection

Slides Poster Similar

Existing temporal action detection algorithms cannot distinguish complete and incomplete actions while this property is essential in many applications. To tackle this challenge, we proposed the action progression networks (APN), a novel model that predicts action progression of video frames with continuous numbers. Using the progression sequence of test video, on the top of the APN, a complete action searching algorithm (CAS) was designed to detect complete actions only. With the usage of frame-level fine-grained temporal structure modeling and detecting actions according to their whole temporal context, our framework can locate actions precisely and is good at avoiding incomplete action detection. We evaluated our framework on a new dataset (DFMAD-70) collected by ourselves which contains both complete and incomplete actions. Our framework got good temporal localization results with 95.77% average precision when the IoU threshold is 0.5. On the benchmark THUMOS14, an incomplete-ignostic dataset, our framework still obtain competitive performance. The code is available online at https://github.com/MakeCent/Action-Progression-Network

Knowledge Distillation for Action Anticipation Via Label Smoothing

Guglielmo Camporese, Pasquale Coscia, Antonino Furnari, Giovanni Maria Farinella, Lamberto Ballan

Responsive image

Auto-TLDR; A Multi-Modal Framework for Action Anticipation using Long Short-Term Memory Networks

Slides Poster Similar

Human capability to anticipate near future from visual observations and non-verbal cues is essential for developing intelligent systems that need to interact with people. Several research areas, such as human-robot interaction (HRI), assisted living or autonomous driving need to foresee future events to avoid crashes or help people. Egocentric scenarios are classic examples where action anticipation is applied due to their numerous applications. Such challenging task demands to capture and model domain's hidden structure to reduce prediction uncertainty. Since multiple actions may equally occur in the future, we treat action anticipation as a multi-label problem with missing labels extending the concept of label smoothing. This idea resembles the knowledge distillation process since useful information is injected into the model during training. We implement a multi-modal framework based on long short-term memory (LSTM) networks to summarize past observations and make predictions at different time steps. We perform extensive experiments on EPIC-Kitchens and EGTEA Gaze+ datasets including more than 2500 and 100 action classes, respectively. The experiments show that label smoothing systematically improves performance of state-of-the-art models for action anticipation.

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

Bing Li, Longteng Kong, Dongming Zhang, Xiuguo Bao, Di Huang, Yunhong Wang

Responsive image

Auto-TLDR; TEMSN: Temporal Enhanced Multi-Stream Network for Compressed Video Action Recognition

Slides Poster Similar

Current compressed video action recognition methods are mainly based on completely received compressed videos. However, in real transmission, the compressed video packets are usually disorderly received and lost due to network jitters or congestion. It is of great significance to recognize actions in early phases with limited packets, e.g. forecasting the potential risks from videos quickly. In this paper, we proposed a Temporal Enhanced Multi-Stream Network (TEMSN) for practical compressed video action recognition. First, we use three compressed modalities as complementary cues and build a multi-stream network to capture the rich information from compressed video packets. Second, we design a temporal enhanced module based on Encoder-Decoder structure applied on each stream to infer the missing packets, and generate more complete action dynamics. Thanks to the rich modalities and temporal enhancement, our approach is able to better modeling the action with limited compressed packets. Experiments on HMDB-51 and UCF-101 dataset validate its effectiveness and efficiency.

Attentive Visual Semantic Specialized Network for Video Captioning

Jesus Perez-Martin, Benjamin Bustos, Jorge Pérez

Responsive image

Auto-TLDR; Adaptive Visual Semantic Specialized Network for Video Captioning

Slides Poster Similar

As an essential high-level task of video understanding topic, automatically describing a video with natural language has recently gained attention as a fundamental challenge in computer vision. Previous models for video captioning have several limitations, such as the existence of gaps in current semantic representations and the inexpressibility of the generated captions. To deal with these limitations, in this paper, we present a new architecture that we callAttentive Visual Semantic Specialized Network(AVSSN), which is an encoder-decoder model based on our Adaptive Attention Gate and Specialized LSTM layers. This architecture can selectively decide when to use visual or semantic information into the text generation process. The adaptive gate makes the decoder to automatically select the relevant information for providing a better temporal state representation than the existing decoders. Besides, the model is capable of learning to improve the expressiveness of generated captions attending to their length, using a sentence-length-related loss function. We evaluate the effectiveness of the proposed approach on the Microsoft Video Description(MSVD) and the Microsoft Research Video-to-Text (MSR-VTT) datasets, achieving state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, CIDEr, and ROUGE_L.

SDMA: Saliency Driven Mutual Cross Attention for Multi-Variate Time Series

Yash Garg, K. Selcuk Candan

Responsive image

Auto-TLDR; Salient-Driven Mutual Cross Attention for Intelligent Time Series Analytics

Slides Poster Similar

Integration of rich sensory technologies into critical applications, such as gesture recognition and building energy optimization, has highlighted the importance of intelligent time series analytics. To accommodate this demand, uni-variate approaches have been extended for multi-variate scenarios, but naive extensions have lead to deterioration in model performances due to their limited ability to capture the information recorded in different variates and complex multi-variate time series patterns’ evolution over time. Furthermore, real-world time series are often contaminated with noisy information. In this paper, we note that a time series often carry robust localized temporal events that could help improve model performance by highlighting the relevant information; however, the lack of sufficient data to train for these events make it impossible for neural architectures to identify and make use of these temporal events. We, therefore, argue that a companion process helping identify salient events in the input time series and driving model’s attention to the associated salient sub-sequences can help with learning a high-performing network. Relying on this observation, we propose a novel Saliency-Driven Mutual Cross Attention (SDMA) framework that extracts localized temporal events and generate a saliency series to complement the input time series. We further propose an architecture which accounts for the mutual cross-talk between the input and saliency series branches where input and saliency series attend each other. Experiments show that the proposed mutually-cross attention framework can offer significant boosts in model performance when compared against non-attentioned, conventionally attentioned, and conventionally cross-attentioned models.

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Tobia Tesan, Pasquale Coscia, Lamberto Ballan

Responsive image

Auto-TLDR; Context-Based Image Annotation with Multiple Semantic Embeddings and Recurrent Neural Networks

Slides Poster Similar

Images represent a commonly used form of visual communication among people. Nevertheless, image classification may be a challenging task when dealing with unclear or non-common images needing more context to be correctly annotated. Metadata accompanying images on social-media represent an ideal source of additional information for retrieving proper neighborhoods easing image annotation task. To this end, we blend visual features extracted from neighbors and their metadata to jointly leverage context and visual cues. Our models use multiple semantic embeddings to achieve the dual objective of being robust to vocabulary changes between train and test sets and decoupling the architecture from the low-level metadata representation. Convolutional and recurrent neural networks (CNNs-RNNs) are jointly adopted to infer similarity among neighbors and query images. We perform comprehensive experiments on the NUS-WIDE dataset showing that our models outperform state-of-the-art architectures based on images and metadata, and decrease both sensory and semantic gaps to better annotate images.

Pose-Based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo

Responsive image

Auto-TLDR; Body Language Based Emotion Recognition for Psychiatric Symptoms Prediction

Slides Poster Similar

Inspired by the human ability to infer emotions from body language, we propose an automated framework for body language based emotion recognition starting from regular RGB videos. In collaboration with psychologists, we further extend the framework for psychiatric symptom prediction. Because a specific application domain of the proposed framework may only supply a limited amount of data, the framework is designed to work on a small training set and possess a good transferability. The proposed system in the first stage generates sequences of body language predictions based on human poses estimated from input videos. In the second stage, the predicted sequences are fed into a temporal network for emotion interpretation and psychiatric symptom prediction. We first validate the accuracy and transferability of the proposed body language recognition method on several public action recognition datasets. We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms. The proposed framework outperforms other methods on the URMC dataset.

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

Kuan-Hsun Wang, Chia Chun Cheng, Yi-Ling Chen, Yale Song, Shang-Hong Lai

Responsive image

Auto-TLDR; Attention-based Deep Metric Learning for Near-duplicate Video Retrieval

Slides Similar

Near-duplicate video retrieval (NDVR) is an important and challenging problem due to the increasing amount of videos uploaded to the Internet. In this paper, we propose an attention-based deep metric learning method for NDVR. Our method is based on well-established principles: We leverage two-stream networks to combine RGB and optical flow features, and incorporate an attention module to effectively deal with distractor frames commonly observed in near duplicate videos. We further aggregate the features corresponding to multiple video segments to enhance the discriminative power. The whole system is trained using a deep metric learning objective with a Siamese architecture. Our experiments show that the attention module helps eliminate redundant and noisy frames, while focusing on visually relevant frames for solving NVDR. We evaluate our approach on recent large-scale NDVR datasets, CC_WEB_VIDEO, VCDB, FIVR and SVD. To demonstrate the generalization ability of our approach, we report results in both within- and cross-dataset settings, and show that the proposed method significantly outperforms state-of-the-art approaches.

Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

Mirco Planamente, Andrea Bottino, Barbara Caputo

Responsive image

Auto-TLDR; A Single Stream Architecture for Egocentric Action Recognition from the First-Person Point of View

Slides Poster Similar

Wearable cameras are becoming more and more popular in several applications, increasing the interest of the research community in developing approaches for recognizing actions from the first-person point of view. An open challenge in egocentric action recognition is that videos lack detailed information about the main actor's pose and thus tend to record only parts of the movement when focusing on manipulation tasks. Thus, the amount of information about the action itself is limited, making crucial the understanding of the manipulated objects and their context. Many previous works addressed this issue with two-stream architectures, where one stream is dedicated to modeling the appearance of objects involved in the action, and another to extracting motion features from optical flow. In this paper, we argue that learning features jointly from these two information channels is beneficial to capture the spatio-temporal correlations between the two better. To this end, we propose a single stream architecture able to do so, thanks to the addition of a self-supervised block that uses a pretext motion prediction task to intertwine motion and appearance knowledge. Experiments on several publicly available databases show the power of our approach.

Late Fusion of Bayesian and Convolutional Models for Action Recognition

Camille Maurice, Francisco Madrigal, Frederic Lerasle

Responsive image

Auto-TLDR; Fusion of Deep Neural Network and Bayesian-based Approach for Temporal Action Recognition

Slides Poster Similar

The activities we do in our daily-life are generally carried out as a succession of atomic actions, following a logical order. During a video sequence, actions usually follow a logical order. In this paper, we propose a hybrid approach resulting from the fusion of a deep learning neural network with a Bayesian-based approach. The latter models human-object interactions and transition between actions. The key idea is to combine both approaches in the final prediction. We validate our strategy in two public datasets: CAD-120 and Watch-n-Patch. We show that our fusion approach yields performance gains in accuracy of respectively +4\% and +6\% over a baseline approach. Temporal action recognition performances are clearly improved by the fusion, especially when classes are imbalanced.

Anticipating Activity from Multimodal Signals

Tiziana Rotondo, Giovanni Maria Farinella, Davide Giacalone, Sebastiano Mauro Strano, Valeria Tomaselli, Sebastiano Battiato

Responsive image

Auto-TLDR; Exploiting Multimodal Signal Embedding Space for Multi-Action Prediction

Slides Poster Similar

Images, videos, audio signals, sensor data, can be easily collected in huge quantity by different devices and processed in order to emulate the human capability of elaborating a variety of different stimuli. Are multimodal signals useful to understand and anticipate human actions if acquired from the user viewpoint? This paper proposes to build an embedding space where inputs of different nature, but semantically correlated, are projected in a new representation space and properly exploited to anticipate the future user activity. To this purpose, we built a new multimodal dataset comprising video, audio, tri-axial acceleration, angular velocity, tri-axial magnetic field, pressure and temperature. To benchmark the proposed multimodal anticipation challenge, we consider classic classifiers on top of deep learning methods used to build the embedding space representing multimodal signals. The achieved results show that the exploitation of different modalities is useful to improve the anticipation of the future activity.

Relevance Detection in Cataract Surgery Videos by Spatio-Temporal Action Localization

Negin Ghamsarian, Mario Taschwer, Doris Putzgruber, Stephanie. Sarny, Klaus Schoeffmann

Responsive image

Auto-TLDR; relevance-based retrieval in cataract surgery videos

Slides Similar

In cataract surgery, the operation is performed with the help of a microscope. Since the microscope enables watching real-time surgery by up to two people only, a major part of surgical training is conducted using the recorded videos. To optimize the training procedure with the video content, the surgeons require an automatic relevance detection approach. In addition to relevance-based retrieval, these results can be further used for skill assessment and irregularity detection in cataract surgery videos. In this paper, a three-module framework is proposed to detect and classify the relevant phase segments in cataract videos. Taking advantage of an idle frame recognition network, the video is divided into idle and action segments. To boost the performance in relevance detection Mask R-CNN is utilized to detect the cornea in each frame where the relevant surgical actions are conducted. The spatio-temporal localized segments containing higher-resolution information about the pupil texture and actions, and complementary temporal information from the same phase are fed into the relevance detection module. This module consists of four parallel recurrent CNNs being responsible to detect four relevant phases that have been defined with medical experts. The results will then be integrated to classify the action phases as irrelevant or one of four relevant phases. Experimental results reveal that the proposed approach outperforms static CNNs and different configurations of feature-based and end-to-end recurrent networks.

Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos

Mamshad Nayeem Rizve, Ugur Demir, Praveen Praveen Tirupattur, Aayush Jung Rana, Kevin Duarte, Ishan Rajendrakumar Dave, Yogesh Rawat, Mubarak Shah

Responsive image

Auto-TLDR; Gabriella: A Real-Time Online System for Activity Detection in Surveillance Videos

Slides Similar

Activity detection in surveillance videos is a difficult problem due to multiple factors such as large field of view, presence of multiple activities, varying scales and viewpoints, and its untrimmed nature. The existing research in activity detection is mainly focused on datasets, such as UCF-101, JHMDB, THUMOS, and AVA, which partially address these issues. The requirement of processing the surveillance videos in real-time makes this even more challenging. In this work we propose Gabriella, a real-time online system to perform activity detection on untrimmed surveillance videos. The proposed method consists of three stages: tubelet extraction, activity classification, and online tubelet merging. For tubelet extraction, we propose a localization network which takes a video clip as input and spatio-temporally detects potential foreground regions at multiple scales to generate action tubelets. We propose a novel Patch-Dice loss to handle large variations in actor size. Our online processing of videos at a clip level drastically reduces the computation time in detecting activities. The detected tubelets are assigned activity class scores by the classification network and merged together using our proposed Tubelet-Merge Action-Split (TMAS) algorithm to form the final action detections. The TMAS algorithm efficiently connects the tubelets in an online fashion to generate action detections which are robust against varying length activities. We perform our experiments on the VIRAT and MEVA (Multiview Extended Video with Activities) datasets and demonstrate the effectiveness of the proposed approach in terms of speed ($\sim$100 fps) and performance with state-of-the-art results. The code and models will be made publicly available.

Learning Group Activities from Skeletons without Individual Action Labels

Fabio Zappardino, Tiberio Uricchio, Lorenzo Seidenari, Alberto Del Bimbo

Responsive image

Auto-TLDR; Lean Pose Only for Group Activity Recognition

Similar

To understand human behavior we must not just recognize individual actions but model possibly complex group activity and interactions. Hierarchical models obtain the best results in group activity recognition but require fine grained individual action annotations at the actor level. In this paper we show that using only skeletal data we can train a state-of-the art end-to-end system using only group activity labels at the sequence level. Our experiments show that models trained without individual action supervision perform poorly. On the other hand we show that pseudo-labels can be computed from any pre-trained feature extractor with comparable final performance. Finally our carefully designed lean pose only architecture shows highly competitive results versus more complex multimodal approaches even in the self-supervised variant.

AG-GAN: An Attentive Group-Aware GAN for Pedestrian Trajectory Prediction

Yue Song, Niccolò Bisagno, Syed Zohaib Hassan, Nicola Conci

Responsive image

Auto-TLDR; An attentive group-aware GAN for motion prediction in crowded scenarios

Slides Poster Similar

Understanding human behaviors in crowded scenarios requires analyzing not only the position of the subjects in space, but also the scene context. Existing approaches mostly rely on the motion history of each pedestrian and model the interactions among people by considering the entire surrounding neighborhood. In our approach, we address the problem of motion prediction by applying coherent group clustering and a global attention mechanism on the LSTM-based Generative Adversarial Networks (GANs). The proposed model consists of an attentive group-aware GAN that observes the agents' past motion and predicts future paths, using (i) a group pooling module to model neighborhood interaction, and (ii) an attention module to specifically focus on hidden states. The experimental results demonstrate that our proposal outperforms state-of-the-art models on common benchmark datasets, and is able to generate socially-acceptable trajectories.

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

Chao Li, Qian Zhang, Ziping Zhao

Responsive image

Auto-TLDR; Intimate Relationship Prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network Using Functional Near-Infrared Spectroscopy

Slides Poster Similar

The detection of intimacy plays a crucial role in the improvement of intimate relationship, which contributes to promote the family and social harmony. Previous studies have shown that different degrees of intimacy have significant differences in brain imaging. Recently, a few of work has emerged to recognise intimacy automatically by using machine learning technique. Moreover, considering the temporal dynamic characteristics of intimacy relationship on neural mechanism, how to model spatio-temporal dynamics for intimacy prediction effectively is still a challenge. In this paper, we propose a novel method to explore deep spatial-temporal representations for intimacy prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network (ACCRNN). Given the advantages of time-frequency resolution in complex neuronal activities analysis, this paper utilizes functional near-infrared spectroscopy (fNIRS) to analyse and infer to intimate relationship. We collect a fNIRS-based dataset for the analysis of intimate relationship. Forty-two-channel fNIRS signals are recorded from the 44 subjects' prefrontal cortex when they watched a total of 18 photos of lovers, friends and strangers for 30 seconds per photo. The experimental results show that our proposed method outperforms the others in terms of accuracy with the precision of 96.5%. To the best of our knowledge, this is the first time that such a hybrid deep architecture has been employed for fNIRS-based intimacy prediction.

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel

Responsive image

Auto-TLDR; Environmental Sound Classification with Short-Time Fourier Transform Spectrograms

Slides Poster Similar

Environmental Sound Classification (ESC) is an active research area in the audio domain and has seen a lot of progress in the past years. However, many of the existing approaches achieve high accuracy by relying on domain-specific features and architectures, making it harder to benefit from advances in other fields (e.g., the image domain). Additionally, some of the past successes have been attributed to a discrepancy of how results are evaluated (i.e., on unofficial splits of the UrbanSound8K (US8K) dataset), distorting the overall progression of the field. The contribution of this paper is twofold. First, we present a model that is inherently compatible with mono and stereo sound inputs. Our model is based on simple log-power Short-Time Fourier Transform (STFT) spectrograms and combines them with several well-known approaches from the image domain (i.e., ResNet, Siamese-like networks and attention). We investigate the influence of cross-domain pre-training, architectural changes, and evaluate our model on standard datasets. We find that our model out-performs all previously known approaches in a fair comparison by achieving accuracies of 97.0 % (ESC-10), 91.5 % (ESC-50) and 84.2 % / 85.4 % (US8K mono / stereo). Second, we provide a comprehensive overview of the actual state of the field, by differentiating several previously reported results on the US8K dataset between official or unofficial splits. For better reproducibility, our code (including any re-implementations) is made available.

Developing Motion Code Embedding for Action Recognition in Videos

Maxat Alibayev, David Andrea Paulius, Yu Sun

Responsive image

Auto-TLDR; Motion Embedding via Motion Codes for Action Recognition

Slides Poster Similar

We propose a motion embedding strategy via the motion codes that is a vectorized representation of motions based on their salient mechanical attributes. We show that our motion codes can provide robust motion representation. We train a deep neural network model that learns to embed demonstration videos into motion codes. We integrate the extracted features from the motion embedding model into the current state-of-the-art action recognition model. The obtained model achieved higher accuracy than the baseline on a verb classification task from egocentric videos in EPIC-KITCHENS dataset.

Vision-Based Multi-Modal Framework for Action Recognition

Djamila Romaissa Beddiar, Mourad Oussalah, Brahim Nini

Responsive image

Auto-TLDR; Multi-modal Framework for Human Activity Recognition Using RGB, Depth and Skeleton Data

Slides Poster Similar

Human activity recognition plays a central role in the development of intelligent systems for video surveillance, public security, health care and home monitoring, where detection and recognition of activities can improve the quality of life and security of humans. Typically, automated, intuitive and real-time systems are required to recognize human activities and identify accurately unusual behaviors in order to prevent dangerous situations. In this work, we explore the combination of three modalities (RGB, depth and skeleton data) to design a robust multi-modal framework for vision-based human activity recognition. Especially, spatial information, body shape/posture and temporal evolution of actions are highlighted using illustrative representations obtained from a combination of dynamic RGB images, dynamic depth images and skeleton data representations. Therefore, each video is represented with three images that summarize the ongoing action. Our framework takes advantage of transfer learning from pre trained models to extract significant features from these newly created images. Next, we fuse extracted features using Canonical Correlation Analysis and train a Long Short-Term Memory network to classify actions from visual descriptive images. Experimental results demonstrated the reliability of our feature-fusion framework that allows us to capture highly significant features and enables us to achieve the state-of-the-art performance on the public UTD-MHAD and NTU RGB+D datasets.

Single View Learning in Action Recognition

Gaurvi Goyal, Nicoletta Noceti, Francesca Odone

Responsive image

Auto-TLDR; Cross-View Action Recognition Using Domain Adaptation for Knowledge Transfer

Slides Poster Similar

Viewpoint is an essential aspect of how an action is visually perceived, with the motion appearing substantially different for some viewpoint pairs. Data driven action recognition algorithms compensate for this by including a variety of viewpoints in their training data, adding to the cost of data acquisition as well as training. We propose a novel methodology that leverages deeply pretrained features to learn actions from a single viewpoint using domain adaptation for knowledge transfer. We demonstrate the effectiveness of this pipeline on 3 different datasets: IXMAS, MoCA and NTU RGBD+, and compare with both classical and deep learning methods. Our method requires low training data and demonstrates unparalleled cross-view action recognition accuracies for single view learning.

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Ding Li, Rui Wu, Zhizhong Zhang, Yongqiang Tang, Wensheng Zhang

Responsive image

Auto-TLDR; Multi-scale 2D Representation Learning for Weakly Supervised Video Moment Retrieval

Slides Poster Similar

Video moment retrieval aims to search the moment most relevant to a given language query. However, most existing methods in this community often require temporal boundary annotations which are expensive and time-consuming to label. Hence weakly supervised methods have been put forward recently by only using coarse video-level label. Despite effectiveness, these methods usually process moment candidates independently, while ignoring a critical issue that the natural temporal dependencies between candidates in different temporal scales. To cope with this issue, we propose a Multi-scale 2D Representation Learning method for weakly supervised video moment retrieval. Specifically, we first construct a two-dimensional map for each temporal scale to capture the temporal dependencies between candidates. Two dimensions in this map indicate the start and end time points of these candidates. Then, we select top-K candidates from each scale-varied map with a learnable convolutional neural network. With a newly designed Moments Evaluation Module, we obtain the alignment scores of the selected candidates. At last, the similarity between captions and language query is served as supervision for further training the candidates' selector. Experiments on two benchmark datasets Charades-STA and ActivityNet Captions demonstrate that our approach achieves superior performance to state-of-the-art results.

Attention Based Multi-Instance Thyroid Cytopathological Diagnosis with Multi-Scale Feature Fusion

Shuhao Qiu, Yao Guo, Chuang Zhu, Wenli Zhou, Huang Chen

Responsive image

Auto-TLDR; A weakly supervised multi-instance learning framework based on attention mechanism with multi-scale feature fusion for thyroid cytopathological diagnosis

Slides Poster Similar

In recent years, deep learning has been popular in combining with cytopathology diagnosis. Using the whole slide images (WSI) scanned by electronic scanners at clinics, researchers have developed many algorithms to classify the slide (benign or malignant). However, the key area that support the diagnosis result can be relatively small in a thyroid WSI, and only the global label can be acquired, which make the direct use of the strongly supervised learning framework infeasible. What’s more, because the clinical diagnosis of the thyroid cells requires the use of visual features in different scales, a generic feature extraction way may not achieve good performance. In this paper, we propose a weakly supervised multi-instance learning framework based on attention mechanism with multi-scale feature fusion (MSF) using convolutional neural network (CNN) for thyroid cytopathological diagnosis. We take each WSI as a bag, each bag contains multiple instances which are the different regions of the WSI, our framework is trained to learn the key area automatically and make the classification. We also propose a feature fusion structure, merge the low-level features into the final feature map and add an instance-level attention module in it, which improves the classification accuracy. Our model is trained and tested on the collected clinical data, reaches the accuracy of 93.2%, which outperforms the other existing methods. We also tested our model on a public histopathology dataset and achieves better result than the state-of-the-art deep multi-instance method.

Audio-Video Detection of the Active Speaker in Meetings

Francisco Madrigal, Frederic Lerasle, Lionel Pibre, Isabelle Ferrané

Responsive image

Auto-TLDR; Active Speaker Detection with Visual and Contextual Information from Meeting Context

Slides Poster Similar

Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using speech, however, certain visual and contextual information can provide additional insights. In this paper we propose an active speaker detection framework that integrates audiovisual features with social information, from the meeting context. Visual cue is processed using a Convolutional Neural Network (CNN) that captures the spatio-temporal relationships. We analyze several CNN architectures with both cues: raw pixels (RGB images) and motion (estimated with optical flow). Contextual reasoning is done with an original methodology, based on the gaze of all participants. We evaluate our proposal with a public \textcolor{black}{benchmark} in state-of-art: AMI corpus. We show how the addition of visual and context information improves the performance of the active speaker detection.

Learning to Take Directions One Step at a Time

Qiyang Hu, Adrian Wälchli, Tiziano Portenier, Matthias Zwicker, Paolo Favaro

Responsive image

Auto-TLDR; Generating a Sequence of Motion Strokes from a Single Image

Slides Poster Similar

We present a method to generate a video sequence given a single image. Because items in an image can be animated in arbitrarily many different ways, we introduce as control signal a sequence of motion strokes. Such control signal can be automatically transferred from other videos, e.g., via bounding box tracking. Each motion stroke provides the direction to the moving object in the input image and we aim to train a network to generate an animation following a sequence of such directions. To address this task we design a novel recurrent architecture, which can be trained easily and effectively thanks to an explicit separation of past, future and current states. As we demonstrate in the experiments, our proposed architecture is capable of generating an arbitrary number of frames from a single image and a sequence of motion strokes. Key components of our architecture are an autoencoding constraint to ensure consistency with the past and a generative adversarial scheme to ensure that images look realistic and are temporally smooth. We demonstrate the effectiveness of our approach on the MNIST, KTH, Human3.6M, Push and Weizmann datasets.

Siamese Fully Convolutional Tracker with Motion Correction

Mathew Francis, Prithwijit Guha

Responsive image

Auto-TLDR; A Siamese Ensemble for Visual Tracking with Appearance and Motion Components

Slides Poster Similar

Visual tracking algorithms use cues like appearance, structure, motion etc. for locating an object in a video. We propose an ensemble tracker with appearance and motion components. A siamese tracker that learns object appearance from a static image and motion vectors computed between consecutive frames with a flow network forms the ensemble. Motion predicted object localization is used to correct the appearance component in the ensemble. Complementary nature of the components bring performance improvement as observed in experiments performed on VOT2018 and VOT2019 datasets.

Attention-Driven Body Pose Encoding for Human Activity Recognition

Bappaditya Debnath, Swagat Kumar, Marry O'Brien, Ardhendu Behera

Responsive image

Auto-TLDR; Attention-based Body Pose Encoding for Human Activity Recognition

Slides Poster Similar

This article proposes a novel attention-based body pose encoding for human activity recognition. Most of the existing human activity recognition approaches based on 3D pose data often enrich the input data using additional handcrafted representations such as velocity, super normal vectors, pairwise relations, and so on. The enriched data complements the 3D body joint position data and improves the model performance. In this paper, we propose a novel approach that learns enhanced feature representations from a given sequence of 3D body joints. To achieve this, the approach exploits two body pose streams: 1) a spatial stream which encodes the spatial relationship between various body joints at each time point to learn spatial structure involving the spatial distribution of different body joints 2) a temporal stream that learns the temporal variation of individual body joints over the entire sequence duration to present a temporally enhanced representation. Afterwards, these two pose streams are fused with a multi-head attention mechanism. We also capture the contextual information from the RGB video stream using a deep Convolutional Neural Network (CNN) model combined with a multi-head attention and a bidirectional Long Short-Term Memory (LSTM) network. Finally, the RGB video stream is combined with the fused body pose stream to give a novel end-to-end deep model for effective human activity recognition. The proposed model is evaluated on three datasets including the challenging NTU-RGBD dataset and achieves state-of-the-art results.