Explore and Explain: Self-Supervised Navigation and Recounting

Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

Responsive image

Auto-TLDR; Exploring a Photorealistic Environment for Explanation and Navigation

Slides

Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path. In this context, the agent needs to navigate the environment driven by an exploration goal, select proper moments for description, and output natural language descriptions of relevant objects and scenes. Our model integrates a novel self-supervised exploration module with penalty, and a fully-attentive captioning model for explanation. Also, we investigate different policies for selecting proper moments for explanation, driven by information coming from both the environment and the navigation. Experiments are conducted on photorealistic environments from the Matterport3D dataset and investigate the navigation and explanation capabilities of the agent as well as the role of their interactions.

Similar papers

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Responsive image

Auto-TLDR; Fully-Attentive Reduction for Vision and Language

Slides Poster Similar

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements - like regions and words - proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.

A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning

Ruchika Chavhan, Biplab Banerjee, Xiao Xiang Zhu, Subhasis Chaudhuri

Responsive image

Auto-TLDR; Actor Dual-Critic Training for Remote Sensing Image Captioning Using Deep Reinforcement Learning

Slides Poster Similar

We deal with the problem of generating textual captions from optical remote sensing (RS) images using the notion of deep reinforcement learning. Due to the high inter-class similarity in reference sentences describing remote sensing data, jointly encoding the sentences and images encourages prediction of captions that are semantically more precise than the ground truth in many cases. To this end, we introduce an Actor Dual-Critic training strategy where a second critic model is deployed in the form of an encoder-decoder RNN to encode the latent information corresponding to the original and generated captions. While all actor-critic methods use an actor to predict sentences for an image and a critic to provide rewards, our proposed encoder-decoder RNN guarantees high-level comprehension of images by sentence-to-image translation. We observe that the proposed model generates sentences on the test data highly similar to the ground truth and is successful in generating even better captions in many critical cases. Extensive experiments on the benchmark Remote Sensing Image Captioning Dataset (RSICD) and the UCM-captions dataset confirm the superiority of the proposed approach in comparison to the previous state-of-the-art where we obtain a gain of sharp increments in both the ROUGE-L and CIDEr measures.

Object-Oriented Map Exploration and Construction Based on Auxiliary Task Aided DRL

Junzhe Xu, Jianhua Zhang, Shengyong Chen, Honghai Liu

Responsive image

Auto-TLDR; Auxiliary Task Aided Deep Reinforcement Learning for Environment Exploration by Autonomous Robots

Similar

Environment exploration by autonomous robots through deep reinforcement learning (DRL) based methods has attracted more and more attention. However, existing methods usually focus on robot navigation to single or multiple fixed goals, while ignoring the perception and construction of external environments. In this paper, we propose a novel environment exploration task based on DRL, which requires a robot fast and completely perceives all objects of interest, and reconstructs their poses in a global environment map, as much as the robot can do. To this end, we design an auxiliary task aided DRL model, which is integrated with the auxiliary object detection and 6-DoF pose estimation components. The outcome of auxiliary tasks can improve the learning speed and robustness of DRL, as well as the accuracy of object pose estimation. Comprehensive experimental results on the indoor simulation platform AI2-THOR have shown the effectiveness and robustness of our method.

Transformer Reasoning Network for Image-Text Matching and Retrieval

Nicola Messina, Fabrizio Falchi, Andrea Esuli, Giuseppe Amato

Responsive image

Auto-TLDR; A Transformer Encoder Reasoning Network for Image-Text Matching in Large-Scale Information Retrieval

Slides Poster Similar

Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN.

On Embodied Visual Navigation in Real Environments through Habitat

Marco Rosano, Antonino Furnari, Luigi Gulino, Giovanni Maria Farinella

Responsive image

Auto-TLDR; Learning Navigation Policies on Real World Observations using Real World Images and Sensor and Actuation Noise

Slides Poster Similar

Visual navigation models based on deep learning can learn effective policies when trained on large amounts of visual observations through reinforcement learning. Unfortunately, collecting the required experience deploying a robotic platform in the real world is expensive and time-consuming. To deal with this limitation, several simulation platforms have been proposed in order to train visual navigation policies on virtual environments efficiently. Despite the advantages they offer, simulators present a limited realism in terms of appearance and physical dynamics, leading to navigation policies that do not generalize in the real world. In this paper, we propose a tool based on the Habitat simulator which exploits real world images of the environment, together with sensor and actuator noise models, to produce more realistic navigation episodes. We perform a range of experiments using virtual, real and images transformed with a simple domain adaptation approach. We also assess the impact of sensor and actuation noise on the navigation performance and investigate whether they allow to learn more robust navigation policies. We show that our tool can effectively help to train and evaluate navigation policies on real world observations without running navigation episodes in the real world.

Deep Reinforcement Learning on a Budget: 3D Control and Reasoning without a Supercomputer

Edward Beeching, Jilles Steeve Dibangoye, Olivier Simonin, Christian Wolf

Responsive image

Auto-TLDR; Deep Reinforcement Learning in Mobile Robots Using 3D Environment Scenarios

Slides Poster Similar

An important goal of research in Deep Reinforcement Learning in mobile robotics is to train agents capableof solving complex tasks, which require a high level of scene understanding and reasoning from an egocentric perspective.When trained from simulations, optimal environments should satisfy a currently unobtainable combination of high-fidelity photographic observations, massive amounts of different environment configurations and fast simulation speeds. In this paper we argue that research on training agents capable of complex reasoning can be simplified by decoupling from the requirement of high fidelity photographic observations. We present a suite of tasks requiring complex reasoning and exploration in continuous,partially observable 3D environments. The objective is to provide challenging scenarios and a robust baseline agent architecture that can be trained on mid-range consumer hardware in under 24h. Our scenarios combine two key advantages: (i) they are based on a simple but highly efficient 3D environment (ViZDoom)which allows high speed simulation (12000fps); (ii) the scenarios provide the user with a range of difficulty settings, in order to identify the limitations of current state of the art algorithms and network architectures. We aim to increase accessibility to the field of Deep-RL by providing baselines for challenging scenarios where new ideas can be iterated on quickly. We argue that the community should be able to address challenging problems in reasoning of mobile agents without the need for a large compute infrastructure.

Low Dimensional State Representation Learning with Reward-Shaped Priors

Nicolò Botteghi, Ruben Obbink, Daan Geijs, Mannes Poel, Beril Sirmacek, Christoph Brune, Abeje Mersha, Stefano Stramigioli

Responsive image

Auto-TLDR; Unsupervised Learning for Unsupervised Reinforcement Learning in Robotics

Poster Similar

Reinforcement Learning has been able to solve many complicated robotics tasks without any need of feature engineering in an end-to-end fashion. However, learning the optimal policy directly from the sensory inputs, i.e the observations, often requires processing and storage of huge amount of data. In the context of robotics, the cost of data from real robotics hardware is usually very high, thus solutions that achieves high sample-efficiency are needed. We propose a method that aims at learning a mapping from the observations into a lower dimensional state space. This mapping is learned with unsupervised learning using loss functions shaped to incorporate prior knowledge of the environment and the task. Using the samples from the state space, the optimal policy is quickly and efficiently learned. We test the method on several mobile robot navigation tasks in simulation environment and also on a real robot.

Enriching Video Captions with Contextual Text

Philipp Rimle, Pelin Dogan, Markus Gross

Responsive image

Auto-TLDR; Contextualized Video Captioning Using Contextual Text

Slides Poster Similar

Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning by infusing extracted information from relevant text data. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mines relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not preprocess the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the contextual text via a pointer-generator network, allowing to produce more specific video captions. We show competitive performance on the News Video Dataset and, through ablation studies, validate the efficacy of contextual video captioning as well as individual design choices in our model architecture.

Transformer Networks for Trajectory Forecasting

Francesco Giuliari, Hasan Irtiza, Marco Cristani, Fabio Galasso

Responsive image

Auto-TLDR; TransformerNetworks for Trajectory Prediction of People Interactions

Slides Poster Similar

Most recent successes on forecasting the people mo-tion are based on LSTM models andallmost recent progress hasbeen achieved by modelling the social interaction among peopleand the people interaction with the scene. We question the useof the LSTM models and propose the novel use of TransformerNetworks for trajectory forecasting. This is a fundamental switchfrom the sequential step-by-step processing of LSTMs to theonly-attention-based memory mechanisms of Transformers. Inparticular, we consider both the original Transformer Network(TF) and the larger Bidirectional Transformer (BERT), state-of-the-art on all natural language processing tasks. Our proposedTransformers predict the trajectories of the individual peoplein the scene. These are “simple” models because each personis modelled separately without any complex human-human norscene interaction terms. In particular, the TF modelwithoutbells and whistlesyields the best score on the largest and mostchallenging trajectory forecasting benchmark of TrajNet [1]. Ad-ditionally, its extension which predicts multiple plausible futuretrajectories performs on par with more engineered techniqueson the 5 datasets of ETH [2]+UCY [3]. Finally, we showthat Transformers may deal with missing observations, as itmay be the case with real sensor data. Code is available atgithub.com/FGiuliari/Trajectory-Transformer

DAG-Net: Double Attentive Graph Neural Network for Trajectory Forecasting

Alessio Monti, Alessia Bertugli, Simone Calderara, Rita Cucchiara

Responsive image

Auto-TLDR; Recurrent Generative Model for Multi-modal Human Motion Behaviour in Urban Environments

Slides Poster Similar

Understanding human motion behaviour is a critical task for several possible applications like self-driving cars or social robots, and in general for all those settings where an autonomous agent has to navigate inside a human-centric environment. This is non-trivial because human motion is inherently multi-modal: given a history of human motion paths, there are many plausible ways by which people could move in the future. Additionally, people activities are often driven by goals, e.g. reaching particular locations or interacting with the environment. We address both the aforementioned aspects by proposing a new recurrent generative model that considers both single agents’ future goals and interactions between different agents. The model exploits a double attention-based graph neural network to collect information about the mutual influences among different agents and integrates it with data about agents’ possible future objectives. Our proposal is general enough to be applied in different scenarios: the model achieves state-of-the-art results in both urban environments and also in sports applications.

SAILenv: Learning in Virtual Visual Environments Made Simple

Enrico Meloni, Luca Pasqualini, Matteo Tiezzi, Marco Gori, Stefano Melacci

Responsive image

Auto-TLDR; SAILenv: A Simple and Customized Platform for Visual Recognition in Virtual 3D Environment

Slides Poster Similar

Recently, researchers in Machine Learning algorithms, Computer Vision scientists, engineers and others, showed a growing interest in 3D simulators as a mean to artificially create experimental settings that are very close to those in the real world. However, most of the existing platforms to interface algorithms with 3D environments are often designed to setup navigation-related experiments, to study physical interactions, or to handle ad-hoc cases that are not thought to be customized, sometimes lacking a strong photorealistic appearance and an easy-to-use software interface. In this paper, we present a novel platform, SAILenv, that is specifically designed to be simple and customizable, and that allows researchers to experiment visual recognition in virtual 3D scenes. A few lines of code are needed to interface every algorithm with the virtual world, and non-3D-graphics experts can easily customize the 3D environment itself, exploiting a collection of photorealistic objects. Our framework yields pixel-level semantic and instance labeling, depth, and, to the best of our knowledge, it is the only one that provides motion-related information directly inherited from the 3D engine. The client-server communication operates at a low level, avoiding the overhead of HTTP-based data exchanges. We perform experiments using a state-of-the-art object detector trained on real-world images, showing that it is able to recognize the photorealistic 3D objects of our environment. The computational burden of the optical flow compares favourably with the estimation performed using modern GPU-based convolutional networks or more classic implementations. We believe that the scientific community will benefit from the easiness and high-quality of our framework to evaluate newly proposed algorithms in their own customized realistic conditions.

A Bayesian Approach to Reinforcement Learning of Vision-Based Vehicular Control

Zahra Gharaee, Karl Holmquist, Linbo He, Michael Felsberg

Responsive image

Auto-TLDR; Bayesian Reinforcement Learning for Autonomous Driving

Slides Poster Similar

In this paper, we present a state-of-the-art reinforcement learning method for autonomous driving. Our approach employs temporal difference learning in a Bayesian framework to learn vehicle control signals from sensor data. The agent has access to images from a forward facing camera, which are pre-processed to generate semantic segmentation maps. We trained our system using both ground truth and estimated semantic segmentation input. Based on our observations from a large set of experiments, we conclude that training the system on ground truth input data leads to better performance than training the system on estimated input even if estimated input is used for evaluation. The system is trained and evaluated in a realistic simulated urban environment using the CARLA simulator. The simulator also contains a benchmark that allows for comparing to other systems and methods. The required training time of the system is shown to be lower and the performance on the benchmark superior to competing approaches.

Learning from Learners: Adapting Reinforcement Learning Agents to Be Competitive in a Card Game

Pablo Vinicius Alves De Barros, Ana Tanevska, Alessandra Sciutti

Responsive image

Auto-TLDR; Adaptive Reinforcement Learning for Competitive Card Games

Slides Poster Similar

Learning how to adapt to complex and dynamic environments is one of the most important factors that contribute to our intelligence. Endowing artificial agents with this ability is not a simple task, particularly in competitive scenarios. In this paper, we present a broad study on how popular reinforcement learning algorithms can be adapted and implemented to learn and to play a real-world implementation of a competitive multiplayer card game. We propose specific training and validation routines for the learning agents, in order to evaluate how the agents learn to be competitive and explain how they adapt to each others' playing style. Finally, we pinpoint how the behavior of each agent derives from their learning style and create a baseline for future research on this scenario.

Attentive Visual Semantic Specialized Network for Video Captioning

Jesus Perez-Martin, Benjamin Bustos, Jorge Pérez

Responsive image

Auto-TLDR; Adaptive Visual Semantic Specialized Network for Video Captioning

Slides Poster Similar

As an essential high-level task of video understanding topic, automatically describing a video with natural language has recently gained attention as a fundamental challenge in computer vision. Previous models for video captioning have several limitations, such as the existence of gaps in current semantic representations and the inexpressibility of the generated captions. To deal with these limitations, in this paper, we present a new architecture that we callAttentive Visual Semantic Specialized Network(AVSSN), which is an encoder-decoder model based on our Adaptive Attention Gate and Specialized LSTM layers. This architecture can selectively decide when to use visual or semantic information into the text generation process. The adaptive gate makes the decoder to automatically select the relevant information for providing a better temporal state representation than the existing decoders. Besides, the model is capable of learning to improve the expressiveness of generated captions attending to their length, using a sentence-length-related loss function. We evaluate the effectiveness of the proposed approach on the Microsoft Video Description(MSVD) and the Microsoft Research Video-to-Text (MSR-VTT) datasets, achieving state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, CIDEr, and ROUGE_L.

AVD-Net: Attention Value Decomposition Network for Deep Multi-Agent Reinforcement Learning

Zhang Yuanxin, Huimin Ma, Yu Wang

Responsive image

Auto-TLDR; Attention Value Decomposition Network for Cooperative Multi-agent Reinforcement Learning

Slides Poster Similar

Multi-agent reinforcement learning (MARL) is of importance for variable real-world applications but remains more challenges like stationarity and scalability. While recently value function factorization methods have obtained empirical good results in cooperative multi-agent environment, these works mostly focus on the decomposable learning structures. Inspired by the application of attention mechanism in machine translation and other related domains, we propose an attention based approach called attention value decomposition network (AVD-Net), which capitalizes on the coordination relations between agents. AVD-Net employs centralized training with decentralized execution (CTDE) paradigm, which factorizes the joint action-value functions with only local observations and actions of agents. Our method is evaluated on multi-agent particle environment (MPE) and StarCraft micromanagement environment (SMAC). The experiment results show the strength of our approach compared to existing methods with state-of-the-art performance in cooperative scenarios.

Improving Visual Question Answering Using Active Perception on Static Images

Theodoros Bozinis, Nikolaos Passalis, Anastasios Tefas

Responsive image

Auto-TLDR; Fine-Grained Visual Question Answering with Reinforcement Learning-based Active Perception

Slides Poster Similar

Visual Question Answering (VQA) is one of the most challenging emerging applications of deep learning. Providing powerful attention mechanisms is crucial for VQA, since the model must correctly identify the region of an image that is relevant to the question at hand. However, existing models analyze the input images at a fixed and typically small resolution, often leading to discarding valuable fine-grained details. To overcome this limitation, in this work we propose a reinforcement learning-based active perception approach that works by applying a series of transformation operations on the images (translation, zoom) in order to facilitate answering the question at hand. This allows for performing fine-grained analysis, effectively increasing the resolution at which the models process information. The proposed method is orthogonal to existing attention mechanisms and it can be combined with most existing VQA methods. The effectiveness of the proposed method is experimentally demonstrated on a challenging VQA dataset.

Meta Learning Via Learned Loss

Sarah Bechtle, Artem Molchanov, Yevgen Chebotar, Edward Thomas Grefenstette, Ludovic Righetti, Gaurav Sukhatme, Franziska Meier

Responsive image

Auto-TLDR; meta-learning for learning parametric loss functions that generalize across different tasks and model architectures

Slides Similar

Typically, loss functions, regularization mechanisms and other important aspects of training parametric models are chosen heuristically from a limited set of options. In this paper, we take the first step towards automating this process, with the view of producing models which train faster and more robustly. Concretely, we present a meta-learning method for learning parametric loss functions that can generalize across different tasks and model architectures. We develop a pipeline for “meta-training” such loss functions, targeted at maximizing the performance of the model trained under them. The loss landscape produced by our learned losses significantly improves upon the original task-specific losses in both supervised and reinforcement learning tasks. Furthermore, we show that our meta-learning framework is flexible enough to incorporate additional information at meta-train time. This information shapes the learned loss function such that the environment does not need to provide this information during meta-test time.

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Tianling Jiang, Yi Ji, Chunping Liu

Responsive image

Auto-TLDR; Integrating Historical States and Co-attention for Visual Dialog

Slides Poster Similar

Visual dialog is a typical multi-modal task which involves both vision and language. Nowadays, it faces two major difficulties. In this paper, we propose Integrating Historical States and Co-attention (HSCA) for visual dialog to solve them. It includes two main modules, Co-ATT and MATCH. Specifically, the main purpose of the Co-ATT module is to guide the image with questions and answers in the early stage to get more specific objects. It tackles the temporal sequence issue in historical information which may influence the precise answer for multi-round questions. The MATCH module is, based on a question with pronouns, to retrieve the best matching historical information block. It overcomes the visual reference problem which requires to solve pronouns referring to unknowns in the text message and then to locate the objects in the given image. We quantitatively and qualitatively evaluate our model on VisDial v1.0, at the same time, ablation studies are carried out. The experimental results demonstrate that HSCA outperforms the state-of-the-art methods in many aspects.

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Bang Yang, Yuexian Zou

Responsive image

Auto-TLDR; Visual Oriented Encoder for Video Captioning

Slides Poster Similar

Video captioning is a challenging task which aims at automatically generating a natural language description of a given video. Recent researches have shown that exploiting the intrinsic multi-modalities of videos significantly promotes captioning performance. However, how to integrate multi-modalities to generate effective semantic representations for video captioning is still an open issue. Some researchers proposed to learn multimodal features in parallel during the encoding stage. The downside of these methods lies in the neglect of the interaction among multi-modalities and their rich contextual information. In this study, inspired by the fact that visual contents are generally more important for comprehending videos, we propose a novel Visual Oriented Encoder (VOE) to integrate multimodal features in an interactive manner. Specifically, VOE is designed as a hierarchical structure, where bottom layers are utilized to extract multi-scale contexts from auxiliary modalities while the top layer is exploited to generate joint representations by considering both visual and contextual information. Following the encoder-decoder framework, we systematically develop a VOE-LSTM model and evaluate it on two mainstream benchmarks: MSVD and MSR-VTT. Experimental results show that the proposed VOE surpasses conventional encoders and our VOE-LSTM model achieves competitive results compared with state-of-the-art approaches.

Text Synopsis Generation for Egocentric Videos

Aidean Sharghi, Niels Lobo, Mubarak Shah

Responsive image

Auto-TLDR; Egocentric Video Summarization Using Multi-task Learning for End-to-End Learning

Slides Similar

Mass utilization of body-worn cameras has led to a huge corpus of available egocentric video. Existing video summarization algorithms can accelerate browsing such videos by selecting (visually) interesting shots from them. Nonetheless, since the system user still has to watch the summary videos, browsing large video databases remain a challenge. Hence, in this work, we propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos. Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database using text queries. Since egocentric videos are long and contain many activities and events, using video-to-text algorithms results in thousands of descriptions, many of which are incorrect. Therefore, we propose a multi-task learning scheme to simultaneously generate descriptions for video segments and summarize the resulting descriptions in an end-to-end fashion. We Input a set of video shots and the network generates a text description for each shot. Next, visual-language content matching unit that is trained with a weakly supervised objective, identifies the correct descriptions. Finally, the last component of our network, called purport network, evaluates the descriptions all together to select the ones containing crucial information. Out of thousands of descriptions generated for the video, a few informative sentences are returned to the user. We validate our framework on the challenging UT Egocentric video dataset, where each video is between 3 to 5 hours long, associated with over 3000 textual descriptions on average. The generated textual summaries, including only 5 percent (or less) of the generated descriptions, are compared to groundtruth summaries in text domain using well-established metrics in natural language processing.

Detecting and Adapting to Crisis Pattern with Context Based Deep Reinforcement Learning

Eric Benhamou, David Saltiel Saltiel, Jean-Jacques Ohana Ohana, Jamal Atif Atif

Responsive image

Auto-TLDR; Deep Reinforcement Learning for Financial Crisis Detection and Dis-Investment

Slides Poster Similar

Deep reinforcement learning (DRL) has reached super human levels in complexes tasks like game solving (Go, StarCraft II), and autonomous driving. However, it remains an open question whether DRL can reach human level in applications to financial problems and in particular in detecting pattern crisis and consequently dis-investing. In this paper, we present an innovative DRL framework consisting in two sub-networks fed respectively with portfolio strategies past performances and standard deviation as well as additional contextual features. The second sub network plays an important role as it captures dependencies with common financial indicators features like risk aversion, economic surprise index and correlations between assets that allows taking into account context based information. We compare different network architectures either using layers of convolutions to reduce network's complexity or LSTM block to capture time dependency and whether previous allocations is important in the modeling. We also use adversarial training to make the final model more robust. Results on test set show this approach substantially over-performs traditional portfolio optimization methods like Markovitz and is able to detect and anticipate crisis like the current Covid one.

Detective: An Attentive Recurrent Model for Sparse Object Detection

Amine Kechaou, Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen

Responsive image

Auto-TLDR; Detective: An attentive object detector that identifies objects in images in a sequential manner

Slides Poster Similar

In this work, we present Detective – an attentive object detector that identifies objects in images in a sequential manner. Our network is based on an encoder-decoder architecture, where the encoder is a convolutional neural network, and the decoder is a convolutional recurrent neural network coupled with an attention mechanism. At each iteration, our decoder focuses on the relevant parts of the image using an attention mechanism, and then estimates the object’s class and the bounding box coordinates. Current object detection models generate dense predictions and rely on post-processing to remove duplicate predictions. Detective is a sparse object detector that generates a single bounding box per object instance. However, training a sparse object detector is challenging, as it requires the model to reason at the instance level and not just at the class and spatial levels. We propose a training mechanism based on the Hungarian Algorithm and a loss that balances the localization and classification tasks. This allows Detective to achieve promising results on the PASCAL VOC object detection dataset. Our experiments demonstrate that sparse object detection is possible and has a great potential for future developments in applications where the order of the objects to be predicted is of interest.

Context Matters: Self-Attention for Sign Language Recognition

Fares Ben Slimane, Mohamed Bouguessa

Responsive image

Auto-TLDR; Attentional Network for Continuous Sign Language Recognition

Slides Poster Similar

This paper proposes an attentional network for the task of Continuous Sign Language Recognition. The proposed approach exploits co-independent streams of data to model the sign language modalities. These different channels of information can share a complex temporal structure between each other. For that reason, we apply attention to synchronize and help capture entangled dependencies between the different sign language components. Even though Sign Language is multi-channel, handshapes represent the central entities in sign interpretation. Seeing handshapes in their correct context defines the meaning of a sign. Taking that into account, we utilize the attention mechanism to efficiently aggregate the hand features with their appropriate Spatio-temporal context for better sign recognition. We found that by doing so the model is able to identify the essential Sign Language components that revolve around the dominant hand and the face areas. We test our model on the benchmark dataset RWTH-PHOENIX-Weather 2014, yielding competitive results.

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Yanan Li, Yuetan Lin, Hongrui Zhao, Donghui Wang

Responsive image

Auto-TLDR; TextVQA: An End-to-End Visual Question Answering Model for Text-Based VQA

Slides Similar

As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.

Learning with Delayed Feedback

Pranavan Theivendiram, Terence Sim

Responsive image

Auto-TLDR; Unsupervised Machine Learning with Delayed Feedback

Slides Poster Similar

We propose a novel supervised machine learning strategy, inspired by human learning, that enables an Agent to learn continually over its lifetime. A natural consequence is that the Agent must be able to handle an input whose label is delayed until a later time, or may not arrive at all. Our Agent learns in two steps: a short Seeding phase, in which the Agent's model is initialized with labelled inputs, and an indefinitely long Growing phase, in which the Agent refines and assesses its model if the label is given for an input, but stores the input in a finite-length queue if the label is missing. Queued items are matched against future input-label pairs that arrive, and the model is then updated. Our strategy also allows for the delayed feedback to take a different form. For example, in an image captioning task, the feedback could be a semantic segmentation rather than a textual caption. We show with many experiments that our strategy enables an Agent to learn flexibly and efficiently.

Vacant Parking Space Detection Based on Task Consistency and Reinforcement Learning

Manh Hung Nguyen, Tzu-Yin Chao, Ching-Chun Huang

Responsive image

Auto-TLDR; Vacant Space Detection via Semantic Consistency Learning

Slides Poster Similar

In this paper, we proposed a novel task-consistency learning method that allows training a vacant space detection network (target task) based on the logistic consistency with the semantic outcomes from a naive flow-based motion behavior classifier (source task) in a parking lot. By well designing the reward mechanism upon semantic consistency, we show the possibility to train the target network in a reinforcement learning setting. Compared with conventional supervised detection methods, the major contribution of this work is to learn a vacant space detector via semantic consistency rather than supervised labels. The dynamic learning property may make the proposed detector been deployed in different lots easily without heavy training loads. The experiments show that based on the task consistency rewards from the motion behavior classifier, the vacant space detector can be trained successfully.

ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos

Guillaume Vaudaux-Ruth, Adrien Chan-Hon-Tong, Catherine Achard

Responsive image

Auto-TLDR; ActionSpotter: A Reinforcement Learning Algorithm for Action Spotting in Video

Slides Poster Similar

Action spotting has recently been proposed as an alternative to action detection and key frame extraction. However, the current state-of-the-art method of action spotting requires an expensive ground truth composed of the search sequences employed by human annotators spotting actions - a critical limitation. In this article, we propose to use a reinforcement learning algorithm to perform efficient action spotting using only the temporal segments from the action detection annotations, thus opening an interesting solution for video understanding. Experiments performed on THUMOS14 and ActivityNet datasets show that the proposed method, named ActionSpotter, leads to good results and outperforms state-of-the-art detection outputs redrawn for this application. In particular, the spotting mean Average Precision on THUMOS14 is significantly improved from 59.7% to 65.6% while skipping 23% of video.

Knowledge Distillation for Action Anticipation Via Label Smoothing

Guglielmo Camporese, Pasquale Coscia, Antonino Furnari, Giovanni Maria Farinella, Lamberto Ballan

Responsive image

Auto-TLDR; A Multi-Modal Framework for Action Anticipation using Long Short-Term Memory Networks

Slides Poster Similar

Human capability to anticipate near future from visual observations and non-verbal cues is essential for developing intelligent systems that need to interact with people. Several research areas, such as human-robot interaction (HRI), assisted living or autonomous driving need to foresee future events to avoid crashes or help people. Egocentric scenarios are classic examples where action anticipation is applied due to their numerous applications. Such challenging task demands to capture and model domain's hidden structure to reduce prediction uncertainty. Since multiple actions may equally occur in the future, we treat action anticipation as a multi-label problem with missing labels extending the concept of label smoothing. This idea resembles the knowledge distillation process since useful information is injected into the model during training. We implement a multi-modal framework based on long short-term memory (LSTM) networks to summarize past observations and make predictions at different time steps. We perform extensive experiments on EPIC-Kitchens and EGTEA Gaze+ datasets including more than 2500 and 100 action classes, respectively. The experiments show that label smoothing systematically improves performance of state-of-the-art models for action anticipation.

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Amar Shrestha, Krittaphat Pugdeethosapol, Haowen Fang, Qinru Qiu

Responsive image

Auto-TLDR; MAGNet: A Multi-Region Attention-Aware Grounding Network for Free-form Textual Queries

Slides Poster Similar

Grounding free-form textual queries necessitates an understanding of these textual phrases and its relation to the visual cues to reliably reason about the described locations. Spatial attention networks are known to learn this relationship and focus its gaze on salient objects in the image. Thus, we propose to utilize spatial attention networks for image-level visual-textual fusion preserving local (word) and global (phrase) information to refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query. We focus only on the phrase query - ground truth pair (referring expression) for a model independent of the constraints of the datasets i.e. additional attributes, context etc. For such referring expression dataset ReferIt game, our Multi- region Attention-assisted Grounding network (MAGNet) achieves over 12% improvement over the state-of-the-art. Without the con- text from image captions and attribute information in Flickr30k Entities, we still achieve competitive results compared to the state- of-the-art.

Self-Play or Group Practice: Learning to Play Alternating Markov Game in Multi-Agent System

Chin-Wing Leung, Shuyue Hu, Ho-Fung Leung

Responsive image

Auto-TLDR; Group Practice for Deep Reinforcement Learning

Slides Poster Similar

The research in reinforcement learning has achieved great success in strategic game playing. These successes are thanks to the incorporation of deep reinforcement learning (DRL) and Monte Carlo Tree Search (MCTS) to the agent trained under the self-play (SP) environment. By self-play, agents are provided with an incrementally more difficult curriculum which in turn facilitate learning. However, recent research suggests that agents trained via self-play may easily lead to getting stuck in local equilibria. In this paper, we consider a population of agents each independently learns to play an alternating Markov game (AMG). We propose a new training framework---group practice---for a population of decentralized RL agents. By group practice (GP), agents are assigned into multiple learning groups during training, for every episode of games, an agent is randomly paired up and practices with another agent in the learning group. The convergence result to the optimal value function and the Nash equilibrium are proved under the GP framework. Experimental study is conducted by applying GP to Q-learning algorithm and the deep Q-learning with Monte-Carlo tree search on the game of Connect Four and the game of Hex. We verify that GP is the more efficient training scheme than SP given the same amount of training. We also show that the learning effectiveness can even be improved when applying local grouping to agents.

Context Visual Information-Based Deliberation Network for Video Captioning

Min Lu, Xueyong Li, Caihua Liu

Responsive image

Auto-TLDR; Context visual information-based deliberation network for video captioning

Slides Poster Similar

Video captioning is to automatically and accurately generate a textual description for a video. The typical methods following the encoder-decoder architecture directly utilized hidden states to predict words. Nevertheless, these methods did not amend the inaccurate hidden states before feeding those states into word prediction. This led to a cascade of errors on generating word by word. In this paper, the context visual information-based deliberation network is proposed, abbreviated as CVI-DelNet. Its key idea is to introduce the deliberator into the encoder-decoder framework. The encoder-decoder firstly generates a raw hidden state sequence. Unlike the existing methods, the raw hidden state is no more directly used for word prediction but is fed into the deliberator to generate the refined hidden state. The words are then predicted according to the refined hidden states and the contextual visual features. Results on two datasets shows that the proposed method significantly outperforms the baselines.

Trajectory Representation Learning for Multi-Task NMRDP Planning

Firas Jarboui, Vianney Perchet

Responsive image

Auto-TLDR; Exploring Non Markovian Reward Decision Processes for Reinforcement Learning

Slides Poster Similar

Expanding Non Markovian Reward Decision Processes (NMRDP) into Markov Decision Processes (MDP) enables the use of state of the art Reinforcement Learning (RL) techniques to identify optimal policies. In this paper an approach to exploring NMRDPs and expanding them into MDPs, without the prior knowledge of the reward structure, is proposed. The non Markovianity of the reward function is disentangled under the assumption that sets of similar and dissimilar trajectory batches can be sampled. More precisely, within the same batch, measuring the similarity between any couple of trajectories is permitted, although comparing trajectories from different batches is not possible. A modified version of the triplet loss is optimised to construct a representation of the trajectories under which rewards become Markovian.

Question-Agnostic Attention for Visual Question Answering

Moshiur R Farazi, Salman Hameed Khan, Nick Barnes

Responsive image

Auto-TLDR; Question-Agnostic Attention for Visual Question Answering

Slides Poster Similar

Visual Question Answering (VQA) models employ attention mechanisms to discover image locations that are most relevant for answering a specific question. For this purpose, several multimodal fusion strategies have been proposed, ranging from relatively simple operations (e.g., linear sum) to more complex ones (e.g., Block). The resulting multimodal representations define an intermediate feature space for capturing the interplay between visual and semantic features, that is helpful in selectively focusing on image content. In this paper, we propose a question-agnostic attention mechanism that is complementary to the existing question-dependent attention mechanisms. Our proposed model parses object instances to obtain an `object map' and applies this map on the visual features to generate Question-Agnostic Attention (QAA) features. In contrast to question-dependent attention approaches that are learned end-to-end, the proposed QAA does not involve question-specific training, and can be easily included in almost any existing VQA model as a generic light-weight pre-processing step, thereby adding minimal computation overhead for training. Further, when used in complement with the question-dependent attention, the QAA allows the model to focus on the regions containing objects that might have been overlooked by the learned attention representation. Through extensive evaluation on VQAv1, VQAv2 and TDIUC datasets, we show that incorporating complementary QAA allows state-of-the-art VQA models to perform better, and provides significant boost to simplistic VQA models, enabling them to performance on par with highly sophisticated fusion strategies.

AG-GAN: An Attentive Group-Aware GAN for Pedestrian Trajectory Prediction

Yue Song, Niccolò Bisagno, Syed Zohaib Hassan, Nicola Conci

Responsive image

Auto-TLDR; An attentive group-aware GAN for motion prediction in crowded scenarios

Slides Poster Similar

Understanding human behaviors in crowded scenarios requires analyzing not only the position of the subjects in space, but also the scene context. Existing approaches mostly rely on the motion history of each pedestrian and model the interactions among people by considering the entire surrounding neighborhood. In our approach, we address the problem of motion prediction by applying coherent group clustering and a global attention mechanism on the LSTM-based Generative Adversarial Networks (GANs). The proposed model consists of an attentive group-aware GAN that observes the agents' past motion and predicts future paths, using (i) a group pooling module to model neighborhood interaction, and (ii) an attention module to specifically focus on hidden states. The experimental results demonstrate that our proposal outperforms state-of-the-art models on common benchmark datasets, and is able to generate socially-acceptable trajectories.

Deep Next-Best-View Planner for Cross-Season Visual Route Classification

Kurauchi Kanya, Kanji Tanaka

Responsive image

Auto-TLDR; Active Visual Place Recognition using Deep Convolutional Neural Network

Slides Poster Similar

This paper addresses the problem of active visual place recognition (VPR) from a novel perspective of long-term autonomy. In our approach, a next-best-view (NBV) planner plans an optimal action-observation-sequence to maximize the expected cost-performance for a visual route classification task. A difficulty arises from the fact that the NBV planner is trained and tested in different domains (times of day, weather conditions, and seasons). Existing NBV methods may be confused and deteriorated by the domain-shifts, and require significant efforts for adapting them to a new domain. We address this issue by a novel deep convolutional neural network (DNN) -based NBV planner that does not require the adaptation. Our main contributions in this paper are summarized as follows: (1) We present a novel domain-invariant NBV planner that is specifically tailored for DNN-based VPR. (2) We formulate the active VPR as a POMDP problem and present a feasible solution to address the inherent intractability. Specifically, the probability distribution vector (PDV) output by the available DNN is used as a domain-invariant observation model without the need to retrain it. (3) We verify efficacy of the proposed approach through challenging cross-season VPR experiments, where it is confirmed that the proposed approach clearly outperforms the previous single-view-based or multi-view-based VPR in terms of VPR accuracy and/or action-observation-cost.

The Effect of Multi-Step Methods on Overestimation in Deep Reinforcement Learning

Lingheng Meng, Rob Gorbet, Dana Kulić

Responsive image

Auto-TLDR; Multi-Step DDPG for Deep Reinforcement Learning

Slides Poster Similar

Multi-step (also called n-step) methods in reinforcement learning (RL) have been shown to be more efficient than the 1-step method due to faster propagation of the reward signal, both theoretically and empirically, in tasks exploiting tabular representation of the value-function. Recently, research in Deep Reinforcement Learning (DRL) also shows that multi-step methods improve learning speed and final performance in applications where the value-function and policy are represented with deep neural networks. However, there is a lack of understanding about what is actually contributing to the boost of performance. In this work, we analyze the effect of multi-step methods on alleviating the overestimation problem in DRL, where multi-step experiences are sampled from a replay buffer. Specifically building on top of Deep Deterministic Policy Gradient (DDPG), we experiment with Multi-step DDPG (MDDPG), where different step sizes are manually set, and with a variant called Mixed Multi-step DDPG (MMDDPG) where an average over different multi-step backups is used as target Q-value. Empirically, we show that both MDDPG and MMDDPG are significantly less affected by the overestimation problem than DDPG with 1-step backup, which consequently results in better final performance and learning speed. We also discuss the advantages and disadvantages of different ways to do multi-step expansion in order to reduce approximation error, and expose the tradeoff between overestimation and underestimation that underlies offline multi-step methods. Finally, we compare the computational resource needs of TD3 and our proposed methods, since they show comparable final performance and learning speed.

What and How? Jointly Forecasting Human Action and Pose

Yanjun Zhu, Yanxia Zhang, Qiong Liu, Andreas Girgensohn

Responsive image

Auto-TLDR; Forecasting Human Actions and Motion Trajectories with Joint Action Classification and Pose Regression

Slides Poster Similar

Forecasting human actions and motion trajectories addresses the problem of predicting what a person is going to do next and how they will perform it. This is crucial in a wide range of applications such as assisted living and future co-robotic settings. We propose to simultaneously learn actions and action-related human motion dynamics, while existing works perform them independently. In this paper, we present a method to jointly forecast categories of human action and the pose of skeletal joints in the hope that the two tasks can help each other. As a result, our system can predict not only the future actions but also the motion trajectories that will result. To achieve this, we define a task of joint action classification and pose regression. We employ a sequence to sequence encoder-decoder model combined with multi-task learning to forecast future actions and poses progressively before the action happens. Experimental results on two public datasets, IkeaDB and OAD, demonstrate the effectiveness of the proposed method.

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

Sungmin Eum, Heesung Kwon

Responsive image

Auto-TLDR; Semantics-to-Space: Deep Zero-Shot Learning for Verb-Object Interaction with Vectors

Slides Poster Similar

We present a novel deep zero-shot learning (ZSL) model for inferencing human-object-interaction with verb-object (VO) query. While the previous two-stream ZSL approaches only use the semantic/textual information to be fed into the query stream, we seek to incorporate and embed the semantics into the visual representation stream as well. Our approach is powered by Semantics-to-Space (S2S) architecture where semantics derived from the residing objects are embedded into a spatial space of the visual stream. This architecture allows the co-capturing of the semantic attributes of the human and the objects along with their location/size/silhouette information. To validate, we have constructed a new dataset, Verb-Transferability 60 (VT60). VT60 provides 60 different VO pairs with overlapping verbs tailored for testing two-stream ZSL approaches with VO query. Experimental evaluations show that our approach not only outperforms the state-of-the-art, but also shows the capability of consistently improving performance regardless of which ZSL baseline architecture is used.

Deep Reinforcement Learning for Autonomous Driving by Transferring Visual Features

Hongli Zhou, Guanwen Zhang, Wei Zhou

Responsive image

Auto-TLDR; Deep Reinforcement Learning for Autonomous Driving by Transferring Visual Features

Slides Poster Similar

Deep reinforcement learning (DRL) has achieved great success in processing vision-based driving tasks. However, the end-to-end training manner makes DRL agents suffer from overfitting training scenes. The agents easily fail to generalize to unseen environments. In this paper, we propose a deep reinforcement learning for autonomous driving by transferring visual features. We formulate the DRL training as a perception and control module and introduce adversarial training mechanism for autonomous driving. The perception module is able to extract invariant features between different domains through adversarial training. While the DRL agent can then be trained on the basis of low dimensional states. In this manner, the proposed approach enables trained agents to adapt to unseen environments by learning robust features invariant across various scenes. We evaluate the proposed approach by transferring visual features between different simulators. The experimental results demonstrate the driving policy trained in the source domain can be directly applied in the target domain, and achieve great efficient and effective performance for autonomous driving.

Multiple Future Prediction Leveraging Synthetic Trajectories

Lorenzo Berlincioni, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo

Responsive image

Auto-TLDR; Synthetic Trajectory Prediction using Markov Chains

Slides Poster Similar

Trajectory prediction is an important task, especially in autonomous driving. The ability to forecast the position of other moving agents can yield to an effective planning, ensuring safety for the autonomous vehicle as well for the observed entities. In this work we propose a data driven approach based on Markov Chains to generate synthetic trajectories, which are useful for training a multiple future trajectory predictor. The advantages are twofold: on the one hand synthetic samples can be used to augment existing datasets and train more effective predictors; on the other hand, it allows to generate samples with multiple ground truths, corresponding to diverse equally likely outcomes of the observed trajectory. We define a trajectory prediction model and a loss that explicitly address the multimodality of the problem and we show that combining synthetic and real data leads to prediction improvements, obtaining state of the art results.

Adaptive Remote Sensing Image Attribute Learning for Active Object Detection

Nuo Xu, Chunlei Huo, Chunhong Pan

Responsive image

Auto-TLDR; Adaptive Image Attribute Learning for Active Object Detection

Slides Similar

In recent years, deep learning methods bring incredible progress to the field of object detection. However, in the field of remote sensing image processing, existing methods neglect the relationship between imaging configuration and detection performance, and do not take into account the importance of detection performance feedback for improving image quality. Therefore, detection performance is limited by the passive nature of the conventional object detection framework. In order to solve the above limitations, this paper takes adaptive brightness adjustment and scale adjustment as examples, and proposes an active object detection method based on deep reinforcement learning. The goal of adaptive image attribute learning is to maximize the detection performance. With the help of active object detection and image attribute adjustment strategies, low-quality images can be converted into high-quality images, and the overall performance is improved without retraining the detector.

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Yaoyuan Liang, Xin Wang, Xuguang Duan, Wenwu Zhu

Responsive image

Auto-TLDR; Multi-modal Contextual Graph Neural Network for Text Visual Question Answering

Slides Poster Similar

Text visual question answering (TextVQA) targets at answering the question related to texts appearing in the given images, posing more challenges than VQA by requiring a deeper recognition and understanding of various shapes of human-readable scene texts as well as their meanings in different contexts. Existing works on TextVQA suffer from two weaknesses: i) scene texts and non-textual objects are processed separately and independently without considering their mutual interactions during the question understanding and answering process, ii) scene texts are encoded only through word embeddings without taking the corresponding visual appearance features as well as their potential relationships with other non-textual objects in the images into account. To overcome the weakness of exiting works, we propose a novel multi-modal contextual graph neural network (MCG) model for TextVQA. The proposed MCG model can capture the relationships between visual features of scene texts and non-textual objects in the given images as well as utilize richer sources of multi-modal features to improve the model performance. In particular, we encode the scene texts into richer features containing textual, visual and positional features, then model the visual relations between scene texts and non-textual objects through a contextual graph neural network. Our extensive experiments on real-world dataset demonstrate the advantages of the proposed MCG model over baseline approaches.

VTT: Long-Term Visual Tracking with Transformers

Tianling Bian, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Neil Robertson, Haibing Guan

Responsive image

Auto-TLDR; Visual Tracking Transformer with transformers for long-term visual tracking

Slides Similar

Long-term visual tracking is a challenging problem. State-of-the-art long-term trackers, e.g., GlobalTrack, utilize region proposal networks (RPNs) to generate target proposals. However, the performance of the trackers is affected by occlusions and large scale or ratio variations. To address these issues, in this paper, we are the first to propose a novel architecture with transformers for long-term visual tracking. Specifically, the proposed Visual Tracking Transformer (VTT) utilizes a transformer encoder-decoder architecture for aggregating global information to deal with occlusion and large scale or ratio variation. Furthermore, it also shows better discriminative power against instance-level distractors without the need for extra labeling and hard-sample mining. We conduct extensive experiments on three largest long-term tracking dataset and have achieved state-of-the-art performance.

Global Context-Based Network with Transformer for Image2latex

Nuo Pang, Chun Yang, Xiaobin Zhu, Jixuan Li, Xu-Cheng Yin

Responsive image

Auto-TLDR; Image2latex with Global Context block and Transformer

Slides Poster Similar

Image2latex usually means converts mathematical formulas in images into latex markup. It is a very challenging job due to the complex two-dimensional structure, variant scales of input, and very long representation sequence. Many researchers use encoder-decoder based model to solve this task and achieved good results. However, these methods don't make full use of the structure and position information of the formula. %In this paper, we improve the encoder by employing Global Context block and Transformer. To solve this problem, we propose a global context-based network with transformer that can (1) learn a more powerful and robust intermediate representation via aggregating global features and (2) encode position information explicitly and (3) learn latent dependencies between symbols by using self-attention mechanism. The experimental results on the dataset IM2LATEX-100K demonstrate the effectiveness of our method.

Multi-Stage Attention Based Visual Question Answering

Aakansha Mishra, Ashish Anand, Prithwijit Guha

Responsive image

Auto-TLDR; Alternative Bi-directional Attention for Visual Question Answering

Poster Similar

Recent developments in the field of Visual Question Answering (VQA) have witnessed promising improvements in performance through contributions in attention based networks. Most such approaches have focused on unidirectional attention that leverage over attention from textual domain (question) on visual space. These approaches mostly focused on learning high-quality attention in the visual space. In contrast, this work proposes an alternating bi-directional attention framework. First, a question to image attention helps to learn the robust visual space embedding, and second, an image to question attention helps to improve the question embedding. This attention mechanism is realized in an alternating fashion i.e. question-to-image followed by image-to-question and is repeated for maximizing performance. We believe that this process of alternating attention generation helps both the modalities and leads to better representations for the VQA task. This proposal is benchmark on TDIUC dataset and against state-of-art approaches. Our ablation analysis shows that alternate attention is the key to achieve high performance in VQA.

Multi-Scale 2D Representation Learning for Weakly-Supervised Moment Retrieval

Ding Li, Rui Wu, Zhizhong Zhang, Yongqiang Tang, Wensheng Zhang

Responsive image

Auto-TLDR; Multi-scale 2D Representation Learning for Weakly Supervised Video Moment Retrieval

Slides Poster Similar

Video moment retrieval aims to search the moment most relevant to a given language query. However, most existing methods in this community often require temporal boundary annotations which are expensive and time-consuming to label. Hence weakly supervised methods have been put forward recently by only using coarse video-level label. Despite effectiveness, these methods usually process moment candidates independently, while ignoring a critical issue that the natural temporal dependencies between candidates in different temporal scales. To cope with this issue, we propose a Multi-scale 2D Representation Learning method for weakly supervised video moment retrieval. Specifically, we first construct a two-dimensional map for each temporal scale to capture the temporal dependencies between candidates. Two dimensions in this map indicate the start and end time points of these candidates. Then, we select top-K candidates from each scale-varied map with a learnable convolutional neural network. With a newly designed Moments Evaluation Module, we obtain the alignment scores of the selected candidates. At last, the similarity between captions and language query is served as supervision for further training the candidates' selector. Experiments on two benchmark datasets Charades-STA and ActivityNet Captions demonstrate that our approach achieves superior performance to state-of-the-art results.

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Edward Kim, Connor Onweller, Kathleen F. Mccoy

Responsive image

Auto-TLDR; A multimodal deep learning framework that can generate summarization text supporting the main idea of an information graphic for presentation to blind or visually impaired

Slides Similar

We present a multimodal deep learning framework that can generate summarization text supporting the main idea of an information graphic for presentation to a person who is blind or visually impaired. The framework utilizes the visual, textual, positional, and size characteristics extracted from the image to create the summary. Different and complimentary neural architectures are optimized for each task using crowdsourced training data. From our quantitative experiments and results, we explain the reasoning behind our framework and show the effectiveness of our models. Our qualitative results showcase text generated from our framework and show that Mechanical Turk participants favor them to other automatic and human generated summarizations. We describe the design and of of an experiment to evaluate the utility of our system for people who have visual impairments in the context of understanding Twitter Tweets containing line graphs.

RLST: A Reinforcement Learning Approach to Scene Text Detection Refinement

Xuan Peng, Zheng Huang, Kai Chen, Jie Guo, Weidong Qiu

Responsive image

Auto-TLDR; Saccadic Eye Movements and Peripheral Vision for Scene Text Detection using Reinforcement Learning

Slides Poster Similar

Within the research of scene text detection, some previous work has already achieved significant accuracy and efficiency. However, most of the work was generally done without considering about the implicit relationship between detection and eye movements. In this paper, we propose a new method for scene text detection especially for its refinement based on reinforcement learning. The idea of this method is inspired by Saccadic Eye Movements and Peripheral Vision. A saccade makes it possible for humans to orient the gaze to the location where a visual object has appeared. Peripheral vision gathers visual information of surroundings which provides supplement to foveal vision during gazing. We propose a simple pipeline, imitating the way human eyes do a saccade and collect peripheral information, to locate scene text roughly and to refine multi-scale vision field iteratively using reinforcement learning. For both training and evaluation, we use ICDAR2015 Challenge 4 dataset as a base and design several criteria to measure the feasibility of our work.