Vision, Language and Action: from Captioning to Embodied AI

Tutorial at ICIAP 2019

Monday 09:00-11:00
Sala 2


Abstract

Recent progress in the Computer Vision and Natural Language Processing communities have made it possible to connect Vision, Language and Action together, achieving significant advancements in a variety of tasks which lie at the intersection of Vision, Language and Embodied AI. Those tasks range from generating meaningful descriptions of images, to answering questions and navigating agents in unseens environments via natural language intructions. This tutorial will give a comprehensive guide through these advancements, including state-of-the-art techniques for image and video captioning (Recurrent Neural Networks, Attention, the Transformer paradigm, training with Reinforcement Learning), for cross-modal retrieval and visual question answering. It will then discuss how these approaches can be used on embodied agents which can interact with the physical world, for navigation and for other embodied tasks such as recognition and embodied question answering.


Program and slides

To be announced soon.


Presenters

Lorenzo Baraldi

Lorenzo Baraldi

Profile

Marcella Cornia

Marcella Cornia

Profile

Massimiliano Corsini

Massimiliano Corsini

Profile