Vision, Language and Action: from Captioning to Embodied AI
Tutorial at ICIAP 2019
Recent progress in the Computer Vision and Natural Language Processing communities have made it possible to connect Vision, Language and Action together, achieving significant advancements in a variety of tasks which lie at the intersection of Vision, Language and Embodied AI. Those tasks range from generating meaningful descriptions of images, to answering questions and navigating agents in unseens environments via natural language intructions. This tutorial will give a comprehensive guide through these advancements, including state-of-the-art techniques for image and video captioning (Recurrent Neural Networks, Attention, the Transformer paradigm, training with Reinforcement Learning), for cross-modal retrieval and visual question answering. It will then discuss how these approaches can be used on embodied agents which can interact with the physical world, for navigation and for other embodied tasks such as recognition and embodied question answering.
Program and slides
To be announced soon.