ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Edward Kim, Connor Onweller, Kathleen F. Mccoy

Auto-TLDR; A multimodal deep learning framework that can generate summarization text supporting the main idea of an information graphic for presentation to blind or visually impaired

Abstract Slides

We present a multimodal deep learning framework that can generate summarization text supporting the main idea of an information graphic for presentation to a person who is blind or visually impaired. The framework utilizes the visual, textual, positional, and size characteristics extracted from the image to create the summary. Different and complimentary neural architectures are optimized for each task using crowdsourced training data. From our quantitative experiments and results, we explain the reasoning behind our framework and show the effectiveness of our models. Our qualitative results showcase text generated from our framework and show that Mechanical Turk participants favor them to other automatic and human generated summarizations. We describe the design and of of an experiment to evaluate the utility of our system for people who have visual impairments in the context of understanding Twitter Tweets containing line graphs.

Similar papers

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Yanan Li, Yuetan Lin, Hongrui Zhao, Donghui Wang

Auto-TLDR; TextVQA: An End-to-End Visual Question Answering Model for Text-Based VQA

Abstract Slides Similar

As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Yaoyuan Liang, Xin Wang, Xuguang Duan, Wenwu Zhu

Auto-TLDR; Multi-modal Contextual Graph Neural Network for Text Visual Question Answering

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Similar papers

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Text Synopsis Generation for Egocentric Videos

Visual Style Extraction from Chart Images for Chart Restyling

Enriching Video Captions with Contextual Text

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Multimodal Side-Tuning for Document Classification

Hierarchical Multimodal Attention for Deep Video Summarization

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Transformer Reasoning Network for Image-Text Matching and Retrieval

Attentive Visual Semantic Specialized Network for Video Captioning

Multi-Modal Identification of State-Sponsored Propaganda on Social Media

Multi-Stage Attention Based Visual Question Answering

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Question-Agnostic Attention for Visual Question Answering

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

MEG: Multi-Evidence GNN for Multimodal Semantic Forensics

ConvMath : A Convolutional Sequence Network for Mathematical Expression Recognition

Deep Composer: A Hash-Based Duplicative Neural Network for Generating Multi-Instrument Songs

Deep Convolutional Embedding for Digitized Painting Clustering

Improving Word Recognition Using Multiple Hypotheses and Deep Embeddings

P ≈ NP, at Least in Visual Question Answering

Assessing the Severity of Health States Based on Social Media Posts

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning

Text Recognition - Real World Data and Where to Find Them

Textual-Content Based Classification of Bundles of Untranscribed of Manuscript Images

Multi-Scale Relational Reasoning with Regional Attention for Visual Question Answering

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Automated Whiteboard Lecture Video Summarization by Content Region Detection and Representation

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Label or Message: A Large-Scale Experimental Survey of Texts and Objects Co-Occurrence

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

End-To-End Hierarchical Relation Extraction for Generic Form Understanding

Learning with Delayed Feedback

An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers

The HisClima Database: Historical Weather Logs for Automatic Transcription and Information Extraction

Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images

PrivAttNet: Predicting Privacy Risks in Images Using Visual Attention

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Context Visual Information-Based Deliberation Network for Video Captioning

Recognizing Bengali Word Images - A Zero-Shot Learning Perspective

Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Explore and Explain: Self-Supervised Navigation and Recounting

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

Developing Motion Code Embedding for Action Recognition in Videos

Context Matters: Self-Attention for Sign Language Recognition

Label Incorporated Graph Neural Networks for Text Classification

Cross-Media Hash Retrieval Using Multi-head Attention Network