ICPR2020 Paper Browser

Paper download is intended for registered attendees only, and is subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.

Label or Message: A Large-Scale Experimental Survey of Texts and Objects Co-Occurrence

Koki Takeshita, Juntaro Shioyama, Seiichi Uchida

Auto-TLDR; Large-scale Survey of Co-occurrence between Objects and Scene Text with a State-of-the-art Scene Text detector and Recognizer

Abstract

Our daily life is surrounded by textual information. Nowadays, the automatic collection of textual information becomes possible owing to the drastic improvement of scene text detectors and recognizer. The purpose of this paper is to conduct a large-scale survey of co-occurrence between visual objects (such as book and car) and scene texts with a large image dataset and a state-of-the-art scene text detector and recognizer. Especially, we focus on the function of ``label'' texts, which are attached to objects for detailing the objects. By analyzing co-occurrence between objects and scene texts, it is possible to observe the statistics about the label texts and understand how the scene texts will be useful for recognizing the objects and vice versa.

Similar papers

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Yanan Li, Yuetan Lin, Hongrui Zhao, Donghui Wang

Auto-TLDR; TextVQA: An End-to-End Visual Question Answering Model for Text-Based VQA

Abstract Slides Similar

As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Yaoyuan Liang, Xin Wang, Xuguang Duan, Wenwu Zhu

Auto-TLDR; Multi-modal Contextual Graph Neural Network for Text Visual Question Answering

Label or Message: A Large-Scale Experimental Survey of Texts and Objects Co-Occurrence

Similar papers

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Text Recognition - Real World Data and Where to Find Them

Weakly Supervised Attention Rectification for Scene Text Recognition

A Multi-Head Self-Relation Network for Scene Text Recognition

Local Gradient Difference Based Mass Features for Classification of 2D-3D Natural Scene Text Images

The HisClima Database: Historical Weather Logs for Automatic Transcription and Information Extraction

Recognizing Multiple Text Sequences from an Image by Pure End-To-End Learning

Feature Embedding Based Text Instance Grouping for Largely Spaced and Occluded Text Detection

Information Graphic Summarization Using a Collection of Multimodal Deep Neural Networks

Self-Training for Domain Adaptive Scene Text Detection

An Accurate Threshold Insensitive Kernel Detector for Arbitrary Shaped Text

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Visual Style Extraction from Chart Images for Chart Restyling

Stratified Multi-Task Learning for Robust Spotting of Scene Texts

Cost-Effective Adversarial Attacks against Scene Text Recognition

To Honor Our Heroes: Analysis of the Obituaries of Australians Killed in Action in WWI and WWII

Sample-Aware Data Augmentor for Scene Text Recognition

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning

DUET: Detection Utilizing Enhancement for Text in Scanned or Captured Documents

An Integrated Approach of Deep Learning and Symbolic Analysis for Digital PDF Table Extraction

Fast Approximate Modelling of the Next Combination Result for Stopping the Text Recognition in a Video

Textual-Content Based Classification of Bundles of Untranscribed of Manuscript Images

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

2D License Plate Recognition based on Automatic Perspective Rectification

Text Recognition in Real Scenarios with a Few Labeled Samples

ReADS: A Rectified Attentional Double Supervised Network for Scene Text Recognition

MEAN: A Multi-Element Attention Based Network for Scene Text Recognition

Scene Text Detection with Selected Anchors

Improving Word Recognition Using Multiple Hypotheses and Deep Embeddings

Multi-Modal Identification of State-Sponsored Propaganda on Social Media

Answer-Checking in Context: A Multi-Modal Fully Attention Network for Visual Question Answering

Multi-Stage Attention Based Visual Question Answering

RWMF: A Real-World Multimodal Foodlog Database

Named Entity Recognition and Relation Extraction with Graph Neural Networks in Semi Structured Documents

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Gaussian Constrained Attention Network for Scene Text Recognition

RLST: A Reinforcement Learning Approach to Scene Text Detection Refinement

Assessing the Severity of Health States Based on Social Media Posts

Transferable Adversarial Attacks for Deep Scene Text Detection

Recognizing Bengali Word Images - A Zero-Shot Learning Perspective

P ≈ NP, at Least in Visual Question Answering

Graph Discovery for Visual Test Generation

Transformer Reasoning Network for Image-Text Matching and Retrieval

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Improving Visual Relation Detection Using Depth Maps

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

End-To-End Hierarchical Relation Extraction for Generic Form Understanding