Daniel Perelman
Paper download is intended for registered attendees only, and is
subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.
Papers from this author
An Integrated Approach of Deep Learning and Symbolic Analysis for Digital PDF Table Extraction
Mengshi Zhang, Daniel Perelman, Vu Le, Sumit Gulwani
Auto-TLDR; Deep Learning and Symbolic Reasoning for Unstructured PDF Table Extraction
Abstract Slides Poster Similar
Deep learning has shown great success at interpreting unstructured data such as object recognition in images. Symbolic/logical-reasoning techniques have shown great success in interpreting structured data such as table extraction in webpages, custom text files, spreadsheets. The tables in PDF documents are often generated from such structured sources (text-based Word/Latex documents, spreadsheets, webpages) but end up being unstructured. We thus explore novel combinations of deep learning and symbolic reasoning techniques to build an effective solution for PDF table extraction. We evaluate effectiveness without granting partial credit for matching part of a table (which may cause silent errors in downstream data processing). Our method achieves a 0.725 F1 score (vs. 0.339 for the state-of-the-art) on detecting correct table bounds---a much stricter metric than the common one of detecting characters within tables---in a well known public benchmark (ICDAR 2013) and a 0.404 F1 score (vs. 0.144 for the state-of-the-art) on our private benchmark with more widely varied table structures.