Recursive Recognition of Offline Handwritten Mathematical Expressions

Marco Cotogni, Claudio Cusano, Antonino Nocera

Responsive image

Auto-TLDR; Online Handwritten Mathematical Expression Recognition with Recurrent Neural Network

Slides Poster

In this paper we propose a method for Offline Handwritten Mathematical Expression recognition. The method is a fast and accurate thanks to its architecture, which include both a Convolutional Neural Network and a Recurrent Neural Network. The CNN extracts features from the image to recognize and its output is provided to the RNN which produces the mathematical expression encoded in the LaTeX language. To process both sequential and non-sequential mathematical expressions we also included a deconvolutional module which, in a recursive way, segments the image for additional analysis trough a recursive process. The results obtained show a very high accuracy obtained on a large handwritten data set of 9100 samples of handwritten expressions.

Similar papers

ConvMath : A Convolutional Sequence Network for Mathematical Expression Recognition

Zuoyu Yan, Xiaode Zhang, Liangcai Gao, Ke Yuan, Zhi Tang

Responsive image

Auto-TLDR; Convolutional Sequence Modeling for Mathematical Expressions Recognition

Slides Poster Similar

Despite the recent advances in optical character recognition (OCR), mathematical expressions still face a great challenge to recognize due to their two-dimensional graphical layout. In this paper, we propose a convolutional sequence modeling network, ConvMath, which converts the mathematical expression description in an image into a LaTeX sequence in an end-to-end way. The network combines an image encoder for feature extraction and a convolutional decoder for sequence generation. Compared with other Long Short Term Memory(LSTM) based encoder-decoder models, ConvMath is entirely based on convolution, thus it is easy to perform parallel computation. Besides, the network adopts multi-layer attention mechanism in the decoder, which allows the model to align output symbols with source feature vectors automatically, and alleviates the problem of lacking coverage while training the model. The performance of ConvMath is evaluated on an open dataset named IM2LATEX-100K, including 103556 samples. The experimental results demonstrate that the proposed network achieves state-of-the-art accuracy and much better efficiency than previous methods.

LODENet: A Holistic Approach to Offline Handwritten Chinese and Japanese Text Line Recognition

Huu Tin Hoang, Chun-Jen Peng, Hung Tran, Hung Le, Huy Hoang Nguyen

Responsive image

Auto-TLDR; Logographic DEComposition Encoding for Chinese and Japanese Text Line Recognition

Slides Poster Similar

One of the biggest obstacles in Chinese and Japanese text line recognition is how to present their enormous character sets. The most common solution is to merely choose and represent a small subset of characters using one-hot encoding. However, such an approach is costly to describe huge character sets, and ignores their semantic relationships. Recent studies have attempted to utilize different encoding methods, but they struggle to build a bijection mapping. In this work, we propose a novel encoding method, called LOgographic DEComposition encoding (LODEC), that can efficiently perform a 1-to-1 mapping for all Chinese and Japanese characters with a strong awareness of semantic relationships. As such, LODEC enables to encode over 21,000 Chinese and Japanese characters by only 520 fundamental elements. Moreover, to handle the vast variety of handwritten texts in the two languages, we propose a novel deep learning (DL) architecture, called LODENet, together with an end-to-end training scheme, that leverages auxiliary data generated by LODEC or other radical-based encoding methods. We performed systematic experiments on both Chinese and Japanese datasets, and found that our approach surpassed the performance of state-of-the-art baselines. Furthermore, empirical evidence shows that our method can gain significantly better results using synthesized text line images without the need for domain knowledge.

Watch Your Strokes: Improving Handwritten Text Recognition with Deformable Convolutions

Iulian Cojocaru, Silvia Cascianelli, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara

Responsive image

Auto-TLDR; Deformable Convolutional Neural Networks for Handwritten Text Recognition

Slides Poster Similar

Handwritten Text Recognition (HTR) in free-layout pages is a valuable yet challenging task which aims to automatically understand handwritten texts. State-of-the-art approaches in this field usually encode input images with Convolutional Neural Networks, whose kernels are typically defined on a fixed grid and focus on all input pixels independently. However, this is in contrast with the sparse nature of handwritten pages, in which only pixels representing the ink of the writing are useful for the recognition task. Furthermore, the standard convolution operator is not explicitly designed to take into account the great variability in shape, scale, and orientation of handwritten characters. To overcome these limitations, we investigate the use of deformable convolutions for handwriting recognition. This type of convolution deform the convolution kernel according to the content of the neighborhood, and can therefore be more adaptable to geometric variations and other deformations of the text. Experiments conducted on the IAM and RIMES datasets demonstrate that the use of deformable convolutions is a promising direction for the design of novel architectures for handwritten text recognition.

Stroke Based Posterior Attention for Online Handwritten Mathematical Expression Recognition

Changjie Wu, Qing Wang, Jianshu Zhang, Jun Du, Jiaming Wang, Jiajia Wu, Jin-Shui Hu

Responsive image

Auto-TLDR; Posterior Attention for Online Handwritten Mathematical Expression Recognition

Slides Poster Similar

Recently, many researches propose to employ attention based encoder-decoder models to convert a sequence of trajectory points into a LaTeX string for online handwritten mathematical expression recognition (OHMER), and the recognition performance of these models critically relies on the accuracy of the attention. In this paper, unlike previous methods which basically employ a soft attention model, we propose to employ a posterior attention model, which modifies the attention probabilities after observing the output probabilities generated by the soft attention model. In order to further improve the posterior attention mechanism, we propose a stroke average pooling layer to aggregate point-level features obtained from the encoder into stroke-level features. We argue that posterior attention is better to be implemented on stroke-level features than point-level features as the output probabilities generated by stroke is more convincing than generated by point, and we prove that through experimental analysis. Validated on the CROHME competition task, we demonstrate that stroke based posterior attention achieves expression recognition rates of 54.26% on CROHME 2014 and 51.75% on CROHME 2016. According to attention visualization analysis, we empirically demonstrate that the posterior attention mechanism can achieve better alignment accuracy than the soft attention mechanism.

Online Trajectory Recovery from Offline Handwritten Japanese Kanji Characters of Multiple Strokes

Hung Tuan Nguyen, Tsubasa Nakamura, Cuong Tuan Nguyen, Masaki Nakagawa

Responsive image

Auto-TLDR; Recovering Dynamic Online Trajectories from Offline Japanese Kanji Character Images for Handwritten Character Recognition

Slides Poster Similar

We propose a deep neural network-based method to recover dynamic online trajectories from offline handwritten Japanese kanji character images. It is a challenging task since Japanese kanji characters consist of multiple strokes. Our proposed model has three main components: Convolutional Neural Network-based encoder, Long Short-Term Memory Network-based decoder with an attention layer, and Gaussian Mixture Model (GMM). The encoder focuses on feature extraction while the decoder refers to the extracted features and generates time-sequences of GMM parameters. The attention layer is the key component for trajectory recovery. The GMM provides robustness to style variations so that the proposed model does not overfit to training samples. In the experiments, the proposed method is evaluated by both visual verification and handwritten character recognition. This is the first attempt to use online recovered trajectories to help improve the performance of offline handwriting recognition. Although the visual verification reveals some problems, the recognition experiments demonstrate the effect of trajectory recovery in improving the accuracy of offline handwritten character recognition when online recognition of the recovered trajectories are combined.

Enhancing Handwritten Text Recognition with N-Gram Sequencedecomposition and Multitask Learning

Vasiliki Tassopoulou, George Retsinas, Petros Maragos

Responsive image

Auto-TLDR; Multi-task Learning for Handwritten Text Recognition

Slides Poster Similar

Current state-of-the-art approaches in the field of Handwritten Text Recognition are predominately single task with unigram, character level target units. In our work, we utilize a Multi-task Learning scheme, training the model to perform decompositions of the target sequence with target units of different granularity, from fine tocoarse. We consider this method as a way to utilize n-gram information, implicitly, in the training process, while the final recognition is performed using only the unigram output. Unigram decoding of sucha multi-task approach highlights the capability of the learned internal representations, imposed by the different n-grams at the training step. We select n-grams as our target units and we experiment from unigrams till fourgrams, namely subword level granularities.These multiple decompositions are learned from the network with task-specific CTC losses. Concerning network architectures, we pro-pose two alternatives, namely the Hierarchical and the Block Multi-task. Overall, our proposed model, even though evaluated only onthe unigram task, outperforms its counterpart single-task by absolute 2.52% WER and 1.02% CER, in the greedy decoding, without any computational overhead during inference, hinting towards success-fully imposing an implicit language model

Writer Identification Using Deep Neural Networks: Impact of Patch Size and Number of Patches

Akshay Punjabi, José Ramón Prieto Fontcuberta, Enrique Vidal

Responsive image

Auto-TLDR; Writer Recognition Using Deep Neural Networks for Handwritten Text Images

Slides Poster Similar

Traditional approaches for the recognition or identification of the writer of a handwritten text image used to relay on heuristic knowledge about the shape and other features of the strokes of previously segmented characters. However, recent works have done significantly advances on the state of the art thanks to the use of various types of deep neural networks. In most of all of these works, text images are decomposed into patches, which are processed by the networks without any previous character or word segmentation. In this paper, we study how the way images are decomposed into patches impact recognition accuracy, using three publicly available datasets. The study also includes a simpler architecture where no patches are used at all - a single deep neural network inputs a whole text image and directly provides a writer recognition hypothesis. Results show that bigger patches generally lead to improved accuracy, achieving in one of the datasets a significant improvement over the best results reported so far.

A Few-Shot Learning Approach for Historical Ciphered Manuscript Recognition

Mohamed Ali Souibgui, Alicia Fornés, Yousri Kessentini, Crina Tudor

Responsive image

Auto-TLDR; Handwritten Ciphers Recognition Using Few-Shot Object Detection

Slides Similar

Encoded (or ciphered) manuscripts are a special type of historical documents that contain encrypted text. The automatic recognition of this kind of documents is challenging because: 1) the cipher alphabet changes from one document to another, 2) there is a lack of annotated corpus for training and 3) touching symbols make the symbol segmentation difficult and complex. To overcome these difficulties, we propose a novel method for handwritten ciphers recognition based on few-shot object detection. Our method first detects all symbols of a given alphabet in a line image, and then a decoding step maps the symbol similarity scores to the final sequence of transcribed symbols. By training on synthetic data, we show that the proposed architecture is able to recognize handwritten ciphers with unseen alphabets. In addition, if few labeled pages with the same alphabet are used for fine tuning, our method surpasses existing unsupervised and supervised HTR methods for ciphers recognition.

Global Context-Based Network with Transformer for Image2latex

Nuo Pang, Chun Yang, Xiaobin Zhu, Jixuan Li, Xu-Cheng Yin

Responsive image

Auto-TLDR; Image2latex with Global Context block and Transformer

Slides Poster Similar

Image2latex usually means converts mathematical formulas in images into latex markup. It is a very challenging job due to the complex two-dimensional structure, variant scales of input, and very long representation sequence. Many researchers use encoder-decoder based model to solve this task and achieved good results. However, these methods don't make full use of the structure and position information of the formula. %In this paper, we improve the encoder by employing Global Context block and Transformer. To solve this problem, we propose a global context-based network with transformer that can (1) learn a more powerful and robust intermediate representation via aggregating global features and (2) encode position information explicitly and (3) learn latent dependencies between symbols by using self-attention mechanism. The experimental results on the dataset IM2LATEX-100K demonstrate the effectiveness of our method.

ReADS: A Rectified Attentional Double Supervised Network for Scene Text Recognition

Qi Song, Qianyi Jiang, Xiaolin Wei, Nan Li, Rui Zhang

Responsive image

Auto-TLDR; ReADS: Rectified Attentional Double Supervised Network for General Scene Text Recognition

Slides Poster Similar

In recent years, scene text recognition is always regarded as a sequence-to-sequence problem. Connectionist Temporal Classification (CTC) and Attentional sequence recognition (Attn) are two very prevailing approaches to tackle this problem while they may fail in some scenarios respectively. CTC concentrates more on every individual character but is weak in text semantic dependency modeling. Attn based methods have better context semantic modeling ability while tends to overfit on limited training data. In this paper, we elaborately design a Rectified Attentional Double Supervised Network (ReADS) for general scene text recognition. To overcome the weakness of CTC and Attn, both of them are applied in our method but with different modules in two supervised branches which can make a complementary to each other. Moreover, effective spatial and channel attention mechanisms are introduced to eliminate background noise and extract valid foreground information. Finally, a simple rectified network is implemented to rectify irregular text. The ReADS can be trained end-to-end and only word-level annotations are required. Extensive experiments on various benchmarks verify the effectiveness of ReADS which achieves state-of-the-art performance.

The HisClima Database: Historical Weather Logs for Automatic Transcription and Information Extraction

Verónica Romero, Joan Andreu Sánchez

Responsive image

Auto-TLDR; Automatic Handwritten Text Recognition and Information Extraction from Historical Weather Logs

Slides Poster Similar

Knowing the weather and atmospheric conditions from the past can help weather researchers to generate models like the ones used to predict how weather conditions are likely to change as global temperatures continue to rise. Many historical weather records are available from the past registered on a systemic basis. Historical weather logs were registered in ships, when they were on the high seas, recording daily weather conditions such as: wind speed, temperature, coordinates, etc. These historical documents represent an important source of knowledge with valuable information to extract climatic information of several centuries ago. Such information is usually collected by experts that devote a lot of time. This paper presents a new database, compiled from a ship log mainly composed by handwritten tables that contain mainly numerical information, to support research in automatic handwriting recognition and information extraction. In addition, a study is presented about the capability of state-of-the-art handwritten text recognition systems and information extraction techniques, when applied to the presented database. Baseline results are reported for reference in future studies.

A Transformer-Based Radical Analysis Network for Chinese Character Recognition

Chen Yang, Qing Wang, Jun Du, Jianshu Zhang, Changjie Wu, Jiaming Wang

Responsive image

Auto-TLDR; Transformer-based Radical Analysis Network for Chinese Character Recognition

Slides Poster Similar

Recently, a novel radical analysis network (RAN) has the capability of effectively recognizing unseen Chinese character classes and largely reducing the requirement of training data by treating a Chinese character as a hierarchical composition of radicals rather than a single character class.} However, when dealing with more challenging issues, such as the recognition of complicated characters, low-frequency character categories, and characters in natural scenes, RAN still has a lot of room for improvement. In this paper, we explore options to further improve the structure generalization and robustness capability of RAN with the Transformer architecture, which has achieved start-of-the-art results for many sequence-to-sequence tasks. More specifically, we propose to replace the original attention module in RAN with the transformer decoder, which is named as a transformer-based radical analysis network (RTN). The experimental results show that the proposed approach can significantly outperform the RAN on both printed Chinese character database and natural scene Chinese character database. Meanwhile, further analysis proves that RTN can be better generalized to complex samples and low-frequency characters, and has better robustness in recognizing Chinese characters with different attributes.

Multi-Task Learning Based Traditional Mongolian Words Recognition

Hongxi Wei, Hui Zhang, Jing Zhang, Kexin Liu

Responsive image

Auto-TLDR; Multi-task Learning for Mongolian Words Recognition

Slides Poster Similar

In this paper, a multi-task learning framework has been proposed for solving and improving traditional Mongolian words recognition. To be specific, a sequence-to-sequence model with attention mechanism was utilized to accomplish the task of recognition. Therein, the attention mechanism is designed to fulfill the task of glyph segmentation during the process of recognition. Although the glyph segmentation is an implicit operation, the information of glyph segmentation can be integrated into the process of recognition. After that, the two tasks can be accomplished simultaneously under the framework of multi-task learning. By this way, adjacent image frames can be decoded into a glyph more precisely, which results in improving not only the performance of words recognition but also the accuracy of character segmentation. Experimental results demonstrate that the proposed multi-task learning based scheme outperforms the conventional glyph segmentation-based method and various segmentation-free (i.e. holistic recognition) methods.

Generation of Hypergraphs from the N-Best Parsing of 2D-Probabilistic Context-Free Grammars for Mathematical Expression Recognition

Noya Ernesto, Joan Andreu Sánchez, Jose Miguel Benedi

Responsive image

Auto-TLDR; Hypergraphs: A Compact Representation of the N-best parse trees from 2D-PCFGs

Slides Poster Similar

We consider hypergraphs as a tool to compactly represent the result of the n-best parse trees, obtained by Bi-Dimensional Probabilistic Context-Free Grammars, for an input image that represents a mathematical expression. More specifically, in this paper we propose: an algorithm to compute the N-best parse trees from a 2D-PCFGs; an algorithm to represent the n-best parse trees using a compact representation in the form of hypergraphs; and a formal framework for the development of inference algorithms (inside and outside) and normalization strategies of hypergraphs.

Radical Counter Network for Robust Chinese Character Recognition

Yunqing Li, Yixing Zhu, Jun Du, Changjie Wu, Jianshu Zhang

Responsive image

Auto-TLDR; Radical Counter Network for Chinese Character Recognition

Slides Poster Similar

Chinese character recognition has attracted much interest due to its high challenge and various applications. The whole-character modeling method can recognize common characters well but unable to handle unseen situation. Some radical-based modeling methods have successfully achieved great performance in unseen condition but the decoding takes huge time comsumption. Therefore, a high-efficient model which can recognize unseen characters needs to be proposed. First, this paper introduces a novel radical counter network (RCN) to recognize Chinese characters by identifying radicals and spatial structures. The proposed RCN first extracts visual features from input by employing DenseNet as encoder. Then a decoder based on fully connected layer is employed, aiming at synchronously estimating the number of each caption in character. The manner of simultaneously decoding all the captions greatly saves time of sequence decoding. Additionally, we design a multi-task learning to combine global feature extraction capability of whole-character modeling and local feature extraction capability of radical-based modeling, which further improves the model generalization. Experiments on natural scene character dataset demonstrate that the proposed model significantly outperforms baseline by 4.81\% with a comparable model complexity. That shows great robustness and simplicity of our model.

Multimodal Side-Tuning for Document Classification

Stefano Zingaro, Giuseppe Lisanti, Maurizio Gabbrielli

Responsive image

Auto-TLDR; Side-tuning for Multimodal Document Classification

Slides Poster Similar

In this paper, we propose to exploit the side-tuning framework for multimodal document classification. Side-tuning is a methodology for network adaptation recently introduced to solve some of the problems related to previous approaches. Thanks to this technique it is actually possible to overcome model rigidity and catastrophic forgetting of transfer learning by fine-tuning. The proposed solution uses off-the-shelf deep learning architectures leveraging the side-tuning framework to combine a base model with a tandem of two side networks. We show that side-tuning can be successfully employed also when different data sources are considered, e.g. text and images in document classification. The experimental results show that this approach pushes further the limit for document classification accuracy with respect to the state of the art.

IBN-STR: A Robust Text Recognizer for Irregular Text in Natural Scenes

Xiaoqian Li, Jie Liu, Shuwu Zhang

Responsive image

Auto-TLDR; IBN-STR: A Robust Text Recognition System Based on Data and Feature Representation

Poster Similar

Although text recognition methods based on deep neural networks have promising performance, there are still challenges due to the variety of text styles, perspective distortion, text with large curvature, and so on. To obtain a robust text recognizer, we have improved the performance from two aspects: data aspect and feature representation aspect. In terms of data, we transform the input images into S-shape distorted images in order to increase the diversity of training data. Besides, we explore the effects of different training data. In terms of feature representation, the combination of instance normalization and batch normalization improves the model's capacity and generalization ability. This paper proposes a robust text recognizer IBN-STR, which is an attention-based model. Through extensive experiments, the model analysis and comparison have been carried out from the aspects of data and feature representation, and the effectiveness of IBN-STR on both regular and irregular text instances has been verified. Furthermore, IBN-STR is an end-to-end recognition system that can achieve state-of-the-art performance.

Robust Lexicon-Free Confidence Prediction for Text Recognition

Qi Song, Qianyi Jiang, Rui Zhang, Xiaolin Wei

Responsive image

Auto-TLDR; Confidence Measurement for Optical Character Recognition using Single-Input Multi-Output Network

Slides Poster Similar

Benefiting from the success of deep learning, Optical Character Recognition (OCR) is booming in recent years. As we all know, the text recognition results are vulnerable to slight perturbation in input images, thus a method for measuring how reliable the results are is crucial. In this paper, we present a novel method for confidence measurement given a text recognition result, which can be embedded in any text recognizer with little overheads. Our method consists of two stages with a coarse-to-fine style. The first stage generates multiple candidates for voting coarse scores by a Single-Input Multi-Output network (SIMO). The second stage calculates a refined confidence score referred by the voting result and the conditional probabilities of the Top-1 probable recognition sequence. Highly competitive performance is achieved on several standard benchmarks validates the efficiency and effectiveness of the proposed method. Moreover, it can be adopted in both Latin and non-Latin languages.

Handwritten Digit String Recognition Using Deep Autoencoder Based Segmentation and ResNet Based Recognition Approach

Anuran Chakraborty, Rajonya De, Samir Malakar, Friedhelm Schwenker, Ram Sarkar

Responsive image

Auto-TLDR; Handwritten Digit Strings Recognition Using Residual Network and Deep Autoencoder Based Segmentation

Slides Poster Similar

Recognition of isolated handwritten digits is a well studied research problem and several models show high recognition accuracy on different standard datasets. But the same is not true while we consider recognition of handwritten digit strings although it has many real-life applications like bank cheque processing, postal code recognition, and numeric field understanding from filled-in form images. The problem becomes more difficult when digits in the string are not neatly written which is commonly seen in freestyle handwriting. The performance of any such model primarily suffers due to the presence of touching digits in the string. To handle these issues, in the present work, we first use a deep autoencoder based segmentation technique for isolating the digits from a handwritten digit string, and then we pass the isolated digits to a Residual Network (ResNet) based recognition model to obtain the machine-encoded digit string. The proposed model has been evaluated on the Computer Vision Lab (CVL) Handwritten Digit Strings (HDS) database, used in HDSRC 2013 competition on handwritten digit string recognition, and a competent result with respect to state-of-the-art techniques has been achieved.

Text Baseline Recognition Using a Recurrent Convolutional Neural Network

Matthias Wödlinger, Robert Sablatnig

Responsive image

Auto-TLDR; Automatic Baseline Detection of Handwritten Text Using Recurrent Convolutional Neural Network

Slides Poster Similar

The detection of baselines of text is a necessary pre-processing step for many modern methods of automatic handwriting recognition. In this work a two-stage system for the automatic detection of text baselines of handwritten text is presented. In a first step pixel-wise segmentation on the document image is performed to classify pixels as baselines, start points and end points. This segmentation is then used to extract the start points of lines. Starting from these points the baseline is extracted using a recurrent convolutional neural network that directly outputs the baseline coordinates. This method allows the direct extraction of baseline coordinates as the output of a neural network without the use of any post processing steps. The model is evaluated on the cBAD dataset from the ICDAR 2019 competition on baseline detection.

Improving Word Recognition Using Multiple Hypotheses and Deep Embeddings

Siddhant Bansal, Praveen Krishnan, C. V. Jawahar

Responsive image

Auto-TLDR; EmbedNet: fuse recognition-based and recognition-free approaches for word recognition using learning-based methods

Slides Poster Similar

We propose to fuse recognition-based and recognition-free approaches for word recognition using learning-based methods. For this purpose, results obtained using a text recognizer and deep embeddings (generated using an End2End network) are fused. To further improve the embeddings, we propose EmbedNet, it uses triplet loss for training and learns an embedding space where the embedding of the word image lies closer to its corresponding text transcription’s embedding. This updated embedding space helps in choosing the correct prediction with higher confidence. To further improve the accuracy, we propose a plug-and-play module called Confidence based Accuracy Booster (CAB). It takes in the confidence scores obtained from the text recognizer and Euclidean distances between the embeddings and generates an updated distance vector. This vector has lower distance values for the correct words and higher distance values for the incorrect words. We rigorously evaluate our proposed method systematically on a collection of books that are in the Hindi language. Our method achieves an absolute improvement of around 10% in terms of word recognition accuracy.

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning

Zhuo Chen, Fei Yin, Xu-Yao Zhang, Qing Yang, Cheng-Lin Liu

Responsive image

Auto-TLDR; Cross-Lingual Text Image Recognition with Multi-task Learning

Slides Poster Similar

This paper considers recognizing texts shown in a source language and translating into a target language, without generating the intermediate source language text image recognition results. We call this problem Cross-Lingual Text Image Recognition (CLTIR). To solve this problem, we propose a multi-task system containing a main task of CLTIR and an auxiliary task of Mono-Lingual Text Image Recognition (MLTIR) simultaneously. Two different sequence to sequence learning methods, a convolution based attention model and a BLSTM model with CTC, are adopted for these tasks respectively. We evaluate the system on a newly collected Chinese-English bilingual movie subtitle image dataset. Experimental results demonstrate the multi-task learning framework performs superiorly in both languages.

Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

Saleem Ahmed, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju

Responsive image

Auto-TLDR; Representational Learning for Similarity Based Retrieval of Mathematical Expressions

Slides Poster Similar

Representational Learning in the form of high dimensional embeddings have been used for multiple pattern recognition applications. There has been a significant interest in building embedding based systems for learning representationsin the mathematical domain. At the same time, retrieval of structured information such as mathematical expressions is an important need for modern IR systems. In this work, our motivation is to introduce a robust framework for learning representations for similarity based retrieval of mathematical expressions. Given a query by example, the embedding can find the closest matching expression as a function of euclidean distance between them. We leverage recent advancements in image-based and graph-based deep learning algorithms to learn our similarity embeddings. We do this first, by using uni-modal encoders in graph space and image space and then, a multi-modal combination of the same. To overcome the lack of training data, we force the networks to learn a deep metric using triplets generated with a heuristic scoring function. We also adopt a custom strategy for mining hard samples to train our neural networks. Our system produces rankings similar to those generated by the original scoring function, but using only a fraction of the time. Our results establish the viability of using such a multi-modal embedding for this task.

Ancient Document Layout Analysis: Autoencoders Meet Sparse Coding

Homa Davoudi, Marco Fiorucci, Arianna Traviglia

Responsive image

Auto-TLDR; Unsupervised Unsupervised Representation Learning for Document Layout Analysis

Slides Poster Similar

Layout analysis of historical handwritten documents is a key pre-processing step in document image analysis that, by segmenting the image into its homogeneous regions, facilitates subsequent procedures such as optical character recognition and automatic transcription. Learning-based approaches have shown promising performances in layout analysis, however, the majority of them requires tedious pixel-wise labelled training data to achieve generalisation capabilities, this limitation preventing their application due to the lack of large labelled datasets. This paper proposes a novel unsupervised representation learning method for documents’ layout analysis that reduces the need for labelled data: a sparse autoencoder is first trained in an unsupervised manner on a historical text document’s image; representation of image patches, computed by the sparse encoder, is then used to classify pixels into various region categories of the document using a feed-forward neural network. A new training method, inspired by the ISTA algorithm, is also introduced here to train the sparse encoder. Experimental results on DIVA-HisDB dataset demonstrate that the proposed method outperforms previous approaches based on unsupervised representation learning while achieving performances comparable to the state-of-the-art fully supervised methods.

2D License Plate Recognition based on Automatic Perspective Rectification

Hui Xu, Zhao-Hong Guo, Da-Han Wang, Xiang-Dong Zhou, Yu Shi

Responsive image

Auto-TLDR; Perspective Rectification Network for License Plate Recognition

Slides Poster Similar

License plate recognition (LPR) remains a challenging task in face of some difficulties such as image deformation and multi-line character distribution. Text rectification that is crucial to eliminate the effects of image deformation has attracted increasing attentions in scene text recognition. However, current text rectification methods are not designed specifically for LPR, which did not take the features of plate deformation into account. Considering the fact that a license plate (LP) can only generate perspective distortion in the image due to its rigid feature, in this paper we propose a novel perspective rectification network (PRN) to automatically estimate the perspective transformation and rectify the distorted LP accordingly. For recognition, we propose a location-aware 2D attention based recognition network that is capable of recognizing both single-line and double-line plates with perspective deformation. The rectification network and recognition network are connected for end-to-end training. Experiments on common datasets show that the proposed method achieves the state-of-the-art performance, demonstrating the effectiveness of the proposed approach.

Automated Whiteboard Lecture Video Summarization by Content Region Detection and Representation

Bhargava Urala Kota, Alexander Stone, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju

Responsive image

Auto-TLDR; A Framework for Summarizing Whiteboard Lecture Videos Using Feature Representations of Handwritten Content Regions

Poster Similar

Lecture videos are rapidly becoming an invaluable source of information for students across the globe. Given the large number of online courses currently available, it is important to condense the information within these videos into a compact yet representative summary that can be used for search-based applications. We propose a framework to summarize whiteboard lecture videos by finding feature representations of detected handwritten content regions to determine unique content. We investigate multi-scale histogram of gradients and embeddings from deep metric learning for feature representation. We explicitly handle occluded, growing and disappearing handwritten content. Our method is capable of producing two kinds of lecture video summaries - the unique regions themselves or so-called key content and keyframes (which contain all unique content in a video segment). We use weighted spatio-temporal conflict minimization to segment the lecture and produce keyframes from detected regions and features. We evaluate both types of summaries and find that we obtain state-of-the-art peformance in terms of number of summary keyframes while our unique content recall and precision are comparable to state-of-the-art.

A Fast and Accurate Object Detector for Handwritten Digit String Recognition

Jun Guo, Wenjing Wei, Yifeng Ma, Cong Peng

Responsive image

Auto-TLDR; ChipNet: An anchor-free object detector for handwritten digit string recognition

Slides Poster Similar

Focusing on handwritten digit string recognition (HDSR), we propose an anchor-free object detector called ChipNet, where a novel encoding method is designed. The input image is divided into columns, and then these columns are encoded by the ground truth. The adjacent columns are responsible for detecting the same target so that it can well address the class-imbalanced problem meanwhile reducing the network computation. ChipNet is composed of convolutional and bidirectional long short term memory networks. Different from the typical detectors, it doesn't use region proposals, anchors or regions of interest pooling. Hence, it can overcome the shortages of anchor-based and dense detectors in HDSR. The experiments are implemented on the synthetic digit strings, the CVL HDS database, and the ORAND-CAR-A & B databases. The high accuracies, which surpass the reported results by a large margin (up to 6.62%), are achieved. Furthermore, it gets 219 FPS speed on 160*32 px resolution images when using a Tesla P100 GPU. The results also show that ChipNet can handle touching, connecting and arbitrary length digit strings, and the obtained accuracies in HDSR are as high as the ones in single handwritten digit recognition.

Recognizing Multiple Text Sequences from an Image by Pure End-To-End Learning

Zhenlong Xu, Shuigeng Zhou, Fan Bai, Cheng Zhanzhan, Yi Niu, Shiliang Pu

Responsive image

Auto-TLDR; Pure End-to-End Learning for Multiple Text Sequences Recognition from Images

Slides Poster Similar

We address a challenging problem: recognizing multiple text sequences from an image by pure end-to-end learning. It is twofold: 1) Multiple text sequences recognition. Each image may contain multiple text sequences of different content, location and orientation, we try to recognize all these texts in the image. 2) Pure end-to-end (PEE) learning.We solve the problem in a pure end-to-end learning way where each training image is labeled by only text transcripts of the contained sequences, without any geometric annotations. Most existing works recognize multiple text sequences from an image in a non-end-to-end (NEE) or quasi-end-to-end (QEE) way, in which each image is trained with both text transcripts and text locations. Only recently, a PEE method was proposed to recognize text sequences from an image where the text sequence was split to several lines in the image. However, it cannot be directly applied to recognizing multiple text sequences from an image. So in this paper, we propose a pure end-to-end learning method to recognize multiple text sequences from an image. Our method directly learns the probability distribution of multiple sequences conditioned on each input image, and outputs multiple text transcripts with a well-designed decoding strategy. To evaluate the proposed method, we construct several datasets mainly based on an existing public dataset and two real application scenarios. Experimental results show that the proposed method can effectively recognize multiple text sequences from images, and outperforms CTC-based and attention-based baseline methods.

Learning to Sort Handwritten Text Lines in Reading Order through Estimated Binary Order Relations

Lorenzo Quirós, Enrique Vidal

Responsive image

Auto-TLDR; Automatic Reading Order of Text Lines in Handwritten Text Documents

Slides Similar

Recent advances in Handwritten Text Recognition and Document Layout Analysis make it possible to extract information from digitized documents and make them accessible beyond the archive shelves. But the reading order of the elements in those documents still is an open problem that has to be solved in order to provide that information with the correct structure. Most of the studies on the reading order task are rule-base approaches that focus on printed documents, while less attention has been paid to handwritten text documents. In this work we propose a new approach to automatically determine the reading order of text lines in handwritten text documents. The task is approached as a sorting problem where the order-relation operator is learned directly from examples. We demonstrate the effectiveness of our method on three different datasets.

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, Rong Xiao

Responsive image

Auto-TLDR; PICK: A Graph Learning Framework for Key Information Extraction from Documents

Slides Poster Similar

Computer vision with state-of-the-art deep learning models have achieved huge success in the field of Optical Character Recognition (OCR) including text detection and recognition tasks recently. However, Key Information Extraction (KIE) from documents as the downstream task of OCR, having a large number of use scenarios in real-world, remains a challenge because documents not only have textual features extracting from OCR systems but also have semantic visual features that are not fully exploited and play a critical role in KIE. Too little work has been devoted to efficiently make full use of both textual and visual features of the documents. In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Extensive experiments on real-world datasets have been conducted to show that our method outperforms baselines methods by significant margins.

Context Matters: Self-Attention for Sign Language Recognition

Fares Ben Slimane, Mohamed Bouguessa

Responsive image

Auto-TLDR; Attentional Network for Continuous Sign Language Recognition

Slides Poster Similar

This paper proposes an attentional network for the task of Continuous Sign Language Recognition. The proposed approach exploits co-independent streams of data to model the sign language modalities. These different channels of information can share a complex temporal structure between each other. For that reason, we apply attention to synchronize and help capture entangled dependencies between the different sign language components. Even though Sign Language is multi-channel, handshapes represent the central entities in sign interpretation. Seeing handshapes in their correct context defines the meaning of a sign. Taking that into account, we utilize the attention mechanism to efficiently aggregate the hand features with their appropriate Spatio-temporal context for better sign recognition. We found that by doing so the model is able to identify the essential Sign Language components that revolve around the dominant hand and the face areas. We test our model on the benchmark dataset RWTH-PHOENIX-Weather 2014, yielding competitive results.

The DeepScoresV2 Dataset and Benchmark for Music Object Detection

Lukas Tuggener, Yvan Putra Satyawan, Alexander Pacha, Jürgen Schmidhuber, Thilo Stadelmann

Responsive image

Auto-TLDR; DeepScoresV2: an extended version of the DeepScores dataset for optical music recognition

Slides Poster Similar

In this paper, we present DeepScoresV2, an extended version of the DeepScores dataset for optical music recognition (OMR). We improve upon the original DeepScores dataset by providing much more detailed annotations, namely (a) annotations for 135 classes including fundamental symbols of non-fixed size and shape, increasing the number of annotated symbols by 23%; (b) oriented bounding boxes; (c) higher-level rhythm and pitch information (onset beat for all symbols and line position for noteheads); and (d) a compatibility mode for easy use in conjunction with the MUSCIMA++ dataset for OMR on handwritten documents. These additions open up the potential for future advancement in OMR research. Additionally, we release two state-of-the-art baselines for DeepScoresV2 based on Faster R-CNN and the Deep Watershed Detector. An analysis of the baselines shows that regular orthogonal bounding boxes are unsuitable for objects which are long, small, and potentially rotated, such as ties and beams, which demonstrates the need for detection algorithms that naturally incorporate object angles. Dataset, code and pre-trained models, as well as user instructions, are publicly available at https://tuggeluk.github.io/dsv2_preview/

Human or Machine? It Is Not What You Write, but How You Write It

Luis Leiva, Moises Diaz, M.A. Ferrer, Réjean Plamondon

Responsive image

Auto-TLDR; Behavioral Biometrics via Handwritten Symbols for Identification and Verification

Slides Poster Similar

Online fraud often involves identity theft. Since most security measures are weak or can be spoofed, we investigate a more nuanced and less explored avenue: behavioral biometrics via handwriting movements. This kind of data can be used to verify if a legitimate user is operating a device or a computer application, so it is important to distinguish between human and machine-generated movements reliably. For this purpose, we study handwritten symbols (isolated characters, digits, gestures, and signatures) produced by humans and machines, and compare and contrast several deep learning models. We find that if symbols are presented as static images, they can fool state-of-the-art classifiers (near 75% accuracy in the best case) but can be distinguished with remarkable accuracy if they are presented as temporal sequences (95% accuracy in the average case). We conclude that an accurate detection of fake movements has more to do with how users write, rather than what they write. Our work has implications for computerized systems that need to authenticate or verify legitimate human users, and provides an additional layer of security to keep attackers at bay.

A Multi-Head Self-Relation Network for Scene Text Recognition

Zhou Junwei, Hongchao Gao, Jiao Dai, Dongqin Liu, Jizhong Han

Responsive image

Auto-TLDR; Multi-head Self-relation Network for Scene Text Recognition

Slides Poster Similar

The text embedded in scene images can be seen everywhere in our lives. However, recognizing text from natural scene images is still a challenge because of its diverse shapes and distorted patterns. Recently, advanced recognition networks generally treat scene text recognition as a sequence prediction task. Although achieving excellent performance, these recognition networks consider the feature map cells as independent individuals and update cells state without utilizing the information of their neighboring cells. And the local receptive field of traditional convolutional neural network (CNN) makes a single cell that cannot cover the whole text region in an image. Due to these issues, the existing recognition networks cannot extract the global context in a visual scene. To deal with the above problems, we propose a Multi-head Self-relation Network(MSRN) for scene text recognition in this paper. The MSRN consists of several multi-head self-relation layers, which is designed for extracting the global context of a visual scene, so that transforms a cell into a new cell that fuses the information of the related cells. Furthermore, experiments over several public datasets demonstrate that our proposed recognition network achieves superior performance on several benchmark datasets including IC03, IC13, IC15, SVT-Perspective.

Recognizing Bengali Word Images - A Zero-Shot Learning Perspective

Sukalpa Chanda, Daniël Arjen Willem Haitink, Prashant Kumar Prasad, Jochem Baas, Umapada Pal, Lambert Schomaker

Responsive image

Auto-TLDR; Zero-Shot Learning for Word Recognition in Bengali Script

Slides Poster Similar

Zero-Shot Learning(ZSL) techniques could classify a completely unseen class, which it has never seen before during training. Thus, making it more apt for any real-life classification problem, where it is not possible to train a system with annotated data for all possible class types. This work investigates recognition of word images written in Bengali Script in a ZSL framework. The proposed approach performs Zero-Shot word recognition by coupling deep learned features procured from VGG16 architecture along with 13 basic shapes/stroke primitives commonly observed in Bengali script characters. As per the notion of ZSL framework those 13 basic shapes are termed as “Signature Attributes”. The obtained results are promising while evaluation was carried out in a Five-Fold cross-validation setup dealing with samples from 250 word classes.

Textual-Content Based Classification of Bundles of Untranscribed of Manuscript Images

José Ramón Prieto Fontcuberta, Enrique Vidal, Vicente Bosch, Carlos Alonso, Carmen Orcero, Lourdes Márquez

Responsive image

Auto-TLDR; Probabilistic Indexing for Text-based Classification of Manuscripts

Slides Poster Similar

Content-based classification of manuscripts is an important task that is generally performed in archives and libraries by experts with a wealth of knowledge on the manuscripts contents. Unfortunately, many manuscript collections are so vast that it is not feasible to rely solely on experts to perform this task. Current approaches for textual-content-based manuscript classification generally require the handwritten images to be first transcribed into text -- but achieving sufficiently accurate transcripts is generally unfeasible for large sets of historical manuscripts. We propose a new approach to automatically perform this classification task which does not rely on any explicit image transcripts. It is based on ``probabilistic indexing'', a relatively novel technology which allows to effectively represent the intrinsic word-level uncertainty generally exhibited by handwritten text images. We assess the performance of this approach on a large collection of complex manuscripts from the Spanish Archivo General de Indias, with promising results.

Continuous Sign Language Recognition with Iterative Spatiotemporal Fine-Tuning

Kenessary Koishybay, Medet Mukushev, Anara Sandygulova

Responsive image

Auto-TLDR; A Deep Neural Network for Continuous Sign Language Recognition with Iterative Gloss Recognition

Slides Poster Similar

This paper aims to develop a deep neural network for Continuous Sign Language Recognition (CSLR) with iterative Gloss Recognition (GR) fine-tuning. CSLR has been a popular research field in the last years and iterative optimization methods are well established. This paper introduces our proposed architecture involving Spatiotemporal feature-extraction model to segment useful ``gloss-unit" features and BiLSTM with CTC as a sequence model. Spatiotemporal Feature Extractor is used for both image features extraction and sequence length reduction. To this end, we compare different architectures for feature extraction and sequence model. In addition, we iteratively fine-tune feature extractor on gloss-unit video segments with alignments from the end2end model. During the iterative training, we use novel alignment correction technique, which is based on minimum transformations of Levenshtein distance. All the experiments were conducted on the RWTH-PHOENIX-Weather-2014 dataset.

Text Recognition - Real World Data and Where to Find Them

Klára Janoušková, Lluis Gomez, Dimosthenis Karatzas, Jiri Matas

Responsive image

Auto-TLDR; Exploiting Weakly Annotated Images for Text Extraction

Slides Poster Similar

We present a method for exploiting weakly annotated images to improve text extraction pipelines. The approach uses an arbitrary end-to-end text recognition system to obtain text region proposals and their, possibly erroneous, transcriptions. The proposed method includes matching of imprecise transcription to weak annotations and edit distance guided neighbourhood search. It produces nearly error-free, localised instances of scene text, which we treat as "pseudo ground truth" (PGT). We apply the method to two weakly-annotated datasets. Training with the extracted PGT consistently improves the accuracy of a state of the art recognition model, by 3.7 % on average, across different benchmark datasets (image domains) and 24.5 % on one of the weakly annotated datasets.

MEAN: A Multi-Element Attention Based Network for Scene Text Recognition

Ruijie Yan, Liangrui Peng, Shanyu Xiao, Gang Yao, Jaesik Min

Responsive image

Auto-TLDR; Multi-element Attention Network for Scene Text Recognition

Slides Poster Similar

Scene text recognition is a challenging problem due to the wide variances in content, style, orientation, and image quality of text instances in natural scene images. To learn the intrinsic representation of scene texts, a novel multi-element attention (MEA) mechanism is proposed to exploit geometric structures from local to global levels in the feature map extracted from a scene text image. The MEA mechanism is a generalized form of self-attention technique with the incorporation of graph structure modeling. The elements in feature maps are taken as the nodes of an undirected graph, and three kinds of adjacency matrices are introduced to aggregating information at local, neighborhood and global levels before calculating the attention weights. If only the local adjacency matrix is used, the MEA mechanism degenerates to a self-attention form. A multi-element attention network (MEAN) is implemented which includes a CNN for feature extraction, an encoder with MEA mechanism and a decoder for predicting text codes. Orientation positional encoding information is further added to the feature map output by the CNN, and a feature sequence as the encoder's input is obtained by element-level decomposition of the feature map. Experimental results show that MEAN has achieved state-of-the-art or competitive performance on public English scene text datasets. Further experiments and analyses conducted on both English and Chinese scene text datasets show that MEAN can handle horizontal, vertical, and irregular scene text samples.

Vision-Based Layout Detection from Scientific Literature Using Recurrent Convolutional Neural Networks

Huichen Yang, William Hsu

Responsive image

Auto-TLDR; Transfer Learning for Scientific Literature Layout Detection Using Convolutional Neural Networks

Slides Poster Similar

We present an approach for adapting convolutional neural networks for object recognition and classification to scientific literature layout detection (SLLD), a shared subtask of several information extraction problems. Scientific publications contain multiple types of information sought by researchers in various disciplines, organized into an abstract, bibliography, and sections documenting related work, experimental methods, and results; however, there is no effective way to extract this information due to their diverse layout. In this paper, we present a novel approach to developing an end-to-end learning framework to segment and classify major regions of a scientific document. We consider scientific document layout analysis as an object detection task over digital images, without any additional text features that need to be added into the network during the training process. Our technical objective is to implement transfer learning via fine-tuning of pre-trained networks and thereby demonstrate that this deep learning architecture is suitable for tasks that lack very large document corpora for training. As part of the experimental test bed for empirical evaluation of this approach, we created a merged multi-corpus data set for scientific publication layout detection tasks. Our results show good improvement with fine-tuning of a pre-trained base network using this merged data set, compared to the baseline convolutional neural network architecture.

Cross-People Mobile-Phone Based Airwriting Character Recognition

Yunzhe Li, Hui Zheng, He Zhu, Haojun Ai, Xiaowei Dong

Responsive image

Auto-TLDR; Cross-People Airwriting Recognition via Motion Sensor Signal via Deep Neural Network

Slides Poster Similar

Airwriting using mobile phones has many applications in human-computer interaction. However, the recognition of airwriting character needs a lot of training data from user, which brings great difficulties to the pratical application. The model learnt from a specific person often cannot yield satisfied results when used on another person. The data gap between people is mainly caused by the following factors: personal writing styles, mobile phone sensors, and ways to hold mobile phones. To address the cross-people problem, we propose a deep neural network(DNN) that combines convolutional neural network(CNN) and bilateral long short-term memory(BLSTM). In each layer of the network, we also add an AdaBN layer which is able to increase the generalization ability of the DNN. Different from the original AdaBN method, we explore the feasibility for semi-supervised learning. We implement it to our design and conduct comprehensive experiments. The evaluation results show that our system can achieve an accuracy of 99% for recognition and an improvement of 10% on average for transfer learning between various factors such as people, devices and postures. To the best of our knowledge, our work is the first to implement cross-people airwriting recognition via motion sensor signal, which is a fundamental step towards ubiquitous sensing.

Image-Based Table Cell Detection: A New Dataset and an Improved Detection Method

Dafeng Wei, Hongtao Lu, Yi Zhou, Kai Chen

Responsive image

Auto-TLDR; TableCell: A Semi-supervised Dataset for Table-wise Detection and Recognition

Slides Poster Similar

The topic of table detection and recognition has been spotlighted in recent years, however, the latest works only aim at the coarse scene in table-wise detection. In this paper, we present TableCell, a new image-based dataset which contains 5262 samples with 170K high precision cell-wised annotations based on a novel semi-supervised method.. Several classical deep learning detection models are evaluated to build a strong baseline using the proposed dataset. Furthermore, we come up with an efficient table projection method to facilitate capturing long-range global feature, which consists of row projection and column projection. Experiments demonstrate that our proposed method improves the accuracy of table detection. Our dataset and code will be made available at https://github.com/weidafeng/TableCell upon publication.

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Michele Alberti, Angela Botros, Schuetz Narayan, Rolf Ingold, Marcus Liwicki, Mathias Seuret

Responsive image

Auto-TLDR; Trainable and Spectrally Initializable Matrix Transformations for Neural Networks

Slides Poster Similar

In this work, we introduce a new architectural component to Neural Networks (NN), i.e., trainable and spectrally initializable matrix transformations on feature maps. While previous literature has already demonstrated the possibility of adding static spectral transformations as feature processors, our focus is on more general trainable transforms. We study the transforms in various architectural configurations on four datasets of different nature: from medical (ColorectalHist, HAM10000) and natural (Flowers) images to historical documents (CB55). With rigorous experiments that control for the number of parameters and randomness, we show that networks utilizing the introduced matrix transformations outperform vanilla neural networks. The observed accuracy increases appreciably across all datasets. In addition, we show that the benefit of spectral initialization leads to significantly faster convergence, as opposed to randomly initialized matrix transformations. The transformations are implemented as auto-differentiable PyTorch modules that can be incorporated into any neural network architecture. The entire code base is open-source.

Cut and Compare: End-To-End Offline Signature Verification Network

Xi Lu, Lin-Lin Huang, Fei Yin

Responsive image

Auto-TLDR; An End-to-End Cut-and-Compare Network for Offline Signature Verification

Slides Poster Similar

Offline signature verification, to determine whether a handwritten signature image is genuine or forged for a claimed identity, is needed in many applications. How to extract salient features and how to calculate similarity scores are the major issues. In this paper, we propose a novel end-to-end cut-and-compare network for offline signature verification. Based on the Spatial Transformer Network (STN), discriminative regions are segmented from a pair of input signature images and are compared attentively with help of Attentive Recurrent Comparator (ARC). An adaptive distance fusion module is proposed to fuse the distances of these regions. To address the intrapersonal variability problem, we design a smoothed double-margin loss to train the network. The proposed network achieves state-of-the-art performance on CEDAR, GPDS Synthetic, BHSig-H and BHSig-B datasets of different languages. Furthermore, our network shows strong generalization ability on cross-language test.

Automatic Semantic Segmentation of Structural Elements related to the Spinal Cord in the Lumbar Region by Using Convolutional Neural Networks

Jhon Jairo Sáenz Gamboa, Maria De La Iglesia-Vaya, Jon Ander Gómez

Responsive image

Auto-TLDR; Semantic Segmentation of Lumbar Spine Using Convolutional Neural Networks

Slides Poster Similar

This work addresses the problem of automatically segmenting the MR images corresponding to the lumbar spine. The purpose is to detect and delimit the different structural elements like vertebrae, intervertebral discs, nerves, blood vessels, etc. This task is known as semantic segmentation. The approach proposed in this work is based on convolutional neural networks whose output is a mask where each pixel from the input image is classified into one of the possible classes. Classes were defined by radiologists and correspond to structural elements and tissues. The proposed network architectures are variants of the U-Net. Several complementary blocks were used to define the variants: spatial attention models, deep supervision and multi-kernels at input, this last block type is based on the idea of inception. Those architectures which got the best results are described in this paper, and their results are discussed. Two of the proposed architectures outperform the standard U-Net used as baseline.

Multiple Document Datasets Pre-Training Improves Text Line Detection with Deep Neural Networks

Mélodie Boillet, Christopher Kermorvant, Thierry Paquet

Responsive image

Auto-TLDR; A fully convolutional network for document layout analysis

Slides Similar

In this paper, we introduce a fully convolutional network for the document layout analysis task. While state-of-the-art methods are using models pre-trained on natural scene images, our method relies on a U-shaped model trained from scratch for detecting objects from historical documents. We consider the line segmentation task and more generally the layout analysis problem as a pixel-wise classification task then our model outputs a pixel-labeling of the input images. We show that our method outperforms state-of-the-art methods on various datasets and also demonstrate that the pre-trained parts on natural scene images are not required to reach good results. In addition, we show that pre-training on multiple document datasets can improve the performances. We evaluate the models using various metrics to have a fair and complete comparison between the methods.

Deep Transfer Learning for Alzheimer’s Disease Detection

Nicole Cilia, Claudio De Stefano, Francesco Fontanella, Claudio Marrocco, Mario Molinara, Alessandra Scotto Di Freca

Responsive image

Auto-TLDR; Automatic Detection of Handwriting Alterations for Alzheimer's Disease Diagnosis using Dynamic Features

Slides Poster Similar

Early detection of Alzheimer’s Disease (AD) is essential in order to initiate therapies that can reduce the effects of such a disease, improving both life quality and life expectancy of patients. Among all the activities carried out in our daily life, handwriting seems one of the first to be influenced by the arise of neurodegenerative diseases. For this reason, the analysis of handwriting and the study of its alterations has become of great interest in this research field in order to make a diagnosis as early as possible. In recent years, many studies have tried to use classification algorithms applied to handwritings to implement decision support systems for AD diagnosis. A key issue for the use of these techniques is the detection of effective features, that allow the system to distinguish the natural handwriting alterations due to age, from those caused by neurodegenerative disorders. In this context, many interesting results have been published in the literature in which the features have been typically selected by hand, generally considering the dynamics of the handwriting process in order to detect motor disorders closely related to AD. Features directly derived from handwriting generation models can be also very helpful for AD diagnosis. It should be remarked, however, that the above features do not consider changes in the shape of handwritten traces, which may occur as a consequence of neurodegenerative diseases, as well as the correlation among shape alterations and changes in the dynamics of the handwriting process. Moving from these considerations, the aim of this study is to verify if the combined use of both shape and dynamic features allows a decision support system to improve performance for AD diagnosis. To this purpose, starting from a database of on-line handwriting samples, we generated for each of them a synthetic off-line colour image, where the colour of each elementary trait encodes, in the three RGB channels, the dynamic information associated to that trait. Finally, we exploited the capability of Deep Neural Networks (DNN) to automatically extract features from raw images. The experimental comparison of the results obtained by using standard features and features extracted according the above procedure, confirmed the effectiveness of our approach.

A Gated and Bifurcated Stacked U-Net Module for Document Image Dewarping

Hmrishav Bandyopadhyay, Tanmoy Dasgupta, Nibaran Das, Mita Nasipuri

Responsive image

Auto-TLDR; Gated and Bifurcated Stacked U-Net for Dewarping Document Images

Slides Poster Similar

Capturing images of documents is one of the easiest and most used methods of recording them. These images however, being captured with the help of handheld devices, often lead to undesirable distortions that are hard to remove. We propose a supervised Gated and Bifurcated Stacked U-Net module to predict a dewarping grid and create a distortion free image from the input. While the network is trained on synthetically warped document images, results are calculated on the basis of real world images. The novelty in our methods exists not only in a bifurcation of the U-Net to help eliminate the intermingling of the grid coordinates, but also in the use of a gated network which adds boundary and other minute line level details to the model. The end-to-end pipeline proposed by us achieves state-of-the-art performance on the DocUNet dataset after being trained on just 8 percent of the data used in previous methods.