Recognizing Bengali Word Images - A Zero-Shot Learning Perspective

Sukalpa Chanda, Daniël Arjen Willem Haitink, Prashant Kumar Prasad, Jochem Baas, Umapada Pal, Lambert Schomaker

Responsive image

Auto-TLDR; Zero-Shot Learning for Word Recognition in Bengali Script

Slides Poster

Zero-Shot Learning(ZSL) techniques could classify a completely unseen class, which it has never seen before during training. Thus, making it more apt for any real-life classification problem, where it is not possible to train a system with annotated data for all possible class types. This work investigates recognition of word images written in Bengali Script in a ZSL framework. The proposed approach performs Zero-Shot word recognition by coupling deep learned features procured from VGG16 architecture along with 13 basic shapes/stroke primitives commonly observed in Bengali script characters. As per the notion of ZSL framework those 13 basic shapes are termed as “Signature Attributes”. The obtained results are promising while evaluation was carried out in a Five-Fold cross-validation setup dealing with samples from 250 word classes.

Similar papers

Prior Knowledge about Attributes: Learning a More Effective Potential Space for Zero-Shot Recognition

Chunlai Chai, Yukuan Lou

Responsive image

Auto-TLDR; Attribute Correlation Potential Space Generation for Zero-Shot Learning

Slides Poster Similar

Zero-shot learning (ZSL) aims to recognize unseen classes accurately by learning seen classes and known attributes, but correlations in attributes were ignored by previous study which lead to classification results confused. To solve this problem, we build an Attribute Correlation Potential Space Generation (ACPSG) model which uses a graph convolution network and attribute correlation to generate a more discriminating potential space. Combining potential discrimination space and user-defined attribute space, we can better classify unseen classes. Our approach outperforms some existing state-of-the-art methods on several benchmark datasets, whether it is conventional ZSL or generalized ZSL.

Watch Your Strokes: Improving Handwritten Text Recognition with Deformable Convolutions

Iulian Cojocaru, Silvia Cascianelli, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara

Responsive image

Auto-TLDR; Deformable Convolutional Neural Networks for Handwritten Text Recognition

Slides Poster Similar

Handwritten Text Recognition (HTR) in free-layout pages is a valuable yet challenging task which aims to automatically understand handwritten texts. State-of-the-art approaches in this field usually encode input images with Convolutional Neural Networks, whose kernels are typically defined on a fixed grid and focus on all input pixels independently. However, this is in contrast with the sparse nature of handwritten pages, in which only pixels representing the ink of the writing are useful for the recognition task. Furthermore, the standard convolution operator is not explicitly designed to take into account the great variability in shape, scale, and orientation of handwritten characters. To overcome these limitations, we investigate the use of deformable convolutions for handwriting recognition. This type of convolution deform the convolution kernel according to the content of the neighborhood, and can therefore be more adaptable to geometric variations and other deformations of the text. Experiments conducted on the IAM and RIMES datasets demonstrate that the use of deformable convolutions is a promising direction for the design of novel architectures for handwritten text recognition.

A Few-Shot Learning Approach for Historical Ciphered Manuscript Recognition

Mohamed Ali Souibgui, Alicia Fornés, Yousri Kessentini, Crina Tudor

Responsive image

Auto-TLDR; Handwritten Ciphers Recognition Using Few-Shot Object Detection

Slides Similar

Encoded (or ciphered) manuscripts are a special type of historical documents that contain encrypted text. The automatic recognition of this kind of documents is challenging because: 1) the cipher alphabet changes from one document to another, 2) there is a lack of annotated corpus for training and 3) touching symbols make the symbol segmentation difficult and complex. To overcome these difficulties, we propose a novel method for handwritten ciphers recognition based on few-shot object detection. Our method first detects all symbols of a given alphabet in a line image, and then a decoding step maps the symbol similarity scores to the final sequence of transcribed symbols. By training on synthetic data, we show that the proposed architecture is able to recognize handwritten ciphers with unseen alphabets. In addition, if few labeled pages with the same alphabet are used for fine tuning, our method surpasses existing unsupervised and supervised HTR methods for ciphers recognition.

Writer Identification Using Deep Neural Networks: Impact of Patch Size and Number of Patches

Akshay Punjabi, José Ramón Prieto Fontcuberta, Enrique Vidal

Responsive image

Auto-TLDR; Writer Recognition Using Deep Neural Networks for Handwritten Text Images

Slides Poster Similar

Traditional approaches for the recognition or identification of the writer of a handwritten text image used to relay on heuristic knowledge about the shape and other features of the strokes of previously segmented characters. However, recent works have done significantly advances on the state of the art thanks to the use of various types of deep neural networks. In most of all of these works, text images are decomposed into patches, which are processed by the networks without any previous character or word segmentation. In this paper, we study how the way images are decomposed into patches impact recognition accuracy, using three publicly available datasets. The study also includes a simpler architecture where no patches are used at all - a single deep neural network inputs a whole text image and directly provides a writer recognition hypothesis. Results show that bigger patches generally lead to improved accuracy, achieving in one of the datasets a significant improvement over the best results reported so far.

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Tengfei Liu, Yongli Hu, Junbin Gao, Yanfeng Sun, Baocai Yin

Responsive image

Auto-TLDR; Semantically Extended Graph Convolutional Network for Zero-shot Text Classification

Slides Poster Similar

As a challenging task of Natural Language Processing(NLP), zero-shot text classification has attracted more and more attention recently. It aims to detect classes that the model has never seen in the training set. For this purpose, a feasible way is to construct connection between the seen and unseen classes by semantic extension and classify the unseen classes by information propagation over the connection. Although many related zero-shot text classification methods have been exploited, how to realize semantic extension properly and propagate information effectively is far from solved. In this paper, we propose a novel zero-shot text classification method called Semantically Extended Graph Convolutional Network (SEGCN). In the proposed method, the semantic category knowledge from ConceptNet is utilized to semantic extension for linking seen classes to unseen classes and constructing a graph of all classes. Then, we build upon Graph Convolutional Network (GCN) for predicting the textual classifier for each category, which transfers the category knowledge by the convolution operators on the constructed graph and is trained in a semi-supervised manner using the samples of the seen classes. The experimental results on Dbpedia and 20newsgroup datasets show that our method outperforms the state of the art zero-shot text classification methods.

Improving Word Recognition Using Multiple Hypotheses and Deep Embeddings

Siddhant Bansal, Praveen Krishnan, C. V. Jawahar

Responsive image

Auto-TLDR; EmbedNet: fuse recognition-based and recognition-free approaches for word recognition using learning-based methods

Slides Poster Similar

We propose to fuse recognition-based and recognition-free approaches for word recognition using learning-based methods. For this purpose, results obtained using a text recognizer and deep embeddings (generated using an End2End network) are fused. To further improve the embeddings, we propose EmbedNet, it uses triplet loss for training and learns an embedding space where the embedding of the word image lies closer to its corresponding text transcription’s embedding. This updated embedding space helps in choosing the correct prediction with higher confidence. To further improve the accuracy, we propose a plug-and-play module called Confidence based Accuracy Booster (CAB). It takes in the confidence scores obtained from the text recognizer and Euclidean distances between the embeddings and generates an updated distance vector. This vector has lower distance values for the correct words and higher distance values for the incorrect words. We rigorously evaluate our proposed method systematically on a collection of books that are in the Hindi language. Our method achieves an absolute improvement of around 10% in terms of word recognition accuracy.

Heterogeneous Graph-Based Knowledge Transfer for Generalized Zero-Shot Learning

Junjie Wang, Xiangfeng Wang, Bo Jin, Junchi Yan, Wenjie Zhang, Hongyuan Zha

Responsive image

Auto-TLDR; Heterogeneous Graph-based Knowledge Transfer for Generalized Zero-Shot Learning

Slides Poster Similar

Generalized zero-shot learning (GZSL) tackles the problem of learning to classify instances involving both seen classes and unseen ones. The key issue is how to effectively transfer the model learned from seen classes to unseen classes. Existing works in GZSL usually assume that some prior information about unseen classes are available. However, such an assumption is unrealistic when new unseen classes appear dynamically. To this end, we propose a novel heterogeneous graph-based knowledge transfer method (HGKT) for GZSL, agnostic to unseen classes and instances, by leveraging graph neural network. Specifically, a structured heterogeneous graph is constructed with high-level representative nodes for seen classes, which are chosen through Wasserstein barycenter in order to simultaneously capture inter-class and intra-class relationship. The aggregation and embedding functions can be learned throughgraph neural network, which can be used to compute the embeddings of unseen classes by transferring the knowledge from their neighbors. Extensive experiments on public benchmark datasets show that our method achieves state-of-the-art results.

A Prototype-Based Generalized Zero-Shot Learning Framework for Hand Gesture Recognition

Jinting Wu, Yujia Zhang, Xiao-Guang Zhao

Responsive image

Auto-TLDR; Generalized Zero-Shot Learning for Hand Gesture Recognition

Slides Poster Similar

Hand gesture recognition plays a significant role in human-computer interaction for understanding various human gestures and their intent. However, most prior works can only recognize gestures of limited labeled classes and fail to adapt to new categories. The task of Generalized Zero-Shot Learning (GZSL) for hand gesture recognition aims to address the above issue by leveraging semantic representations and detecting both seen and unseen class samples. In this paper, we propose an end-to-end prototype-based GZSL framework for hand gesture recognition which consists of two branches. The first branch is a prototype-based detector that learns gesture representations and determines whether an input sample belongs to a seen or unseen category. The second branch is a zero-shot label predictor which takes the features of unseen classes as input and outputs predictions through a learned mapping mechanism between the feature and the semantic space. We further establish a hand gesture dataset that specifically targets this GZSL task, and comprehensive experiments on this dataset demonstrate the effectiveness of our proposed approach on recognizingQuestionnaire both seen and unseen gestures.

Multi-Attribute Learning with Highly Imbalanced Data

Lady Viviana Beltran Beltran, Mickaël Coustaty, Nicholas Journet, Juan C. Caicedo, Antoine Doucet

Responsive image

Auto-TLDR; Data Imbalance in Multi-Attribute Deep Learning Models: Adaptation to face each one of the problems derived from imbalance

Slides Poster Similar

Data is one of the most important keys for success when studying a simple or a complex phenomenon. With the use of deep-learning exploding and its democratization, non-computer science experts may struggle to use highly complex deep learning architectures, even when straightforward models offer them suitable performances. In this article, we study the specific and common problem of data imbalance in real databases as most of the bad performance problems are due to the data itself. We review two points: first, when the data contains different levels of imbalance. Classical imbalanced learning strategies cannot be directly applied when using multi-attribute deep learning models, i.e., multi-task and multi-label architectures. Therefore, one of our contributions is our proposed adaptations to face each one of the problems derived from imbalance. Second, we demonstrate that with little to no imbalance, straightforward deep learning models work well. However, for non-experts, these models can be seen as black boxes, where all the effort is put in pre-processing the data. To simplify the problem, we performed the classification task ignoring information that is costly to extract, such as part localization which is widely used in the state of the art of attribute classification. We make use of a widely known attribute database, CUB-200-2011 - CUB as our main use case due to its deeply imbalanced nature, along with two better structured databases: celebA and Awa2. All of them contain multi-attribute annotations. The results of highly fine-grained attribute learning over CUB demonstrate that in the presence of imbalance, by using our proposed strategies is possible to have competitive results against the state of the art, while taking advantage of multi-attribute deep learning models. We also report results for two better-structured databases over which our models over-perform the state of the art.

Textual-Content Based Classification of Bundles of Untranscribed of Manuscript Images

José Ramón Prieto Fontcuberta, Enrique Vidal, Vicente Bosch, Carlos Alonso, Carmen Orcero, Lourdes Márquez

Responsive image

Auto-TLDR; Probabilistic Indexing for Text-based Classification of Manuscripts

Slides Poster Similar

Content-based classification of manuscripts is an important task that is generally performed in archives and libraries by experts with a wealth of knowledge on the manuscripts contents. Unfortunately, many manuscript collections are so vast that it is not feasible to rely solely on experts to perform this task. Current approaches for textual-content-based manuscript classification generally require the handwritten images to be first transcribed into text -- but achieving sufficiently accurate transcripts is generally unfeasible for large sets of historical manuscripts. We propose a new approach to automatically perform this classification task which does not rely on any explicit image transcripts. It is based on ``probabilistic indexing'', a relatively novel technology which allows to effectively represent the intrinsic word-level uncertainty generally exhibited by handwritten text images. We assess the performance of this approach on a large collection of complex manuscripts from the Spanish Archivo General de Indias, with promising results.

Multimodal Side-Tuning for Document Classification

Stefano Zingaro, Giuseppe Lisanti, Maurizio Gabbrielli

Responsive image

Auto-TLDR; Side-tuning for Multimodal Document Classification

Slides Poster Similar

In this paper, we propose to exploit the side-tuning framework for multimodal document classification. Side-tuning is a methodology for network adaptation recently introduced to solve some of the problems related to previous approaches. Thanks to this technique it is actually possible to overcome model rigidity and catastrophic forgetting of transfer learning by fine-tuning. The proposed solution uses off-the-shelf deep learning architectures leveraging the side-tuning framework to combine a base model with a tandem of two side networks. We show that side-tuning can be successfully employed also when different data sources are considered, e.g. text and images in document classification. The experimental results show that this approach pushes further the limit for document classification accuracy with respect to the state of the art.

VSB^2-Net: Visual-Semantic Bi-Branch Network for Zero-Shot Hashing

Xin Li, Xiangfeng Wang, Bo Jin, Wenjie Zhang, Jun Wang, Hongyuan Zha

Responsive image

Auto-TLDR; VSB^2-Net: inductive zero-shot hashing for image retrieval

Slides Poster Similar

Zero-shot hashing aims at learning hashing model from seen classes and the obtained model is capable of generalizing to unseen classes for image retrieval. Inspired by zero-shot learning, existing zero-shot hashing methods usually transfer the supervised knowledge from seen to unseen classes, by embedding the hamming space to a shared semantic space. However, this makes instances difficult to distinguish due to limited hashing bit numbers, especially for semantically similar unseen classes. We propose a novel inductive zero-shot hashing framework, i.e., VSB^2-Net, where both semantic space and visual feature space are embedded to the same hamming space instead. The reconstructive semantic relationships are established in the hamming space, preserving local similarity relationships and explicitly enlarging the discrepancy between semantic hamming vectors. A two-task architecture, comprising of classification module and visual feature reconstruction module, is employed to enhance the generalization and transfer abilities. Extensive evaluation results on several benchmark datasets demonstratethe superiority of our proposed method compared to several state-of-the-art baselines.

Semantics to Space(S2S): Embedding Semantics into Spatial Space for Zero-Shot Verb-Object Query Inferencing

Sungmin Eum, Heesung Kwon

Responsive image

Auto-TLDR; Semantics-to-Space: Deep Zero-Shot Learning for Verb-Object Interaction with Vectors

Slides Poster Similar

We present a novel deep zero-shot learning (ZSL) model for inferencing human-object-interaction with verb-object (VO) query. While the previous two-stream ZSL approaches only use the semantic/textual information to be fed into the query stream, we seek to incorporate and embed the semantics into the visual representation stream as well. Our approach is powered by Semantics-to-Space (S2S) architecture where semantics derived from the residing objects are embedded into a spatial space of the visual stream. This architecture allows the co-capturing of the semantic attributes of the human and the objects along with their location/size/silhouette information. To validate, we have constructed a new dataset, Verb-Transferability 60 (VT60). VT60 provides 60 different VO pairs with overlapping verbs tailored for testing two-stream ZSL approaches with VO query. Experimental evaluations show that our approach not only outperforms the state-of-the-art, but also shows the capability of consistently improving performance regardless of which ZSL baseline architecture is used.

Chebyshev-Harmonic-Fourier-Moments and Deep CNNs for Detecting Forged Handwriting

Lokesh Nandanwar, Shivakumara Palaiahnakote, Kundu Sayani, Umapada Pal, Tong Lu, Daniel Lopresti

Responsive image

Auto-TLDR; Chebyshev-Harmonic-Fourier-Moments and Deep Convolutional Neural Networks for forged handwriting detection

Slides Poster Similar

Recently developed sophisticated image processing techniques and tools have made easier the creation of high-quality forgeries of handwritten documents including financial and property records. To detect such forgeries of handwritten documents, this paper presents a new method by exploring the combination of Chebyshev-Harmonic-Fourier-Moments (CHFM) and deep Convolutional Neural Networks (D-CNNs). Unlike existing methods work based on abrupt changes due to distortion created by forgery operation, the proposed method works based on inconsistencies and irregular changes created by forgery operations. Inspired by the special properties of CHFM, such as its reconstruction ability by removing redundant information, the proposed method explores CHFM to obtain reconstructed images for the color components of the Original, Forged Noisy and Blurred classes. Motivated by the strong discriminative power of deep CNNs, for the reconstructed images of respective color components, the proposed method used deep CNNs for forged handwriting detection. Experimental results on our dataset and benchmark datasets (namely, ACPR 2019, ICPR 2018 FCD and IMEI datasets) show that the proposed method outperforms existing methods in terms of classification rate.

GuCNet: A Guided Clustering-Based Network for Improved Classification

Ushasi Chaudhuri, Syomantak Chaudhuri, Subhasis Chaudhuri

Responsive image

Auto-TLDR; Semantic Classification of Challenging Dataset Using Guide Datasets

Slides Poster Similar

We deal with the problem of semantic classification of challenging and highly-cluttered dataset. We present a novel, and yet a very simple classification technique by leveraging the ease of classifiability of any existing well separable dataset for guidance. Since the guide dataset which may or may not have any semantic relationship with the experimental dataset, forms well separable clusters in the feature set, the proposed network tries to embed class-wise features of the challenging dataset to those distinct clusters of the guide set, making them more separable. Depending on the availability, we propose two types of guide sets: one using texture (image) guides and another using prototype vectors representing cluster centers. Experimental results obtained on the challenging benchmark RSSCN, LSUN, and TU-Berlin datasets establish the efficacy of the proposed method as we outperform the existing state-of-the-art techniques by a considerable margin.

LODENet: A Holistic Approach to Offline Handwritten Chinese and Japanese Text Line Recognition

Huu Tin Hoang, Chun-Jen Peng, Hung Tran, Hung Le, Huy Hoang Nguyen

Responsive image

Auto-TLDR; Logographic DEComposition Encoding for Chinese and Japanese Text Line Recognition

Slides Poster Similar

One of the biggest obstacles in Chinese and Japanese text line recognition is how to present their enormous character sets. The most common solution is to merely choose and represent a small subset of characters using one-hot encoding. However, such an approach is costly to describe huge character sets, and ignores their semantic relationships. Recent studies have attempted to utilize different encoding methods, but they struggle to build a bijection mapping. In this work, we propose a novel encoding method, called LOgographic DEComposition encoding (LODEC), that can efficiently perform a 1-to-1 mapping for all Chinese and Japanese characters with a strong awareness of semantic relationships. As such, LODEC enables to encode over 21,000 Chinese and Japanese characters by only 520 fundamental elements. Moreover, to handle the vast variety of handwritten texts in the two languages, we propose a novel deep learning (DL) architecture, called LODENet, together with an end-to-end training scheme, that leverages auxiliary data generated by LODEC or other radical-based encoding methods. We performed systematic experiments on both Chinese and Japanese datasets, and found that our approach surpassed the performance of state-of-the-art baselines. Furthermore, empirical evidence shows that our method can gain significantly better results using synthesized text line images without the need for domain knowledge.

The HisClima Database: Historical Weather Logs for Automatic Transcription and Information Extraction

Verónica Romero, Joan Andreu Sánchez

Responsive image

Auto-TLDR; Automatic Handwritten Text Recognition and Information Extraction from Historical Weather Logs

Slides Poster Similar

Knowing the weather and atmospheric conditions from the past can help weather researchers to generate models like the ones used to predict how weather conditions are likely to change as global temperatures continue to rise. Many historical weather records are available from the past registered on a systemic basis. Historical weather logs were registered in ships, when they were on the high seas, recording daily weather conditions such as: wind speed, temperature, coordinates, etc. These historical documents represent an important source of knowledge with valuable information to extract climatic information of several centuries ago. Such information is usually collected by experts that devote a lot of time. This paper presents a new database, compiled from a ship log mainly composed by handwritten tables that contain mainly numerical information, to support research in automatic handwriting recognition and information extraction. In addition, a study is presented about the capability of state-of-the-art handwritten text recognition systems and information extraction techniques, when applied to the presented database. Baseline results are reported for reference in future studies.

Generalized Local Attention Pooling for Deep Metric Learning

Carlos Roig Mari, David Varas, Issey Masuda, Juan Carlos Riveiro, Elisenda Bou-Balust

Responsive image

Auto-TLDR; Generalized Local Attention Pooling for Deep Metric Learning

Slides Poster Similar

Deep metric learning has been key to recent advances in face verification and image retrieval amongst others. These systems consist on a feature extraction block (extracts feature maps from images) followed by a spatial dimensionality reduction block (generates compact image representations from the feature maps) and an embedding generation module (projects the image representation to the embedding space). While research on deep metric learning has focused on improving the losses for the embedding generation module, the dimensionality reduction block has been overlooked. In this work, we propose a novel method to generate compact image representations which uses local spatial information through an attention mechanism, named Generalized Local Attention Pooling (GLAP). This method, instead of being placed at the end layer of the backbone, is connected at an intermediate level, resulting in lower memory requirements. We assess the performance of the aforementioned method by comparing it with multiple dimensionality reduction techniques, demonstrating the importance of using attention weights to generate robust compact image representations. Moreover, we compare the performance of multiple state-of-the-art losses using the standard deep metric learning system against the same experiment with our GLAP. Experiments showcase that the proposed Generalized Local Attention Pooling mechanism outperforms other pooling methods when compared with current state-of-the-art losses for deep metric learning.

Supervised Domain Adaptation Using Graph Embedding

Lukas Hedegaard, Omar Ali Sheikh-Omar, Alexandros Iosifidis

Responsive image

Auto-TLDR; Domain Adaptation from the Perspective of Multi-view Graph Embedding and Dimensionality Reduction

Slides Poster Similar

Getting deep convolutional neural networks to perform well requires a large amount of training data. When the available labelled data is small, it is often beneficial to use transfer learning to leverage a related larger dataset (source) in order to improve the performance on the small dataset (target). Among the transfer learning approaches, domain adaptation methods assume that distributions between the two domains are shifted and attempt to realign them. In this paper, we consider the domain adaptation problem from the perspective of multi-view graph embedding and dimensionality reduction. Instead of solving the generalised eigenvalue problem to perform the embedding, we formulate the graph-preserving criterion as loss in the neural network and learn a domain-invariant feature transformation in an end-to-end fashion. We show that the proposed approach leads to a powerful Domain Adaptation framework which generalises the prior methods CCSA and d-SNE, and enables simple and effective loss designs; an LDA-inspired instantiation of the framework leads to performance on par with the state-of-the-art on the most widely used Domain Adaptation benchmarks, Office31 and MNIST to USPS datasets.

A Multi-Head Self-Relation Network for Scene Text Recognition

Zhou Junwei, Hongchao Gao, Jiao Dai, Dongqin Liu, Jizhong Han

Responsive image

Auto-TLDR; Multi-head Self-relation Network for Scene Text Recognition

Slides Poster Similar

The text embedded in scene images can be seen everywhere in our lives. However, recognizing text from natural scene images is still a challenge because of its diverse shapes and distorted patterns. Recently, advanced recognition networks generally treat scene text recognition as a sequence prediction task. Although achieving excellent performance, these recognition networks consider the feature map cells as independent individuals and update cells state without utilizing the information of their neighboring cells. And the local receptive field of traditional convolutional neural network (CNN) makes a single cell that cannot cover the whole text region in an image. Due to these issues, the existing recognition networks cannot extract the global context in a visual scene. To deal with the above problems, we propose a Multi-head Self-relation Network(MSRN) for scene text recognition in this paper. The MSRN consists of several multi-head self-relation layers, which is designed for extracting the global context of a visual scene, so that transforms a cell into a new cell that fuses the information of the related cells. Furthermore, experiments over several public datasets demonstrate that our proposed recognition network achieves superior performance on several benchmark datasets including IC03, IC13, IC15, SVT-Perspective.

Online Trajectory Recovery from Offline Handwritten Japanese Kanji Characters of Multiple Strokes

Hung Tuan Nguyen, Tsubasa Nakamura, Cuong Tuan Nguyen, Masaki Nakagawa

Responsive image

Auto-TLDR; Recovering Dynamic Online Trajectories from Offline Japanese Kanji Character Images for Handwritten Character Recognition

Slides Poster Similar

We propose a deep neural network-based method to recover dynamic online trajectories from offline handwritten Japanese kanji character images. It is a challenging task since Japanese kanji characters consist of multiple strokes. Our proposed model has three main components: Convolutional Neural Network-based encoder, Long Short-Term Memory Network-based decoder with an attention layer, and Gaussian Mixture Model (GMM). The encoder focuses on feature extraction while the decoder refers to the extracted features and generates time-sequences of GMM parameters. The attention layer is the key component for trajectory recovery. The GMM provides robustness to style variations so that the proposed model does not overfit to training samples. In the experiments, the proposed method is evaluated by both visual verification and handwritten character recognition. This is the first attempt to use online recovered trajectories to help improve the performance of offline handwriting recognition. Although the visual verification reveals some problems, the recognition experiments demonstrate the effect of trajectory recovery in improving the accuracy of offline handwritten character recognition when online recognition of the recovered trajectories are combined.

Recursive Recognition of Offline Handwritten Mathematical Expressions

Marco Cotogni, Claudio Cusano, Antonino Nocera

Responsive image

Auto-TLDR; Online Handwritten Mathematical Expression Recognition with Recurrent Neural Network

Slides Poster Similar

In this paper we propose a method for Offline Handwritten Mathematical Expression recognition. The method is a fast and accurate thanks to its architecture, which include both a Convolutional Neural Network and a Recurrent Neural Network. The CNN extracts features from the image to recognize and its output is provided to the RNN which produces the mathematical expression encoded in the LaTeX language. To process both sequential and non-sequential mathematical expressions we also included a deconvolutional module which, in a recursive way, segments the image for additional analysis trough a recursive process. The results obtained show a very high accuracy obtained on a large handwritten data set of 9100 samples of handwritten expressions.

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, Rong Xiao

Responsive image

Auto-TLDR; PICK: A Graph Learning Framework for Key Information Extraction from Documents

Slides Poster Similar

Computer vision with state-of-the-art deep learning models have achieved huge success in the field of Optical Character Recognition (OCR) including text detection and recognition tasks recently. However, Key Information Extraction (KIE) from documents as the downstream task of OCR, having a large number of use scenarios in real-world, remains a challenge because documents not only have textual features extracting from OCR systems but also have semantic visual features that are not fully exploited and play a critical role in KIE. Too little work has been devoted to efficiently make full use of both textual and visual features of the documents. In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Extensive experiments on real-world datasets have been conducted to show that our method outperforms baselines methods by significant margins.

Using Scene Graphs for Detecting Visual Relationships

Anurag Tripathi, Siddharth Srivastava, Brejesh Lall, Santanu Chaudhury

Responsive image

Auto-TLDR; Relationship Detection using Context Aligned Scene Graph Embeddings

Slides Poster Similar

In this paper we solve the problem of detecting relationships between pairs of objects in an image. We develop spatially aware word embeddings using scene graphs and use joint feature representations containing visual, spatial and semantic embeddings from the input images to train a deep network on the task of relationship detection. Further, we propose to utilize context aligned scene graph embeddings from the train set, without requiring explicit availability of scene graphs at test time. We show that the proposed method outperforms the state-of-the-art methods for predicate detection and provides competing results on relationship detection. We also show the generalization ability of the proposed method by performing predictions under zero shot settings. Further, we also provide an exhaustive empirical evaluation on each component of the proposed network.

A Transformer-Based Radical Analysis Network for Chinese Character Recognition

Chen Yang, Qing Wang, Jun Du, Jianshu Zhang, Changjie Wu, Jiaming Wang

Responsive image

Auto-TLDR; Transformer-based Radical Analysis Network for Chinese Character Recognition

Slides Poster Similar

Recently, a novel radical analysis network (RAN) has the capability of effectively recognizing unseen Chinese character classes and largely reducing the requirement of training data by treating a Chinese character as a hierarchical composition of radicals rather than a single character class.} However, when dealing with more challenging issues, such as the recognition of complicated characters, low-frequency character categories, and characters in natural scenes, RAN still has a lot of room for improvement. In this paper, we explore options to further improve the structure generalization and robustness capability of RAN with the Transformer architecture, which has achieved start-of-the-art results for many sequence-to-sequence tasks. More specifically, we propose to replace the original attention module in RAN with the transformer decoder, which is named as a transformer-based radical analysis network (RTN). The experimental results show that the proposed approach can significantly outperform the RAN on both printed Chinese character database and natural scene Chinese character database. Meanwhile, further analysis proves that RTN can be better generalized to complex samples and low-frequency characters, and has better robustness in recognizing Chinese characters with different attributes.

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Michele Alberti, Angela Botros, Schuetz Narayan, Rolf Ingold, Marcus Liwicki, Mathias Seuret

Responsive image

Auto-TLDR; Trainable and Spectrally Initializable Matrix Transformations for Neural Networks

Slides Poster Similar

In this work, we introduce a new architectural component to Neural Networks (NN), i.e., trainable and spectrally initializable matrix transformations on feature maps. While previous literature has already demonstrated the possibility of adding static spectral transformations as feature processors, our focus is on more general trainable transforms. We study the transforms in various architectural configurations on four datasets of different nature: from medical (ColorectalHist, HAM10000) and natural (Flowers) images to historical documents (CB55). With rigorous experiments that control for the number of parameters and randomness, we show that networks utilizing the introduced matrix transformations outperform vanilla neural networks. The observed accuracy increases appreciably across all datasets. In addition, we show that the benefit of spectral initialization leads to significantly faster convergence, as opposed to randomly initialized matrix transformations. The transformations are implemented as auto-differentiable PyTorch modules that can be incorporated into any neural network architecture. The entire code base is open-source.

Parallel Network to Learn Novelty from the Known

Shuaiyuan Du, Chaoyi Hong, Zhiyu Pan, Chen Feng, Zhiguo Cao

Responsive image

Auto-TLDR; Trainable Parallel Network for Pseudo-Novel Detection

Slides Poster Similar

Towards multi-class novelty detection, we propose an end-to-end trainable Parallel Network (PN) using no additional data but only the training set itself. Our key idea is to first divide the training set into successive subtasks of pseudo-novelty detection to simulate real scenarios. We then design a multi-branch PN to well address the fine-grained division, which yields a compressed and more discriminative classification space and forms a natural ensemble. In practice, we divide the training data into subsets consisting of known and pseudo-novel classes. Each subset forms a sub-task fed to one branch in PN. During training, both known and pseudo-novel classes are uniformly distributed over the branches for better data balance and model diversity. By distinguishing between the known and the diverse pseudo-novel, PN extracts the concept of novelty in a compressed classification space. This provides PN with generalization ability to real novel classes which are absent during training. During online inference, this ability is further strengthened with the ensemble of PN's multiple branches. Experiments on three public datasets show our method's superiority to the mainstream methods.

Text Recognition - Real World Data and Where to Find Them

Klára Janoušková, Lluis Gomez, Dimosthenis Karatzas, Jiri Matas

Responsive image

Auto-TLDR; Exploiting Weakly Annotated Images for Text Extraction

Slides Poster Similar

We present a method for exploiting weakly annotated images to improve text extraction pipelines. The approach uses an arbitrary end-to-end text recognition system to obtain text region proposals and their, possibly erroneous, transcriptions. The proposed method includes matching of imprecise transcription to weak annotations and edit distance guided neighbourhood search. It produces nearly error-free, localised instances of scene text, which we treat as "pseudo ground truth" (PGT). We apply the method to two weakly-annotated datasets. Training with the extracted PGT consistently improves the accuracy of a state of the art recognition model, by 3.7 % on average, across different benchmark datasets (image domains) and 24.5 % on one of the weakly annotated datasets.

Learning to Sort Handwritten Text Lines in Reading Order through Estimated Binary Order Relations

Lorenzo Quirós, Enrique Vidal

Responsive image

Auto-TLDR; Automatic Reading Order of Text Lines in Handwritten Text Documents

Slides Similar

Recent advances in Handwritten Text Recognition and Document Layout Analysis make it possible to extract information from digitized documents and make them accessible beyond the archive shelves. But the reading order of the elements in those documents still is an open problem that has to be solved in order to provide that information with the correct structure. Most of the studies on the reading order task are rule-base approaches that focus on printed documents, while less attention has been paid to handwritten text documents. In this work we propose a new approach to automatically determine the reading order of text lines in handwritten text documents. The task is approached as a sorting problem where the order-relation operator is learned directly from examples. We demonstrate the effectiveness of our method on three different datasets.

An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers

Manuel Burghardt, Bernhard Liebl

Responsive image

Auto-TLDR; Evaluation of Backbone Architectures for Optical Character Segmentation of Historical Documents

Slides Poster Similar

One important and particularly challenging step in the optical character recognition of historical documents with complex layouts, such as newspapers, is the separation of text from non-text content (e.g. page borders or illustrations). This step is commonly referred to as page segmentation. While various rule-based algorithms have been proposed, the applicability of Deep Neural Networks for this task recently has gained a lot of attention. In this paper, we perform a systematic evaluation of 11 different published backbone architectures and 9 different tiling and scaling configurations for separating text, tables or table column lines. We also show the influence of the number of labels and the number of training pages on the segmentation quality, which we measure using the Matthews Correlation Coefficient. Our results show that (depending on the task) Inception-ResNet-v2 and EfficientNet backbones work best, vertical tiling is generally preferable to other tiling approaches, and training data that comprises 30 to 40 pages will be sufficient most of the time.

Local Gradient Difference Based Mass Features for Classification of 2D-3D Natural Scene Text Images

Lokesh Nandanwar, Shivakumara Palaiahnakote, Raghavendra Ramachandra, Tong Lu, Umapada Pal, Daniel Lopresti, Nor Badrul Anuar

Responsive image

Auto-TLDR; Classification of 2D and 3D Natural Scene Images Using COLD

Slides Poster Similar

Methods developed for normal 2D text detection do not work well for a text that is rendered using decorative, 3D effects. This paper proposes a new method for classification of 2D and 3D natural scene images such that an appropriate method can be chosen or modified according to the complexity of the individual classes. The proposed method explores local gradient differences for obtaining candidate pixels, which represent a stroke. To study the spatial distribution of candidate pixels, we propose a measure we call COLD, which is denser for pixels toward the center of strokes and scattered for non-stroke pixels. This observation leads us to introduce mass features for extracting the regular spatial pattern of COLD, which indicates a 2D text image. The extracted features are fed to a Neural Network (NN) for classification. The proposed method is tested on both a new dataset introduced in this work and a standard dataset assembled from different natural scene datasets, and compared to from existing methods to show its effectiveness. The approach improves text detection performance significantly after classification.

Context for Object Detection Via Lightweight Global and Mid-Level Representations

Mesut Erhan Unal, Adriana Kovashka

Responsive image

Auto-TLDR; Context-Based Object Detection with Semantic Similarity

Slides Poster Similar

We propose an approach for explicitly capturing context in object detection. We model visual and geometric relationships between object regions, but also model the global scene as a first-class participant. In contrast to prior approaches, both the context we rely on, as well as our proposed mechanism for belief propagation over regions, is lightweight. We also experiment with capturing similarities between regions at a semantic level, by modeling class co-occurrence and linguistic similarity between class names. We show that our approach significantly outperforms Faster R-CNN, and performs competitively with a much more costly approach that also models context.

Open Set Domain Recognition Via Attention-Based GCN and Semantic Matching Optimization

Xinxing He, Yuan Yuan, Zhiyu Jiang

Responsive image

Auto-TLDR; Attention-based GCN and Semantic Matching Optimization for Open Set Domain Recognition

Slides Poster Similar

Open set domain recognition has got the attention in recent years. The task aims to specifically classify each sample in the practical unlabeled target domain, which consists of all known classes in the manually labeled source domain and target-specific unknown categories. The absence of annotated training data or auxiliary attribute information for unknown categories makes this task especially difficult. Moreover, exiting domain discrepancy in label space and data distribution further distracts the knowledge transferred from known classes to unknown classes. To address these issues, this work presents an end-to-end model based on attention-based GCN and semantic matching optimization, which first employs the attention mechanism to enable the central node to learn more discriminating representations from its neighbors in the knowledge graph. Moreover, a coarse-to-fine semantic matching optimization approach is proposed to progressively bridge the domain gap. Experimental results validate that the proposed model not only has superiority on recognizing the images of known and unknown classes, but also can adapt to various openness of the target domain.

A Fast and Accurate Object Detector for Handwritten Digit String Recognition

Jun Guo, Wenjing Wei, Yifeng Ma, Cong Peng

Responsive image

Auto-TLDR; ChipNet: An anchor-free object detector for handwritten digit string recognition

Slides Poster Similar

Focusing on handwritten digit string recognition (HDSR), we propose an anchor-free object detector called ChipNet, where a novel encoding method is designed. The input image is divided into columns, and then these columns are encoded by the ground truth. The adjacent columns are responsible for detecting the same target so that it can well address the class-imbalanced problem meanwhile reducing the network computation. ChipNet is composed of convolutional and bidirectional long short term memory networks. Different from the typical detectors, it doesn't use region proposals, anchors or regions of interest pooling. Hence, it can overcome the shortages of anchor-based and dense detectors in HDSR. The experiments are implemented on the synthetic digit strings, the CVL HDS database, and the ORAND-CAR-A & B databases. The high accuracies, which surpass the reported results by a large margin (up to 6.62%), are achieved. Furthermore, it gets 219 FPS speed on 160*32 px resolution images when using a Tesla P100 GPU. The results also show that ChipNet can handle touching, connecting and arbitrary length digit strings, and the obtained accuracies in HDSR are as high as the ones in single handwritten digit recognition.

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Yaning Li, Liu Yang

Responsive image

Auto-TLDR; Fully Associative Network for Fully Exploiting Correlation Information in Multi-Label Classification

Slides Poster Similar

Recent researches demonstrate that correlation modeling plays a key role in high-performance multi-label classification methods. However, existing methods do not take full advantage of correlation information, especially correlations in feature and label spaces of each image, which limits the performance of correlation-based multi-label classification methods. With more correlations considered, in this study, a Fully Associative Network (FAN) is proposed for fully exploiting correlation information, which involves both visual feature and label correlations. Specifically, FAN introduces a robust covariance pooling to summarize convolution features as global image representation for capturing feature correlation in the multi-label task. Moreover, it constructs an effective label correlation matrix based on a re-weighted scheme, which is fed into a graph convolution network for capturing label correlation. Then, correlation between covariance representations (i.e., feature correlation ) and the outputs of GCN (i.e., label correlation) are modeled for final prediction. Experimental results on two datasets illustrate the effectiveness and efficiency of our proposed FAN compared with state-of-the-art methods.

Concept Embedding through Canonical Forms: A Case Study on Zero-Shot ASL Recognition

Azamat Kamzin, Apurupa Amperyani, Prasanth Sukhapalli, Ayan Banerjee, Sandeep Gupta

Responsive image

Auto-TLDR; A canonical form of gestures in American Sign Language

Slides Poster Similar

In the recognition problem, a canonical form that expresses the spatio-temporal relation of concepts for a given class can potentially increase accuracy. Concepts are defined as attributes that can be recognized using a soft matching paradigm. We consider the specific case study of American Sign Language (ASL) to show that canonical forms of classes can be used to recognize unseen gestures. There are several advantages of a canonical form of gestures including translation between gestures, gesture-based searching, and automated transcription of gestures into any spoken language. We applied our technique to two independently collected datasets: a) IMPACT Lab dataset: 23 ASL gestures each executed three times from 130 first time ASL learners as training data and b) ASLTEXT dataset: 190 gestures each executed six times on an average. Our technique was able to recognize 19 arbitrarily chosen previously unseen gestures in the IMPACT dataset from seven individuals who are not a part of 130 and 34 unseen gestures from the ASLTEXT dataset without any retraining. Our normalized accuracy on ASLTEXT dataset is 66 % which is 13.6 % higher than state-of-art technique.

Directed Variational Cross-encoder Network for Few-Shot Multi-image Co-segmentation

Sayan Banerjee, Divakar Bhat S, Subhasis Chaudhuri, Rajbabu Velmurugan

Responsive image

Auto-TLDR; Directed Variational Inference Cross Encoder for Class Agnostic Co-Segmentation of Multiple Images

Slides Poster Similar

In this paper, we propose a novel framework for class agnostic co-segmentation of multiple images using comparatively smaller datasets. We have developed a novel encoder-decoder network termed as DVICE (Directed Variational Inference Cross Encoder), which learns a continuous embedding space to ensure better similarity learning. We employ a combination of the proposed variational encoder-decoder and a novel few-shot learning approach to tackle the small sample size problem in co-segmentation. Furthermore, the proposed framework does not use any semantic class labels and is entirely class agnostic. Through exhaustive experimentation using a small volume of data over multiple datasets, we have demonstrated that our approach outperforms all existing state-of-the-art techniques.

Generative Latent Implicit Conditional Optimization When Learning from Small Sample

Idan Azuri, Daphna Weinshall

Responsive image

Auto-TLDR; GLICO: Generative Latent Implicit Conditional Optimization for Small Sample Learning

Slides Poster Similar

We revisit the long-standing problem of learning from small sample. The generation of new samples from a small training set of labeled points has attracted increased attention in recent years. In this paper, we propose a novel such method called GLICO (Generative Latent Implicit Conditional Optimization). GLICO learns a mapping from the training examples to a latent space and a generator that generates images from vectors in the latent space. Unlike most recent work, which rely on access to large amounts of unlabeled data, GLICO does not require access to any additional data other than the small set of labeled points. In fact, GLICO learns to synthesize completely new samples for every class using as little as 5 or 10 examples per class, with as few as 10 such classes and no data from unknown classes. GLICO is then used to augment the small training set while training a classifier on the small sample. To this end, our proposed method samples the learned latent space using spherical interpolation (slerp) and generates new examples using the trained generator. Empirical results show that the new sampled set is diverse enough, leading to improvement in image classification in comparison with the state of the art when trained on small samples obtained from CIFAR-10, CIFAR-100, and CUB-200.

Contextual Classification Using Self-Supervised Auxiliary Models for Deep Neural Networks

Sebastian Palacio, Philipp Engler, Jörn Hees, Andreas Dengel

Responsive image

Auto-TLDR; Self-Supervised Autogenous Learning for Deep Neural Networks

Slides Poster Similar

Classification problems solved with deep neural networks (DNNs) typically rely on a closed world paradigm, and optimize over a single objective (e.g., minimization of the cross- entropy loss). This setup dismisses all kinds of supporting signals that can be used to reinforce the existence or absence of particular patterns. The increasing need for models that are interpretable by design makes the inclusion of said contextual signals a crucial necessity. To this end, we introduce the notion of Self-Supervised Autogenous Learning (SSAL). A SSAL objective is realized through one or more additional targets that are derived from the original supervised classification task, following architectural principles found in multi-task learning. SSAL branches impose low-level priors into the optimization process (e.g., grouping). The ability of using SSAL branches during inference, allow models to converge faster, focusing on a richer set of class-relevant features. We equip state-of-the-art DNNs with SSAL objectives and report consistent improvements for all of them on CIFAR100 and Imagenet. We show that SSAL models outperform similar state-of-the-art methods focused on contextual loss functions, auxiliary branches and hierarchical priors.

Vision-Based Layout Detection from Scientific Literature Using Recurrent Convolutional Neural Networks

Huichen Yang, William Hsu

Responsive image

Auto-TLDR; Transfer Learning for Scientific Literature Layout Detection Using Convolutional Neural Networks

Slides Poster Similar

We present an approach for adapting convolutional neural networks for object recognition and classification to scientific literature layout detection (SLLD), a shared subtask of several information extraction problems. Scientific publications contain multiple types of information sought by researchers in various disciplines, organized into an abstract, bibliography, and sections documenting related work, experimental methods, and results; however, there is no effective way to extract this information due to their diverse layout. In this paper, we present a novel approach to developing an end-to-end learning framework to segment and classify major regions of a scientific document. We consider scientific document layout analysis as an object detection task over digital images, without any additional text features that need to be added into the network during the training process. Our technical objective is to implement transfer learning via fine-tuning of pre-trained networks and thereby demonstrate that this deep learning architecture is suitable for tasks that lack very large document corpora for training. As part of the experimental test bed for empirical evaluation of this approach, we created a merged multi-corpus data set for scientific publication layout detection tasks. Our results show good improvement with fine-tuning of a pre-trained base network using this merged data set, compared to the baseline convolutional neural network architecture.

Rethinking of Deep Models Parameters with Respect to Data Distribution

Shitala Prasad, Dongyun Lin, Yiqun Li, Sheng Dong, Zaw Min Oo

Responsive image

Auto-TLDR; A progressive stepwise training strategy for deep neural networks

Slides Poster Similar

The performance of deep learning models are driven by various parameters but to tune all of them every time, for every dataset, is a heuristic practice. In this paper, unlike the common practice of decaying the learning rate, we propose a step-wise training strategy where the learning rate and the batch size are tuned based on the dataset size. Here, the given dataset size is progressively increased during the training to boost the network performance without saturating the learning curve, after certain epochs. We conducted extensive experiments on multiple networks and datasets to validate the proposed training strategy. The experimental results proves our hypothesis that the learning rate, the batch size and the data size are interrelated and can improve the network accuracy if an optimal progressive stepwise training strategy is applied. The proposed strategy also the overall training computational cost is reduced.

A Close Look at Deep Learning with Small Data

Lorenzo Brigato, Luca Iocchi

Responsive image

Auto-TLDR; Low-Complex Neural Networks for Small Data Conditions

Slides Poster Similar

In this work, we perform a wide variety of experiments with different Deep Learning architectures in small data conditions. We show that model complexity is a critical factor when only a few samples per class are available. Differently from the literature, we improve the state of the art using low complexity models. We show that standard convolutional neural networks with relatively few parameters are effective in this scenario. In many of our experiments, low complexity models outperform state-of-the-art architectures. Moreover, we propose a novel network that uses an unsupervised loss to regularize its training. Such architecture either improves the results either performs comparably well to low capacity networks. Surprisingly, experiments show that the dynamic data augmentation pipeline is not beneficial in this particular domain. Statically augmenting the dataset might be a promising research direction while dropout maintains its role as a good regularizer.

Human or Machine? It Is Not What You Write, but How You Write It

Luis Leiva, Moises Diaz, M.A. Ferrer, Réjean Plamondon

Responsive image

Auto-TLDR; Behavioral Biometrics via Handwritten Symbols for Identification and Verification

Slides Poster Similar

Online fraud often involves identity theft. Since most security measures are weak or can be spoofed, we investigate a more nuanced and less explored avenue: behavioral biometrics via handwriting movements. This kind of data can be used to verify if a legitimate user is operating a device or a computer application, so it is important to distinguish between human and machine-generated movements reliably. For this purpose, we study handwritten symbols (isolated characters, digits, gestures, and signatures) produced by humans and machines, and compare and contrast several deep learning models. We find that if symbols are presented as static images, they can fool state-of-the-art classifiers (near 75% accuracy in the best case) but can be distinguished with remarkable accuracy if they are presented as temporal sequences (95% accuracy in the average case). We conclude that an accurate detection of fake movements has more to do with how users write, rather than what they write. Our work has implications for computerized systems that need to authenticate or verify legitimate human users, and provides an additional layer of security to keep attackers at bay.

ReADS: A Rectified Attentional Double Supervised Network for Scene Text Recognition

Qi Song, Qianyi Jiang, Xiaolin Wei, Nan Li, Rui Zhang

Responsive image

Auto-TLDR; ReADS: Rectified Attentional Double Supervised Network for General Scene Text Recognition

Slides Poster Similar

In recent years, scene text recognition is always regarded as a sequence-to-sequence problem. Connectionist Temporal Classification (CTC) and Attentional sequence recognition (Attn) are two very prevailing approaches to tackle this problem while they may fail in some scenarios respectively. CTC concentrates more on every individual character but is weak in text semantic dependency modeling. Attn based methods have better context semantic modeling ability while tends to overfit on limited training data. In this paper, we elaborately design a Rectified Attentional Double Supervised Network (ReADS) for general scene text recognition. To overcome the weakness of CTC and Attn, both of them are applied in our method but with different modules in two supervised branches which can make a complementary to each other. Moreover, effective spatial and channel attention mechanisms are introduced to eliminate background noise and extract valid foreground information. Finally, a simple rectified network is implemented to rectify irregular text. The ReADS can be trained end-to-end and only word-level annotations are required. Extensive experiments on various benchmarks verify the effectiveness of ReADS which achieves state-of-the-art performance.

Text Baseline Recognition Using a Recurrent Convolutional Neural Network

Matthias Wödlinger, Robert Sablatnig

Responsive image

Auto-TLDR; Automatic Baseline Detection of Handwritten Text Using Recurrent Convolutional Neural Network

Slides Poster Similar

The detection of baselines of text is a necessary pre-processing step for many modern methods of automatic handwriting recognition. In this work a two-stage system for the automatic detection of text baselines of handwritten text is presented. In a first step pixel-wise segmentation on the document image is performed to classify pixels as baselines, start points and end points. This segmentation is then used to extract the start points of lines. Starting from these points the baseline is extracted using a recurrent convolutional neural network that directly outputs the baseline coordinates. This method allows the direct extraction of baseline coordinates as the output of a neural network without the use of any post processing steps. The model is evaluated on the cBAD dataset from the ICDAR 2019 competition on baseline detection.

MetaMix: Improved Meta-Learning with Interpolation-based Consistency Regularization

Yangbin Chen, Yun Ma, Tom Ko, Jianping Wang, Qing Li

Responsive image

Auto-TLDR; MetaMix: A Meta-Agnostic Meta-Learning Algorithm for Few-Shot Classification

Slides Poster Similar

Model-Agnostic Meta-Learning (MAML) and its variants are popular few-shot classification methods. They train an initializer across a variety of sampled learning tasks (also known as episodes) such that the initialized model can adapt quickly to new tasks. However, within each episode, current MAML-based algorithms have limitations in forming generalizable decision boundaries using only a few training examples. In this paper, we propose an approach called MetaMix. It generates virtual examples within each episode to regularize the backbone models. MetaMix can be applied in any of the MAML-based algorithms and learn the decision boundaries which are more generalizable to new tasks. Experiments on the mini-ImageNet, CUB, and FC100 datasets show that MetaMix improves the performance of MAML-based algorithms and achieves the state-of-the-art result when applied in Meta-Transfer Learning.

Stratified Multi-Task Learning for Robust Spotting of Scene Texts

Kinjal Dasgupta, Sudip Das, Ujjwal Bhattacharya

Responsive image

Auto-TLDR; Feature Representation Block for Multi-task Learning of Scene Text

Slides Similar

Gaining control over the dynamics of multi-task learning should help to unlock the potential of the deep network to a great extent. In the existing multi-task learning (MTL) approaches of deep network, all the parameters of its feature encoding part are subjected to adjustments corresponding to each of the underlying sub-tasks. On the other hand, different functional areas of human brain are responsible for distinct functions such as the Broca's area of the cerebrum is responsible for speech formation whereas its Wernicke's area is related to the language development etc. Inspired by this fact, in the present study, we propose to introduce a block (termed as Feature Representation Block) of connection weights spanned over a few successive layers of a deep multi-task learning architecture and stratify the same into distinct subsets for their adjustments exclusively corresponding to different sub-tasks. Additionally, we have introduced a novel regularization component for controlled training of this Feature Representation Block. The purpose of the development of this learning framework is efficient end-to-end recognition of scene texts. Simulation results of the proposed strategy on various benchmark scene text datasets such as ICDAR 2015, ICDAR 2017 MLT, COCO-Text and MSRA-TD500 have improved respective SOTA performance.

Incrementally Zero-Shot Detection by an Extreme Value Analyzer

Sixiao Zheng, Yanwei Fu, Yanxi Hou

Responsive image

Auto-TLDR; IZSD-EVer: Incremental Zero-Shot Detection for Incremental Learning

Slides Similar

Human beings not only have the ability of recogniz-ing novel unseen classes, but also can incrementally incorporatethe new classes to existing knowledge preserved. However, thezero-shot learning models assume that all seen classes should beknown beforehand, while incremental learning models cannotrecognize unseen classes. This paper introduces a novel andchallenging task of Incrementally Zero-Shot Detection (IZSD),a practical strategy for both zero-shot learning and class-incremental learning in real-world object detection. An innovativeend-to-end model – IZSD-EVer was proposed to tackle this taskthat requires incrementally detecting new classes and detectingthe classes that have never been seen. Specifically, we proposea novel extreme value analyzer to simultaneously detect objectsfrom old seen, new seen, and unseen classes. Additionally andtechnically, we propose two innovative losses, i.e., background-foreground mean squared error loss alleviating the extremeimbalance of the background and foreground of images, andprojection distance loss aligning the visual space and semanticspaces of old seen classes. Experiments demonstrate the efficacyof our model in detecting objects from both the seen and unseenclasses, outperforming the alternative models on Pascal VOC andMSCOCO datasets.