Stroke Based Posterior Attention for Online Handwritten Mathematical Expression Recognition

Changjie Wu, Qing Wang, Jianshu Zhang, Jun Du, Jiaming Wang, Jiajia Wu, Jin-Shui Hu

Responsive image

Auto-TLDR; Posterior Attention for Online Handwritten Mathematical Expression Recognition

Slides Poster

Recently, many researches propose to employ attention based encoder-decoder models to convert a sequence of trajectory points into a LaTeX string for online handwritten mathematical expression recognition (OHMER), and the recognition performance of these models critically relies on the accuracy of the attention. In this paper, unlike previous methods which basically employ a soft attention model, we propose to employ a posterior attention model, which modifies the attention probabilities after observing the output probabilities generated by the soft attention model. In order to further improve the posterior attention mechanism, we propose a stroke average pooling layer to aggregate point-level features obtained from the encoder into stroke-level features. We argue that posterior attention is better to be implemented on stroke-level features than point-level features as the output probabilities generated by stroke is more convincing than generated by point, and we prove that through experimental analysis. Validated on the CROHME competition task, we demonstrate that stroke based posterior attention achieves expression recognition rates of 54.26% on CROHME 2014 and 51.75% on CROHME 2016. According to attention visualization analysis, we empirically demonstrate that the posterior attention mechanism can achieve better alignment accuracy than the soft attention mechanism.

Similar papers

ConvMath : A Convolutional Sequence Network for Mathematical Expression Recognition

Zuoyu Yan, Xiaode Zhang, Liangcai Gao, Ke Yuan, Zhi Tang

Responsive image

Auto-TLDR; Convolutional Sequence Modeling for Mathematical Expressions Recognition

Slides Poster Similar

Despite the recent advances in optical character recognition (OCR), mathematical expressions still face a great challenge to recognize due to their two-dimensional graphical layout. In this paper, we propose a convolutional sequence modeling network, ConvMath, which converts the mathematical expression description in an image into a LaTeX sequence in an end-to-end way. The network combines an image encoder for feature extraction and a convolutional decoder for sequence generation. Compared with other Long Short Term Memory(LSTM) based encoder-decoder models, ConvMath is entirely based on convolution, thus it is easy to perform parallel computation. Besides, the network adopts multi-layer attention mechanism in the decoder, which allows the model to align output symbols with source feature vectors automatically, and alleviates the problem of lacking coverage while training the model. The performance of ConvMath is evaluated on an open dataset named IM2LATEX-100K, including 103556 samples. The experimental results demonstrate that the proposed network achieves state-of-the-art accuracy and much better efficiency than previous methods.

Global Context-Based Network with Transformer for Image2latex

Nuo Pang, Chun Yang, Xiaobin Zhu, Jixuan Li, Xu-Cheng Yin

Responsive image

Auto-TLDR; Image2latex with Global Context block and Transformer

Slides Poster Similar

Image2latex usually means converts mathematical formulas in images into latex markup. It is a very challenging job due to the complex two-dimensional structure, variant scales of input, and very long representation sequence. Many researchers use encoder-decoder based model to solve this task and achieved good results. However, these methods don't make full use of the structure and position information of the formula. %In this paper, we improve the encoder by employing Global Context block and Transformer. To solve this problem, we propose a global context-based network with transformer that can (1) learn a more powerful and robust intermediate representation via aggregating global features and (2) encode position information explicitly and (3) learn latent dependencies between symbols by using self-attention mechanism. The experimental results on the dataset IM2LATEX-100K demonstrate the effectiveness of our method.

Recursive Recognition of Offline Handwritten Mathematical Expressions

Marco Cotogni, Claudio Cusano, Antonino Nocera

Responsive image

Auto-TLDR; Online Handwritten Mathematical Expression Recognition with Recurrent Neural Network

Slides Poster Similar

In this paper we propose a method for Offline Handwritten Mathematical Expression recognition. The method is a fast and accurate thanks to its architecture, which include both a Convolutional Neural Network and a Recurrent Neural Network. The CNN extracts features from the image to recognize and its output is provided to the RNN which produces the mathematical expression encoded in the LaTeX language. To process both sequential and non-sequential mathematical expressions we also included a deconvolutional module which, in a recursive way, segments the image for additional analysis trough a recursive process. The results obtained show a very high accuracy obtained on a large handwritten data set of 9100 samples of handwritten expressions.

A Transformer-Based Radical Analysis Network for Chinese Character Recognition

Chen Yang, Qing Wang, Jun Du, Jianshu Zhang, Changjie Wu, Jiaming Wang

Responsive image

Auto-TLDR; Transformer-based Radical Analysis Network for Chinese Character Recognition

Slides Poster Similar

Recently, a novel radical analysis network (RAN) has the capability of effectively recognizing unseen Chinese character classes and largely reducing the requirement of training data by treating a Chinese character as a hierarchical composition of radicals rather than a single character class.} However, when dealing with more challenging issues, such as the recognition of complicated characters, low-frequency character categories, and characters in natural scenes, RAN still has a lot of room for improvement. In this paper, we explore options to further improve the structure generalization and robustness capability of RAN with the Transformer architecture, which has achieved start-of-the-art results for many sequence-to-sequence tasks. More specifically, we propose to replace the original attention module in RAN with the transformer decoder, which is named as a transformer-based radical analysis network (RTN). The experimental results show that the proposed approach can significantly outperform the RAN on both printed Chinese character database and natural scene Chinese character database. Meanwhile, further analysis proves that RTN can be better generalized to complex samples and low-frequency characters, and has better robustness in recognizing Chinese characters with different attributes.

Online Trajectory Recovery from Offline Handwritten Japanese Kanji Characters of Multiple Strokes

Hung Tuan Nguyen, Tsubasa Nakamura, Cuong Tuan Nguyen, Masaki Nakagawa

Responsive image

Auto-TLDR; Recovering Dynamic Online Trajectories from Offline Japanese Kanji Character Images for Handwritten Character Recognition

Slides Poster Similar

We propose a deep neural network-based method to recover dynamic online trajectories from offline handwritten Japanese kanji character images. It is a challenging task since Japanese kanji characters consist of multiple strokes. Our proposed model has three main components: Convolutional Neural Network-based encoder, Long Short-Term Memory Network-based decoder with an attention layer, and Gaussian Mixture Model (GMM). The encoder focuses on feature extraction while the decoder refers to the extracted features and generates time-sequences of GMM parameters. The attention layer is the key component for trajectory recovery. The GMM provides robustness to style variations so that the proposed model does not overfit to training samples. In the experiments, the proposed method is evaluated by both visual verification and handwritten character recognition. This is the first attempt to use online recovered trajectories to help improve the performance of offline handwriting recognition. Although the visual verification reveals some problems, the recognition experiments demonstrate the effect of trajectory recovery in improving the accuracy of offline handwritten character recognition when online recognition of the recovered trajectories are combined.

Gaussian Constrained Attention Network for Scene Text Recognition

Zhi Qiao, Xugong Qin, Yu Zhou, Fei Yang, Weiping Wang

Responsive image

Auto-TLDR; Gaussian Constrained Attention Network for Scene Text Recognition

Slides Poster Similar

Scene text recognition has been a hot topic in computer vision. Recent methods adopt the attention mechanism for sequence prediction which achieve convincing results. However, we argue that the existing attention mechanism faces the problem of attention diffusion, in which the model may not focus on a certain character area. In this paper, we propose Gaussian Constrained Attention Network to deal with this problem. It is a 2D attention-based method integrated with a novel Gaussian Constrained Refinement Module, which predicts an additional Gaussian mask to refine the attention weights. Different from adopting an additional supervision on the attention weights simply, our proposed method introduce an explicit refinement. In this way, the attention weights will be more concentrated and the attention-based recognition network achieves better performance. The proposed Gaussian Constrained Refinement Module is flexible and can be applied to existing attention-based methods directly. The experiments on several benchmark datasets demonstrate the effectiveness of our proposed method. Our code has been available at https://github.com/Pay20Y/GCAN.

LODENet: A Holistic Approach to Offline Handwritten Chinese and Japanese Text Line Recognition

Huu Tin Hoang, Chun-Jen Peng, Hung Tran, Hung Le, Huy Hoang Nguyen

Responsive image

Auto-TLDR; Logographic DEComposition Encoding for Chinese and Japanese Text Line Recognition

Slides Poster Similar

One of the biggest obstacles in Chinese and Japanese text line recognition is how to present their enormous character sets. The most common solution is to merely choose and represent a small subset of characters using one-hot encoding. However, such an approach is costly to describe huge character sets, and ignores their semantic relationships. Recent studies have attempted to utilize different encoding methods, but they struggle to build a bijection mapping. In this work, we propose a novel encoding method, called LOgographic DEComposition encoding (LODEC), that can efficiently perform a 1-to-1 mapping for all Chinese and Japanese characters with a strong awareness of semantic relationships. As such, LODEC enables to encode over 21,000 Chinese and Japanese characters by only 520 fundamental elements. Moreover, to handle the vast variety of handwritten texts in the two languages, we propose a novel deep learning (DL) architecture, called LODENet, together with an end-to-end training scheme, that leverages auxiliary data generated by LODEC or other radical-based encoding methods. We performed systematic experiments on both Chinese and Japanese datasets, and found that our approach surpassed the performance of state-of-the-art baselines. Furthermore, empirical evidence shows that our method can gain significantly better results using synthesized text line images without the need for domain knowledge.

Multi-Task Learning Based Traditional Mongolian Words Recognition

Hongxi Wei, Hui Zhang, Jing Zhang, Kexin Liu

Responsive image

Auto-TLDR; Multi-task Learning for Mongolian Words Recognition

Slides Poster Similar

In this paper, a multi-task learning framework has been proposed for solving and improving traditional Mongolian words recognition. To be specific, a sequence-to-sequence model with attention mechanism was utilized to accomplish the task of recognition. Therein, the attention mechanism is designed to fulfill the task of glyph segmentation during the process of recognition. Although the glyph segmentation is an implicit operation, the information of glyph segmentation can be integrated into the process of recognition. After that, the two tasks can be accomplished simultaneously under the framework of multi-task learning. By this way, adjacent image frames can be decoded into a glyph more precisely, which results in improving not only the performance of words recognition but also the accuracy of character segmentation. Experimental results demonstrate that the proposed multi-task learning based scheme outperforms the conventional glyph segmentation-based method and various segmentation-free (i.e. holistic recognition) methods.

IBN-STR: A Robust Text Recognizer for Irregular Text in Natural Scenes

Xiaoqian Li, Jie Liu, Shuwu Zhang

Responsive image

Auto-TLDR; IBN-STR: A Robust Text Recognition System Based on Data and Feature Representation

Poster Similar

Although text recognition methods based on deep neural networks have promising performance, there are still challenges due to the variety of text styles, perspective distortion, text with large curvature, and so on. To obtain a robust text recognizer, we have improved the performance from two aspects: data aspect and feature representation aspect. In terms of data, we transform the input images into S-shape distorted images in order to increase the diversity of training data. Besides, we explore the effects of different training data. In terms of feature representation, the combination of instance normalization and batch normalization improves the model's capacity and generalization ability. This paper proposes a robust text recognizer IBN-STR, which is an attention-based model. Through extensive experiments, the model analysis and comparison have been carried out from the aspects of data and feature representation, and the effectiveness of IBN-STR on both regular and irregular text instances has been verified. Furthermore, IBN-STR is an end-to-end recognition system that can achieve state-of-the-art performance.

Generation of Hypergraphs from the N-Best Parsing of 2D-Probabilistic Context-Free Grammars for Mathematical Expression Recognition

Noya Ernesto, Joan Andreu Sánchez, Jose Miguel Benedi

Responsive image

Auto-TLDR; Hypergraphs: A Compact Representation of the N-best parse trees from 2D-PCFGs

Slides Poster Similar

We consider hypergraphs as a tool to compactly represent the result of the n-best parse trees, obtained by Bi-Dimensional Probabilistic Context-Free Grammars, for an input image that represents a mathematical expression. More specifically, in this paper we propose: an algorithm to compute the N-best parse trees from a 2D-PCFGs; an algorithm to represent the n-best parse trees using a compact representation in the form of hypergraphs; and a formal framework for the development of inference algorithms (inside and outside) and normalization strategies of hypergraphs.

Enhancing Handwritten Text Recognition with N-Gram Sequencedecomposition and Multitask Learning

Vasiliki Tassopoulou, George Retsinas, Petros Maragos

Responsive image

Auto-TLDR; Multi-task Learning for Handwritten Text Recognition

Slides Poster Similar

Current state-of-the-art approaches in the field of Handwritten Text Recognition are predominately single task with unigram, character level target units. In our work, we utilize a Multi-task Learning scheme, training the model to perform decompositions of the target sequence with target units of different granularity, from fine tocoarse. We consider this method as a way to utilize n-gram information, implicitly, in the training process, while the final recognition is performed using only the unigram output. Unigram decoding of sucha multi-task approach highlights the capability of the learned internal representations, imposed by the different n-grams at the training step. We select n-grams as our target units and we experiment from unigrams till fourgrams, namely subword level granularities.These multiple decompositions are learned from the network with task-specific CTC losses. Concerning network architectures, we pro-pose two alternatives, namely the Hierarchical and the Block Multi-task. Overall, our proposed model, even though evaluated only onthe unigram task, outperforms its counterpart single-task by absolute 2.52% WER and 1.02% CER, in the greedy decoding, without any computational overhead during inference, hinting towards success-fully imposing an implicit language model

Watch Your Strokes: Improving Handwritten Text Recognition with Deformable Convolutions

Iulian Cojocaru, Silvia Cascianelli, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara

Responsive image

Auto-TLDR; Deformable Convolutional Neural Networks for Handwritten Text Recognition

Slides Poster Similar

Handwritten Text Recognition (HTR) in free-layout pages is a valuable yet challenging task which aims to automatically understand handwritten texts. State-of-the-art approaches in this field usually encode input images with Convolutional Neural Networks, whose kernels are typically defined on a fixed grid and focus on all input pixels independently. However, this is in contrast with the sparse nature of handwritten pages, in which only pixels representing the ink of the writing are useful for the recognition task. Furthermore, the standard convolution operator is not explicitly designed to take into account the great variability in shape, scale, and orientation of handwritten characters. To overcome these limitations, we investigate the use of deformable convolutions for handwriting recognition. This type of convolution deform the convolution kernel according to the content of the neighborhood, and can therefore be more adaptable to geometric variations and other deformations of the text. Experiments conducted on the IAM and RIMES datasets demonstrate that the use of deformable convolutions is a promising direction for the design of novel architectures for handwritten text recognition.

Radical Counter Network for Robust Chinese Character Recognition

Yunqing Li, Yixing Zhu, Jun Du, Changjie Wu, Jianshu Zhang

Responsive image

Auto-TLDR; Radical Counter Network for Chinese Character Recognition

Slides Poster Similar

Chinese character recognition has attracted much interest due to its high challenge and various applications. The whole-character modeling method can recognize common characters well but unable to handle unseen situation. Some radical-based modeling methods have successfully achieved great performance in unseen condition but the decoding takes huge time comsumption. Therefore, a high-efficient model which can recognize unseen characters needs to be proposed. First, this paper introduces a novel radical counter network (RCN) to recognize Chinese characters by identifying radicals and spatial structures. The proposed RCN first extracts visual features from input by employing DenseNet as encoder. Then a decoder based on fully connected layer is employed, aiming at synchronously estimating the number of each caption in character. The manner of simultaneously decoding all the captions greatly saves time of sequence decoding. Additionally, we design a multi-task learning to combine global feature extraction capability of whole-character modeling and local feature extraction capability of radical-based modeling, which further improves the model generalization. Experiments on natural scene character dataset demonstrate that the proposed model significantly outperforms baseline by 4.81\% with a comparable model complexity. That shows great robustness and simplicity of our model.

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning

Zhuo Chen, Fei Yin, Xu-Yao Zhang, Qing Yang, Cheng-Lin Liu

Responsive image

Auto-TLDR; Cross-Lingual Text Image Recognition with Multi-task Learning

Slides Poster Similar

This paper considers recognizing texts shown in a source language and translating into a target language, without generating the intermediate source language text image recognition results. We call this problem Cross-Lingual Text Image Recognition (CLTIR). To solve this problem, we propose a multi-task system containing a main task of CLTIR and an auxiliary task of Mono-Lingual Text Image Recognition (MLTIR) simultaneously. Two different sequence to sequence learning methods, a convolution based attention model and a BLSTM model with CTC, are adopted for these tasks respectively. We evaluate the system on a newly collected Chinese-English bilingual movie subtitle image dataset. Experimental results demonstrate the multi-task learning framework performs superiorly in both languages.

ReADS: A Rectified Attentional Double Supervised Network for Scene Text Recognition

Qi Song, Qianyi Jiang, Xiaolin Wei, Nan Li, Rui Zhang

Responsive image

Auto-TLDR; ReADS: Rectified Attentional Double Supervised Network for General Scene Text Recognition

Slides Poster Similar

In recent years, scene text recognition is always regarded as a sequence-to-sequence problem. Connectionist Temporal Classification (CTC) and Attentional sequence recognition (Attn) are two very prevailing approaches to tackle this problem while they may fail in some scenarios respectively. CTC concentrates more on every individual character but is weak in text semantic dependency modeling. Attn based methods have better context semantic modeling ability while tends to overfit on limited training data. In this paper, we elaborately design a Rectified Attentional Double Supervised Network (ReADS) for general scene text recognition. To overcome the weakness of CTC and Attn, both of them are applied in our method but with different modules in two supervised branches which can make a complementary to each other. Moreover, effective spatial and channel attention mechanisms are introduced to eliminate background noise and extract valid foreground information. Finally, a simple rectified network is implemented to rectify irregular text. The ReADS can be trained end-to-end and only word-level annotations are required. Extensive experiments on various benchmarks verify the effectiveness of ReADS which achieves state-of-the-art performance.

MEAN: A Multi-Element Attention Based Network for Scene Text Recognition

Ruijie Yan, Liangrui Peng, Shanyu Xiao, Gang Yao, Jaesik Min

Responsive image

Auto-TLDR; Multi-element Attention Network for Scene Text Recognition

Slides Poster Similar

Scene text recognition is a challenging problem due to the wide variances in content, style, orientation, and image quality of text instances in natural scene images. To learn the intrinsic representation of scene texts, a novel multi-element attention (MEA) mechanism is proposed to exploit geometric structures from local to global levels in the feature map extracted from a scene text image. The MEA mechanism is a generalized form of self-attention technique with the incorporation of graph structure modeling. The elements in feature maps are taken as the nodes of an undirected graph, and three kinds of adjacency matrices are introduced to aggregating information at local, neighborhood and global levels before calculating the attention weights. If only the local adjacency matrix is used, the MEA mechanism degenerates to a self-attention form. A multi-element attention network (MEAN) is implemented which includes a CNN for feature extraction, an encoder with MEA mechanism and a decoder for predicting text codes. Orientation positional encoding information is further added to the feature map output by the CNN, and a feature sequence as the encoder's input is obtained by element-level decomposition of the feature map. Experimental results show that MEAN has achieved state-of-the-art or competitive performance on public English scene text datasets. Further experiments and analyses conducted on both English and Chinese scene text datasets show that MEAN can handle horizontal, vertical, and irregular scene text samples.

A Few-Shot Learning Approach for Historical Ciphered Manuscript Recognition

Mohamed Ali Souibgui, Alicia Fornés, Yousri Kessentini, Crina Tudor

Responsive image

Auto-TLDR; Handwritten Ciphers Recognition Using Few-Shot Object Detection

Slides Similar

Encoded (or ciphered) manuscripts are a special type of historical documents that contain encrypted text. The automatic recognition of this kind of documents is challenging because: 1) the cipher alphabet changes from one document to another, 2) there is a lack of annotated corpus for training and 3) touching symbols make the symbol segmentation difficult and complex. To overcome these difficulties, we propose a novel method for handwritten ciphers recognition based on few-shot object detection. Our method first detects all symbols of a given alphabet in a line image, and then a decoding step maps the symbol similarity scores to the final sequence of transcribed symbols. By training on synthetic data, we show that the proposed architecture is able to recognize handwritten ciphers with unseen alphabets. In addition, if few labeled pages with the same alphabet are used for fine tuning, our method surpasses existing unsupervised and supervised HTR methods for ciphers recognition.

Robust Lexicon-Free Confidence Prediction for Text Recognition

Qi Song, Qianyi Jiang, Rui Zhang, Xiaolin Wei

Responsive image

Auto-TLDR; Confidence Measurement for Optical Character Recognition using Single-Input Multi-Output Network

Slides Poster Similar

Benefiting from the success of deep learning, Optical Character Recognition (OCR) is booming in recent years. As we all know, the text recognition results are vulnerable to slight perturbation in input images, thus a method for measuring how reliable the results are is crucial. In this paper, we present a novel method for confidence measurement given a text recognition result, which can be embedded in any text recognizer with little overheads. Our method consists of two stages with a coarse-to-fine style. The first stage generates multiple candidates for voting coarse scores by a Single-Input Multi-Output network (SIMO). The second stage calculates a refined confidence score referred by the voting result and the conditional probabilities of the Top-1 probable recognition sequence. Highly competitive performance is achieved on several standard benchmarks validates the efficiency and effectiveness of the proposed method. Moreover, it can be adopted in both Latin and non-Latin languages.

Sample-Aware Data Augmentor for Scene Text Recognition

Guanghao Meng, Tao Dai, Shudeng Wu, Bin Chen, Jian Lu, Yong Jiang, Shutao Xia

Responsive image

Auto-TLDR; Sample-Aware Data Augmentation for Scene Text Recognition

Slides Poster Similar

Deep neural networks (DNNs) have been widely used in scene text recognition, and achieved remarkable performance. Such DNN-based scene text recognizers usually require plenty of training data for training, but data collection and annotation is usually cost-expensive in practice. To alleviate this issue, data augmentation is often applied to train the scene text recognizers. However, existing data augmentation methods including affine transformation and elastic transformation methods suffer from the problems of under- and over-diversity, due to the complexity of text contents and shapes. In this paper, we propose a sample-aware data augmentor to transform samples adaptively based on the contents of samples. Specifically, our data augmentor consists of three parts: gated module, affine transformation module, and elastic transformation module. In our data augmentor, affine transformation module focuses on keeping the affinity of samples, while elastic transformation module aims to improve the diversity of samples. With the gated module, our data augmentor determines transformation type adaptively based on the properties of training samples and the recognizer capability during the training process. Besides, our framework introduces an adversarial learning strategy to optimize the augmentor and the recognizer jointly. Extensive experiments on scene text recognition benchmarks show that our sample-aware data augmentor significantly improves the performance of state-of-the-art scene text recognizer.

Context Visual Information-Based Deliberation Network for Video Captioning

Min Lu, Xueyong Li, Caihua Liu

Responsive image

Auto-TLDR; Context visual information-based deliberation network for video captioning

Slides Poster Similar

Video captioning is to automatically and accurately generate a textual description for a video. The typical methods following the encoder-decoder architecture directly utilized hidden states to predict words. Nevertheless, these methods did not amend the inaccurate hidden states before feeding those states into word prediction. This led to a cascade of errors on generating word by word. In this paper, the context visual information-based deliberation network is proposed, abbreviated as CVI-DelNet. Its key idea is to introduce the deliberator into the encoder-decoder framework. The encoder-decoder firstly generates a raw hidden state sequence. Unlike the existing methods, the raw hidden state is no more directly used for word prediction but is fed into the deliberator to generate the refined hidden state. The words are then predicted according to the refined hidden states and the contextual visual features. Results on two datasets shows that the proposed method significantly outperforms the baselines.

Weakly Supervised Attention Rectification for Scene Text Recognition

Chengyu Gu, Shilin Wang, Yiwei Zhu, Zheng Huang, Kai Chen

Responsive image

Auto-TLDR; An auxiliary supervision branch for attention-based scene text recognition

Slides Poster Similar

Scene text recognition has become a hot topic in recent years due to its booming real-life applications. Attention-based encoder-decoder framework has become one of the most popular frameworks especially in the irregular text scenario. However, the “attention drift” problem reduces the recognition performance for most existing attention-based scene text recognition methods. To solve this problem, we propose an auxiliary supervision branch along with the attention-based encoder-decoder framework. A new loss function is designed to refine the feature map and to help the attention region align the target character area. Compared with existing attention rectification mechanisms, our method does not require character-level annotations or introduce any additional trainable parameter. Furthermore, our method can improve the performance for both RNN-Attention and Scaled Dot-Product Attention. The experiment results on various benchmarks have demonstrated that the proposed approach outperforms the state-of-the-art methods in both regular and irregular text recognition scenarios.

Switching Dynamical Systems with Deep Neural Networks

Cesar Ali Ojeda Marin, Kostadin Cvejoski, Bogdan Georgiev, Ramses J. Sanchez

Responsive image

Auto-TLDR; Variational RNN for Switching Dynamics

Slides Poster Similar

The problem of uncovering different dynamicalregimes is of pivotal importance in time series analysis. Switchingdynamical systems provide a solution for modeling physical phe-nomena whose time series data exhibit different dynamical modes.In this work we propose a novel variational RNN model forswitching dynamics allowing for both non-Markovian and non-linear dynamical behavior between and within dynamic modes.Attention mechanisms are provided to inform the switchingdistribution. We evaluate our model on synthetic and empiricaldatasets of diverse nature and successfully uncover differentdynamical regimes and predict the switching dynamics.

2D License Plate Recognition based on Automatic Perspective Rectification

Hui Xu, Zhao-Hong Guo, Da-Han Wang, Xiang-Dong Zhou, Yu Shi

Responsive image

Auto-TLDR; Perspective Rectification Network for License Plate Recognition

Slides Poster Similar

License plate recognition (LPR) remains a challenging task in face of some difficulties such as image deformation and multi-line character distribution. Text rectification that is crucial to eliminate the effects of image deformation has attracted increasing attentions in scene text recognition. However, current text rectification methods are not designed specifically for LPR, which did not take the features of plate deformation into account. Considering the fact that a license plate (LP) can only generate perspective distortion in the image due to its rigid feature, in this paper we propose a novel perspective rectification network (PRN) to automatically estimate the perspective transformation and rectify the distorted LP accordingly. For recognition, we propose a location-aware 2D attention based recognition network that is capable of recognizing both single-line and double-line plates with perspective deformation. The rectification network and recognition network are connected for end-to-end training. Experiments on common datasets show that the proposed method achieves the state-of-the-art performance, demonstrating the effectiveness of the proposed approach.

A Multi-Head Self-Relation Network for Scene Text Recognition

Zhou Junwei, Hongchao Gao, Jiao Dai, Dongqin Liu, Jizhong Han

Responsive image

Auto-TLDR; Multi-head Self-relation Network for Scene Text Recognition

Slides Poster Similar

The text embedded in scene images can be seen everywhere in our lives. However, recognizing text from natural scene images is still a challenge because of its diverse shapes and distorted patterns. Recently, advanced recognition networks generally treat scene text recognition as a sequence prediction task. Although achieving excellent performance, these recognition networks consider the feature map cells as independent individuals and update cells state without utilizing the information of their neighboring cells. And the local receptive field of traditional convolutional neural network (CNN) makes a single cell that cannot cover the whole text region in an image. Due to these issues, the existing recognition networks cannot extract the global context in a visual scene. To deal with the above problems, we propose a Multi-head Self-relation Network(MSRN) for scene text recognition in this paper. The MSRN consists of several multi-head self-relation layers, which is designed for extracting the global context of a visual scene, so that transforms a cell into a new cell that fuses the information of the related cells. Furthermore, experiments over several public datasets demonstrate that our proposed recognition network achieves superior performance on several benchmark datasets including IC03, IC13, IC15, SVT-Perspective.

Trajectory-User Link with Attention Recurrent Networks

Tao Sun, Yongjun Xu, Fei Wang, Lin Wu, 塘文 钱, Zezhi Shao

Responsive image

Auto-TLDR; TULAR: Trajectory-User Link with Attention Recurrent Neural Networks

Slides Poster Similar

The prevalent adoptions of GPS-enabled devices have witnessed an explosion of various location-based services which produces a huge amount of trajectories monitoring the individuals' movements. In this paper, we tackle Trajectory-User Link (TUL) problem, which identifies humans' movement patterns and links trajectories to the users who generated them. Existing solutions on TUL problem employ recurrent neural networks and variational autoencoder methods, which face the bottlenecks in the case of excessively long trajectories and fragmentary users' movements. However, these are common characteristics of trajectory data in reality, leading to performance degradation of the existing models. In this paper, we propose an end-to-end attention recurrent neural learning framework, called TULAR (Trajectory-User Link with Attention Recurrent Networks), which focus on selected parts of the source trajectories when linking. TULAR introduce the Trajectory Semantic Vector (TSV) via unsupervised location representation learning and recurrent neural networks, by which to reckon the weight of parts of source trajectory. Further, we employ three attention scores for the weight measurements. Experiments are conducted on two real world datasets and compared with several existing methods, and the results show that TULAR yields a new state-of-the-art performance. Source code is public available at GitHub: https://github.com/taos123/TULAR.

MANet: Multimodal Attention Network Based Point-View Fusion for 3D Shape Recognition

Yaxin Zhao, Jichao Jiao, Ning Li

Responsive image

Auto-TLDR; Fusion Network for 3D Shape Recognition based on Multimodal Attention Mechanism

Slides Poster Similar

3D shape recognition has attracted more and more attention as a task of 3D vision research. The proliferation of 3D data encourages various deep learning methods based on 3D data. Now there have been many deep learning models based on point-cloud data or multi-view data alone. However, in the era of big data, integrating data of two different modals to obtain a unified 3D shape descriptor is bound to improve the recognition accuracy. Therefore, this paper proposes a fusion network based on multimodal attention mechanism for 3D shape recognition. Considering the limitations of multi-view data, we introduce a soft attention scheme, which can use the global point-cloud features to filter the multi-view features, and then realize the effective fusion of the two features. More specifically, we obtain the enhanced multi-view features by mining the contribution of each multi-view image to the overall shape recognition, and then fuse the point-cloud features and the enhanced multi-view features to obtain a more discriminative 3D shape descriptor. We have performed relevant experiments on the ModelNet40 dataset, and experimental results verify the effectiveness of our method.

Sketch-SNet: Deeper Subdivision of Temporal Cues for Sketch Recognition

Yizhou Tan, Lan Yang, Honggang Zhang

Responsive image

Auto-TLDR; Sketch Recognition using Invariable Structural Feature and Drawing Habits Feature

Slides Poster Similar

Sketch recognition is a central task in sketchrelated researches. Different from the natural image, the sparse pixel distribution of sketch destroys the visual texture which encourages researchers to explore the temporal information of sketch. With the release of million-scale datasets, we explore the invariable structure of sketch and specific order of strokes in sketch. Prior works based on Recurrent Neural Network (RNN) trend to output different features with changed stroke orders. In particular, we adopt a novel method by employing a Graph Convolutional Network (GCN) to extract invariable structural feature under any orders of strokes. Compared to traditional comprehension of sketch, we further split the temporal information of sketch into two types of feature (invariable structural feature (ISF) and drawing habits feature (DHF)) which aim to reduce the confusion in temporal information. We propose a two-branch GCN-RNN network to extract two types of feature respectively, termed Sketch-SNet. The GCN branch is encouraged to extract the ISF through receiving various shuffled strokes of an input sketch. The RNN branch takes the original input to extract DHF by learning the pattern of strokes’ order. Meanwhile, we introduce semantic information to generate soft-labels owing to the high abstractness of sketch. Extensive experiments on the Quick-Draw dataset demonstrate that our further subdivision of temporal information improves the performance of sketch recognition which surpasses state-of-the-art by a large margin.

Context Matters: Self-Attention for Sign Language Recognition

Fares Ben Slimane, Mohamed Bouguessa

Responsive image

Auto-TLDR; Attentional Network for Continuous Sign Language Recognition

Slides Poster Similar

This paper proposes an attentional network for the task of Continuous Sign Language Recognition. The proposed approach exploits co-independent streams of data to model the sign language modalities. These different channels of information can share a complex temporal structure between each other. For that reason, we apply attention to synchronize and help capture entangled dependencies between the different sign language components. Even though Sign Language is multi-channel, handshapes represent the central entities in sign interpretation. Seeing handshapes in their correct context defines the meaning of a sign. Taking that into account, we utilize the attention mechanism to efficiently aggregate the hand features with their appropriate Spatio-temporal context for better sign recognition. We found that by doing so the model is able to identify the essential Sign Language components that revolve around the dominant hand and the face areas. We test our model on the benchmark dataset RWTH-PHOENIX-Weather 2014, yielding competitive results.

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

Peilin Zhou, Zhiqi Huang, Fenglin Liu, Yuexian Zou

Responsive image

Auto-TLDR; Parallel Interactive Network for Spoken Language Understanding

Slides Poster Similar

Spoken Language Understanding (SLU) is an essential part of the spoken dialogue system, which typically consists of intent detection (ID) and slot filling (SF) tasks. Recently, recurrent neural networks (RNNs) based methods achieved the state-of-the-art for SLU. It is noted that, in the existing RNN-based approaches, ID and SF tasks are often jointly modeled to utilize the correlation information between them. However, we noted that, so far, the efforts to obtain better performance by supporting bidirectional and explicit information exchange between ID and SF are not well studied. In addition, few studies attempt to capture the local context information to enhance the performance of SF. Motivated by these findings, in this paper, Parallel Interactive Network (PIN) is proposed to model the mutual guidance between ID and SF. Specifically, given an utterance, a Gaussian self-attentive encoder is introduced to generate the context-aware feature embedding of the utterance which is able to capture local context information. Taking the feature embedding of the utterance, Slot2Intent module and Intent2Slot module are developed to capture the bidirectional information flow for ID and SF tasks. Finally, a cooperation mechanism is constructed to fuse the information obtained from Slot2Intent and Intent2Slot modules to further reduce the prediction bias. The experiments on two benchmark datasets, i.e., SNIPS and ATIS, demonstrate the effectiveness of our approach, which achieves a competitive result with state-of-the-art models. More encouragingly, by using the feature embedding of the utterance generated by the pre-trained language model BERT, our method achieves the state-of-the-art among all comparison approaches.

Cut and Compare: End-To-End Offline Signature Verification Network

Xi Lu, Lin-Lin Huang, Fei Yin

Responsive image

Auto-TLDR; An End-to-End Cut-and-Compare Network for Offline Signature Verification

Slides Poster Similar

Offline signature verification, to determine whether a handwritten signature image is genuine or forged for a claimed identity, is needed in many applications. How to extract salient features and how to calculate similarity scores are the major issues. In this paper, we propose a novel end-to-end cut-and-compare network for offline signature verification. Based on the Spatial Transformer Network (STN), discriminative regions are segmented from a pair of input signature images and are compared attentively with help of Attentive Recurrent Comparator (ARC). An adaptive distance fusion module is proposed to fuse the distances of these regions. To address the intrapersonal variability problem, we design a smoothed double-margin loss to train the network. The proposed network achieves state-of-the-art performance on CEDAR, GPDS Synthetic, BHSig-H and BHSig-B datasets of different languages. Furthermore, our network shows strong generalization ability on cross-language test.

MA-LSTM: A Multi-Attention Based LSTM for Complex Pattern Extraction

Jingjie Guo, Kelang Tian, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; MA-LSTM: Multiple Attention based recurrent neural network for forget gate

Slides Poster Similar

With the improvement of data, computing powerand algorithms, deep learning has achieved rapid developmentand showing excellent performance. Recently, many deep learn-ing models are proposed to solve the problems in different areas.A recurrent neural network (RNN) is a class of artificial neuralnetworks where connections between nodes form a directedgraph along a temporal sequence. This allows it to exhibittemporal dynamic behavior, which makes it applicable to taskssuch as handwriting recognition or speech recognition. How-ever, the RNN relies heavily on the automatic learning abilityto update parameters which concentrate on the data flow butseldom considers the feature extraction capability of the gatemechanism. In this paper, we propose a novel architecture tobuild the forget gate which is generated by multiple bases.Instead of using the traditional single-layer fully-connectednetwork, we use a Multiple Attention (MA) based network togenerate the forget gate which refines the optimization spaceof gate function and improve the granularity of the recurrentneural network to approximate the map in the ground truth.Credit to the MA structure on the gate mechanism. Our modelhas a better feature extraction capability than other knownmodels. MA-LSTM is an alternative module which can directly replace the recurrent neural network and has achieved good performance in many areas that people are concerned about.

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, Rong Xiao

Responsive image

Auto-TLDR; PICK: A Graph Learning Framework for Key Information Extraction from Documents

Slides Poster Similar

Computer vision with state-of-the-art deep learning models have achieved huge success in the field of Optical Character Recognition (OCR) including text detection and recognition tasks recently. However, Key Information Extraction (KIE) from documents as the downstream task of OCR, having a large number of use scenarios in real-world, remains a challenge because documents not only have textual features extracting from OCR systems but also have semantic visual features that are not fully exploited and play a critical role in KIE. Too little work has been devoted to efficiently make full use of both textual and visual features of the documents. In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Extensive experiments on real-world datasets have been conducted to show that our method outperforms baselines methods by significant margins.

Text Recognition in Real Scenarios with a Few Labeled Samples

Jinghuang Lin, Cheng Zhanzhan, Fan Bai, Yi Niu, Shiliang Pu, Shuigeng Zhou

Responsive image

Auto-TLDR; Few-shot Adversarial Sequence Domain Adaptation for Scene Text Recognition

Slides Poster Similar

Scene text recognition (STR) is still a hot research topic in computer vision field due to its various applications. Existing works mainly focus on learning a general model with a huge number of synthetic text images to recognize unconstrained scene texts, and have achieved substantial progress. However, these methods are not quite applicable in many real-world scenarios where 1) high recognition accuracy is required, while 2) labeled samples are lacked. To tackle this challenging problem, this paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation between the synthetic source domain (with many synthetic labeled samples) and a specific target domain (with only some or a few real labeled samples). This is done by simultaneously learning each character’s feature representation with an attention mech- anism and establishing the corresponding character-level latent subspace with adversarial learning. Our approach can maximize the character-level confusion between the source domain and the target domain, thus achieves the sequence-level adaptation with even a small number of labeled samples in the target domain. Extensive experiments on various datasets show that our method significantly outperforms the finetuning scheme, and obtains comparable performance to the state-of-the-art STR methods.

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

Chao Li, Qian Zhang, Ziping Zhao

Responsive image

Auto-TLDR; Intimate Relationship Prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network Using Functional Near-Infrared Spectroscopy

Slides Poster Similar

The detection of intimacy plays a crucial role in the improvement of intimate relationship, which contributes to promote the family and social harmony. Previous studies have shown that different degrees of intimacy have significant differences in brain imaging. Recently, a few of work has emerged to recognise intimacy automatically by using machine learning technique. Moreover, considering the temporal dynamic characteristics of intimacy relationship on neural mechanism, how to model spatio-temporal dynamics for intimacy prediction effectively is still a challenge. In this paper, we propose a novel method to explore deep spatial-temporal representations for intimacy prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network (ACCRNN). Given the advantages of time-frequency resolution in complex neuronal activities analysis, this paper utilizes functional near-infrared spectroscopy (fNIRS) to analyse and infer to intimate relationship. We collect a fNIRS-based dataset for the analysis of intimate relationship. Forty-two-channel fNIRS signals are recorded from the 44 subjects' prefrontal cortex when they watched a total of 18 photos of lovers, friends and strangers for 30 seconds per photo. The experimental results show that our proposed method outperforms the others in terms of accuracy with the precision of 96.5%. To the best of our knowledge, this is the first time that such a hybrid deep architecture has been employed for fNIRS-based intimacy prediction.

Moto: Enhancing Embedding with Multiple Joint Factors for Chinese Text Classification

Xunzhu Tang, Rujie Zhu, Tiezhu Sun

Responsive image

Auto-TLDR; Moto: Enhancing Embedding with Multiple J\textbf{o}int Fac\textBF{to}rs

Slides Poster Similar

Recently, language representation techniques have achieved great performances in text classification. However, most existing representation models are specifically designed for English materials, which may fail in Chinese because of the huge difference between these two languages. Actually, few existing methods for Chinese text classification process texts at a single level. However, as a special kind of hieroglyphics, radicals of Chinese characters are good semantic carriers. In addition, Pinyin codes carry the semantic of tones, and Wubi reflects the stroke structure information, \textit{etc}. Unfortunately, previous researches neglected to find an effective way to distill the useful parts of these four factors and to fuse them. In our works, we propose a novel model called Moto: Enhancing Embedding with \textbf{M}ultiple J\textbf{o}int Fac\textbf{to}rs. Specifically, we design an attention mechanism to distill the useful parts by fusing the four-level information above more effectively. We conduct extensive experiments on four popular tasks. The empirical results show that our Moto achieves SOTA 0.8316 ($F_1$-score, 2.11\% improvement) on Chinese news titles, 96.38 (1.24\% improvement) on Fudan Corpus and 0.9633 (3.26\% improvement) on THUCNews.

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Bang Yang, Yuexian Zou

Responsive image

Auto-TLDR; Visual Oriented Encoder for Video Captioning

Slides Poster Similar

Video captioning is a challenging task which aims at automatically generating a natural language description of a given video. Recent researches have shown that exploiting the intrinsic multi-modalities of videos significantly promotes captioning performance. However, how to integrate multi-modalities to generate effective semantic representations for video captioning is still an open issue. Some researchers proposed to learn multimodal features in parallel during the encoding stage. The downside of these methods lies in the neglect of the interaction among multi-modalities and their rich contextual information. In this study, inspired by the fact that visual contents are generally more important for comprehending videos, we propose a novel Visual Oriented Encoder (VOE) to integrate multimodal features in an interactive manner. Specifically, VOE is designed as a hierarchical structure, where bottom layers are utilized to extract multi-scale contexts from auxiliary modalities while the top layer is exploited to generate joint representations by considering both visual and contextual information. Following the encoder-decoder framework, we systematically develop a VOE-LSTM model and evaluate it on two mainstream benchmarks: MSVD and MSR-VTT. Experimental results show that the proposed VOE surpasses conventional encoders and our VOE-LSTM model achieves competitive results compared with state-of-the-art approaches.

Tackling Contradiction Detection in German Using Machine Translation and End-To-End Recurrent Neural Networks

Maren Pielka, Rafet Sifa, Lars Patrick Hillebrand, David Biesner, Rajkumar Ramamurthy, Anna Ladi, Christian Bauckhage

Responsive image

Auto-TLDR; Contradiction Detection in Natural Language Inference using Recurrent Neural Networks

Slides Poster Similar

Natural Language Inference, and specifically Contradiction Detection, is still an unexplored topic with respect to German text. In this paper, we apply Recurrent Neural Network (RNN) methods to learn contradiction-specific sentence embeddings. Our data set for evaluation is a machine-translated version of the Stanford Natural Language Inference (SNLI) corpus. The results are compared to a baseline using unsupervised vectorization techniques, namely tf-idf and Flair, as well as state-of-the art transformer-based (MBERT) methods. We find that the end-to-end models outperform the models trained on unsupervised embeddings, which makes them the better choice in an empirical use case. The RNN methods also perform superior to MBERT on the translated data set.

Writer Identification Using Deep Neural Networks: Impact of Patch Size and Number of Patches

Akshay Punjabi, José Ramón Prieto Fontcuberta, Enrique Vidal

Responsive image

Auto-TLDR; Writer Recognition Using Deep Neural Networks for Handwritten Text Images

Slides Poster Similar

Traditional approaches for the recognition or identification of the writer of a handwritten text image used to relay on heuristic knowledge about the shape and other features of the strokes of previously segmented characters. However, recent works have done significantly advances on the state of the art thanks to the use of various types of deep neural networks. In most of all of these works, text images are decomposed into patches, which are processed by the networks without any previous character or word segmentation. In this paper, we study how the way images are decomposed into patches impact recognition accuracy, using three publicly available datasets. The study also includes a simpler architecture where no patches are used at all - a single deep neural network inputs a whole text image and directly provides a writer recognition hypothesis. Results show that bigger patches generally lead to improved accuracy, achieving in one of the datasets a significant improvement over the best results reported so far.

Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

Saleem Ahmed, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju

Responsive image

Auto-TLDR; Representational Learning for Similarity Based Retrieval of Mathematical Expressions

Slides Poster Similar

Representational Learning in the form of high dimensional embeddings have been used for multiple pattern recognition applications. There has been a significant interest in building embedding based systems for learning representationsin the mathematical domain. At the same time, retrieval of structured information such as mathematical expressions is an important need for modern IR systems. In this work, our motivation is to introduce a robust framework for learning representations for similarity based retrieval of mathematical expressions. Given a query by example, the embedding can find the closest matching expression as a function of euclidean distance between them. We leverage recent advancements in image-based and graph-based deep learning algorithms to learn our similarity embeddings. We do this first, by using uni-modal encoders in graph space and image space and then, a multi-modal combination of the same. To overcome the lack of training data, we force the networks to learn a deep metric using triplets generated with a heuristic scoring function. We also adopt a custom strategy for mining hard samples to train our neural networks. Our system produces rankings similar to those generated by the original scoring function, but using only a fraction of the time. Our results establish the viability of using such a multi-modal embedding for this task.

Ancient Document Layout Analysis: Autoencoders Meet Sparse Coding

Homa Davoudi, Marco Fiorucci, Arianna Traviglia

Responsive image

Auto-TLDR; Unsupervised Unsupervised Representation Learning for Document Layout Analysis

Slides Poster Similar

Layout analysis of historical handwritten documents is a key pre-processing step in document image analysis that, by segmenting the image into its homogeneous regions, facilitates subsequent procedures such as optical character recognition and automatic transcription. Learning-based approaches have shown promising performances in layout analysis, however, the majority of them requires tedious pixel-wise labelled training data to achieve generalisation capabilities, this limitation preventing their application due to the lack of large labelled datasets. This paper proposes a novel unsupervised representation learning method for documents’ layout analysis that reduces the need for labelled data: a sparse autoencoder is first trained in an unsupervised manner on a historical text document’s image; representation of image patches, computed by the sparse encoder, is then used to classify pixels into various region categories of the document using a feed-forward neural network. A new training method, inspired by the ISTA algorithm, is also introduced here to train the sparse encoder. Experimental results on DIVA-HisDB dataset demonstrate that the proposed method outperforms previous approaches based on unsupervised representation learning while achieving performances comparable to the state-of-the-art fully supervised methods.

Fast Approximate Modelling of the Next Combination Result for Stopping the Text Recognition in a Video

Konstantin Bulatov, Nadezhda Fedotova, Vladimir V. Arlazarov

Responsive image

Auto-TLDR; Stopping Video Stream Recognition of a Text Field Using Optimized Computation Scheme

Slides Poster Similar

In this paper, we consider a task of stopping the video stream recognition process of a text field, in which each frame is recognized independently and the individual results are combined together. The video stream recognition stopping problem is an under-researched topic with regards to computer vision, but its relevance for building high-performance video recognition systems is clear. Firstly, we describe an existing method of optimally stopping such a process based on a modelling of the next combined result. Then, we describe approximations and assumptions which allowed us to build an optimized computation scheme and thus obtain a method with reduced computational complexity. The methods were evaluated for the tasks of document text field recognition and arbitrary text recognition in a video. The experimental comparison shows that the introduced approximations do not diminish the quality of the stopping method in terms of the achieved combined result precision, while dramatically reducing the time required to make the stopping decision. The results were consistent for both text recognition tasks.

Text Baseline Recognition Using a Recurrent Convolutional Neural Network

Matthias Wödlinger, Robert Sablatnig

Responsive image

Auto-TLDR; Automatic Baseline Detection of Handwritten Text Using Recurrent Convolutional Neural Network

Slides Poster Similar

The detection of baselines of text is a necessary pre-processing step for many modern methods of automatic handwriting recognition. In this work a two-stage system for the automatic detection of text baselines of handwritten text is presented. In a first step pixel-wise segmentation on the document image is performed to classify pixels as baselines, start points and end points. This segmentation is then used to extract the start points of lines. Starting from these points the baseline is extracted using a recurrent convolutional neural network that directly outputs the baseline coordinates. This method allows the direct extraction of baseline coordinates as the output of a neural network without the use of any post processing steps. The model is evaluated on the cBAD dataset from the ICDAR 2019 competition on baseline detection.

AG-GAN: An Attentive Group-Aware GAN for Pedestrian Trajectory Prediction

Yue Song, Niccolò Bisagno, Syed Zohaib Hassan, Nicola Conci

Responsive image

Auto-TLDR; An attentive group-aware GAN for motion prediction in crowded scenarios

Slides Poster Similar

Understanding human behaviors in crowded scenarios requires analyzing not only the position of the subjects in space, but also the scene context. Existing approaches mostly rely on the motion history of each pedestrian and model the interactions among people by considering the entire surrounding neighborhood. In our approach, we address the problem of motion prediction by applying coherent group clustering and a global attention mechanism on the LSTM-based Generative Adversarial Networks (GANs). The proposed model consists of an attentive group-aware GAN that observes the agents' past motion and predicts future paths, using (i) a group pooling module to model neighborhood interaction, and (ii) an attention module to specifically focus on hidden states. The experimental results demonstrate that our proposal outperforms state-of-the-art models on common benchmark datasets, and is able to generate socially-acceptable trajectories.

The HisClima Database: Historical Weather Logs for Automatic Transcription and Information Extraction

Verónica Romero, Joan Andreu Sánchez

Responsive image

Auto-TLDR; Automatic Handwritten Text Recognition and Information Extraction from Historical Weather Logs

Slides Poster Similar

Knowing the weather and atmospheric conditions from the past can help weather researchers to generate models like the ones used to predict how weather conditions are likely to change as global temperatures continue to rise. Many historical weather records are available from the past registered on a systemic basis. Historical weather logs were registered in ships, when they were on the high seas, recording daily weather conditions such as: wind speed, temperature, coordinates, etc. These historical documents represent an important source of knowledge with valuable information to extract climatic information of several centuries ago. Such information is usually collected by experts that devote a lot of time. This paper presents a new database, compiled from a ship log mainly composed by handwritten tables that contain mainly numerical information, to support research in automatic handwriting recognition and information extraction. In addition, a study is presented about the capability of state-of-the-art handwritten text recognition systems and information extraction techniques, when applied to the presented database. Baseline results are reported for reference in future studies.

Two-Stream Temporal Convolutional Network for Dynamic Facial Attractiveness Prediction

Nina Weng, Jiahao Wang, Annan Li, Yunhong Wang

Responsive image

Auto-TLDR; 2S-TCN: A Two-Stream Temporal Convolutional Network for Dynamic Facial Attractiveness Prediction

Slides Poster Similar

In the field of facial attractiveness prediction, while deep models using static pictures have shown promising results, little attention is paid to dynamic facial information, which is proven to be influential by psychological studies. Meanwhile, the increasing popularity of short video apps creates an enormous demand of facial attractiveness prediction from short video clips. In this paper, we target on the dynamic facial attractiveness prediction problem. To begin with, a large-scale video-based facial attractiveness prediction dataset (VFAP) with more than one thousand clips from TikTok is collected. A two-stream temporal convolutional network (2S-TCN) is then proposed to capture dynamic attractiveness feature from both facial appearance and landmarks. We employ attentive feature enhancement along with specially designed modality and temporal fusion strategies to better explore the temporal dynamics. Extensive experiments on the proposed VFAP dataset demonstrate that 2S-TCN has a distinct advantage over the state-of-the-art static prediction methods.

Attentive Hybrid Feature Based a Two-Step Fusion for Facial Expression Recognition

Jun Weng, Yang Yang, Zichang Tan, Zhen Lei

Responsive image

Auto-TLDR; Attentive Hybrid Architecture for Facial Expression Recognition

Slides Poster Similar

Facial expression recognition is inherently a challenging task, especially for the in-the-wild images with various occlusions and large pose variations, which may lead to the loss of some crucial information. To address it, in this paper, we propose an attentive hybrid architecture (AHA) which learns global, local and integrated features based on different face regions. Compared with one type of feature, our extracted features own complementary information and can reduce the loss of crucial information. Specifically, AHA contains three branches, where all sub-networks in those branches employ the attention mechanism to further localize the interested pixels/regions. Moreover, we propose a two-step fusion strategy based on LSTM to deeply explore the hidden correlations among different face regions. Extensive experiments on four popular expression databases (i.e., CK+, FER-2013, SFEW 2.0, RAF-DB) show the effectiveness of the proposed method.

Continuous Sign Language Recognition with Iterative Spatiotemporal Fine-Tuning

Kenessary Koishybay, Medet Mukushev, Anara Sandygulova

Responsive image

Auto-TLDR; A Deep Neural Network for Continuous Sign Language Recognition with Iterative Gloss Recognition

Slides Poster Similar

This paper aims to develop a deep neural network for Continuous Sign Language Recognition (CSLR) with iterative Gloss Recognition (GR) fine-tuning. CSLR has been a popular research field in the last years and iterative optimization methods are well established. This paper introduces our proposed architecture involving Spatiotemporal feature-extraction model to segment useful ``gloss-unit" features and BiLSTM with CTC as a sequence model. Spatiotemporal Feature Extractor is used for both image features extraction and sequence length reduction. To this end, we compare different architectures for feature extraction and sequence model. In addition, we iteratively fine-tune feature extractor on gloss-unit video segments with alignments from the end2end model. During the iterative training, we use novel alignment correction technique, which is based on minimum transformations of Levenshtein distance. All the experiments were conducted on the RWTH-PHOENIX-Weather-2014 dataset.

Attentive Visual Semantic Specialized Network for Video Captioning

Jesus Perez-Martin, Benjamin Bustos, Jorge Pérez

Responsive image

Auto-TLDR; Adaptive Visual Semantic Specialized Network for Video Captioning

Slides Poster Similar

As an essential high-level task of video understanding topic, automatically describing a video with natural language has recently gained attention as a fundamental challenge in computer vision. Previous models for video captioning have several limitations, such as the existence of gaps in current semantic representations and the inexpressibility of the generated captions. To deal with these limitations, in this paper, we present a new architecture that we callAttentive Visual Semantic Specialized Network(AVSSN), which is an encoder-decoder model based on our Adaptive Attention Gate and Specialized LSTM layers. This architecture can selectively decide when to use visual or semantic information into the text generation process. The adaptive gate makes the decoder to automatically select the relevant information for providing a better temporal state representation than the existing decoders. Besides, the model is capable of learning to improve the expressiveness of generated captions attending to their length, using a sentence-length-related loss function. We evaluate the effectiveness of the proposed approach on the Microsoft Video Description(MSVD) and the Microsoft Research Video-to-Text (MSR-VTT) datasets, achieving state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, CIDEr, and ROUGE_L.