Cut and Compare: End-To-End Offline Signature Verification Network

Xi Lu, Lin-Lin Huang, Fei Yin

Responsive image

Auto-TLDR; An End-to-End Cut-and-Compare Network for Offline Signature Verification

Slides Poster

Offline signature verification, to determine whether a handwritten signature image is genuine or forged for a claimed identity, is needed in many applications. How to extract salient features and how to calculate similarity scores are the major issues. In this paper, we propose a novel end-to-end cut-and-compare network for offline signature verification. Based on the Spatial Transformer Network (STN), discriminative regions are segmented from a pair of input signature images and are compared attentively with help of Attentive Recurrent Comparator (ARC). An adaptive distance fusion module is proposed to fuse the distances of these regions. To address the intrapersonal variability problem, we design a smoothed double-margin loss to train the network. The proposed network achieves state-of-the-art performance on CEDAR, GPDS Synthetic, BHSig-H and BHSig-B datasets of different languages. Furthermore, our network shows strong generalization ability on cross-language test.

Similar papers

Learning Metric Features for Writer-Independent Signature Verification Using Dual Triplet Loss

Qian Wan, Qin Zou

Responsive image

Auto-TLDR; A dual triplet loss based method for offline writer-independent signature verification

Poster Similar

Handwritten signature has long been a widely accepted biometric and applied in many verification scenarios. However, automatic signature verification remains an open research problem, which is mainly due to three reasons. 1) Skilled forgeries generated by persons who imitate the original writting pattern are very difficult to be distinguished from genuine signatures. It is especially so in the case of offline signatures, where only the signature image is captured as a feature for verification. 2) Most state-of-the-art models are writer-dependent, requiring a specific model to be trained whenever a new user is registered in verification, which is quite inconvenient. 3) Writer-independent models often have unsatisfactory performance. To this end, we propose a novel metric learning based method for offline writer-independent signature verification. Specifically, a dual triplet loss is used to train the model, where two different triplets are constructed for random and skilled forgeries, respectively. Experiments on three alphabet datasets — GPDS Synthetic, MCYT and CEDAR — show that the proposed method achieves competitive or superior performance to the state-of-the-art methods. Experiments are also conducted on a new offline Chinese signature dataset — CSIG-WHU, and the results show that the proposed method has a high feasibility on character-based signatures.

Total Whitening for Online Signature Verification Based on Deep Representation

Xiaomeng Wu, Akisato Kimura, Kunio Kashino, Seiichi Uchida

Responsive image

Auto-TLDR; Total Whitening for Online Signature Verification

Slides Poster Similar

In deep metric learning targeted at time series, the correlation between feature activations may be easily enlarged through highly nonlinear neural networks, leading to suboptimal embedding effectiveness. An effective solution to this problem is whitening. For example, in online signature verification, whitening can be derived for three individual Gaussian distributions, namely the distributions of local features at all temporal positions 1) for all signatures of all subjects, 2) for all signatures of each particular subject, and 3) for each particular signature of each particular subject. This study proposes a unified method called total whitening that integrates these individual Gaussians. Total whitening rectifies the layout of multiple individual Gaussians to resemble a standard normal distribution, improving the balance between intraclass invariance and interclass discriminative power. Experimental results demonstrate that total whitening achieves state-of-the-art accuracy when tested on online signature verification benchmarks.

An Investigation of Feature Selection and Transfer Learning for Writer-Independent Offline Handwritten Signature Verification

Victor Souza, Adriano Oliveira, Rafael Menelau Oliveira E Cruz, Robert Sabourin

Responsive image

Auto-TLDR; Overfitting of SigNet using Binary Particle Swarm Optimization

Slides Poster Similar

SigNet is a state of the art model for feature representation used for handwritten signature verification (HSV). This representation is based on a Deep Convolutional Neural Network (DCNN) and contains 2048 dimensions. When transposed to a dissimilarity space generated by the dichotomy transformation (DT), related to the writer-independent (WI) approach, these features may include redundant information. This paper investigates the presence of overfitting when using Binary Particle Swarm Optimization (BPSO) to perform the feature selection in a wrapper mode. We proposed a method based on a global validation strategy with an external archive to control overfitting during the search for the most discriminant representation. Moreover, an investigation is also carried out to evaluate the use of the selected features in a transfer learning context. The analysis is carried out on a writer-independent approach on the CEDAR, MCYT and GPDS-960 datasets. The experimental results showed the presence of overfitting when no validation is used during the optimization process and the improvement when the global validation strategy with an external archive is used. Also, the space generated after feature selection can be used in a transfer learning context.

Writer Identification Using Deep Neural Networks: Impact of Patch Size and Number of Patches

Akshay Punjabi, José Ramón Prieto Fontcuberta, Enrique Vidal

Responsive image

Auto-TLDR; Writer Recognition Using Deep Neural Networks for Handwritten Text Images

Slides Poster Similar

Traditional approaches for the recognition or identification of the writer of a handwritten text image used to relay on heuristic knowledge about the shape and other features of the strokes of previously segmented characters. However, recent works have done significantly advances on the state of the art thanks to the use of various types of deep neural networks. In most of all of these works, text images are decomposed into patches, which are processed by the networks without any previous character or word segmentation. In this paper, we study how the way images are decomposed into patches impact recognition accuracy, using three publicly available datasets. The study also includes a simpler architecture where no patches are used at all - a single deep neural network inputs a whole text image and directly provides a writer recognition hypothesis. Results show that bigger patches generally lead to improved accuracy, achieving in one of the datasets a significant improvement over the best results reported so far.

Watch Your Strokes: Improving Handwritten Text Recognition with Deformable Convolutions

Iulian Cojocaru, Silvia Cascianelli, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara

Responsive image

Auto-TLDR; Deformable Convolutional Neural Networks for Handwritten Text Recognition

Slides Poster Similar

Handwritten Text Recognition (HTR) in free-layout pages is a valuable yet challenging task which aims to automatically understand handwritten texts. State-of-the-art approaches in this field usually encode input images with Convolutional Neural Networks, whose kernels are typically defined on a fixed grid and focus on all input pixels independently. However, this is in contrast with the sparse nature of handwritten pages, in which only pixels representing the ink of the writing are useful for the recognition task. Furthermore, the standard convolution operator is not explicitly designed to take into account the great variability in shape, scale, and orientation of handwritten characters. To overcome these limitations, we investigate the use of deformable convolutions for handwriting recognition. This type of convolution deform the convolution kernel according to the content of the neighborhood, and can therefore be more adaptable to geometric variations and other deformations of the text. Experiments conducted on the IAM and RIMES datasets demonstrate that the use of deformable convolutions is a promising direction for the design of novel architectures for handwritten text recognition.

Attentive Hybrid Feature Based a Two-Step Fusion for Facial Expression Recognition

Jun Weng, Yang Yang, Zichang Tan, Zhen Lei

Responsive image

Auto-TLDR; Attentive Hybrid Architecture for Facial Expression Recognition

Slides Poster Similar

Facial expression recognition is inherently a challenging task, especially for the in-the-wild images with various occlusions and large pose variations, which may lead to the loss of some crucial information. To address it, in this paper, we propose an attentive hybrid architecture (AHA) which learns global, local and integrated features based on different face regions. Compared with one type of feature, our extracted features own complementary information and can reduce the loss of crucial information. Specifically, AHA contains three branches, where all sub-networks in those branches employ the attention mechanism to further localize the interested pixels/regions. Moreover, we propose a two-step fusion strategy based on LSTM to deeply explore the hidden correlations among different face regions. Extensive experiments on four popular expression databases (i.e., CK+, FER-2013, SFEW 2.0, RAF-DB) show the effectiveness of the proposed method.

Online Trajectory Recovery from Offline Handwritten Japanese Kanji Characters of Multiple Strokes

Hung Tuan Nguyen, Tsubasa Nakamura, Cuong Tuan Nguyen, Masaki Nakagawa

Responsive image

Auto-TLDR; Recovering Dynamic Online Trajectories from Offline Japanese Kanji Character Images for Handwritten Character Recognition

Slides Poster Similar

We propose a deep neural network-based method to recover dynamic online trajectories from offline handwritten Japanese kanji character images. It is a challenging task since Japanese kanji characters consist of multiple strokes. Our proposed model has three main components: Convolutional Neural Network-based encoder, Long Short-Term Memory Network-based decoder with an attention layer, and Gaussian Mixture Model (GMM). The encoder focuses on feature extraction while the decoder refers to the extracted features and generates time-sequences of GMM parameters. The attention layer is the key component for trajectory recovery. The GMM provides robustness to style variations so that the proposed model does not overfit to training samples. In the experiments, the proposed method is evaluated by both visual verification and handwritten character recognition. This is the first attempt to use online recovered trajectories to help improve the performance of offline handwriting recognition. Although the visual verification reveals some problems, the recognition experiments demonstrate the effect of trajectory recovery in improving the accuracy of offline handwritten character recognition when online recognition of the recovered trajectories are combined.

One-Shot Representational Learning for Joint Biometric and Device Authentication

Sudipta Banerjee, Arun Ross

Responsive image

Auto-TLDR; Joint Biometric and Device Recognition from a Single Biometric Image

Slides Poster Similar

In this work, we propose a method to simultaneously perform (i) biometric recognition (\textit{i.e.}, identify the individual), and (ii) device recognition, (\textit{i.e.}, identify the device) from a single biometric image, say, a face image, using a one-shot schema. Such a joint recognition scheme can be useful in devices such as smartphones for enhancing security as well as privacy. We propose to automatically learn a joint representation that encapsulates both biometric-specific and sensor-specific features. We evaluate the proposed approach using iris, face and periocular images acquired using near-infrared iris sensors and smartphone cameras. Experiments conducted using 14,451 images from 13 sensors resulted in a rank-1 identification accuracy of upto 99.81\% and a verification accuracy of upto 100\% at a false match rate of 1\%.

2D License Plate Recognition based on Automatic Perspective Rectification

Hui Xu, Zhao-Hong Guo, Da-Han Wang, Xiang-Dong Zhou, Yu Shi

Responsive image

Auto-TLDR; Perspective Rectification Network for License Plate Recognition

Slides Poster Similar

License plate recognition (LPR) remains a challenging task in face of some difficulties such as image deformation and multi-line character distribution. Text rectification that is crucial to eliminate the effects of image deformation has attracted increasing attentions in scene text recognition. However, current text rectification methods are not designed specifically for LPR, which did not take the features of plate deformation into account. Considering the fact that a license plate (LP) can only generate perspective distortion in the image due to its rigid feature, in this paper we propose a novel perspective rectification network (PRN) to automatically estimate the perspective transformation and rectify the distorted LP accordingly. For recognition, we propose a location-aware 2D attention based recognition network that is capable of recognizing both single-line and double-line plates with perspective deformation. The rectification network and recognition network are connected for end-to-end training. Experiments on common datasets show that the proposed method achieves the state-of-the-art performance, demonstrating the effectiveness of the proposed approach.

Rotation Invariant Aerial Image Retrieval with Group Convolutional Metric Learning

Hyunseung Chung, Woo-Jeoung Nam, Seong-Whan Lee

Responsive image

Auto-TLDR; Robust Remote Sensing Image Retrieval Using Group Convolution with Attention Mechanism and Metric Learning

Slides Poster Similar

Remote sensing image retrieval (RSIR) is the process of ranking database images depending on the degree of similarity compared to the query image. As the complexity of RSIR increases due to the diversity in shooting range, angle, and location of remote sensors, there is an increasing demand for methods to address these issues and improve retrieval performance. In this work, we introduce a novel method for retrieving aerial images by merging group convolution with attention mechanism and metric learning, resulting in robustness to rotational variations. For refinement and emphasis on important features, we applied channel attention in each group convolution stage. By utilizing the characteristics of group convolution and channel-wise attention, it is possible to acknowledge the equality among rotated but identically located images. The training procedure has two main steps: (i) training the network with Aerial Image Dataset (AID) for classification, (ii) fine-tuning the network with triplet-loss for retrieval with Google Earth South Korea and NWPU-RESISC45 datasets. Results show that the proposed method performance exceeds other state-of-the-art retrieval methods in both rotated and original environments. Furthermore, we utilize class activation maps (CAM) to visualize the distinct difference of main features between our method and baseline, resulting in better adaptability in rotated environments.

Free-Form Image Inpainting Via Contrastive Attention Network

Xin Ma, Xiaoqiang Zhou, Huaibo Huang, Zhenhua Chai, Xiaolin Wei, Ran He

Responsive image

Auto-TLDR; Self-supervised Siamese inference for image inpainting

Slides Similar

Most deep learning based image inpainting approaches adopt autoencoder or its variants to fill missing regions in images. Encoders are usually utilized to learn powerful representational spaces, which are important for dealing with sophisticated learning tasks. Specifically, in the image inpainting task, masks with any shapes can appear anywhere in images (i.e., free-form masks) forming complex patterns. It is difficult for encoders to capture such powerful representations under this complex situation. To tackle this problem, we propose a self-supervised Siamese inference network to improve the robustness and generalization. Moreover, the restored image usually can not be harmoniously integrated into the exiting content, especially in the boundary area. To address this problem, we propose a novel Dual Attention Fusion module (DAF), which can combine both the restored and known regions in a smoother way and be inserted into decoder layers in a plug-and-play way. DAF is developed to not only adaptively rescale channel-wise features by taking interdependencies between channels into account but also force deep convolutional neural networks (CNNs) focusing more on unknown regions. In this way, the unknown region will be naturally filled from the outside to the inside. Qualitative and quantitative experiments on multiple datasets, including facial and natural datasets (i.e., Celeb-HQ, Pairs Street View, Places2 and ImageNet), demonstrate that our proposed method outperforms against state-of-the-arts in generating high-quality inpainting results.

Documents Counterfeit Detection through a Deep Learning Approach

Darwin Danilo Saire Pilco, Salvatore Tabbone

Responsive image

Auto-TLDR; End-to-End Learning for Counterfeit Documents Detection using Deep Neural Network

Slides Poster Similar

The main topic of this work is on the detection of counterfeit documents and especially banknotes. We propose an end-to-end learning model using a deep learning approach based on Adapnet++ which manages feature extraction at multiple scale levels using several residual units. Unlike previous models based on regions of interest (ROI) and high-resolution documents, our network is feed with simple input images (i.e., a single patch) and we do not need high resolution images. Besides, discriminative regions can be visualized at different scales. Our network learns by itself which regions of interest predict the better results. Experimental results show that we are competitive compared with the state-of-the-art and our deep neural network has good ability to generalize and can be applied to other kind of documents like identity or administrative one.

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

Kuan-Hsun Wang, Chia Chun Cheng, Yi-Ling Chen, Yale Song, Shang-Hong Lai

Responsive image

Auto-TLDR; Attention-based Deep Metric Learning for Near-duplicate Video Retrieval

Slides Similar

Near-duplicate video retrieval (NDVR) is an important and challenging problem due to the increasing amount of videos uploaded to the Internet. In this paper, we propose an attention-based deep metric learning method for NDVR. Our method is based on well-established principles: We leverage two-stream networks to combine RGB and optical flow features, and incorporate an attention module to effectively deal with distractor frames commonly observed in near duplicate videos. We further aggregate the features corresponding to multiple video segments to enhance the discriminative power. The whole system is trained using a deep metric learning objective with a Siamese architecture. Our experiments show that the attention module helps eliminate redundant and noisy frames, while focusing on visually relevant frames for solving NVDR. We evaluate our approach on recent large-scale NDVR datasets, CC_WEB_VIDEO, VCDB, FIVR and SVD. To demonstrate the generalization ability of our approach, we report results in both within- and cross-dataset settings, and show that the proposed method significantly outperforms state-of-the-art approaches.

Multi-Label Contrastive Focal Loss for Pedestrian Attribute Recognition

Xiaoqiang Zheng, Zhenxia Yu, Lin Chen, Fan Zhu, Shilong Wang

Responsive image

Auto-TLDR; Multi-label Contrastive Focal Loss for Pedestrian Attribute Recognition

Slides Poster Similar

Pedestrian Attribute Recognition (PAR) has received extensive attention during the past few years. With the advances of deep constitutional neural networks (CNNs), the performance of PAR has been significantly improved. Existing methods tend to acquire attribute-specific features by designing various complex network structures with additional modules. Such additional modules, however, dramatically increase the number of parameters. Meanwhile, the problems of class imbalance and hard attribute retrieving remain underestimated in PAR. In this paper, we explore the optimization mechanism of the training processing to account for these problems and propose a new loss function called Multi-label Contrastive Focal Loss (MCFL). This proposed MCFL emphasizes the hard and minority attributes by using a separated re-weighting mechanism for different positive and negative classes to alleviate the impact of the imbalance. MCFL is also able to enlarge the gaps between the intra-class of multi-label attributes, to force CNNs to extract more subtle discriminative features. We evaluate the proposed MCFL on three large public pedestrian datasets, including RAP, PA-100K, and PETA. The experimental results indicate that the proposed MCFL with the ResNet-50 backbone is able to outperform other state-of-the-art approaches in comparison.

Global Context-Based Network with Transformer for Image2latex

Nuo Pang, Chun Yang, Xiaobin Zhu, Jixuan Li, Xu-Cheng Yin

Responsive image

Auto-TLDR; Image2latex with Global Context block and Transformer

Slides Poster Similar

Image2latex usually means converts mathematical formulas in images into latex markup. It is a very challenging job due to the complex two-dimensional structure, variant scales of input, and very long representation sequence. Many researchers use encoder-decoder based model to solve this task and achieved good results. However, these methods don't make full use of the structure and position information of the formula. %In this paper, we improve the encoder by employing Global Context block and Transformer. To solve this problem, we propose a global context-based network with transformer that can (1) learn a more powerful and robust intermediate representation via aggregating global features and (2) encode position information explicitly and (3) learn latent dependencies between symbols by using self-attention mechanism. The experimental results on the dataset IM2LATEX-100K demonstrate the effectiveness of our method.

Human or Machine? It Is Not What You Write, but How You Write It

Luis Leiva, Moises Diaz, M.A. Ferrer, Réjean Plamondon

Responsive image

Auto-TLDR; Behavioral Biometrics via Handwritten Symbols for Identification and Verification

Slides Poster Similar

Online fraud often involves identity theft. Since most security measures are weak or can be spoofed, we investigate a more nuanced and less explored avenue: behavioral biometrics via handwriting movements. This kind of data can be used to verify if a legitimate user is operating a device or a computer application, so it is important to distinguish between human and machine-generated movements reliably. For this purpose, we study handwritten symbols (isolated characters, digits, gestures, and signatures) produced by humans and machines, and compare and contrast several deep learning models. We find that if symbols are presented as static images, they can fool state-of-the-art classifiers (near 75% accuracy in the best case) but can be distinguished with remarkable accuracy if they are presented as temporal sequences (95% accuracy in the average case). We conclude that an accurate detection of fake movements has more to do with how users write, rather than what they write. Our work has implications for computerized systems that need to authenticate or verify legitimate human users, and provides an additional layer of security to keep attackers at bay.

Recursive Recognition of Offline Handwritten Mathematical Expressions

Marco Cotogni, Claudio Cusano, Antonino Nocera

Responsive image

Auto-TLDR; Online Handwritten Mathematical Expression Recognition with Recurrent Neural Network

Slides Poster Similar

In this paper we propose a method for Offline Handwritten Mathematical Expression recognition. The method is a fast and accurate thanks to its architecture, which include both a Convolutional Neural Network and a Recurrent Neural Network. The CNN extracts features from the image to recognize and its output is provided to the RNN which produces the mathematical expression encoded in the LaTeX language. To process both sequential and non-sequential mathematical expressions we also included a deconvolutional module which, in a recursive way, segments the image for additional analysis trough a recursive process. The results obtained show a very high accuracy obtained on a large handwritten data set of 9100 samples of handwritten expressions.

IBN-STR: A Robust Text Recognizer for Irregular Text in Natural Scenes

Xiaoqian Li, Jie Liu, Shuwu Zhang

Responsive image

Auto-TLDR; IBN-STR: A Robust Text Recognition System Based on Data and Feature Representation

Poster Similar

Although text recognition methods based on deep neural networks have promising performance, there are still challenges due to the variety of text styles, perspective distortion, text with large curvature, and so on. To obtain a robust text recognizer, we have improved the performance from two aspects: data aspect and feature representation aspect. In terms of data, we transform the input images into S-shape distorted images in order to increase the diversity of training data. Besides, we explore the effects of different training data. In terms of feature representation, the combination of instance normalization and batch normalization improves the model's capacity and generalization ability. This paper proposes a robust text recognizer IBN-STR, which is an attention-based model. Through extensive experiments, the model analysis and comparison have been carried out from the aspects of data and feature representation, and the effectiveness of IBN-STR on both regular and irregular text instances has been verified. Furthermore, IBN-STR is an end-to-end recognition system that can achieve state-of-the-art performance.

ReADS: A Rectified Attentional Double Supervised Network for Scene Text Recognition

Qi Song, Qianyi Jiang, Xiaolin Wei, Nan Li, Rui Zhang

Responsive image

Auto-TLDR; ReADS: Rectified Attentional Double Supervised Network for General Scene Text Recognition

Slides Poster Similar

In recent years, scene text recognition is always regarded as a sequence-to-sequence problem. Connectionist Temporal Classification (CTC) and Attentional sequence recognition (Attn) are two very prevailing approaches to tackle this problem while they may fail in some scenarios respectively. CTC concentrates more on every individual character but is weak in text semantic dependency modeling. Attn based methods have better context semantic modeling ability while tends to overfit on limited training data. In this paper, we elaborately design a Rectified Attentional Double Supervised Network (ReADS) for general scene text recognition. To overcome the weakness of CTC and Attn, both of them are applied in our method but with different modules in two supervised branches which can make a complementary to each other. Moreover, effective spatial and channel attention mechanisms are introduced to eliminate background noise and extract valid foreground information. Finally, a simple rectified network is implemented to rectify irregular text. The ReADS can be trained end-to-end and only word-level annotations are required. Extensive experiments on various benchmarks verify the effectiveness of ReADS which achieves state-of-the-art performance.

Attentive Part-Aware Networks for Partial Person Re-Identification

Lijuan Huo, Chunfeng Song, Zhengyi Liu, Zhaoxiang Zhang

Responsive image

Auto-TLDR; Part-Aware Learning for Partial Person Re-identification

Slides Poster Similar

Partial person re-identification (re-ID) refers to re-identify a person through occluded images. It suffers from two major challenges, i.e., insufficient training data and incomplete probe image. In this paper, we introduce an automatic data augmentation module and a part-aware learning method for partial re-identification. On the one hand, we adopt the data augmentation to enhance the training data and help learns more stabler partial features. On the other hand, we intuitively find that the partial person images usually have fixed percentages of parts, therefore, in partial person re-id task, the probe image could be cropped from the pictures and divided into several different partial types following fixed ratios. Based on the cropped images, we propose the Cropping Type Consistency (CTC) loss to classify the cropping types of partial images. Moreover, in order to help the network better fit the generated and cropped data, we incorporate the Block Attention Mechanism (BAM) into the framework for attentive learning. To enhance the retrieval performance in the inference stage, we implement cropping on gallery images according to the predicted types of probe partial images. Through calculating feature distances between the partial image and the cropped holistic gallery images, we can recognize the right person from the gallery. To validate the effectiveness of our approach, we conduct extensive experiments on the partial re-ID benchmarks and achieve state-of-the-art performance.

ConvMath : A Convolutional Sequence Network for Mathematical Expression Recognition

Zuoyu Yan, Xiaode Zhang, Liangcai Gao, Ke Yuan, Zhi Tang

Responsive image

Auto-TLDR; Convolutional Sequence Modeling for Mathematical Expressions Recognition

Slides Poster Similar

Despite the recent advances in optical character recognition (OCR), mathematical expressions still face a great challenge to recognize due to their two-dimensional graphical layout. In this paper, we propose a convolutional sequence modeling network, ConvMath, which converts the mathematical expression description in an image into a LaTeX sequence in an end-to-end way. The network combines an image encoder for feature extraction and a convolutional decoder for sequence generation. Compared with other Long Short Term Memory(LSTM) based encoder-decoder models, ConvMath is entirely based on convolution, thus it is easy to perform parallel computation. Besides, the network adopts multi-layer attention mechanism in the decoder, which allows the model to align output symbols with source feature vectors automatically, and alleviates the problem of lacking coverage while training the model. The performance of ConvMath is evaluated on an open dataset named IM2LATEX-100K, including 103556 samples. The experimental results demonstrate that the proposed network achieves state-of-the-art accuracy and much better efficiency than previous methods.

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning

Zhuo Chen, Fei Yin, Xu-Yao Zhang, Qing Yang, Cheng-Lin Liu

Responsive image

Auto-TLDR; Cross-Lingual Text Image Recognition with Multi-task Learning

Slides Poster Similar

This paper considers recognizing texts shown in a source language and translating into a target language, without generating the intermediate source language text image recognition results. We call this problem Cross-Lingual Text Image Recognition (CLTIR). To solve this problem, we propose a multi-task system containing a main task of CLTIR and an auxiliary task of Mono-Lingual Text Image Recognition (MLTIR) simultaneously. Two different sequence to sequence learning methods, a convolution based attention model and a BLSTM model with CTC, are adopted for these tasks respectively. We evaluate the system on a newly collected Chinese-English bilingual movie subtitle image dataset. Experimental results demonstrate the multi-task learning framework performs superiorly in both languages.

A Multi-Head Self-Relation Network for Scene Text Recognition

Zhou Junwei, Hongchao Gao, Jiao Dai, Dongqin Liu, Jizhong Han

Responsive image

Auto-TLDR; Multi-head Self-relation Network for Scene Text Recognition

Slides Poster Similar

The text embedded in scene images can be seen everywhere in our lives. However, recognizing text from natural scene images is still a challenge because of its diverse shapes and distorted patterns. Recently, advanced recognition networks generally treat scene text recognition as a sequence prediction task. Although achieving excellent performance, these recognition networks consider the feature map cells as independent individuals and update cells state without utilizing the information of their neighboring cells. And the local receptive field of traditional convolutional neural network (CNN) makes a single cell that cannot cover the whole text region in an image. Due to these issues, the existing recognition networks cannot extract the global context in a visual scene. To deal with the above problems, we propose a Multi-head Self-relation Network(MSRN) for scene text recognition in this paper. The MSRN consists of several multi-head self-relation layers, which is designed for extracting the global context of a visual scene, so that transforms a cell into a new cell that fuses the information of the related cells. Furthermore, experiments over several public datasets demonstrate that our proposed recognition network achieves superior performance on several benchmark datasets including IC03, IC13, IC15, SVT-Perspective.

Dual-Attention Guided Dropblock Module for Weakly Supervised Object Localization

Junhui Yin, Siqing Zhang, Dongliang Chang, Zhanyu Ma, Jun Guo

Responsive image

Auto-TLDR; Dual-Attention Guided Dropblock for Weakly Supervised Object Localization

Slides Poster Similar

Attention mechanisms is frequently used to learn the discriminative features for better feature representations. In this paper, we extend the attention mechanism to the task of weakly supervised object localization (WSOL) and propose the dual-attention guided dropblock module (DGDM), which aims at learning the informative and complementary visual patterns for WSOL. This module contains two key components, the channel attention guided dropout (CAGD) and the spatial attention guided dropblock (SAGD). To model channel interdependencies, the CAGD ranks the channel attentions and treats the top-k attentions with the largest magnitudes as the important ones. It also keeps some low-valued elements to increase their value if they become important during training. The SAGD can efficiently remove the most discriminative information by erasing the contiguous regions of feature maps rather than individual pixels. This guides the model to capture the less discriminative parts for classification. Furthermore, it can also distinguish the foreground objects from the background regions to alleviate the attention misdirection. Experimental results demonstrate that the proposed method achieves new state-of-the-art localization performance.

Recognizing Bengali Word Images - A Zero-Shot Learning Perspective

Sukalpa Chanda, Daniël Arjen Willem Haitink, Prashant Kumar Prasad, Jochem Baas, Umapada Pal, Lambert Schomaker

Responsive image

Auto-TLDR; Zero-Shot Learning for Word Recognition in Bengali Script

Slides Poster Similar

Zero-Shot Learning(ZSL) techniques could classify a completely unseen class, which it has never seen before during training. Thus, making it more apt for any real-life classification problem, where it is not possible to train a system with annotated data for all possible class types. This work investigates recognition of word images written in Bengali Script in a ZSL framework. The proposed approach performs Zero-Shot word recognition by coupling deep learned features procured from VGG16 architecture along with 13 basic shapes/stroke primitives commonly observed in Bengali script characters. As per the notion of ZSL framework those 13 basic shapes are termed as “Signature Attributes”. The obtained results are promising while evaluation was carried out in a Five-Fold cross-validation setup dealing with samples from 250 word classes.

Local Attention and Global Representation Collaborating for Fine-Grained Classification

He Zhang, Yunming Bai, Hui Zhang, Jing Liu, Xingguang Li, Zhaofeng He

Responsive image

Auto-TLDR; Weighted Region Network for Cosmetic Contact Lenses Detection

Slides Poster Similar

The cosmetic contact lenses over an iris may change its original textural pattern that is the foundation for iris recognition, making the cosmetic lenses a possible and easy-to-use iris presentation attack means. Aiming at cosmetic contact lenses detection of practical application system, some approaches have been proposed but still facing unsolved problems, such as low quality iris images and inaccurate localized iris boundaries. In this paper, we propose a novel framework called Weighted Region Network (WRN) for the cosmetic contact lenses detection. The WRN includes both the local attention Weight Network and the global classification Region Network. With the inherent attention mechanism, the proposed network is able to find the most discriminative regions, which reduces the requirement for target detection and improves the ability of classification based on some specific areas and patterns. The Weight Network can be trained by using Rank loss and MSE loss without manual discriminative region annotations. Experiments are conducted on several databases and a new collected low-quality iris image database. The proposed method outperforms state-of-the-art fake iris detection algorithms, and is also effective for the fine-grained image classification task.

Generalized Local Attention Pooling for Deep Metric Learning

Carlos Roig Mari, David Varas, Issey Masuda, Juan Carlos Riveiro, Elisenda Bou-Balust

Responsive image

Auto-TLDR; Generalized Local Attention Pooling for Deep Metric Learning

Slides Poster Similar

Deep metric learning has been key to recent advances in face verification and image retrieval amongst others. These systems consist on a feature extraction block (extracts feature maps from images) followed by a spatial dimensionality reduction block (generates compact image representations from the feature maps) and an embedding generation module (projects the image representation to the embedding space). While research on deep metric learning has focused on improving the losses for the embedding generation module, the dimensionality reduction block has been overlooked. In this work, we propose a novel method to generate compact image representations which uses local spatial information through an attention mechanism, named Generalized Local Attention Pooling (GLAP). This method, instead of being placed at the end layer of the backbone, is connected at an intermediate level, resulting in lower memory requirements. We assess the performance of the aforementioned method by comparing it with multiple dimensionality reduction techniques, demonstrating the importance of using attention weights to generate robust compact image representations. Moreover, we compare the performance of multiple state-of-the-art losses using the standard deep metric learning system against the same experiment with our GLAP. Experiments showcase that the proposed Generalized Local Attention Pooling mechanism outperforms other pooling methods when compared with current state-of-the-art losses for deep metric learning.

3D Facial Matching by Spiral Convolutional Metric Learning and a Biometric Fusion-Net of Demographic Properties

Soha Sadat Mahdi, Nele Nauwelaers, Philip Joris, Giorgos Bouritsas, Imperial London, Sergiy Bokhnyak, Susan Walsh, Mark Shriver, Michael Bronstein, Peter Claes

Responsive image

Auto-TLDR; Multi-biometric Fusion for Biometric Verification using 3D Facial Mesures

Slides Similar

Face recognition is a widely accepted biometric verification tool, as the face contains a lot of information about the identity of a person. In this study, a 2-step neural-based pipeline is presented for matching 3D facial shape to multiple DNA-related properties (sex, age, BMI and genomic background). The first step consists of a triplet loss-based metric learner that compresses facial shape into a lower dimensional embedding while preserving information about the property of interest. Most studies in the field of metric learning have only focused on Euclidean data. In this work, geometric deep learning is employed to learn directly from 3D facial meshes. To this end, spiral convolutions are used along with a novel mesh-sampling scheme that retains uniformly sampled 3D points at different levels of resolution. The second step is a multi-biometric fusion by a fully connected neural network. The network takes an ensemble of embeddings and property labels as input and returns genuine and imposter scores. Since embeddings are accepted as an input, there is no need to train classifiers for the different properties and available data can be used more efficiently. Results obtained by a 10-fold cross-validation for biometric verification show that combining multiple properties leads to stronger biometric systems. Furthermore, the proposed neural-based pipeline outperforms a linear baseline, which consists of principal component analysis, followed by classification with linear support vector machines and a Naïve Bayes-based score-fuser.

Stroke Based Posterior Attention for Online Handwritten Mathematical Expression Recognition

Changjie Wu, Qing Wang, Jianshu Zhang, Jun Du, Jiaming Wang, Jiajia Wu, Jin-Shui Hu

Responsive image

Auto-TLDR; Posterior Attention for Online Handwritten Mathematical Expression Recognition

Slides Poster Similar

Recently, many researches propose to employ attention based encoder-decoder models to convert a sequence of trajectory points into a LaTeX string for online handwritten mathematical expression recognition (OHMER), and the recognition performance of these models critically relies on the accuracy of the attention. In this paper, unlike previous methods which basically employ a soft attention model, we propose to employ a posterior attention model, which modifies the attention probabilities after observing the output probabilities generated by the soft attention model. In order to further improve the posterior attention mechanism, we propose a stroke average pooling layer to aggregate point-level features obtained from the encoder into stroke-level features. We argue that posterior attention is better to be implemented on stroke-level features than point-level features as the output probabilities generated by stroke is more convincing than generated by point, and we prove that through experimental analysis. Validated on the CROHME competition task, we demonstrate that stroke based posterior attention achieves expression recognition rates of 54.26% on CROHME 2014 and 51.75% on CROHME 2016. According to attention visualization analysis, we empirically demonstrate that the posterior attention mechanism can achieve better alignment accuracy than the soft attention mechanism.

Self and Channel Attention Network for Person Re-Identification

Asad Munir, Niki Martinel, Christian Micheloni

Responsive image

Auto-TLDR; SCAN: Self and Channel Attention Network for Person Re-identification

Slides Poster Similar

Recent research has shown promising results for person re-identification by focusing on several trends. One is designing efficient metric learning loss functions such as triplet loss family to learn the most discriminative representations. The other is learning local features by designing part based architectures to form an informative descriptor from semantically coherent parts. Some efforts adjust distant outliers to their most similar positions by using soft attention and learn the relationship between distant similar features. However, only a few prior efforts focus on channel-wise dependencies and learn non-local sharp similar part features directly for the degraded data in the person re-identification task. In this paper, we propose a novel Self and Channel Attention Network (SCAN) to model long-range dependencies between channels and feature maps. We add multiple classifiers to learn discriminative global features by using classification loss. Self Attention (SA) module and Channel Attention (CA) module are introduced to model non-local and channel-wise dependencies in the learned features. Spectral normalization is applied to the whole network to stabilize the training process. Experimental results on the person re-identification benchmarks show the proposed components achieve significant improvement with respect to the baseline.

Handwritten Signature and Text Based User Verification Using Smartwatch

Raghavendra Ramachandra, Sushma Venkatesh, Raja Kiran, Christoph Busch

Responsive image

Auto-TLDR; A novel technique for user verification using a smartwatch based on writing pattern or signing pattern

Slides Poster Similar

Wrist-wearable devices such as smartwatch have gained popularity as they provide quick access to the various information and easy access to multiple applications. Among various applications of the smartwatch, user verification based on the handwriting has been recently investigated. In this paper, we present a novel technique for user verification using a smartwatch based on writing pattern or signing pattern. The proposed technique leverages accelerometer data captured from the smartwatch that are further represented using 2D Continuous Wavelet Transform (CWT) and deep features extracted using the pre-trained ResNet50. The comparison is performed using the ensemble of the classifier. Extensive experiments are carried out on the newly captured dataset using two different smartwatches with three different writing scenarios (or activities). The article provides key insights and analysis of the results in such a verification scenario.

TAAN: Task-Aware Attention Network for Few-Shot Classification

Zhe Wang, Li Liu, Fanzhang Li

Responsive image

Auto-TLDR; TAAN: Task-Aware Attention Network for Few-Shot Classification

Slides Poster Similar

Few-shot classification aims to recognize unlabeled samples from unseen classes given only a few labeled samples.Current approaches of few-shot learning usually employ a metriclearning framework to learn a feature similarity comparison between a query (test) example and the few support (training) examples. However, these approaches all extract features from samples independently without looking at the entire task as a whole, and so fail to provide an enough discrimination to features. Moreover, the existing approaches lack the ability to select the most relevant features for the task at hand. In this work, we propose a novel algorithm called Task-Aware Attention Network (TAAN) to address the above problems in few-shot classification. By inserting a Task-Relevant Channel Attention Module into metric-based few-shot learners, TAAN generates channel attentions for each sample by aggregating the context of the entire support set and identifies the most relevant features for similarity comparison. The experiment demonstrates that TAAN is competitive in overall performance comparing to the recent state-of-the-art systems and improves the performance considerably over baseline systems on both mini-ImageNet and tiered-ImageNet benchmarks.

A Transformer-Based Radical Analysis Network for Chinese Character Recognition

Chen Yang, Qing Wang, Jun Du, Jianshu Zhang, Changjie Wu, Jiaming Wang

Responsive image

Auto-TLDR; Transformer-based Radical Analysis Network for Chinese Character Recognition

Slides Poster Similar

Recently, a novel radical analysis network (RAN) has the capability of effectively recognizing unseen Chinese character classes and largely reducing the requirement of training data by treating a Chinese character as a hierarchical composition of radicals rather than a single character class.} However, when dealing with more challenging issues, such as the recognition of complicated characters, low-frequency character categories, and characters in natural scenes, RAN still has a lot of room for improvement. In this paper, we explore options to further improve the structure generalization and robustness capability of RAN with the Transformer architecture, which has achieved start-of-the-art results for many sequence-to-sequence tasks. More specifically, we propose to replace the original attention module in RAN with the transformer decoder, which is named as a transformer-based radical analysis network (RTN). The experimental results show that the proposed approach can significantly outperform the RAN on both printed Chinese character database and natural scene Chinese character database. Meanwhile, further analysis proves that RTN can be better generalized to complex samples and low-frequency characters, and has better robustness in recognizing Chinese characters with different attributes.

Face Anti-Spoofing Using Spatial Pyramid Pooling

Lei Shi, Zhuo Zhou, Zhenhua Guo

Responsive image

Auto-TLDR; Spatial Pyramid Pooling for Face Anti-Spoofing

Slides Poster Similar

Face recognition system is vulnerable to many kinds of presentation attacks, so how to effectively detect whether the image is from the real face is particularly important. At present, many deep learning-based anti-spoofing methods have been proposed. But these approaches have some limitations, for example, global average pooling (GAP) easily loses local information of faces, single-scale features easily ignore information differences in different scales, while a complex network is prune to be overfitting. In this paper, we propose a face anti-spoofing approach using spatial pyramid pooling (SPP). Firstly, we use ResNet-18 with a small amount of parameter as the basic model to avoid overfitting. Further, we use spatial pyramid pooling module in the single model to enhance local features while fusing multi-scale information. The effectiveness of the proposed method is evaluated on three databases, CASIA-FASD, Replay-Attack and CASIA-SURF. The experimental results show that the proposed approach can achieve state-of-the-art performance.

Recurrent Deep Attention Network for Person Re-Identification

Changhao Wang, Jun Zhou, Xianfei Duan, Guanwen Zhang, Wei Zhou

Responsive image

Auto-TLDR; Recurrent Deep Attention Network for Person Re-identification

Slides Poster Similar

Person re-identification (re-id) is an important task in video surveillance. It is challenging due to the appearance of person varying a wide range acrossnon-overlapping camera views. Recent years, attention-based models are introduced to learn discriminative representation. In this paper, we consider the attention selection in a natural way as like human moving attention on different parts of the visual field for person re-id. In concrete, we propose a Recurrent Deep Attention Network (RDAN) with an attention selection mechanism based on reinforcement learning. The RDAN aims to adaptively observe the identity-sensitive regions to build up the representation of individuals step by step. Extensive experiments on three person re-id benchmarks Market-1501, DukeMTMC-reID and CUHK03-NP demonstrate the proposed method can achieve competitive performance.

SAT-Net: Self-Attention and Temporal Fusion for Facial Action Unit Detection

Zhihua Li, Zheng Zhang, Lijun Yin

Responsive image

Auto-TLDR; Temporal Fusion and Self-Attention Network for Facial Action Unit Detection

Slides Poster Similar

Research on facial action unit detection has shown remarkable performances by using deep spatial learning models in recent years, however, it is far from reaching its full capacity in learning due to the lack of use of temporal information of AUs across time. Since the AU occurrence in one frame is highly likely related to previous frames in a temporal sequence, exploring temporal correlation of AUs across frames becomes a key motivation of this work. In this paper, we propose a novel temporal fusion and AU-supervised self-attention network (a so-called SAT-Net) to address the AU detection problem. First of all, we input the deep features of a sequence into a convolutional LSTM network and fuse the previous temporal information into the feature map of the last frame, and continue to learn the AU occurrence. Second, considering the AU detection problem is a multi-label classification problem that individual label depends only on certain facial areas, we propose a new self-learned attention mask by focusing the detection of each AU on parts of facial areas through the learning of individual attention mask for each AU, thus increasing the AU independence without the loss of any spatial relations. Our extensive experiments show that the proposed framework achieves better results of AU detection over the state-of-the-arts on two benchmark databases (BP4D and DISFA).

FeatureNMS: Non-Maximum Suppression by Learning Feature Embeddings

Niels Ole Salscheider

Responsive image

Auto-TLDR; FeatureNMS: Non-Maximum Suppression for Multiple Object Detection

Slides Poster Similar

Most state of the art object detectors output multiple detections per object. The duplicates are removed in a post-processing step called Non-Maximum Suppression. Classical Non-Maximum Suppression has shortcomings in scenes that contain objects with high overlap: The idea of this heuristic is that a high bounding box overlap corresponds to a high probability of having a duplicate. We propose FeatureNMS to solve this problem. FeatureNMS recognizes duplicates not only based on the intersection over union between bounding boxes, but also based on the difference of feature vectors. These feature vectors can encode more information like visual appearance. Our approach outperforms classical NMS and derived approaches and achieves state of the art performance.

LODENet: A Holistic Approach to Offline Handwritten Chinese and Japanese Text Line Recognition

Huu Tin Hoang, Chun-Jen Peng, Hung Tran, Hung Le, Huy Hoang Nguyen

Responsive image

Auto-TLDR; Logographic DEComposition Encoding for Chinese and Japanese Text Line Recognition

Slides Poster Similar

One of the biggest obstacles in Chinese and Japanese text line recognition is how to present their enormous character sets. The most common solution is to merely choose and represent a small subset of characters using one-hot encoding. However, such an approach is costly to describe huge character sets, and ignores their semantic relationships. Recent studies have attempted to utilize different encoding methods, but they struggle to build a bijection mapping. In this work, we propose a novel encoding method, called LOgographic DEComposition encoding (LODEC), that can efficiently perform a 1-to-1 mapping for all Chinese and Japanese characters with a strong awareness of semantic relationships. As such, LODEC enables to encode over 21,000 Chinese and Japanese characters by only 520 fundamental elements. Moreover, to handle the vast variety of handwritten texts in the two languages, we propose a novel deep learning (DL) architecture, called LODENet, together with an end-to-end training scheme, that leverages auxiliary data generated by LODEC or other radical-based encoding methods. We performed systematic experiments on both Chinese and Japanese datasets, and found that our approach surpassed the performance of state-of-the-art baselines. Furthermore, empirical evidence shows that our method can gain significantly better results using synthesized text line images without the need for domain knowledge.

Super-Resolution Guided Pore Detection for Fingerprint Recognition

Syeda Nyma Ferdous, Ali Dabouei, Jeremy Dawson, Nasser M. Nasarabadi

Responsive image

Auto-TLDR; Super-Resolution Generative Adversarial Network for Fingerprint Recognition Using Pore Features

Slides Poster Similar

Performance of fingerprint recognition algorithms substantially rely on fine features extracted from fingerprints. Apart from minutiae and ridge patterns, pore features have proven to be usable for fingerprint recognition. Although features from minutiae and ridge patterns are quite attainable from low-resolution images, using pore features is practical only if the fingerprint image is of high resolution which necessitates a model that enhances the image quality of the conventional 500 ppi legacy fingerprints preserving the fine details. To find a solution for recovering pore information from low-resolution fingerprints, we adopt a joint learning-based approach that combines both super-resolution and pore detection networks. Our modified single image Super-Resolution Generative Adversarial Network (SRGAN) framework helps to reliably reconstruct high-resolution fingerprint samples from low-resolution ones assisting the pore detection network to identify pores with a high accuracy. The network jointly learns a distinctive feature representation from a real low-resolution fingerprint sample and successfully synthesizes a high-resolution sample from it. To add discriminative information and uniqueness for all the subjects, we have integrated features extracted from a deep fingerprint verifier with the SRGAN quality discriminator. We also add ridge reconstruction loss, utilizing ridge patterns to make the best use of extracted features. Our proposed method solves the recognition problem by improving the quality of fingerprint images. High recognition accuracy of the synthesized samples that is close to the accuracy achieved using the original high-resolution images validate the effectiveness of our proposed model.

A Fast and Accurate Object Detector for Handwritten Digit String Recognition

Jun Guo, Wenjing Wei, Yifeng Ma, Cong Peng

Responsive image

Auto-TLDR; ChipNet: An anchor-free object detector for handwritten digit string recognition

Slides Poster Similar

Focusing on handwritten digit string recognition (HDSR), we propose an anchor-free object detector called ChipNet, where a novel encoding method is designed. The input image is divided into columns, and then these columns are encoded by the ground truth. The adjacent columns are responsible for detecting the same target so that it can well address the class-imbalanced problem meanwhile reducing the network computation. ChipNet is composed of convolutional and bidirectional long short term memory networks. Different from the typical detectors, it doesn't use region proposals, anchors or regions of interest pooling. Hence, it can overcome the shortages of anchor-based and dense detectors in HDSR. The experiments are implemented on the synthetic digit strings, the CVL HDS database, and the ORAND-CAR-A & B databases. The high accuracies, which surpass the reported results by a large margin (up to 6.62%), are achieved. Furthermore, it gets 219 FPS speed on 160*32 px resolution images when using a Tesla P100 GPU. The results also show that ChipNet can handle touching, connecting and arbitrary length digit strings, and the obtained accuracies in HDSR are as high as the ones in single handwritten digit recognition.

Automated Whiteboard Lecture Video Summarization by Content Region Detection and Representation

Bhargava Urala Kota, Alexander Stone, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju

Responsive image

Auto-TLDR; A Framework for Summarizing Whiteboard Lecture Videos Using Feature Representations of Handwritten Content Regions

Poster Similar

Lecture videos are rapidly becoming an invaluable source of information for students across the globe. Given the large number of online courses currently available, it is important to condense the information within these videos into a compact yet representative summary that can be used for search-based applications. We propose a framework to summarize whiteboard lecture videos by finding feature representations of detected handwritten content regions to determine unique content. We investigate multi-scale histogram of gradients and embeddings from deep metric learning for feature representation. We explicitly handle occluded, growing and disappearing handwritten content. Our method is capable of producing two kinds of lecture video summaries - the unique regions themselves or so-called key content and keyframes (which contain all unique content in a video segment). We use weighted spatio-temporal conflict minimization to segment the lecture and produce keyframes from detected regions and features. We evaluate both types of summaries and find that we obtain state-of-the-art peformance in terms of number of summary keyframes while our unique content recall and precision are comparable to state-of-the-art.

Sample-Aware Data Augmentor for Scene Text Recognition

Guanghao Meng, Tao Dai, Shudeng Wu, Bin Chen, Jian Lu, Yong Jiang, Shutao Xia

Responsive image

Auto-TLDR; Sample-Aware Data Augmentation for Scene Text Recognition

Slides Poster Similar

Deep neural networks (DNNs) have been widely used in scene text recognition, and achieved remarkable performance. Such DNN-based scene text recognizers usually require plenty of training data for training, but data collection and annotation is usually cost-expensive in practice. To alleviate this issue, data augmentation is often applied to train the scene text recognizers. However, existing data augmentation methods including affine transformation and elastic transformation methods suffer from the problems of under- and over-diversity, due to the complexity of text contents and shapes. In this paper, we propose a sample-aware data augmentor to transform samples adaptively based on the contents of samples. Specifically, our data augmentor consists of three parts: gated module, affine transformation module, and elastic transformation module. In our data augmentor, affine transformation module focuses on keeping the affinity of samples, while elastic transformation module aims to improve the diversity of samples. With the gated module, our data augmentor determines transformation type adaptively based on the properties of training samples and the recognizer capability during the training process. Besides, our framework introduces an adversarial learning strategy to optimize the augmentor and the recognizer jointly. Extensive experiments on scene text recognition benchmarks show that our sample-aware data augmentor significantly improves the performance of state-of-the-art scene text recognizer.

Siamese Dynamic Mask Estimation Network for Fast Video Object Segmentation

Dexiang Hong, Guorong Li, Kai Xu, Li Su, Qingming Huang

Responsive image

Auto-TLDR; Siamese Dynamic Mask Estimation for Video Object Segmentation

Slides Poster Similar

Video object segmentation(VOS) has been a fundamental topic in recent years, and many deep learning-based methods have achieved state-of-the-art performance on multiple benchmarks. However, most of these methods rely on pixel-level matching between the template and the searched frames on the whole image while the targets only occupy a small region. Calculating on the entire image brings lots of additional computation cost. Besides, the whole image may contain some distracting information resulting in many false-positive matching points. To address this issue, motivated by one-stage instance object segmentation methods, we propose an efficient siamese dynamic mask estimation network for fast video object segmentation. The VOS is decoupled into two tasks, i.e. mask feature learning and dynamic kernel prediction. The former is responsible for learning high-quality features to preserve structural geometric information, and the latter learns a dynamic kernel which is used to convolve with the mask feature to generate a mask output. We use Siamese neural network as a feature extractor and directly predict masks after correlation. In this way, we can avoid using pixel-level matching, making our framework more simple and efficient. Experiment results on DAVIS 2016 /2017 datasets show that our proposed methods can run at 35 frames per second on NVIDIA RTX TITAN while preserving competitive accuracy.

UHRSNet: A Semantic Segmentation Network Specifically for Ultra-High-Resolution Images

Lianlei Shan, Weiqiang Wang

Responsive image

Auto-TLDR; Ultra-High-Resolution Segmentation with Local and Global Feature Fusion

Poster Similar

Abstract—Semantic segmentation is a basic task in computer vision, but only limited attention has been devoted to the ultra-high-resolution (UHR) image segmentation. Since UHR images occupy too much memory, they cannot be directly put into GPU for training. Previous methods are cropping images to small patches or downsampling the whole images. Cropping and downsampling cause the loss of contexts and details, which is essential for segmentation accuracy. To solve this problem, we improve and simplify the local and global feature fusion method in previous works. Local features are extracted from patches and global features are from downsampled images. Meanwhile, we propose one new fusion called local feature fusion for the first time, which can make patches get information from surrounding patches. We call the network with these two fusions ultra-high-resolution segmentation network (UHRSNet). These two fusions can effectively and efficiently solve the problem caused by cropping and downsampling. Experiments show a remarkable improvement on Deepglobe dataset.

VTT: Long-Term Visual Tracking with Transformers

Tianling Bian, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Neil Robertson, Haibing Guan

Responsive image

Auto-TLDR; Visual Tracking Transformer with transformers for long-term visual tracking

Slides Similar

Long-term visual tracking is a challenging problem. State-of-the-art long-term trackers, e.g., GlobalTrack, utilize region proposal networks (RPNs) to generate target proposals. However, the performance of the trackers is affected by occlusions and large scale or ratio variations. To address these issues, in this paper, we are the first to propose a novel architecture with transformers for long-term visual tracking. Specifically, the proposed Visual Tracking Transformer (VTT) utilizes a transformer encoder-decoder architecture for aggregating global information to deal with occlusion and large scale or ratio variation. Furthermore, it also shows better discriminative power against instance-level distractors without the need for extra labeling and hard-sample mining. We conduct extensive experiments on three largest long-term tracking dataset and have achieved state-of-the-art performance.

A Few-Shot Learning Approach for Historical Ciphered Manuscript Recognition

Mohamed Ali Souibgui, Alicia Fornés, Yousri Kessentini, Crina Tudor

Responsive image

Auto-TLDR; Handwritten Ciphers Recognition Using Few-Shot Object Detection

Slides Similar

Encoded (or ciphered) manuscripts are a special type of historical documents that contain encrypted text. The automatic recognition of this kind of documents is challenging because: 1) the cipher alphabet changes from one document to another, 2) there is a lack of annotated corpus for training and 3) touching symbols make the symbol segmentation difficult and complex. To overcome these difficulties, we propose a novel method for handwritten ciphers recognition based on few-shot object detection. Our method first detects all symbols of a given alphabet in a line image, and then a decoding step maps the symbol similarity scores to the final sequence of transcribed symbols. By training on synthetic data, we show that the proposed architecture is able to recognize handwritten ciphers with unseen alphabets. In addition, if few labeled pages with the same alphabet are used for fine tuning, our method surpasses existing unsupervised and supervised HTR methods for ciphers recognition.

Deep Gait Relative Attribute Using a Signed Quadratic Contrastive Loss

Yuta Hayashi, Shehata Allam, Yasushi Makihara, Daigo Muramatsu, Yasushi Yagi

Responsive image

Auto-TLDR; Signal-Contrastive Loss for Gait Attributes Estimation

Similar

This paper presents a deep learning-based method to estimate gait attributes (e.g., stately, cool, relax, etc.). Similarly to the existing studies on relative attribute, human perception-based annotations on the gait attributes are given to pairs of gait videos (i.e., the first one is better, tie, and the second one is better), and the relative annotations are utilized to train a ranking model of the gait attribute. More specifically, we design a Siamese (i.e., two-stream) network which takes a pair of gait inputs and output gait attribute score for each. We then introduce a suitable loss function called a signed contrastive loss to train the network parameters with the relative annotation. Unlike the existing loss functions for learning to rank does not inherent a nice property of a quadratic contrastive loss, the proposed signed quadratic contrastive loss function inherents the nice property. The quantitative evaluation results reveal that the proposed method shows better or comparable accuracies of relative attribute prediction against the baseline methods.

DFH-GAN: A Deep Face Hashing with Generative Adversarial Network

Bo Xiao, Lanxiang Zhou, Yifei Wang, Qiangfang Xu

Responsive image

Auto-TLDR; Deep Face Hashing with GAN for Face Image Retrieval

Slides Poster Similar

Face Image retrieval is one of the key research directions in computer vision field. Thanks to the rapid development of deep neural network in recent years, deep hashing has achieved good performance in the field of image retrieval. But for large-scale face image retrieval, the performance needs to be further improved. In this paper, we propose Deep Face Hashing with GAN (DFH-GAN), a novel deep hashing method for face image retrieval, which mainly consists of three components: a generator network for generating synthesized images, a discriminator network with a shared CNN to learn multi-domain face feature, and a hash encoding network to generate compact binary hash codes. The generator network is used to perform data augmentation so that the model could learn from both real images and diverse synthesized images. We adopt a two-stage training strategy. In the first stage, the GAN is trained to generate fake images, while in the second stage, to make the network convergence faster. The model inherits the trained shared CNN of discriminator to train the DFH model by using many different supervised loss functions not only in the last layer but also in the middle layer of the network. Extensive experiments on two widely used datasets demonstrate that DFH-GAN can generate high-quality binary hash codes and exceed the performance of the state-of-the-art model greatly.