Predicting Chemical Properties Using Self-Attention Multi-Task Learning Based on SMILES Representation

Sangrak Lim, Yong Oh Lee

Responsive image

Auto-TLDR; Self-attention based Transformer-Variant Model for Chemical Compound Properties Prediction

Slides Poster

In the computational prediction of chemical compound properties, molecular descriptors and fingerprints encoded to low dimensional vectors are used. The selection of proper molecular descriptors and fingerprints is both important and challenging as the performance of such models is highly dependent on descriptors. To overcome this challenge, natural language processing models that utilize simplified molecular input line entry system as input were studied, and several transformer variant models achieved superior results when compared with conventional methods. In this study, we explored the structural differences of the transformer-variant model and proposed a new self-attention based model. The representation learning performance of the self-attention module was evaluated in a multi-task learning environment using imbalanced chemical datasets. The experiment results showed that our model achieved competitive outcomes on several benchmark datasets. The source code of our experiment is available at https://github.com/arwhirang/sa-mtl and the dataset is available from the same URL.

Similar papers

Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks

Hyunjin Choi, Judong Kim, Seongho Joe, Youngjune Gwon

Responsive image

Auto-TLDR; Sentence Embedding Models for BERT and ALBERT: A Comparison and Evaluation

Slides Poster Similar

Contextualized representations from a pre-trained language model are central to achieve a high performance on downstream NLP task. The pre-trained BERT and A Lite BERT (ALBERT) models can be fine-tuned to give state-of-the-art results in sentence-pair regressions such as semantic textual similarity (STS) and natural language inference (NLI). Although BERT-based models yield the [CLS] token vector as a reasonable sentence embedding, the search for an optimal sentence embedding scheme remains an active research area in computational linguistics. This paper explores on sentence embedding models for BERT and ALBERT. In particular, we take a modified BERT network with siamese and triplet network structures called Sentence-BERT (SBERT) and replace BERT with ALBERT to create Sentence-ALBERT (SALBERT). We also experiment with an outer CNN sentence-embedding network for SBERT and SALBERT. We evaluate performances of all sentence-embedding models considered using the STS and NLI datasets. The empirical results indicate that our CNN architecture improves ALBERT models substantially more than BERT models for STS benchmark. Despite significantly fewer model parameters, ALBERT sentence embedding is highly competitive to BERT in downstream NLP evaluations.

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

Peilin Zhou, Zhiqi Huang, Fenglin Liu, Yuexian Zou

Responsive image

Auto-TLDR; Parallel Interactive Network for Spoken Language Understanding

Slides Poster Similar

Spoken Language Understanding (SLU) is an essential part of the spoken dialogue system, which typically consists of intent detection (ID) and slot filling (SF) tasks. Recently, recurrent neural networks (RNNs) based methods achieved the state-of-the-art for SLU. It is noted that, in the existing RNN-based approaches, ID and SF tasks are often jointly modeled to utilize the correlation information between them. However, we noted that, so far, the efforts to obtain better performance by supporting bidirectional and explicit information exchange between ID and SF are not well studied. In addition, few studies attempt to capture the local context information to enhance the performance of SF. Motivated by these findings, in this paper, Parallel Interactive Network (PIN) is proposed to model the mutual guidance between ID and SF. Specifically, given an utterance, a Gaussian self-attentive encoder is introduced to generate the context-aware feature embedding of the utterance which is able to capture local context information. Taking the feature embedding of the utterance, Slot2Intent module and Intent2Slot module are developed to capture the bidirectional information flow for ID and SF tasks. Finally, a cooperation mechanism is constructed to fuse the information obtained from Slot2Intent and Intent2Slot modules to further reduce the prediction bias. The experiments on two benchmark datasets, i.e., SNIPS and ATIS, demonstrate the effectiveness of our approach, which achieves a competitive result with state-of-the-art models. More encouragingly, by using the feature embedding of the utterance generated by the pre-trained language model BERT, our method achieves the state-of-the-art among all comparison approaches.

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning

Zhuo Chen, Fei Yin, Xu-Yao Zhang, Qing Yang, Cheng-Lin Liu

Responsive image

Auto-TLDR; Cross-Lingual Text Image Recognition with Multi-task Learning

Slides Poster Similar

This paper considers recognizing texts shown in a source language and translating into a target language, without generating the intermediate source language text image recognition results. We call this problem Cross-Lingual Text Image Recognition (CLTIR). To solve this problem, we propose a multi-task system containing a main task of CLTIR and an auxiliary task of Mono-Lingual Text Image Recognition (MLTIR) simultaneously. Two different sequence to sequence learning methods, a convolution based attention model and a BLSTM model with CTC, are adopted for these tasks respectively. We evaluate the system on a newly collected Chinese-English bilingual movie subtitle image dataset. Experimental results demonstrate the multi-task learning framework performs superiorly in both languages.

Emerging Relation Network and Task Embedding for Multi-Task Regression Problems

Schreiber Jens, Bernhard Sick

Responsive image

Auto-TLDR; A Comparative Study of Multi-Task Learning for Non-linear Time Series Problems

Slides Poster Similar

Multi-Task learning (MTL) provides state-of-the-art results in many applications of computer vision and natural language processing. In contrast to single-task learning (STL), MTL allows for leveraging knowledge between related tasks improving prediction results on all tasks. However, there is a limited number of comparative studies applied to MTL architectures for regression and time series problems taking recent advances of MTL into account. An intriguing, non-linear time-series problem are day ahead forecasts of the expected power generation for renewable power plants. Therefore, the main contribution of this article is a comparative study of the following recent and relevant MTL architectures: Hard-parameter sharing, cross-stitch network, and sluice network (SN). They are compared to a multi-layer peceptron (MLP) model of similar size in an STL setting. As a additional contribution, we provide a simple, yet practical approach to model task specific information through an embedding layer in an MLP, referred to as task embedding. Further, we contribute a new MTL architecture named emerging relation network (ERN), which can be considered as an extension of the SN. For a solar power dataset, the task embedding achieves the best mean improvement with 8.2%. For two wind and one additional solar dataset, the ERN is the best MTL architecture with improvements up to 11.3%.

Automatic Student Network Search for Knowledge Distillation

Zhexi Zhang, Wei Zhu, Junchi Yan, Peng Gao, Guotong Xie

Responsive image

Auto-TLDR; NAS-KD: Knowledge Distillation for BERT

Slides Poster Similar

Pre-trained language models (PLMs), such as BERT, have achieved outstanding performance on multiple natural language processing (NLP) tasks. However, such pre-trained models usually contain a huge number of parameters and are computationally expensive. The high resource demand hinders their application on resource-restricted devices like mobile phones. Knowledge distillation (KD) is an effective compression approach, aiming at encouraging a light-weight student network to imitate the teacher network, and accordingly latent knowledge is transferred from the teacher to student. However, the great majority of student networks in previous KD methods are manually designed, normally a subnetwork of the teacher network. Transformer is generally utilized as the student for compressing BERT but still contains masses of parameters. Motivated by this, we propose a novel approach named NAS-KD, which automatically generates an optimal student network using neural architecture search (NAS) to enhance the distillation for BERT. Experiment on 7 classification tasks in NLP domain demonstrates that NAS-KD can substantially reduce the size of BERT without much performance sacrifice.

A Transformer-Based Radical Analysis Network for Chinese Character Recognition

Chen Yang, Qing Wang, Jun Du, Jianshu Zhang, Changjie Wu, Jiaming Wang

Responsive image

Auto-TLDR; Transformer-based Radical Analysis Network for Chinese Character Recognition

Slides Poster Similar

Recently, a novel radical analysis network (RAN) has the capability of effectively recognizing unseen Chinese character classes and largely reducing the requirement of training data by treating a Chinese character as a hierarchical composition of radicals rather than a single character class.} However, when dealing with more challenging issues, such as the recognition of complicated characters, low-frequency character categories, and characters in natural scenes, RAN still has a lot of room for improvement. In this paper, we explore options to further improve the structure generalization and robustness capability of RAN with the Transformer architecture, which has achieved start-of-the-art results for many sequence-to-sequence tasks. More specifically, we propose to replace the original attention module in RAN with the transformer decoder, which is named as a transformer-based radical analysis network (RTN). The experimental results show that the proposed approach can significantly outperform the RAN on both printed Chinese character database and natural scene Chinese character database. Meanwhile, further analysis proves that RTN can be better generalized to complex samples and low-frequency characters, and has better robustness in recognizing Chinese characters with different attributes.

KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding

Hyunjae Lee, Jaewoong Yun, Bongkyu Hwang, Seongho Joe, Seungjai Min, Youngjune Gwon

Responsive image

Auto-TLDR; KoreALBERT: A monolingual ALBERT model for Korean language understanding

Slides Poster Similar

Abstract—A Lite BERT (ALBERT) has been introduced to scale-up deep bidirectional representation learning for natural languages. Due to the lack of pretrained ALBERT models for Korean language, the best available practice is the multilingual model or resorting back to the any other BERT-based model. In this paper, we develop and pretrain KoreALBERT, a monolingual ALBERT model specifically for Korean language understanding. We introduce a new training objective, namely Word Order Prediction (WOP), and use alongside the existing MLM and SOP criteria to the same architecture and model parameters. Despite having significantly fewer model parameters (thus, quicker to train), our pretrained KoreALBERT outperforms its BERT counterpart on KorQuAD 1.0 benchmark for machine reading comprehension. Consistent with the empirical results in English by Lan et al., KoreALBERT seems to improve downstream task performance involving multi-sentence encoding for Korean language. The pretrained KoreALBERT is publicly available to encourage research and application development for Korean NLP.

Tackling Contradiction Detection in German Using Machine Translation and End-To-End Recurrent Neural Networks

Maren Pielka, Rafet Sifa, Lars Patrick Hillebrand, David Biesner, Rajkumar Ramamurthy, Anna Ladi, Christian Bauckhage

Responsive image

Auto-TLDR; Contradiction Detection in Natural Language Inference using Recurrent Neural Networks

Slides Poster Similar

Natural Language Inference, and specifically Contradiction Detection, is still an unexplored topic with respect to German text. In this paper, we apply Recurrent Neural Network (RNN) methods to learn contradiction-specific sentence embeddings. Our data set for evaluation is a machine-translated version of the Stanford Natural Language Inference (SNLI) corpus. The results are compared to a baseline using unsupervised vectorization techniques, namely tf-idf and Flair, as well as state-of-the art transformer-based (MBERT) methods. We find that the end-to-end models outperform the models trained on unsupervised embeddings, which makes them the better choice in an empirical use case. The RNN methods also perform superior to MBERT on the translated data set.

Adversarial Training for Aspect-Based Sentiment Analysis with BERT

Akbar Karimi, Andrea Prati, Leonardo Rossi

Responsive image

Auto-TLDR; Adversarial Training of BERT for Aspect-Based Sentiment Analysis

Slides Poster Similar

Aspect-Based Sentiment Analysis (ABSA) studies the extraction of sentiments and their targets. Collecting labeled data for this task in order to help neural networks generalize better can be laborious and time-consuming. As an alternative, similar data to the real-world examples can be produced artificially through an adversarial process which is carried out in the embedding space. Although these examples are not real sentences, they have been shown to act as a regularization method which can make neural networks more robust. In this work, we fine-tune the general purpose BERT and domain specific post-trained BERT (BERT-PT) using adversarial training. After improving the results of post-trained BERT with different hyperparameters, we propose a novel architecture called BERT Adversarial Training (BAT) to utilize adversarial training for the two major tasks of Aspect Extraction and Aspect Sentiment Classification in sentiment analysis. The proposed model outperforms the general BERT as well as the in-domain post-trained BERT in both tasks. To the best of our knowledge, this is the first study on the application of adversarial training in ABSA. The code is publicly available on a GitHub repository at https://github.com/IMPLabUniPr/Adversarial-Training-fo r-ABSA

Conditional Multi-Task Learning for Plant Disease Identification

Sue Han Lee, Herve Goëau, Pierre Bonnet, Alexis Joly

Responsive image

Auto-TLDR; A conditional multi-task learning approach for plant disease identification

Slides Poster Similar

Several recent studies have proposed an automatic plant disease identification system based on deep learning. Although successful, these approaches are generally based on learned classification models with target classes of joint host species-disease pairs that may not allow optimal use of the available information. This is due to the fact that they require distinguishing between similar host species or diseases. In fact, these approaches have limited scalability because the size of a network gradually increases as new classes are added, even if information on host species or diseases is already available. This constraint is all the more important as it can be difficult to collect/establish a specific list of all diseases for each host plant species in an actual application. In this paper, we address the problems by proposing a new conditional multi-task learning (CMTL) approach which allows the distribution of host species and disease characteristics learned simultaneously with a conditional link between them. This conditioning is formed in such a way that the knowledge to infer the prediction of one concept (the diseases) depends on the other concept (the host species), which corresponds to the way plant pathologists used to infer the diseases of the host species. We show that our approach can improve the performance of plant disease identification compared to the usual species-disease pair modeling in the previous studies. Meanwhile, we also compose a new dataset on plant disease identification that could serve as an important benchmark in this field.

Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images

Carol Anderson, Phil Crone

Responsive image

Auto-TLDR; Text Segmentation of Marriage Announcements Using Deep Learning-based Models

Slides Poster Similar

Text segmentation, the task of dividing a document into sections, is often a prerequisite for performing additional natural language processing tasks. Existing text segmentation methods have typically been developed and tested using clean, narrative-style text with segments containing distinct topics. Here we consider a challenging text segmentation task: dividing newspaper marriage announcement lists into units of one couple each. In many cases the information is not structured into sentences, and adjacent segments are not topically distinct from each other. In addition, the text of the announcements, which is derived from images of historical newspapers via optical character recognition, contains many typographical errors. Because of these properties, these announcements are not amenable to segmentation with existing techniques. We present a novel deep learning-based model for segmenting such text and show that it significantly outperforms an existing state-of-the-art method on our task.

Enhancing Handwritten Text Recognition with N-Gram Sequencedecomposition and Multitask Learning

Vasiliki Tassopoulou, George Retsinas, Petros Maragos

Responsive image

Auto-TLDR; Multi-task Learning for Handwritten Text Recognition

Slides Poster Similar

Current state-of-the-art approaches in the field of Handwritten Text Recognition are predominately single task with unigram, character level target units. In our work, we utilize a Multi-task Learning scheme, training the model to perform decompositions of the target sequence with target units of different granularity, from fine tocoarse. We consider this method as a way to utilize n-gram information, implicitly, in the training process, while the final recognition is performed using only the unigram output. Unigram decoding of sucha multi-task approach highlights the capability of the learned internal representations, imposed by the different n-grams at the training step. We select n-grams as our target units and we experiment from unigrams till fourgrams, namely subword level granularities.These multiple decompositions are learned from the network with task-specific CTC losses. Concerning network architectures, we pro-pose two alternatives, namely the Hierarchical and the Block Multi-task. Overall, our proposed model, even though evaluated only onthe unigram task, outperforms its counterpart single-task by absolute 2.52% WER and 1.02% CER, in the greedy decoding, without any computational overhead during inference, hinting towards success-fully imposing an implicit language model

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, Rong Xiao

Responsive image

Auto-TLDR; PICK: A Graph Learning Framework for Key Information Extraction from Documents

Slides Poster Similar

Computer vision with state-of-the-art deep learning models have achieved huge success in the field of Optical Character Recognition (OCR) including text detection and recognition tasks recently. However, Key Information Extraction (KIE) from documents as the downstream task of OCR, having a large number of use scenarios in real-world, remains a challenge because documents not only have textual features extracting from OCR systems but also have semantic visual features that are not fully exploited and play a critical role in KIE. Too little work has been devoted to efficiently make full use of both textual and visual features of the documents. In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Extensive experiments on real-world datasets have been conducted to show that our method outperforms baselines methods by significant margins.

Global Context-Based Network with Transformer for Image2latex

Nuo Pang, Chun Yang, Xiaobin Zhu, Jixuan Li, Xu-Cheng Yin

Responsive image

Auto-TLDR; Image2latex with Global Context block and Transformer

Slides Poster Similar

Image2latex usually means converts mathematical formulas in images into latex markup. It is a very challenging job due to the complex two-dimensional structure, variant scales of input, and very long representation sequence. Many researchers use encoder-decoder based model to solve this task and achieved good results. However, these methods don't make full use of the structure and position information of the formula. %In this paper, we improve the encoder by employing Global Context block and Transformer. To solve this problem, we propose a global context-based network with transformer that can (1) learn a more powerful and robust intermediate representation via aggregating global features and (2) encode position information explicitly and (3) learn latent dependencies between symbols by using self-attention mechanism. The experimental results on the dataset IM2LATEX-100K demonstrate the effectiveness of our method.

Transformer Networks for Trajectory Forecasting

Francesco Giuliari, Hasan Irtiza, Marco Cristani, Fabio Galasso

Responsive image

Auto-TLDR; TransformerNetworks for Trajectory Prediction of People Interactions

Slides Poster Similar

Most recent successes on forecasting the people mo-tion are based on LSTM models andallmost recent progress hasbeen achieved by modelling the social interaction among peopleand the people interaction with the scene. We question the useof the LSTM models and propose the novel use of TransformerNetworks for trajectory forecasting. This is a fundamental switchfrom the sequential step-by-step processing of LSTMs to theonly-attention-based memory mechanisms of Transformers. Inparticular, we consider both the original Transformer Network(TF) and the larger Bidirectional Transformer (BERT), state-of-the-art on all natural language processing tasks. Our proposedTransformers predict the trajectories of the individual peoplein the scene. These are “simple” models because each personis modelled separately without any complex human-human norscene interaction terms. In particular, the TF modelwithoutbells and whistlesyields the best score on the largest and mostchallenging trajectory forecasting benchmark of TrajNet [1]. Ad-ditionally, its extension which predicts multiple plausible futuretrajectories performs on par with more engineered techniqueson the 5 datasets of ETH [2]+UCY [3]. Finally, we showthat Transformers may deal with missing observations, as itmay be the case with real sensor data. Code is available atgithub.com/FGiuliari/Trajectory-Transformer

Adversarial Encoder-Multi-Task-Decoder for Multi-Stage Processes

Andre Mendes, Julian Togelius, Leandro Dos Santos Coelho

Responsive image

Auto-TLDR; Multi-Task Learning and Semi-Supervised Learning for Multi-Stage Processes

Similar

In multi-stage processes, decisions occur in an ordered sequence of stages. Early stages usually have more observations with general information (easier/cheaper to collect), while later stages have fewer observations but more specific data. This situation can be represented by a dual funnel structure, in which the sample size decreases from one stage to the other while the information increases. Training classifiers in this scenario is challenging since information in the early stages may not contain distinct patterns to learn (underfitting). In contrast, the small sample size in later stages can cause overfitting. We address both cases by introducing a framework that combines adversarial autoencoders (AAE), multi-task learning (MTL), and multi-label semi-supervised learning (MLSSL). We improve the decoder of the AAE with an MTL component so it can jointly reconstruct the original input and use feature nets to predict the features for the next stages. We also introduce a sequence constraint in the output of an MLSSL classifier to guarantee the sequential pattern in the predictions. Using real-world data from different domains (selection process, medical diagnosis), we show that our approach outperforms other state-of-the-art methods.

TAAN: Task-Aware Attention Network for Few-Shot Classification

Zhe Wang, Li Liu, Fanzhang Li

Responsive image

Auto-TLDR; TAAN: Task-Aware Attention Network for Few-Shot Classification

Slides Poster Similar

Few-shot classification aims to recognize unlabeled samples from unseen classes given only a few labeled samples.Current approaches of few-shot learning usually employ a metriclearning framework to learn a feature similarity comparison between a query (test) example and the few support (training) examples. However, these approaches all extract features from samples independently without looking at the entire task as a whole, and so fail to provide an enough discrimination to features. Moreover, the existing approaches lack the ability to select the most relevant features for the task at hand. In this work, we propose a novel algorithm called Task-Aware Attention Network (TAAN) to address the above problems in few-shot classification. By inserting a Task-Relevant Channel Attention Module into metric-based few-shot learners, TAAN generates channel attentions for each sample by aggregating the context of the entire support set and identifies the most relevant features for similarity comparison. The experiment demonstrates that TAAN is competitive in overall performance comparing to the recent state-of-the-art systems and improves the performance considerably over baseline systems on both mini-ImageNet and tiered-ImageNet benchmarks.

GCNs-Based Context-Aware Short Text Similarity Model

Xiaoqi Sun

Responsive image

Auto-TLDR; Context-Aware Graph Convolutional Network for Text Similarity

Slides Poster Similar

Semantic textual similarity is a fundamental task in text mining and natural language processing (NLP), which has profound research value. The essential step for text similarity is text representation learning. Recently, researches have explored the graph convolutional network (GCN) techniques on text representation, since GCN does well in handling complex structures and preserving syntactic information. However, current GCN models are usually limited to very shallow layers due to the vanishing gradient problem, which cannot capture non-local dependency information of sentences. In this paper, we propose a GCNs-based context-aware (GCSTS) model that applies iterated GCN blocks to train deeper GCNs. Recurrently employing the same GCN block prevents over-fitting and provides broad effective input width. Combined with dense connections, GCSTS can be trained more deeply. Besides, we use dynamic graph structures in the block, which further extend the receptive field of each vertex in graphs, learning better sentence representations. Experiments show that our model outperforms existing models on several text similarity datasets, while also verify that GCNs-based text representation models can be trained in a deeper manner, rather than being trained in two or three layers.

MA-LSTM: A Multi-Attention Based LSTM for Complex Pattern Extraction

Jingjie Guo, Kelang Tian, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; MA-LSTM: Multiple Attention based recurrent neural network for forget gate

Slides Poster Similar

With the improvement of data, computing powerand algorithms, deep learning has achieved rapid developmentand showing excellent performance. Recently, many deep learn-ing models are proposed to solve the problems in different areas.A recurrent neural network (RNN) is a class of artificial neuralnetworks where connections between nodes form a directedgraph along a temporal sequence. This allows it to exhibittemporal dynamic behavior, which makes it applicable to taskssuch as handwriting recognition or speech recognition. How-ever, the RNN relies heavily on the automatic learning abilityto update parameters which concentrate on the data flow butseldom considers the feature extraction capability of the gatemechanism. In this paper, we propose a novel architecture tobuild the forget gate which is generated by multiple bases.Instead of using the traditional single-layer fully-connectednetwork, we use a Multiple Attention (MA) based network togenerate the forget gate which refines the optimization spaceof gate function and improve the granularity of the recurrentneural network to approximate the map in the ground truth.Credit to the MA structure on the gate mechanism. Our modelhas a better feature extraction capability than other knownmodels. MA-LSTM is an alternative module which can directly replace the recurrent neural network and has achieved good performance in many areas that people are concerned about.

Mood Detection Analyzing Lyrics and Audio Signal Based on Deep Learning Architectures

Konstantinos Pyrovolakis, Paraskevi Tzouveli, Giorgos Stamou

Responsive image

Auto-TLDR; Automated Music Mood Detection using Music Information Retrieval

Slides Poster Similar

Digital era has changed the way music is produced and propagated creating new needs for automated and more effective management of music tracks in big volumes. Automated music mood detection constitutes an active task in the field of MIR (Music Information Retrieval) and connected with many research papers in the past few years. In order to approach the task of mood detection, we faced separately the analysis of musical lyrics and the analysis of musical audio signal. Then we applied a uniform multichannel analysis to classify our data in mood classes. The available data we will use to train and evaluate our models consists of a total of 2.000 song titles, classified in four mood classes {happy, angry, sad, relaxed}. The result of this process leads to a uniform prediction for emotional arousal that a music track can cause to a listener and show the way to develop many applications.

Context Matters: Self-Attention for Sign Language Recognition

Fares Ben Slimane, Mohamed Bouguessa

Responsive image

Auto-TLDR; Attentional Network for Continuous Sign Language Recognition

Slides Poster Similar

This paper proposes an attentional network for the task of Continuous Sign Language Recognition. The proposed approach exploits co-independent streams of data to model the sign language modalities. These different channels of information can share a complex temporal structure between each other. For that reason, we apply attention to synchronize and help capture entangled dependencies between the different sign language components. Even though Sign Language is multi-channel, handshapes represent the central entities in sign interpretation. Seeing handshapes in their correct context defines the meaning of a sign. Taking that into account, we utilize the attention mechanism to efficiently aggregate the hand features with their appropriate Spatio-temporal context for better sign recognition. We found that by doing so the model is able to identify the essential Sign Language components that revolve around the dominant hand and the face areas. We test our model on the benchmark dataset RWTH-PHOENIX-Weather 2014, yielding competitive results.

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Responsive image

Auto-TLDR; Fully-Attentive Reduction for Vision and Language

Slides Poster Similar

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements - like regions and words - proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.

Assessing the Severity of Health States Based on Social Media Posts

Shweta Yadav, Joy Prakash Sain, Amit Sheth, Asif Ekbal, Sriparna Saha, Pushpak Bhattacharyya

Responsive image

Auto-TLDR; A Multiview Learning Framework for Assessment of Health State in Online Health Communities

Slides Poster Similar

The unprecedented growth of Internet users has resulted in an abundance of unstructured information on social media including health forums, where patients request health-related information or opinions from other users. Previous studies have shown that online peer support has limited effectiveness without expert intervention. Therefore, a system capable of assessing the severity of health state from the patients' social media posts can help health professionals (HP) in prioritizing the user’s post. In this study, we inspect the efficacy of different aspects of Natural Language Understanding (NLU) to identify the severity of the user’s health state in relation to two perspectives(tasks) (a) Medical Condition (i.e., Recover, Exist, Deteriorate, Other) and (b) Medication (i.e., Effective, Ineffective, Serious Adverse Effect, Other) in online health communities. We propose a multiview learning framework that models both the textual content as well as contextual-information to assess the severity of the user’s health state. Specifically, our model utilizes the NLU views such as sentiment, emotions, personality, and use of figurative language to extract the contextual information. The diverse NLU views demonstrate its effectiveness on both the tasks and as well as on the individual disease to assess a user’s health.

Weakly Supervised Attention Rectification for Scene Text Recognition

Chengyu Gu, Shilin Wang, Yiwei Zhu, Zheng Huang, Kai Chen

Responsive image

Auto-TLDR; An auxiliary supervision branch for attention-based scene text recognition

Slides Poster Similar

Scene text recognition has become a hot topic in recent years due to its booming real-life applications. Attention-based encoder-decoder framework has become one of the most popular frameworks especially in the irregular text scenario. However, the “attention drift” problem reduces the recognition performance for most existing attention-based scene text recognition methods. To solve this problem, we propose an auxiliary supervision branch along with the attention-based encoder-decoder framework. A new loss function is designed to refine the feature map and to help the attention region align the target character area. Compared with existing attention rectification mechanisms, our method does not require character-level annotations or introduce any additional trainable parameter. Furthermore, our method can improve the performance for both RNN-Attention and Scaled Dot-Product Attention. The experiment results on various benchmarks have demonstrated that the proposed approach outperforms the state-of-the-art methods in both regular and irregular text recognition scenarios.

Attention-Driven Body Pose Encoding for Human Activity Recognition

Bappaditya Debnath, Swagat Kumar, Marry O'Brien, Ardhendu Behera

Responsive image

Auto-TLDR; Attention-based Body Pose Encoding for Human Activity Recognition

Slides Poster Similar

This article proposes a novel attention-based body pose encoding for human activity recognition. Most of the existing human activity recognition approaches based on 3D pose data often enrich the input data using additional handcrafted representations such as velocity, super normal vectors, pairwise relations, and so on. The enriched data complements the 3D body joint position data and improves the model performance. In this paper, we propose a novel approach that learns enhanced feature representations from a given sequence of 3D body joints. To achieve this, the approach exploits two body pose streams: 1) a spatial stream which encodes the spatial relationship between various body joints at each time point to learn spatial structure involving the spatial distribution of different body joints 2) a temporal stream that learns the temporal variation of individual body joints over the entire sequence duration to present a temporally enhanced representation. Afterwards, these two pose streams are fused with a multi-head attention mechanism. We also capture the contextual information from the RGB video stream using a deep Convolutional Neural Network (CNN) model combined with a multi-head attention and a bidirectional Long Short-Term Memory (LSTM) network. Finally, the RGB video stream is combined with the fused body pose stream to give a novel end-to-end deep model for effective human activity recognition. The proposed model is evaluated on three datasets including the challenging NTU-RGBD dataset and achieves state-of-the-art results.

Stratified Multi-Task Learning for Robust Spotting of Scene Texts

Kinjal Dasgupta, Sudip Das, Ujjwal Bhattacharya

Responsive image

Auto-TLDR; Feature Representation Block for Multi-task Learning of Scene Text

Slides Similar

Gaining control over the dynamics of multi-task learning should help to unlock the potential of the deep network to a great extent. In the existing multi-task learning (MTL) approaches of deep network, all the parameters of its feature encoding part are subjected to adjustments corresponding to each of the underlying sub-tasks. On the other hand, different functional areas of human brain are responsible for distinct functions such as the Broca's area of the cerebrum is responsible for speech formation whereas its Wernicke's area is related to the language development etc. Inspired by this fact, in the present study, we propose to introduce a block (termed as Feature Representation Block) of connection weights spanned over a few successive layers of a deep multi-task learning architecture and stratify the same into distinct subsets for their adjustments exclusively corresponding to different sub-tasks. Additionally, we have introduced a novel regularization component for controlled training of this Feature Representation Block. The purpose of the development of this learning framework is efficient end-to-end recognition of scene texts. Simulation results of the proposed strategy on various benchmark scene text datasets such as ICDAR 2015, ICDAR 2017 MLT, COCO-Text and MSRA-TD500 have improved respective SOTA performance.

Learning Neural Textual Representations for Citation Recommendation

Thanh Binh Kieu, Inigo Jauregi Unanue, Son Bao Pham, Xuan-Hieu Phan, M. Piccardi

Responsive image

Auto-TLDR; Sentence-BERT cascaded with Siamese and triplet networks for citation recommendation

Slides Poster Similar

With the rapid growth of the scientific literature, manually selecting appropriate citations for a paper is becoming increasingly challenging and time-consuming. While several approaches for automated citation recommendation have been proposed in the recent years, effective document representations for citation recommendation are still elusive to a large extent. For this reason, in this paper we propose a novel approach to citation recommendation which leverages a deep sequential representation of the documents (Sentence-BERT) cascaded with Siamese and triplet networks in a submodular scoring function. To the best of our knowledge, this is the first approach to combine deep representations and submodular selection for a task of citation recommendation. Experiments have been carried out using a popular benchmark dataset -- the ACL Anthology Network corpus -- and evaluated against baselines and a state-of-the-art approach using metrics such as the MRR and F1@k score. The results show that the proposed approach has been able to outperform all the compared approaches in every measured metric.

Deep Multi-Task Learning for Facial Expression Recognition and Synthesis Based on Selective Feature Sharing

Rui Zhao, Tianshan Liu, Jun Xiao, P. K. Daniel Lun, Kin-Man Lam

Responsive image

Auto-TLDR; Multi-task Learning for Facial Expression Recognition and Synthesis

Slides Poster Similar

Multi-task learning is an effective learning strategy for deep-learning-based facial expression recognition tasks. However, most existing methods take into limited consideration the feature selection, when transferring information between different tasks, which may lead to task interference when training the multi-task networks. To address this problem, we propose a novel selective feature-sharing method, and establish a multi-task network for facial expression recognition and facial expression synthesis. The proposed method can effectively transfer beneficial features between different tasks, while filtering out useless and harmful information. Moreover, we employ the facial expression synthesis task to enlarge and balance the training dataset to further enhance the generalization ability of the proposed method. Experimental results show that the proposed method achieves state-of-the-art performance on those commonly used facial expression recognition benchmarks, which makes it a potential solution to real-world facial expression recognition problems.

Reinforcement Learning with Dual Attention Guided Graph Convolution for Relation Extraction

Zhixin Li, Yaru Sun, Suqin Tang, Canlong Zhang, Huifang Ma

Responsive image

Auto-TLDR; Dual Attention Graph Convolutional Network for Relation Extraction

Slides Poster Similar

To better learn the dependency relationship between nodes, we address the relationship extraction task by capturing rich contextual dependencies based on the attention mechanism, and using distributional reinforcement learning to generate optimal relation information representation. This method is called Dual Attention Graph Convolutional Network (DAGCN), to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of GCN, which model the semantic interdependencies in spatial and relational dimensions respectively. The position attention module selectively aggregates the feature at each position by a weighted sum of the features at all positions of nodes internal features. Meanwhile, the relation attention module selectively emphasizes interdependent node relations by integrating associated features among all nodes. We sum the outputs of the two attention modules and use reinforcement learning to predict the classification of nodes relationship to further improve feature representation which contributes to more precise extraction results. The results on the TACRED and SemEval datasets show that the model can obtain more useful information for relational extraction tasks, and achieve better performances on various evaluation indexes.

Multi-Task Learning Based Traditional Mongolian Words Recognition

Hongxi Wei, Hui Zhang, Jing Zhang, Kexin Liu

Responsive image

Auto-TLDR; Multi-task Learning for Mongolian Words Recognition

Slides Poster Similar

In this paper, a multi-task learning framework has been proposed for solving and improving traditional Mongolian words recognition. To be specific, a sequence-to-sequence model with attention mechanism was utilized to accomplish the task of recognition. Therein, the attention mechanism is designed to fulfill the task of glyph segmentation during the process of recognition. Although the glyph segmentation is an implicit operation, the information of glyph segmentation can be integrated into the process of recognition. After that, the two tasks can be accomplished simultaneously under the framework of multi-task learning. By this way, adjacent image frames can be decoded into a glyph more precisely, which results in improving not only the performance of words recognition but also the accuracy of character segmentation. Experimental results demonstrate that the proposed multi-task learning based scheme outperforms the conventional glyph segmentation-based method and various segmentation-free (i.e. holistic recognition) methods.

Neuron-Based Network Pruning Based on Majority Voting

Ali Alqahtani, Xianghua Xie, Ehab Essa, Mark W. Jones

Responsive image

Auto-TLDR; Large-Scale Neural Network Pruning using Majority Voting

Slides Poster Similar

The achievement of neural networks in a variety of applications is accompanied by a dramatic increase in computational costs and memory requirements. In this paper, we propose an efficient method to simultaneously identify the critical neurons and prune the model during training without involving any pre-training or fine-tuning procedures. Unlike existing methods, which accomplish this task in a greedy fashion, we propose a majority voting technique to compare the activation values among neurons and assign a voting score to quantitatively evaluate their importance.This mechanism helps to effectively reduce model complexity by eliminating the less influential neurons and aims to determine a subset of the whole model that can represent the reference model with much fewer parameters within the training process. Experimental results show that majority voting efficiently compresses the network with no drop in model accuracy, pruning more than 79\% of the original model parameters on CIFAR10 and more than 91\% of the original parameters on MNIST. Moreover, we show that with our proposed method, sparse models can be further pruned into even smaller models by removing more than 60\% of the parameters, whilst preserving the reference model accuracy.

Edge-Aware Graph Attention Network for Ratio of Edge-User Estimation in Mobile Networks

Jiehui Deng, Sheng Wan, Xiang Wang, Enmei Tu, Xiaolin Huang, Jie Yang, Chen Gong

Responsive image

Auto-TLDR; EAGAT: Edge-Aware Graph Attention Network for Automatic REU Estimation in Mobile Networks

Slides Poster Similar

Estimating the Ratio of Edge-Users (REU) is an important issue in mobile networks, as it helps the subsequent adjustment of loads in different cells. However, existing approaches usually determine the REU manually, which are experience-dependent and labor-intensive, and thus the estimated REU might be imprecise. Considering the inherited graph structure of mobile networks, in this paper, we utilize a graph-based deep learning method for automatic REU estimation, where the practical cells are deemed as nodes and the load switchings among them constitute edges. Concretely, Graph Attention Network (GAT) is employed as the backbone of our method due to its impressive generalizability in dealing with networked data. Nevertheless, conventional GAT cannot make full use of the information in mobile networks, since it only incorporates node features to infer the pairwise importance and conduct graph convolutions, while the edge features that are actually critical in our problem are disregarded. To accommodate this issue, we propose an Edge-Aware Graph Attention Network (EAGAT), which is able to fuse the node features and edge features for REU estimation. Extensive experimental results on two real-world mobile network datasets demonstrate the superiority of our EAGAT approach to several state-of-the-art methods.

Aggregating Object Features Based on Attention Weights for Fine-Grained Image Retrieval

Hongli Lin, Yongqi Song, Zixuan Zeng, Weisheng Wang

Responsive image

Auto-TLDR; DSAW: Unsupervised Dual-selection for Fine-Grained Image Retrieval

Similar

Object localization and local feature representation are key issues in fine-grained image retrieval. However, the existing unsupervised methods still need to be improved in these two aspects. For conquering these issues in a unified framework, a novel unsupervised scheme, named DSAW for short, is presented in this paper. Firstly, we proposed a dual-selection (DS) method, which achieves more accurate object localization by using adaptive threshold method to perform feature selection on local and global activation map in turn. Secondly, a novel and faster self-attention weights (AW) method is developed to weight local features by measuring their importance in the global context. Finally, we also evaluated the performance of the proposed method on five fine-grained image datasets and the results showed that our DSAW outperformed the existing best method.

Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

Saleem Ahmed, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju

Responsive image

Auto-TLDR; Representational Learning for Similarity Based Retrieval of Mathematical Expressions

Slides Poster Similar

Representational Learning in the form of high dimensional embeddings have been used for multiple pattern recognition applications. There has been a significant interest in building embedding based systems for learning representationsin the mathematical domain. At the same time, retrieval of structured information such as mathematical expressions is an important need for modern IR systems. In this work, our motivation is to introduce a robust framework for learning representations for similarity based retrieval of mathematical expressions. Given a query by example, the embedding can find the closest matching expression as a function of euclidean distance between them. We leverage recent advancements in image-based and graph-based deep learning algorithms to learn our similarity embeddings. We do this first, by using uni-modal encoders in graph space and image space and then, a multi-modal combination of the same. To overcome the lack of training data, we force the networks to learn a deep metric using triplets generated with a heuristic scoring function. We also adopt a custom strategy for mining hard samples to train our neural networks. Our system produces rankings similar to those generated by the original scoring function, but using only a fraction of the time. Our results establish the viability of using such a multi-modal embedding for this task.

VTT: Long-Term Visual Tracking with Transformers

Tianling Bian, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Neil Robertson, Haibing Guan

Responsive image

Auto-TLDR; Visual Tracking Transformer with transformers for long-term visual tracking

Slides Similar

Long-term visual tracking is a challenging problem. State-of-the-art long-term trackers, e.g., GlobalTrack, utilize region proposal networks (RPNs) to generate target proposals. However, the performance of the trackers is affected by occlusions and large scale or ratio variations. To address these issues, in this paper, we are the first to propose a novel architecture with transformers for long-term visual tracking. Specifically, the proposed Visual Tracking Transformer (VTT) utilizes a transformer encoder-decoder architecture for aggregating global information to deal with occlusion and large scale or ratio variation. Furthermore, it also shows better discriminative power against instance-level distractors without the need for extra labeling and hard-sample mining. We conduct extensive experiments on three largest long-term tracking dataset and have achieved state-of-the-art performance.

Multi-Task Learning for Calorie Prediction on a Novel Large-Scale Recipe Dataset Enriched with Nutritional Information

Robin Ruede, Verena Heusser, Lukas Frank, Monica Haurilet, Alina Roitberg, Rainer Stiefelhagen

Responsive image

Auto-TLDR; Pic2kcal: Learning Food Recipes from Images for Calorie Estimation

Slides Poster Similar

A rapidly growing amount of content posted online, such as food recipes, opens doors to new exciting applications at the intersection of vision and language. In this work, we aim to estimate the calorie amount of a meal directly from an image by learning from recipes people have published on the Internet, thus skipping time-consuming manual data annotation. Since there are few large-scale publicly available datasets captured in unconstrained environments, we propose the pic2kcal benchmark comprising 308,000 images from over 70,000 recipes including photographs, ingredients and instructions. To obtain nutritional information of the ingredients and automatically determine the ground-truth calorie value, we match the items in the recipes with structured information from a food item database. We evaluate various neural networks for regression of the calorie quantity and extend them with the multi-task paradigm. Our learning procedure combines the calorie estimation with prediction of proteins, carbohydrates, and fat amounts as well as a multi-label ingredient classification. Our experiments demonstrate clear benefits of multi-task learning for calorie estimation, surpassing the single-task calorie regression by 9.9%. To encourage further research on this task, we make the code for generating the dataset and the models publicly available.

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

Chao Li, Qian Zhang, Ziping Zhao

Responsive image

Auto-TLDR; Intimate Relationship Prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network Using Functional Near-Infrared Spectroscopy

Slides Poster Similar

The detection of intimacy plays a crucial role in the improvement of intimate relationship, which contributes to promote the family and social harmony. Previous studies have shown that different degrees of intimacy have significant differences in brain imaging. Recently, a few of work has emerged to recognise intimacy automatically by using machine learning technique. Moreover, considering the temporal dynamic characteristics of intimacy relationship on neural mechanism, how to model spatio-temporal dynamics for intimacy prediction effectively is still a challenge. In this paper, we propose a novel method to explore deep spatial-temporal representations for intimacy prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network (ACCRNN). Given the advantages of time-frequency resolution in complex neuronal activities analysis, this paper utilizes functional near-infrared spectroscopy (fNIRS) to analyse and infer to intimate relationship. We collect a fNIRS-based dataset for the analysis of intimate relationship. Forty-two-channel fNIRS signals are recorded from the 44 subjects' prefrontal cortex when they watched a total of 18 photos of lovers, friends and strangers for 30 seconds per photo. The experimental results show that our proposed method outperforms the others in terms of accuracy with the precision of 96.5%. To the best of our knowledge, this is the first time that such a hybrid deep architecture has been employed for fNIRS-based intimacy prediction.

Answer-Checking in Context: A Multi-Modal Fully Attention Network for Visual Question Answering

Hantao Huang, Tao Han, Wei Han, Deep Yap Deep Yap, Cheng-Ming Chiang

Responsive image

Auto-TLDR; Fully Attention Based Visual Question Answering

Slides Poster Similar

Visual Question Answering (VQA) is challenging due to the complex cross-modality relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. Such answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves a state-of-the-art accuracy 71.57\% using less parameters on VQA-v2.0 test-standard split.

Recursive Convolutional Neural Networks for Epigenomics

Aikaterini Symeonidi, Anguelos Nicolaou, Frank Johannes, Vincent Christlein

Responsive image

Auto-TLDR; Recursive Convolutional Neural Networks for Epigenomic Data Analysis

Slides Poster Similar

Deep learning for epigenomic data analysis has demonstrated to be very promising for analysis of genomic and epigenomic data. In this paper we introduce the use of Recursive Convolutional Neural Networks (RCNN) as tool for epigenomic data analysis. We focus on the task of predicting gene expression from the intensity of histone modifications. The proposed RCNN architecture can be applied on data of an arbitrary size and has a single meta-parameter that quantifies the models capacity making it flexible for experimenting. The proposed architecture outperforms state-of-the-art systems while having several orders of magnitude fewer parameters.

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Tengfei Liu, Yongli Hu, Junbin Gao, Yanfeng Sun, Baocai Yin

Responsive image

Auto-TLDR; Semantically Extended Graph Convolutional Network for Zero-shot Text Classification

Slides Poster Similar

As a challenging task of Natural Language Processing(NLP), zero-shot text classification has attracted more and more attention recently. It aims to detect classes that the model has never seen in the training set. For this purpose, a feasible way is to construct connection between the seen and unseen classes by semantic extension and classify the unseen classes by information propagation over the connection. Although many related zero-shot text classification methods have been exploited, how to realize semantic extension properly and propagate information effectively is far from solved. In this paper, we propose a novel zero-shot text classification method called Semantically Extended Graph Convolutional Network (SEGCN). In the proposed method, the semantic category knowledge from ConceptNet is utilized to semantic extension for linking seen classes to unseen classes and constructing a graph of all classes. Then, we build upon Graph Convolutional Network (GCN) for predicting the textual classifier for each category, which transfers the category knowledge by the convolution operators on the constructed graph and is trained in a semi-supervised manner using the samples of the seen classes. The experimental results on Dbpedia and 20newsgroup datasets show that our method outperforms the state of the art zero-shot text classification methods.

Leveraging Unlabeled Data for Glioma Molecular Subtype and Survival Prediction

Nicholas Nuechterlein, Beibin Li, Mehmet Saygin Seyfioglu, Sachin Mehta, Patrick Cimino, Linda Shapiro

Responsive image

Auto-TLDR; Multimodal Brain Tumor Segmentation Using Unlabeled MR Data and Genomic Data for Cancer Prediction

Slides Poster Similar

In this paper, we address two long-standing challenges in neuro-oncology: (1) how to leverage large amounts of unlabeled magnetic resonance (MR) imaging data for radiogenomic tasks and (2) how to unite glioma MR imaging with genomic data. We examine multi-parametric MR data from 542 patients in the combined training, validation, and testing sets of the 2018 Multimodal Brain Tumor Segmentation Challenge and somatic copy number alteration (SCNA) data from 1090 patients in The Cancer Genome Archive's (TCGA) lower-grade glioma and glioblastoma projects. We propose a novel application of multi-task learning (MTL) that leverages unlabeled MR data by jointly learning tumor segmentation masks with glioma molecular subtype markers and allows for SCNA input when available. There are 235 patients in the intersection of these MR and SCNA datasets, which we divide into an unlabeled training set, a labeled training set, and a validation set. Our MTL model significantly outperforms comparable classification models trained only on labeled MR data for both IDH1/2 mutation and 1p/19q co-deletion glioma subtype marker prediction tasks. We also observe that models trained on genomic and imaging data improve survival prediction results achieved by models trained on either alone. We will release our source code for future research.

Deep Learning on Active Sonar Data Using Bayesian Optimization for Hyperparameter Tuning

Henrik Berg, Karl Thomas Hjelmervik

Responsive image

Auto-TLDR; Bayesian Optimization for Sonar Operations in Littoral Environments

Slides Poster Similar

Sonar operations in littoral environments may be challenging due to an increased probability of false alarms. Machine learning can be used to train classifiers that are able to filter out most of the false alarms automatically, however, this is a time consuming process, with many hyperparameters that need to be tuned in order to yield useful results. In this paper, Bayesian optimization is used to search for good values for some of the hyperparameters, like topology and training parameters, resulting in performance superior to earlier trial-and-error based training. Additionally, we analyze some of the parameters involved in the Bayesian optimization, as well as the resulting hyperparameter values.

Text Synopsis Generation for Egocentric Videos

Aidean Sharghi, Niels Lobo, Mubarak Shah

Responsive image

Auto-TLDR; Egocentric Video Summarization Using Multi-task Learning for End-to-End Learning

Slides Similar

Mass utilization of body-worn cameras has led to a huge corpus of available egocentric video. Existing video summarization algorithms can accelerate browsing such videos by selecting (visually) interesting shots from them. Nonetheless, since the system user still has to watch the summary videos, browsing large video databases remain a challenge. Hence, in this work, we propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos. Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database using text queries. Since egocentric videos are long and contain many activities and events, using video-to-text algorithms results in thousands of descriptions, many of which are incorrect. Therefore, we propose a multi-task learning scheme to simultaneously generate descriptions for video segments and summarize the resulting descriptions in an end-to-end fashion. We Input a set of video shots and the network generates a text description for each shot. Next, visual-language content matching unit that is trained with a weakly supervised objective, identifies the correct descriptions. Finally, the last component of our network, called purport network, evaluates the descriptions all together to select the ones containing crucial information. Out of thousands of descriptions generated for the video, a few informative sentences are returned to the user. We validate our framework on the challenging UT Egocentric video dataset, where each video is between 3 to 5 hours long, associated with over 3000 textual descriptions on average. The generated textual summaries, including only 5 percent (or less) of the generated descriptions, are compared to groundtruth summaries in text domain using well-established metrics in natural language processing.

Improving Robotic Grasping on Monocular Images Via Multi-Task Learning and Positional Loss

William Prew, Toby Breckon, Magnus Bordewich, Ulrik Beierholm

Responsive image

Auto-TLDR; Improving grasping performance from monocularcolour images in an end-to-end CNN architecture with multi-task learning

Slides Poster Similar

In this paper we introduce two methods of improv-ing real-time objecting grasping performance from monocularcolour images in an end-to-end CNN architecture. The first isthe addition of an auxiliary task during model training (multi-task learning). Our multi-task CNN model improves graspingperformance from a baseline average of 72.04% to 78.14% onthe large Jacquard grasping dataset when performing a supple-mentary depth reconstruction task. The second is introducinga positional loss function that emphasises loss per pixel forsecondary parameters (gripper angle and width) only on points ofan object where a successful grasp can take place. This increasesperformance from a baseline average of 72.04% to 78.92% aswell as reducing the number of training epochs required. Thesemethods can be also performed in tandem resulting in a furtherperformance increase to 79.12%, while maintaining sufficientinference speed to enable processing at 50FPS

Contextual Classification Using Self-Supervised Auxiliary Models for Deep Neural Networks

Sebastian Palacio, Philipp Engler, Jörn Hees, Andreas Dengel

Responsive image

Auto-TLDR; Self-Supervised Autogenous Learning for Deep Neural Networks

Slides Poster Similar

Classification problems solved with deep neural networks (DNNs) typically rely on a closed world paradigm, and optimize over a single objective (e.g., minimization of the cross- entropy loss). This setup dismisses all kinds of supporting signals that can be used to reinforce the existence or absence of particular patterns. The increasing need for models that are interpretable by design makes the inclusion of said contextual signals a crucial necessity. To this end, we introduce the notion of Self-Supervised Autogenous Learning (SSAL). A SSAL objective is realized through one or more additional targets that are derived from the original supervised classification task, following architectural principles found in multi-task learning. SSAL branches impose low-level priors into the optimization process (e.g., grouping). The ability of using SSAL branches during inference, allow models to converge faster, focusing on a richer set of class-relevant features. We equip state-of-the-art DNNs with SSAL objectives and report consistent improvements for all of them on CIFAR100 and Imagenet. We show that SSAL models outperform similar state-of-the-art methods focused on contextual loss functions, auxiliary branches and hierarchical priors.

Transformer Reasoning Network for Image-Text Matching and Retrieval

Nicola Messina, Fabrizio Falchi, Andrea Esuli, Giuseppe Amato

Responsive image

Auto-TLDR; A Transformer Encoder Reasoning Network for Image-Text Matching in Large-Scale Information Retrieval

Slides Poster Similar

Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN.

Analyzing Zero-Shot Cross-Lingual Transfer in Supervised NLP Tasks

Hyunjin Choi, Judong Kim, Seongho Joe, Seungjai Min, Youngjune Gwon

Responsive image

Auto-TLDR; Cross-Lingual Language Model Pretraining for Zero-Shot Cross-lingual Transfer

Slides Poster Similar

In zero-shot cross-lingual transfer, a supervised NLP task trained on a corpus in one language is directly applicable to another language without any additional training. A source of cross-lingual transfer can be as straightforward as lexical overlap between languages (e.g., use of the same scripts, shared subwords) that naturally forces text embeddings to occupy a similar representation space. Recently introduced cross-lingual language model (XLM) pretraining brings out neural parameter sharing in Transformer-style networks as the most important factor for the transfer. In this paper, we aim to validate the hypothetically strong cross-lingual transfer properties induced by XLM pretraining. Particularly, we take XLM-RoBERTa (XLM-R) in our experiments that extend semantic textual similarity (STS), SQuAD and KorQuAD for machine reading comprehension, sentiment analysis, and alignment of sentence embeddings under various cross-lingual settings. Our results indicate that the presence of cross-lingual transfer is most pronounced in STS, sentiment analysis the next, and MRC the last. That is, the complexity of a downstream task softens the degree of cross-lingual transfer. All of our results are empirically observed and measured, and we make our code and data publicly available.

CKG: Dynamic Representation Based on Context and Knowledge Graph

Xunzhu Tang, Tiezhu Sun, Rujie Zhu

Responsive image

Auto-TLDR; CKG: Dynamic Representation Based on Knowledge Graph for Language Sentences

Slides Poster Similar

Recently, neural language representation models pre-trained on large corpus can capture rich co-occurrence information and be fine-tuned in downstream tasks to improve the performance. As a result, they have achieved state-of-the-art results in a large range of language tasks. However, there exists other valuable semantic information such as similar, opposite, or other possible meanings in external knowledge graphs (KGs). We argue that entities in KGs could be used to enhance the correct semantic meaning of language sentences. In this paper, we propose a new method CKG: Dynamic Representation Based on \textbf{C}ontext and \textbf{K}nowledge \textbf{G}raph. On the one side, CKG can extract rich semantic information of large corpus. On the other side, it can make full use of inside information such as co-occurrence in large corpus and outside information such as similar entities in KGs. We conduct extensive experiments on a wide range of tasks, including QQP, MRPC, SST-5, SQuAD, CoNLL 2003, and SNLI. The experiment results show that CKG achieves SOTA 89.2 on SQuAD compared with SAN (84.4), ELMo (85.8), and BERT$_{Base}$ (88.5).