Interpretable Structured Learning with Sparse Gated Sequence Encoder for Protein-Protein Interaction Prediction

Kishan K C, Feng Cui, Anne Haake, Rui Li

Responsive image

Auto-TLDR; Predicting Protein-Protein Interactions Using Sequence Representations

Slides Poster

Predicting protein-protein interactions (PPIs) by learning informative representations from amino acid sequences is a challenging yet important problem in biology. Although various deep learning models in Siamese architecture have been proposed to model PPIs from sequences, these methods are computationally expensive for a large number of PPIs due to the pairwise encoding process. Furthermore, these methods are difficult to interpret because of non-intuitive mappings from protein sequences to their sequence representation. To address these challenges, we present a novel deep framework to model and predict PPIs from sequence alone. Our model incorporates a bidirectional gated recurrent unit to learn sequence representations by leveraging contextualized and sequential information from sequences. We further employ a sparse regularization to model long-range dependencies between amino acids and to select important amino acids (protein motifs), thus enhancing interpretability. Besides, the novel design of the encoding process makes our model computationally efficient and scalable to an increasing number of interactions. Experimental results on up-to-date interaction datasets demonstrate that our model achieves superior performance compared to other state-of-the-art methods. Literature-based case studies illustrate the ability of our model to provide biological insights to interpret the predictions.

Similar papers

Trajectory-User Link with Attention Recurrent Networks

Tao Sun, Yongjun Xu, Fei Wang, Lin Wu, 塘文 钱, Zezhi Shao

Responsive image

Auto-TLDR; TULAR: Trajectory-User Link with Attention Recurrent Neural Networks

Slides Poster Similar

The prevalent adoptions of GPS-enabled devices have witnessed an explosion of various location-based services which produces a huge amount of trajectories monitoring the individuals' movements. In this paper, we tackle Trajectory-User Link (TUL) problem, which identifies humans' movement patterns and links trajectories to the users who generated them. Existing solutions on TUL problem employ recurrent neural networks and variational autoencoder methods, which face the bottlenecks in the case of excessively long trajectories and fragmentary users' movements. However, these are common characteristics of trajectory data in reality, leading to performance degradation of the existing models. In this paper, we propose an end-to-end attention recurrent neural learning framework, called TULAR (Trajectory-User Link with Attention Recurrent Networks), which focus on selected parts of the source trajectories when linking. TULAR introduce the Trajectory Semantic Vector (TSV) via unsupervised location representation learning and recurrent neural networks, by which to reckon the weight of parts of source trajectory. Further, we employ three attention scores for the weight measurements. Experiments are conducted on two real world datasets and compared with several existing methods, and the results show that TULAR yields a new state-of-the-art performance. Source code is public available at GitHub: https://github.com/taos123/TULAR.

Temporal Pattern Detection in Time-Varying Graphical Models

Federico Tomasi, Veronica Tozzo, Annalisa Barla

Responsive image

Auto-TLDR; A dynamical network inference model that leverages on kernels to consider general temporal patterns

Slides Poster Similar

Graphical models allow to describe the interplay among variables of a system through a compact representation, suitable when relations evolve over time. For example, in a biological setting, genes interact differently depending on external environmental or metabolic factors. To incorporate this dynamics a viable strategy is to estimate a sequence of temporally related graphs assuming similarity among samples in different time points. While adjacent time points may direct the analysis towards a robust estimate of the underlying graph, the resulting model will not incorporate long-term or recurrent temporal relationships. In this work we propose a dynamical network inference model that leverages on kernels to consider general temporal patterns (such as circadian rhythms or seasonality). We show how our approach may also be exploited when the recurrent patterns are unknown, by coupling the network inference with a clustering procedure that detects possibly non-consecutive similar networks. Such clusters are then used to build similarity kernels. The convexity of the functional is determined by whether we impose or infer the kernel. In the first case, the optimisation algorithm exploits efficiently proximity operators with closed-form solutions. In the other case, we resort to an alternating minimisation procedure which jointly learns the temporal kernel and the underlying network. Extensive analysis on synthetic data shows the efficacy of our models compared to state-of-the-art methods. Finally, we applied our approach on two real-world applications to show how considering long-term patterns is fundamental to have insights on the behaviour of a complex system.

Region and Relations Based Multi Attention Network for Graph Classification

Manasvi Aggarwal, M. Narasimha Murty

Responsive image

Auto-TLDR; R2POOL: A Graph Pooling Layer for Non-euclidean Structures

Slides Poster Similar

Graphs are non-euclidean structures that can represent many relational data efficiently. Many studies have proposed the convolution and the pooling operators on the non-euclidean domain. The graph convolution operators have shown astounding performance on various tasks such as node representation and classification. For graph classification, different pooling techniques are introduced, but none of them has considered both neighborhood of the node and the long-range dependencies of the node. In this paper, we propose a novel graph pooling layer R2POOL, which balances the structure information around the node as well as the dependencies with far away nodes. Further, we propose a new training strategy to learn coarse to fine representations. We add supervision at only intermediate levels to generate predictions using only intermediate-level features. For this, we propose the concept of an alignment score. Moreover, each layer's prediction is controlled by our proposed branch training strategy. This complete training helps in learning dominant class features at each layer for representing graphs. We call the combined model by R2MAN. Experiments show that R2MAN the potential to improve the performance of graph classification on various datasets.

AG-GAN: An Attentive Group-Aware GAN for Pedestrian Trajectory Prediction

Yue Song, Niccolò Bisagno, Syed Zohaib Hassan, Nicola Conci

Responsive image

Auto-TLDR; An attentive group-aware GAN for motion prediction in crowded scenarios

Slides Poster Similar

Understanding human behaviors in crowded scenarios requires analyzing not only the position of the subjects in space, but also the scene context. Existing approaches mostly rely on the motion history of each pedestrian and model the interactions among people by considering the entire surrounding neighborhood. In our approach, we address the problem of motion prediction by applying coherent group clustering and a global attention mechanism on the LSTM-based Generative Adversarial Networks (GANs). The proposed model consists of an attentive group-aware GAN that observes the agents' past motion and predicts future paths, using (i) a group pooling module to model neighborhood interaction, and (ii) an attention module to specifically focus on hidden states. The experimental results demonstrate that our proposal outperforms state-of-the-art models on common benchmark datasets, and is able to generate socially-acceptable trajectories.

End-To-End Multi-Task Learning of Missing Value Imputation and Forecasting in Time-Series Data

Jinhee Kim, Taesung Kim, Jang-Ho Choi, Jaegul Choo

Responsive image

Auto-TLDR; Time-Series Prediction with Denoising and Imputation of Missing Data

Slides Poster Similar

Multivariate time-series prediction is a common task, but it often becomes challenging due to missing values involved in data caused by unreliable sensors and other issues. In fact, inaccurate imputation of missing values can degrade the downstream prediction performance, so it may be better not to rely on the estimated values of missing data. Furthermore, observed data may contain noise, so denoising them can be helpful for the main task at hand. In response, we propose a novel approach that can automatically utilize the optimal combination of the observed and the estimated values to generate not only complete, but also noise-reduced data by our own gating mechanism. We evaluate our model on real-world time-series datasets and achieved state-of-the-art performance, demonstrating that our method successfully handle the incomplete datasets. Moreover, we present in-depth studies using a carefully designed, synthetic multivariate time-series dataset to verify the effectiveness of the proposed model. The ablation studies and the experimental analysis of the proposed gating mechanism show that the proposed method works as an effective denoising as well as imputation method for time-series classification tasks.

Recursive Convolutional Neural Networks for Epigenomics

Aikaterini Symeonidi, Anguelos Nicolaou, Frank Johannes, Vincent Christlein

Responsive image

Auto-TLDR; Recursive Convolutional Neural Networks for Epigenomic Data Analysis

Slides Poster Similar

Deep learning for epigenomic data analysis has demonstrated to be very promising for analysis of genomic and epigenomic data. In this paper we introduce the use of Recursive Convolutional Neural Networks (RCNN) as tool for epigenomic data analysis. We focus on the task of predicting gene expression from the intensity of histone modifications. The proposed RCNN architecture can be applied on data of an arbitrary size and has a single meta-parameter that quantifies the models capacity making it flexible for experimenting. The proposed architecture outperforms state-of-the-art systems while having several orders of magnitude fewer parameters.

Aggregating Dependent Gaussian Experts in Local Approximation

Hamed Jalali, Gjergji Kasneci

Responsive image

Auto-TLDR; A novel approach for aggregating the Gaussian experts by detecting strong violations of conditional independence

Slides Poster Similar

Distributed Gaussian processes (DGPs) are prominent local approximation methods to scale Gaussian processes (GPs) to large datasets. Instead of a global estimation, they train local experts by dividing the training set into subsets, thus reducing the time complexity. This strategy is based on the conditional independence assumption, which basically means that there is a perfect diversity between the local experts. In practice, however, this assumption is often violated, and the aggregation of experts leads to sub-optimal and inconsistent solutions. In this paper, we propose a novel approach for aggregating the Gaussian experts by detecting strong violations of conditional independence. The dependency between experts is determined by using a Gaussian graphical model, which yields the precision matrix. The precision matrix encodes conditional dependencies between experts and is used to detect strongly dependent experts and construct an improved aggregation. Using both synthetic and real datasets, our experimental evaluations illustrate that our new method outperforms other state-of-the-art (SOTA) DGP approaches while being substantially more time-efficient than SOTA approaches, which build on independent experts.

Road Network Metric Learning for Estimated Time of Arrival

Yiwen Sun, Kun Fu, Zheng Wang, Changshui Zhang, Jieping Ye

Responsive image

Auto-TLDR; Road Network Metric Learning for Estimated Time of Arrival (RNML-ETA)

Slides Poster Similar

Recently, deep learning have achieved promising results in Estimated Time of Arrival (ETA), which is considered as predicting the travel time from the origin to the destination along a given path. One of the key techniques is to use embedding vectors to represent the elements of road network, such as the links (road segments). However, the embedding suffers from the data sparsity problem that many links in the road network are traversed by too few floating cars even in large ride-hailing platforms like Uber and DiDi. Insufficient data makes the embedding vectors in an under-fitting status, which undermines the accuracy of ETA prediction. To address the data sparsity problem, we propose the Road Network Metric Learning framework for ETA (RNML ETA). It consists of two components: (1) a main regression task to predict the travel time, and (2) an auxiliary metric learning task to improve the quality of link embedding vectors. We further propose the triangle loss, a novel loss function to improve the efficiency of metric learning. We validated the effectiveness of RNML-ETA on large scale real-world datasets, by showing that our method outperforms the state-of-the-art model and the promotion concentrates on the cold links with few data.

Switching Dynamical Systems with Deep Neural Networks

Cesar Ali Ojeda Marin, Kostadin Cvejoski, Bogdan Georgiev, Ramses J. Sanchez

Responsive image

Auto-TLDR; Variational RNN for Switching Dynamics

Slides Poster Similar

The problem of uncovering different dynamicalregimes is of pivotal importance in time series analysis. Switchingdynamical systems provide a solution for modeling physical phe-nomena whose time series data exhibit different dynamical modes.In this work we propose a novel variational RNN model forswitching dynamics allowing for both non-Markovian and non-linear dynamical behavior between and within dynamic modes.Attention mechanisms are provided to inform the switchingdistribution. We evaluate our model on synthetic and empiricaldatasets of diverse nature and successfully uncover differentdynamical regimes and predict the switching dynamics.

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

Chao Li, Qian Zhang, Ziping Zhao

Responsive image

Auto-TLDR; Intimate Relationship Prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network Using Functional Near-Infrared Spectroscopy

Slides Poster Similar

The detection of intimacy plays a crucial role in the improvement of intimate relationship, which contributes to promote the family and social harmony. Previous studies have shown that different degrees of intimacy have significant differences in brain imaging. Recently, a few of work has emerged to recognise intimacy automatically by using machine learning technique. Moreover, considering the temporal dynamic characteristics of intimacy relationship on neural mechanism, how to model spatio-temporal dynamics for intimacy prediction effectively is still a challenge. In this paper, we propose a novel method to explore deep spatial-temporal representations for intimacy prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network (ACCRNN). Given the advantages of time-frequency resolution in complex neuronal activities analysis, this paper utilizes functional near-infrared spectroscopy (fNIRS) to analyse and infer to intimate relationship. We collect a fNIRS-based dataset for the analysis of intimate relationship. Forty-two-channel fNIRS signals are recorded from the 44 subjects' prefrontal cortex when they watched a total of 18 photos of lovers, friends and strangers for 30 seconds per photo. The experimental results show that our proposed method outperforms the others in terms of accuracy with the precision of 96.5%. To the best of our knowledge, this is the first time that such a hybrid deep architecture has been employed for fNIRS-based intimacy prediction.

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

Peilin Zhou, Zhiqi Huang, Fenglin Liu, Yuexian Zou

Responsive image

Auto-TLDR; Parallel Interactive Network for Spoken Language Understanding

Slides Poster Similar

Spoken Language Understanding (SLU) is an essential part of the spoken dialogue system, which typically consists of intent detection (ID) and slot filling (SF) tasks. Recently, recurrent neural networks (RNNs) based methods achieved the state-of-the-art for SLU. It is noted that, in the existing RNN-based approaches, ID and SF tasks are often jointly modeled to utilize the correlation information between them. However, we noted that, so far, the efforts to obtain better performance by supporting bidirectional and explicit information exchange between ID and SF are not well studied. In addition, few studies attempt to capture the local context information to enhance the performance of SF. Motivated by these findings, in this paper, Parallel Interactive Network (PIN) is proposed to model the mutual guidance between ID and SF. Specifically, given an utterance, a Gaussian self-attentive encoder is introduced to generate the context-aware feature embedding of the utterance which is able to capture local context information. Taking the feature embedding of the utterance, Slot2Intent module and Intent2Slot module are developed to capture the bidirectional information flow for ID and SF tasks. Finally, a cooperation mechanism is constructed to fuse the information obtained from Slot2Intent and Intent2Slot modules to further reduce the prediction bias. The experiments on two benchmark datasets, i.e., SNIPS and ATIS, demonstrate the effectiveness of our approach, which achieves a competitive result with state-of-the-art models. More encouragingly, by using the feature embedding of the utterance generated by the pre-trained language model BERT, our method achieves the state-of-the-art among all comparison approaches.

Total Whitening for Online Signature Verification Based on Deep Representation

Xiaomeng Wu, Akisato Kimura, Kunio Kashino, Seiichi Uchida

Responsive image

Auto-TLDR; Total Whitening for Online Signature Verification

Slides Poster Similar

In deep metric learning targeted at time series, the correlation between feature activations may be easily enlarged through highly nonlinear neural networks, leading to suboptimal embedding effectiveness. An effective solution to this problem is whitening. For example, in online signature verification, whitening can be derived for three individual Gaussian distributions, namely the distributions of local features at all temporal positions 1) for all signatures of all subjects, 2) for all signatures of each particular subject, and 3) for each particular signature of each particular subject. This study proposes a unified method called total whitening that integrates these individual Gaussians. Total whitening rectifies the layout of multiple individual Gaussians to resemble a standard normal distribution, improving the balance between intraclass invariance and interclass discriminative power. Experimental results demonstrate that total whitening achieves state-of-the-art accuracy when tested on online signature verification benchmarks.

Automatically Mining Relevant Variable Interactions Via Sparse Bayesian Learning

Ryoichiro Yafune, Daisuke Sakuma, Yasuo Tabei, Noritaka Saito, Hiroto Saigo

Responsive image

Auto-TLDR; Sparse Bayes for Interpretable Non-linear Prediction

Slides Poster Similar

With the rapid increase in the availability of large amount of data, prediction is becoming increasingly popular, and has widespread through our daily life. However, powerful non- linear prediction methods such as deep learning and SVM suffer from interpretability problem, making it hard to use in domains where the reason for decision making is required. In this paper, we develop an interpretable non-linear model called itemset Sparse Bayes (iSB), which builds a Bayesian probabilistic model, while simultaneously considering variable interactions. In order to suppress the resulting large number of variables, sparsity is imposed on regression weights by a sparsity inducing prior. As a subroutine to search for variable interactions, itemset enumeration algorithm is employed with a novel bounding condition. In computational experiments using real-world dataset, the proposed method performed better than decision tree by 10% in terms of r-squared . We also demonstrated the advantage of our method in Bayesian optimization setting, in which the proposed approach could successfully find the maximum of an unknown function faster than Gaussian process. The interpretability of iSB is naturally inherited to Bayesian optimization, thereby gives us a clue to understand which variables interactions are important in optimizing an unknown function.

EasiECG: A Novel Inter-Patient Arrhythmia Classification Method Using ECG Waves

Chuanqi Han, Ruoran Huang, Fang Yu, Xi Huang, Li Cui

Responsive image

Auto-TLDR; EasiECG: Attention-based Convolution Factorization Machines for Arrhythmia Classification

Slides Poster Similar

Abstract—In an ECG record, the PQRST waves are of important medical significance which provide ample information reflecting heartbeat activities. In this paper, we propose a novel arrhythmia classification method namely EasiECG, characterized by simplicity and accuracy. Compared with other works, the EasiECG takes the configuration of these five key waves into account and does not require complicated feature engineering. Meanwhile, an additional encoding of the extracted features makes the EasiECG applicable even on samples with missing waves. To automatically capture interactions that contribute to the classification among the processed features, a novel adapted classification model named Attention-based Convolution Factorization Machines (ACFM) is proposed. In detail, the ACFM can learn both linear and high-order interactions from linear regression and convolution on outer-product feature interaction maps, respectively. After that, an attention mechanism implemented in the model can further assign different importance of these interactions when predicting certain types of heartbeats. To validate the effectiveness and practicability of our EasiECG, extensive experiments of inter-patient paradigm on the benchmark MIT-BIH arrhythmia database are conducted. To tackle the imbalanced sample problem in this dataset, an ingenious loss function: focal loss is adopted when training. The experiment results show that our method is competitive compared with other state-of-the-arts, especially in classifying the Supraventricular ectopic beats. Besides, the EasiECG achieves an overall accuracy of 87.6% on samples with a missing wave in the related experiment, demonstrating the robustness of our proposed method.

Constructing Geographic and Long-term Temporal Graph for Traffic Forecasting

Yiwen Sun, Yulu Wang, Kun Fu, Zheng Wang, Changshui Zhang, Jieping Ye

Responsive image

Auto-TLDR; GLT-GCRNN: Geographic and Long-term Temporal Graph Convolutional Recurrent Neural Network for Traffic Forecasting

Slides Poster Similar

Traffic forecasting influences various intelligent transportation system (ITS) services and is of great significance for user experience as well as urban traffic control. It is challenging due to the fact that the road network contains complex and time-varying spatial-temporal dependencies. Recently, deep learning based methods have achieved promising results by adopting graph convolutional network (GCN) to extract the spatial correlations and recurrent neural network (RNN) to capture the temporal dependencies. However, the existing methods often construct the graph only based on road network connectivity, which limits the interaction between roads. In this work, we propose Geographic and Long-term Temporal Graph Convolutional Recurrent Neural Network (GLT-GCRNN), a novel framework for traffic forecasting that learns the rich interactions between roads sharing similar geographic or long-term temporal patterns. Extensive experiments on a real-world traffic state dataset validate the effectiveness of our method by showing that GLT-GCRNN outperforms the state-of-the-art methods in terms of different metrics.

Privacy Attributes-Aware Message Passing Neural Network for Visual Privacy Attributes Classification

Hanbin Hong, Wentao Bao, Yuan Hong, Yu Kong

Responsive image

Auto-TLDR; Privacy Attributes-Aware Message Passing Neural Network for Visual Privacy Attribute Classification

Slides Poster Similar

Visual Privacy Attribute Classification (VPAC) identifies privacy information leakage via social media images. These images containing privacy attributes such as skin color, face or gender are classified into multiple privacy attribute categories in VPAC. With limited works in this task, current methods often extract features from images and simply classify the extracted feature into multiple privacy attribute classes. The dependencies between privacy attributes, e.g., skin color and face typically co-exist in the same image, are usually ignored in classification, which causes performance degradation in VPAC. In this paper, we propose a novel end-to-end Privacy Attributes-aware Message Passing Neural Network (PA-MPNN) to address VPAC. Privacy attributes are considered as nodes on a graph and an MPNN is introduced to model the privacy attribute dependencies. To generate representative features for privacy attribute nodes, a class-wise encoder-decoder is proposed to learn a latent space for each attribute. An attention mechanism with multiple correlation matrices is also introduced in MPNN to learn the privacy attributes graph automatically. Experimental results on the Privacy Attribute Dataset demonstrate that our framework achieves better performance than state-of-the-art methods on visual privacy attributes classification.

Epitomic Variational Graph Autoencoder

Rayyan Ahmad Khan, Muhammad Umer Anwaar, Martin Kleinsteuber

Responsive image

Auto-TLDR; EVGAE: A Generative Variational Autoencoder for Graph Data

Slides Poster Similar

Variational autoencoder (VAE) is a widely used generative model for learning latent representations. Burda et al. in their seminal paper showed that learning capacity of VAE is limited by over-pruning. It is a phenomenon where a significant number of latent variables fail to capture any information about the input data and the corresponding hidden units become inactive. This adversely affects learning diverse and interpretable latent representations. As variational graph autoencoder (VGAE) extends VAE for graph-structured data, it inherits the over-pruning problem. In this paper, we adopt a model based approach and propose epitomic VGAE (EVGAE),a generative variational framework for graph datasets which successfully mitigates the over-pruning problem and also boosts the generative ability of VGAE. We consider EVGAE to consist of multiple sparse VGAE models, called epitomes, that are groups of latent variables sharing the latent space. This approach aids in increasing active units as epitomes compete to learn better representation of the graph data. We verify our claims via experiments on three benchmark datasets. Our experiments show that EVGAE has a better generative ability than VGAE. Moreover, EVGAE outperforms VGAE on link prediction task in citation networks

Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

Saleem Ahmed, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju

Responsive image

Auto-TLDR; Representational Learning for Similarity Based Retrieval of Mathematical Expressions

Slides Poster Similar

Representational Learning in the form of high dimensional embeddings have been used for multiple pattern recognition applications. There has been a significant interest in building embedding based systems for learning representationsin the mathematical domain. At the same time, retrieval of structured information such as mathematical expressions is an important need for modern IR systems. In this work, our motivation is to introduce a robust framework for learning representations for similarity based retrieval of mathematical expressions. Given a query by example, the embedding can find the closest matching expression as a function of euclidean distance between them. We leverage recent advancements in image-based and graph-based deep learning algorithms to learn our similarity embeddings. We do this first, by using uni-modal encoders in graph space and image space and then, a multi-modal combination of the same. To overcome the lack of training data, we force the networks to learn a deep metric using triplets generated with a heuristic scoring function. We also adopt a custom strategy for mining hard samples to train our neural networks. Our system produces rankings similar to those generated by the original scoring function, but using only a fraction of the time. Our results establish the viability of using such a multi-modal embedding for this task.

Multi-Graph Convolutional Network for Relationship-Driven Stock Movement Prediction

Jiexia Ye, Juanjuan Zhao, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; Multi-GCGRU: A Deep Learning Framework for Stock Price Prediction with Cross Effect

Slides Poster Similar

Stock price movement prediction is commonly accepted as a very challenging task due to the volatile nature of financial markets. Previous works typically predict the stock price mainly based on its own information, neglecting the cross effect among involved stocks. However, it is well known that an individual stock price is correlated with prices of other stocks in complex ways. To take the cross effect into consideration, we propose a deep learning framework, called Multi-GCGRU, which comprises graph convolutional network (GCN) and gated recurrent units (GRU) to predict stock movement. Specifically, we first encode multiple relationships among stocks into graphs based on financial domain knowledge and utilize GCN to extract the cross effect based on the pre-defined graphs. The cross-correlation features produced by GCN are concatenated with historical records and fed into GRU to model the temporal pattern in stock price. To further get rid of prior knowledge, we explore an adaptive stock graph learned by data automatically. Experiments on two stock indexes in China market show that our model outperforms other baselines. Note that our model is rather feasible to incorporate more effective pre-defined stock relationships. What's more, it can also learn a data-driven relationship without any domain knowledge.

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning

Zhuo Chen, Fei Yin, Xu-Yao Zhang, Qing Yang, Cheng-Lin Liu

Responsive image

Auto-TLDR; Cross-Lingual Text Image Recognition with Multi-task Learning

Slides Poster Similar

This paper considers recognizing texts shown in a source language and translating into a target language, without generating the intermediate source language text image recognition results. We call this problem Cross-Lingual Text Image Recognition (CLTIR). To solve this problem, we propose a multi-task system containing a main task of CLTIR and an auxiliary task of Mono-Lingual Text Image Recognition (MLTIR) simultaneously. Two different sequence to sequence learning methods, a convolution based attention model and a BLSTM model with CTC, are adopted for these tasks respectively. We evaluate the system on a newly collected Chinese-English bilingual movie subtitle image dataset. Experimental results demonstrate the multi-task learning framework performs superiorly in both languages.

Dimensionality Reduction for Data Visualization and Linear Classification, and the Trade-Off between Robustness and Classification Accuracy

Martin Becker, Jens Lippel, Thomas Zielke

Responsive image

Auto-TLDR; Robustness Assessment of Deep Autoencoder for Data Visualization using Scatter Plots

Slides Poster Similar

This paper has three intertwined goals. The first is to introduce a new similarity measure for scatter plots. It uses Delaunay triangulations to compare two scatter plots regarding their relative positioning of clusters. The second is to apply this measure for the robustness assessment of a recent deep neural network (DNN) approach to dimensionality reduction (DR) for data visualization. It uses a nonlinear generalization of Fisher's linear discriminant analysis (LDA) as the encoder network of a deep autoencoder (DAE). The DAE's decoder network acts as a regularizer. The third goal is to look at different variants of the DNN: ones that promise robustness and ones that promise high classification accuracies. This is to study the trade-off between these two objectives -- our results support the recent claim that robustness may be at odds with accuracy; however, results that are balanced regarding both objectives are achievable. We see a restricted Boltzmann machine (RBM) pretraining and the DAE based regularization as important building blocks for achieving balanced results. As a means of assessing the robustness of DR methods, we propose a measure that is based on our similarity measure for scatter plots. The robustness measure comes with a superimposition view of Delaunay triangulations, which allows a fast comparison of results from multiple DR methods.

Temporal Collaborative Filtering with Graph Convolutional Neural Networks

Esther Rodrigo-Bonet, Minh Duc Nguyen, Nikos Deligiannis

Responsive image

Auto-TLDR; Temporal Collaborative Filtering with Graph-Neural-Network-based Neural Networks

Slides Poster Similar

Temporal collaborative filtering (TCF) methods aim at modelling non-static aspects behind recommender systems, such as the dynamics in users' preferences and social trends around items. State-of-the-art TCF methods employ recurrent neural networks (RNNs) to model such aspects. These methods deploy matrix-factorization-based (MF-based) approaches to learn the user and item representations. Recently, graph-neural-network-based (GNN-based) approaches have shown improved performance in providing accurate recommendations over traditional MF-based approaches in non-temporal CF settings. Motivated by this, we propose a novel TCF method that leverages GNNs to learn user and item representations, and RNNs to model their temporal dynamics. A challenge with this method lies in the increased data sparsity, which negatively impacts obtaining meaningful quality representations with GNNs. To overcome this challenge, we train a GNN model at each time step using a set of observed interactions accumulated time-wise. Comprehensive experiments on real-world data show the improved performance obtained by our method over several state-of-the-art temporal and non-temporal CF models.

Global Context-Based Network with Transformer for Image2latex

Nuo Pang, Chun Yang, Xiaobin Zhu, Jixuan Li, Xu-Cheng Yin

Responsive image

Auto-TLDR; Image2latex with Global Context block and Transformer

Slides Poster Similar

Image2latex usually means converts mathematical formulas in images into latex markup. It is a very challenging job due to the complex two-dimensional structure, variant scales of input, and very long representation sequence. Many researchers use encoder-decoder based model to solve this task and achieved good results. However, these methods don't make full use of the structure and position information of the formula. %In this paper, we improve the encoder by employing Global Context block and Transformer. To solve this problem, we propose a global context-based network with transformer that can (1) learn a more powerful and robust intermediate representation via aggregating global features and (2) encode position information explicitly and (3) learn latent dependencies between symbols by using self-attention mechanism. The experimental results on the dataset IM2LATEX-100K demonstrate the effectiveness of our method.

CardioGAN: An Attention-Based Generative Adversarial Network for Generation of Electrocardiograms

Subhrajyoti Dasgupta, Sudip Das, Ujjwal Bhattacharya

Responsive image

Auto-TLDR; CardioGAN: Generative Adversarial Network for Synthetic Electrocardiogram Signals

Slides Poster Similar

Electrocardiogram (ECG) signal is studied to obtain crucial information about the condition of a patient's heart. Machine learning based automated medical diagnostic systems that may help to evaluate the condition of the heart from this signal are required to be trained using large volumes of labelled training samples and the same may increase the chance of compromising with the patients' privacy. To solve this issue, generation of synthetic electrocardiogram signals by learning only from the general distributions of the available real training samples have been attempted in the literature. However, these studies did not pay necessary attention to the specific vital details of these signals, such as the P wave, the QRS complex, and the T wave. This shortcoming often results in the generation of unrealistic synthetic signals, such as a signal which does not contain one or more of the above components. In the present study, a novel deep generative architecture, termed as CardioGAN, based on generative adversarial network and powered by the effective attention mechanism has been designed which is capable of learning the intricate inter-dependencies among the various parts of real samples leading to the generation of more realistic electrocardiogram signals. Also, it helps in reducing the risk of breaching the privacy of patients. Extensive experimentation performed by us establishes that the proposed method achieves a better performance in generating synthetic electrocardiogram signals in comparison to the existing methods. The source code will be made available on github.

RNN Training along Locally Optimal Trajectories via Frank-Wolfe Algorithm

Yun Yue, Ming Li, Venkatesh Saligrama, Ziming Zhang

Responsive image

Auto-TLDR; Frank-Wolfe Algorithm for Efficient Training of RNNs

Slides Poster Similar

We propose a novel and efficient training method for RNNs by iteratively seeking a local minima on the loss surface within a small region, and leverage this directional vector for the update, in an outer-loop. We propose to utilize the Frank-Wolfe (FW) algorithm in this context. Although, FW implicitly involves normalized gradients, which can lead to a slow convergence rate, we develop a novel RNN training method that, surprisingly, even with the additional cost, the overall training cost is empirically observed to be lower than back-propagation. Our method leads to a new Frank-Wolfe method, that is in essence an SGD algorithm with a restart scheme. We prove that under certain conditions our algorithm has a sublinear convergence rate of $O(1/\epsilon)$ for $\epsilon$ error. We then conduct empirical experiments on several benchmark datasets including those that exhibit long-term dependencies, and show significant performance improvement. We also experiment with deep RNN architectures and show efficient training performance. Finally, we demonstrate that our training method is robust to noisy data.

Transfer Learning with Graph Neural Networks for Short-Term Highway Traffic Forecasting

Tanwi Mallick, Prasanna Balaprakash, Eric Rask, Jane Macfarlane

Responsive image

Auto-TLDR; Transfer Learning for Highway Traffic Forecasting on Unseen Traffic Networks

Slides Poster Similar

Large-scale highway traffic forecasting approaches are critical for intelligent transportation systems. Recently, deep-learning-based traffic forecasting methods have emerged as promising approaches for a wide range of traffic forecasting tasks. However, these methods are specific to a given traffic network and consequently, they cannot be used for forecasting traffic on an unseen traffic network. Previous work has identified diffusion convolutional recurrent neural network (DCRNN), as a state-of-the-art method for highway traffic forecasting. It models the complex spatial and temporal dynamics of a highway network using a graph-based diffusion convolution operation within a recurrent neural network. Currently, DCRNN cannot perform transfer learning because it learns location-specific traffic patterns, which cannot be used for unseen regions of a network or new geographic locations. To that end, we develop TL-DCRNN, a new transfer learning approach for DCRNN, where a single model trained on a highway network can be used to forecast traffic on unseen highway networks. Given a traffic network with a large amount of traffic data, our approach consists of partitioning the traffic network into a number of subgraphs and using a new training scheme that utilizes subgraphs for the DCRNN to marginalize the location-specific information, thus learning the traffic as a function of network connectivity and temporal patterns alone. The resulting trained model can be used to forecast traffic on unseen networks. We demonstrate that TL-DCRNN can learn from San Francisco regional traffic data and forecast traffic on the Los Angeles region and vice versa.

Edge-Aware Graph Attention Network for Ratio of Edge-User Estimation in Mobile Networks

Jiehui Deng, Sheng Wan, Xiang Wang, Enmei Tu, Xiaolin Huang, Jie Yang, Chen Gong

Responsive image

Auto-TLDR; EAGAT: Edge-Aware Graph Attention Network for Automatic REU Estimation in Mobile Networks

Slides Poster Similar

Estimating the Ratio of Edge-Users (REU) is an important issue in mobile networks, as it helps the subsequent adjustment of loads in different cells. However, existing approaches usually determine the REU manually, which are experience-dependent and labor-intensive, and thus the estimated REU might be imprecise. Considering the inherited graph structure of mobile networks, in this paper, we utilize a graph-based deep learning method for automatic REU estimation, where the practical cells are deemed as nodes and the load switchings among them constitute edges. Concretely, Graph Attention Network (GAT) is employed as the backbone of our method due to its impressive generalizability in dealing with networked data. Nevertheless, conventional GAT cannot make full use of the information in mobile networks, since it only incorporates node features to infer the pairwise importance and conduct graph convolutions, while the edge features that are actually critical in our problem are disregarded. To accommodate this issue, we propose an Edge-Aware Graph Attention Network (EAGAT), which is able to fuse the node features and edge features for REU estimation. Extensive experimental results on two real-world mobile network datasets demonstrate the superiority of our EAGAT approach to several state-of-the-art methods.

DAG-Net: Double Attentive Graph Neural Network for Trajectory Forecasting

Alessio Monti, Alessia Bertugli, Simone Calderara, Rita Cucchiara

Responsive image

Auto-TLDR; Recurrent Generative Model for Multi-modal Human Motion Behaviour in Urban Environments

Slides Poster Similar

Understanding human motion behaviour is a critical task for several possible applications like self-driving cars or social robots, and in general for all those settings where an autonomous agent has to navigate inside a human-centric environment. This is non-trivial because human motion is inherently multi-modal: given a history of human motion paths, there are many plausible ways by which people could move in the future. Additionally, people activities are often driven by goals, e.g. reaching particular locations or interacting with the environment. We address both the aforementioned aspects by proposing a new recurrent generative model that considers both single agents’ future goals and interactions between different agents. The model exploits a double attention-based graph neural network to collect information about the mutual influences among different agents and integrates it with data about agents’ possible future objectives. Our proposal is general enough to be applied in different scenarios: the model achieves state-of-the-art results in both urban environments and also in sports applications.

Nonlinear Ranking Loss on Riemannian Potato Embedding

Byung Hyung Kim, Yoonje Suh, Honggu Lee, Sungho Jo

Responsive image

Auto-TLDR; Riemannian Potato for Rank-based Metric Learning

Slides Poster Similar

We propose a rank-based metric learning method by leveraging a concept of the Riemannian Potato for better separating non-linear data. By exploring the geometric properties of Riemannian manifolds, the proposed loss function optimizes the measure of dispersion using the distribution of Riemannian distances between a reference sample and neighbors and builds a ranked list according to the similarities. We show the proposed function can learn a hypersphere for each class, preserving the similarity structure inside it on Riemannian manifold. As a result, compared with Euclidean distance-based metric, our method can further jointly reduce the intra-class distances and enlarge the inter-class distances for learned features, consistently outperforming state-of-the-art methods on three widely used non-linear datasets.

An Efficient Empirical Solver for Localized Multiple Kernel Learning Via DNNs

Ziming Zhang

Responsive image

Auto-TLDR; Localized Multiple Kernel Learning using LMKL-Net

Slides Poster Similar

In this paper we propose solving localized multiple kernel learning (LMKL) using LMKL-Net, a feedforward deep neural network (DNN). In contrast to previous works, as a learning principle we propose parameterizing the gating function for learning kernel combination weights and the multiclass classifier using an attentional network (AN) and a multilayer perceptron (MLP), respectively. Such interpretability helps us better understand how the network solves the problem. Thanks to stochastic gradient descent (SGD), our approach has {\em linear} computational complexity in training. Empirically on benchmark datasets we demonstrate that with comparable or better accuracy than the state-of-the-art, our LMKL-Net can be trained about {\bf two orders of magnitude} faster with about {\bf two orders of magnitude} smaller memory footprint for large-scale learning.

Attentive Visual Semantic Specialized Network for Video Captioning

Jesus Perez-Martin, Benjamin Bustos, Jorge Pérez

Responsive image

Auto-TLDR; Adaptive Visual Semantic Specialized Network for Video Captioning

Slides Poster Similar

As an essential high-level task of video understanding topic, automatically describing a video with natural language has recently gained attention as a fundamental challenge in computer vision. Previous models for video captioning have several limitations, such as the existence of gaps in current semantic representations and the inexpressibility of the generated captions. To deal with these limitations, in this paper, we present a new architecture that we callAttentive Visual Semantic Specialized Network(AVSSN), which is an encoder-decoder model based on our Adaptive Attention Gate and Specialized LSTM layers. This architecture can selectively decide when to use visual or semantic information into the text generation process. The adaptive gate makes the decoder to automatically select the relevant information for providing a better temporal state representation than the existing decoders. Besides, the model is capable of learning to improve the expressiveness of generated captions attending to their length, using a sentence-length-related loss function. We evaluate the effectiveness of the proposed approach on the Microsoft Video Description(MSVD) and the Microsoft Research Video-to-Text (MSR-VTT) datasets, achieving state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, CIDEr, and ROUGE_L.

Learning Sign-Constrained Support Vector Machines

Kenya Tajima, Kouhei Tsuchida, Esmeraldo Ronnie Rey Zara, Naoya Ohta, Tsuyoshi Kato

Responsive image

Auto-TLDR; Constrained Sign Constraints for Learning Linear Support Vector Machine

Poster Similar

Domain knowledge is useful to improve the generalization performance of learning machines. Sign constraints are a handy representation to combine domain knowledge with learning machine. In this paper, we consider constraining the signs of the weight coefficients in learning the linear support vector machine, and develop two optimization algorithms for minimizing the empirical risk under the sign constraints. One of the two algorithms is based on the projected gradient method, in which each iteration of the projected gradient method takes O(nd) computational cost and the sublinear convergence of the objective error is guaranteed. The second algorithm is based on the Frank-Wolfe method that also converges sublinearly and possesses a clear termination criterion. We show that each iteration of the Frank-Wolfe also requires O(nd) cost. Furthermore, we derive the explicit expression for the minimal iteration number to ensure an epsilon-accurate solution by analyzing the curvature of the objective function. Finally, we empirically demonstrate that the sign constraints are a promising technique when similarities to the training examples compose the feature vector.

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval

Kuan-Hsun Wang, Chia Chun Cheng, Yi-Ling Chen, Yale Song, Shang-Hong Lai

Responsive image

Auto-TLDR; Attention-based Deep Metric Learning for Near-duplicate Video Retrieval

Slides Similar

Near-duplicate video retrieval (NDVR) is an important and challenging problem due to the increasing amount of videos uploaded to the Internet. In this paper, we propose an attention-based deep metric learning method for NDVR. Our method is based on well-established principles: We leverage two-stream networks to combine RGB and optical flow features, and incorporate an attention module to effectively deal with distractor frames commonly observed in near duplicate videos. We further aggregate the features corresponding to multiple video segments to enhance the discriminative power. The whole system is trained using a deep metric learning objective with a Siamese architecture. Our experiments show that the attention module helps eliminate redundant and noisy frames, while focusing on visually relevant frames for solving NVDR. We evaluate our approach on recent large-scale NDVR datasets, CC_WEB_VIDEO, VCDB, FIVR and SVD. To demonstrate the generalization ability of our approach, we report results in both within- and cross-dataset settings, and show that the proposed method significantly outperforms state-of-the-art approaches.

Feature-Aware Unsupervised Learning with Joint Variational Attention and Automatic Clustering

Wang Ru, Lin Li, Peipei Wang, Liu Peiyu

Responsive image

Auto-TLDR; Deep Variational Attention Encoder-Decoder for Clustering

Slides Poster Similar

Deep clustering aims to cluster unlabeled real-world samples by mining deep feature representation. Most of existing methods remain challenging when handling high-dimensional data and simultaneously exploring the complementarity of deep feature representation and clustering. In this paper, we propose a novel Deep Variational Attention Encoder-decoder for Clustering (DVAEC). Our DVAEC improves the representation learning ability by fusing variational attention. Specifically, we design a feature-aware automatic clustering module to mitigate the unreliability of similarity calculation and guide network learning. Besides, to further boost the performance of deep clustering from a global perspective, we define a joint optimization objective to promote feature representation learning and automatic clustering synergistically. Extensive experimental results show the promising performance achieved by our DVAEC on six datasets comparing with several popular baseline clustering methods.

A General Model for Learning Node and Graph Representations Jointly

Chaofan Chen

Responsive image

Auto-TLDR; Joint Community Detection/Dynamic Routing for Graph Classification

Slides Poster Similar

This paper focuses on two fundamental graph recognition tasks: node classification and graph classification. Existing methods usually learn the node and graph representations for these two tasks separately, and ignore modeling the relations between the local and global structures. In this paper, we propose a general approach to learn the local and global features collaboratively: (1) in order to characterize the correlation among nodes and communities (a set of nodes), we employ the joint community detection/dynamic routing modules to generate the clustering assignment matrices at first and then utilize these matrices to cluster nodes to capture the global information of graphs (locally relevant graph representations). Inspired by the success of spectral clustering, we minimize the ratiocut loss to help optimize the learned assignment matrices. (2) We maximize the mutual information between local and global representations to help learn the globally relevant node representations. Experimental results on a variety of node and graph classification benchmarks show that our model can achieve superior performance over the state-of-the-art approaches.

An Intransitivity Model for Matchup and Pairwise Comparison

Yan Gu, Jiuding Duan, Hisashi Kashima

Responsive image

Auto-TLDR; Blade-Chest: A Low-Rank Matrix Approach for Probabilistic Ranking of Players

Slides Poster Similar

Ranking is a ubiquitous problem appearing in many real-world applications. The superior players or objects are oftentimes determined by a matchup or pairwise comparison. Various models have been developed to integrate the matchup results into a single ranking list of players and to further predict the results of future matchups. Amongst them, the Bradley-Terry model is a mainstream model that achieves the goals by constructing explicit probabilistic interpretation. However, the model suffers from its strong assumption of transitive relationships and becomes vulnerable in practices where intransitive relationships exist. Blade-Chest model is an alternative solution to this intransitivity challenge by allowing the multi-dimensional representation of players. In this paper, we propose a low-rank matrix approach to characterize all players and generalize the related works by introducing a unified framework. Our experimental results on synthetic datasets and real-world datasets show that the proposed model is stably competitive with the standard models in terms of the consistency of probabilistic model interpretation and the predictive performance in out-of-sample tests.

Malware Detection by Exploiting Deep Learning over Binary Programs

Panpan Qi, Zhaoqi Zhang, Wei Wang, Chang Yao

Responsive image

Auto-TLDR; End-to-End Malware Detection without Feature Engineering

Slides Poster Similar

Malware evolves rapidly over time, which makes existing solutions being ineffective in detecting newly released malware. Machine learning models that can learn to capture malicious patterns directly from the data play an increasingly important role in malware analysis. However, traditional machine learning models heavily depend on feature engineering. The extracted static features are vulnerable as hackers could create new malware with different feature values to deceive the machine learning models. In this paper, we propose an end-to-end malware detection framework consisting of convolutional neural network, autoencoder and neural decision trees. It learns the features from multiple domains for malware detection without feature engineering. In addition, since anti-virus products should have a very low false alarm rate to avoid annoying users, we propose a special loss function, which optimizes the recall for a fixed low false positive rate (e.g., less than 0.1%). Experiments show that the proposed framework has achieved a better recall than the baseline models, and the derived loss function also makes a difference.

Picture-To-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images

Jiatong Li, Fangda Han, Ricardo Guerrero, Vladimir Pavlovic

Responsive image

Auto-TLDR; PITA: A Deep Learning Architecture for Predicting the Relative Amount of Ingredients from Food Images

Slides Poster Similar

Increased awareness of the impact of food consumption on health and lifestyle today has given rise to novel data-driven food analysis systems. Although these systems may recognize the ingredients, a detailed analysis of their amounts in the meal, which is paramount for estimating the correct nutrition, is usually ignored. In this paper, we study the novel and challenging problem of predicting the relative amount of each ingredient from a food image. We propose PITA, the Picture-to-Amount deep learning architecture to solve the problem. More specifically, we predict the ingredient amounts using a domain-driven Wasserstein loss from image-to-recipe cross-modal embeddings learned to align the two views of food data. Experiments on a dataset of recipes collected from the Internet show the model generates promising results and improves the baselines on this challenging task.

Context Matters: Self-Attention for Sign Language Recognition

Fares Ben Slimane, Mohamed Bouguessa

Responsive image

Auto-TLDR; Attentional Network for Continuous Sign Language Recognition

Slides Poster Similar

This paper proposes an attentional network for the task of Continuous Sign Language Recognition. The proposed approach exploits co-independent streams of data to model the sign language modalities. These different channels of information can share a complex temporal structure between each other. For that reason, we apply attention to synchronize and help capture entangled dependencies between the different sign language components. Even though Sign Language is multi-channel, handshapes represent the central entities in sign interpretation. Seeing handshapes in their correct context defines the meaning of a sign. Taking that into account, we utilize the attention mechanism to efficiently aggregate the hand features with their appropriate Spatio-temporal context for better sign recognition. We found that by doing so the model is able to identify the essential Sign Language components that revolve around the dominant hand and the face areas. We test our model on the benchmark dataset RWTH-PHOENIX-Weather 2014, yielding competitive results.

Leveraging Quadratic Spherical Mutual Information Hashing for Fast Image Retrieval

Nikolaos Passalis, Anastasios Tefas

Responsive image

Auto-TLDR; Quadratic Mutual Information for Large-Scale Hashing and Information Retrieval

Slides Poster Similar

Several deep supervised hashing techniques have been proposed to allow for querying large image databases. However, it is often overlooked that the process of information retrieval can be modeled using information-theoretic metrics, leading to optimizing various proxies for the problem at hand instead. Contrary to this, we propose a deep supervised hashing algorithm that optimizes the learned codes using an information-theoretic measure, the Quadratic Mutual Information (QMI). The proposed method is adapted to the needs of large-scale hashing and information retrieval leading to a novel information-theoretic measure, the Quadratic Spherical Mutual Information (QSMI), that is inspired by QMI, but leads to significant better retrieval precision. Indeed, the effectiveness of the proposed method is demonstrated under several different scenarios, using different datasets and network architectures, outperforming existing deep supervised image hashing techniques.

Assortative-Constrained Stochastic Block Models

Daniel Gribel, Thibaut Vidal, Michel Gendreau

Responsive image

Auto-TLDR; Constrained Stochastic Block Models for Assortative Communities in Neural Networks

Slides Poster Similar

Stochastic block models (SBMs) are often used to find assortative community structures in networks, such that the probability of connections within communities is higher than in between communities. However, classic SBMs are not limited to assortative structures. In this study, we discuss the implications of this model-inherent indifference towards assortativity or disassortativity, and show that it can lead to undesirable outcomes in datasets which are known to be assortative but which contain a reduced amount of information. To circumvent these issues, we propose a constrained SBM that imposes strong assortativity constraints, along with efficient algorithmic solutions. These constraints significantly boost community-detection capabilities in regimes which are close to the detectability threshold. They also permit to identify structurally-different communities in networks representing cerebral-cortex activity regions.

Space-Time Domain Tensor Neural Networks: An Application on Human Pose Classification

Konstantinos Makantasis, Athanasios Voulodimos, Anastasios Doulamis, Nikolaos Doulamis, Nikolaos Bakalos

Responsive image

Auto-TLDR; Tensor-Based Neural Network for Spatiotemporal Pose Classifiaction using Three-Dimensional Skeleton Data

Slides Poster Similar

Recent advances in sensing technologies require the design and development of pattern recognition models capable of processing spatiotemporal data efficiently. In this study, we propose a spatially and temporally aware tensor-based neural network for human pose classifiaction using three-dimensional skeleton data. Our model employs three novel components. First, an input layer capable of constructing highly discriminative spatiotemporal features. Second, a tensor fusion operation that produces compact yet rich representations of the data, and third, a tensor-based neural network that processes data representations in their original tensor form. Our model is end-to-end trainable and characterized by a small number of trainable parameters making it suitable for problems where the annotated data is limited. Experimental evaluation of the proposed model indicates that it can achieve state-of-the-art performance.

Transformer Networks for Trajectory Forecasting

Francesco Giuliari, Hasan Irtiza, Marco Cristani, Fabio Galasso

Responsive image

Auto-TLDR; TransformerNetworks for Trajectory Prediction of People Interactions

Slides Poster Similar

Most recent successes on forecasting the people mo-tion are based on LSTM models andallmost recent progress hasbeen achieved by modelling the social interaction among peopleand the people interaction with the scene. We question the useof the LSTM models and propose the novel use of TransformerNetworks for trajectory forecasting. This is a fundamental switchfrom the sequential step-by-step processing of LSTMs to theonly-attention-based memory mechanisms of Transformers. Inparticular, we consider both the original Transformer Network(TF) and the larger Bidirectional Transformer (BERT), state-of-the-art on all natural language processing tasks. Our proposedTransformers predict the trajectories of the individual peoplein the scene. These are “simple” models because each personis modelled separately without any complex human-human norscene interaction terms. In particular, the TF modelwithoutbells and whistlesyields the best score on the largest and mostchallenging trajectory forecasting benchmark of TrajNet [1]. Ad-ditionally, its extension which predicts multiple plausible futuretrajectories performs on par with more engineered techniqueson the 5 datasets of ETH [2]+UCY [3]. Finally, we showthat Transformers may deal with missing observations, as itmay be the case with real sensor data. Code is available atgithub.com/FGiuliari/Trajectory-Transformer

Global Feature Aggregation for Accident Anticipation

Mishal Fatima, Umar Karim Khan, Chong Min Kyung

Responsive image

Auto-TLDR; Feature Aggregation for Predicting Accidents in Video Sequences

Slides Similar

Anticipation of accidents ahead of time in autonomous and non-autonomous vehicles aids in accident avoidance. In order to recognize abnormal events such as traffic accidents in a video sequence, it is important that the network takes into account interactions of objects in a given frame. We propose a novel Feature Aggregation (FA) block that refines each object's features by computing a weighted sum of the features of all objects in a frame. We use FA block along with Long Short Term Memory (LSTM) network to anticipate accidents in the video sequences. We report mean Average Precision (mAP) and Average Time-to-Accident (ATTA) on Street Accident (SA) dataset. Our proposed method achieves the highest score for risk anticipation by predicting accidents 0.32 sec and 0.75 sec earlier compared to the best results with Adaptive Loss and dynamic parameter prediction based methods respectively.

Variational Deep Embedding Clustering by Augmented Mutual Information Maximization

Qiang Ji, Yanfeng Sun, Yongli Hu, Baocai Yin

Responsive image

Auto-TLDR; Clustering by Augmented Mutual Information maximization for Deep Embedding

Slides Poster Similar

Clustering is a crucial but challenging task in pattern analysis and machine learning. Recent many deep clustering methods combining representation learning with cluster techniques emerged. These deep clustering methods mainly focus on the correlation among samples and ignore the relationship between samples and their representations. In this paper, we propose a novel end-to-end clustering framework, namely variational deep embedding clustering by augmented mutual information maximization (VCAMI). From the perspective of VAE, we prove that minimizing reconstruction loss is equivalent to maximizing the mutual information of the input and its latent representation. This provides a theoretical guarantee for us to directly maximize the mutual information instead of minimizing reconstruction loss. Therefore we proposed the augmented mutual information which highlights the uniqueness of the representations while discovering invariant information among similar samples. Extensive experiments on several challenging image datasets show that the VCAMI achieves good performance. we achieve state-of-the-art results for clustering on MNIST (99.5%) and CIFAR-10 (65.4%) to the best of our knowledge.

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Yanan Li, Yuetan Lin, Hongrui Zhao, Donghui Wang

Responsive image

Auto-TLDR; TextVQA: An End-to-End Visual Question Answering Model for Text-Based VQA

Slides Similar

As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.

SDMA: Saliency Driven Mutual Cross Attention for Multi-Variate Time Series

Yash Garg, K. Selcuk Candan

Responsive image

Auto-TLDR; Salient-Driven Mutual Cross Attention for Intelligent Time Series Analytics

Slides Poster Similar

Integration of rich sensory technologies into critical applications, such as gesture recognition and building energy optimization, has highlighted the importance of intelligent time series analytics. To accommodate this demand, uni-variate approaches have been extended for multi-variate scenarios, but naive extensions have lead to deterioration in model performances due to their limited ability to capture the information recorded in different variates and complex multi-variate time series patterns’ evolution over time. Furthermore, real-world time series are often contaminated with noisy information. In this paper, we note that a time series often carry robust localized temporal events that could help improve model performance by highlighting the relevant information; however, the lack of sufficient data to train for these events make it impossible for neural architectures to identify and make use of these temporal events. We, therefore, argue that a companion process helping identify salient events in the input time series and driving model’s attention to the associated salient sub-sequences can help with learning a high-performing network. Relying on this observation, we propose a novel Saliency-Driven Mutual Cross Attention (SDMA) framework that extracts localized temporal events and generate a saliency series to complement the input time series. We further propose an architecture which accounts for the mutual cross-talk between the input and saliency series branches where input and saliency series attend each other. Experiments show that the proposed mutually-cross attention framework can offer significant boosts in model performance when compared against non-attentioned, conventionally attentioned, and conventionally cross-attentioned models.

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Yaning Li, Liu Yang

Responsive image

Auto-TLDR; Fully Associative Network for Fully Exploiting Correlation Information in Multi-Label Classification

Slides Poster Similar

Recent researches demonstrate that correlation modeling plays a key role in high-performance multi-label classification methods. However, existing methods do not take full advantage of correlation information, especially correlations in feature and label spaces of each image, which limits the performance of correlation-based multi-label classification methods. With more correlations considered, in this study, a Fully Associative Network (FAN) is proposed for fully exploiting correlation information, which involves both visual feature and label correlations. Specifically, FAN introduces a robust covariance pooling to summarize convolution features as global image representation for capturing feature correlation in the multi-label task. Moreover, it constructs an effective label correlation matrix based on a re-weighted scheme, which is fed into a graph convolution network for capturing label correlation. Then, correlation between covariance representations (i.e., feature correlation ) and the outputs of GCN (i.e., label correlation) are modeled for final prediction. Experimental results on two datasets illustrate the effectiveness and efficiency of our proposed FAN compared with state-of-the-art methods.