Enhanced User Interest and Expertise Modeling for Expert Recommendation

Tongze He, Caili Guo, Yunfei Chu

Responsive image

Auto-TLDR; A Unified Framework for Expert Recommendation in Community Question Answering

Slides Poster

The rapid development of Community Question Answering (CQA) satisfies users' request for professional and personal knowledge. In CQA, one key issue is to recommend users with high expertise and willingness to answer the given questions, namely expert recommendation. However, most of existing methods for expert recommendation ignore some key information, such as time information and historical feedback information, degrading the performance. On the one hand, users' interest are changing over time. It is biased if we don't consider the dynamics. On the other hand, feedback information is critical to estimate users' expertise. To solve these problems, we propose a unified framework for expert recommendation to exploit user interest and expertise more precisely. Considering the inconsistency between them, we propose to learn their embeddings separately. We leverage Long Short-Term Memory (LSTM) to model user's short-term interest and combine it with long-term interest. The user expertise is learned by the designed user expertise network, which explicitly models feedback on users' historical behavior. The extensive experiments on a large-scale dataset from a real-world CQA site demonstrate the superior performance of our method than state-of-the-art solutions to the problem.

Similar papers

Integrating Historical States and Co-Attention Mechanism for Visual Dialog

Tianling Jiang, Yi Ji, Chunping Liu

Responsive image

Auto-TLDR; Integrating Historical States and Co-attention for Visual Dialog

Slides Poster Similar

Visual dialog is a typical multi-modal task which involves both vision and language. Nowadays, it faces two major difficulties. In this paper, we propose Integrating Historical States and Co-attention (HSCA) for visual dialog to solve them. It includes two main modules, Co-ATT and MATCH. Specifically, the main purpose of the Co-ATT module is to guide the image with questions and answers in the early stage to get more specific objects. It tackles the temporal sequence issue in historical information which may influence the precise answer for multi-round questions. The MATCH module is, based on a question with pronouns, to retrieve the best matching historical information block. It overcomes the visual reference problem which requires to solve pronouns referring to unknowns in the text message and then to locate the objects in the given image. We quantitatively and qualitatively evaluate our model on VisDial v1.0, at the same time, ablation studies are carried out. The experimental results demonstrate that HSCA outperforms the state-of-the-art methods in many aspects.

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Yanan Li, Yuetan Lin, Hongrui Zhao, Donghui Wang

Responsive image

Auto-TLDR; TextVQA: An End-to-End Visual Question Answering Model for Text-Based VQA

Slides Similar

As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.

Multi-Stage Attention Based Visual Question Answering

Aakansha Mishra, Ashish Anand, Prithwijit Guha

Responsive image

Auto-TLDR; Alternative Bi-directional Attention for Visual Question Answering

Poster Similar

Recent developments in the field of Visual Question Answering (VQA) have witnessed promising improvements in performance through contributions in attention based networks. Most such approaches have focused on unidirectional attention that leverage over attention from textual domain (question) on visual space. These approaches mostly focused on learning high-quality attention in the visual space. In contrast, this work proposes an alternating bi-directional attention framework. First, a question to image attention helps to learn the robust visual space embedding, and second, an image to question attention helps to improve the question embedding. This attention mechanism is realized in an alternating fashion i.e. question-to-image followed by image-to-question and is repeated for maximizing performance. We believe that this process of alternating attention generation helps both the modalities and leads to better representations for the VQA task. This proposal is benchmark on TDIUC dataset and against state-of-art approaches. Our ablation analysis shows that alternate attention is the key to achieve high performance in VQA.

Trajectory-User Link with Attention Recurrent Networks

Tao Sun, Yongjun Xu, Fei Wang, Lin Wu, 塘文 钱, Zezhi Shao

Responsive image

Auto-TLDR; TULAR: Trajectory-User Link with Attention Recurrent Neural Networks

Slides Poster Similar

The prevalent adoptions of GPS-enabled devices have witnessed an explosion of various location-based services which produces a huge amount of trajectories monitoring the individuals' movements. In this paper, we tackle Trajectory-User Link (TUL) problem, which identifies humans' movement patterns and links trajectories to the users who generated them. Existing solutions on TUL problem employ recurrent neural networks and variational autoencoder methods, which face the bottlenecks in the case of excessively long trajectories and fragmentary users' movements. However, these are common characteristics of trajectory data in reality, leading to performance degradation of the existing models. In this paper, we propose an end-to-end attention recurrent neural learning framework, called TULAR (Trajectory-User Link with Attention Recurrent Networks), which focus on selected parts of the source trajectories when linking. TULAR introduce the Trajectory Semantic Vector (TSV) via unsupervised location representation learning and recurrent neural networks, by which to reckon the weight of parts of source trajectory. Further, we employ three attention scores for the weight measurements. Experiments are conducted on two real world datasets and compared with several existing methods, and the results show that TULAR yields a new state-of-the-art performance. Source code is public available at GitHub: https://github.com/taos123/TULAR.

Temporal Collaborative Filtering with Graph Convolutional Neural Networks

Esther Rodrigo-Bonet, Minh Duc Nguyen, Nikos Deligiannis

Responsive image

Auto-TLDR; Temporal Collaborative Filtering with Graph-Neural-Network-based Neural Networks

Slides Poster Similar

Temporal collaborative filtering (TCF) methods aim at modelling non-static aspects behind recommender systems, such as the dynamics in users' preferences and social trends around items. State-of-the-art TCF methods employ recurrent neural networks (RNNs) to model such aspects. These methods deploy matrix-factorization-based (MF-based) approaches to learn the user and item representations. Recently, graph-neural-network-based (GNN-based) approaches have shown improved performance in providing accurate recommendations over traditional MF-based approaches in non-temporal CF settings. Motivated by this, we propose a novel TCF method that leverages GNNs to learn user and item representations, and RNNs to model their temporal dynamics. A challenge with this method lies in the increased data sparsity, which negatively impacts obtaining meaningful quality representations with GNNs. To overcome this challenge, we train a GNN model at each time step using a set of observed interactions accumulated time-wise. Comprehensive experiments on real-world data show the improved performance obtained by our method over several state-of-the-art temporal and non-temporal CF models.

Multi-Modal Contextual Graph Neural Network for Text Visual Question Answering

Yaoyuan Liang, Xin Wang, Xuguang Duan, Wenwu Zhu

Responsive image

Auto-TLDR; Multi-modal Contextual Graph Neural Network for Text Visual Question Answering

Slides Poster Similar

Text visual question answering (TextVQA) targets at answering the question related to texts appearing in the given images, posing more challenges than VQA by requiring a deeper recognition and understanding of various shapes of human-readable scene texts as well as their meanings in different contexts. Existing works on TextVQA suffer from two weaknesses: i) scene texts and non-textual objects are processed separately and independently without considering their mutual interactions during the question understanding and answering process, ii) scene texts are encoded only through word embeddings without taking the corresponding visual appearance features as well as their potential relationships with other non-textual objects in the images into account. To overcome the weakness of exiting works, we propose a novel multi-modal contextual graph neural network (MCG) model for TextVQA. The proposed MCG model can capture the relationships between visual features of scene texts and non-textual objects in the given images as well as utilize richer sources of multi-modal features to improve the model performance. In particular, we encode the scene texts into richer features containing textual, visual and positional features, then model the visual relations between scene texts and non-textual objects through a contextual graph neural network. Our extensive experiments on real-world dataset demonstrate the advantages of the proposed MCG model over baseline approaches.

Probabilistic Latent Factor Model for Collaborative Filtering with Bayesian Inference

Jiansheng Fang, Xiaoqing Zhang, Yan Hu, Yanwu Xu, Ming Yang, Jiang Liu

Responsive image

Auto-TLDR; Bayesian Latent Factor Model for Collaborative Filtering

Slides Similar

Latent Factor Model (LFM) is one of the most successful methods for Collaborative filtering (CF) in the recommendation system, in which both users and items are projected into a joint latent factor space. Base on matrix factorization applied usually in pattern recognition, LFM models user-item interactions as inner products of factor vectors of user and item in that space and can be efficiently solved by least square methods with optimal estimation. However, such optimal estimation methods are prone to overfitting due to the extreme sparsity of user-item interactions. In this paper, we propose a Bayesian treatment for LFM, named Bayesian Latent Factor Model (BLFM). Based on observed user-item interactions, we build a probabilistic factor model in which the regularization is introduced via placing prior constraint on latent factors, and the likelihood function is established over observations and parameters. Then we draw samples of latent factors from the posterior distribution with Variational Inference (VI) to predict expected value. We further make an extension to BLFM, called BLFMBias, incorporating user-dependent and item-dependent biases into the model for enhancing performance. Extensive experiments on the movie rating dataset show the effectiveness of our proposed models by compared with several strong baselines.

An Intransitivity Model for Matchup and Pairwise Comparison

Yan Gu, Jiuding Duan, Hisashi Kashima

Responsive image

Auto-TLDR; Blade-Chest: A Low-Rank Matrix Approach for Probabilistic Ranking of Players

Slides Poster Similar

Ranking is a ubiquitous problem appearing in many real-world applications. The superior players or objects are oftentimes determined by a matchup or pairwise comparison. Various models have been developed to integrate the matchup results into a single ranking list of players and to further predict the results of future matchups. Amongst them, the Bradley-Terry model is a mainstream model that achieves the goals by constructing explicit probabilistic interpretation. However, the model suffers from its strong assumption of transitive relationships and becomes vulnerable in practices where intransitive relationships exist. Blade-Chest model is an alternative solution to this intransitivity challenge by allowing the multi-dimensional representation of players. In this paper, we propose a low-rank matrix approach to characterize all players and generalize the related works by introducing a unified framework. Our experimental results on synthetic datasets and real-world datasets show that the proposed model is stably competitive with the standard models in terms of the consistency of probabilistic model interpretation and the predictive performance in out-of-sample tests.

PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

Peilin Zhou, Zhiqi Huang, Fenglin Liu, Yuexian Zou

Responsive image

Auto-TLDR; Parallel Interactive Network for Spoken Language Understanding

Slides Poster Similar

Spoken Language Understanding (SLU) is an essential part of the spoken dialogue system, which typically consists of intent detection (ID) and slot filling (SF) tasks. Recently, recurrent neural networks (RNNs) based methods achieved the state-of-the-art for SLU. It is noted that, in the existing RNN-based approaches, ID and SF tasks are often jointly modeled to utilize the correlation information between them. However, we noted that, so far, the efforts to obtain better performance by supporting bidirectional and explicit information exchange between ID and SF are not well studied. In addition, few studies attempt to capture the local context information to enhance the performance of SF. Motivated by these findings, in this paper, Parallel Interactive Network (PIN) is proposed to model the mutual guidance between ID and SF. Specifically, given an utterance, a Gaussian self-attentive encoder is introduced to generate the context-aware feature embedding of the utterance which is able to capture local context information. Taking the feature embedding of the utterance, Slot2Intent module and Intent2Slot module are developed to capture the bidirectional information flow for ID and SF tasks. Finally, a cooperation mechanism is constructed to fuse the information obtained from Slot2Intent and Intent2Slot modules to further reduce the prediction bias. The experiments on two benchmark datasets, i.e., SNIPS and ATIS, demonstrate the effectiveness of our approach, which achieves a competitive result with state-of-the-art models. More encouragingly, by using the feature embedding of the utterance generated by the pre-trained language model BERT, our method achieves the state-of-the-art among all comparison approaches.

Edge-Aware Graph Attention Network for Ratio of Edge-User Estimation in Mobile Networks

Jiehui Deng, Sheng Wan, Xiang Wang, Enmei Tu, Xiaolin Huang, Jie Yang, Chen Gong

Responsive image

Auto-TLDR; EAGAT: Edge-Aware Graph Attention Network for Automatic REU Estimation in Mobile Networks

Slides Poster Similar

Estimating the Ratio of Edge-Users (REU) is an important issue in mobile networks, as it helps the subsequent adjustment of loads in different cells. However, existing approaches usually determine the REU manually, which are experience-dependent and labor-intensive, and thus the estimated REU might be imprecise. Considering the inherited graph structure of mobile networks, in this paper, we utilize a graph-based deep learning method for automatic REU estimation, where the practical cells are deemed as nodes and the load switchings among them constitute edges. Concretely, Graph Attention Network (GAT) is employed as the backbone of our method due to its impressive generalizability in dealing with networked data. Nevertheless, conventional GAT cannot make full use of the information in mobile networks, since it only incorporates node features to infer the pairwise importance and conduct graph convolutions, while the edge features that are actually critical in our problem are disregarded. To accommodate this issue, we propose an Edge-Aware Graph Attention Network (EAGAT), which is able to fuse the node features and edge features for REU estimation. Extensive experimental results on two real-world mobile network datasets demonstrate the superiority of our EAGAT approach to several state-of-the-art methods.

Constructing Geographic and Long-term Temporal Graph for Traffic Forecasting

Yiwen Sun, Yulu Wang, Kun Fu, Zheng Wang, Changshui Zhang, Jieping Ye

Responsive image

Auto-TLDR; GLT-GCRNN: Geographic and Long-term Temporal Graph Convolutional Recurrent Neural Network for Traffic Forecasting

Slides Poster Similar

Traffic forecasting influences various intelligent transportation system (ITS) services and is of great significance for user experience as well as urban traffic control. It is challenging due to the fact that the road network contains complex and time-varying spatial-temporal dependencies. Recently, deep learning based methods have achieved promising results by adopting graph convolutional network (GCN) to extract the spatial correlations and recurrent neural network (RNN) to capture the temporal dependencies. However, the existing methods often construct the graph only based on road network connectivity, which limits the interaction between roads. In this work, we propose Geographic and Long-term Temporal Graph Convolutional Recurrent Neural Network (GLT-GCRNN), a novel framework for traffic forecasting that learns the rich interactions between roads sharing similar geographic or long-term temporal patterns. Extensive experiments on a real-world traffic state dataset validate the effectiveness of our method by showing that GLT-GCRNN outperforms the state-of-the-art methods in terms of different metrics.

Multi-Graph Convolutional Network for Relationship-Driven Stock Movement Prediction

Jiexia Ye, Juanjuan Zhao, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; Multi-GCGRU: A Deep Learning Framework for Stock Price Prediction with Cross Effect

Slides Poster Similar

Stock price movement prediction is commonly accepted as a very challenging task due to the volatile nature of financial markets. Previous works typically predict the stock price mainly based on its own information, neglecting the cross effect among involved stocks. However, it is well known that an individual stock price is correlated with prices of other stocks in complex ways. To take the cross effect into consideration, we propose a deep learning framework, called Multi-GCGRU, which comprises graph convolutional network (GCN) and gated recurrent units (GRU) to predict stock movement. Specifically, we first encode multiple relationships among stocks into graphs based on financial domain knowledge and utilize GCN to extract the cross effect based on the pre-defined graphs. The cross-correlation features produced by GCN are concatenated with historical records and fed into GRU to model the temporal pattern in stock price. To further get rid of prior knowledge, we explore an adaptive stock graph learned by data automatically. Experiments on two stock indexes in China market show that our model outperforms other baselines. Note that our model is rather feasible to incorporate more effective pre-defined stock relationships. What's more, it can also learn a data-driven relationship without any domain knowledge.

A Novel Attention-Based Aggregation Function to Combine Vision and Language

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Responsive image

Auto-TLDR; Fully-Attentive Reduction for Vision and Language

Slides Poster Similar

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements - like regions and words - proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.

GCNs-Based Context-Aware Short Text Similarity Model

Xiaoqi Sun

Responsive image

Auto-TLDR; Context-Aware Graph Convolutional Network for Text Similarity

Slides Poster Similar

Semantic textual similarity is a fundamental task in text mining and natural language processing (NLP), which has profound research value. The essential step for text similarity is text representation learning. Recently, researches have explored the graph convolutional network (GCN) techniques on text representation, since GCN does well in handling complex structures and preserving syntactic information. However, current GCN models are usually limited to very shallow layers due to the vanishing gradient problem, which cannot capture non-local dependency information of sentences. In this paper, we propose a GCNs-based context-aware (GCSTS) model that applies iterated GCN blocks to train deeper GCNs. Recurrently employing the same GCN block prevents over-fitting and provides broad effective input width. Combined with dense connections, GCSTS can be trained more deeply. Besides, we use dynamic graph structures in the block, which further extend the receptive field of each vertex in graphs, learning better sentence representations. Experiments show that our model outperforms existing models on several text similarity datasets, while also verify that GCNs-based text representation models can be trained in a deeper manner, rather than being trained in two or three layers.

PrivAttNet: Predicting Privacy Risks in Images Using Visual Attention

Chen Zhang, Thivya Kandappu, Vigneshwaran Subbaraju

Responsive image

Auto-TLDR; PrivAttNet: A Visual Attention Based Approach for Privacy Sensitivity in Images

Slides Poster Similar

Visual privacy concerns associated with image sharing is a critical issue that need to be addressed to enable safe and lawful use of online social platforms. Users of social media platforms often suffer from no guidance in sharing sensitive images in public, and often face with social and legal consequences. Given the recent success of visual attention based deep learning methods in measuring abstract phenomena like image memorability, we are motivated to investigate whether visual attention based methods could be useful in measuring psycho-physical phenomena like "privacy sensitivity". In this paper we propose PrivAttNet -- a visual attention based approach, that can be trained end-to-end to estimate the privacy sensitivity of images without explicitly detecting objects and attributes present in the image. We show that our PrivAttNet model outperforms various SOTA and baseline strategies -- a 1.6 fold reduction in $L1-error$ over SOTA and 7%--10% improvement in Spearman-rank correlation between the predicted and ground truth sensitivity scores. Additionally, the attention maps from PrivAttNet are found to be useful in directing the users to the regions that are responsible for generating the privacy risk score.

Label Incorporated Graph Neural Networks for Text Classification

Yuan Xin, Linli Xu, Junliang Guo, Jiquan Li, Xin Sheng, Yuanyuan Zhou

Responsive image

Auto-TLDR; Graph Neural Networks for Semi-supervised Text Classification

Slides Poster Similar

Graph Neural Networks (GNNs) have achieved great success on graph-structured data, and their applications on traditional data structures such as natural language processing and semi-supervised text classification have been extensively explored in recent years. While previous works only consider the text information while building the graph, heterogeneous information such as labels is ignored. In this paper, we consider to incorporate the label information while building the graph by adding text-label-text paths, through which the supervision information will propagate among the graph more directly. Specifically, we treat labels as nodes in the graph which also contains text and word nodes, and then connect labels with texts belonging to that label. Through graph convolutions, label embeddings are jointly learned with text embeddings in the same latent semantic space. The newly incorporated label nodes will facilitate learning more accurate text embeddings by introducing the label information, and thus benefit the downstream text classification tasks. Extensive results on several benchmark datasets show that the proposed framework outperforms baseline methods by a significant margin.

Webly Supervised Image-Text Embedding with Noisy Tag Refinement

Niluthpol Mithun, Ravdeep Pasricha, Evangelos Papalexakis, Amit Roy-Chowdhury

Responsive image

Auto-TLDR; Robust Joint Embedding for Image-Text Retrieval Using Web Images

Slides Similar

In this paper, we address the problem of utilizing web images in training robust joint embedding models for the image-text retrieval task. Prior webly supervised approaches directly leverage weakly annotated web images in the joint embedding learning framework. The objective of these approaches would suffer significantly when the ratio of noisy and missing tags associated with the web images is very high. In this regard, we propose a CP decomposition based tensor completion framework to refine the tags of web images by modeling observed ternary inter-relations between the sets of labeled images, tags, and web images as a tensor. To effectively deal with the high ratio of missing entries likely in our case, we incorporate intra-modal correlation as side information in the proposed framework. Our tag refinement approach combined with existing web supervised image-text embedding approaches provide a more principled way for learning the joint embedding models in the presence of significant noise from web data and limited clean labeled data. Experiments on benchmark datasets demonstrate that the proposed approach helps to achieve a significant performance gain in image-text retrieval.

MA-LSTM: A Multi-Attention Based LSTM for Complex Pattern Extraction

Jingjie Guo, Kelang Tian, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; MA-LSTM: Multiple Attention based recurrent neural network for forget gate

Slides Poster Similar

With the improvement of data, computing powerand algorithms, deep learning has achieved rapid developmentand showing excellent performance. Recently, many deep learn-ing models are proposed to solve the problems in different areas.A recurrent neural network (RNN) is a class of artificial neuralnetworks where connections between nodes form a directedgraph along a temporal sequence. This allows it to exhibittemporal dynamic behavior, which makes it applicable to taskssuch as handwriting recognition or speech recognition. How-ever, the RNN relies heavily on the automatic learning abilityto update parameters which concentrate on the data flow butseldom considers the feature extraction capability of the gatemechanism. In this paper, we propose a novel architecture tobuild the forget gate which is generated by multiple bases.Instead of using the traditional single-layer fully-connectednetwork, we use a Multiple Attention (MA) based network togenerate the forget gate which refines the optimization spaceof gate function and improve the granularity of the recurrentneural network to approximate the map in the ground truth.Credit to the MA structure on the gate mechanism. Our modelhas a better feature extraction capability than other knownmodels. MA-LSTM is an alternative module which can directly replace the recurrent neural network and has achieved good performance in many areas that people are concerned about.

Moto: Enhancing Embedding with Multiple Joint Factors for Chinese Text Classification

Xunzhu Tang, Rujie Zhu, Tiezhu Sun

Responsive image

Auto-TLDR; Moto: Enhancing Embedding with Multiple J\textbf{o}int Fac\textBF{to}rs

Slides Poster Similar

Recently, language representation techniques have achieved great performances in text classification. However, most existing representation models are specifically designed for English materials, which may fail in Chinese because of the huge difference between these two languages. Actually, few existing methods for Chinese text classification process texts at a single level. However, as a special kind of hieroglyphics, radicals of Chinese characters are good semantic carriers. In addition, Pinyin codes carry the semantic of tones, and Wubi reflects the stroke structure information, \textit{etc}. Unfortunately, previous researches neglected to find an effective way to distill the useful parts of these four factors and to fuse them. In our works, we propose a novel model called Moto: Enhancing Embedding with \textbf{M}ultiple J\textbf{o}int Fac\textbf{to}rs. Specifically, we design an attention mechanism to distill the useful parts by fusing the four-level information above more effectively. We conduct extensive experiments on four popular tasks. The empirical results show that our Moto achieves SOTA 0.8316 ($F_1$-score, 2.11\% improvement) on Chinese news titles, 96.38 (1.24\% improvement) on Fudan Corpus and 0.9633 (3.26\% improvement) on THUCNews.

Cross-Supervised Joint-Event-Extraction with Heterogeneous Information Networks

Yue Wang, Zhuo Xu, Yao Wan, Lu Bai, Lixin Cui, Qian Zhao, Edwin Hancock, Philip Yu

Responsive image

Auto-TLDR; Joint-Event-extraction from Unstructured corpora using Structural Information Network

Slides Poster Similar

Joint-event-extraction, which extracts structural information (i.e., entities or triggers of events) from unstructured real-world corpora, has attracted more and more research attention in natural language processing. \revised{Most existing works do not fully address the sparse co-occurred relationships between entities and triggers. This exacerbates the error-propagation problem} which may degrade the extraction performance. To mitigate this issue, we first define the joint-event-extraction as a sequence-to-sequence labeling task with a tag set which is composed of tags of triggers and entities. Then, to incorporate the missing information in the aforementioned co-occurred relationships, we propose a \underline{C}ross-\underline{S}upervised \underline{M}echanism (CSM) to alternately supervise the extraction of either triggers or entities based on the type distribution of each other. Moreover, since the connected entities and triggers naturally form a heterogeneous information network (HIN), we leverage the latent pattern along meta-paths for a given corpus to further improve the performance of our proposed method. To verify the effectiveness of our proposed method, we conduct extensive experiments on real-world datasets as well as compare our method with state-of-the-art methods. Empirical results and analysis show that our approach outperforms the state-of-the-art methods in both entity and trigger extraction.

Video Episode Boundary Detection with Joint Episode-Topic Model

Shunyao Wang, Ye Tian, Ruidong Wang, Yang Du, Han Yan, Ruilin Yang, Jian Ma

Responsive image

Auto-TLDR; Unsupervised Video Episode Boundary Detection for Bullet Screen Comment Video

Slides Poster Similar

Social online video has emerged as one of the most popular application, where "bullet screen comment" is one of the favorite features of Asian users. User behavior report finds that most people are used to quickly navigate and locate his concerned video clip according to its corresponding video labels. Traditional scene segmentation algorithms are mostly based on the analysis of frames, which cannot automatically generate labels. Since time-synchronized comments can reflect the episode of current moment, this paper proposed an unsupervised video episode boundary detection model (VEBD) for bullet screen comment video. It could not only automatically identify each episode boundary, but also detect the topic for video tagging. Specifically, a Joint Episode-Topic model is first constructed to detect the hidden topic in initial partitioned time slices. Then, based on the detected topics, temporal and semantic relevancy between adjacent time slices are measured to refine the boundary detection accuracy. Experiments based on real data show that our model outperforms the existing algorithms in both boundary detection and semantic tagging quality.

Multi-Scale Relational Reasoning with Regional Attention for Visual Question Answering

Yuntao Ma, Yirui Wu, Tong Lu

Responsive image

Auto-TLDR; Question-Guided Relational Reasoning for Visual Question Answering

Slides Poster Similar

The main challenges of visual question answering (VQA) lie in modeling an alignment between image and question to find out informative regions in images that related to the question and reasoning relations among visual objects according to the question. In this paper, we propose question-guided relational reasoning in multi-scales for visual question answering, in which each region is enhanced by regional attention. Specifically, we present regional attention, which consists of a soft attention and a hard attention, to pick up informative regions of the image according to informative evaluations implemented by question-guided soft attention. And combinations of different informative regions are then concatenated with question embedding in different scales to capture relational information. Relational reasoning can extract question-based relational information between regions, and the multi-scale mechanism gives it the ability to analyze relationships in diversity and sensitivity to numbers by modeling scales of relationships. We conduct experiments to show that our proposed architecture is effective and achieves a new state-of-the-art on VQA v2.

Assessing the Severity of Health States Based on Social Media Posts

Shweta Yadav, Joy Prakash Sain, Amit Sheth, Asif Ekbal, Sriparna Saha, Pushpak Bhattacharyya

Responsive image

Auto-TLDR; A Multiview Learning Framework for Assessment of Health State in Online Health Communities

Slides Poster Similar

The unprecedented growth of Internet users has resulted in an abundance of unstructured information on social media including health forums, where patients request health-related information or opinions from other users. Previous studies have shown that online peer support has limited effectiveness without expert intervention. Therefore, a system capable of assessing the severity of health state from the patients' social media posts can help health professionals (HP) in prioritizing the user’s post. In this study, we inspect the efficacy of different aspects of Natural Language Understanding (NLU) to identify the severity of the user’s health state in relation to two perspectives(tasks) (a) Medical Condition (i.e., Recover, Exist, Deteriorate, Other) and (b) Medication (i.e., Effective, Ineffective, Serious Adverse Effect, Other) in online health communities. We propose a multiview learning framework that models both the textual content as well as contextual-information to assess the severity of the user’s health state. Specifically, our model utilizes the NLU views such as sentiment, emotions, personality, and use of figurative language to extract the contextual information. The diverse NLU views demonstrate its effectiveness on both the tasks and as well as on the individual disease to assess a user’s health.

Attentive Visual Semantic Specialized Network for Video Captioning

Jesus Perez-Martin, Benjamin Bustos, Jorge Pérez

Responsive image

Auto-TLDR; Adaptive Visual Semantic Specialized Network for Video Captioning

Slides Poster Similar

As an essential high-level task of video understanding topic, automatically describing a video with natural language has recently gained attention as a fundamental challenge in computer vision. Previous models for video captioning have several limitations, such as the existence of gaps in current semantic representations and the inexpressibility of the generated captions. To deal with these limitations, in this paper, we present a new architecture that we callAttentive Visual Semantic Specialized Network(AVSSN), which is an encoder-decoder model based on our Adaptive Attention Gate and Specialized LSTM layers. This architecture can selectively decide when to use visual or semantic information into the text generation process. The adaptive gate makes the decoder to automatically select the relevant information for providing a better temporal state representation than the existing decoders. Besides, the model is capable of learning to improve the expressiveness of generated captions attending to their length, using a sentence-length-related loss function. We evaluate the effectiveness of the proposed approach on the Microsoft Video Description(MSVD) and the Microsoft Research Video-to-Text (MSR-VTT) datasets, achieving state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, CIDEr, and ROUGE_L.

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

Chao Li, Qian Zhang, Ziping Zhao

Responsive image

Auto-TLDR; Intimate Relationship Prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network Using Functional Near-Infrared Spectroscopy

Slides Poster Similar

The detection of intimacy plays a crucial role in the improvement of intimate relationship, which contributes to promote the family and social harmony. Previous studies have shown that different degrees of intimacy have significant differences in brain imaging. Recently, a few of work has emerged to recognise intimacy automatically by using machine learning technique. Moreover, considering the temporal dynamic characteristics of intimacy relationship on neural mechanism, how to model spatio-temporal dynamics for intimacy prediction effectively is still a challenge. In this paper, we propose a novel method to explore deep spatial-temporal representations for intimacy prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network (ACCRNN). Given the advantages of time-frequency resolution in complex neuronal activities analysis, this paper utilizes functional near-infrared spectroscopy (fNIRS) to analyse and infer to intimate relationship. We collect a fNIRS-based dataset for the analysis of intimate relationship. Forty-two-channel fNIRS signals are recorded from the 44 subjects' prefrontal cortex when they watched a total of 18 photos of lovers, friends and strangers for 30 seconds per photo. The experimental results show that our proposed method outperforms the others in terms of accuracy with the precision of 96.5%. To the best of our knowledge, this is the first time that such a hybrid deep architecture has been employed for fNIRS-based intimacy prediction.

Visual Style Extraction from Chart Images for Chart Restyling

Danqing Huang, Jinpeng Wang, Guoxin Wang, Chin-Yew Lin

Responsive image

Auto-TLDR; Exploiting Visual Properties from Reference Chart Images for Chart Restyling

Slides Poster Similar

Creating a good looking chart for better visualization is time consuming. There are plenty of well-designed charts on the Web, which are ideal references for imitation of chart style. However, stored as bitmap images, reference charts have hinder machine interpretation of style settings and thus difficult to be directly applied. In this paper, we extract visual properties from reference chart images as style templates to restyle charts. We first construct a large-scale dataset of 187,059 chart images from real world data, labeled with predefined visual property values. Then we introduce an end-to-end learning network to extract the properties based on two image-encoding approaches. Furthermore, in order to capture spatial relationships of chart objects, which are crucial in solving the task, we propose a novel positional encoding method to integrate clues of relative positions between objects. Experimental results show that our model significantly outperforms baseline models. By adding positional features, our model achieves better performance. Finally, we present the application for chart restyling based on our model.

Learning Neural Textual Representations for Citation Recommendation

Thanh Binh Kieu, Inigo Jauregi Unanue, Son Bao Pham, Xuan-Hieu Phan, M. Piccardi

Responsive image

Auto-TLDR; Sentence-BERT cascaded with Siamese and triplet networks for citation recommendation

Slides Poster Similar

With the rapid growth of the scientific literature, manually selecting appropriate citations for a paper is becoming increasingly challenging and time-consuming. While several approaches for automated citation recommendation have been proposed in the recent years, effective document representations for citation recommendation are still elusive to a large extent. For this reason, in this paper we propose a novel approach to citation recommendation which leverages a deep sequential representation of the documents (Sentence-BERT) cascaded with Siamese and triplet networks in a submodular scoring function. To the best of our knowledge, this is the first approach to combine deep representations and submodular selection for a task of citation recommendation. Experiments have been carried out using a popular benchmark dataset -- the ACL Anthology Network corpus -- and evaluated against baselines and a state-of-the-art approach using metrics such as the MRR and F1@k score. The results show that the proposed approach has been able to outperform all the compared approaches in every measured metric.

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Tengfei Liu, Yongli Hu, Junbin Gao, Yanfeng Sun, Baocai Yin

Responsive image

Auto-TLDR; Semantically Extended Graph Convolutional Network for Zero-shot Text Classification

Slides Poster Similar

As a challenging task of Natural Language Processing(NLP), zero-shot text classification has attracted more and more attention recently. It aims to detect classes that the model has never seen in the training set. For this purpose, a feasible way is to construct connection between the seen and unseen classes by semantic extension and classify the unseen classes by information propagation over the connection. Although many related zero-shot text classification methods have been exploited, how to realize semantic extension properly and propagate information effectively is far from solved. In this paper, we propose a novel zero-shot text classification method called Semantically Extended Graph Convolutional Network (SEGCN). In the proposed method, the semantic category knowledge from ConceptNet is utilized to semantic extension for linking seen classes to unseen classes and constructing a graph of all classes. Then, we build upon Graph Convolutional Network (GCN) for predicting the textual classifier for each category, which transfers the category knowledge by the convolution operators on the constructed graph and is trained in a semi-supervised manner using the samples of the seen classes. The experimental results on Dbpedia and 20newsgroup datasets show that our method outperforms the state of the art zero-shot text classification methods.

Road Network Metric Learning for Estimated Time of Arrival

Yiwen Sun, Kun Fu, Zheng Wang, Changshui Zhang, Jieping Ye

Responsive image

Auto-TLDR; Road Network Metric Learning for Estimated Time of Arrival (RNML-ETA)

Slides Poster Similar

Recently, deep learning have achieved promising results in Estimated Time of Arrival (ETA), which is considered as predicting the travel time from the origin to the destination along a given path. One of the key techniques is to use embedding vectors to represent the elements of road network, such as the links (road segments). However, the embedding suffers from the data sparsity problem that many links in the road network are traversed by too few floating cars even in large ride-hailing platforms like Uber and DiDi. Insufficient data makes the embedding vectors in an under-fitting status, which undermines the accuracy of ETA prediction. To address the data sparsity problem, we propose the Road Network Metric Learning framework for ETA (RNML ETA). It consists of two components: (1) a main regression task to predict the travel time, and (2) an auxiliary metric learning task to improve the quality of link embedding vectors. We further propose the triangle loss, a novel loss function to improve the efficiency of metric learning. We validated the effectiveness of RNML-ETA on large scale real-world datasets, by showing that our method outperforms the state-of-the-art model and the promotion concentrates on the cold links with few data.

Transformer Networks for Trajectory Forecasting

Francesco Giuliari, Hasan Irtiza, Marco Cristani, Fabio Galasso

Responsive image

Auto-TLDR; TransformerNetworks for Trajectory Prediction of People Interactions

Slides Poster Similar

Most recent successes on forecasting the people mo-tion are based on LSTM models andallmost recent progress hasbeen achieved by modelling the social interaction among peopleand the people interaction with the scene. We question the useof the LSTM models and propose the novel use of TransformerNetworks for trajectory forecasting. This is a fundamental switchfrom the sequential step-by-step processing of LSTMs to theonly-attention-based memory mechanisms of Transformers. Inparticular, we consider both the original Transformer Network(TF) and the larger Bidirectional Transformer (BERT), state-of-the-art on all natural language processing tasks. Our proposedTransformers predict the trajectories of the individual peoplein the scene. These are “simple” models because each personis modelled separately without any complex human-human norscene interaction terms. In particular, the TF modelwithoutbells and whistlesyields the best score on the largest and mostchallenging trajectory forecasting benchmark of TrajNet [1]. Ad-ditionally, its extension which predicts multiple plausible futuretrajectories performs on par with more engineered techniqueson the 5 datasets of ETH [2]+UCY [3]. Finally, we showthat Transformers may deal with missing observations, as itmay be the case with real sensor data. Code is available atgithub.com/FGiuliari/Trajectory-Transformer

Region and Relations Based Multi Attention Network for Graph Classification

Manasvi Aggarwal, M. Narasimha Murty

Responsive image

Auto-TLDR; R2POOL: A Graph Pooling Layer for Non-euclidean Structures

Slides Poster Similar

Graphs are non-euclidean structures that can represent many relational data efficiently. Many studies have proposed the convolution and the pooling operators on the non-euclidean domain. The graph convolution operators have shown astounding performance on various tasks such as node representation and classification. For graph classification, different pooling techniques are introduced, but none of them has considered both neighborhood of the node and the long-range dependencies of the node. In this paper, we propose a novel graph pooling layer R2POOL, which balances the structure information around the node as well as the dependencies with far away nodes. Further, we propose a new training strategy to learn coarse to fine representations. We add supervision at only intermediate levels to generate predictions using only intermediate-level features. For this, we propose the concept of an alignment score. Moreover, each layer's prediction is controlled by our proposed branch training strategy. This complete training helps in learning dominant class features at each layer for representing graphs. We call the combined model by R2MAN. Experiments show that R2MAN the potential to improve the performance of graph classification on various datasets.

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Bang Yang, Yuexian Zou

Responsive image

Auto-TLDR; Visual Oriented Encoder for Video Captioning

Slides Poster Similar

Video captioning is a challenging task which aims at automatically generating a natural language description of a given video. Recent researches have shown that exploiting the intrinsic multi-modalities of videos significantly promotes captioning performance. However, how to integrate multi-modalities to generate effective semantic representations for video captioning is still an open issue. Some researchers proposed to learn multimodal features in parallel during the encoding stage. The downside of these methods lies in the neglect of the interaction among multi-modalities and their rich contextual information. In this study, inspired by the fact that visual contents are generally more important for comprehending videos, we propose a novel Visual Oriented Encoder (VOE) to integrate multimodal features in an interactive manner. Specifically, VOE is designed as a hierarchical structure, where bottom layers are utilized to extract multi-scale contexts from auxiliary modalities while the top layer is exploited to generate joint representations by considering both visual and contextual information. Following the encoder-decoder framework, we systematically develop a VOE-LSTM model and evaluate it on two mainstream benchmarks: MSVD and MSR-VTT. Experimental results show that the proposed VOE surpasses conventional encoders and our VOE-LSTM model achieves competitive results compared with state-of-the-art approaches.

Global Feature Aggregation for Accident Anticipation

Mishal Fatima, Umar Karim Khan, Chong Min Kyung

Responsive image

Auto-TLDR; Feature Aggregation for Predicting Accidents in Video Sequences

Slides Similar

Anticipation of accidents ahead of time in autonomous and non-autonomous vehicles aids in accident avoidance. In order to recognize abnormal events such as traffic accidents in a video sequence, it is important that the network takes into account interactions of objects in a given frame. We propose a novel Feature Aggregation (FA) block that refines each object's features by computing a weighted sum of the features of all objects in a frame. We use FA block along with Long Short Term Memory (LSTM) network to anticipate accidents in the video sequences. We report mean Average Precision (mAP) and Average Time-to-Accident (ATTA) on Street Accident (SA) dataset. Our proposed method achieves the highest score for risk anticipation by predicting accidents 0.32 sec and 0.75 sec earlier compared to the best results with Adaptive Loss and dynamic parameter prediction based methods respectively.

Context Visual Information-Based Deliberation Network for Video Captioning

Min Lu, Xueyong Li, Caihua Liu

Responsive image

Auto-TLDR; Context visual information-based deliberation network for video captioning

Slides Poster Similar

Video captioning is to automatically and accurately generate a textual description for a video. The typical methods following the encoder-decoder architecture directly utilized hidden states to predict words. Nevertheless, these methods did not amend the inaccurate hidden states before feeding those states into word prediction. This led to a cascade of errors on generating word by word. In this paper, the context visual information-based deliberation network is proposed, abbreviated as CVI-DelNet. Its key idea is to introduce the deliberator into the encoder-decoder framework. The encoder-decoder firstly generates a raw hidden state sequence. Unlike the existing methods, the raw hidden state is no more directly used for word prediction but is fed into the deliberator to generate the refined hidden state. The words are then predicted according to the refined hidden states and the contextual visual features. Results on two datasets shows that the proposed method significantly outperforms the baselines.

Tackling Contradiction Detection in German Using Machine Translation and End-To-End Recurrent Neural Networks

Maren Pielka, Rafet Sifa, Lars Patrick Hillebrand, David Biesner, Rajkumar Ramamurthy, Anna Ladi, Christian Bauckhage

Responsive image

Auto-TLDR; Contradiction Detection in Natural Language Inference using Recurrent Neural Networks

Slides Poster Similar

Natural Language Inference, and specifically Contradiction Detection, is still an unexplored topic with respect to German text. In this paper, we apply Recurrent Neural Network (RNN) methods to learn contradiction-specific sentence embeddings. Our data set for evaluation is a machine-translated version of the Stanford Natural Language Inference (SNLI) corpus. The results are compared to a baseline using unsupervised vectorization techniques, namely tf-idf and Flair, as well as state-of-the art transformer-based (MBERT) methods. We find that the end-to-end models outperform the models trained on unsupervised embeddings, which makes them the better choice in an empirical use case. The RNN methods also perform superior to MBERT on the translated data set.

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Tobia Tesan, Pasquale Coscia, Lamberto Ballan

Responsive image

Auto-TLDR; Context-Based Image Annotation with Multiple Semantic Embeddings and Recurrent Neural Networks

Slides Poster Similar

Images represent a commonly used form of visual communication among people. Nevertheless, image classification may be a challenging task when dealing with unclear or non-common images needing more context to be correctly annotated. Metadata accompanying images on social-media represent an ideal source of additional information for retrieving proper neighborhoods easing image annotation task. To this end, we blend visual features extracted from neighbors and their metadata to jointly leverage context and visual cues. Our models use multiple semantic embeddings to achieve the dual objective of being robust to vocabulary changes between train and test sets and decoupling the architecture from the low-level metadata representation. Convolutional and recurrent neural networks (CNNs-RNNs) are jointly adopted to infer similarity among neighbors and query images. We perform comprehensive experiments on the NUS-WIDE dataset showing that our models outperform state-of-the-art architectures based on images and metadata, and decrease both sensory and semantic gaps to better annotate images.

Equation Attention Relationship Network (EARN) : A Geometric Deep Metric Framework for Learning Similar Math Expression Embedding

Saleem Ahmed, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju

Responsive image

Auto-TLDR; Representational Learning for Similarity Based Retrieval of Mathematical Expressions

Slides Poster Similar

Representational Learning in the form of high dimensional embeddings have been used for multiple pattern recognition applications. There has been a significant interest in building embedding based systems for learning representationsin the mathematical domain. At the same time, retrieval of structured information such as mathematical expressions is an important need for modern IR systems. In this work, our motivation is to introduce a robust framework for learning representations for similarity based retrieval of mathematical expressions. Given a query by example, the embedding can find the closest matching expression as a function of euclidean distance between them. We leverage recent advancements in image-based and graph-based deep learning algorithms to learn our similarity embeddings. We do this first, by using uni-modal encoders in graph space and image space and then, a multi-modal combination of the same. To overcome the lack of training data, we force the networks to learn a deep metric using triplets generated with a heuristic scoring function. We also adopt a custom strategy for mining hard samples to train our neural networks. Our system produces rankings similar to those generated by the original scoring function, but using only a fraction of the time. Our results establish the viability of using such a multi-modal embedding for this task.

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, Rong Xiao

Responsive image

Auto-TLDR; PICK: A Graph Learning Framework for Key Information Extraction from Documents

Slides Poster Similar

Computer vision with state-of-the-art deep learning models have achieved huge success in the field of Optical Character Recognition (OCR) including text detection and recognition tasks recently. However, Key Information Extraction (KIE) from documents as the downstream task of OCR, having a large number of use scenarios in real-world, remains a challenge because documents not only have textual features extracting from OCR systems but also have semantic visual features that are not fully exploited and play a critical role in KIE. Too little work has been devoted to efficiently make full use of both textual and visual features of the documents. In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Extensive experiments on real-world datasets have been conducted to show that our method outperforms baselines methods by significant margins.

SAT-Net: Self-Attention and Temporal Fusion for Facial Action Unit Detection

Zhihua Li, Zheng Zhang, Lijun Yin

Responsive image

Auto-TLDR; Temporal Fusion and Self-Attention Network for Facial Action Unit Detection

Slides Poster Similar

Research on facial action unit detection has shown remarkable performances by using deep spatial learning models in recent years, however, it is far from reaching its full capacity in learning due to the lack of use of temporal information of AUs across time. Since the AU occurrence in one frame is highly likely related to previous frames in a temporal sequence, exploring temporal correlation of AUs across frames becomes a key motivation of this work. In this paper, we propose a novel temporal fusion and AU-supervised self-attention network (a so-called SAT-Net) to address the AU detection problem. First of all, we input the deep features of a sequence into a convolutional LSTM network and fuse the previous temporal information into the feature map of the last frame, and continue to learn the AU occurrence. Second, considering the AU detection problem is a multi-label classification problem that individual label depends only on certain facial areas, we propose a new self-learned attention mask by focusing the detection of each AU on parts of facial areas through the learning of individual attention mask for each AU, thus increasing the AU independence without the loss of any spatial relations. Our extensive experiments show that the proposed framework achieves better results of AU detection over the state-of-the-arts on two benchmark databases (BP4D and DISFA).

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Yaning Li, Liu Yang

Responsive image

Auto-TLDR; Fully Associative Network for Fully Exploiting Correlation Information in Multi-Label Classification

Slides Poster Similar

Recent researches demonstrate that correlation modeling plays a key role in high-performance multi-label classification methods. However, existing methods do not take full advantage of correlation information, especially correlations in feature and label spaces of each image, which limits the performance of correlation-based multi-label classification methods. With more correlations considered, in this study, a Fully Associative Network (FAN) is proposed for fully exploiting correlation information, which involves both visual feature and label correlations. Specifically, FAN introduces a robust covariance pooling to summarize convolution features as global image representation for capturing feature correlation in the multi-label task. Moreover, it constructs an effective label correlation matrix based on a re-weighted scheme, which is fed into a graph convolution network for capturing label correlation. Then, correlation between covariance representations (i.e., feature correlation ) and the outputs of GCN (i.e., label correlation) are modeled for final prediction. Experimental results on two datasets illustrate the effectiveness and efficiency of our proposed FAN compared with state-of-the-art methods.

Enriching Video Captions with Contextual Text

Philipp Rimle, Pelin Dogan, Markus Gross

Responsive image

Auto-TLDR; Contextualized Video Captioning Using Contextual Text

Slides Poster Similar

Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning by infusing extracted information from relevant text data. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mines relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not preprocess the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the contextual text via a pointer-generator network, allowing to produce more specific video captions. We show competitive performance on the News Video Dataset and, through ablation studies, validate the efficacy of contextual video captioning as well as individual design choices in our model architecture.

Answer-Checking in Context: A Multi-Modal Fully Attention Network for Visual Question Answering

Hantao Huang, Tao Han, Wei Han, Deep Yap Deep Yap, Cheng-Ming Chiang

Responsive image

Auto-TLDR; Fully Attention Based Visual Question Answering

Slides Poster Similar

Visual Question Answering (VQA) is challenging due to the complex cross-modality relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. Such answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves a state-of-the-art accuracy 71.57\% using less parameters on VQA-v2.0 test-standard split.

AG-GAN: An Attentive Group-Aware GAN for Pedestrian Trajectory Prediction

Yue Song, Niccolò Bisagno, Syed Zohaib Hassan, Nicola Conci

Responsive image

Auto-TLDR; An attentive group-aware GAN for motion prediction in crowded scenarios

Slides Poster Similar

Understanding human behaviors in crowded scenarios requires analyzing not only the position of the subjects in space, but also the scene context. Existing approaches mostly rely on the motion history of each pedestrian and model the interactions among people by considering the entire surrounding neighborhood. In our approach, we address the problem of motion prediction by applying coherent group clustering and a global attention mechanism on the LSTM-based Generative Adversarial Networks (GANs). The proposed model consists of an attentive group-aware GAN that observes the agents' past motion and predicts future paths, using (i) a group pooling module to model neighborhood interaction, and (ii) an attention module to specifically focus on hidden states. The experimental results demonstrate that our proposal outperforms state-of-the-art models on common benchmark datasets, and is able to generate socially-acceptable trajectories.

CKG: Dynamic Representation Based on Context and Knowledge Graph

Xunzhu Tang, Tiezhu Sun, Rujie Zhu

Responsive image

Auto-TLDR; CKG: Dynamic Representation Based on Knowledge Graph for Language Sentences

Slides Poster Similar

Recently, neural language representation models pre-trained on large corpus can capture rich co-occurrence information and be fine-tuned in downstream tasks to improve the performance. As a result, they have achieved state-of-the-art results in a large range of language tasks. However, there exists other valuable semantic information such as similar, opposite, or other possible meanings in external knowledge graphs (KGs). We argue that entities in KGs could be used to enhance the correct semantic meaning of language sentences. In this paper, we propose a new method CKG: Dynamic Representation Based on \textbf{C}ontext and \textbf{K}nowledge \textbf{G}raph. On the one side, CKG can extract rich semantic information of large corpus. On the other side, it can make full use of inside information such as co-occurrence in large corpus and outside information such as similar entities in KGs. We conduct extensive experiments on a wide range of tasks, including QQP, MRPC, SST-5, SQuAD, CoNLL 2003, and SNLI. The experiment results show that CKG achieves SOTA 89.2 on SQuAD compared with SAN (84.4), ELMo (85.8), and BERT$_{Base}$ (88.5).

Question-Agnostic Attention for Visual Question Answering

Moshiur R Farazi, Salman Hameed Khan, Nick Barnes

Responsive image

Auto-TLDR; Question-Agnostic Attention for Visual Question Answering

Slides Poster Similar

Visual Question Answering (VQA) models employ attention mechanisms to discover image locations that are most relevant for answering a specific question. For this purpose, several multimodal fusion strategies have been proposed, ranging from relatively simple operations (e.g., linear sum) to more complex ones (e.g., Block). The resulting multimodal representations define an intermediate feature space for capturing the interplay between visual and semantic features, that is helpful in selectively focusing on image content. In this paper, we propose a question-agnostic attention mechanism that is complementary to the existing question-dependent attention mechanisms. Our proposed model parses object instances to obtain an `object map' and applies this map on the visual features to generate Question-Agnostic Attention (QAA) features. In contrast to question-dependent attention approaches that are learned end-to-end, the proposed QAA does not involve question-specific training, and can be easily included in almost any existing VQA model as a generic light-weight pre-processing step, thereby adding minimal computation overhead for training. Further, when used in complement with the question-dependent attention, the QAA allows the model to focus on the regions containing objects that might have been overlooked by the learned attention representation. Through extensive evaluation on VQAv1, VQAv2 and TDIUC datasets, we show that incorporating complementary QAA allows state-of-the-art VQA models to perform better, and provides significant boost to simplistic VQA models, enabling them to performance on par with highly sophisticated fusion strategies.

Deep Convolutional Embedding for Digitized Painting Clustering

Giovanna Castellano, Gennaro Vessio

Responsive image

Auto-TLDR; A Deep Convolutional Embedding Model for Clustering Artworks

Slides Poster Similar

Clustering artworks is difficult because of several reasons. On one hand, recognizing meaningful patterns in accordance with domain knowledge and visual perception is extremely hard. On the other hand, the application of traditional clustering and feature reduction techniques to the highly dimensional pixel space can be ineffective. To address these issues, we propose to use a deep convolutional embedding model for digitized painting clustering, in which the task of mapping the input raw data to an abstract, latent space is jointly optimized with the task of finding a set of cluster centroids in this latent feature space. Quantitative and qualitative experimental results show the effectiveness of the proposed method. The model is also able to outperform other state-of-the-art deep clustering approaches to the same problem. The proposed method may be beneficial to several art-related tasks, particularly visual link retrieval and historical knowledge discovery in painting datasets.

Video Summarization with a Dual Attention Capsule Network

Hao Fu, Hongxing Wang, Jianyu Yang

Responsive image

Auto-TLDR; Dual Self-Attention Capsule Network for Video Summarization

Slides Poster Similar

In this paper, we address the problem of video summarization, which aims at selecting a subset of video frames as a summary to represent the original video contents compactly and completely. We propose a simple but effective supervised approach with a dual attention capsule network towards this end. Unlike existing LSTM based methods, it pays attention to short- and long-term dependencies among video frames through an elaborate dual self-attention architecture, which can handle longer-term dependencies and admit parallel computing. To reconcile the outputs of dual self-attention, we rely on a two-stream capsule network to learn the underlying frame selection criteria. Experiments on real-world datasets show the advantages of the proposed approach compared with state-of-the-art methods.

Interpretable Structured Learning with Sparse Gated Sequence Encoder for Protein-Protein Interaction Prediction

Kishan K C, Feng Cui, Anne Haake, Rui Li

Responsive image

Auto-TLDR; Predicting Protein-Protein Interactions Using Sequence Representations

Slides Poster Similar

Predicting protein-protein interactions (PPIs) by learning informative representations from amino acid sequences is a challenging yet important problem in biology. Although various deep learning models in Siamese architecture have been proposed to model PPIs from sequences, these methods are computationally expensive for a large number of PPIs due to the pairwise encoding process. Furthermore, these methods are difficult to interpret because of non-intuitive mappings from protein sequences to their sequence representation. To address these challenges, we present a novel deep framework to model and predict PPIs from sequence alone. Our model incorporates a bidirectional gated recurrent unit to learn sequence representations by leveraging contextualized and sequential information from sequences. We further employ a sparse regularization to model long-range dependencies between amino acids and to select important amino acids (protein motifs), thus enhancing interpretability. Besides, the novel design of the encoding process makes our model computationally efficient and scalable to an increasing number of interactions. Experimental results on up-to-date interaction datasets demonstrate that our model achieves superior performance compared to other state-of-the-art methods. Literature-based case studies illustrate the ability of our model to provide biological insights to interpret the predictions.