Geographic-Semantic-Temporal Hypergraph Convolutional Network for Traffic Flow Prediction

Kesu Wang, Jing Chen, Shijie Liao, Jiaxin Hou, Qingyu Xiong

Responsive image

Auto-TLDR; Geographic-semantic-temporal convolutional network for traffic flow prediction

Traffic flow prediction is becoming an increasingly important part for intelligent transportation control and management. This task is challenging due to (1) complex geographic and non-geographic spatial correlation; (2) temporal correlations between time slices; (3) dynamics of semantic high-order correlations along temporal dimension. To address those difficulties, commonly-used methods apply graph convolutional networks for spatial correlations and recurrent neural networks for temporal dependencies. In this work, We distinguish the two aspects of spatial correlations and propose the two types of spatial graphes, named as geographic graph and semantic hypergraph. We extend the traditional convolution and propose geographic-temporal graph convolution to jointly capture geographic-temporal correlations and semantic-temporal hypergraph convolution to jointly capture semantic-temporal correlations. Then We propose a geographic-semantic-temporal convolutional network (GST-HCN) that combines our graph convolutions and GRU units hierarchically in a unified end-to-end network. The experiment results on the Caltrans Performance Measurement System (PeMS) dataset show that our proposed model significantly outperforms other popular spatio-temporal deep learning models and suggest the effectiveness to explore geographic-semantic-temporal dependencies on deep learning models for traffic flow prediction.

Similar papers

Constructing Geographic and Long-term Temporal Graph for Traffic Forecasting

Yiwen Sun, Yulu Wang, Kun Fu, Zheng Wang, Changshui Zhang, Jieping Ye

Responsive image

Auto-TLDR; GLT-GCRNN: Geographic and Long-term Temporal Graph Convolutional Recurrent Neural Network for Traffic Forecasting

Slides Poster Similar

Traffic forecasting influences various intelligent transportation system (ITS) services and is of great significance for user experience as well as urban traffic control. It is challenging due to the fact that the road network contains complex and time-varying spatial-temporal dependencies. Recently, deep learning based methods have achieved promising results by adopting graph convolutional network (GCN) to extract the spatial correlations and recurrent neural network (RNN) to capture the temporal dependencies. However, the existing methods often construct the graph only based on road network connectivity, which limits the interaction between roads. In this work, we propose Geographic and Long-term Temporal Graph Convolutional Recurrent Neural Network (GLT-GCRNN), a novel framework for traffic forecasting that learns the rich interactions between roads sharing similar geographic or long-term temporal patterns. Extensive experiments on a real-world traffic state dataset validate the effectiveness of our method by showing that GLT-GCRNN outperforms the state-of-the-art methods in terms of different metrics.

Transfer Learning with Graph Neural Networks for Short-Term Highway Traffic Forecasting

Tanwi Mallick, Prasanna Balaprakash, Eric Rask, Jane Macfarlane

Responsive image

Auto-TLDR; Transfer Learning for Highway Traffic Forecasting on Unseen Traffic Networks

Slides Poster Similar

Large-scale highway traffic forecasting approaches are critical for intelligent transportation systems. Recently, deep-learning-based traffic forecasting methods have emerged as promising approaches for a wide range of traffic forecasting tasks. However, these methods are specific to a given traffic network and consequently, they cannot be used for forecasting traffic on an unseen traffic network. Previous work has identified diffusion convolutional recurrent neural network (DCRNN), as a state-of-the-art method for highway traffic forecasting. It models the complex spatial and temporal dynamics of a highway network using a graph-based diffusion convolution operation within a recurrent neural network. Currently, DCRNN cannot perform transfer learning because it learns location-specific traffic patterns, which cannot be used for unseen regions of a network or new geographic locations. To that end, we develop TL-DCRNN, a new transfer learning approach for DCRNN, where a single model trained on a highway network can be used to forecast traffic on unseen highway networks. Given a traffic network with a large amount of traffic data, our approach consists of partitioning the traffic network into a number of subgraphs and using a new training scheme that utilizes subgraphs for the DCRNN to marginalize the location-specific information, thus learning the traffic as a function of network connectivity and temporal patterns alone. The resulting trained model can be used to forecast traffic on unseen networks. We demonstrate that TL-DCRNN can learn from San Francisco regional traffic data and forecast traffic on the Los Angeles region and vice versa.

Multi-Graph Convolutional Network for Relationship-Driven Stock Movement Prediction

Jiexia Ye, Juanjuan Zhao, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; Multi-GCGRU: A Deep Learning Framework for Stock Price Prediction with Cross Effect

Slides Poster Similar

Stock price movement prediction is commonly accepted as a very challenging task due to the volatile nature of financial markets. Previous works typically predict the stock price mainly based on its own information, neglecting the cross effect among involved stocks. However, it is well known that an individual stock price is correlated with prices of other stocks in complex ways. To take the cross effect into consideration, we propose a deep learning framework, called Multi-GCGRU, which comprises graph convolutional network (GCN) and gated recurrent units (GRU) to predict stock movement. Specifically, we first encode multiple relationships among stocks into graphs based on financial domain knowledge and utilize GCN to extract the cross effect based on the pre-defined graphs. The cross-correlation features produced by GCN are concatenated with historical records and fed into GRU to model the temporal pattern in stock price. To further get rid of prior knowledge, we explore an adaptive stock graph learned by data automatically. Experiments on two stock indexes in China market show that our model outperforms other baselines. Note that our model is rather feasible to incorporate more effective pre-defined stock relationships. What's more, it can also learn a data-driven relationship without any domain knowledge.

Edge-Aware Graph Attention Network for Ratio of Edge-User Estimation in Mobile Networks

Jiehui Deng, Sheng Wan, Xiang Wang, Enmei Tu, Xiaolin Huang, Jie Yang, Chen Gong

Responsive image

Auto-TLDR; EAGAT: Edge-Aware Graph Attention Network for Automatic REU Estimation in Mobile Networks

Slides Poster Similar

Estimating the Ratio of Edge-Users (REU) is an important issue in mobile networks, as it helps the subsequent adjustment of loads in different cells. However, existing approaches usually determine the REU manually, which are experience-dependent and labor-intensive, and thus the estimated REU might be imprecise. Considering the inherited graph structure of mobile networks, in this paper, we utilize a graph-based deep learning method for automatic REU estimation, where the practical cells are deemed as nodes and the load switchings among them constitute edges. Concretely, Graph Attention Network (GAT) is employed as the backbone of our method due to its impressive generalizability in dealing with networked data. Nevertheless, conventional GAT cannot make full use of the information in mobile networks, since it only incorporates node features to infer the pairwise importance and conduct graph convolutions, while the edge features that are actually critical in our problem are disregarded. To accommodate this issue, we propose an Edge-Aware Graph Attention Network (EAGAT), which is able to fuse the node features and edge features for REU estimation. Extensive experimental results on two real-world mobile network datasets demonstrate the superiority of our EAGAT approach to several state-of-the-art methods.

Recurrent Graph Convolutional Networks for Skeleton-Based Action Recognition

Guangming Zhu, Lu Yang, Liang Zhang, Peiyi Shen, Juan Song

Responsive image

Auto-TLDR; Recurrent Graph Convolutional Network for Human Action Recognition

Slides Poster Similar

Human action recognition is one of the challenging and active research fields due to its wide applications. Recently, graph convolutions for skeleton-based action recognition have attracted much attention. Generally, the adjacency matrices of the graph are fixed to the hand-crafted physical connectivity of the human joints, or learned adaptively via deep learining. The hand-crafted or learned adjacency matrices are fixed when processing each frame of an action sequence. However, the interactions of different subsets of joints may play a core role at different phases of an action. Therefore, it is reasonable to evolve the graph topology with time. In this paper, a recurrent graph convolution is proposed, in which the graph topology is evolved via a long short-term memory (LSTM) network. The proposed recurrent graph convolutional network (R-GCN) can recurrently learn the data-dependent graph topologies for different layers, different time steps and different kinds of actions. Experimental results on the NTU RGB+D and Kinetics-Skeleton datasets demonstrate the advantages of the proposed R-GCN.

Temporal Attention-Augmented Graph Convolutional Network for Efficient Skeleton-Based Human Action Recognition

Negar Heidari, Alexandros Iosifidis

Responsive image

Auto-TLDR; Temporal Attention Module for Efficient Graph Convolutional Network-based Action Recognition

Slides Poster Similar

Graph convolutional networks (GCNs) have been very successful in modeling non-Euclidean data structures, like sequences of body skeletons forming actions modeled as spatio-temporal graphs. Most GCN-based action recognition methods use deep feed-forward networks with high computational complexity to process all skeletons in an action. This leads to a high number of floating point operations (ranging from 16G to 100G FLOPs) to process a single sample, making their adoption in restricted computation application scenarios infeasible. In this paper, we propose a temporal attention module (TAM) for increasing the efficiency in skeleton-based action recognition by selecting the most informative skeletons of an action at the early layers of the network. We incorporate the TAM in a light-weight GCN topology to further reduce the overall number of computations. Experimental results on two benchmark datasets show that the proposed method outperforms with a large margin the baseline GCN-based method while having 2.9 times less number of computations. Moreover, it performs on par with the state-of-the-art with up to 9.6 times less number of computations.

Road Network Metric Learning for Estimated Time of Arrival

Yiwen Sun, Kun Fu, Zheng Wang, Changshui Zhang, Jieping Ye

Responsive image

Auto-TLDR; Road Network Metric Learning for Estimated Time of Arrival (RNML-ETA)

Slides Poster Similar

Recently, deep learning have achieved promising results in Estimated Time of Arrival (ETA), which is considered as predicting the travel time from the origin to the destination along a given path. One of the key techniques is to use embedding vectors to represent the elements of road network, such as the links (road segments). However, the embedding suffers from the data sparsity problem that many links in the road network are traversed by too few floating cars even in large ride-hailing platforms like Uber and DiDi. Insufficient data makes the embedding vectors in an under-fitting status, which undermines the accuracy of ETA prediction. To address the data sparsity problem, we propose the Road Network Metric Learning framework for ETA (RNML ETA). It consists of two components: (1) a main regression task to predict the travel time, and (2) an auxiliary metric learning task to improve the quality of link embedding vectors. We further propose the triangle loss, a novel loss function to improve the efficiency of metric learning. We validated the effectiveness of RNML-ETA on large scale real-world datasets, by showing that our method outperforms the state-of-the-art model and the promotion concentrates on the cold links with few data.

Graph Convolutional Neural Networks for Power Line Outage Identification

Jia He, Maggie Cheng

Responsive image

Auto-TLDR; Graph Convolutional Networks for Power Line Outage Identification

Poster Similar

In this paper, we consider the power line outage identification problem as a graph signal classification problem, where the signal at each vertex is given as a time series. We propose graph convolutional networks (GCNs) for the task of classifying signals supported on graphs. An important element of the GCN design is filter design. We consider filtering signals in either the vertex (spatial) domain, or the frequency (spectral) domain. Two basic architectures are proposed. In the spatial GCN architecture, the GCN uses a graph shift operator as the basic building block to incorporate the underlying graph structure into the convolution layer. The spatial filter directly utilizes the graph connectivity information. It defines the filter to be a polynomial in the graph shift operator to obtain the convolved features that aggregate neighborhood information of each node. In the spectral GCN architecture, a frequency filter is used instead. A graph Fourier transform operator first transforms the raw graph signal from the vertex domain to the frequency domain, and then a filter is defined using the graph's spectral parameters. The spectral GCN then uses the output from the graph Fourier transform to compute the convolved features. There are additional challenges to classify the time-evolving graph signal as the signal value at each vertex changes over time. The GCNs are designed to recognize different spatiotemporal patterns from high-dimensional data defined on a graph. The application of the proposed methods to power line outage identification shows that these GCN architectures can successfully classify abnormal signal patterns and identify the outage location.

GCNs-Based Context-Aware Short Text Similarity Model

Xiaoqi Sun

Responsive image

Auto-TLDR; Context-Aware Graph Convolutional Network for Text Similarity

Slides Poster Similar

Semantic textual similarity is a fundamental task in text mining and natural language processing (NLP), which has profound research value. The essential step for text similarity is text representation learning. Recently, researches have explored the graph convolutional network (GCN) techniques on text representation, since GCN does well in handling complex structures and preserving syntactic information. However, current GCN models are usually limited to very shallow layers due to the vanishing gradient problem, which cannot capture non-local dependency information of sentences. In this paper, we propose a GCNs-based context-aware (GCSTS) model that applies iterated GCN blocks to train deeper GCNs. Recurrently employing the same GCN block prevents over-fitting and provides broad effective input width. Combined with dense connections, GCSTS can be trained more deeply. Besides, we use dynamic graph structures in the block, which further extend the receptive field of each vertex in graphs, learning better sentence representations. Experiments show that our model outperforms existing models on several text similarity datasets, while also verify that GCNs-based text representation models can be trained in a deeper manner, rather than being trained in two or three layers.

AOAM: Automatic Optimization of Adjacency Matrix for Graph Convolutional Network

Yuhang Zhang, Hongshuai Ren, Jiexia Ye, Xitong Gao, Yang Wang, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; Adjacency Matrix for Graph Convolutional Network in Non-Euclidean Space

Slides Poster Similar

Graph Convolutional Network (GCN) is adopted to tackle the problem of the convolution operation in non-Euclidean space. Although previous works on GCN have made some progress, one of their limitations is that their input Adjacency Matrix (AM) is designed manually and requires domain knowledge, which is cumbersome, tedious and error-prone. In addition, entries of this fixed Adjacency Matrix are generally designed as binary values (i.e., ones and zeros) which can not reflect more complex relationship between nodes. However, many applications require a weighted and dynamic Adjacency Matrix instead of an unweighted and fixed Adjacency Matrix. To this end, there are few works focusing on designing a more flexible Adjacency Matrix. In this paper, we propose an end-to-end algorithm to improve the GCN performance by focusing on the Adjacency Matrix. We first provide a calculation method that called node information entropy to update the matrix. Then, we analyze the search strategy in a continuous space and introduce the Deep Deterministic Policy Gradient (DDPG) method to overcome the demerit of the discrete space search. Finally, we integrate the GCN and reinforcement learning into an end-to-end framework. Our method can automatically define the adjacency matrix without artificial knowledge. At the same time, the proposed approach can deal with any size of the matrix and provide a better value for the network. Four popular datasets are selected to evaluate the capability of our algorithm. The method in this paper achieves the state-of-the-art performance on Cora and Pubmed datasets, respectively, with the accuracy of 84.6% and 81.6%.

Region and Relations Based Multi Attention Network for Graph Classification

Manasvi Aggarwal, M. Narasimha Murty

Responsive image

Auto-TLDR; R2POOL: A Graph Pooling Layer for Non-euclidean Structures

Slides Poster Similar

Graphs are non-euclidean structures that can represent many relational data efficiently. Many studies have proposed the convolution and the pooling operators on the non-euclidean domain. The graph convolution operators have shown astounding performance on various tasks such as node representation and classification. For graph classification, different pooling techniques are introduced, but none of them has considered both neighborhood of the node and the long-range dependencies of the node. In this paper, we propose a novel graph pooling layer R2POOL, which balances the structure information around the node as well as the dependencies with far away nodes. Further, we propose a new training strategy to learn coarse to fine representations. We add supervision at only intermediate levels to generate predictions using only intermediate-level features. For this, we propose the concept of an alignment score. Moreover, each layer's prediction is controlled by our proposed branch training strategy. This complete training helps in learning dominant class features at each layer for representing graphs. We call the combined model by R2MAN. Experiments show that R2MAN the potential to improve the performance of graph classification on various datasets.

Vertex Feature Encoding and Hierarchical Temporal Modeling in a Spatio-Temporal Graph Convolutional Network for Action Recognition

Konstantinos Papadopoulos, Enjie Ghorbel, Djamila Aouada, Bjorn Ottersten

Responsive image

Auto-TLDR; Spatio-Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

Slides Poster Similar

Spatio-temporal Graph Convolutional Networks (ST-GCNs) have shown great performance in the context of skeleton-based action recognition. Nevertheless, ST-GCNs use raw skeleton data as vertex features. Such features have low dimensionality and might not be optimal for action discrimination. Moreover, a single layer of temporal convolution is used to model short-term temporal dependencies but can be insufficient for capturing both long-term. In this paper, we extend the Spatio-Temporal Graph Convolutional Network for skeleton-based action recognition by introducing two novel modules, namely, the Graph Vertex Feature Encoder (GVFE) and the Dilated Hierarchical Temporal Convolutional Network (DH-TCN). On the one hand, the GVFE module learns appropriate vertex features for action recognition by encoding raw skeleton data into a new feature space. On the other hand, the DH-TCN module is capable of capturing both short-term and long-term temporal dependencies using a hierarchical dilated convolutional network. Experiments have been conducted on the challenging NTU RGB-D 60, NTU RGB-D 120 and Kinetics datasets. The obtained results show that our method competes with state-of-the-art approaches while using a smaller number of layers and parameters; thus reducing the required training time and memory.

MFI: Multi-Range Feature Interchange for Video Action Recognition

Sikai Bai, Qi Wang, Xuelong Li

Responsive image

Auto-TLDR; Multi-range Feature Interchange Network for Action Recognition in Videos

Slides Poster Similar

Short-range motion features and long-range dependencies are two complementary and vital cues for action recognition in videos, but it remains unclear how to efficiently and effectively extract these two features. In this paper, we propose a novel network to capture these two features in a unified 2D framework. Specifically, we first construct a Short-range Temporal Interchange (STI) block, which contains a Channels-wise Temporal Interchange (CTI) module for encoding short-range motion features. Then a Graph-based Regional Interchange (GRI) module is built to present long-range dependencies using graph convolution. Finally, we replace original bottleneck blocks in the ResNet with STI blocks and insert several GRI modules between STI blocks, to form a Multi-range Feature Interchange (MFI) Network. Practically, extensive experiments are conducted on three action recognition datasets (i.e., Something-Something V1, HMDB51, and UCF101), which demonstrate that the proposed MFI network achieves impressive results with very limited computing cost.

MA-LSTM: A Multi-Attention Based LSTM for Complex Pattern Extraction

Jingjie Guo, Kelang Tian, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; MA-LSTM: Multiple Attention based recurrent neural network for forget gate

Slides Poster Similar

With the improvement of data, computing powerand algorithms, deep learning has achieved rapid developmentand showing excellent performance. Recently, many deep learn-ing models are proposed to solve the problems in different areas.A recurrent neural network (RNN) is a class of artificial neuralnetworks where connections between nodes form a directedgraph along a temporal sequence. This allows it to exhibittemporal dynamic behavior, which makes it applicable to taskssuch as handwriting recognition or speech recognition. How-ever, the RNN relies heavily on the automatic learning abilityto update parameters which concentrate on the data flow butseldom considers the feature extraction capability of the gatemechanism. In this paper, we propose a novel architecture tobuild the forget gate which is generated by multiple bases.Instead of using the traditional single-layer fully-connectednetwork, we use a Multiple Attention (MA) based network togenerate the forget gate which refines the optimization spaceof gate function and improve the granularity of the recurrentneural network to approximate the map in the ground truth.Credit to the MA structure on the gate mechanism. Our modelhas a better feature extraction capability than other knownmodels. MA-LSTM is an alternative module which can directly replace the recurrent neural network and has achieved good performance in many areas that people are concerned about.

On the Global Self-attention Mechanism for Graph Convolutional Networks

Chen Wang, Deng Chengyuan

Responsive image

Auto-TLDR; Global Self-Attention Mechanism for Graph Convolutional Networks

Slides Similar

Applying Global Self-Attention (GSA) mechanism over features has achieved remarkable success on Convolutional Neural Networks (CNNs). However, it is not clear if Graph Convolutional Networks (GCNs) can similarly benefit from such a technique. In this paper, inspired by the similarity between CNNs and GCNs, we study the impact of the Global Self-Attention mechanism on GCNs. We find that consistent with the intuition, the GSA mechanism allows GCNs to capture feature-based vertex relations regardless of edge connections; As a result, the GSA mechanism can introduce extra expressive power to the GCNs. Furthermore, we analyze the impacts of the GSA mechanism on the issues of overfitting and over-smoothing. We prove that the GSA mechanism can alleviate both the overfitting and the over-smoothing issues based on some recent technical developments. Experiments on multiple benchmark datasets illustrate both superior expressive power and less significant overfitting and over-smoothing problems for the GSA-augmented GCNs, which corroborate the intuitions and the theoretical results.

A Two-Stream Recurrent Network for Skeleton-Based Human Interaction Recognition

Qianhui Men, Edmond S. L. Ho, Shum Hubert P. H., Howard Leung

Responsive image

Auto-TLDR; Two-Stream Recurrent Neural Network for Human-Human Interaction Recognition

Slides Poster Similar

This paper addresses the problem of recognizing human-human interaction from skeletal sequences. Existing methods are mainly designed to classify single human action. Many of them simply stack the movement features of two characters to deal with human interaction, while neglecting the abundant relationships between characters. In this paper, we propose a novel two-stream recurrent neural network by adopting the geometric features from both single actions and interactions to describe the spatial correlations with different discriminative abilities. The first stream is constructed under pairwise joint distance (PJD) in a fully-connected mesh to categorize the interactions with explicit distance patterns. To better distinguish similar interactions, in the second stream, we combine PJD with the spatial features from individual joint positions using graph convolutions to detect the implicit correlations among joints, where the joint connections in the graph are adaptive for flexible correlations. After spatial modeling, each stream is fed to a bi-directional LSTM to encode two-way temporal properties. To take advantage of the diverse discriminative power of the two streams, we come up with a late fusion algorithm to combine their output predictions concerning information entropy. Experimental results show that the proposed framework achieves state-of-the-art performance on 3D and comparable performance on 2D interaction datasets. Moreover, the late fusion results demonstrate the effectiveness of improving the recognition accuracy compared with single streams.

Trajectory-User Link with Attention Recurrent Networks

Tao Sun, Yongjun Xu, Fei Wang, Lin Wu, 塘文 钱, Zezhi Shao

Responsive image

Auto-TLDR; TULAR: Trajectory-User Link with Attention Recurrent Neural Networks

Slides Poster Similar

The prevalent adoptions of GPS-enabled devices have witnessed an explosion of various location-based services which produces a huge amount of trajectories monitoring the individuals' movements. In this paper, we tackle Trajectory-User Link (TUL) problem, which identifies humans' movement patterns and links trajectories to the users who generated them. Existing solutions on TUL problem employ recurrent neural networks and variational autoencoder methods, which face the bottlenecks in the case of excessively long trajectories and fragmentary users' movements. However, these are common characteristics of trajectory data in reality, leading to performance degradation of the existing models. In this paper, we propose an end-to-end attention recurrent neural learning framework, called TULAR (Trajectory-User Link with Attention Recurrent Networks), which focus on selected parts of the source trajectories when linking. TULAR introduce the Trajectory Semantic Vector (TSV) via unsupervised location representation learning and recurrent neural networks, by which to reckon the weight of parts of source trajectory. Further, we employ three attention scores for the weight measurements. Experiments are conducted on two real world datasets and compared with several existing methods, and the results show that TULAR yields a new state-of-the-art performance. Source code is public available at GitHub: https://github.com/taos123/TULAR.

Label Incorporated Graph Neural Networks for Text Classification

Yuan Xin, Linli Xu, Junliang Guo, Jiquan Li, Xin Sheng, Yuanyuan Zhou

Responsive image

Auto-TLDR; Graph Neural Networks for Semi-supervised Text Classification

Slides Poster Similar

Graph Neural Networks (GNNs) have achieved great success on graph-structured data, and their applications on traditional data structures such as natural language processing and semi-supervised text classification have been extensively explored in recent years. While previous works only consider the text information while building the graph, heterogeneous information such as labels is ignored. In this paper, we consider to incorporate the label information while building the graph by adding text-label-text paths, through which the supervision information will propagate among the graph more directly. Specifically, we treat labels as nodes in the graph which also contains text and word nodes, and then connect labels with texts belonging to that label. Through graph convolutions, label embeddings are jointly learned with text embeddings in the same latent semantic space. The newly incorporated label nodes will facilitate learning more accurate text embeddings by introducing the label information, and thus benefit the downstream text classification tasks. Extensive results on several benchmark datasets show that the proposed framework outperforms baseline methods by a significant margin.

More Correlations Better Performance: Fully Associative Networks for Multi-Label Image Classification

Yaning Li, Liu Yang

Responsive image

Auto-TLDR; Fully Associative Network for Fully Exploiting Correlation Information in Multi-Label Classification

Slides Poster Similar

Recent researches demonstrate that correlation modeling plays a key role in high-performance multi-label classification methods. However, existing methods do not take full advantage of correlation information, especially correlations in feature and label spaces of each image, which limits the performance of correlation-based multi-label classification methods. With more correlations considered, in this study, a Fully Associative Network (FAN) is proposed for fully exploiting correlation information, which involves both visual feature and label correlations. Specifically, FAN introduces a robust covariance pooling to summarize convolution features as global image representation for capturing feature correlation in the multi-label task. Moreover, it constructs an effective label correlation matrix based on a re-weighted scheme, which is fed into a graph convolution network for capturing label correlation. Then, correlation between covariance representations (i.e., feature correlation ) and the outputs of GCN (i.e., label correlation) are modeled for final prediction. Experimental results on two datasets illustrate the effectiveness and efficiency of our proposed FAN compared with state-of-the-art methods.

End-To-End Multi-Task Learning of Missing Value Imputation and Forecasting in Time-Series Data

Jinhee Kim, Taesung Kim, Jang-Ho Choi, Jaegul Choo

Responsive image

Auto-TLDR; Time-Series Prediction with Denoising and Imputation of Missing Data

Slides Poster Similar

Multivariate time-series prediction is a common task, but it often becomes challenging due to missing values involved in data caused by unreliable sensors and other issues. In fact, inaccurate imputation of missing values can degrade the downstream prediction performance, so it may be better not to rely on the estimated values of missing data. Furthermore, observed data may contain noise, so denoising them can be helpful for the main task at hand. In response, we propose a novel approach that can automatically utilize the optimal combination of the observed and the estimated values to generate not only complete, but also noise-reduced data by our own gating mechanism. We evaluate our model on real-world time-series datasets and achieved state-of-the-art performance, demonstrating that our method successfully handle the incomplete datasets. Moreover, we present in-depth studies using a carefully designed, synthetic multivariate time-series dataset to verify the effectiveness of the proposed model. The ablation studies and the experimental analysis of the proposed gating mechanism show that the proposed method works as an effective denoising as well as imputation method for time-series classification tasks.

Video-Based Facial Expression Recognition Using Graph Convolutional Networks

Daizong Liu, Hongting Zhang, Pan Zhou

Responsive image

Auto-TLDR; Graph Convolutional Network for Video-based Facial Expression Recognition

Slides Poster Similar

Facial expression recognition (FER), aiming to classify the expression present in the facial image or video, has attracted a lot of research interests in the field of artificial intelligence and multimedia. In terms of video based FER task, it is sensible to capture the dynamic expression variation among the frames to recognize facial expression. However, existing methods directly utilize CNN-RNN or 3D CNN to extract the spatial-temporal features from different facial units, instead of concentrating on a certain region during expression variation capturing, which leads to limited performance in FER. In our paper, we introduce a Graph Convolutional Network (GCN) layer into a common CNN-RNN based model for video-based FER. First, the GCN layer is utilized to learn more contributing facial expression features which concentrate on certain regions after sharing information between nodes those represent CNN extracted features. Then, a LSTM layer is applied to learn long-term dependencies among the GCN learned features to model the variation. In addition, a weight assignment mechanism is also designed to weight the output of different nodes for final classification by characterizing the expression intensities in each frame. To the best of our knowledge, it is the first time to use GCN in FER task. We evaluate our method on three widely-used datasets, CK+, Oulu-CASIA and MMI, and also one challenging wild dataset AFEW8.0, and the experimental results demonstrate that our method has superior performance to existing methods.

You Ought to Look Around: Precise, Large Span Action Detection

Ge Pan, Zhang Han, Fan Yu, Yonghong Song, Yuanlin Zhang, Han Yuan

Responsive image

Auto-TLDR; YOLA: Local Feature Extraction for Action Localization with Variable receptive field

Slides Similar

For the action localization task, pre-defined action anchors are the cornerstone of mainstream techniques. State-of-the-art models mostly rely on a dense segmenting scheme, where anchors are sampled uniformly over the temporal domain with a predefined set of scales. However, it is not sufficient because action duration varies greatly. Therefore, it is necessary for the anchors or proposals to have a variable receptive field. In this paper, we propose a method called YOLA (You Ought to Look Around) which includes three parts: 1) a robust backbone SPN-I3D for extracting spatio-temporal features. In this part, we employ a stronger backbone I3D with SPN (Segment Pyramid Network) instead of C3D to obtain multi-scale features; 2) a simple but useful feature fusion module named LFE (Local Feature Extraction). Compared with the fully connected layer and global average pooling, our LFE model is more advantageous for network to fit and fuse features. 3) a new feature segment aligning method called TPGC (Two Pathway Graph Convolution), which allows one proposal to leverage semantic features of adjacent proposals to update its content and make sure the proposals have a variable receptive field. YOLA add only a small overhead to the baseline network, and is easy to train in an end-to-end manner, running at a speed of 1097 fps. YOLA achieves a mAP of 58.3%, outperforming all existing models including both RGB-based and two stream on THUMOS'14, and achieves competitive results on ActivityNet 1.3.

Learning Connectivity with Graph Convolutional Networks

Hichem Sahbi

Responsive image

Auto-TLDR; Learning Graph Convolutional Networks Using Topological Properties of Graphs

Slides Poster Similar

Learning graph convolutional networks (GCNs) is an emerging field which aims at generalizing convolutional operations to arbitrary non-regular domains. In particular, GCNs operating on spatial domains show superior performances compared to spectral ones, however their success is highly dependent on how the topology of input graphs is defined. In this paper, we introduce a novel framework for graph convolutional networks that learns the topological properties of graphs. The design principle of our method is based on the optimization of a constrained objective function which learns not only the usual convolutional parameters in GCNs but also a transformation basis that conveys the most relevant topological relationships in these graphs. Experiments conducted on the challenging task of skeleton-based action recognition shows the superiority of the proposed method compared to handcrafted graph design as well as the related work.

Boundary-Aware Graph Convolution for Semantic Segmentation

Hanzhe Hu, Jinshi Cui, Jinshi Hongbin Zha

Responsive image

Auto-TLDR; Boundary-Aware Graph Convolution for Semantic Segmentation

Slides Poster Similar

Recent works have made great progress in semantic segmentation by exploiting contextual information in a local or global manner with dilated convolutions, pyramid pooling or self-attention mechanism. However, few works have focused on harvesting boundary information to improve the segmentation performance. In order to enhance the feature similarity within the object and keep discrimination from other objects, we propose a boundary-aware graph convolution (BGC) module to propagate features within the object. The graph reasoning is performed among pixels of the same object apart from the boundary pixels. Based on the proposed BGC module, we further introduce the Boundary-aware Graph Convolution Network(BGCNet), which consists of two main components including a basic segmentation network and the BGC module, forming a coarse-to-fine paradigm. Specifically, the BGC module takes the coarse segmentation feature map as node features and boundary prediction to guide graph construction. After graph convolution, the reasoned feature and the input feature are fused together to get the refined feature, producing the refined segmentation result. We conduct extensive experiments on three popular semantic segmentation benchmarks including Cityscapes, PASCAL VOC 2012 and COCO Stuff, and achieve state-of-the-art performance on all three benchmarks.

Channel-Wise Dense Connection Graph Convolutional Network for Skeleton-Based Action Recognition

Michael Lao Banteng, Zhiyong Wu

Responsive image

Auto-TLDR; Two-stream channel-wise dense connection GCN for human action recognition

Slides Poster Similar

Skeleton-based action recognition task has drawn much attention for many years. Graph Convolutional Network (GCN) has proved its effectiveness in this task. However, how to improve the model's robustness to different human actions and how to make effective use of features produced by the network are main topics needed to be further explored. Human actions are time series sequence, meaning that temporal information is a key factor to model the representation of data. The ranges of body parts involved in small actions (e.g. raise a glass or shake head) and big actions (e.g. walking or jumping) are diverse. It's crucial for the model to generate and utilize more features that can be adaptive to a wider range of actions. Furthermore, feature channels are specific with the action class, the model needs to weigh their importance and pay attention to more related ones. To address these problems, in this work, we propose a two-stream channel-wise dense connection GCN (2s-CDGCN). Specifically, the skeleton data was extracted and processed into spatial and temporal information for better feature representation. A channel-wise attention module was used to select and emphasize the more useful features generated by the network. Moreover, to ensure maximum information flow, dense connection was introduced to the network structure, which enables the network to reuse the skeleton features and generate more information adaptive and related to different human actions. Our model has shown its ability to improve the accuracy of human action recognition task on two large datasets, NTU-RGB+D and Kinetics. Extensive evaluations were conducted to prove the effectiveness of our model.

Zero-Shot Text Classification with Semantically Extended Graph Convolutional Network

Tengfei Liu, Yongli Hu, Junbin Gao, Yanfeng Sun, Baocai Yin

Responsive image

Auto-TLDR; Semantically Extended Graph Convolutional Network for Zero-shot Text Classification

Slides Poster Similar

As a challenging task of Natural Language Processing(NLP), zero-shot text classification has attracted more and more attention recently. It aims to detect classes that the model has never seen in the training set. For this purpose, a feasible way is to construct connection between the seen and unseen classes by semantic extension and classify the unseen classes by information propagation over the connection. Although many related zero-shot text classification methods have been exploited, how to realize semantic extension properly and propagate information effectively is far from solved. In this paper, we propose a novel zero-shot text classification method called Semantically Extended Graph Convolutional Network (SEGCN). In the proposed method, the semantic category knowledge from ConceptNet is utilized to semantic extension for linking seen classes to unseen classes and constructing a graph of all classes. Then, we build upon Graph Convolutional Network (GCN) for predicting the textual classifier for each category, which transfers the category knowledge by the convolution operators on the constructed graph and is trained in a semi-supervised manner using the samples of the seen classes. The experimental results on Dbpedia and 20newsgroup datasets show that our method outperforms the state of the art zero-shot text classification methods.

TreeRNN: Topology-Preserving Deep Graph Embedding and Learning

Yecheng Lyu, Ming Li, Xinming Huang, Ulkuhan Guler, Patrick Schaumont, Ziming Zhang

Responsive image

Auto-TLDR; TreeRNN: Recurrent Neural Network for General Graph Classification

Slides Poster Similar

General graphs are difficult for learning due to their irregular structures. Existing works employ message passing along graph edges to extract local patterns using customized graph kernels, but few of them are effective for the integration of such local patterns into global features. In contrast, in this paper we study the methods to transfer the graphs into trees so that explicit orders are learned to direct the feature integration from local to global. To this end, we apply the breadth first search (BFS) to construct trees from the graphs, which adds direction to the graph edges from the center node to the peripheral nodes. In addition, we proposed a novel projection scheme that transfer the trees to image representations, which is suitable for conventional convolution neural networks (CNNs) and recurrent neural networks (RNNs). To best learn the patterns from the graph-tree-images, we propose TreeRNN, a 2D RNN architecture that recurrently integrates the image pixels by rows and columns to help classify the graph categories. We evaluate the proposed method on several graph classification datasets, and manage to demonstrate comparable accuracy with the state-of-the-art on MUTAG, PTC-MR and NCI1 datasets.

Kernel-based Graph Convolutional Networks

Hichem Sahbi

Responsive image

Auto-TLDR; Spatial Graph Convolutional Networks in Recurrent Kernel Hilbert Space

Slides Poster Similar

Learning graph convolutional networks (GCNs) is an emerging field which aims at generalizing deep learning to arbitrary non-regular domains. Most of the existing GCNs follow a neighborhood aggregation scheme, where the representation of a node is recursively obtained by aggregating its neighboring node representations using averaging or sorting operations. However, these operations are either ill-posed or weak to be discriminant or increase the number of training parameters and thereby the computational complexity and the risk of overfitting. In this paper, we introduce a novel GCN framework that achieves spatial graph convolution in a reproducing kernel Hilbert space. The latter makes it possible to design, via implicit kernel representations, convolutional graph filters in a high dimensional and more discriminating space without increasing the number of training parameters. The particularity of our GCN model also resides in its ability to achieve convolutions without explicitly realigning nodes in the receptive fields of the learned graph filters with those of the input graphs, thereby making convolutions permutation agnostic and well defined. Experiments conducted on the challenging task of skeleton-based action recognition show the superiority of the proposed method against different baselines as well as the related work.

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

Chao Li, Qian Zhang, Ziping Zhao

Responsive image

Auto-TLDR; Intimate Relationship Prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network Using Functional Near-Infrared Spectroscopy

Slides Poster Similar

The detection of intimacy plays a crucial role in the improvement of intimate relationship, which contributes to promote the family and social harmony. Previous studies have shown that different degrees of intimacy have significant differences in brain imaging. Recently, a few of work has emerged to recognise intimacy automatically by using machine learning technique. Moreover, considering the temporal dynamic characteristics of intimacy relationship on neural mechanism, how to model spatio-temporal dynamics for intimacy prediction effectively is still a challenge. In this paper, we propose a novel method to explore deep spatial-temporal representations for intimacy prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network (ACCRNN). Given the advantages of time-frequency resolution in complex neuronal activities analysis, this paper utilizes functional near-infrared spectroscopy (fNIRS) to analyse and infer to intimate relationship. We collect a fNIRS-based dataset for the analysis of intimate relationship. Forty-two-channel fNIRS signals are recorded from the 44 subjects' prefrontal cortex when they watched a total of 18 photos of lovers, friends and strangers for 30 seconds per photo. The experimental results show that our proposed method outperforms the others in terms of accuracy with the precision of 96.5%. To the best of our knowledge, this is the first time that such a hybrid deep architecture has been employed for fNIRS-based intimacy prediction.

A General Model for Learning Node and Graph Representations Jointly

Chaofan Chen

Responsive image

Auto-TLDR; Joint Community Detection/Dynamic Routing for Graph Classification

Slides Poster Similar

This paper focuses on two fundamental graph recognition tasks: node classification and graph classification. Existing methods usually learn the node and graph representations for these two tasks separately, and ignore modeling the relations between the local and global structures. In this paper, we propose a general approach to learn the local and global features collaboratively: (1) in order to characterize the correlation among nodes and communities (a set of nodes), we employ the joint community detection/dynamic routing modules to generate the clustering assignment matrices at first and then utilize these matrices to cluster nodes to capture the global information of graphs (locally relevant graph representations). Inspired by the success of spectral clustering, we minimize the ratiocut loss to help optimize the learned assignment matrices. (2) We maximize the mutual information between local and global representations to help learn the globally relevant node representations. Experimental results on a variety of node and graph classification benchmarks show that our model can achieve superior performance over the state-of-the-art approaches.

Temporal Collaborative Filtering with Graph Convolutional Neural Networks

Esther Rodrigo-Bonet, Minh Duc Nguyen, Nikos Deligiannis

Responsive image

Auto-TLDR; Temporal Collaborative Filtering with Graph-Neural-Network-based Neural Networks

Slides Poster Similar

Temporal collaborative filtering (TCF) methods aim at modelling non-static aspects behind recommender systems, such as the dynamics in users' preferences and social trends around items. State-of-the-art TCF methods employ recurrent neural networks (RNNs) to model such aspects. These methods deploy matrix-factorization-based (MF-based) approaches to learn the user and item representations. Recently, graph-neural-network-based (GNN-based) approaches have shown improved performance in providing accurate recommendations over traditional MF-based approaches in non-temporal CF settings. Motivated by this, we propose a novel TCF method that leverages GNNs to learn user and item representations, and RNNs to model their temporal dynamics. A challenge with this method lies in the increased data sparsity, which negatively impacts obtaining meaningful quality representations with GNNs. To overcome this challenge, we train a GNN model at each time step using a set of observed interactions accumulated time-wise. Comprehensive experiments on real-world data show the improved performance obtained by our method over several state-of-the-art temporal and non-temporal CF models.

Object Detection Using Dual Graph Network

Shengjia Chen, Zhixin Li, Feicheng Huang, Canlong Zhang, Huifang Ma

Responsive image

Auto-TLDR; A Graph Convolutional Network for Object Detection with Key Relation Information

Slides Similar

Most object detection methods focus only on the local information near the region proposal and ignore the object's global semantic relation and local spatial relation information, resulting in limited performance. To capture and explore these important relations, we propose a detection method based on a graph convolutional network (GCN). Two independent relation graph networks are used to obtain the global semantic information of the object in labels and the local spatial information in images. Semantic relation networks can implicitly acquire global knowledge, and by constructing a directed graph on the dataset, each node is represented by the word embedding of labels and then sent to the GCN to obtain high-level semantic representation. The spatial relation network encodes the relation by the positional relation module and the visual connection module, and enriches the object features through local key information from objects. The feature representation is further improved by aggregating the outputs of the two networks. Instead of directly disseminating visual features in the network, the dual-graph network explores more advanced feature information, giving the detector the ability to obtain key relations in labels and region proposals. Experiments on the PASCAL VOC and MS COCO datasets demonstrate that key relation information significantly improve the performance of detection with better ability to detect small objects and reasonable boduning box. The results on COCO dataset demonstrate our method obtains around 32.3% improvement on AP in terms of small objects.

What Nodes Vote To? Graph Classification without Readout Phase

Yuxing Tian, Zheng Liu, Weiding Liu, Zeyu Zhang, Yanwen Qu

Responsive image

Auto-TLDR; node voting based graph classification with convolutional operator

Slides Poster Similar

In recent years, many researchers have started to construct Graph Neural Networks (GNNs) to deal with graph classification task. Those GNNs can fit into a framework named Message Passing Neural Networks (MPNNs), which consists of two phases: a Message Passing phase used for updating node embeddings and a Readout phase. In Readout phase, node embeddings are aggregated to extract graph feature used for classification. However, the above operation may obscure the affect of the node embedding of each node on graph classification. Therefore, a node voting based graph classification model is proposed in this paper, called Node Voting net (NVnet). Similar to the MPNNs, NVnet also contains the Message Passing phase. The main differences between NVnet and MPNNs are: 1, a decoder for graph reconstruction is added to NVnet to make node embeddings contain as much graph structure information as possible; 2, NVnet replaces the Readout phase with a new phase called Node Voting phase. In the Node Voting phase, an attention layer based on the gate mechanism is constructed to help each node observe the node embeddings of other nodes in the graph, and each node predicts the graph class from its own perspective. The above process is called node voting. After voting, the results of all nodes are aggregated to get the final graph classification result. In addition, considering that aggregation operation may also obscure the difference between node voting results, our solution is to add a regularization term to drive node voting results to reach group consensus. We evaluate the performance of the NVnet on 4 benchmark datasets. The experimental results show that compared with other 10 baselines, NVnet can achieve higher graph classification accuracy on datasets by using appropriate convolutional operator.

JT-MGCN: Joint-Temporal Motion Graph Convolutional Network for Skeleton-Based Action Recognition

Suekyeong Nam, Seungkyu Lee

Responsive image

Auto-TLDR; Joint-temporal Motion Graph Convolutional Networks for Action Recognition

Slides Similar

Recently, action recognition methods using graph convolutional networks (GCN) have shown remarkable performance thanks to its concise but effective representation of human body motion. Prior methods construct human body motion graph building edges between neighbor or distant body joints. On the other hand, human action contains lots of temporal variations showing strong temporal correlations between joint motions. Thus the characterization of an action requires a comprehensive analysis of joint motion correlations on spatial and temporal domains. In this paper, we propose Joint-temporal Motion Graph Convolutional Networks (JT-MGCN) in which joint-temporal edges learn the correlations between different joints at different time. Experimental evaluation on large public data sets such as NTU rgb+d data set and kinetics-skeleton data set show outstanding action recognition performance.

Video Semantic Segmentation Using Deep Multi-View Representation Learning

Akrem Sellami, Salvatore Tabbone

Responsive image

Auto-TLDR; Deep Multi-view Representation Learning for Video Object Segmentation

Slides Poster Similar

In this paper, we propose a deep learning model based on deep multi-view representation learning, to address the video object segmentation task. The proposed model emphasizes the importance of the inherent correlation between video frames and incorporates a multi-view representation learning based on deep canonically correlated autoencoders. The multi-view representation learning in our model provides an efficient mechanism for capturing inherent correlations by jointly extracting useful features and learning better representation into a joint feature space, i.e., shared representation. To increase the training data and the learning capacity, we train the proposed model with pairs of video frames, i.e., $F_{a}$ and $F_{b}$. During the segmentation phase, the deep canonically correlated autoencoders model encodes useful features by processing multiple reference frames together, which is used to detect the frequently reappearing. Our model enhances the state-of-the-art deep learning-based methods that mainly focus on learning discriminative foreground representations over appearance and motion. Experimental results over two large benchmarks demonstrate the ability of the proposed method to outperform competitive approaches and to reach good performances, in terms of semantic segmentation.

A Grid-Based Representation for Human Action Recognition

Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Responsive image

Auto-TLDR; GRAR: Grid-based Representation for Action Recognition in Videos

Slides Poster Similar

Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for the task, and are limited in the way they fuse temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets that demonstrate that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges.

Hcore-Init: Neural Network Initialization Based on Graph Degeneracy

Stratis Limnios, George Dasoulas, Dimitrios Thilikos, Michalis Vazirgiannis

Responsive image

Auto-TLDR; K-hypercore: Graph Mining for Deep Neural Networks

Slides Poster Similar

Neural networks are the pinnacle of Artificial Intelligence, as in recent years we witnessed many novel architectures, learning and optimization techniques for deep learning. Capitalizing on the fact that neural networks inherently constitute multipartite graphs among neuron layers, we aim to analyze directly their structure to extract meaningful information that can improve the learning process. To our knowledge graph mining techniques for enhancing learning in neural networks have not been thoroughly investigated. In this paper we propose an adapted version of the k-core structure for the complete weighted multipartite graph extracted from a deep learning architecture. As a multipartite graph is a combination of bipartite graphs, that are in turn the incidence graphs of hypergraphs, we design k-hypercore decomposition, the hypergraph analogue of k-core degeneracy. We applied k-hypercore to several neural network architectures, more specifically to convolutional neural networks and multilayer perceptrons for image recognition tasks after a very short pretraining. Then we used the information provided by the hypercore numbers of the neurons to re-initialize the weights of the neural network, thus biasing the gradient optimization scheme. Extensive experiments proved that k-hypercore outperforms the state-of-the-art initialization methods.

Directional Graph Networks with Hard Weight Assignments

Miguel Dominguez, Raymond Ptucha

Responsive image

Auto-TLDR; Hard Directional Graph Networks for Point Cloud Analysis

Slides Poster Similar

Point cloud analysis is an important field for 3D scene understanding. It has applications in self driving cars and robotics (via LIDAR sensors), 3D graphics, and computer-aided design. Neural networks have recently achieved strong results on point cloud analysis problems such as classification and segmentation. Each point cloud network has the challenge of defining a convolution that can learn useful features on unstructured points. Some recent point cloud convolutions create separate weight matrices for separate directions like a CNN, but apply every weight matrix to every neighbor with soft assignments. This increases computational complexity and makes relatively small neighborhood aggregations expensive to compute. We propose Hard Directional Graph Networks (HDGN), a point cloud model that both learns directional weight matrices and assigns a single matrix to each neighbor, achieving directional convolutions at lower computational cost. HDGN's directional modeling achieves state-of-the-art results on multiple point cloud vision benchmarks.

Sketch-Based Community Detection Via Representative Node Sampling

Mahlagha Sedghi, Andre Beckus, George Atia

Responsive image

Auto-TLDR; Sketch-based Clustering of Community Detection Using a Small Sketch

Slides Poster Similar

This paper proposes a sketch-based approach to the community detection problem which clusters the full graph through the use of an informative and concise sketch. The reduced sketch is built through an effective sampling approach which selects few nodes that best represent the complete graph and operates on a pairwise node similarity measure based on the average commute time. After sampling, the proposed algorithm clusters the nodes in the sketch, and then infers the cluster membership of the remaining nodes in the full graph based on their aggregate similarity to nodes in the partitioned sketch. By sampling nodes with strong representation power, our approach can improve the success rates over full graph clustering. In challenging cases with large node degree variation, our approach not only maintains competitive accuracy with full graph clustering despite using a small sketch, but also outperforms existing sampling methods. The use of a small sketch allows considerable storage savings, and computational and timing improvements for further analysis such as clustering and visualization. We provide numerical results on synthetic data based on the homogeneous, heterogeneous and degree corrected versions of the stochastic block model, as well as experimental results on real-world data.

Privacy Attributes-Aware Message Passing Neural Network for Visual Privacy Attributes Classification

Hanbin Hong, Wentao Bao, Yuan Hong, Yu Kong

Responsive image

Auto-TLDR; Privacy Attributes-Aware Message Passing Neural Network for Visual Privacy Attribute Classification

Slides Poster Similar

Visual Privacy Attribute Classification (VPAC) identifies privacy information leakage via social media images. These images containing privacy attributes such as skin color, face or gender are classified into multiple privacy attribute categories in VPAC. With limited works in this task, current methods often extract features from images and simply classify the extracted feature into multiple privacy attribute classes. The dependencies between privacy attributes, e.g., skin color and face typically co-exist in the same image, are usually ignored in classification, which causes performance degradation in VPAC. In this paper, we propose a novel end-to-end Privacy Attributes-aware Message Passing Neural Network (PA-MPNN) to address VPAC. Privacy attributes are considered as nodes on a graph and an MPNN is introduced to model the privacy attribute dependencies. To generate representative features for privacy attribute nodes, a class-wise encoder-decoder is proposed to learn a latent space for each attribute. An attention mechanism with multiple correlation matrices is also introduced in MPNN to learn the privacy attributes graph automatically. Experimental results on the Privacy Attribute Dataset demonstrate that our framework achieves better performance than state-of-the-art methods on visual privacy attributes classification.

What and How? Jointly Forecasting Human Action and Pose

Yanjun Zhu, Yanxia Zhang, Qiong Liu, Andreas Girgensohn

Responsive image

Auto-TLDR; Forecasting Human Actions and Motion Trajectories with Joint Action Classification and Pose Regression

Slides Poster Similar

Forecasting human actions and motion trajectories addresses the problem of predicting what a person is going to do next and how they will perform it. This is crucial in a wide range of applications such as assisted living and future co-robotic settings. We propose to simultaneously learn actions and action-related human motion dynamics, while existing works perform them independently. In this paper, we present a method to jointly forecast categories of human action and the pose of skeletal joints in the hope that the two tasks can help each other. As a result, our system can predict not only the future actions but also the motion trajectories that will result. To achieve this, we define a task of joint action classification and pose regression. We employ a sequence to sequence encoder-decoder model combined with multi-task learning to forecast future actions and poses progressively before the action happens. Experimental results on two public datasets, IkeaDB and OAD, demonstrate the effectiveness of the proposed method.

PICK: Processing Key Information Extraction from Documents Using Improved Graph Learning-Convolutional Networks

Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, Rong Xiao

Responsive image

Auto-TLDR; PICK: A Graph Learning Framework for Key Information Extraction from Documents

Slides Poster Similar

Computer vision with state-of-the-art deep learning models have achieved huge success in the field of Optical Character Recognition (OCR) including text detection and recognition tasks recently. However, Key Information Extraction (KIE) from documents as the downstream task of OCR, having a large number of use scenarios in real-world, remains a challenge because documents not only have textual features extracting from OCR systems but also have semantic visual features that are not fully exploited and play a critical role in KIE. Too little work has been devoted to efficiently make full use of both textual and visual features of the documents. In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Extensive experiments on real-world datasets have been conducted to show that our method outperforms baselines methods by significant margins.

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Bang Yang, Yuexian Zou

Responsive image

Auto-TLDR; Visual Oriented Encoder for Video Captioning

Slides Poster Similar

Video captioning is a challenging task which aims at automatically generating a natural language description of a given video. Recent researches have shown that exploiting the intrinsic multi-modalities of videos significantly promotes captioning performance. However, how to integrate multi-modalities to generate effective semantic representations for video captioning is still an open issue. Some researchers proposed to learn multimodal features in parallel during the encoding stage. The downside of these methods lies in the neglect of the interaction among multi-modalities and their rich contextual information. In this study, inspired by the fact that visual contents are generally more important for comprehending videos, we propose a novel Visual Oriented Encoder (VOE) to integrate multimodal features in an interactive manner. Specifically, VOE is designed as a hierarchical structure, where bottom layers are utilized to extract multi-scale contexts from auxiliary modalities while the top layer is exploited to generate joint representations by considering both visual and contextual information. Following the encoder-decoder framework, we systematically develop a VOE-LSTM model and evaluate it on two mainstream benchmarks: MSVD and MSR-VTT. Experimental results show that the proposed VOE surpasses conventional encoders and our VOE-LSTM model achieves competitive results compared with state-of-the-art approaches.

SDMA: Saliency Driven Mutual Cross Attention for Multi-Variate Time Series

Yash Garg, K. Selcuk Candan

Responsive image

Auto-TLDR; Salient-Driven Mutual Cross Attention for Intelligent Time Series Analytics

Slides Poster Similar

Integration of rich sensory technologies into critical applications, such as gesture recognition and building energy optimization, has highlighted the importance of intelligent time series analytics. To accommodate this demand, uni-variate approaches have been extended for multi-variate scenarios, but naive extensions have lead to deterioration in model performances due to their limited ability to capture the information recorded in different variates and complex multi-variate time series patterns’ evolution over time. Furthermore, real-world time series are often contaminated with noisy information. In this paper, we note that a time series often carry robust localized temporal events that could help improve model performance by highlighting the relevant information; however, the lack of sufficient data to train for these events make it impossible for neural architectures to identify and make use of these temporal events. We, therefore, argue that a companion process helping identify salient events in the input time series and driving model’s attention to the associated salient sub-sequences can help with learning a high-performing network. Relying on this observation, we propose a novel Saliency-Driven Mutual Cross Attention (SDMA) framework that extracts localized temporal events and generate a saliency series to complement the input time series. We further propose an architecture which accounts for the mutual cross-talk between the input and saliency series branches where input and saliency series attend each other. Experiments show that the proposed mutually-cross attention framework can offer significant boosts in model performance when compared against non-attentioned, conventionally attentioned, and conventionally cross-attentioned models.

MANet: Multimodal Attention Network Based Point-View Fusion for 3D Shape Recognition

Yaxin Zhao, Jichao Jiao, Ning Li

Responsive image

Auto-TLDR; Fusion Network for 3D Shape Recognition based on Multimodal Attention Mechanism

Slides Poster Similar

3D shape recognition has attracted more and more attention as a task of 3D vision research. The proliferation of 3D data encourages various deep learning methods based on 3D data. Now there have been many deep learning models based on point-cloud data or multi-view data alone. However, in the era of big data, integrating data of two different modals to obtain a unified 3D shape descriptor is bound to improve the recognition accuracy. Therefore, this paper proposes a fusion network based on multimodal attention mechanism for 3D shape recognition. Considering the limitations of multi-view data, we introduce a soft attention scheme, which can use the global point-cloud features to filter the multi-view features, and then realize the effective fusion of the two features. More specifically, we obtain the enhanced multi-view features by mining the contribution of each multi-view image to the overall shape recognition, and then fuse the point-cloud features and the enhanced multi-view features to obtain a more discriminative 3D shape descriptor. We have performed relevant experiments on the ModelNet40 dataset, and experimental results verify the effectiveness of our method.

TSMSAN: A Three-Stream Multi-Scale Attentive Network for Video Saliency Detection

Jingwen Yang, Guanwen Zhang, Wei Zhou

Responsive image

Auto-TLDR; Three-stream Multi-scale attentive network for video saliency detection in dynamic scenes

Slides Poster Similar

Video saliency detection is an important low-level task that has been used in a large range of high-level applications. In this paper, we proposed a three-stream multi-scale attentive network (TSMSAN) for saliency detection in dynamic scenes. TSMSAN integrates motion vector representation, static saliency map, and RGB information in multi-scales together into one framework on the basis of Fully Convolutional Network (FCN) and spatial attention mechanism. On the one hand, the respective motion features, spatial features, as well as the scene features can provide abundant information for video saliency detection. On the other hand, spatial attention mechanism can combine features with multi-scales to focus on key information in dynamic scenes. In this manner, the proposed TSMSAN can encode the spatiotemporal features of the dynamic scene comprehensively. We evaluate the proposed approach on two public dynamic saliency data sets. The experimental results demonstrate TSMSAN is able to achieve the state-of-the-art performance as well as the excellent generalization ability. Furthermore, the proposed TSMSAN can provide more convincing video saliency information, in line with human perception.

Temporal Extension Module for Skeleton-Based Action Recognition

Yuya Obinata, Takuma Yamamoto

Responsive image

Auto-TLDR; Extended Temporal Graph for Action Recognition with Kinetics-Skeleton

Slides Poster Similar

We present a module that extends the temporal graph of a graph convolutional network (GCN) for action recognition with a sequence of skeletons. Existing methods attempt to represent a more appropriate spatial graph on an intra-frame, but disregard optimization of the temporal graph on the inter-frame. Concretely, these methods connect between vertices corresponding only to the same joint on the inter-frame. In this work, we focus on adding connections to neighboring multiple vertices on the inter-frame and extracting additional features based on the extended temporal graph. Our module is a simple yet effective method to extract correlated features of multiple joints in human movement. Moreover, our module aids in further performance improvements, along with other GCN methods that optimize only the spatial graph. We conduct extensive experiments on two large datasets, NTU RGB+D and Kinetics-Skeleton, and demonstrate that our module is effective for several existing models and our final model achieves state-of-the-art performance.

DAG-Net: Double Attentive Graph Neural Network for Trajectory Forecasting

Alessio Monti, Alessia Bertugli, Simone Calderara, Rita Cucchiara

Responsive image

Auto-TLDR; Recurrent Generative Model for Multi-modal Human Motion Behaviour in Urban Environments

Slides Poster Similar

Understanding human motion behaviour is a critical task for several possible applications like self-driving cars or social robots, and in general for all those settings where an autonomous agent has to navigate inside a human-centric environment. This is non-trivial because human motion is inherently multi-modal: given a history of human motion paths, there are many plausible ways by which people could move in the future. Additionally, people activities are often driven by goals, e.g. reaching particular locations or interacting with the environment. We address both the aforementioned aspects by proposing a new recurrent generative model that considers both single agents’ future goals and interactions between different agents. The model exploits a double attention-based graph neural network to collect information about the mutual influences among different agents and integrates it with data about agents’ possible future objectives. Our proposal is general enough to be applied in different scenarios: the model achieves state-of-the-art results in both urban environments and also in sports applications.