RNN Training along Locally Optimal Trajectories via Frank-Wolfe Algorithm

Yun Yue, Ming Li, Venkatesh Saligrama, Ziming Zhang

Responsive image

Auto-TLDR; Frank-Wolfe Algorithm for Efficient Training of RNNs

Slides Poster

We propose a novel and efficient training method for RNNs by iteratively seeking a local minima on the loss surface within a small region, and leverage this directional vector for the update, in an outer-loop. We propose to utilize the Frank-Wolfe (FW) algorithm in this context. Although, FW implicitly involves normalized gradients, which can lead to a slow convergence rate, we develop a novel RNN training method that, surprisingly, even with the additional cost, the overall training cost is empirically observed to be lower than back-propagation. Our method leads to a new Frank-Wolfe method, that is in essence an SGD algorithm with a restart scheme. We prove that under certain conditions our algorithm has a sublinear convergence rate of $O(1/\epsilon)$ for $\epsilon$ error. We then conduct empirical experiments on several benchmark datasets including those that exhibit long-term dependencies, and show significant performance improvement. We also experiment with deep RNN architectures and show efficient training performance. Finally, we demonstrate that our training method is robust to noisy data.

Similar papers

Learning Stable Deep Predictive Coding Networks with Weight Norm Supervision

Guo Ruohao

Responsive image

Auto-TLDR; Stability of Predictive Coding Network with Weight Norm Supervision

Slides Poster Similar

Predictive Coding Network (PCN) is an important neural network inspired by visual processing models in neuroscience. It combines the feedforward and feedback processing and has the architecture of recurrent neural networks (RNNs). This type of network is usually trained with backpropagation through time (BPTT). With infinite recurrent steps, PCN is a dynamic system. However, as one of the most important properties, stability is rarely studied in this type of network. Inspired by reservoir computing, we investigate the stability of hierarchical RNNs from the perspective of dynamic systems, and propose a sufficient condition for their echo state property (ESP). Our study shows the global stability is determined by stability of the local layers and the feedback between neighboring layers. Based on it, we further propose Weight Norm Supervision, a new algorithm that controls the stability of PCN dynamics by imposing different weight norm constraints on different parts of the network. We compare our approach with other training methods in terms of stability and prediction capability. The experiments show that our algorithm learns stable PCNs with a reliable prediction precision in the most effective and controllable way.

Low-Cost Lipschitz-Independent Adaptive Importance Sampling of Stochastic Gradients

Huikang Liu, Xiaolu Wang, Jiajin Li, Man-Cho Anthony So

Responsive image

Auto-TLDR; Adaptive Importance Sampling for Stochastic Gradient Descent

Slides Similar

Stochastic gradient descent (SGD) usually samples training data based on the uniform distribution, which may not be a good choice because of the high variance of its stochastic gradient. Thus, importance sampling methods are considered in the literature to improve the performance. Most previous work on SGD-based methods with importance sampling requires the knowledge of Lipschitz constants of all component gradients, which are in general difficult to estimate. In this paper, we study an adaptive importance sampling method for common SGD-based methods by exploiting the local first-order information without knowing any Lipschitz constants. In particular, we periodically changes the sampling distribution by only utilizing the gradient norms in the past few iterations. We prove that our adaptive importance sampling non-asymptotically reduces the variance of the stochastic gradients in SGD, and thus better convergence bounds than that for vanilla SGD can be obtained. We extend this sampling method to several other widely used stochastic gradient algorithms including SGD with momentum and ADAM. Experiments on common convex learning problems and deep neural networks illustrate notably enhanced performance using the adaptive sampling strategy.

Learning Sparse Deep Neural Networks Using Efficient Structured Projections on Convex Constraints for Green AI

Michel Barlaud, Frederic Guyard

Responsive image

Auto-TLDR; Constrained Deep Neural Network with Constrained Splitting Projection

Slides Poster Similar

In recent years, deep neural networks (DNN) have been applied to different domains and achieved dramatic performance improvements over state-of-the-art classical methods. These performances of DNNs were however often obtained with networks containing millions of parameters and which training required heavy computational power. In order to cope with this computational issue a huge literature deals with proximal regularization methods which are time consuming.\\ In this paper, we propose instead a constrained approach. We provide the general framework for our new splitting projection gradient method. Our splitting algorithm iterates a gradient step and a projection on convex sets. We study algorithms for different constraints: the classical $\ell_1$ unstructured constraint and structured constraints such as the nuclear norm, the $\ell_{2,1} $ constraint (Group LASSO). We propose a new $\ell_{1,1} $ structured constraint for which we provide a new projection algorithm We demonstrate the effectiveness of our method on three popular datasets (MNIST, Fashion MNIST and CIFAR). Experiments on these datasets show that our splitting projection method with our new $\ell_{1,1} $ structured constraint provides the best reduction of memory and computational power. Experiments show that fully connected linear DNN are more efficient for green AI.

An Efficient Empirical Solver for Localized Multiple Kernel Learning Via DNNs

Ziming Zhang

Responsive image

Auto-TLDR; Localized Multiple Kernel Learning using LMKL-Net

Slides Poster Similar

In this paper we propose solving localized multiple kernel learning (LMKL) using LMKL-Net, a feedforward deep neural network (DNN). In contrast to previous works, as a learning principle we propose parameterizing the gating function for learning kernel combination weights and the multiclass classifier using an attentional network (AN) and a multilayer perceptron (MLP), respectively. Such interpretability helps us better understand how the network solves the problem. Thanks to stochastic gradient descent (SGD), our approach has {\em linear} computational complexity in training. Empirically on benchmark datasets we demonstrate that with comparable or better accuracy than the state-of-the-art, our LMKL-Net can be trained about {\bf two orders of magnitude} faster with about {\bf two orders of magnitude} smaller memory footprint for large-scale learning.

Stochastic Runge-Kutta Methods and Adaptive SGD-G2 Stochastic Gradient Descent

Gabriel Turinici, Imen Ayadi

Responsive image

Auto-TLDR; Adaptive Stochastic Runge Kutta for the Minimization of the Loss Function

Slides Poster Similar

The minimization of the loss function is of paramount importance in deep neural networks. Many popular optimization algorithms have been shown to correspond to some evolution equation of gradient flow type. Inspired by the numerical schemes used for general evolution equations, we introduce a second-order stochastic Runge Kutta method and show that it yields a consistent procedure for the minimization of the loss function. In addition, it can be coupled, in an adaptive framework, with the Stochastic Gradient Descent (SGD) to adjust automatically the learning rate of the SGD The resulting adaptive SGD, called SGD-G2, shows good results in terms of convergence speed when tested on standard data-sets.

Improving Batch Normalization with Skewness Reduction for Deep Neural Networks

Pak Lun Kevin Ding, Martin Sarah, Baoxin Li

Responsive image

Auto-TLDR; Batch Normalization with Skewness Reduction

Slides Poster Similar

Batch Normalization (BN) is a well-known technique used in training deep neural networks. The main idea behind batch normalization is to normalize the features of the layers ($i.e.$, transforming them to have a mean equal to zero and a variance equal to one). Such a procedure encourages the optimization landscape of the loss function to be smoother, and improve the learning of the networks for both speed and performance. In this paper, we demonstrate that the performance of the network can be improved, if the distributions of the features of the output in the same layer are similar. As normalizing based on mean and variance does not necessarily make the features to have the same distribution, we propose a new normalization scheme: Batch Normalization with Skewness Reduction (BNSR). Comparing with other normalization approaches, BNSR transforms not just only the mean and variance, but also the skewness of the data. By tackling this property of a distribution, we are able to make the output distributions of the layers to be further similar. The nonlinearity of BNSR may further improve the expressiveness of the underlying network. Comparisons with other normalization schemes are tested on the CIFAR-100 and ImageNet datasets. Experimental results show that the proposed approach can outperform other state-of-the-arts that are not equipped with BNSR.

Norm Loss: An Efficient yet Effective Regularization Method for Deep Neural Networks

Theodoros Georgiou, Sebastian Schmitt, Thomas Baeck, Wei Chen, Michael Lew

Responsive image

Auto-TLDR; Weight Soft-Regularization with Oblique Manifold for Convolutional Neural Network Training

Slides Poster Similar

Convolutional neural network training can suffer from diverse issues like exploding or vanishing gradients, scaling-based weight space symmetry and covariant-shift. In order to address these issues, researchers develop weight regularization methods and activation normalization methods. In this work we propose a weight soft-regularization method based on the Oblique manifold. The proposed method uses a loss function which pushes each weight vector to have a norm close to one, i.e. the weight matrix is smoothly steered toward the so-called Oblique manifold. We evaluate our method on the very popular CIFAR-10, CIFAR-100 and ImageNet 2012 datasets using two state-of-the-art architectures, namely the ResNet and wide-ResNet. Our method introduces negligible computational overhead and the results show that it is competitive to the state-of-the-art and in some cases superior to it. Additionally, the results are less sensitive to hyperparameter settings such as batch size and regularization factor.

Revisiting the Training of Very Deep Neural Networks without Skip Connections

Oyebade Kayode Oyedotun, Abd El Rahman Shabayek, Djamila Aouada, Bjorn Ottersten

Responsive image

Auto-TLDR; Optimization of Very Deep PlainNets without shortcut connections with 'vanishing and exploding units' activations'

Slides Poster Similar

Deep neural networks (DNNs) with many layers of feature representations yield state-of-the-art results on several difficult learning tasks. However, optimizing very deep DNNs without shortcut connections known as PlainNets, is a notoriously hard problem. Considering the growing interest in this area, this paper investigates holistically two scenarios that plague the training of very deep PlainNets: (1) the relatively popular challenge of 'vanishing and exploding units' activations', and (2) the less investigated 'singularity' problem, which is studied in details in this paper. In contrast to earlier works that study only the saturation and explosion of units' activations in isolation, this paper harmonizes the inconspicuous coexistence of the aforementioned problems for very deep PlainNets. Particularly, we argue that the aforementioned problems would have to be tackled simultaneously for the successful training of very deep PlainNets. Finally, different techniques that can be employed for tackling the optimization problem are discussed, and a specific combination of simple techniques that allows the successful training of PlainNets having up to 100 layers is demonstrated.

Regularized Flexible Activation Function Combinations for Deep Neural Networks

Renlong Jie, Junbin Gao, Andrey Vasnev, Minh-Ngoc Tran

Responsive image

Auto-TLDR; Flexible Activation in Deep Neural Networks using ReLU and ELUs

Slides Poster Similar

Activation in deep neural networks is fundamental to achieving non-linear mappings. Traditional studies mainly focus on finding fixed activations for a particular set of learning tasks or model architectures. The research on flexible activation is quite limited in both designing philosophy and application scenarios. In this study, three principles of choosing flexible activation components are proposed and a general combined form of flexible activation functions is implemented. Based on this, a novel family of flexible activation functions that can replace sigmoid or tanh in LSTM cells are implemented, as well as a new family by combining ReLU and ELUs. Also, two new regularisation terms based on assumptions as prior knowledge are introduced. It has been shown that LSTM models with proposed flexible activations P-Sig-Ramp provide significant improvements in time series forecasting, while the proposed P-E2-ReLU achieves better and more stable performance on lossy image compression tasks with convolutional auto-encoders. In addition, the proposed regularization terms improve the convergence,performance and stability of the models with flexible activation functions. The code for this paper is available at https://github.com/9NXJRDDRQK/Flexible Activation.

Exploiting Non-Linear Redundancy for Neural Model Compression

Muhammad Ahmed Shah, Raphael Olivier, Bhiksha Raj

Responsive image

Auto-TLDR; Compressing Deep Neural Networks with Linear Dependency

Slides Poster Similar

Deploying deep learning models with millions, even billions, of parameters is challenging given real world memory, power and compute constraints. In an effort to make these models more practical, in this paper, we propose a novel model compression approach that exploits linear dependence between the activations in a layer to eliminate entire structural units (neurons/convolutional filters). Our approach also adjusts the weights of the layer in a manner that is provably lossless while training if the removed neuron was perfectly predictable. We combine this approach with an annealing algorithm that may be applied during training, or even on a trained model, and demonstrate, using popular datasets, that our technique can reduce the parameters of VGG and AlexNet by more than 97\% on \cifar, 85\% on \caltech, and 19\% on ImageNet at less than 2\% loss in accuracy. Furthermore, we provide theoretical results showing that in overparametrized, locally linear (ReLU) neural networks where redundant features exist, and with correct hyperparameter selection, our method is indeed able to capture and suppress those dependencies.

Learning Sign-Constrained Support Vector Machines

Kenya Tajima, Kouhei Tsuchida, Esmeraldo Ronnie Rey Zara, Naoya Ohta, Tsuyoshi Kato

Responsive image

Auto-TLDR; Constrained Sign Constraints for Learning Linear Support Vector Machine

Poster Similar

Domain knowledge is useful to improve the generalization performance of learning machines. Sign constraints are a handy representation to combine domain knowledge with learning machine. In this paper, we consider constraining the signs of the weight coefficients in learning the linear support vector machine, and develop two optimization algorithms for minimizing the empirical risk under the sign constraints. One of the two algorithms is based on the projected gradient method, in which each iteration of the projected gradient method takes O(nd) computational cost and the sublinear convergence of the objective error is guaranteed. The second algorithm is based on the Frank-Wolfe method that also converges sublinearly and possesses a clear termination criterion. We show that each iteration of the Frank-Wolfe also requires O(nd) cost. Furthermore, we derive the explicit expression for the minimal iteration number to ensure an epsilon-accurate solution by analyzing the curvature of the objective function. Finally, we empirically demonstrate that the sign constraints are a promising technique when similarities to the training examples compose the feature vector.

Meta Learning Via Learned Loss

Sarah Bechtle, Artem Molchanov, Yevgen Chebotar, Edward Thomas Grefenstette, Ludovic Righetti, Gaurav Sukhatme, Franziska Meier

Responsive image

Auto-TLDR; meta-learning for learning parametric loss functions that generalize across different tasks and model architectures

Slides Similar

Typically, loss functions, regularization mechanisms and other important aspects of training parametric models are chosen heuristically from a limited set of options. In this paper, we take the first step towards automating this process, with the view of producing models which train faster and more robustly. Concretely, we present a meta-learning method for learning parametric loss functions that can generalize across different tasks and model architectures. We develop a pipeline for “meta-training” such loss functions, targeted at maximizing the performance of the model trained under them. The loss landscape produced by our learned losses significantly improves upon the original task-specific losses in both supervised and reinforcement learning tasks. Furthermore, we show that our meta-learning framework is flexible enough to incorporate additional information at meta-train time. This information shapes the learned loss function such that the environment does not need to provide this information during meta-test time.

Learning Connectivity with Graph Convolutional Networks

Hichem Sahbi

Responsive image

Auto-TLDR; Learning Graph Convolutional Networks Using Topological Properties of Graphs

Slides Poster Similar

Learning graph convolutional networks (GCNs) is an emerging field which aims at generalizing convolutional operations to arbitrary non-regular domains. In particular, GCNs operating on spatial domains show superior performances compared to spectral ones, however their success is highly dependent on how the topology of input graphs is defined. In this paper, we introduce a novel framework for graph convolutional networks that learns the topological properties of graphs. The design principle of our method is based on the optimization of a constrained objective function which learns not only the usual convolutional parameters in GCNs but also a transformation basis that conveys the most relevant topological relationships in these graphs. Experiments conducted on the challenging task of skeleton-based action recognition shows the superiority of the proposed method compared to handcrafted graph design as well as the related work.

Classification and Feature Selection Using a Primal-Dual Method and Projections on Structured Constraints

Michel Barlaud, Antonin Chambolle, Jean_Baptiste Caillau

Responsive image

Auto-TLDR; A Constrained Primal-dual Method for Structured Feature Selection on High Dimensional Data

Slides Poster Similar

This paper deals with feature selection using supervised classification on high dimensional datasets. A classical approach is to project data on a low dimensional space and classify by minimizing an appropriate quadratic cost. Our first contribution is to introduce a matrix of centers in the definition of this cost. Moreover, as quadratic costs are not robust to outliers, we propose to use an $\ell_1$ cost instead (or Huber loss to mitigate overfitting issues). While control on sparsity is commonly obtained by adding an $\ell_1$ constraint on the vectorized matrix of weights used for projecting the data, our second contribution is to enforce structured sparsity. To this end we propose constraints that take into account the matrix structure of the data, based either on the nuclear norm, on the $\ell_{2,1}$ norm, or on the $\ell_{1,2}$ norm for which we provide a new projection algorithm. We optimize simultaneously the projection matrix and the matrix of centers thanks to a new tailored constrained primal-dual method. The primal-dual framework is general enough to encompass the various robust losses and structured constraints we use, and allows a convergence analysis. We demonstrate the effectiveness of the approach on three biological datasets. Our primal-dual method with robust losses, adaptive centers and structured constraints does significantly better than classical methods, both in terms of accuracy and computational time.

Unveiling Groups of Related Tasks in Multi-Task Learning

Jordan Frecon, Saverio Salzo, Massimiliano Pontil

Responsive image

Auto-TLDR; Continuous Bilevel Optimization for Multi-Task Learning

Slides Poster Similar

A common approach in multi-task learning is to encourage the tasks to share a low dimensional representation. This has led to the popular method of trace norm regularization, which has proved effective in many applications. In this paper, we extend this approach by allowing the tasks to partition into different groups, within which trace norm regularization is separately applied. We propose a continuous bilevel optimization framework to simultaneously identify groups of related tasks and learn a low dimensional representation within each group. Hinging on recent results on the derivative of generalized matrix functions, we devise a smooth approximation of the upper-level objective via a dual forward-backward algorithm with Bregman distances. This allows us to solve the bilevel problem by a gradient-based scheme. Numerical experiments on synthetic and benchmark datasets support the effectiveness of the proposed method.

On the Global Self-attention Mechanism for Graph Convolutional Networks

Chen Wang, Deng Chengyuan

Responsive image

Auto-TLDR; Global Self-Attention Mechanism for Graph Convolutional Networks

Slides Similar

Applying Global Self-Attention (GSA) mechanism over features has achieved remarkable success on Convolutional Neural Networks (CNNs). However, it is not clear if Graph Convolutional Networks (GCNs) can similarly benefit from such a technique. In this paper, inspired by the similarity between CNNs and GCNs, we study the impact of the Global Self-Attention mechanism on GCNs. We find that consistent with the intuition, the GSA mechanism allows GCNs to capture feature-based vertex relations regardless of edge connections; As a result, the GSA mechanism can introduce extra expressive power to the GCNs. Furthermore, we analyze the impacts of the GSA mechanism on the issues of overfitting and over-smoothing. We prove that the GSA mechanism can alleviate both the overfitting and the over-smoothing issues based on some recent technical developments. Experiments on multiple benchmark datasets illustrate both superior expressive power and less significant overfitting and over-smoothing problems for the GSA-augmented GCNs, which corroborate the intuitions and the theoretical results.

Learning with Multiplicative Perturbations

Xiulong Yang, Shihao Ji

Responsive image

Auto-TLDR; XAT and xVAT: A Multiplicative Adversarial Training Algorithm for Robust DNN Training

Slides Poster Similar

Adversarial Training (AT) and Virtual Adversarial Training (VAT) are the regularization techniques that train Deep Neural Networks (DNNs) with adversarial examples generated by adding small but worst-case perturbations to input examples. In this paper, we propose xAT and xVAT, new adversarial training algorithms that generate multiplicative perturbations to input examples for robust training of DNNs. Such perturbations are much more perceptible and interpretable than their additive counterparts exploited by AT and VAT. Furthermore, the multiplicative perturbations can be generated transductively or inductively, while the standard AT and VAT only support a transductive implementation. We conduct a series of experiments that analyze the behavior of the multiplicative perturbations and demonstrate that xAT and xVAT match or outperform state-of-the-art classification accuracies across multiple established benchmarks while being about 30% faster than their additive counterparts. Our source code can be found at https://github.com/sndnyang/xvat

Energy Minimum Regularization in Continual Learning

Xiaobin Li, Weiqiang Wang

Responsive image

Auto-TLDR; Energy Minimization Regularization for Continuous Learning

Slides Similar

How to give agents the ability of continuous learning like human and animals is still a challenge. In the regularized continual learning method OWM, the constraint of the model on the energy compression of the learned task is ignored, which results in the poor performance of the method on the dataset with a large number of learning tasks. In this paper, we propose an energy minimization regularization(EMR) method to constrain the energy of learned tasks, providing enough learning space for the following tasks that are not learned, and increasing the capacity of the model to the number of learning tasks. A large number of experiments show that our method can effectively increase the capacity of the model and reduce the sensitivity of the model to the number of tasks and the size of the network.

Is the Meta-Learning Idea Able to Improve the Generalization of Deep Neural Networks on the Standard Supervised Learning?

Xiang Deng, Zhongfei Zhang

Responsive image

Auto-TLDR; Meta-learning Based Training of Deep Neural Networks for Few-Shot Learning

Slides Poster Similar

Substantial efforts have been made on improving the generalization abilities of deep neural networks (DNNs) in order to obtain better performances without introducing more parameters. On the other hand, meta-learning approaches exhibit powerful generalization on new tasks in few-shot learning. Intuitively, few-shot learning is more challenging than the standard supervised learning as each target class only has a very few or no training samples. The natural question that arises is whether the meta-learning idea can be used for improving the generalization of DNNs on the standard supervised learning. In this paper, we propose a novel meta-learning based training procedure (MLTP) for DNNs and demonstrate that the meta-learning idea can indeed improve the generalization abilities of DNNs. MLTP simulates the meta-training process by considering a batch of training samples as a task. The key idea is that the gradient descent step for improving the current task performance should also improve a new task performance, which is ignored by the current standard procedure for training neural networks. MLTP also benefits from all the existing training techniques such as dropout, weight decay, and batch normalization. We evaluate MLTP by training a variety of small and large neural networks on three benchmark datasets, i.e., CIFAR-10, CIFAR-100, and Tiny ImageNet. The experimental results show a consistently improved generalization performance on all the DNNs with different sizes, which verifies the promise of MLTP and demonstrates that the meta-learning idea is indeed able to improve the generalization of DNNs on the standard supervised learning.

Generalization Comparison of Deep Neural Networks Via Output Sensitivity

Mahsa Forouzesh, Farnood Salehi, Patrick Thiran

Responsive image

Auto-TLDR; Generalization of Deep Neural Networks using Sensitivity

Slides Similar

Although recent works have brought some insights into the performance improvement of techniques used in state-of-the-art deep-learning models, more work is needed to understand their generalization properties. We shed light on this matter by linking the loss function to the output's sensitivity to its input. We find a rather strong empirical relation between the output sensitivity and the variance in the bias-variance decomposition of the loss function, which hints on using sensitivity as a metric for comparing the generalization performance of networks, without requiring labeled data. We find that sensitivity is decreased by applying popular methods which improve the generalization performance of the model, such as (1) using a deep network rather than a wide one, (2) adding convolutional layers to baseline classifiers instead of adding fully-connected layers, (3) using batch normalization, dropout and max-pooling, and (4) applying parameter initialization techniques.

Trajectory-User Link with Attention Recurrent Networks

Tao Sun, Yongjun Xu, Fei Wang, Lin Wu, 塘文 钱, Zezhi Shao

Responsive image

Auto-TLDR; TULAR: Trajectory-User Link with Attention Recurrent Neural Networks

Slides Poster Similar

The prevalent adoptions of GPS-enabled devices have witnessed an explosion of various location-based services which produces a huge amount of trajectories monitoring the individuals' movements. In this paper, we tackle Trajectory-User Link (TUL) problem, which identifies humans' movement patterns and links trajectories to the users who generated them. Existing solutions on TUL problem employ recurrent neural networks and variational autoencoder methods, which face the bottlenecks in the case of excessively long trajectories and fragmentary users' movements. However, these are common characteristics of trajectory data in reality, leading to performance degradation of the existing models. In this paper, we propose an end-to-end attention recurrent neural learning framework, called TULAR (Trajectory-User Link with Attention Recurrent Networks), which focus on selected parts of the source trajectories when linking. TULAR introduce the Trajectory Semantic Vector (TSV) via unsupervised location representation learning and recurrent neural networks, by which to reckon the weight of parts of source trajectory. Further, we employ three attention scores for the weight measurements. Experiments are conducted on two real world datasets and compared with several existing methods, and the results show that TULAR yields a new state-of-the-art performance. Source code is public available at GitHub: https://github.com/taos123/TULAR.

Switching Dynamical Systems with Deep Neural Networks

Cesar Ali Ojeda Marin, Kostadin Cvejoski, Bogdan Georgiev, Ramses J. Sanchez

Responsive image

Auto-TLDR; Variational RNN for Switching Dynamics

Slides Poster Similar

The problem of uncovering different dynamicalregimes is of pivotal importance in time series analysis. Switchingdynamical systems provide a solution for modeling physical phe-nomena whose time series data exhibit different dynamical modes.In this work we propose a novel variational RNN model forswitching dynamics allowing for both non-Markovian and non-linear dynamical behavior between and within dynamic modes.Attention mechanisms are provided to inform the switchingdistribution. We evaluate our model on synthetic and empiricaldatasets of diverse nature and successfully uncover differentdynamical regimes and predict the switching dynamics.

Neuron-Based Network Pruning Based on Majority Voting

Ali Alqahtani, Xianghua Xie, Ehab Essa, Mark W. Jones

Responsive image

Auto-TLDR; Large-Scale Neural Network Pruning using Majority Voting

Slides Poster Similar

The achievement of neural networks in a variety of applications is accompanied by a dramatic increase in computational costs and memory requirements. In this paper, we propose an efficient method to simultaneously identify the critical neurons and prune the model during training without involving any pre-training or fine-tuning procedures. Unlike existing methods, which accomplish this task in a greedy fashion, we propose a majority voting technique to compare the activation values among neurons and assign a voting score to quantitatively evaluate their importance.This mechanism helps to effectively reduce model complexity by eliminating the less influential neurons and aims to determine a subset of the whole model that can represent the reference model with much fewer parameters within the training process. Experimental results show that majority voting efficiently compresses the network with no drop in model accuracy, pruning more than 79\% of the original model parameters on CIFAR10 and more than 91\% of the original parameters on MNIST. Moreover, we show that with our proposed method, sparse models can be further pruned into even smaller models by removing more than 60\% of the parameters, whilst preserving the reference model accuracy.

WeightAlign: Normalizing Activations by Weight Alignment

Xiangwei Shi, Yunqiang Li, Xin Liu, Jan Van Gemert

Responsive image

Auto-TLDR; WeightAlign: Normalization of Activations without Sample Statistics

Slides Poster Similar

Batch normalization (BN) allows training very deep networks by normalizing activations by mini-batch sample statistics which renders BN unstable for small batch sizes. Current small-batch solutions such as Instance Norm, Layer Norm, and Group Norm use channel statistics which can be computed even for a single sample. Such methods are less stable than BN as they critically depend on the statistics of a single input sample. To address this problem, we propose a normalization of activation without sample statistics. We present WeightAlign: a method that normalizes the weights by the mean and scaled standard derivation computed within a filter, which normalizes activations without computing any sample statistics. Our proposed method is independent of batch size and stable over a wide range of batch sizes. Because weight statistics are orthogonal to sample statistics, we can directly combine WeightAlign with any method for activation normalization. We experimentally demonstrate these benefits for classification on CIFAR-10, CIFAR-100, ImageNet, for semantic segmentation on PASCAL VOC 2012 and for domain adaptation on Office-31.

Speeding-Up Pruning for Artificial Neural Networks: Introducing Accelerated Iterative Magnitude Pruning

Marco Zullich, Eric Medvet, Felice Andrea Pellegrino, Alessio Ansuini

Responsive image

Auto-TLDR; Iterative Pruning of Artificial Neural Networks with Overparametrization

Slides Poster Similar

In recent years, Artificial Neural Networks (ANNs) pruning has become the focal point of many researches, due to the extreme overparametrization of such models. This has urged the scientific world to investigate methods for the simplification of the structure of weights in ANNs, mainly in an effort to reduce time for both training and inference. Frankle and Carbin and later Renda, Frankle, and Carbin introduced and refined an iterative pruning method which is able to effectively prune the network of a great portion of its parameters with little to no loss in performance. On the downside, this method requires a large amount of time for its application, since, for each iteration, the network has to be trained for (almost) the same amount of epochs of the unpruned network. In this work, we show that, for a limited setting, if targeting high overall sparsity rates, this time can be effectively reduced for each iteration, save for the last one, by more than 50%, while yielding a final product (i.e., final pruned network) whose performance is comparable to the ANN obtained using the existing method.

Adaptive Noise Injection for Training Stochastic Student Networks from Deterministic Teachers

Yi Xiang Marcus Tan, Yuval Elovici, Alexander Binder

Responsive image

Auto-TLDR; Adaptive Stochastic Networks for Adversarial Attacks

Slides Similar

Adversarial attacks have been a prevalent problem causing misclassification in machine learning models, with stochasticity being a promising direction towards greater robustness. However, stochastic networks frequently underperform compared to deterministic deep networks. In this work, we present a conceptually clear adaptive noise injection mechanism in combination with teacher-initialisation, which adjusts its degree of randomness dynamically through the computation of mini-batch statistics. This mechanism is embedded within a simple framework to obtain stochastic networks from existing deterministic networks. Our experiments show that our method is able to outperform prior baselines under white-box settings, exemplified through CIFAR-10 and CIFAR-100. Following which, we perform in-depth analysis on varying different components of training with our approach on the effects of robustness and accuracy, through the study of the evolution of decision boundary and trend curves of clean accuracy/attack success over differing degrees of stochasticity. We also shed light on the effects of adversarial training on a pre-trained network, through the lens of decision boundaries.

A Multilinear Sampling Algorithm to Estimate Shapley Values

Ramin Okhrati, Aldo Lipani

Responsive image

Auto-TLDR; A sampling method for Shapley values for multilayer Perceptrons

Slides Poster Similar

Shapley values are great analytical tools in game theory to measure the importance of a player in a game. Due to their axiomatic and desirable properties such as efficiency, they have become popular for feature importance analysis in data science and machine learning. However, the time complexity to compute Shapley values based on the original formula is exponential, and as the number of features increases, this becomes infeasible. Castro et al. [1] developed a sampling algorithm, to estimate Shapley values. In this work, we propose a new sampling method based on a multilinear extension technique as applied in game theory. The aim is to provide a more efficient (sampling) method for estimating Shapley values. Our method is applicable to any machine learning model, in particular for either multiclass classifications or regression problems. We apply the method to estimate Shapley values for multilayer Perceptrons (MLPs) and through experimentation on two datasets, we demonstrate that our method provides more accurate estimations of the Shapley values by reducing the variance of the sampling statistics

P-DIFF: Learning Classifier with Noisy Labels Based on Probability Difference Distributions

Wei Hu, Qihao Zhao, Yangyu Huang, Fan Zhang

Responsive image

Auto-TLDR; P-DIFF: A Simple and Effective Training Paradigm for Deep Neural Network Classifier with Noisy Labels

Slides Poster Similar

Learning deep neural network (DNN) classifier with noisy labels is a challenging task because the DNN can easily over- fit on these noisy labels due to its high capability. In this paper, we present a very simple but effective training paradigm called P-DIFF, which can train DNN classifiers but obviously alleviate the adverse impact of noisy labels. Our proposed probability difference distribution implicitly reflects the probability of a training sample to be clean, then this probability is employed to re-weight the corresponding sample during the training process. P-DIFF can also achieve good performance even without prior- knowledge on the noise rate of training samples. Experiments on benchmark datasets also demonstrate that P-DIFF is superior to the state-of-the-art sample selection methods.

MA-LSTM: A Multi-Attention Based LSTM for Complex Pattern Extraction

Jingjie Guo, Kelang Tian, Kejiang Ye, Cheng-Zhong Xu

Responsive image

Auto-TLDR; MA-LSTM: Multiple Attention based recurrent neural network for forget gate

Slides Poster Similar

With the improvement of data, computing powerand algorithms, deep learning has achieved rapid developmentand showing excellent performance. Recently, many deep learn-ing models are proposed to solve the problems in different areas.A recurrent neural network (RNN) is a class of artificial neuralnetworks where connections between nodes form a directedgraph along a temporal sequence. This allows it to exhibittemporal dynamic behavior, which makes it applicable to taskssuch as handwriting recognition or speech recognition. How-ever, the RNN relies heavily on the automatic learning abilityto update parameters which concentrate on the data flow butseldom considers the feature extraction capability of the gatemechanism. In this paper, we propose a novel architecture tobuild the forget gate which is generated by multiple bases.Instead of using the traditional single-layer fully-connectednetwork, we use a Multiple Attention (MA) based network togenerate the forget gate which refines the optimization spaceof gate function and improve the granularity of the recurrentneural network to approximate the map in the ground truth.Credit to the MA structure on the gate mechanism. Our modelhas a better feature extraction capability than other knownmodels. MA-LSTM is an alternative module which can directly replace the recurrent neural network and has achieved good performance in many areas that people are concerned about.

N2D: (Not Too) Deep Clustering Via Clustering the Local Manifold of an Autoencoded Embedding

Ryan Mcconville, Raul Santos-Rodriguez, Robert Piechocki, Ian Craddock

Responsive image

Auto-TLDR; Local Manifold Learning for Deep Clustering on Autoencoded Embeddings

Slides Similar

Deep clustering has increasingly been demonstrating superiority over conventional shallow clustering algorithms. Deep clustering algorithms usually combine representation learning with deep neural networks to achieve this performance, typically optimizing a clustering and non-clustering loss. In such cases, an autoencoder is typically connected with a clustering network, and the final clustering is jointly learned by both the autoencoder and clustering network. Instead, we propose to learn an autoencoded embedding and then search this further for the underlying manifold. For simplicity, we then cluster this with a shallow clustering algorithm, rather than a deeper network. We study a number of local and global manifold learning methods on both the raw data and autoencoded embedding, concluding that UMAP in our framework is able to find the best clusterable manifold of the embedding. This suggests that local manifold learning on an autoencoded embedding is effective for discovering higher quality clusters. We quantitatively show across a range of image and time-series datasets that our method has competitive performance against the latest deep clustering algorithms, including out-performing current state-of-the-art on several. We postulate that these results show a promising research direction for deep clustering. The code can be found at https://github.com/rymc/n2d.

Dimensionality Reduction for Data Visualization and Linear Classification, and the Trade-Off between Robustness and Classification Accuracy

Martin Becker, Jens Lippel, Thomas Zielke

Responsive image

Auto-TLDR; Robustness Assessment of Deep Autoencoder for Data Visualization using Scatter Plots

Slides Poster Similar

This paper has three intertwined goals. The first is to introduce a new similarity measure for scatter plots. It uses Delaunay triangulations to compare two scatter plots regarding their relative positioning of clusters. The second is to apply this measure for the robustness assessment of a recent deep neural network (DNN) approach to dimensionality reduction (DR) for data visualization. It uses a nonlinear generalization of Fisher's linear discriminant analysis (LDA) as the encoder network of a deep autoencoder (DAE). The DAE's decoder network acts as a regularizer. The third goal is to look at different variants of the DNN: ones that promise robustness and ones that promise high classification accuracies. This is to study the trade-off between these two objectives -- our results support the recent claim that robustness may be at odds with accuracy; however, results that are balanced regarding both objectives are achievable. We see a restricted Boltzmann machine (RBM) pretraining and the DAE based regularization as important building blocks for achieving balanced results. As a means of assessing the robustness of DR methods, we propose a measure that is based on our similarity measure for scatter plots. The robustness measure comes with a superimposition view of Delaunay triangulations, which allows a fast comparison of results from multiple DR methods.

Rethinking Experience Replay: A Bag of Tricks for Continual Learning

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Simone Calderara

Responsive image

Auto-TLDR; Experience Replay for Continual Learning: A Practical Approach

Slides Poster Similar

In Continual Learning, a Neural Network is trained on a stream of data whose distribution shifts over time. Under these assumptions, it is especially challenging to improve on classes appearing later in the stream while remaining accurate on previous ones. This is due to the infamous problem of catastrophic forgetting, which causes a quick performance degradation when the classifier focuses on learning new categories. Recent literature proposed various approaches to tackle this issue, often resorting to very sophisticated techniques. In this work, we show that naive rehearsal can be patched to achieve similar performance. We point out some shortcomings that restrain Experience Replay (ER) and propose five tricks to mitigate them. Experiments show that ER, thus enhanced, displays an accuracy gain of 51.2 and 26.9 percentage points on the CIFAR-10 and CIFAR-100 datasets respectively (memory buffer size 1000). As a result, it surpasses current state-of-the-art rehearsal-based methods.

Kernel-based Graph Convolutional Networks

Hichem Sahbi

Responsive image

Auto-TLDR; Spatial Graph Convolutional Networks in Recurrent Kernel Hilbert Space

Slides Poster Similar

Learning graph convolutional networks (GCNs) is an emerging field which aims at generalizing deep learning to arbitrary non-regular domains. Most of the existing GCNs follow a neighborhood aggregation scheme, where the representation of a node is recursively obtained by aggregating its neighboring node representations using averaging or sorting operations. However, these operations are either ill-posed or weak to be discriminant or increase the number of training parameters and thereby the computational complexity and the risk of overfitting. In this paper, we introduce a novel GCN framework that achieves spatial graph convolution in a reproducing kernel Hilbert space. The latter makes it possible to design, via implicit kernel representations, convolutional graph filters in a high dimensional and more discriminating space without increasing the number of training parameters. The particularity of our GCN model also resides in its ability to achieve convolutions without explicitly realigning nodes in the receptive fields of the learned graph filters with those of the input graphs, thereby making convolutions permutation agnostic and well defined. Experiments conducted on the challenging task of skeleton-based action recognition show the superiority of the proposed method against different baselines as well as the related work.

Meta Soft Label Generation for Noisy Labels

Görkem Algan, Ilkay Ulusoy

Responsive image

Auto-TLDR; MSLG: Meta-Learning for Noisy Label Generation

Slides Poster Similar

The existence of noisy labels in the dataset causes significant performance degradation for deep neural networks (DNNs). To address this problem, we propose a Meta Soft Label Generation algorithm called MSLG, which can jointly generate soft labels using meta-learning techniques and learn DNN parameters in an end-to-end fashion. Our approach adapts the meta-learning paradigm to estimate optimal label distribution by checking gradient directions on both noisy training data and noise-free meta-data. In order to iteratively update soft labels, meta-gradient descent step is performed on estimated labels, which would minimize the loss of noise-free meta samples. In each iteration, the base classifier is trained on estimated meta labels. MSLG is model-agnostic and can be added on top of any existing model at hand with ease. We performed extensive experiments on CIFAR10, Clothing1M and Food101N datasets. Results show that our approach outperforms other state-of-the-art methods by a large margin. Our code is available at \url{https://github.com/gorkemalgan/MSLG_noisy_label}.

Resource-efficient DNNs for Keyword Spotting using Neural Architecture Search and Quantization

David Peter, Wolfgang Roth, Franz Pernkopf

Responsive image

Auto-TLDR; Neural Architecture Search for Keyword Spotting in Limited Resource Environments

Slides Poster Similar

This paper introduces neural architecture search (NAS) for the automatic discovery of small models for keyword spotting (KWS) in limited resource environments. We employ a differentiable NAS approach to optimize the structure of convolutional neural networks (CNNs) to meet certain memory constraints for storing weights as well as constraints on the number of operations per inference. Using NAS only, we were able to obtain a highly efficient model with 95.6% accuracy on the Google speech commands dataset with 494.8 kB of memory usage and 19.6 million operations. Additionally, weight quantization is used to reduce the memory consumption even further. We show that weight quantization to low bit-widths (e.g. 1 bit) can be used without substantial loss in accuracy. By increasing the number of input features from 10 MFCC to 20 MFCC we were able to increase the accuracy to 96.6% at 340.1 kB of memory usage and 27.1 million operations.

E-DNAS: Differentiable Neural Architecture Search for Embedded Systems

Javier García López, Antonio Agudo, Francesc Moreno-Noguer

Responsive image

Auto-TLDR; E-DNAS: Differentiable Architecture Search for Light-Weight Networks for Image Classification

Slides Poster Similar

Designing optimal and light weight networks to fit in resource-limited platforms like mobiles, DSPs or GPUs is a challenging problem with a wide range of interesting applications, {\em e.g.} in embedded systems for autonomous driving. While most approaches are based on manual hyperparameter tuning, there exist a new line of research, the so-called NAS (Neural Architecture Search) methods, that aim to optimize several metrics during the design process, including memory requirements of the network, number of FLOPs, number of MACs (Multiply-ACcumulate operations) or inference latency. However, while NAS methods have shown very promising results, they are still significantly time and cost consuming. In this work we introduce E-DNAS, a differentiable architecture search method, which improves the efficiency of NAS methods in designing light-weight networks for the task of image classification. Concretely, E-DNAS computes, in a differentiable manner, the optimal size of a number of meta-kernels that capture patterns of the input data at different resolutions. We also leverage on the additive property of convolution operations to merge several kernels with different compatible sizes into a single one, reducing thus the number of operations and the time required to estimate the optimal configuration. We evaluate our approach on several datasets to perform classification. We report results in terms of the SoC (System on Chips) metric, typically used in the Texas Instruments TDA2x families for autonomous driving applications. The results show that our approach allows designing low latency architectures significantly faster than state-of-the-art.

Boundaries of Single-Class Regions in the Input Space of Piece-Wise Linear Neural Networks

Jay Hoon Jung, Youngmin Kwon

Responsive image

Auto-TLDR; Piece-wise Linear Neural Networks with Linear Constraints

Slides Similar

An input space is a set of all the possible inputs for a neural network. An element or a group of elements in the input space can easily be understood by projecting them on their original forms. Even though Piece-wise Linear Neural Networks (PLNNs) are a nonlinear system in general, a PLNN can also be expressed in terms of linear constraints because the Rectified Linear Units (ReLU) function is a piece-wise linear function. A PLNN divides the input space into disjoint linear regions. We proved that all components of the outputs are continuous at the boundary between two different adjacent regions. This continuity implies that the boundary corresponding to a unit itself should be continuous regardless of the regions. Furthermore, we also obtained the boundaries of a single-class region, which has the same predicted classes in the interior of the region. Finally, we suggested that the point-wise robustness of a neural network can be calculated by investigating the boundaries of linear regions and the single-class regions. We obtained adversarial examples in which Euclidean distances from original inputs are less than 0.01 pixels.

Feature Engineering and Stacked Echo State Networks for Musical Onset Detection

Peter Steiner, Azarakhsh Jalalvand, Simon Stone, Peter Birkholz

Responsive image

Auto-TLDR; Echo State Networks for Onset Detection in Music Analysis

Slides Poster Similar

In music analysis, one of the most fundamental tasks is note onset detection - detecting the beginning of new note events. As the target function of onset detection is related to other tasks, such as beat tracking or tempo estimation, onset detection is the basis for such related tasks. Furthermore, it can help to improve Automatic Music Transcription (AMT). Typically, different approaches for onset detection follow a similar outline: An audio signal is transformed into an Onset Detection Function (ODF), which should have rather low values (i.e. close to zero) for most of the time but with pronounced peaks at onset times, which can then be extracted by applying peak picking algorithms on the ODF. In the recent years, several kinds of neural networks were used successfully to compute the ODF from feature vectors. Currently, Convolutional Neural Networks (CNNs) define the state of the art. In this paper, we build up on an alternative approach to obtain a ODF by Echo State Networks (ESNs), which have achieved comparable results to CNNs in several tasks, such as speech and image recognition. In contrast to the typical iterative training procedures of deep learning architectures, such as CNNs or networks consisting of Long-Short-Term Memory Cells (LSTMs), in ESNs only a very small part of the weights is easily trained in one shot using linear regression. By comparing the performance of several feature extraction methods, pre-processing steps and introducing a new way to stack ESNs, we expand our previous approach to achieve results that fall between a bidirectional LSTM network and a CNN with relative improvements of 1.8% and -1.4%, respectively. For the evaluation, we used exactly the same 8-fold cross validation setup as for the reference results.

Can Data Placement Be Effective for Neural Networks Classification Tasks? Introducing the Orthogonal Loss

Brais Cancela, Veronica Bolon-Canedo, Amparo Alonso-Betanzos

Responsive image

Auto-TLDR; Spatial Placement for Neural Network Training Loss Functions

Slides Poster Similar

Traditionally, a Neural Network classification training loss function follows the same principle: minimizing the distance between samples that belong to the same class, while maximizing the distance to the other classes. There are no restrictions on the spatial placement of deep features (last layer input). This paper addresses this issue when dealing with Neural Networks, providing a set of loss functions that are able to train a classifier by forcing the deep features to be projected over a predefined orthogonal basis. Experimental results shows that these `data placement' functions can overcome the training accuracy provided by the classic cross-entropy loss function.

Towards Explaining Adversarial Examples Phenomenon in Artificial Neural Networks

Ramin Barati, Reza Safabakhsh, Mohammad Rahmati

Responsive image

Auto-TLDR; Convolutional Neural Networks and Adversarial Training from the Perspective of convergence

Slides Poster Similar

In this paper, we study the adversarial examples existence and adversarial training from the standpoint of convergence and provide evidence that pointwise convergence in ANNs can explain these observations. The main contribution of our proposal is that it relates the objective of the evasion attacks and adversarial training with concepts already defined in learning theory. Also, we extend and unify some of the other proposals in the literature and provide alternative explanations on the observations made in those proposals. Through different experiments, we demonstrate that the framework is valuable in the study of the phenomenon and is applicable to real-world problems.

Understanding Integrated Gradients with SmoothTaylor for Deep Neural Network Attribution

Gary Shing Wee Goh, Sebastian Lapuschkin, Leander Weber, Wojciech Samek, Alexander Binder

Responsive image

Auto-TLDR; SmoothGrad: bridging Integrated Gradients and SmoothGrad from the Taylor's theorem perspective

Slides Similar

Integrated Gradients as an attribution method for deep neural network models offers simple implementability. However, it suffers from noisiness of explanations which affects the ease of interpretability. The SmoothGrad technique is proposed to solve the noisiness issue and smoothen the attribution maps of any gradient-based attribution method. In this paper, we present SmoothTaylor as a novel theoretical concept bridging Integrated Gradients and SmoothGrad, from the Taylor's theorem perspective. We apply the methods to the image classification problem, using the ILSVRC2012 ImageNet object recognition dataset, and a couple of pretrained image models to generate attribution maps. These attribution maps are empirically evaluated using quantitative measures for sensitivity and noise level. We further propose adaptive noising to optimize for the noise scale hyperparameter value. From our experiments, we find that the SmoothTaylor approach together with adaptive noising is able to generate better quality saliency maps with lesser noise and higher sensitivity to the relevant points in the input space as compared to Integrated Gradients.

On Resource-Efficient Bayesian Network Classifiers and Deep Neural Networks

Wolfgang Roth, Günther Schindler, Holger Fröning, Franz Pernkopf

Responsive image

Auto-TLDR; Quantization-Aware Bayesian Network Classifiers for Small-Scale Scenarios

Slides Poster Similar

We present two methods to reduce the complexity of Bayesian network (BN) classifiers. First, we introduce quantization-aware training using the straight-through gradient estimator to quantize the parameters of BNs to few bits. Second, we extend a recently proposed differentiable tree-augmented naive Bayes (TAN) structure learning approach to also consider the model size. Both methods are motivated by recent developments in the deep learning community, and they provide effective means to trade off between model size and prediction accuracy, which is demonstrated in extensive experiments. Furthermore, we contrast quantized BN classifiers with quantized deep neural networks (DNNs) for small-scale scenarios which have hardly been investigated in the literature. We show Pareto optimal models with respect to model size, number of operations, and test error and find that both model classes are viable options.

Revisiting Graph Neural Networks: Graph Filtering Perspective

Hoang Nguyen-Thai, Takanori Maehara, Tsuyoshi Murata

Responsive image

Auto-TLDR; Two-Layers Graph Convolutional Network with Graph Filters Neural Network

Slides Poster Similar

In this work, we develop quantitative results to the learnability of a two-layers Graph Convolutional Network (GCN). Instead of analyzing GCN under some classes of functions, our approach provides a quantitative gap between a two-layers GCN and a two-layers MLP model. From the graph signal processing perspective, we provide useful insights to some flaws of graph neural networks for vertex classification. We empirically demonstrate a few cases when GCN and other state-of-the-art models cannot learn even when true vertex features are extremely low-dimensional. To demonstrate our theoretical findings and propose a solution to the aforementioned adversarial cases, we build a proof of concept graph neural network model with different filters named Graph Filters Neural Network (gfNN).

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Michele Alberti, Angela Botros, Schuetz Narayan, Rolf Ingold, Marcus Liwicki, Mathias Seuret

Responsive image

Auto-TLDR; Trainable and Spectrally Initializable Matrix Transformations for Neural Networks

Slides Poster Similar

In this work, we introduce a new architectural component to Neural Networks (NN), i.e., trainable and spectrally initializable matrix transformations on feature maps. While previous literature has already demonstrated the possibility of adding static spectral transformations as feature processors, our focus is on more general trainable transforms. We study the transforms in various architectural configurations on four datasets of different nature: from medical (ColorectalHist, HAM10000) and natural (Flowers) images to historical documents (CB55). With rigorous experiments that control for the number of parameters and randomness, we show that networks utilizing the introduced matrix transformations outperform vanilla neural networks. The observed accuracy increases appreciably across all datasets. In addition, we show that the benefit of spectral initialization leads to significantly faster convergence, as opposed to randomly initialized matrix transformations. The transformations are implemented as auto-differentiable PyTorch modules that can be incorporated into any neural network architecture. The entire code base is open-source.

A Randomized Algorithm for Sparse Recovery

Huiyuan Yu, Maggie Cheng, Yingdong Lu

Responsive image

Auto-TLDR; A Constrained Graph Optimization Algorithm for Sparse Signal Recovery

Poster Similar

This paper considers the problem of sparse signal recovery where there is a structure in the signal. Efficient recovery schemes can be designed to leverage the signal structure. Following the model-based compressive sensing framework, we have developed an efficient algorithm for both head and tail approximations for the model-projection problem. The problem is modeled as a constrained graph optimization problem, which is an NP-hard optimization problem. Solving the NP-hard optimization program is then transformed to solving a linear program and finding a randomized algorithm to find an integral solution. The integral solution is optimal-in-expectation. The algorithm is proved to have the same geometric convergence as previous work. The algorithm has been tested on various compressing matrices. It worked well with the matrices with the Restricted Isometry Property (RIP), also worked well with some matrices that have not been shown to have RIP. The proposed algorithm demonstrated improved recoverability and used fewer number of iterations to recover the signal.

Quaternion Capsule Networks

Barış Özcan, Furkan Kınlı, Mustafa Furkan Kirac

Responsive image

Auto-TLDR; Quaternion Capsule Networks for Object Recognition

Slides Poster Similar

Capsules are grouping of neurons that allow to represent sophisticated information of a visual entity such as pose and features. In the view of this property, Capsule Networks outperform CNNs in challenging tasks like object recognition in unseen viewpoints, and this is achieved by learning the transformations between the object and its parts with the help of high dimensional representation of pose information. In this paper, we present Quaternion Capsules (QCN) where pose information of capsules and their transformations are represented by quaternions. Quaternions are immune to the gimbal lock, have straightforward regularization of the rotation representation for capsules, and require less number of parameters than matrices. The experimental results show that QCNs generalize better to novel viewpoints with fewer parameters, and also achieve on-par or better performances with the state-of-the-art Capsule architectures on well-known benchmarking datasets.

HFP: Hardware-Aware Filter Pruning for Deep Convolutional Neural Networks Acceleration

Fang Yu, Chuanqi Han, Pengcheng Wang, Ruoran Huang, Xi Huang, Li Cui

Responsive image

Auto-TLDR; Hardware-Aware Filter Pruning for Convolutional Neural Networks

Slides Poster Similar

Convolutional Neural Networks (CNNs) are powerful but computationally demanding and memory intensive, thus impeding their practical applications on resource-constrained hardware. Filter pruning is an efficient approach for deep CNN compression and acceleration, which aims to eliminate some filters with tolerable performance degradation. In the literature, the majority of approaches prune networks by defining the redundant filters or training the networks with a sparsity prior loss function. These approaches mainly use FLOPs as their speed metric. However, the inference latency of pruned networks cannot be directly controlled on the hardware platform, which is an important dimension of practicality. To address this issue, we propose a novel Hardware-aware Filter Pruning method (HFP) which can produce pruned networks that satisfy the actual latency budget on the hardwares of interest. In addition, we propose an iterative pruning framework called Opti-Cut to decrease the accuracy degradation of pruning process and accelerate the pruning procedure whilst meeting the hardware budget. More specifically, HFP first builds up a lookup table for fast estimating the latency of target network about filter configuration layer by layer. Then, HFP leverages information gain (IG) to globally evaluate the filters' contribution to network output distribution. HFP utilizes the Opti-Cut framework to globally prune filters with the minimum IG one by one until the latency budget is satisfied. We verify the effectiveness of the proposed method on CIFAR-10 and ImageNet. Compared with the state-of-the-art pruning methods, HFP demonstrates superior performances on VGGNet, ResNet and MobileNet V1/V2.

Compression Strategies and Space-Conscious Representations for Deep Neural Networks

Giosuè Marinò, Gregorio Ghidoli, Marco Frasca, Dario Malchiodi

Responsive image

Auto-TLDR; Compression of Large Convolutional Neural Networks by Weight Pruning and Quantization

Slides Poster Similar

Recent advances in deep learning have made available large, powerful convolutional neural networks (CNN) with state-of-the-art performance in several real-world applications. Unfortunately, these large-sized models have millions of parameters, thus they are not deployable on resource-limited platforms (e.g. where RAM is limited). Compression of CNNs thereby becomes a critical problem to achieve memory-efficient and possibly computationally faster model representations. In this paper, we investigate the impact of lossy compression of CNNs by weight pruning and quantization, and lossless weight matrix representations based on source coding. We tested several combinations of these techniques on four benchmark datasets for classification and regression problems, achieving compression rates up to 165 times, while preserving or improving the model performance.