Adversarially Training for Audio Classifiers

Raymel Alfonso Sallo, Mohammad Esmaeilpour, Patrick Cardinal

Responsive image

Auto-TLDR; Adversarially Training for Robust Neural Networks against Adversarial Attacks

Slides Poster

In this paper, we investigate the potential effect of the adversarially training on the robustness of six advanced deep neural networks against a variety of targeted and non-targeted adversarial attacks. We firstly show that, the ResNet-56 model trained on the 2D representation of the discrete wavelet transform appended with the tonnetz chromagram outperforms other models in terms of recognition accuracy. Then we demonstrate the positive impact of adversarially training on this model as well as other deep architectures against six types of attack algorithms (white and black-box) with the cost of the reduced recognition accuracy and limited adversarial perturbation. We run our experiments on two benchmarking environmental sound datasets and show that without any imposed limitations on the budget allocations for the adversary, the fooling rate of the adversarially trained models can exceed 90%. In other words, adversarial attacks exist in any scales, but they might require higher adversarial perturbations compared to non-adversarially trained models.

Similar papers

Defense Mechanism against Adversarial Attacks Using Density-Based Representation of Images

Yen-Ting Huang, Wen-Hung Liao, Chen-Wei Huang

Responsive image

Auto-TLDR; Adversarial Attacks Reduction Using Input Recharacterization

Slides Poster Similar

Adversarial examples are slightly modified inputs devised to cause erroneous inference of deep learning models. Protection against the intervention of adversarial examples is a fundamental issue that needs to be addressed before the wide adoption of deep-learning based intelligent systems. In this research, we utilize the method known as input recharacterization to effectively eliminate the perturbations found in the adversarial examples. By converting images from the intensity domain into density-based representation using halftoning operation, performance of the classifier can be properly maintained. With adversarial attacks generated using FGSM, I-FGSM, and PGD, the top-5 accuracy of the hybrid model can still achieve 80.97%, 78.77%, 81.56%, respectively. Although the accuracy has been slightly affected, the influence of adversarial examples is significantly discounted. The average improvement over existing input transform defense mechanisms is approximately 10%.

Variational Inference with Latent Space Quantization for Adversarial Resilience

Vinay Kyatham, Deepak Mishra, Prathosh A.P.

Responsive image

Auto-TLDR; A Generalized Defense Mechanism for Adversarial Attacks on Data Manifolds

Slides Poster Similar

Despite their tremendous success in modelling highdimensional data manifolds, deep neural networks suffer from the threat of adversarial attacks - Existence of perceptually valid input-like samples obtained through careful perturbation that lead to degradation in the performance of the underlying model. Major concerns with existing defense mechanisms include non-generalizability across different attacks, models and large inference time. In this paper, we propose a generalized defense mechanism capitalizing on the expressive power of regularized latent space based generative models. We design an adversarial filter, devoid of access to classifier and adversaries, which makes it usable in tandem with any classifier. The basic idea is to learn a Lipschitz constrained mapping from the data manifold, incorporating adversarial perturbations, to a quantized latent space and re-map it to the true data manifold. Specifically, we simultaneously auto-encode the data manifold and its perturbations implicitly through the perturbations of the regularized and quantized generative latent space, realized using variational inference. We demonstrate the efficacy of the proposed formulation in providing resilience against multiple attack types (black and white box) and methods, while being almost real-time. Our experiments show that the proposed method surpasses the stateof-the-art techniques in several cases.

F-Mixup: Attack CNNs from Fourier Perspective

Xiu-Chuan Li, Xu-Yao Zhang, Fei Yin, Cheng-Lin Liu

Responsive image

Auto-TLDR; F-Mixup: A novel black-box attack in frequency domain for deep neural networks

Slides Poster Similar

Recent research has revealed that deep neural networks are highly vulnerable to adversarial examples. In this paper, different from most adversarial attacks which directly modify pixels in spatial domain, we propose a novel black-box attack in frequency domain, named as f-mixup, based on the property of natural images and perception disparity between human-visual system (HVS) and convolutional neural networks (CNNs): First, natural images tend to have the bulk of their Fourier spectrums concentrated on the low frequency domain; Second, HVS is much less sensitive to high frequencies while CNNs can utilize both low and high frequency information to make predictions. Extensive experiments are conducted and show that deeper CNNs tend to concentrate more on the high frequency domain, which may explain the contradiction between robustness and accuracy. In addition, we compared f-mixup with existing attack methods and observed that our approach possesses great advantages. Finally, we show that f-mixup can be also incorporated in training to make deep CNNs defensible against a kind of perturbations effectively.

Attack Agnostic Adversarial Defense via Visual Imperceptible Bound

Saheb Chhabra, Akshay Agarwal, Richa Singh, Mayank Vatsa

Responsive image

Auto-TLDR; Robust Adversarial Defense with Visual Imperceptible Bound

Slides Poster Similar

High susceptibility of deep learning algorithms against structured and unstructured perturbations has motivated the development of efficient adversarial defense algorithms. However, the lack of generalizability of existing defense algorithms and the high variability in the performance of the attack algorithms for different databases raises several questions on the effectiveness of the defense algorithms. In this research, we aim to design a defense model that is robust within the certain bound against both seen and unseen adversarial attacks. This bound is related to the visual appearance of an image and we termed it as \textit{Visual Imperceptible Bound (VIB)}. To compute this bound, we propose a novel method that uses the database characteristics. The VIB is further used to compute the effectiveness of attack algorithms. In order to design a defense model, we propose a defense algorithm which makes the model robust within the VIB against both seen and unseen attacks. The performance of the proposed defense algorithm and the method to compute VIB are evaluated on MNIST, CIFAR-10, and Tiny ImageNet databases on multiple attacks including C\&W ($l_2$) and DeepFool. The proposed defense algorithm is not only able to increase the robustness against several attacks but also retain or improve the classification accuracy on an original clean test set. Experimentally, it is demonstrated that the proposed defense is better than existing strong defense algorithms based on adversarial retraining. We have additionally performed the PGD attack in white box settings and compared the results with the existing algorithms. The proposed defense is independent of the target model and adversarial attacks, and therefore can be utilized against any attack.

A Delayed Elastic-Net Approach for Performing Adversarial Attacks

Brais Cancela, Veronica Bolon-Canedo, Amparo Alonso-Betanzos

Responsive image

Auto-TLDR; Robustness of ImageNet Pretrained Models against Adversarial Attacks

Slides Poster Similar

With the rise of the so-called Adversarial Attacks, there is an increased concern on model security. In this paper we present two different contributions: novel measures of robustness (based on adversarial attacks) and a novel adversarial attack. The key idea behind these metrics is to obtain a measure that could compare different architectures, with independence of how the input is preprocessed (robustness against different input sizes and value ranges). To do so, a novel adversarial attack is presented, performing a delayed elastic-net adversarial attack (constraints are only used whenever a successful adversarial attack is obtained). Experimental results show that our approach obtains state-of-the-art adversarial samples, in terms of minimal perturbation distance. Finally, a benchmark of ImageNet pretrained models is used to conduct experiments aiming to shed some light about which model should be selected whenever security is a role factor.

Accuracy-Perturbation Curves for Evaluation of Adversarial Attack and Defence Methods

Jaka Šircelj, Danijel Skocaj

Responsive image

Auto-TLDR; Accuracy-perturbation Curve for Robustness Evaluation of Adversarial Examples

Slides Poster Similar

With more research published on adversarial examples, we face a growing need for strong and insightful methods for evaluating the robustness of machine learning solutions against their adversarial threats. Previous work contains problematic and overly simplified evaluation methods, where different methods for generating adversarial examples are compared, even though they produce adversarial examples of differing perturbation magnitudes. This creates a biased evaluation environment, as higher perturbations yield naturally stronger adversarial examples. We propose a novel "accuracy-perturbation curve" that visualizes a classifiers classification accuracy response to adversarial examples of different perturbations. To demonstrate the utility of the curve we perform evaluation of responses of different image classifier architectures to four popular adversarial example methods. We also show how adversarial training improves the robustness of a classifier using the "accuracy-perturbation curve".

Adaptive Noise Injection for Training Stochastic Student Networks from Deterministic Teachers

Yi Xiang Marcus Tan, Yuval Elovici, Alexander Binder

Responsive image

Auto-TLDR; Adaptive Stochastic Networks for Adversarial Attacks

Slides Similar

Adversarial attacks have been a prevalent problem causing misclassification in machine learning models, with stochasticity being a promising direction towards greater robustness. However, stochastic networks frequently underperform compared to deterministic deep networks. In this work, we present a conceptually clear adaptive noise injection mechanism in combination with teacher-initialisation, which adjusts its degree of randomness dynamically through the computation of mini-batch statistics. This mechanism is embedded within a simple framework to obtain stochastic networks from existing deterministic networks. Our experiments show that our method is able to outperform prior baselines under white-box settings, exemplified through CIFAR-10 and CIFAR-100. Following which, we perform in-depth analysis on varying different components of training with our approach on the effects of robustness and accuracy, through the study of the evolution of decision boundary and trend curves of clean accuracy/attack success over differing degrees of stochasticity. We also shed light on the effects of adversarial training on a pre-trained network, through the lens of decision boundaries.

Optimal Transport As a Defense against Adversarial Attacks

Quentin Bouniot, Romaric Audigier, Angélique Loesch

Responsive image

Auto-TLDR; Sinkhorn Adversarial Training with Optimal Transport Theory

Slides Poster Similar

Deep learning classifiers are now known to have flaws in the representations of their class. Adversarial attacks can find a human-imperceptible perturbation for a given image that will mislead a trained model. The most effective methods to defend against such attacks trains on generated adversarial examples to learn their distribution. Previous work aimed to align original and adversarial image representations in the same way as domain adaptation to improve robustness. Yet, they partially align the representations using approaches that do not reflect the geometry of space and distribution. In addition, it is difficult to accurately compare robustness between defended models. Until now, they have been evaluated using a fixed perturbation size. However, defended models may react differently to variations of this perturbation size. In this paper, the analogy of domain adaptation is taken a step further by exploiting optimal transport theory. We propose to use a loss between distributions that faithfully reflect the ground distance. This leads to SAT (Sinkhorn Adversarial Training), a more robust defense against adversarial attacks. Then, we propose to quantify more precisely the robustness of a model to adversarial attacks over a wide range of perturbation sizes using a different metric, the Area Under the Accuracy Curve (AUAC). We perform extensive experiments on both CIFAR-10 and CIFAR-100 datasets and show that our defense is globally more robust than the state-of-the-art.

Beyond Cross-Entropy: Learning Highly Separable Feature Distributions for Robust and Accurate Classification

Arslan Ali, Andrea Migliorati, Tiziano Bianchi, Enrico Magli

Responsive image

Auto-TLDR; Gaussian class-conditional simplex loss for adversarial robust multiclass classifiers

Slides Poster Similar

Deep learning has shown outstanding performance in several applications including image classification. However, deep classifiers are known to be highly vulnerable to adversarial attacks, in that a minor perturbation of the input can easily lead to an error. Providing robustness to adversarial attacks is a very challenging task especially in problems involving a large number of classes, as it typically comes at the expense of an accuracy decrease. In this work, we propose the Gaussian class-conditional simplex (GCCS) loss: a novel approach for training deep robust multiclass classifiers that provides adversarial robustness while at the same time achieving or even surpassing the classification accuracy of state-of-the-art methods. Differently from other frameworks, the proposed method learns a mapping of the input classes onto target distributions in a latent space such that the classes are linearly separable. Instead of maximizing the likelihood of target labels for individual samples, our objective function pushes the network to produce feature distributions yielding high inter-class separation. The mean values of the distributions are centered on the vertices of a simplex such that each class is at the same distance from every other class. We show that the regularization of the latent space based on our approach yields excellent classification accuracy and inherently provides robustness to multiple adversarial attacks, both targeted and untargeted, outperforming state-of-the-art approaches over challenging datasets.

Cost-Effective Adversarial Attacks against Scene Text Recognition

Mingkun Yang, Haitian Zheng, Xiang Bai, Jiebo Luo

Responsive image

Auto-TLDR; Adversarial Attacks on Scene Text Recognition

Slides Poster Similar

Scene text recognition is a challenging task due to the diversity in text appearance and complexity of natural scenes. Thanks to the development of deep learning and the large volume of training data, scene text recognition has made impressive progress in recent years. However, recent research on adversarial examples has shown that deep learning models are vulnerable to adversarial input with imperceptible changes. As one of the most practical tasks in computer vision, scene text recognition is also facing huge security risks. To our best knowledge, there has been no work on adversarial attacks against scene text recognition. To investigate its effects on scene text recognition, we make the first attempt to attack the state-of-the-art scene text recognizer, i.e., attention-based recognizer. To that end, we first adjust the objective function designed for non-sequential tasks, such as image classification, semantic segmentation and image retrieval, to the sequential form. We then propose a novel and effective objective function to further reduce the amount of perturbation while achieving a higher attack success rate. Comprehensive experiments on several standard benchmarks clearly demonstrate effective adversarial effects on scene text recognition by the proposed attacks.

AdvHat: Real-World Adversarial Attack on ArcFace Face ID System

Stepan Komkov, Aleksandr Petiushko

Responsive image

Auto-TLDR; Adversarial Sticker Attack on ArcFace in Shooting Conditions

Slides Poster Similar

In this paper we propose a novel easily reproducible technique to attack the best public Face ID system ArcFace in different shooting conditions. To create an attack, we print the rectangular paper sticker on a common color printer and put it on the hat. The adversarial sticker is prepared with a novel algorithm for off-plane transformations of the image which imitates sticker location on the hat. Such an approach confuses the state-of-the-art public Face ID model LResNet100E-IR, ArcFace@ms1m-refine-v2 and is transferable to other Face ID models.

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel

Responsive image

Auto-TLDR; Environmental Sound Classification with Short-Time Fourier Transform Spectrograms

Slides Poster Similar

Environmental Sound Classification (ESC) is an active research area in the audio domain and has seen a lot of progress in the past years. However, many of the existing approaches achieve high accuracy by relying on domain-specific features and architectures, making it harder to benefit from advances in other fields (e.g., the image domain). Additionally, some of the past successes have been attributed to a discrepancy of how results are evaluated (i.e., on unofficial splits of the UrbanSound8K (US8K) dataset), distorting the overall progression of the field. The contribution of this paper is twofold. First, we present a model that is inherently compatible with mono and stereo sound inputs. Our model is based on simple log-power Short-Time Fourier Transform (STFT) spectrograms and combines them with several well-known approaches from the image domain (i.e., ResNet, Siamese-like networks and attention). We investigate the influence of cross-domain pre-training, architectural changes, and evaluate our model on standard datasets. We find that our model out-performs all previously known approaches in a fair comparison by achieving accuracies of 97.0 % (ESC-10), 91.5 % (ESC-50) and 84.2 % / 85.4 % (US8K mono / stereo). Second, we provide a comprehensive overview of the actual state of the field, by differentiating several previously reported results on the US8K dataset between official or unofficial splits. For better reproducibility, our code (including any re-implementations) is made available.

Killing Four Birds with One Gaussian Process: The Relation between Different Test-Time Attacks

Kathrin Grosse, Michael Thomas Smith, Michael Backes

Responsive image

Auto-TLDR; Security of Gaussian Process Classifiers against Attack Algorithms

Slides Poster Similar

In machine learning (ML) security, attacks like evasion, model stealing or membership inference are generally studied in individually. Previous work has also shown a relationship between some attacks and decision function curvature of the targeted model. Consequently, we study an ML model allowing direct control over the decision surface curvature: Gaussian Process classifiers (GPCs). For evasion, we find that changing GPC's curvature to be robust against one attack algorithm boils down to enabling a different norm or attack algorithm to succeed. This is backed up by our formal analysis showing that static security guarantees are opposed to learning. Concerning intellectual property, we show formally that lazy learning does not necessarily leak all information when applied. In practice, often a seemingly secure curvature can be found. For example, we are able to secure GPC against empirical membership inference by proper configuration. In this configuration, however, the GPC's hyper-parameters are leaked, e.g. model reverse engineering succeeds. We conclude that attacks on classification should not be studied in isolation, but in relation to each other.

Verifying the Causes of Adversarial Examples

Honglin Li, Yifei Fan, Frieder Ganz, Tony Yezzi, Payam Barnaghi

Responsive image

Auto-TLDR; Exploring the Causes of Adversarial Examples in Neural Networks

Slides Poster Similar

The robustness of neural networks is challenged by adversarial examples that contain almost imperceptible perturbations to inputs which mislead a classifier to incorrect outputs in high confidence. Limited by the extreme difficulty in examining a high-dimensional image space thoroughly, research on explaining and justifying the causes of adversarial examples falls behind studies on attacks and defenses. In this paper, we present a collection of potential causes of adversarial examples and verify (or partially verify) them through carefully-designed controlled experiments. The major causes of adversarial examples include model linearity, one-sum constraint, and geometry of the categories. To control the effect of those causes, multiple techniques are applied such as $L_2$ normalization, replacement of loss functions, construction of reference datasets, and novel models using multi-layer perceptron probabilistic neural networks (MLP-PNN) and density estimation (DE). Our experiment results show that geometric factors tend to be more direct causes and statistical factors magnify the phenomenon, especially for assigning high prediction confidence. We hope this paper will inspire more studies to rigorously investigate the root causes of adversarial examples, which in turn provide useful guidance on designing more robust models.

Explain2Attack: Text Adversarial Attacks via Cross-Domain Interpretability

Mahmoud Hossam, Le Trung, He Zhao, Dinh Phung

Responsive image

Auto-TLDR; Transfer2Attack: A Black-box Adversarial Attack on Text Classification

Slides Poster Similar

Training robust deep learning models is a critical challenge for downstream tasks. Research has shown that common down-stream models can be easily fooled with adversarial inputs that look like the training data, but slightly perturbed, in a way imperceptible to humans. Understanding the behavior of natural language models under these attacks is crucial to better defend these models against such attacks. In the black-box attack setting, where no access to model parameters is available, the attacker can only query the output information from the targeted model to craft a successful attack. Current black-box state-of-the-art models are costly in both computational complexity and number of queries needed to craft successful adversarial examples. For real world scenarios, the number of queries is critical, where less queries are desired to avoid suspicion towards an attacking agent. In this paper, we propose Transfer2Attack, a black-box adversarial attack on text classification task, that employs cross-domain interpretability to reduce target model queries during attack. We show that our framework either achieves or out-performs attack rates of the state-of-the-art models, yet with lower queries cost and higher efficiency.

Which are the factors affecting the performance of audio surveillance systems?

Antonio Greco, Antonio Roberto, Alessia Saggese, Mario Vento

Responsive image

Auto-TLDR; Sound Event Recognition Using Convolutional Neural Networks and Visual Representations on MIVIA Audio Events

Slides Similar

Sound event recognition systems are rapidly becoming part of our life, since they can be profitably used in several vertical markets, ranging from audio security applications to scene classification and multi-modal analysis in social robotics. In the last years, a not negligible part of the scientific community started to apply Convolutional Neural Networks (CNNs) to image-based representations of the audio stream, due to their successful adoption in almost all the computer vision tasks. In this paper, we carry out a detailed benchmark of various widely used CNN architectures and visual representations on a popular dataset, namely the MIVIA Audio Events database. Our analysis is aimed at understanding how these factors affect the sound event recognition performance with a particular focus on the false positive rate, very relevant in audio surveillance solutions. In fact, although most of the proposed solutions achieve a high recognition rate, the capability of distinguishing the events-of-interest from the background is often not yet sufficient for real systems, and prevent its usage in real applications. Our comprehensive experimental analysis investigates this aspect and allows to identify useful design guidelines for increasing the specificity of sound event recognition systems.

Towards Explaining Adversarial Examples Phenomenon in Artificial Neural Networks

Ramin Barati, Reza Safabakhsh, Mohammad Rahmati

Responsive image

Auto-TLDR; Convolutional Neural Networks and Adversarial Training from the Perspective of convergence

Slides Poster Similar

In this paper, we study the adversarial examples existence and adversarial training from the standpoint of convergence and provide evidence that pointwise convergence in ANNs can explain these observations. The main contribution of our proposal is that it relates the objective of the evasion attacks and adversarial training with concepts already defined in learning theory. Also, we extend and unify some of the other proposals in the literature and provide alternative explanations on the observations made in those proposals. Through different experiments, we demonstrate that the framework is valuable in the study of the phenomenon and is applicable to real-world problems.

Task-based Focal Loss for Adversarially Robust Meta-Learning

Yufan Hou, Lixin Zou, Weidong Liu

Responsive image

Auto-TLDR; Task-based Adversarial Focal Loss for Few-shot Meta-Learner

Slides Poster Similar

Adversarial robustness of machine learning has been widely studied in recent years, and a series of effective methods are proposed to resist adversarial attacks. However, less attention is paid to few-shot meta-learners which are much more vulnerable due to the lack of training samples. In this paper, we propose Task-based Adversarial Focal Loss (TAFL) to handle this tough challenge on a typical meta-learner called MAML. More concretely, we regard few-shot classification tasks as normal samples in learning models and apply focal loss mechanism on them. Our proposed method focuses more on adversarially fragile tasks, leading to improvement on overall model robustness. Results of extensive experiments on several benchmarks demonstrate that TAFL can effectively promote the performance of the meta-learner on adversarial examples with elaborately designed perturbations.

Polynomial Universal Adversarial Perturbations for Person Re-Identification

Wenjie Ding, Xing Wei, Rongrong Ji, Xiaopeng Hong, Yihong Gong

Responsive image

Auto-TLDR; Polynomial Universal Adversarial Perturbation for Re-identification Methods

Slides Poster Similar

In this paper, we focus on Universal Adversarial Perturbations (UAP) attack on state-of-the-art person re-identification (Re-ID) methods. Existing UAP methods usually compute a perturbation image and add it to the images of interest. Such a simple constant form greatly limits the attack power. To address this problem, we extend the formulation of UAP to a polynomial form and propose the Polynomial Universal Adversarial Perturbation (PUAP). Unlike traditional UAP methods which only rely on the additive perturbation signal, the proposed PUAP consists of both an additive perturbation and a multiplicative modulation factor. The additive perturbation produces the fundamental component of the signal, while the multiplicative factor modulates the perturbation signal in line with the unit impulse pattern of the input image. Moreover, we design a Pearson correlation coefficient loss to generate universal perturbations, for disrupting the outputs of person Re-ID methods. Extensive experiments on DukeMTMC-ReID, Market-1501, and MARS show that the proposed method can efficiently improve the attack performance, especially when the magnitude of UAP is constrained to a small value.

CCA: Exploring the Possibility of Contextual Camouflage Attack on Object Detection

Shengnan Hu, Yang Zhang, Sumit Laha, Ankit Sharma, Hassan Foroosh

Responsive image

Auto-TLDR; Contextual camouflage attack for object detection

Slides Poster Similar

Deep neural network based object detection has become the cornerstone of many real-world applications. Along with this success comes concerns about its vulnerability to malicious attacks. To gain more insight into this issue, we propose a contextual camouflage attack (CCA for short) algorithm to influence the performance of object detectors. In this paper, we use an evolutionary search strategy and adversarial machine learning in interactions with a photo-realistic simulated environment to find camouflage patterns that are effective over a huge variety of object locations, camera poses, and lighting conditions. The proposed camouflages are validated effective to most of the state-of-the-art object detectors.

Attack-Agnostic Adversarial Detection on Medical Data Using Explainable Machine Learning

Matthew Watson, Noura Al Moubayed

Responsive image

Auto-TLDR; Explainability-based Detection of Adversarial Samples on EHR and Chest X-Ray Data

Slides Poster Similar

Explainable machine learning has become increasingly prevalent, especially in healthcare where explainable models are vital for ethical and trusted automated decision making. Work on the susceptibility of deep learning models to adversarial attacks has shown the ease of designing samples to mislead a model into making incorrect predictions. In this work, we propose an explainability-based method for the accurate detection of adversarial samples on two datasets with different complexity and properties: Electronic Health Record (EHR) and chest X-ray (CXR) data. On the MIMIC-III and Henan-Renmin EHR datasets, we report a detection accuracy of 77% against the Longitudinal Adversarial Attack. On the MIMIC-CXR dataset, we achieve an accuracy of 88%; significantly improving on the state of the art of adversarial detection in both datasets by over 10% in all settings. We propose an anomaly detection based method using explainability techniques to detect adversarial samples which is able to generalise to different attack methods without a need for retraining.

Ballroom Dance Recognition from Audio Recordings

Tomas Pavlin, Jan Cech, Jiri Matas

Responsive image

Auto-TLDR; A CNN-based approach to classify ballroom dances given audio recordings

Slides Poster Similar

We propose a CNN-based approach to classify ten genres of ballroom dances given audio recordings, five latin and five standard, namely Cha Cha Cha, Jive, Paso Doble, Rumba, Samba, Quickstep, Slow Foxtrot, Slow Waltz, Tango and Viennese Waltz. We utilize a spectrogram of an audio signal and we treat it as an image that is an input of the CNN. The classification is performed independently by 5-seconds spectrogram segments in sliding window fashion and the results are then aggregated. The method was tested on following datasets: Publicly available Extended Ballroom dataset collected by Marchand and Peeters, 2016 and two YouTube datasets collected by us, one in studio quality and the other, more challenging, recorded on mobile phones. The method achieved accuracy 93.9%, 96.7% and 89.8% respectively. The method runs in real-time. We implemented a web application to demonstrate the proposed method.

On-Manifold Adversarial Data Augmentation Improves Uncertainty Calibration

Kanil Patel, William Beluch, Dan Zhang, Michael Pfeiffer, Bin Yang

Responsive image

Auto-TLDR; On-Manifold Adversarial Data Augmentation for Uncertainty Estimation

Slides Similar

Uncertainty estimates help to identify ambiguous, novel, or anomalous inputs, but the reliable quantification of uncertainty has proven to be challenging for modern deep networks. To improve uncertainty estimation, we propose On-Manifold Adversarial Data Augmentation or OMADA, which specifically attempts to generate challenging examples by following an on-manifold adversarial attack path in the latent space of an autoencoder that closely approximates the decision boundaries between classes. On a variety of datasets and for multiple network architectures, OMADA consistently yields more accurate and better calibrated classifiers than baseline models, and outperforms competing approaches such as Mixup, as well as achieving similar performance to (at times better than) post-processing calibration methods such as temperature scaling. Variants of OMADA can employ different sampling schemes for ambiguous on-manifold examples based on the entropy of their estimated soft labels, which exhibit specific strengths for generalization, calibration of predicted uncertainty, or detection of out-of-distribution inputs.

Learning with Multiplicative Perturbations

Xiulong Yang, Shihao Ji

Responsive image

Auto-TLDR; XAT and xVAT: A Multiplicative Adversarial Training Algorithm for Robust DNN Training

Slides Poster Similar

Adversarial Training (AT) and Virtual Adversarial Training (VAT) are the regularization techniques that train Deep Neural Networks (DNNs) with adversarial examples generated by adding small but worst-case perturbations to input examples. In this paper, we propose xAT and xVAT, new adversarial training algorithms that generate multiplicative perturbations to input examples for robust training of DNNs. Such perturbations are much more perceptible and interpretable than their additive counterparts exploited by AT and VAT. Furthermore, the multiplicative perturbations can be generated transductively or inductively, while the standard AT and VAT only support a transductive implementation. We conduct a series of experiments that analyze the behavior of the multiplicative perturbations and demonstrate that xAT and xVAT match or outperform state-of-the-art classification accuracies across multiple established benchmarks while being about 30% faster than their additive counterparts. Our source code can be found at https://github.com/sndnyang/xvat

Feature Engineering and Stacked Echo State Networks for Musical Onset Detection

Peter Steiner, Azarakhsh Jalalvand, Simon Stone, Peter Birkholz

Responsive image

Auto-TLDR; Echo State Networks for Onset Detection in Music Analysis

Slides Poster Similar

In music analysis, one of the most fundamental tasks is note onset detection - detecting the beginning of new note events. As the target function of onset detection is related to other tasks, such as beat tracking or tempo estimation, onset detection is the basis for such related tasks. Furthermore, it can help to improve Automatic Music Transcription (AMT). Typically, different approaches for onset detection follow a similar outline: An audio signal is transformed into an Onset Detection Function (ODF), which should have rather low values (i.e. close to zero) for most of the time but with pronounced peaks at onset times, which can then be extracted by applying peak picking algorithms on the ODF. In the recent years, several kinds of neural networks were used successfully to compute the ODF from feature vectors. Currently, Convolutional Neural Networks (CNNs) define the state of the art. In this paper, we build up on an alternative approach to obtain a ODF by Echo State Networks (ESNs), which have achieved comparable results to CNNs in several tasks, such as speech and image recognition. In contrast to the typical iterative training procedures of deep learning architectures, such as CNNs or networks consisting of Long-Short-Term Memory Cells (LSTMs), in ESNs only a very small part of the weights is easily trained in one shot using linear regression. By comparing the performance of several feature extraction methods, pre-processing steps and introducing a new way to stack ESNs, we expand our previous approach to achieve results that fall between a bidirectional LSTM network and a CNN with relative improvements of 1.8% and -1.4%, respectively. For the evaluation, we used exactly the same 8-fold cross validation setup as for the reference results.

DenseRecognition of Spoken Languages

Jaybrata Chakraborty, Bappaditya Chakraborty, Ujjwal Bhattacharya

Responsive image

Auto-TLDR; DenseNet: A Dense Convolutional Network Architecture for Speech Recognition in Indian Languages

Slides Poster Similar

In the present study, we have, for the first time, con- sidered a large number of Indian languages for recog- nition from their audio signals of different sources. A dense convolutional network architecture (DenseNet) has been proposed for this classification problem. Dy- namic elimination of low energy frames from the input speech signal has been considered as a preprocessing operation. Mel-spectrogram of pre-processed speech signal is fed to a DenseNet architecture for recogni- tion of its language. Recognition performance of the proposed architecture has been compared with that of several state-of-the-art deep architectures which include a traditional convolutional neural network (CNN), multiple ResNet architectures, CNN-BLSTM and DenseNet-BLSTM hybrid architectures. Addition- ally, we obtained recognition performances of a stacked BLSTM architecture fed with different sets of hand- crafted features for comparison purpose. Simulations have been performed on two different standard datasets which include (i) IITKGP-MLILSC dataset of news clips in 27 different Indian languages and (ii) Linguistic Data Consortium (LDC) dataset of telephonic conver- sations in 5 different Indian languages. Recognition performance of the proposed framework has been found to be consistently and significantly better than all other frameworks implemented in this study.

Transferable Adversarial Attacks for Deep Scene Text Detection

Shudeng Wu, Tao Dai, Guanghao Meng, Bin Chen, Jian Lu, Shutao Xia

Responsive image

Auto-TLDR; Robustness of DNN-based STD methods against Adversarial Attacks

Slides Similar

Scene text detection (STD) aims to locate text in images and plays an important role in many computer vision tasks including automatic driving and text recognition systems. Recently, deep neural networks (DNNs) have been widely and successfully used in scene text detection, leading to plenty of DNN-based STD methods including regression-based and segmentation-based STD methods. However, recent studies have also shown that DNN is vulnerable to adversarial attacks, which can significantly degrade the performance of DNN models. In this paper, we investigate the robustness of DNN-based STD methods against adversarial attacks. To this end, we propose a generic and efficient attack method to generate adversarial examples, which are produced by adding small but imperceptible adversarial perturbation to the input images. Experiments on attacking four various models and a real-world STD engine of Google optical character recognition (OCR) show that the state-of-the-art DNN-based STD methods including regression-based and segmentation-based methods are vulnerable to adversarial attacks.

Delving in the Loss Landscape to Embed Robust Watermarks into Neural Networks

Enzo Tartaglione, Marco Grangetto, Davide Cavagnino, Marco Botta

Responsive image

Auto-TLDR; Watermark Aware Training of Neural Networks

Slides Poster Similar

In the last decade the use of artificial neural networks (ANNs) in many fields like image processing or speech recognition has become a common practice because of their effectiveness to solve complex tasks. However, in such a rush, very little attention has been paid to security aspects. In this work we explore the possibility to embed a watermark into the ANN parameters. We exploit model redundancy and adaptation capacity to lock a subset of its parameters to carry the watermark sequence. The watermark can be extracted in a simple way to claim copyright on models but can be very easily attacked with model fine-tuning. To tackle this culprit we devise a novel watermark aware training strategy. We aim at delving into the loss landscape to find an optimal configuration of the parameters such that we are robust to fine-tuning attacks towards the watermarked parameters. Our experimental results on classical ANN models trained on well-known MNIST and CIFAR-10 datasets show that the proposed approach makes the embedded watermark robust to fine-tuning and compression attacks.

Understanding Integrated Gradients with SmoothTaylor for Deep Neural Network Attribution

Gary Shing Wee Goh, Sebastian Lapuschkin, Leander Weber, Wojciech Samek, Alexander Binder

Responsive image

Auto-TLDR; SmoothGrad: bridging Integrated Gradients and SmoothGrad from the Taylor's theorem perspective

Slides Similar

Integrated Gradients as an attribution method for deep neural network models offers simple implementability. However, it suffers from noisiness of explanations which affects the ease of interpretability. The SmoothGrad technique is proposed to solve the noisiness issue and smoothen the attribution maps of any gradient-based attribution method. In this paper, we present SmoothTaylor as a novel theoretical concept bridging Integrated Gradients and SmoothGrad, from the Taylor's theorem perspective. We apply the methods to the image classification problem, using the ILSVRC2012 ImageNet object recognition dataset, and a couple of pretrained image models to generate attribution maps. These attribution maps are empirically evaluated using quantitative measures for sensitivity and noise level. We further propose adaptive noising to optimize for the noise scale hyperparameter value. From our experiments, we find that the SmoothTaylor approach together with adaptive noising is able to generate better quality saliency maps with lesser noise and higher sensitivity to the relevant points in the input space as compared to Integrated Gradients.

Digit Recognition Applied to Reconstructed Audio Signals Using Deep Learning

Anastasia-Sotiria Toufa, Constantine Kotropoulos

Responsive image

Auto-TLDR; Compressed Sensing for Digit Recognition in Audio Reconstruction

Poster Similar

Compressed sensing allows signal reconstruction from a few measurements. This work proposes a complete pipeline for digit recognition applied to audio reconstructed signals. The reconstruction procedure exploits the assumption that the original signal lies in the range of a generator. A pretrained generator of a Generative Adversarial Network generates audio digits. A new method for reconstruction is proposed, using only the most active segment of the signal, i.e., the segment with the highest energy. The underlying assumption is that such segment offers a more compact representation, preserving the meaningful content of signal. Cases when the reconstruction produces noise, instead of digit, are treated as outliers. In order to detect and reject them, three unsupervised indicators are used, namely, the total energy of reconstructed signal, the predictions of an one-class Support Vector Machine, and the confidence of a pretrained classifier used for recognition. This classifier is based on neural networks architectures and is pretrained on original audio recordings, employing three input representations, i.e., raw audio, spectrogram, and gammatonegram. Experiments are conducted, analyzing both the quality of reconstruction and the performance of classifiers in digit recognition, demonstrating that the proposed method yields higher performance in both the quality of reconstruction and digit recognition accuracy.

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

Pavlos Avgoustinakis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Andreas L. Symeonidis, Ioannis Kompatsiaris

Responsive image

Auto-TLDR; AuSiL: Audio Similarity Learning for Near-duplicate Video Retrieval

Slides Poster Similar

In this work, we address the problem of audio-based near-duplicate video retrieval. We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs. For the robust similarity calculation between two videos, we first extract representative audio-based video descriptors by leveraging transfer learning based on a Convolutional Neural Network (CNN) trained on a large scale dataset of audio events, and then we calculate the similarity matrix derived from the pairwise similarity of these descriptors. The similarity matrix is subsequently fed to a CNN network that captures the temporal structures existing within its content. We train our network following a triplet generation process and optimizing the triplet loss function. To evaluate the effectiveness of the proposed approach, we have manually annotated two publicly available video datasets based on the audio duplicity between their videos. The proposed approach achieves very competitive results compared to three state-of-the-art methods. Also, unlike the competing methods, it is very robust for the retrieval of audio duplicates generated with speed transformations.

Hybrid Network for End-To-End Text-Independent Speaker Identification

Wajdi Ghezaiel, Luc Brun, Olivier Lezoray

Responsive image

Auto-TLDR; Text-Independent Speaker Identification with Scattering Wavelet Network and Convolutional Neural Networks

Slides Poster Similar

Deep learning has recently improved the performance of Speaker Identification (SI) systems. Promising results have been obtained with Convolutional Neural Networks (CNNs). This success are mostly driven by the advent of large datasets. However in the context of commercial applications, collection of large amount of training data is not always possible. In addition, robustness of a SI system is adversely effected by short utterances. SI with only a few and short utterances is a challenging problem. Therefore, in this paper, we propose a novel text-independent speaker identification system. The proposed system can identify speakers by learning from only few training short utterances examples. To achieve this, we combine CNN with Scattering Wavelet Network. We propose a two-stage feature extraction framework using a two-layer wavelet scattering network coupled with a CNN for SI system. The proposed architecture takes variable length speech segments. To evaluate the effectiveness of the proposed approach, Timit and Librispeech datasets are used in the experiments. These conducted experiments show that our hybrid architecture performs successfully for SI, even with a small number and short duration of training samples. In comparaison with related methods, the obtained results shows that an hybrid architecture achieve better performance.

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Michele Alberti, Angela Botros, Schuetz Narayan, Rolf Ingold, Marcus Liwicki, Mathias Seuret

Responsive image

Auto-TLDR; Trainable and Spectrally Initializable Matrix Transformations for Neural Networks

Slides Poster Similar

In this work, we introduce a new architectural component to Neural Networks (NN), i.e., trainable and spectrally initializable matrix transformations on feature maps. While previous literature has already demonstrated the possibility of adding static spectral transformations as feature processors, our focus is on more general trainable transforms. We study the transforms in various architectural configurations on four datasets of different nature: from medical (ColorectalHist, HAM10000) and natural (Flowers) images to historical documents (CB55). With rigorous experiments that control for the number of parameters and randomness, we show that networks utilizing the introduced matrix transformations outperform vanilla neural networks. The observed accuracy increases appreciably across all datasets. In addition, we show that the benefit of spectral initialization leads to significantly faster convergence, as opposed to randomly initialized matrix transformations. The transformations are implemented as auto-differentiable PyTorch modules that can be incorporated into any neural network architecture. The entire code base is open-source.

Knowledge Distillation Beyond Model Compression

Fahad Sarfraz, Elahe Arani, Bahram Zonooz

Responsive image

Auto-TLDR; Knowledge Distillation from Teacher to Student

Slides Poster Similar

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various techniques have been proposed since the original formulation, which mimics different aspects of the teacher such as the representation space, decision boundary or intra-data relationship. Some methods replace the one way knowledge distillation from a static teacher with collaborative learning between a cohort of students. Despite the recent advances, a clear understanding of where knowledge resides in a deep neural network and optimal method for capturing knowledge from teacher and transferring it to student still remains an open question. In this study we provide an extensive study on 9 different knowledge distillation methods which covers a broad spectrum of approaches to capture and transfer knowledge. We demonstrate the versatility of the KD framework on different datasets and network architectures under varying capacity gaps between the teacher and student. The study provides intuition for the effects of mimicking different aspects of the teacher and derives insights from the performance of the different distillation approaches to guide the the design of more effective KD methods . Furthermore, our study shows the effectiveness of the KD framework in learning efficiently under varying severity levels of label noise and class imbalance, consistently providing significant generalization gains over standard training. We emphasize that the efficacy of KD goes much beyond a model compression technique and should be considered as a general purpose training paradigm which offers more robustness to common challenges in the real-world datasets compared to the standard training procedure.

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, Herremans Dorien

Responsive image

Auto-TLDR; Exploring the effect of spectrogram reconstruction loss on automatic music transcription

Slides Similar

Most of the state-of-the-art automatic music transcription (AMT) models break down the main transcription task into sub-tasks such as onset prediction and offset prediction and train them with onset and offset labels. These predictions are then concatenated together and used as the input to train another model with the pitch labels to obtain the final transcription. We attempt to use only the pitch labels (together with spectrogram reconstruction loss) and explore how far this model can go without introducing supervised sub-tasks. In this paper, we do not aim at achieving state-of-the-art transcription accuracy, instead, we explore the effect that spectrogram reconstruction has on our AMT model. Our proposed model consists of two U-nets: the first U-net transcribes the spectrogram into a posteriorgram, and a second U-net transforms the posteriorgram back into a spectrogram. A reconstruction loss is applied between the original spectrogram and the reconstructed spectrogram to constrain the second U-net to focus only on reconstruction. We train our model on different datasets including MAPS, MAESTRO, and MusicNet. Our experiments show that adding the reconstruction loss can generally improve the note-level transcription accuracy when compared to the same model without the reconstruction part. Moreover, it can also boost the frame-level precision to be higher than the state-of-the-art models. The feature maps learned by our u-net contain gridlike structures (not present in the baseline model) which implies that with the present of reconstruction loss, the model is probably trying to count along both the time and frequency axis, resulting in a higher note-level transcription accuracy.

On the Robustness of 3D Human Pose Estimation

Zerui Chen, Yan Huang, Liang Wang

Responsive image

Auto-TLDR; Robustness of 3D Human Pose Estimation Methods to Adversarial Attacks

Slides Similar

It is widely shown that Convolutional Neural Networks (CNNs) are vulnerable to adversarial examples on most recognition tasks, such as image classification and segmentation. However, few work studies the more complicated task -- 3D human pose estimation. This task often requires large-scale datasets, specialized network architectures, and it can be solved either from single-view RGB images or from multi-view RGB images. In this paper, we make the first attempt to investigate the robustness of current state-of-the-art 3D human pose estimation methods. To this end, we build four representative baseline models, where most of the current methods can be generally classified as one of them. Furthermore, we design targeted adversarial attacks to detect whether 3D pose estimators are robust to different camera parameters. For different types of methods, we present a comprehensive study of their robustness on the large-scale \emph{Human3.6M} benchmark. Our work shows that different methods vary significantly in their resistance to adversarial attacks. Through extensive experiments, we show that multi-view 3D pose estimators can be more vulnerable to adversarial examples. We believe that our efforts can shed light on future works to design more robust 3D human pose estimators.

S2I-Bird: Sound-To-Image Generation of Bird Species Using Generative Adversarial Networks

Joo Yong Shim, Joongheon Kim, Jong-Kook Kim

Responsive image

Auto-TLDR; Generating bird images from sound using conditional generative adversarial networks

Slides Poster Similar

Generating images from sound is a challenging task. This paper proposes a novel deep learning model that generates bird images from their corresponding sound information. Our proposed model includes a sound encoder in order to extract suitable feature representations from audio recordings, and then it generates bird images that corresponds to its calls using conditional generative adversarial networks (GANs) with auxiliary classifiers. We demonstrate that our model produces better image generation results which outperforms other state-of-the-art methods in a similar context.

Auto Encoding Explanatory Examples with Stochastic Paths

Cesar Ali Ojeda Marin, Ramses J. Sanchez, Kostadin Cvejoski, Bogdan Georgiev

Responsive image

Auto-TLDR; Semantic Stochastic Path: Explaining a Classifier's Decision Making Process using latent codes

Slides Poster Similar

In this paper we ask for the main factors that determine a classifier's decision making process and uncover such factors by studying latent codes produced by auto-encoding frameworks. To deliver an explanation of a classifier's behaviour, we propose a method that provides series of examples highlighting semantic differences between the classifier's decisions. These examples are generated through interpolations in latent space. We introduce and formalize the notion of a semantic stochastic path, as a suitable stochastic process defined in feature (data) space via latent code interpolations. We then introduce the concept of semantic Lagrangians as a way to incorporate the desired classifier's behaviour and find that the solution of the associated variational problem allows for highlighting differences in the classifier decision. Very importantly, within our framework the classifier is used as a black-box, and only its evaluation is required.

Adversarial Training for Aspect-Based Sentiment Analysis with BERT

Akbar Karimi, Andrea Prati, Leonardo Rossi

Responsive image

Auto-TLDR; Adversarial Training of BERT for Aspect-Based Sentiment Analysis

Slides Poster Similar

Aspect-Based Sentiment Analysis (ABSA) studies the extraction of sentiments and their targets. Collecting labeled data for this task in order to help neural networks generalize better can be laborious and time-consuming. As an alternative, similar data to the real-world examples can be produced artificially through an adversarial process which is carried out in the embedding space. Although these examples are not real sentences, they have been shown to act as a regularization method which can make neural networks more robust. In this work, we fine-tune the general purpose BERT and domain specific post-trained BERT (BERT-PT) using adversarial training. After improving the results of post-trained BERT with different hyperparameters, we propose a novel architecture called BERT Adversarial Training (BAT) to utilize adversarial training for the two major tasks of Aspect Extraction and Aspect Sentiment Classification in sentiment analysis. The proposed model outperforms the general BERT as well as the in-domain post-trained BERT in both tasks. To the best of our knowledge, this is the first study on the application of adversarial training in ABSA. The code is publicly available on a GitHub repository at https://github.com/IMPLabUniPr/Adversarial-Training-fo r-ABSA

Boundary Optimised Samples Training for Detecting Out-Of-Distribution Images

Luca Marson, Vladimir Li, Atsuto Maki

Responsive image

Auto-TLDR; Boundary Optimised Samples for Out-of-Distribution Input Detection in Deep Convolutional Networks

Slides Poster Similar

This paper presents a new approach to the problem of detecting out-of-distribution (OOD) inputs in image classifications with deep convolutional networks. We leverage so-called boundary samples to enforce low confidence (maximum softmax probabilities) for inputs far away from the training data. In particular, we propose the boundary optimised samples (named BoS) training algorithm for generating them. Unlike existing approaches, it does not require extra generative adversarial network, but achieves the goal by simply back propagating the gradient of an appropriately designed loss function to the input samples. At the end of the BoS training, all the boundary samples are in principle located on a specific level hypersurface with respect to the designed loss. Our contributions are i) the BoS training as an efficient alternative to generate boundary samples, ii) a robust algorithm therewith to enforce low confidence for OOD samples, and iii) experiments demonstrating improved OOD detection over the baseline. We show the performance using standard datasets for training and different test sets including Fashion MNIST, EMNIST, SVHN, and CIFAR-100, preceded by evaluations with a synthetic 2-dimensional dataset that provide an insight for the new procedure.

A Joint Representation Learning and Feature Modeling Approach for One-Class Recognition

Pramuditha Perera, Vishal Patel

Responsive image

Auto-TLDR; Combining Generative Features and One-Class Classification for Effective One-class Recognition

Slides Poster Similar

One-class recognition is traditionally approached either as a representation learning problem or a feature modelling problem. In this work, we argue that both of these approaches have their own limitations; and a more effective solution can be obtained by combining the two. The proposed approach is based on the combination of a generative framework and a one-class classification method. First, we learn generative features using the one-class data with a generative framework. We augment the learned features with the corresponding reconstruction errors to obtain augmented features. Then, we qualitatively identify a suitable feature distribution that reduces the redundancy in the chosen classifier space. Finally, we force the augmented features to take the form of this distribution using an adversarial framework. We test the effectiveness of the proposed method on three one-class classification tasks and obtain state-of-the-art results.

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Quan Nguyen, Simone Frintrop, Timo Gerkmann, Mikko Lauri, Julius Richter

Responsive image

Auto-TLDR; Object-Prior: Learning the 1-to-1 correspondence between visual and audio signals by audio- visual sound source methods

Slides Similar

The performance of an audio-visual sound source separation system is determined by its ability to separate audio sources given images of the sources and the audio mixture. The goal of this study is to investigate the ability to learn the mapping between the sounds and the images of instruments by audio- visual sound source separation methods based on the state-of-the- art PixelPlayer [1]. Theoretical and empirical analyses illustrate that the PixelPlayer is not properly trained to learn the 1-to- 1 correspondence between visual and audio signals during its mix-and-separate training process. Based on the insights from this analysis, a weakly-supervised method called Object-Prior is proposed and evaluated on two audio-visual datasets. The experimental results show that the proposed Object-Prior method outperforms the PixelPlayer and other baselines in the audio- visual sound source separation task. It is also more robust against asynchronized data, where the frame and the audio do not come from the same video, and recognizes musical instruments based on their sound with higher accuracy than the PixelPlayer. This indicates that learning the 1-to-1 correspondence between visual and audio features of an instrument improves the effectiveness of audio-visual sound source separation.

ResMax: Detecting Voice Spoofing Attacks with Residual Network and Max Feature Map

Il-Youp Kwak, Sungsu Kwag, Junhee Lee, Jun Ho Huh, Choong-Hoon Lee, Youngbae Jeon, Jeonghwan Hwang, Ji Won Yoon

Responsive image

Auto-TLDR; ASVspoof 2019: A Lightweight Automatic Speaker Verification Spoofing and Countermeasures System

Slides Poster Similar

The ``2019 Automatic Speaker Verification Spoofing And Countermeasures Challenge'' (ASVspoof) competition aimed to facilitate the design of highly accurate voice spoofing attack detection systems. the competition did not emphasize model complexity and latency requirements; such constraints are strict and integral in real-world deployment. Hence, most of the top performing solutions from the competition all used an ensemble approach, and combined multiple complex deep learning models to maximize detection accuracy -- this kind of approach would sit uneasily with real-world deployment constraints. To design a lightweight system, we combined the notions of skip connection (from ResNet) and max feature map (from Light CNN), and evaluated the accuracy of the system using the ASVspoof 2019 dataset. With an optimized constant Q transform (CQT) feature, our single model achieved a replay attack detection equal error rate (EER) of 0.37% on the evaluation set, outperforming the top ensemble system from the competition that achieved an EER of 0.39%.

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Thi Phuong Thao Ha, Bt Balamurali, Herremans Dorien, Roig Gemma

Responsive image

Auto-TLDR; AttendAffectNet: A Self-Attention Based Network for Emotion Prediction from Movies

Slides Poster Similar

In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet. We take both audio and video into account and incorporate the relation among multiple modalities by applying self-attention mechanism in a novel manner into the extracted features for emotion prediction. We compare it to the typically temporal integration of the self-attention based model, which in our case, allows to capture the relation of temporal representations of the movie while considering the sequential dependencies of emotion responses. We demonstrate the effectiveness of our proposed architectures on the extended COGNIMUSE dataset [1], [2] and the MediaEval 2016 Emotional Impact of Movies Task [3], which consist of movies with emotion annotations. Our results show that applying the self-attention mechanism on the different audio-visual features, rather than in the time domain, is more effective for emotion prediction. Our approach is also proven to outperform state-of-the-art models for emotion prediction.

Norm Loss: An Efficient yet Effective Regularization Method for Deep Neural Networks

Theodoros Georgiou, Sebastian Schmitt, Thomas Baeck, Wei Chen, Michael Lew

Responsive image

Auto-TLDR; Weight Soft-Regularization with Oblique Manifold for Convolutional Neural Network Training

Slides Poster Similar

Convolutional neural network training can suffer from diverse issues like exploding or vanishing gradients, scaling-based weight space symmetry and covariant-shift. In order to address these issues, researchers develop weight regularization methods and activation normalization methods. In this work we propose a weight soft-regularization method based on the Oblique manifold. The proposed method uses a loss function which pushes each weight vector to have a norm close to one, i.e. the weight matrix is smoothly steered toward the so-called Oblique manifold. We evaluate our method on the very popular CIFAR-10, CIFAR-100 and ImageNet 2012 datasets using two state-of-the-art architectures, namely the ResNet and wide-ResNet. Our method introduces negligible computational overhead and the results show that it is competitive to the state-of-the-art and in some cases superior to it. Additionally, the results are less sensitive to hyperparameter settings such as batch size and regularization factor.

How Does DCNN Make Decisions?

Yi Lin, Namin Wang, Xiaoqing Ma, Ziwei Li, Gang Bai

Responsive image

Auto-TLDR; Exploring Deep Convolutional Neural Network's Decision-Making Interpretability

Slides Poster Similar

Deep Convolutional Neural Networks (DCNN), despite imitating the human visual system, present no such decision credibility as human observers. This phenomenon, therefore, leads to the limitations of DCNN's applications in the security and trusted computing, such as self-driving cars and medical diagnosis. Focusing on this issue, our work aims to explore the way DCNN makes decisions. In this paper, the major contributions we made are: firstly, provide the hypothesis, “point-wise activation” of convolution function, according to the analysis of DCNN’s architectures and training process; secondly, point out the effect of “point-wise activation” on DCNN’s uninterpretable classification and pool robustness, and then suggest, in particular, the contradiction between the traditional and DCNN’s convolution kernel functions; finally, distinguish decision-making interpretability from semantic interpretability, and indicate that DCNN’s decision-making mechanism need to evolve towards the direction of semantics in the future. Besides, the “point-wise activation” hypothesis and conclusions proposed in our paper are supported by extensive experimental results.

Discriminative Multi-Level Reconstruction under Compact Latent Space for One-Class Novelty Detection

Jaewoo Park, Yoon Gyo Jung, Andrew Teoh

Responsive image

Auto-TLDR; Discriminative Compact AE for One-Class novelty detection and Adversarial Example Detection

Slides Similar

In one-class novelty detection, a model learns solely on the in-class data to single out out-class instances. Autoencoder (AE) variants aim to compactly model the in-class data to reconstruct it exclusively, thus differentiating the in-class from out-class by the reconstruction error. However, compact modeling in an improper way might collapse the latent representations of the in-class data and thus their reconstruction, which would lead to performance deterioration. Moreover, to properly measure the reconstruction error of high-dimensional data, a metric is required that captures high-level semantics of the data. To this end, we propose Discriminative Compact AE (DCAE) that learns both compact and collapse-free latent representations of the in-class data, thereby reconstructing them both finely and exclusively. In DCAE, (a) we force a compact latent space to bijectively represent the in-class data by reconstructing them through internal discriminative layers of generative adversarial nets. (b) Based on the deep encoder's vulnerability to open set risk, out-class instances are encoded into the same compact latent space and reconstructed poorly without sacrificing the quality of in-class data reconstruction. (c) In inference, the reconstruction error is measured by a novel metric that computes the dissimilarity between a query and its reconstruction based on the class semantics captured by the internal discriminator. Extensive experiments on public image datasets validate the effectiveness of our proposed model on both novelty and adversarial example detection, delivering state-of-the-art performance.

Improving Gravitational Wave Detection with 2D Convolutional Neural Networks

Siyu Fan, Yisen Wang, Yuan Luo, Alexander Michael Schmitt, Shenghua Yu

Responsive image

Auto-TLDR; Two-dimensional Convolutional Neural Networks for Gravitational Wave Detection from Time Series with Background Noise

Poster Similar

Sensitive gravitational wave (GW) detectors such as that of Laser Interferometer Gravitational-wave Observatory (LIGO) realize the direct observation of GW signals that confirm Einstein's general theory of relativity. However, it remains challenges to quickly detect faint GW signals from a large number of time series with background noise under unknown probability distributions. Traditional methods such as matched-filtering in general assume Additive White Gaussian Noise (AWGN) and are far from being real-time due to its high computational complexity. To avoid these weaknesses, one-dimensional (1D) Convolutional Neural Networks (CNNs) are introduced to achieve fast online detection in milliseconds but do not have enough consideration on the trade-off between the frequency and time features, which will be revisited in this paper through data pre-processing and subsequent two-dimensional (2D) CNNs during offline training to improve the online detection sensitivity. In this work, the input data is pre-processed to form a 2D spectrum by Short-time Fourier transform (STFT), where frequency features are extracted without learning. Then, carrying out two 1D convolutions across time and frequency axes respectively, and concatenating the time-amplitude and frequency-amplitude feature maps with equal proportion subsequently, the frequency and time features are treated equally as the input of our following two-dimensional CNNs. The simulation of our above ideas works on a generated data set with uniformly varying SNR (2-17), which combines the GW signal generated by PYCBC and the background noise sampled directly from LIGO. Satisfying the real-time online detection requirement without noise distribution assumption, the experiments of this paper demonstrate better performance in average compared to that of 1D CNNs, especially in the cases of lower SNR (4-9).