ESResNet: Environmental Sound Classification Based on Visual Domain Models

Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel

Responsive image

Auto-TLDR; Environmental Sound Classification with Short-Time Fourier Transform Spectrograms

Slides Poster

Environmental Sound Classification (ESC) is an active research area in the audio domain and has seen a lot of progress in the past years. However, many of the existing approaches achieve high accuracy by relying on domain-specific features and architectures, making it harder to benefit from advances in other fields (e.g., the image domain). Additionally, some of the past successes have been attributed to a discrepancy of how results are evaluated (i.e., on unofficial splits of the UrbanSound8K (US8K) dataset), distorting the overall progression of the field. The contribution of this paper is twofold. First, we present a model that is inherently compatible with mono and stereo sound inputs. Our model is based on simple log-power Short-Time Fourier Transform (STFT) spectrograms and combines them with several well-known approaches from the image domain (i.e., ResNet, Siamese-like networks and attention). We investigate the influence of cross-domain pre-training, architectural changes, and evaluate our model on standard datasets. We find that our model out-performs all previously known approaches in a fair comparison by achieving accuracies of 97.0 % (ESC-10), 91.5 % (ESC-50) and 84.2 % / 85.4 % (US8K mono / stereo). Second, we provide a comprehensive overview of the actual state of the field, by differentiating several previously reported results on the US8K dataset between official or unofficial splits. For better reproducibility, our code (including any re-implementations) is made available.

Similar papers

Which are the factors affecting the performance of audio surveillance systems?

Antonio Greco, Antonio Roberto, Alessia Saggese, Mario Vento

Responsive image

Auto-TLDR; Sound Event Recognition Using Convolutional Neural Networks and Visual Representations on MIVIA Audio Events

Slides Similar

Sound event recognition systems are rapidly becoming part of our life, since they can be profitably used in several vertical markets, ranging from audio security applications to scene classification and multi-modal analysis in social robotics. In the last years, a not negligible part of the scientific community started to apply Convolutional Neural Networks (CNNs) to image-based representations of the audio stream, due to their successful adoption in almost all the computer vision tasks. In this paper, we carry out a detailed benchmark of various widely used CNN architectures and visual representations on a popular dataset, namely the MIVIA Audio Events database. Our analysis is aimed at understanding how these factors affect the sound event recognition performance with a particular focus on the false positive rate, very relevant in audio surveillance solutions. In fact, although most of the proposed solutions achieve a high recognition rate, the capability of distinguishing the events-of-interest from the background is often not yet sufficient for real systems, and prevent its usage in real applications. Our comprehensive experimental analysis investigates this aspect and allows to identify useful design guidelines for increasing the specificity of sound event recognition systems.

Ballroom Dance Recognition from Audio Recordings

Tomas Pavlin, Jan Cech, Jiri Matas

Responsive image

Auto-TLDR; A CNN-based approach to classify ballroom dances given audio recordings

Slides Poster Similar

We propose a CNN-based approach to classify ten genres of ballroom dances given audio recordings, five latin and five standard, namely Cha Cha Cha, Jive, Paso Doble, Rumba, Samba, Quickstep, Slow Foxtrot, Slow Waltz, Tango and Viennese Waltz. We utilize a spectrogram of an audio signal and we treat it as an image that is an input of the CNN. The classification is performed independently by 5-seconds spectrogram segments in sliding window fashion and the results are then aggregated. The method was tested on following datasets: Publicly available Extended Ballroom dataset collected by Marchand and Peeters, 2016 and two YouTube datasets collected by us, one in studio quality and the other, more challenging, recorded on mobile phones. The method achieved accuracy 93.9%, 96.7% and 89.8% respectively. The method runs in real-time. We implemented a web application to demonstrate the proposed method.

Mood Detection Analyzing Lyrics and Audio Signal Based on Deep Learning Architectures

Konstantinos Pyrovolakis, Paraskevi Tzouveli, Giorgos Stamou

Responsive image

Auto-TLDR; Automated Music Mood Detection using Music Information Retrieval

Slides Poster Similar

Digital era has changed the way music is produced and propagated creating new needs for automated and more effective management of music tracks in big volumes. Automated music mood detection constitutes an active task in the field of MIR (Music Information Retrieval) and connected with many research papers in the past few years. In order to approach the task of mood detection, we faced separately the analysis of musical lyrics and the analysis of musical audio signal. Then we applied a uniform multichannel analysis to classify our data in mood classes. The available data we will use to train and evaluate our models consists of a total of 2.000 song titles, classified in four mood classes {happy, angry, sad, relaxed}. The result of this process leads to a uniform prediction for emotional arousal that a music track can cause to a listener and show the way to develop many applications.

Feature Engineering and Stacked Echo State Networks for Musical Onset Detection

Peter Steiner, Azarakhsh Jalalvand, Simon Stone, Peter Birkholz

Responsive image

Auto-TLDR; Echo State Networks for Onset Detection in Music Analysis

Slides Poster Similar

In music analysis, one of the most fundamental tasks is note onset detection - detecting the beginning of new note events. As the target function of onset detection is related to other tasks, such as beat tracking or tempo estimation, onset detection is the basis for such related tasks. Furthermore, it can help to improve Automatic Music Transcription (AMT). Typically, different approaches for onset detection follow a similar outline: An audio signal is transformed into an Onset Detection Function (ODF), which should have rather low values (i.e. close to zero) for most of the time but with pronounced peaks at onset times, which can then be extracted by applying peak picking algorithms on the ODF. In the recent years, several kinds of neural networks were used successfully to compute the ODF from feature vectors. Currently, Convolutional Neural Networks (CNNs) define the state of the art. In this paper, we build up on an alternative approach to obtain a ODF by Echo State Networks (ESNs), which have achieved comparable results to CNNs in several tasks, such as speech and image recognition. In contrast to the typical iterative training procedures of deep learning architectures, such as CNNs or networks consisting of Long-Short-Term Memory Cells (LSTMs), in ESNs only a very small part of the weights is easily trained in one shot using linear regression. By comparing the performance of several feature extraction methods, pre-processing steps and introducing a new way to stack ESNs, we expand our previous approach to achieve results that fall between a bidirectional LSTM network and a CNN with relative improvements of 1.8% and -1.4%, respectively. For the evaluation, we used exactly the same 8-fold cross validation setup as for the reference results.

Hybrid Network for End-To-End Text-Independent Speaker Identification

Wajdi Ghezaiel, Luc Brun, Olivier Lezoray

Responsive image

Auto-TLDR; Text-Independent Speaker Identification with Scattering Wavelet Network and Convolutional Neural Networks

Slides Poster Similar

Deep learning has recently improved the performance of Speaker Identification (SI) systems. Promising results have been obtained with Convolutional Neural Networks (CNNs). This success are mostly driven by the advent of large datasets. However in the context of commercial applications, collection of large amount of training data is not always possible. In addition, robustness of a SI system is adversely effected by short utterances. SI with only a few and short utterances is a challenging problem. Therefore, in this paper, we propose a novel text-independent speaker identification system. The proposed system can identify speakers by learning from only few training short utterances examples. To achieve this, we combine CNN with Scattering Wavelet Network. We propose a two-stage feature extraction framework using a two-layer wavelet scattering network coupled with a CNN for SI system. The proposed architecture takes variable length speech segments. To evaluate the effectiveness of the proposed approach, Timit and Librispeech datasets are used in the experiments. These conducted experiments show that our hybrid architecture performs successfully for SI, even with a small number and short duration of training samples. In comparaison with related methods, the obtained results shows that an hybrid architecture achieve better performance.

Trainable Spectrally Initializable Matrix Transformations in Convolutional Neural Networks

Michele Alberti, Angela Botros, Schuetz Narayan, Rolf Ingold, Marcus Liwicki, Mathias Seuret

Responsive image

Auto-TLDR; Trainable and Spectrally Initializable Matrix Transformations for Neural Networks

Slides Poster Similar

In this work, we introduce a new architectural component to Neural Networks (NN), i.e., trainable and spectrally initializable matrix transformations on feature maps. While previous literature has already demonstrated the possibility of adding static spectral transformations as feature processors, our focus is on more general trainable transforms. We study the transforms in various architectural configurations on four datasets of different nature: from medical (ColorectalHist, HAM10000) and natural (Flowers) images to historical documents (CB55). With rigorous experiments that control for the number of parameters and randomness, we show that networks utilizing the introduced matrix transformations outperform vanilla neural networks. The observed accuracy increases appreciably across all datasets. In addition, we show that the benefit of spectral initialization leads to significantly faster convergence, as opposed to randomly initialized matrix transformations. The transformations are implemented as auto-differentiable PyTorch modules that can be incorporated into any neural network architecture. The entire code base is open-source.

Audio-Based Near-Duplicate Video Retrieval with Audio Similarity Learning

Pavlos Avgoustinakis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Andreas L. Symeonidis, Ioannis Kompatsiaris

Responsive image

Auto-TLDR; AuSiL: Audio Similarity Learning for Near-duplicate Video Retrieval

Slides Poster Similar

In this work, we address the problem of audio-based near-duplicate video retrieval. We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs. For the robust similarity calculation between two videos, we first extract representative audio-based video descriptors by leveraging transfer learning based on a Convolutional Neural Network (CNN) trained on a large scale dataset of audio events, and then we calculate the similarity matrix derived from the pairwise similarity of these descriptors. The similarity matrix is subsequently fed to a CNN network that captures the temporal structures existing within its content. We train our network following a triplet generation process and optimizing the triplet loss function. To evaluate the effectiveness of the proposed approach, we have manually annotated two publicly available video datasets based on the audio duplicity between their videos. The proposed approach achieves very competitive results compared to three state-of-the-art methods. Also, unlike the competing methods, it is very robust for the retrieval of audio duplicates generated with speed transformations.

One-Shot Learning for Acoustic Identification of Bird Species in Non-Stationary Environments

Michelangelo Acconcjaioco, Stavros Ntalampiras

Responsive image

Auto-TLDR; One-shot Learning in the Bioacoustics Domain using Siamese Neural Networks

Slides Poster Similar

This work introduces the one-shot learning paradigm in the computational bioacoustics domain. Even though, most of the related literature assumes availability of data characterizing the entire class dictionary of the problem at hand, that is rarely true as a habitat's species composition is only known up to a certain extent. Thus, the problem needs to be addressed by methodologies able to cope with non-stationarity. To this end, we propose a framework able to detect changes in the class dictionary and incorporate new classes on the fly. We design an one-shot learning architecture composed of a Siamese Neural Network operating in the logMel spectrogram space. We extensively examine the proposed approach on two datasets of various bird species using suitable figures of merit. Interestingly, such a learning scheme exhibits state of the art performance, while taking into account extreme non-stationarity cases.

DenseRecognition of Spoken Languages

Jaybrata Chakraborty, Bappaditya Chakraborty, Ujjwal Bhattacharya

Responsive image

Auto-TLDR; DenseNet: A Dense Convolutional Network Architecture for Speech Recognition in Indian Languages

Slides Poster Similar

In the present study, we have, for the first time, con- sidered a large number of Indian languages for recog- nition from their audio signals of different sources. A dense convolutional network architecture (DenseNet) has been proposed for this classification problem. Dy- namic elimination of low energy frames from the input speech signal has been considered as a preprocessing operation. Mel-spectrogram of pre-processed speech signal is fed to a DenseNet architecture for recogni- tion of its language. Recognition performance of the proposed architecture has been compared with that of several state-of-the-art deep architectures which include a traditional convolutional neural network (CNN), multiple ResNet architectures, CNN-BLSTM and DenseNet-BLSTM hybrid architectures. Addition- ally, we obtained recognition performances of a stacked BLSTM architecture fed with different sets of hand- crafted features for comparison purpose. Simulations have been performed on two different standard datasets which include (i) IITKGP-MLILSC dataset of news clips in 27 different Indian languages and (ii) Linguistic Data Consortium (LDC) dataset of telephonic conver- sations in 5 different Indian languages. Recognition performance of the proposed framework has been found to be consistently and significantly better than all other frameworks implemented in this study.

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, Herremans Dorien

Responsive image

Auto-TLDR; Exploring the effect of spectrogram reconstruction loss on automatic music transcription

Slides Similar

Most of the state-of-the-art automatic music transcription (AMT) models break down the main transcription task into sub-tasks such as onset prediction and offset prediction and train them with onset and offset labels. These predictions are then concatenated together and used as the input to train another model with the pitch labels to obtain the final transcription. We attempt to use only the pitch labels (together with spectrogram reconstruction loss) and explore how far this model can go without introducing supervised sub-tasks. In this paper, we do not aim at achieving state-of-the-art transcription accuracy, instead, we explore the effect that spectrogram reconstruction has on our AMT model. Our proposed model consists of two U-nets: the first U-net transcribes the spectrogram into a posteriorgram, and a second U-net transforms the posteriorgram back into a spectrogram. A reconstruction loss is applied between the original spectrogram and the reconstructed spectrogram to constrain the second U-net to focus only on reconstruction. We train our model on different datasets including MAPS, MAESTRO, and MusicNet. Our experiments show that adding the reconstruction loss can generally improve the note-level transcription accuracy when compared to the same model without the reconstruction part. Moreover, it can also boost the frame-level precision to be higher than the state-of-the-art models. The feature maps learned by our u-net contain gridlike structures (not present in the baseline model) which implies that with the present of reconstruction loss, the model is probably trying to count along both the time and frequency axis, resulting in a higher note-level transcription accuracy.

Adversarially Training for Audio Classifiers

Raymel Alfonso Sallo, Mohammad Esmaeilpour, Patrick Cardinal

Responsive image

Auto-TLDR; Adversarially Training for Robust Neural Networks against Adversarial Attacks

Slides Poster Similar

In this paper, we investigate the potential effect of the adversarially training on the robustness of six advanced deep neural networks against a variety of targeted and non-targeted adversarial attacks. We firstly show that, the ResNet-56 model trained on the 2D representation of the discrete wavelet transform appended with the tonnetz chromagram outperforms other models in terms of recognition accuracy. Then we demonstrate the positive impact of adversarially training on this model as well as other deep architectures against six types of attack algorithms (white and black-box) with the cost of the reduced recognition accuracy and limited adversarial perturbation. We run our experiments on two benchmarking environmental sound datasets and show that without any imposed limitations on the budget allocations for the adversary, the fooling rate of the adversarially trained models can exceed 90%. In other words, adversarial attacks exist in any scales, but they might require higher adversarial perturbations compared to non-adversarially trained models.

AttendAffectNet: Self-Attention Based Networks for Predicting Affective Responses from Movies

Thi Phuong Thao Ha, Bt Balamurali, Herremans Dorien, Roig Gemma

Responsive image

Auto-TLDR; AttendAffectNet: A Self-Attention Based Network for Emotion Prediction from Movies

Slides Poster Similar

In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet. We take both audio and video into account and incorporate the relation among multiple modalities by applying self-attention mechanism in a novel manner into the extracted features for emotion prediction. We compare it to the typically temporal integration of the self-attention based model, which in our case, allows to capture the relation of temporal representations of the movie while considering the sequential dependencies of emotion responses. We demonstrate the effectiveness of our proposed architectures on the extended COGNIMUSE dataset [1], [2] and the MediaEval 2016 Emotional Impact of Movies Task [3], which consist of movies with emotion annotations. Our results show that applying the self-attention mechanism on the different audio-visual features, rather than in the time domain, is more effective for emotion prediction. Our approach is also proven to outperform state-of-the-art models for emotion prediction.

Influence of Event Duration on Automatic Wheeze Classification

Bruno M Rocha, Diogo Pessoa, Alda Marques, Paulo Carvalho, Rui Pedro Paiva

Responsive image

Auto-TLDR; Experimental Design of the Non-wheeze Class for Wheeze Classification

Slides Poster Similar

Patients with respiratory conditions typically exhibit adventitious respiratory sounds, such as wheezes. Wheeze events have variable duration. In this work we studied the influence of event duration on wheeze classification, namely how the creation of the non-wheeze class affected the classifiers' performance. First, we evaluated several classifiers on an open access respiratory sound database, with the best one reaching sensitivity and specificity values of 98% and 95%, respectively. Then, by changing one parameter in the design of the non-wheeze class, i.e., event duration, the best classifier only reached sensitivity and specificity values of 53% and 75%, respectively. These results demonstrate the importance of experimental design on the assessment of wheeze classification algorithms' performance.

FatNet: A Feature-Attentive Network for 3D Point Cloud Processing

Chaitanya Kaul, Nick Pears, Suresh Manandhar

Responsive image

Auto-TLDR; Feature-Attentive Neural Networks for Point Cloud Classification and Segmentation

Slides Similar

The application of deep learning to 3D point clouds is challenging due to its lack of order. Inspired by the point embeddings of PointNet and the edge embeddings of DGCNNs, we propose three improvements to the task of point cloud analysis. First, we introduce a novel feature-attentive neural network layer, a FAT layer, that combines both global point-based features and local edge-based features in order to generate better embeddings. Second, we find that applying the same attention mechanism across two different forms of feature map aggregation, max pooling and average pooling, gives better performance than either alone. Third, we observe that residual feature reuse in this setting propagates information more effectively between the layers, and makes the network easier to train. Our architecture achieves state-of-the-art results on the task of point cloud classification, as demonstrated on the ModelNet40 dataset, and an extremely competitive performance on the ShapeNet part segmentation challenge.

3D Attention Mechanism for Fine-Grained Classification of Table Tennis Strokes Using a Twin Spatio-Temporal Convolutional Neural Networks

Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, Julien Morlier

Responsive image

Auto-TLDR; Attentional Blocks for Action Recognition in Table Tennis Strokes

Slides Poster Similar

The paper addresses the problem of recognition of actions in video with low inter-class variability such as Table Tennis strokes. Two stream, "twin" convolutional neural networks are used with 3D convolutions both on RGB data and optical flow. Actions are recognized by classification of temporal windows. We introduce 3D attention modules and examine their impact on classification efficiency. In the context of the study of sportsmen performances, a corpus of the particular actions of table tennis strokes is considered. The use of attention blocks in the network speeds up the training step and improves the classification scores up to 5% with our twin model. We visualize the impact on the obtained features and notice correlation between attention and player movements and position. Score comparison of state-of-the-art action classification method and proposed approach with attentional blocks is performed on the corpus. Proposed model with attention blocks outperforms previous model without them and our baseline.

The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition

Xincheng Wen, Kunhong Liu

Responsive image

Auto-TLDR; CapCNN: A Capsule Neural Network for Speech Emotion Recognition

Slides Poster Similar

Moreover, the abstraction of audio features makes it impossible to fully use the inherent relationship among audio features. This paper proposes a model that combines a convolutional neural network(CNN) and a capsule neural network (CapsNet), named as CapCNN. The advantage of CapCNN lies in that it provides a solution to solve time sensitivity and focus on the overall characteristics. In this study, it is found that CapCNN can well handle the speech emotion recognition task. Compared with other state-of-art methods, our algorithm shows high performances on the CASIA and EMODB datasets. The detailed analysis confirms that our method provides balanced results on the various classes.

Improving Gravitational Wave Detection with 2D Convolutional Neural Networks

Siyu Fan, Yisen Wang, Yuan Luo, Alexander Michael Schmitt, Shenghua Yu

Responsive image

Auto-TLDR; Two-dimensional Convolutional Neural Networks for Gravitational Wave Detection from Time Series with Background Noise

Poster Similar

Sensitive gravitational wave (GW) detectors such as that of Laser Interferometer Gravitational-wave Observatory (LIGO) realize the direct observation of GW signals that confirm Einstein's general theory of relativity. However, it remains challenges to quickly detect faint GW signals from a large number of time series with background noise under unknown probability distributions. Traditional methods such as matched-filtering in general assume Additive White Gaussian Noise (AWGN) and are far from being real-time due to its high computational complexity. To avoid these weaknesses, one-dimensional (1D) Convolutional Neural Networks (CNNs) are introduced to achieve fast online detection in milliseconds but do not have enough consideration on the trade-off between the frequency and time features, which will be revisited in this paper through data pre-processing and subsequent two-dimensional (2D) CNNs during offline training to improve the online detection sensitivity. In this work, the input data is pre-processed to form a 2D spectrum by Short-time Fourier transform (STFT), where frequency features are extracted without learning. Then, carrying out two 1D convolutions across time and frequency axes respectively, and concatenating the time-amplitude and frequency-amplitude feature maps with equal proportion subsequently, the frequency and time features are treated equally as the input of our following two-dimensional CNNs. The simulation of our above ideas works on a generated data set with uniformly varying SNR (2-17), which combines the GW signal generated by PYCBC and the background noise sampled directly from LIGO. Satisfying the real-time online detection requirement without noise distribution assumption, the experiments of this paper demonstrate better performance in average compared to that of 1D CNNs, especially in the cases of lower SNR (4-9).

Improving Mix-And-Separate Training in Audio-Visual Sound Source Separation with an Object Prior

Quan Nguyen, Simone Frintrop, Timo Gerkmann, Mikko Lauri, Julius Richter

Responsive image

Auto-TLDR; Object-Prior: Learning the 1-to-1 correspondence between visual and audio signals by audio- visual sound source methods

Slides Similar

The performance of an audio-visual sound source separation system is determined by its ability to separate audio sources given images of the sources and the audio mixture. The goal of this study is to investigate the ability to learn the mapping between the sounds and the images of instruments by audio- visual sound source separation methods based on the state-of-the- art PixelPlayer [1]. Theoretical and empirical analyses illustrate that the PixelPlayer is not properly trained to learn the 1-to- 1 correspondence between visual and audio signals during its mix-and-separate training process. Based on the insights from this analysis, a weakly-supervised method called Object-Prior is proposed and evaluated on two audio-visual datasets. The experimental results show that the proposed Object-Prior method outperforms the PixelPlayer and other baselines in the audio- visual sound source separation task. It is also more robust against asynchronized data, where the frame and the audio do not come from the same video, and recognizes musical instruments based on their sound with higher accuracy than the PixelPlayer. This indicates that learning the 1-to-1 correspondence between visual and audio features of an instrument improves the effectiveness of audio-visual sound source separation.

The Color Out of Space: Learning Self-Supervised Representations for Earth Observation Imagery

Stefano Vincenzi, Angelo Porrello, Pietro Buzzega, Marco Cipriano, Pietro Fronte, Roberto Cuccu, Carla Ippoliti, Annamaria Conte, Simone Calderara

Responsive image

Auto-TLDR; Satellite Image Representation Learning for Remote Sensing

Slides Poster Similar

The recent growth in the number of satellite images fosters the development of effective deep-learning techniques for Remote Sensing (RS). However, their full potential is untapped due to the lack of large annotated datasets. Such a problem is usually countered by fine-tuning a feature extractor that is previously trained on the ImageNet dataset. Unfortunately, the domain of natural images differs from the RS one, which hinders the final performance. In this work, we propose to learn meaningful representations from satellite imagery, leveraging its high-dimensionality spectral bands to reconstruct the visible colors. We conduct experiments on land cover classification (BigEarthNet) and West Nile Virus detection, showing that colorization is a solid pretext task for training a feature extractor. Furthermore, we qualitatively observe that guesses based on natural images and colorization rely on different parts of the input. This paves the way to an ensemble model that eventually outperforms both the above-mentioned techniques.

Video Face Manipulation Detection through Ensemble of CNNs

Nicolo Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, Stefano Tubaro

Responsive image

Auto-TLDR; Face Manipulation Detection in Video Sequences Using Convolutional Neural Networks

Slides Similar

In the last few years, several techniques for facial manipulation in videos have been successfully developed and made available to the masses (i.e., FaceSwap, deepfake, etc.). These methods enable anyone to easily edit faces in video sequences with incredibly realistic results and a very little effort. Despite the usefulness of these tools in many fields, if used maliciously, they can have a significantly bad impact on society (e.g., fake news spreading, cyber bullying through fake revenge porn). The ability of objectively detecting whether a face has been manipulated in a video sequence is then a task of utmost importance. In this paper, we tackle the problem of face manipulation detection in video sequences targeting modern facial manipulation techniques. In particular, we study the ensembling of different trained Convolutional Neural Network (CNN) models. In the proposed solution, different models are obtained starting from a base network (i.e., EfficientNetB4) making use of two different concepts: (i) attention layers; (ii) siamese training. We show that combining these networks leads to promising face manipulation detection results on two publicly available datasets with more than 119000 videos.

Single-Modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning

Reina Ishikawa, Ryo Hachiuma, Akiyoshi Kurobe, Hideo Saito

Responsive image

Auto-TLDR; Multi-modal Variational Autoencoder for Terrain Type Clustering

Slides Poster Similar

The key to an accurate understanding of terrain is to extract the informative features from the multi-modal data obtained from different devices. Sensors, such as RGB cameras, depth sensors, vibration sensors, and microphones, are used as the multi-modal data. Many studies have explored ways to use them, especially in the robotics field. Some papers have successfully introduced single-modal or multi-modal methods. However, in practice, robots can be faced with extreme conditions; microphones do not work well in the crowded scenes, and an RGB camera cannot capture terrains well in the dark. In this paper, we present a novel framework using the multi-modal variational autoencoder and the Gaussian mixture model clustering algorithm on image data and audio data for terrain type clustering. Our method enables the terrain type clustering even if one of the modalities (either image or audio) is missing at the test-time. We evaluated the clustering accuracy with a conventional multi-modal terrain type clustering method and we conducted ablation studies to show the effectiveness of our approach.

Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection

Yongqiang Dou, Haocheng Yang, Maolin Yang, Yanyan Xu, Dengfeng Ke

Responsive image

Auto-TLDR; Anti-Spoofing with Balanced Focal Loss Function and Combination Features

Slides Poster Similar

It becomes urgent to design effective anti-spoofing algorithms for vulnerable automatic speaker verification systems due to the advancement of high-quality playback devices. Current studies mainly treat anti-spoofing as a binary classification problem between bonafide and spoofed utterances, while lack of indistinguishable samples makes it difficult to train a robust spoofing detector. In this paper, we argue that for anti-spoofing, it needs more attention for indistinguishable samples over easily-classified ones in the modeling process, to make correct discrimination a top priority. Therefore, to mitigate the data discrepancy between training and inference, we propose to leverage a balanced focal loss function as the training objective to dynamically scale the loss based on the traits of the sample itself. Besides, in the experiments, we select three kinds of features that contain both magnitude-based and phase-based information to form complementary and informative features. Experimental results on the ASVspoof2019 dataset demonstrate the superiority of the proposed methods by comparison between our systems and top-performing ones. Systems trained with the balanced focal loss perform significantly better than conventional cross-entropy loss. With complementary features, our fusion system with only three kinds of features outperforms other systems containing five or more complex single models by 22.5% for min-tDCF and 7% for EER, achieving a min-tDCF and an EER of 0.0124 and 0.55% respectively. Furthermore, we present and discuss the evaluation results on real replay data apart from the simulated ASVspoof2019 data, indicating that research for anti-spoofing still has a long way to go.

Translation Resilient Opportunistic WiFi Sensing

Mohammud Junaid Bocus, Wenda Li, Jonas Paulavičius, Ryan Mcconville, Raul Santos-Rodriguez, Kevin Chetty, Robert Piechocki

Responsive image

Auto-TLDR; Activity Recognition using Fine-Grained WiFi Channel State Information using WiFi CSI

Slides Poster Similar

Passive wireless sensing using WiFi signals has become a very active area of research over the past few years. Such techniques provide a cost-effective and non-intrusive solution for human activity sensing especially in healthcare applications. One of the main approaches used in wireless sensing is based on fine-grained WiFi Channel State Information (CSI) which can be extracted from commercial Network Interface Cards (NICs). In this paper, we present a new signal processing pipelines required for effective wireless sensing. An experiment involving five participants performing six different activities was carried out in an office space to evaluate the performance of activity recognition using WiFi CSI in different physical layouts. Experimental results show that the CSI system has the best detection performance when activities are performed half-way in between the transmitter and receiver in a line-of-sight (LoS) setting. In this case, an accuracy as high as 91% is achieved while the accuracy for the case where the transmitter and receiver are co-located is around 62%. As for the case when data from all layouts is combined, which better reflects the real-world scenario, the accuracy is around 67%. The results showed that the activity detection performance is dependent not only on the locations of the transmitter and receiver but also on the positioning of the person performing the activity.

Detection of Calls from Smart Speaker Devices

Vinay Maddali, David Looney, Kailash Patil

Responsive image

Auto-TLDR; Distinguishing Between Smart Speaker and Cell Devices Using Only the Audio Using a Feature Set

Slides Poster Similar

The ubiquity of smart speakers is increasing, with a growing number of households utilising these devices to make calls over the telephony network. As the technology is typically configured to retain the cellular phone number of the user, it presents challenges in applications where knowledge of the true call origin is required. There are a wide range of makes and models for these devices, as is the case with cell phones, and it is challenging to detect the general category as a smart speaker or cell, independent of the designated phone number. In this paper, we present an approach to differentiate between calls originating from smart speakers and ones from cellular devices using only the audio. We present a feature set that characterises the relevant acoustic information, such as the degree of reverberation and noise, to distinguish between these categories. When evaluated on a dataset spanning multiple models for each device category, as well as different modes-of-usage and microphone-speaker distances, the method yields an Equal Error Rate (EER) of 12.6%.

Audio-Visual Speech Recognition Using a Two-Step Feature Fusion Strategy

Hong Liu, Wanlu Xu, Bing Yang

Responsive image

Auto-TLDR; A Two-Step Feature Fusion Network for Speech Recognition

Slides Poster Similar

Lip-reading methods and fusion strategy are crucial for audio-visual speech recognition. In recent years, most approaches involve two separate audio and visual streams with early or late fusion strategies. Such a single-stage fusion method may fail to guarantee the integrity and representativeness of fusion information simultaneously. This paper extends a traditional single-stage fusion network to a two-step feature fusion network by adding an audio-visual early feature fusion (AV-EFF) stream to the baseline model. This method can learn the fusion information of different stages, preserving the original features as much as possible and ensuring the independence of different features. Besides, to capture long-range dependencies of video information, a non-local block is added to the feature extraction part of the visual stream (NL-Visual) to obtain the long-term spatio-temporal features. Experimental results on the two largest public datasets in English (LRW) and Mandarin (LRW-1000) demonstrate our method is superior to other state-of-the-art methods.

Spatial Bias in Vision-Based Voice Activity Detection

Kalin Stefanov, Mohammad Adiban, Giampiero Salvi

Responsive image

Auto-TLDR; Spatial Bias in Vision-based Voice Activity Detection in Multiparty Human-Human Interactions

Poster Similar

We present models for automatic vision-based voice activity detection (VAD) in multiparty human-human interactions that are aimed at complementing the acoustic VAD methods. We provide evidence that this type of vision-based VAD models are susceptible to spatial bias in the datasets. The physical settings of the interaction, usually constant throughout data acquisition, determines the distribution of head poses of the participants. Our results show that when the head pose distributions are significantly different in the training and test sets, the performance of the models drops significantly. This suggests that previously reported results on datasets with a fixed physical configuration may overestimate the generalization capabilities of this type of models. We also propose a number of possible remedies to the spatial bias, including data augmentation, input masking and dynamic features, and provide an in-depth analysis of the visual cues used by our models.

Planar 3D Transfer Learning for End to End Unimodal MRI Unbalanced Data Segmentation

Martin Kolarik, Radim Burget, Carlos M. Travieso-Gonzalez, Jan Kocica

Responsive image

Auto-TLDR; Planar 3D Res-U-Net Network for Unbalanced 3D Image Segmentation using Fluid Attenuation Inversion Recover

Slides Similar

We present a novel approach of 2D to 3D transfer learning based on mapping pre-trained 2D convolutional neural network weights into planar 3D kernels. The method is validated by proposed planar 3D res-u-net network with encoder transferred from the 2D VGG-16 which is applied for a single-stage unbalanced 3D image data segmentation. In particular, we evaluate the method on the MICCAI 2016 MS lesion segmentation challenge dataset utilizing solely Fluid Attenuation Inversion Recover (FLAIR) sequence without brain extraction for training and inference to simulate real medical praxis. The planar 3D res-u-net network performed the best both in sensitivity and Dice score amongst end to end methods processing raw MRI scans and achieved comparable Dice score to a state-of-the-art unimodal not end to end approach. Complete source code was released under the open-source license and this paper is in compliance with the Machine learning Reproducibility Checklist. By implementing practical transfer learning for 3D data representation we were able to successfully segment heavily unbalanced data without selective sampling and achieved more reliable results using less training data in single modality. From medical perspective, the unimodal approach gives an advantage in real praxis as it does not require co-registration nor additional scanning time during examination. Although modern medical imaging methods capture high resolution 3D anatomy scans suitable for computer aided detection system processing, deployment of automatic systems for interpretation of radiology imaging is still rather theoretical in many medical areas. Our work aims to bridge the gap offering solution for partial research questions.

End-To-End Triplet Loss Based Emotion Embedding System for Speech Emotion Recognition

Puneet Kumar, Sidharth Jain, Balasubramanian Raman, Partha Pratim Roy, Masakazu Iwamura

Responsive image

Auto-TLDR; End-to-End Neural Embedding System for Speech Emotion Recognition

Slides Poster Similar

In this paper, an end-to-end neural embedding system based on triplet loss and residual learning has been proposed for speech emotion recognition. The proposed system learns the embeddings from the emotional information of the speech utterances. The learned embeddings are used to recognize the emotions portrayed by given speech samples of various lengths. The proposed system implements Residual Neural Network architecture. It is trained using softmax pre-training and triplet loss function. The weights between the fully connected and embedding layers of the trained network are used to calculate the embedding values. The embedding representations of various emotions are mapped onto a hyperplane, and the angles among them are computed using the cosine similarity. These angles are utilized to classify a new speech sample into its appropriate emotion class. The proposed system has demonstrated 91.67\% and 64.44\% accuracy while recognizing emotions for RAVDESS and IEMOCAP dataset, respectively.

A Close Look at Deep Learning with Small Data

Lorenzo Brigato, Luca Iocchi

Responsive image

Auto-TLDR; Low-Complex Neural Networks for Small Data Conditions

Slides Poster Similar

In this work, we perform a wide variety of experiments with different Deep Learning architectures in small data conditions. We show that model complexity is a critical factor when only a few samples per class are available. Differently from the literature, we improve the state of the art using low complexity models. We show that standard convolutional neural networks with relatively few parameters are effective in this scenario. In many of our experiments, low complexity models outperform state-of-the-art architectures. Moreover, we propose a novel network that uses an unsupervised loss to regularize its training. Such architecture either improves the results either performs comparably well to low capacity networks. Surprisingly, experiments show that the dynamic data augmentation pipeline is not beneficial in this particular domain. Statically augmenting the dataset might be a promising research direction while dropout maintains its role as a good regularizer.

Enhancing Semantic Segmentation of Aerial Images with Inhibitory Neurons

Ihsan Ullah, Sean Reilly, Michael Madden

Responsive image

Auto-TLDR; Lateral Inhibition in Deep Neural Networks for Object Recognition and Semantic Segmentation

Slides Poster Similar

In a Convolutional Neural Network, each neuron in the output feature map takes input from the neurons in its receptive field. This receptive field concept plays a vital role in today's deep neural networks. However, inspired by neuro-biological research, it has been proposed to add inhibitory neurons outside the receptive field, which may enhance the performance of neural network models. In this paper, we begin with deep network architectures such as VGG and ResNet, and propose an approach to add lateral inhibition in each output neuron to reduce its impact on its neighbours, both in fine-tuning pre-trained models and training from scratch. Our experiments show that notable improvements upon prior baseline deep models can be achieved. A key feature of our approach is that it is easy to add to baseline models; it can be adopted in any model containing convolution layers, and we demonstrate its value in applications including object recognition and semantic segmentation of aerial images, where we show state-of-the-art result on the Aeroscape dataset. On semantic segmentation tasks, our enhancement shows 17.43% higher mIoU than a single baseline model on a single source (the Aeroscape dataset), 13.43% higher performance than an ensemble model on the same single source, and 7.03% higher than an ensemble model on multiple sources (segmentation datasets). Our experiments illustrate the potential impact of using inhibitory neurons in deep learning models, and they also show better results than the baseline models that have standard convolutional layer.

ResNet-Like Architecture with Low Hardware Requirements

Elena Limonova, Daniil Alfonso, Dmitry Nikolaev, Vladimir V. Arlazarov

Responsive image

Auto-TLDR; BM-ResNet: Bipolar Morphological ResNet for Image Classification

Slides Poster Similar

One of the most computationally intensive parts in modern recognition systems is an inference of deep neural networks that are used for image classification, segmentation, enhancement, and recognition. The growing popularity of edge computing makes us look for ways to reduce its time for mobile and embedded devices. One way to decrease the neural network inference time is to modify a neuron model to make it more efficient for computations on a specific device. The example of such a model is a bipolar morphological neuron model. The bipolar morphological neuron is based on the idea of replacing multiplication with addition and maximum operations. This model has been demonstrated for simple image classification with LeNet-like architectures [1]. In the paper, we introduce a bipolar morphological ResNet (BM-ResNet) model obtained from a much more complex ResNet architecture by converting its layers to bipolar morphological ones. We apply BM-ResNet to image classification on MNIST and CIFAR-10 datasets with only a moderate accuracy decrease from 99.3% to 99.1% and from 85.3% to 85.1%. We also estimate the computational complexity of the resulting model. We show that for the majority of ResNet layers, the considered model requires 2.1-2.9 times fewer logic gates for implementation and 15-30% lower latency.

Directional Graph Networks with Hard Weight Assignments

Miguel Dominguez, Raymond Ptucha

Responsive image

Auto-TLDR; Hard Directional Graph Networks for Point Cloud Analysis

Slides Poster Similar

Point cloud analysis is an important field for 3D scene understanding. It has applications in self driving cars and robotics (via LIDAR sensors), 3D graphics, and computer-aided design. Neural networks have recently achieved strong results on point cloud analysis problems such as classification and segmentation. Each point cloud network has the challenge of defining a convolution that can learn useful features on unstructured points. Some recent point cloud convolutions create separate weight matrices for separate directions like a CNN, but apply every weight matrix to every neighbor with soft assignments. This increases computational complexity and makes relatively small neighborhood aggregations expensive to compute. We propose Hard Directional Graph Networks (HDGN), a point cloud model that both learns directional weight matrices and assigns a single matrix to each neighbor, achieving directional convolutions at lower computational cost. HDGN's directional modeling achieves state-of-the-art results on multiple point cloud vision benchmarks.

Fast and Accurate Real-Time Semantic Segmentation with Dilated Asymmetric Convolutions

Leonel Rosas-Arias, Gibran Benitez-Garcia, Jose Portillo-Portillo, Gabriel Sanchez-Perez, Keiji Yanai

Responsive image

Auto-TLDR; FASSD-Net: Dilated Asymmetric Pyramidal Fusion for Real-Time Semantic Segmentation

Slides Poster Similar

Recent works have shown promising results applied to real-time semantic segmentation tasks. To maintain fast inference speed, most of the existing networks make use of light decoders, or they simply do not use them at all. This strategy helps to maintain a fast inference speed; however, their accuracy performance is significantly lower in comparison to non-real-time semantic segmentation networks. In this paper, we introduce two key modules aimed to design a high-performance decoder for real-time semantic segmentation for reducing the accuracy gap between real-time and non-real-time segmentation networks. Our first module, Dilated Asymmetric Pyramidal Fusion (DAPF), is designed to substantially increase the receptive field on the top of the last stage of the encoder, obtaining richer contextual features. Our second module, Multi-resolution Dilated Asymmetric (MDA) module, fuses and refines detail and contextual information from multi-scale feature maps coming from early and deeper stages of the network. Both modules exploit contextual information without excessively increasing the computational complexity by using asymmetric convolutions. Our proposed network entitled “FASSD-Net” reaches 78.8% of mIoU accuracy on the Cityscapes validation dataset at 41.1 FPS on full resolution images (1024x2048). Besides, with a light version of our network, we reach 74.1% of mIoU at 133.1 FPS (full resolution) on a single NVIDIA GTX 1080Ti card with no additional acceleration techniques. The source code and pre-trained models are available at https://github.com/GibranBenitez/FASSD-Net.

S2I-Bird: Sound-To-Image Generation of Bird Species Using Generative Adversarial Networks

Joo Yong Shim, Joongheon Kim, Jong-Kook Kim

Responsive image

Auto-TLDR; Generating bird images from sound using conditional generative adversarial networks

Slides Poster Similar

Generating images from sound is a challenging task. This paper proposes a novel deep learning model that generates bird images from their corresponding sound information. Our proposed model includes a sound encoder in order to extract suitable feature representations from audio recordings, and then it generates bird images that corresponds to its calls using conditional generative adversarial networks (GANs) with auxiliary classifiers. We demonstrate that our model produces better image generation results which outperforms other state-of-the-art methods in a similar context.

A Systematic Investigation on Deep Architectures for Automatic Skin Lesions Classification

Pierluigi Carcagni, Marco Leo, Andrea Cuna, Giuseppe Celeste, Cosimo Distante

Responsive image

Auto-TLDR; RegNet: Deep Investigation of Convolutional Neural Networks for Automatic Classification of Skin Lesions

Slides Poster Similar

Computer vision-based techniques are more and more employed in healthcare and medical fields nowadays in order, principally, to be as a support to the experienced medical staff to help them to make a quick and correct diagnosis. One of the hot topics in this arena concerns the automatic classification of skin lesions. Several promising works exist about it, mainly leveraging Convolutional Neural Networks (CNN), but proposed pipeline mainly rely on complex data preprocessing and there is no systematic investigation about how available deep models can actually reach the accuracy needed for real applications. In order to overcome these drawbacks, in this work, an end-to-end pipeline is introduced and some of the most recent Convolutional Neural Networks (CNNs) architectures are included in it and compared on the largest common benchmark dataset recently introduced. To this aim, for the first time in this application context, a new network design paradigm, namely RegNet, has been exploited to get the best models among a population of configurations. The paper introduces a threefold level of contribution and novelty with respect the previous literature: the deep investigation of several CNN architectures driving to a consistent improvement of the lesions recognition accuracy, the exploitation of a new network design paradigm able to study the behavior of populations of models and a deep discussion about pro and cons of each analyzed method paving the path towards new research lines.

Attention Pyramid Module for Scene Recognition

Zhinan Qiao, Xiaohui Yuan, Chengyuan Zhuang, Abolfazl Meyarian

Responsive image

Auto-TLDR; Attention Pyramid Module for Multi-Scale Scene Recognition

Slides Poster Similar

The unrestricted open vocabulary and diverse substances of scenery images bring significant challenges to scene recognition. However, most deep learning architectures and attention methods are developed on general-purpose datasets and omit the characteristics of scene data. In this paper, we exploit the attention pyramid module (APM) to tackle the predicament of scene recognition. Our method streamlines the multi-scale scene recognition pipeline, learns comprehensive scene features at various scales and locations, addresses the interdependency among scales, and further assists feature re-calibration as well as aggregation process. APM is extremely light-weighted and can be easily plugged into existing network architectures in a parameter-efficient manner. By simply integrating APM into ResNet-50, we obtain a 3.54\% boost in terms of top-1 accuracy on the benchmark scene dataset. Comprehensive experiments show that APM achieves better performance comparing with state-of-the-art attention methods using significant less computation budget. Code and pre-trained models will be made publicly available.

SECI-GAN: Semantic and Edge Completion for Dynamic Objects Removal

Francesco Pinto, Andrea Romanoni, Matteo Matteucci, Phil Torr

Responsive image

Auto-TLDR; SECI-GAN: Semantic and Edge Conditioned Inpainting Generative Adversarial Network

Slides Poster Similar

Image inpainting aims at synthesizing the missing content of damaged and corrupted images to produce visually realistic restorations; typical applications are in image restoration, automatic scene editing, super-resolution, and dynamic object removal. In this paper, we propose Semantic and Edge Conditioned Inpainting Generative Adversarial Network (SECI-GAN), an architecture that jointly exploits the high-level cues extracted by semantic segmentation and the fine-grained details captured by edge extraction to condition the image inpainting process. SECI-GAN is designed with a particular focus on recovering big regions belonging to the same object (e.g. cars or pedestrians) in the context of dynamic object removal from complex street views. To demonstrate the effectiveness of SECI-GAN, we evaluate our results on the Cityscapes dataset, showing that SECI-GAN is better than competing state-of-the-art models at recovering the structure and the content of the missing parts while producing consistent predictions.

Building Computationally Efficient and Well-Generalizing Person Re-Identification Models with Metric Learning

Vladislav Sovrasov, Dmitry Sidnev

Responsive image

Auto-TLDR; Cross-Domain Generalization in Person Re-identification using Omni-Scale Network

Slides Similar

This work considers the problem of domain shift in person re-identification.Being trained on one dataset, a re-identification model usually performs much worse on unseen data. Partially this gap is caused by the relatively small scale of person re-identification datasets (compared to face recognition ones, for instance), but it is also related to training objectives. We propose to use the metric learning objective, namely AM-Softmax loss, and some additional training practices to build well-generalizing, yet, computationally efficient models. We use recently proposed Omni-Scale Network (OSNet) architecture combined with several training tricks and architecture adjustments to obtain state-of-the art results in cross-domain generalization problem on a large-scale MSMT17 dataset in three setups: MSMT17-all->DukeMTMC, MSMT17-train->Market1501 and MSMT17-all->Market1501.

Multimodal Side-Tuning for Document Classification

Stefano Zingaro, Giuseppe Lisanti, Maurizio Gabbrielli

Responsive image

Auto-TLDR; Side-tuning for Multimodal Document Classification

Slides Poster Similar

In this paper, we propose to exploit the side-tuning framework for multimodal document classification. Side-tuning is a methodology for network adaptation recently introduced to solve some of the problems related to previous approaches. Thanks to this technique it is actually possible to overcome model rigidity and catastrophic forgetting of transfer learning by fine-tuning. The proposed solution uses off-the-shelf deep learning architectures leveraging the side-tuning framework to combine a base model with a tandem of two side networks. We show that side-tuning can be successfully employed also when different data sources are considered, e.g. text and images in document classification. The experimental results show that this approach pushes further the limit for document classification accuracy with respect to the state of the art.

Domain Siamese CNNs for Sparse Multispectral Disparity Estimation

David-Alexandre Beaupre, Guillaume-Alexandre Bilodeau

Responsive image

Auto-TLDR; Multispectral Disparity Estimation between Thermal and Visible Images using Deep Neural Networks

Slides Poster Similar

Multispectral disparity estimation is a difficult task for many reasons: it as all the same challenges as traditional visible-visible disparity estimation (occlusions, repetitive patterns, textureless surfaces), in addition of having very few common visual information between images (e.g. color information vs. thermal information). In this paper, we propose a new CNN architecture able to do disparity estimation between images from different spectrum, namely thermal and visible in our case. Our proposed model takes two patches as input and proceeds to do domain feature extraction for each of them. Features from both domains are then merged with two fusion operations, namely correlation and concatenation. These merged vectors are then forwarded to their respective classification heads, which are responsible for classifying the inputs as being same or not. Using two merging operations gives more robustness to our feature extraction process, which leads to more precise disparity estimation. Our method was tested using the publicly available LITIV 2014 and LITIV 2018 datasets, and showed best results when compared to other state of the art methods.

Personalized Models in Human Activity Recognition Using Deep Learning

Hamza Amrani, Daniela Micucci, Paolo Napoletano

Responsive image

Auto-TLDR; Incremental Learning for Personalized Human Activity Recognition

Slides Poster Similar

Current sensor-based human activity recognition techniques that rely on a user-independent model struggle to generalize to new users and on to changes that a person may make over time to his or her way of carrying out activities. Incremental learning is a technique that allows to obtain personalized models which may improve the performance on the classifiers thanks to a continuous learning based on user data. Finally, deep learning techniques have been proven to be more effective with respect to traditional ones in the generation of user-independent models. The aim of our work is therefore to put together deep learning techniques with incremental learning in order to obtain personalized models that perform better with respect to user-independent model and personalized model obtained using traditional machine learning techniques. The experimentation was done by comparing the results obtained by a technique in the state of the art with those obtained by two neural networks (ResNet and a simplified CNN) on three datasets. The experimentation showed that neural networks adapt faster to a new user than the baseline.

Anticipating Activity from Multimodal Signals

Tiziana Rotondo, Giovanni Maria Farinella, Davide Giacalone, Sebastiano Mauro Strano, Valeria Tomaselli, Sebastiano Battiato

Responsive image

Auto-TLDR; Exploiting Multimodal Signal Embedding Space for Multi-Action Prediction

Slides Poster Similar

Images, videos, audio signals, sensor data, can be easily collected in huge quantity by different devices and processed in order to emulate the human capability of elaborating a variety of different stimuli. Are multimodal signals useful to understand and anticipate human actions if acquired from the user viewpoint? This paper proposes to build an embedding space where inputs of different nature, but semantically correlated, are projected in a new representation space and properly exploited to anticipate the future user activity. To this purpose, we built a new multimodal dataset comprising video, audio, tri-axial acceleration, angular velocity, tri-axial magnetic field, pressure and temperature. To benchmark the proposed multimodal anticipation challenge, we consider classic classifiers on top of deep learning methods used to build the embedding space representing multimodal signals. The achieved results show that the exploitation of different modalities is useful to improve the anticipation of the future activity.

Exploiting the Logits: Joint Sign Language Recognition and Spell-Correction

Christina Runkel, Stefan Dorenkamp, Hartmut Bauermeister, Michael Möller

Responsive image

Auto-TLDR; A Convolutional Neural Network for Spell-correction in Sign Language Videos

Slides Poster Similar

Machine learning techniques have excelled in the automatic semantic analysis of images, reaching human-level performances on challenging bechmarks. Yet, the semantic analysis of videos remains challenging due to the significantly higher dimensionality of the input data, respectively, the significantly higher need for annotated training examples. By studying the automatic recognition of German sign language videos, we demonstrate that on the relatively scarce training data of 2.800 videos, modern deep learning architectures for video analysis (such as ResNeXt) along with transfer learning on large gesture recognition tasks, can achieve about 75% character accuracy. Considering that this leaves us with a probability of under 25% that a five letter word is spelled correctly, spell-correction systems are crucial for producing readable outputs. The contribution of this paper is to propose a convolutional neural network for spell-correction that expects the softmax outputs of the character recognition network (instead of a misspelled word) as an input. We demonstrate that purely learning on softmax inputs in combination with scarce training data yields overfitting as the network learns the inputs by heart. In contrast, training the network on several variants of the logits of the classification output i.e. scaling by a constant factor, adding of random noise, mixing of softmax and hardmax inputs or purely training on hardmax inputs, leads to better generalization while benefitting from the significant information hidden in these outputs (that have 98% top-5 accuracy), yielding a readable text despite the comparably low character accuracy.

GazeMAE: General Representations of Eye Movements Using a Micro-Macro Autoencoder

Louise Gillian C. Bautista, Prospero Naval

Responsive image

Auto-TLDR; Fast and Slow Eye Movement Representations for Sentiment-agnostic Eye Tracking

Slides Poster Similar

Eye movements are intricate and dynamic events that contain a wealth of information about the subject and the stimuli. We propose an abstract representation of eye movements that preserve the important nuances in gaze behavior while being stimuli-agnostic. We consider eye movements as raw position and velocity signals and train a deep temporal convolutional autoencoder to learn micro-scale and macro-scale representations corresponding to the fast and slow features of eye movements. These joint representations are evaluated by fitting a linear classifier on various tasks and outperform other works in biometrics and stimuli classification. Further experiments highlight the validity and generalizability of this method, bringing eye tracking research closer to real-world applications.

Automatic Semantic Segmentation of Structural Elements related to the Spinal Cord in the Lumbar Region by Using Convolutional Neural Networks

Jhon Jairo Sáenz Gamboa, Maria De La Iglesia-Vaya, Jon Ander Gómez

Responsive image

Auto-TLDR; Semantic Segmentation of Lumbar Spine Using Convolutional Neural Networks

Slides Poster Similar

This work addresses the problem of automatically segmenting the MR images corresponding to the lumbar spine. The purpose is to detect and delimit the different structural elements like vertebrae, intervertebral discs, nerves, blood vessels, etc. This task is known as semantic segmentation. The approach proposed in this work is based on convolutional neural networks whose output is a mask where each pixel from the input image is classified into one of the possible classes. Classes were defined by radiologists and correspond to structural elements and tissues. The proposed network architectures are variants of the U-Net. Several complementary blocks were used to define the variants: spatial attention models, deep supervision and multi-kernels at input, this last block type is based on the idea of inception. Those architectures which got the best results are described in this paper, and their results are discussed. Two of the proposed architectures outperform the standard U-Net used as baseline.

The DeepHealth Toolkit: A Unified Framework to Boost Biomedical Applications

Michele Cancilla, Laura Canalini, Federico Bolelli, Stefano Allegretti, Salvador Carrión, Roberto Paredes, Jon Ander Gómez, Simone Leo, Marco Enrico Piras, Luca Pireddu, Asaf Badouh, Santiago Marco-Sola, Lluc Alvarez, Miquel Moreto, Costantino Grana

Responsive image

Auto-TLDR; DeepHealth Toolkit: An Open Source Deep Learning Toolkit for Cloud Computing and HPC

Slides Poster Similar

Given the overwhelming impact of machine learning on the last decade, several libraries and frameworks have been developed in recent years to simplify the design and training of neural networks, providing array-based programming, automatic differentiation and user-friendly access to hardware accelerators. None of those tools, however, was designed with native and transparent support for Cloud Computing or heterogeneous High-Performance Computing (HPC). The DeepHealth Toolkit is an open source deep learning toolkit aimed at boosting productivity of data scientists operating in the medical field by providing a unified framework for the distributed training of neural networks, that is able to leverage hybrid HPC and Cloud environments in a way transparent to the user. The toolkit is composed of a computer vision library, a deep learning library, and a front-end for non-expert users; all of the components are focused on the medical domain, but they are general purpose and can be applied to any other field. In this paper, the principles driving the design of the DeepHealth libraries are described, along with details about the implementation and the interaction between the different elements composing the toolkit. Finally, experiments on common benchmarks prove the efficiency of each separate component, and of the DeepHealth Toolkit overall.

Digit Recognition Applied to Reconstructed Audio Signals Using Deep Learning

Anastasia-Sotiria Toufa, Constantine Kotropoulos

Responsive image

Auto-TLDR; Compressed Sensing for Digit Recognition in Audio Reconstruction

Poster Similar

Compressed sensing allows signal reconstruction from a few measurements. This work proposes a complete pipeline for digit recognition applied to audio reconstructed signals. The reconstruction procedure exploits the assumption that the original signal lies in the range of a generator. A pretrained generator of a Generative Adversarial Network generates audio digits. A new method for reconstruction is proposed, using only the most active segment of the signal, i.e., the segment with the highest energy. The underlying assumption is that such segment offers a more compact representation, preserving the meaningful content of signal. Cases when the reconstruction produces noise, instead of digit, are treated as outliers. In order to detect and reject them, three unsupervised indicators are used, namely, the total energy of reconstructed signal, the predictions of an one-class Support Vector Machine, and the confidence of a pretrained classifier used for recognition. This classifier is based on neural networks architectures and is pretrained on original audio recordings, employing three input representations, i.e., raw audio, spectrogram, and gammatonegram. Experiments are conducted, analyzing both the quality of reconstruction and the performance of classifiers in digit recognition, demonstrating that the proposed method yields higher performance in both the quality of reconstruction and digit recognition accuracy.

MEG: Multi-Evidence GNN for Multimodal Semantic Forensics

Ekraam Sabir, Ayush Jaiswal, Wael Abdalmageed, Prem Natarajan

Responsive image

Auto-TLDR; Scalable Image Repurposing Detection with Graph Neural Network Based Model

Slides Poster Similar

Image repurposing is a category of fake news where a digitally unmanipulated image is misrepresented by means of its accompanying metadata such as captions, location, etc., where the image and accompanying metadata together comprise a multimedia package. The problem setup is to authenticate a query multimedia package using a reference dataset of potentially related packages as evidences. Existing methods are limited to using a single evidence (retrieved package), which ignores potential performance improvement from the use of multiple evidences. In this work, we introduce a novel graph neural network based model for image repurposing detection, which effectively utilizes multiple retrieved packages as evidences and is scalable with the number of evidences. We compare the scalability and performance of our model against existing methods. Experimental results show that the proposed model outperforms existing state-of-the-art for image repurposing detection with an error reduction of up to 25%.