Locally-Connected, Irregular Deep Neural Networks for Biomimetic Active Vision in a Simulated Human

Masaki Nakada, Honglin Chen, Arjun Lakshmipathy, Demetri Terzopoulos

Responsive image

Auto-TLDR; Local-connected, Irregular Deep Neural Networks for biomimetic active vision

Slides Poster

An advanced simulation framework has recently been introduced for exploring human perception and visuomotor control. In this context, we investigate locally-connected, irregular deep neural networks (liNets) for biomimetic active vision. Like commonly used CNNs, liNets are locally-connected, forming receptive fields. Unlike CNNs, liNets are ideal for irregular photoreceptor distributions like those found in foveated biological retinas. Relative to fully-connected deep neural networks, liNets accommodate a dramatically greater number of retinal photoreceptors for significantly enhanced visual acuity, while avoiding prohibitive memory requirements. We demonstrate that our new networks can serve effectively in the biomimetic active vision system embodied in a simulated human that learns active visuomotor control and active appearance-based recognition.

Similar papers

Temporal Pulses Driven Spiking Neural Network for Time and Power Efficient Object Recognition in Autonomous Driving

Wei Wang, Shibo Zhou, Jingxi Li, Xiaohua Li, Junsong Yuan, Zhanpeng Jin

Responsive image

Auto-TLDR; Spiking Neural Network for Real-Time Object Recognition on Temporal LiDAR Pulses

Slides Poster Similar

Accurate real-time object recognition from sensory data has long been a crucial and challenging task for autonomous driving. Even though deep neural networks (DNNs) have been widely applied in this area, their considerable processing latency, power consumption as well as computational complexity have been challenging issues for real-time autonomous driving applications. In this paper, we propose an approach to address the real-time object recognition problem utilizing spiking neural networks (SNNs). The proposed SNN model works directly with raw temporal LiDAR pulses without the pulse-to-point cloud preprocessing procedure, which can significantly reduce delay and power consumption. Being evaluated on various datasets derived from LiDAR and dynamic vision sensor (DVS), including Sim LiDAR, KITTI, and DVS-barrel, our proposed model has shown remarkable time and power efficiency, while achieving comparable recognition performance as the state-of-the-art methods. This paper highlights the SNN's great potentials in autonomous driving and related applications. To the best of our knowledge, this is the first attempt to use SNN to directly perform time and energy efficient object recognition on temporal LiDAR pulses in the setting of autonomous driving.

Occlusion-Tolerant and Personalized 3D Human Pose Estimation in RGB Images

Ammar Qammaz, Antonis Argyros

Responsive image

Auto-TLDR; Real-Time 3D Human Pose Estimation in BVH using Inverse Kinematics Solver and Neural Networks

Poster Similar

We introduce a real-time method that estimates the 3D human pose directly in the popular BVH format, given estimations of the 2D body joints in RGB images. Our contributions include: (a) A novel and compact 2D pose representation. (b) A human body orientation classifier and an ensemble of orientation-tuned neural networks that regress the 3D human pose by also allowing for the decomposition of the body to an upper and lower kinematic hierarchy. This permits the recovery of the human pose even in the case of significant occlusions. (c) An efficient Inverse Kinematics solver that refines the neural-network-based solution providing 3D human pose estimations that are consistent with the limb sizes of a target person (if known). All the above yield a 33% accuracy improvement on the H3.6M dataset compared to the baseline MocapNET method while maintaining real-time performance (70 fps in CPU-only execution).

VOWEL: A Local Online Learning Rule for Recurrent Networks of Probabilistic Spiking Winner-Take-All Circuits

Hyeryung Jang, Nicolas Skatchkovsky, Osvaldo Simeone

Responsive image

Auto-TLDR; VOWEL: A Variational Online Local Training Rule for Winner-Take-All Spiking Neural Networks

Slides Similar

Networks of spiking neurons and Winner-Take-All spiking circuits (WTA-SNNs) can detect information encoded in spatio-temporal multi-valued events. These are described by the timing of events of interest, e.g., clicks, as well as by categorical numerical values assigned to each event, e.g., like or dislike. Other use cases include object recognition from data collected by neuromorphic cameras, which produce, for each pixel, signed bits at the times of sufficiently large brightness variations. Existing schemes for training WTA-SNNs are limited to rate-encoding solutions, and are hence able to detect only spatial patterns. Developing more general training algorithms for arbitrary WTA-SNNs inherits the challenges of training (binary) Spiking Neural Networks (SNNs). These amount, most notably, to the non-differentiability of threshold functions, to the recurrent behavior of spiking neural models, and to the difficulty of implementing backpropagation in neuromorphic hardware. In this paper, we develop a variational online local training rule for WTA-SNNs, referred to as VOWEL, that leverages only local pre- and post-synaptic information for visible circuits, and an additional common reward signal for hidden circuits. The method is based on probabilistic generalized linear neural models, control variates, and variational regularization. Experimental results on real-world neuromorphic datasets with multi-valued events demonstrate the advantages of WTA-SNNs over conventional binary SNNs trained with state-of-the-art methods, especially in the presence of limited computing resources.

Spiking Neural Networks with Single-Spike Temporal-Coded Neurons for Network Intrusion Detection

Shibo Zhou, Xiaohua Li

Responsive image

Auto-TLDR; Spiking Neural Network with Leaky Neurons

Slides Poster Similar

Spiking neural network (SNN) is interesting due to its strong bio-plausibility and high energy efficiency. However, its performance is falling far behind conventional deep neural networks (DNNs). In this paper, considering a general class of single-spike temporal-coded integrate-and-fire neurons, we analyze the input-output expressions of both leaky and nonleaky neurons. We show that SNNs built with leaky neurons suffer from the overly-nonlinear and overly-complex input-output response, which is the major reason for their difficult training and low performance. This reason is more fundamental than the commonly believed problem of nondifferentiable spikes. To support this claim, we show that SNNs built with nonleaky neurons can have a less-complex and less-nonlinear input-output response. They can be easily trained and can have superior performance, which is demonstrated by experimenting with the SNNs over two popular network intrusion detection datasets, i.e., the NSL-KDD and the AWID datasets. Our experiment results show that the proposed SNNs outperform a comprehensive list of DNN models and classic machine learning models. This paper demonstrates that SNNs can be promising and competitive in contrast to common beliefs.

Fully Convolutional Neural Networks for Raw Eye Tracking Data Segmentation, Generation, and Reconstruction

Wolfgang Fuhl, Yao Rong, Enkelejda Kasneci

Responsive image

Auto-TLDR; Semantic Segmentation of Eye Tracking Data with Fully Convolutional Neural Networks

Slides Poster Similar

In this paper, we use fully convolutional neural networks for the semantic segmentation of eye tracking data. We also use these networks for reconstruction, and in conjunction with a variational auto-encoder to generate eye movement data. The first improvement of our approach is that no input window is necessary, due to the use of fully convolutional networks and therefore any input size can be processed directly. The second improvement is that the used and generated data is raw eye tracking data (position X, Y and time) without preprocessing. This is achieved by pre-initializing the filters in the first layer and by building the input tensor along the z axis. We evaluated our approach on three publicly available datasets and compare the results to the state of the art.

User-Independent Gaze Estimation by Extracting Pupil Parameter and Its Mapping to the Gaze Angle

Sang Yoon Han, Nam Ik Cho

Responsive image

Auto-TLDR; Gaze Point Estimation using Pupil Shape for Generalization

Slides Poster Similar

Since gaze estimation plays a crucial role in recognizing human intentions, it has been researched for a long time, and its accuracy is ever increasing. However, due to the wide variation in eye shapes and focusing abilities between the individuals, accuracies of most algorithms vary depending on each person in the test group, especially when the initial calibration is not well performed. To alleviate the user-dependency, we attempt to derive features that are general for most people and use them as the input to a deep network instead of using the images as the input. Specifically, we use the pupil shape as the core feature because it is directly related to the 3D eyeball rotation, and thus the gaze direction. While existing deep learning methods learn the gaze point by extracting various features from the image, we focus on the mapping function from the eyeball rotation to the gaze point by using the pupil shape as the input. It is shown that the accuracy of gaze point estimation also becomes robust for the uncalibrated points by following the characteristics of the mapping function. Also, our gaze network learns the gaze difference to facilitate the re-calibration process to fix the calibration-drift problem that typically occurs with glass-type or head-mount devices.

SPA: Stochastic Probability Adjustment for System Balance of Unsupervised SNNs

Xingyu Yang, Mingyuan Meng, Shanlin Xiao, Zhiyi Yu

Responsive image

Auto-TLDR; Stochastic Probability Adjustment for Spiking Neural Networks

Slides Poster Similar

Abstract—Spiking neural networks (SNNs) receive widespread attention because of their low-power hardware characteristic and brain-like signal response mechanism, but the performance of SNNs is still behind Artificial Neural Networks (ANNs) currently. We build an information theory-inspired system called Stochastic Probability Adjustment (SPA) system to reduce this gap. The SPA maps the synapses and neurons of SNNs into a probability space, where a neuron with all the pre-synapses connected to it is represented by a cluster, and the movement of the synaptic transmitter between different clusters is a Brownian-like stochastic process in which the transmitter distribution is adaptively adjusted at different firing phases. We tested various existing unsupervised SNN architectures and achieved good, consistent performance improvements, the classification accuracy improvements on the MNIST and EMNIST datasets have reached 1.99% and 6.29% respectively.

Detection and Correspondence Matching of Corneal Reflections for Eye Tracking Using Deep Learning

Soumil Chugh, Braiden Brousseau, Jonathan Rose, Moshe Eizenman

Responsive image

Auto-TLDR; A Fully Convolutional Neural Network for Corneal Reflection Detection and Matching in Extended Reality Eye Tracking Systems

Slides Poster Similar

Eye tracking systems that estimate the point-of-gaze are essential in extended reality (XR) systems as they enable new interaction paradigms and technological improvements. It is important for these systems to maintain accuracy when the headset moves relative to the head (known as device slippage) due to head movements or user adjustment. One of the most accurate eye tracking techniques, which is also insensitive to shifts of the system relative to the head, uses two or more infrared (IR) light emitting diodes to illuminate the eye and an IR camera to capture images of the eye. An essential step in estimating the point-of-gaze in these systems is the precise determination of the location of two or more corneal reflections (virtual images of the IR-LEDs that illuminate the eye) in images of the eye. Eye trackers tend to have multiple light sources to ensure at least one pair of reflections for each gaze position. The use of multiple light sources introduces a difficult problem: the need to match the corneal reflections with the corresponding light source over the range of expected eye movements. Corneal reflection detection and matching often fail in XR systems due to the proximity of camera and steep illumination angles of light sources with respect to the eye. The failures are caused by corneal reflections having varying shape and intensity levels or disappearance due to rotation of the eye, or the presence of spurious reflections. We have developed a fully convolutional neural network, based on the UNET architecture, that solves the detection and matching problem in the presence of spurious and missing reflections. Eye images of 25 people were collected in a virtual reality headset using a binocular eye tracking module consisting of five infrared light sources per eye. A set of 4,000 eye images were manually labelled for each of the corneal reflections, and data augmentation was used to generate a dataset of 40,000 images. The network is able to correctly identify and match 91% of corneal reflections present in the test set. This is comparable to a state-of-the-art deep learning system, but our approach requires 33 times less memory and executes 10 times faster. The proposed algorithm, when used in an eye tracker in a VR system, achieved an average mean absolute gaze error of 1°. This is a significant improvement over the state-of-the-art learning-based XR eye tracking systems that have reported gaze errors of 2-3°.

Estimating Gaze Points from Facial Landmarks by a Remote Spherical Camera

Shigang Li

Responsive image

Auto-TLDR; Gaze Point Estimation from a Spherical Image from Facial Landmarks

Slides Poster Similar

From a spherical image, a gaze point, instead of gaze vectors, can be estimated directly because a remote spherical camera can observe a user's face and a gaze target simultaneously. This paper investigates the problem of estimating a gaze point in a spherical image from facial landmarks. In contrast with the existing methods which usually assume gaze points move on a narrow plane, the proposed method can cope with the situation where gaze points vary in depth for a relatively wide field of view. As shown in the results of comparative experiments, we find the orthogonal coordinates of facial landmarks on a unit sphere is a reasonable representation in comparison with spherical polar coordinates; the cues of head pose is helpful to improve the accuracy of gaze points. Consequently, the proposed method achieves a performance on the accuracy of gaze points estimation which is comparable to the state of the art methods.

Wireless Localisation in WiFi Using Novel Deep Architectures

Peizheng Li, Han Cui, Aftab Khan, Usman Raza, Robert Piechocki, Angela Doufexi, Tim Farnham

Responsive image

Auto-TLDR; Deep Neural Network for Indoor Localisation of WiFi Devices in Indoor Environments

Slides Poster Similar

This paper studies the indoor localisation of WiFi devices based on a commodity chipset and standard channel sounding. First, we present a novel shallow neural network (SNN) in which features are extracted from the channel state information (CSI) corresponding to WiFi subcarriers received on different antennas and used to train the model. The single layer architecture of this localisation neural network makes it lightweight and easy-to-deploy on devices with stringent constraints on computational resources. We further investigate for localisation the use of deep learning models and design novel architectures for convolutional neural network (CNN) and long-short term memory (LSTM). We extensively evaluate these localisation algorithms for continuous tracking in indoor environments. Experimental results prove that even an SNN model, after a careful handcrafted feature extraction, can achieve accurate localisation. Meanwhile, using a well-organised architecture, the neural network models can be trained directly with raw data from the CSI and localisation features can be automatically extracted to achieve accurate position estimates. We also found that the performance of neural network-based methods are directly affected by the number of anchor access points (APs) regardless of their structure. With three APs, all neural network models proposed in this paper can obtain localisation accuracy of around 0.5 metres. In addition the proposed deep NN architecture reduces the data pre-processing time by 6.5 hours compared with a shallow NN using the data collected in our testbed. In the deployment phase, the inference time is also significantly reduced to 0.1 ms per sample. We also demonstrate the generalisation capability of the proposed method by evaluating models using different target movement characteristics to the ones in which they were trained.

Map-Based Temporally Consistent Geolocalization through Learning Motion Trajectories

Bing Zha, Alper Yilmaz

Responsive image

Auto-TLDR; Exploiting Motion Trajectories for Geolocalization of Object on Topological Map using Recurrent Neural Network

Slides Poster Similar

In this paper, we propose a novel trajectory learning method that exploits motion trajectories on topological map using recurrent neural network for temporally consistent geolocalization of object. Inspired by human's ability to both be aware of distance and direction of self-motion in navigation, our trajectory learning method learns a pattern representation of trajectories encoded as a sequence of distances and turning angles to assist self-localization. We pose the learning process as a conditional sequence prediction problem in which each output locates the object on a traversable edge in a map. Considering the prediction sequence ought to be topologically connected in the graph-structured map, we adopt two different hypotheses generation and elimination strategies to eliminate disconnected sequence prediction. We demonstrate our approach on the KITTI stereo visual odometry dataset which is a city-scale environment. The key benefits of our approach to geolocalization are that 1) we take advantage of powerful sequence modeling ability of recurrent neural network and its robustness to noisy input, 2) only require a map in the form of a graph and 3) simply use an affordable sensor that generates motion trajectory. The experiments show that the motion trajectories can be learned by training an recurrent neural network, and temporally consistent geolocation can be predicted with both of the proposed strategies.

GazeMAE: General Representations of Eye Movements Using a Micro-Macro Autoencoder

Louise Gillian C. Bautista, Prospero Naval

Responsive image

Auto-TLDR; Fast and Slow Eye Movement Representations for Sentiment-agnostic Eye Tracking

Slides Poster Similar

Eye movements are intricate and dynamic events that contain a wealth of information about the subject and the stimuli. We propose an abstract representation of eye movements that preserve the important nuances in gaze behavior while being stimuli-agnostic. We consider eye movements as raw position and velocity signals and train a deep temporal convolutional autoencoder to learn micro-scale and macro-scale representations corresponding to the fast and slow features of eye movements. These joint representations are evaluated by fitting a linear classifier on various tasks and outperform other works in biometrics and stimuli classification. Further experiments highlight the validity and generalizability of this method, bringing eye tracking research closer to real-world applications.

From Early Biological Models to CNNs: Do They Look Where Humans Look?

Marinella Iole Cadoni, Andrea Lagorio, Enrico Grosso, Jia Huei Tan, Chee Seng Chan

Responsive image

Auto-TLDR; Comparing Neural Networks to Human Fixations for Semantic Learning

Slides Poster Similar

Early hierarchical computational visual models as well as recent deep neural networks have been inspired by the functioning of the primate visual cortex system. Although much effort has been made to dissect neural networks to visualize the features they learn at the individual units, the scope of the visualizations has been limited to a categorization of the features in terms of their semantic level. Considering the ability humans have to select high semantic level regions of a scene, the question whether neural networks can match this ability, and if similarity with humans attention is correlated with neural networks performance naturally arise. To address this question we propose a pipeline to select and compare sets of feature points that maximally activate individual networks units to human fixations. We extract features from a variety of neural networks, from early hierarchical models such as HMAX up to recent deep convolutional neural netwoks such as Densnet, to compare them to human fixations. Experiments over the ETD database show that human fixations correlate with CNNs features from deep layers significantly better than with random sets of points, while they do not with features extracted from the first layers of CNNs, nor with the HMAX features, which seem to have low semantic level compared with the features that respond to the automatically learned filters of CNNs. It also turns out that there is a correlation between CNN’s human similarity and classification performance.

Collaborative Human Machine Attention Module for Character Recognition

Chetan Ralekar, Tapan Gandhi, Santanu Chaudhury

Responsive image

Auto-TLDR; A Collaborative Human-Machine Attention Module for Deep Neural Networks

Slides Poster Similar

The deep learning models which include attention mechanisms are shown to enhance the performance and efficiency of the various computer vision tasks such as pattern recognition, object detection, face recognition, etc. Although the visual attention mechanism is the source of inspiration for these models, recent attention models consider `attention' as a pure machine vision optimization problem and visual attention remains the most neglected aspect. Therefore, this paper presents a collaborative human and machine attention module which considers both visual and network's attention. The proposed module is inspired by the dorsal (`where') pathways of visual processing and it can be integrated with any convolutional neural network (CNN) model. First, the module computes the spatial attention map from the input feature maps which is then combined with the visual attention maps. The visual attention maps are created using eye-fixations obtained by performing an eye-tracking experiment with human participants. The visual attention map covers the highly salient and discriminative image regions as humans tend to focus on such regions, whereas the other relevant image regions are processed by spatial attention map. The combination of these two maps results in the finer refinement in feature maps which results in improved performance. The comparative analysis reveals that our model not only shows significant improvement over the baseline model but also outperforms the other models. We hope that our findings using a collaborative human-machine attention module will be helpful in other vision tasks as well.

Adaptive Feature Fusion Network for Gaze Tracking in Mobile Tablets

Yiwei Bao, Yihua Cheng, Yunfei Liu, Feng Lu

Responsive image

Auto-TLDR; Adaptive Feature Fusion Network for Multi-stream Gaze Estimation in Mobile Tablets

Slides Poster Similar

Recently, many multi-stream gaze estimation methods have been proposed. They estimate gaze from eye and face appearances and achieve reasonable accuracy. However, most of the methods simply concatenate the features extracted from eye and face appearance. The feature fusion process has been ignored. In this paper, we propose a novel Adaptive Feature Fusion Network (AFF-Net), which performs gaze tracking task in mobile tablets. We stack two-eye feature maps and utilize Squeeze-and-Excitation layers to adaptively fuse two-eye features based on different eye features. Meanwhile, we also propose Adaptive Group Normalization to recalibrate eye features with the guidance of face appearance characteristics. Extensive experiments on both GazeCapture and MPIIFaceGaze datasets demonstrate consistently superior performance of the proposed method.

AerialMPTNet: Multi-Pedestrian Tracking in Aerial Imagery Using Temporal and Graphical Features

Maximilian Kraus, Seyed Majid Azimi, Emec Ercelik, Reza Bahmanyar, Peter Reinartz, Alois Knoll

Responsive image

Auto-TLDR; AerialMPTNet: A novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features

Slides Poster Similar

Multi-pedestrian tracking in aerial imagery has several applications such as large-scale event monitoring, disaster management, search-and-rescue missions, and as input into predictive crowd dynamic models. Due to the challenges such as the large number and the tiny size of the pedestrians (e.g., 4 x 4 pixels) with their similar appearances as well as different scales and atmospheric conditions of the images with their extremely low frame rates (e.g., 2 fps), current state-of-the-art algorithms including the deep learning-based ones are unable to perform well. In this paper, we propose AerialMPTNet, a novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features from a Siamese Neural Network, movement predictions from a Long Short-Term Memory, and pedestrian interconnections from a GraphCNN. In addition, to address the lack of diverse aerial multi-pedestrian tracking datasets, we introduce the Aerial Multi-Pedestrian Tracking (AerialMPT) dataset consisting of 307 frames and 44,740 pedestrians annotated. To the best of our knowledge, AerialMPT is the largest and most diverse dataset to this date and will be released publicly. We evaluate AerialMPTNet on AerialMPT and KIT AIS, and benchmark with several state-of-the-art tracking methods. Results indicate that AerialMPTNet significantly outperforms other methods on accuracy and time-efficiency.

Space-Time Domain Tensor Neural Networks: An Application on Human Pose Classification

Konstantinos Makantasis, Athanasios Voulodimos, Anastasios Doulamis, Nikolaos Doulamis, Nikolaos Bakalos

Responsive image

Auto-TLDR; Tensor-Based Neural Network for Spatiotemporal Pose Classifiaction using Three-Dimensional Skeleton Data

Slides Poster Similar

Recent advances in sensing technologies require the design and development of pattern recognition models capable of processing spatiotemporal data efficiently. In this study, we propose a spatially and temporally aware tensor-based neural network for human pose classifiaction using three-dimensional skeleton data. Our model employs three novel components. First, an input layer capable of constructing highly discriminative spatiotemporal features. Second, a tensor fusion operation that produces compact yet rich representations of the data, and third, a tensor-based neural network that processes data representations in their original tensor form. Our model is end-to-end trainable and characterized by a small number of trainable parameters making it suitable for problems where the annotated data is limited. Experimental evaluation of the proposed model indicates that it can achieve state-of-the-art performance.

Explainable Online Validation of Machine Learning Models for Practical Applications

Wolfgang Fuhl, Yao Rong, Thomas Motz, Michael Scheidt, Andreas Markus Hartel, Andreas Koch, Enkelejda Kasneci

Responsive image

Auto-TLDR; A Reformulation of Regression and Classification for Machine Learning Algorithm Validation

Slides Poster Similar

We present a reformulation of the regression and classification, which aims to validate the result of a machine learning algorithm. Our reformulation simplifies the original problem and validates the result of the machine learning algorithm using the training data. Since the validation of machine learning algorithms must always be explainable, we perform our experiments with the kNN algorithm as well as with an algorithm based on conditional probabilities, which is proposed in this work. For the evaluation of our approach, three publicly available data sets were used and three classification and two regression problems were evaluated. The presented algorithm based on conditional probabilities is also online capable and requires only a fraction of memory compared to the kNN algorithm.

Weight Estimation from an RGB-D Camera in Top-View Configuration

Marco Mameli, Marina Paolanti, Nicola Conci, Filippo Tessaro, Emanuele Frontoni, Primo Zingaretti

Responsive image

Auto-TLDR; Top-View Weight Estimation using Deep Neural Networks

Slides Poster Similar

The development of so-called soft-biometrics aims at providing information related to the physical and behavioural characteristics of a person. This paper focuses on bodyweight estimation based on the observation from a top-view RGB-D camera. In fact, the capability to estimate the weight of a person can be of help in many different applications, from health-related scenarios to business intelligence and retail analytics. To deal with this issue, a TVWE (Top-View Weight Estimation) framework is proposed with the aim of predicting the weight. The approach relies on the adoption of Deep Neural Networks (DNNs) that have been trained on depth data. Each network has also been modified in its top section to replace classification with prediction inference. The performance of five state-of-art DNNs has been compared, namely VGG16, ResNet, Inception, DenseNet and Efficient-Net. In addition, a convolutional auto-encoder has also been included for completeness. Considering the limited literature in this domain, the TVWE framework has been evaluated on a new publicly available dataset: “VRAI Weight estimation Dataset”, which also collects, for each subject, labels related to weight, gender, and height. The experimental results have demonstrated that the proposed methods are suitable for this task, bringing different and significant insights for the application of the solution in different domains.

Multiple Future Prediction Leveraging Synthetic Trajectories

Lorenzo Berlincioni, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo

Responsive image

Auto-TLDR; Synthetic Trajectory Prediction using Markov Chains

Slides Poster Similar

Trajectory prediction is an important task, especially in autonomous driving. The ability to forecast the position of other moving agents can yield to an effective planning, ensuring safety for the autonomous vehicle as well for the observed entities. In this work we propose a data driven approach based on Markov Chains to generate synthetic trajectories, which are useful for training a multiple future trajectory predictor. The advantages are twofold: on the one hand synthetic samples can be used to augment existing datasets and train more effective predictors; on the other hand, it allows to generate samples with multiple ground truths, corresponding to diverse equally likely outcomes of the observed trajectory. We define a trajectory prediction model and a loss that explicitly address the multimodality of the problem and we show that combining synthetic and real data leads to prediction improvements, obtaining state of the art results.

Real-Time Driver Drowsiness Detection Using Facial Action Units

Malaika Vijay, Nandagopal Netrakanti Vinayak, Maanvi Nunna, Subramanyam Natarajan

Responsive image

Auto-TLDR; Real-Time Detection of Driver Drowsiness using Facial Action Units using Extreme Gradient Boosting

Slides Poster Similar

This paper presents a two-stage, vision-based pipeline for the real-time detection of driver drowsiness using Facial Action Units (FAUs). FAUs capture movements in groups of muscles in the face like widening of the eyes or dropping of the jaw. The first stage of the pipeline employs a Convolutional Neural Network (CNN) trained to detect FAUs. The output of the penultimate layer of this network serves as an image embedding that captures features relevant to FAU detection. These embeddings are then used to predict drowsiness using an Extreme Gradient Boosting (XGBoost) classifier. A separate XGBoost model is trained for each user of the system so that behavior specific to each user can be modeled into the drowsiness classifier. We show that user-specific classifiers require very little data and low training time to yield high prediction accuracies in real-time.

InsideBias: Measuring Bias in Deep Networks and Application to Face Gender Biometrics

Ignacio Serna, Alejandro Peña Almansa, Aythami Morales, Julian Fierrez

Responsive image

Auto-TLDR; InsideBias: Detecting Bias in Deep Neural Networks from Face Images

Slides Poster Similar

This work explores the biases in learning processes based on deep neural network architectures. We analyze how bias affects deep learning processes through a toy example using the MNIST database and a case study in gender detection from face images. We employ two gender detection models based on popular deep neural networks. We present a comprehensive analysis of bias effects when using an unbalanced training dataset on the features learned by the models. We show how bias impacts in the activations of gender detection models based on face images. We finally propose InsideBias, a novel method to detect biased models. InsideBias is based on how the models represent the information instead of how they perform, which is the normal practice in other existing methods for bias detection. Our strategy with InsideBias allows to detect biased models with very few samples (only 15 images in our case study). Our experiments include 72K face images from 24K identities and 3 ethnic groups.

Distilling Spikes: Knowledge Distillation in Spiking Neural Networks

Ravi Kumar Kushawaha, Saurabh Kumar, Biplab Banerjee, Rajbabu Velmurugan

Responsive image

Auto-TLDR; Knowledge Distillation in Spiking Neural Networks for Image Classification

Slides Poster Similar

Spiking Neural Networks (SNN) are energy-efficient computing architectures that exchange spikes for processing information, unlike classical Artificial Neural Networks (ANN). Due to this, SNNs are better suited for real-life deployments. However, similar to ANNs, SNNs also benefit from deeper architectures to obtain improved performance. Furthermore, like the deep ANNs, the memory, compute and power requirements of SNNs also increase with model size, and model compression becomes a necessity. Knowledge distillation is a model com- pression technique that enables transferring the learning of a large machine learning model to a smaller model with minimal loss in performance. In this paper, we propose techniques for knowledge distillation in spiking neural networks for the task of image classification. We present ways to distill spikes from a larger SNN, also called the teacher network, to a smaller one, also called the student network, while minimally impacting the classification accuracy. We demonstrate the effectiveness of the proposed method with detailed experiments on three standard datasets while proposing novel distillation methodologies and loss functions. We also present a multi-stage knowledge distillation technique for SNNs using an intermediate network to obtain higher performance from the student network. Our approach is expected to open up new avenues for deploying high performing large SNN models on resource-constrained hardware platforms.

Detecting Manipulated Facial Videos: A Time Series Solution

Zhang Zhewei, Ma Can, Gao Meilin, Ding Bowen

Responsive image

Auto-TLDR; Face-Alignment Based Bi-LSTM for Fake Video Detection

Slides Poster Similar

We propose a new method to expose fake videos based on a time series solution. The method is based on bidirectional long short-term memory (Bi-LSTM) backbone architecture with two different types of features: {Face-Alignment} and {Dense-Face-Alignment}, in which both of them are physiological signals that can be distinguished between fake and original videos. We choose 68 landmark points as the feature of {Face-Alignment} and Pose Adaptive Feature (PAF) for {Dense-Face-Alignment}. Based on these two facial features, we designed two deep networks. In addition, we optimize our network by adding an attention mechanism that improves detection precision. Our method is tested over benchmarks of Face Forensics/Face Forensics++ dataset and show a promising performance on inference speed while maintaining accuracy with state-of art solutions that deal against DeepFake.

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

Chao Li, Qian Zhang, Ziping Zhao

Responsive image

Auto-TLDR; Intimate Relationship Prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network Using Functional Near-Infrared Spectroscopy

Slides Poster Similar

The detection of intimacy plays a crucial role in the improvement of intimate relationship, which contributes to promote the family and social harmony. Previous studies have shown that different degrees of intimacy have significant differences in brain imaging. Recently, a few of work has emerged to recognise intimacy automatically by using machine learning technique. Moreover, considering the temporal dynamic characteristics of intimacy relationship on neural mechanism, how to model spatio-temporal dynamics for intimacy prediction effectively is still a challenge. In this paper, we propose a novel method to explore deep spatial-temporal representations for intimacy prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network (ACCRNN). Given the advantages of time-frequency resolution in complex neuronal activities analysis, this paper utilizes functional near-infrared spectroscopy (fNIRS) to analyse and infer to intimate relationship. We collect a fNIRS-based dataset for the analysis of intimate relationship. Forty-two-channel fNIRS signals are recorded from the 44 subjects' prefrontal cortex when they watched a total of 18 photos of lovers, friends and strangers for 30 seconds per photo. The experimental results show that our proposed method outperforms the others in terms of accuracy with the precision of 96.5%. To the best of our knowledge, this is the first time that such a hybrid deep architecture has been employed for fNIRS-based intimacy prediction.

Radar Image Reconstruction from Raw ADC Data Using Parametric Variational Autoencoder with Domain Adaptation

Michael Stephan, Thomas Stadelmayer, Avik Santra, Georg Fischer, Robert Weigel, Fabian Lurz

Responsive image

Auto-TLDR; Parametric Variational Autoencoder-based Human Target Detection and Localization for Frequency Modulated Continuous Wave Radar

Slides Poster Similar

This paper presents a parametric variational autoencoder-based human target detection and localization framework working directly with the raw analog-to-digital converter data from the frequency modulated continuous wave radar. We propose a parametrically constrained variational autoencoder, with residual and skip connections, capable of generating the clustered and localized target detections on the range-angle image. Furthermore, to circumvent the problem of training the proposed neural network on all possible scenarios using real radar data, we propose domain adaptation strategies whereby we first train the neural network using ray tracing based model data and then adapt the network to work on real sensor data. This strategy ensures better generalization and scalability of the proposed neural network even though it is trained with limited radar data. We demonstrate the superior detection and localization performance of our proposed solution compared to the conventional signal processing pipeline and earlier state-of-art deep U-Net architecture with range-doppler images as inputs.

Spatial Bias in Vision-Based Voice Activity Detection

Kalin Stefanov, Mohammad Adiban, Giampiero Salvi

Responsive image

Auto-TLDR; Spatial Bias in Vision-based Voice Activity Detection in Multiparty Human-Human Interactions

Poster Similar

We present models for automatic vision-based voice activity detection (VAD) in multiparty human-human interactions that are aimed at complementing the acoustic VAD methods. We provide evidence that this type of vision-based VAD models are susceptible to spatial bias in the datasets. The physical settings of the interaction, usually constant throughout data acquisition, determines the distribution of head poses of the participants. Our results show that when the head pose distributions are significantly different in the training and test sets, the performance of the models drops significantly. This suggests that previously reported results on datasets with a fixed physical configuration may overestimate the generalization capabilities of this type of models. We also propose a number of possible remedies to the spatial bias, including data augmentation, input masking and dynamic features, and provide an in-depth analysis of the visual cues used by our models.

Hybrid Approach for 3D Head Reconstruction: Using Neural Networks and Visual Geometry

Oussema Bouafif, Bogdan Khomutenko, Mohammed Daoudi

Responsive image

Auto-TLDR; Recovering 3D Head Geometry from a Single Image using Deep Learning and Geometric Techniques

Slides Poster Similar

Recovering the 3D geometric structure of a face from a single input image is a challenging active research area in computer vision. In this paper, we present a novel method for reconstructing 3D heads from a single or multiple image(s) using a hybrid approach based on deep learning and geometric techniques. We propose an encoder-decoder network based on the U-net architecture and trained on synthetic data only. It predicts both pixel-wise normal vectors and landmarks maps from a single input photo. Landmarks are used for the pose computation and the initialization of the optimization problem, which, in turn, reconstructs the 3D head geometry by using a parametric morphable model and normal vector fields. State-of-the-art results are achieved through qualitative and quantitative evaluation tests on both single and multi-view settings. Despite the fact that the model was trained only on synthetic data, it successfully recovers 3D geometry and precise poses for real-world images.

Revisiting the Training of Very Deep Neural Networks without Skip Connections

Oyebade Kayode Oyedotun, Abd El Rahman Shabayek, Djamila Aouada, Bjorn Ottersten

Responsive image

Auto-TLDR; Optimization of Very Deep PlainNets without shortcut connections with 'vanishing and exploding units' activations'

Slides Poster Similar

Deep neural networks (DNNs) with many layers of feature representations yield state-of-the-art results on several difficult learning tasks. However, optimizing very deep DNNs without shortcut connections known as PlainNets, is a notoriously hard problem. Considering the growing interest in this area, this paper investigates holistically two scenarios that plague the training of very deep PlainNets: (1) the relatively popular challenge of 'vanishing and exploding units' activations', and (2) the less investigated 'singularity' problem, which is studied in details in this paper. In contrast to earlier works that study only the saturation and explosion of units' activations in isolation, this paper harmonizes the inconspicuous coexistence of the aforementioned problems for very deep PlainNets. Particularly, we argue that the aforementioned problems would have to be tackled simultaneously for the successful training of very deep PlainNets. Finally, different techniques that can be employed for tackling the optimization problem are discussed, and a specific combination of simple techniques that allows the successful training of PlainNets having up to 100 layers is demonstrated.

RISEdb: A Novel Indoor Localization Dataset

Carlos Sanchez Belenguer, Erik Wolfart, Álvaro Casado Coscollá, Vitor Sequeira

Responsive image

Auto-TLDR; Indoor Localization Using LiDAR SLAM and Smartphones: A Benchmarking Dataset

Slides Poster Similar

In this paper we introduce a novel public dataset for developing and benchmarking indoor localization systems. We have selected and 3D mapped a set of representative indoor environments including a large office building, a conference room, a workshop, an exhibition area and a restaurant. Our acquisition pipeline is based on a portable LiDAR SLAM backpack to map the buildings and to accurately track the pose of the user as it moves freely inside them. We introduce the calibration procedures that enable us to acquire and geo-reference live data coming from different independent sensors rigidly attached to the backpack. This has allowed us to collect long sequences of spherical and stereo images, together with all the sensor readings coming from a consumer smartphone and locate them inside the map with centimetre accuracy. The dataset addresses many of the limitations of existing indoor localization datasets regarding the scale and diversity of the mapped buildings; the number of acquired sequences under varying conditions; the accuracy of the ground-truth trajectory; the availability of a detailed 3D model and the availability of different sensor types. It enables the benchmarking of existing and the development of new indoor localization approaches, in particular for deep learning based systems that require large amounts of labeled training data.

Fine-Tuning Convolutional Neural Networks: A Comprehensive Guide and Benchmark Analysis for Glaucoma Screening

Amed Mvoulana, Rostom Kachouri, Mohamed Akil

Responsive image

Auto-TLDR; Fine-tuning Convolutional Neural Networks for Glaucoma Screening

Slides Poster Similar

This work aimed at giving a comprehensive and in-detailed guide on the route to fine-tuning Convolutional Neural Networks (CNNs) for glaucoma screening. Transfer learning consists in a promising alternative to train CNNs from stratch, to avoid the huge data and resources requirements. After a thorough study of five state-of-the-art CNNs architectures, a complete and well-explained strategy for fine-tuning these networks is proposed, using hyperparameter grid-searching and two-phase training approach. Excellent performance is reached on model evaluation, with a 0.9772 AUROC validation rate, giving arise to reliable glaucoma diagosis-help systems. Also, a benchmark analysis is conducted across all fine-tuned models, studying them according to performance indices such as model complexity and size, AUROC density and inference time. This in-depth analysis allows a rigorous comparison between model characteristics, and is useful for giving practioners important trademarks for prospective applications and deployments.

Deep Ordinal Regression with Label Diversity

Axel Berg, Magnus Oskarsson, Mark Oconnor

Responsive image

Auto-TLDR; Discrete Regression via Classification for Neural Network Learning

Slides Similar

Regression via classification (RvC) is a common method used for regression problems in deep learning, where the target variable belongs to a set of continuous values. By discretizing the target into a set of non-overlapping classes, it has been shown that training a classifier can improve neural network accuracy compared to using a standard regression approach. However, it is not clear how the set of discrete classes should be chosen and how it affects the overall solution. In this work, we propose that using several discrete data representations simultaneously can improve neural network learning compared to a single representation. Our approach is end-to-end differentiable and can be added as a simple extension to conventional learning methods, such as deep neural networks. We test our method on three challenging tasks and show that our method reduces the prediction error compared to a baseline RvC approach while maintaining a similar model complexity.

OmniFlowNet: A Perspective Neural Network Adaptation for Optical Flow Estimation in Omnidirectional Images

Charles-Olivier Artizzu, Haozhou Zhang, Guillaume Allibert, Cédric Demonceaux

Responsive image

Auto-TLDR; OmniFlowNet: A Convolutional Neural Network for Omnidirectional Optical Flow Estimation

Slides Poster Similar

Spherical cameras and the latest image processing techniques open up new horizons. In particular, methods based on Convolutional Neural Networks (CNNs) now give excellent results for optical flow estimation on perspective images. However, these approaches are highly dependent on their architectures and training datasets. This paper proposes to benefit from years of improvement in perspective images optical flow estimation and to apply it to omnidirectional ones without training on new datasets. Our network, OmniFlowNet, is built on a CNN specialized in perspective images. Its convolution operation is adapted to be consistent with the equirectangular projection. Tested on spherical datasets created with Blender and several equirectangular videos realized from real indoor and outdoor scenes, OmniFlowNet shows better performance than its original network.

Generalization Comparison of Deep Neural Networks Via Output Sensitivity

Mahsa Forouzesh, Farnood Salehi, Patrick Thiran

Responsive image

Auto-TLDR; Generalization of Deep Neural Networks using Sensitivity

Slides Similar

Although recent works have brought some insights into the performance improvement of techniques used in state-of-the-art deep-learning models, more work is needed to understand their generalization properties. We shed light on this matter by linking the loss function to the output's sensitivity to its input. We find a rather strong empirical relation between the output sensitivity and the variance in the bias-variance decomposition of the loss function, which hints on using sensitivity as a metric for comparing the generalization performance of networks, without requiring labeled data. We find that sensitivity is decreased by applying popular methods which improve the generalization performance of the model, such as (1) using a deep network rather than a wide one, (2) adding convolutional layers to baseline classifiers instead of adding fully-connected layers, (3) using batch normalization, dropout and max-pooling, and (4) applying parameter initialization techniques.

A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors

Ruobing Zheng, Zhou Zhu, Bo Song, Ji Changjiang

Responsive image

Auto-TLDR; Lip-sync: Synthesis of a Virtual News Anchor for Low-Delayed Applications

Slides Poster Similar

Lip sync has emerged as a promising technique to generate mouth movements from audio signals. However, synthesizing a high-resolution and photorealistic virtual news anchor with current methods is still challenging. The lack of natural appearance, visual consistency, and processing efficiency is the main issue. In this paper, we present a novel lip-sync framework specially designed for producing a virtual news anchor for a target person. A pair of Temporal Convolutional Networks are used to learn the seq-to-seq mapping from audio signals to mouth movements, followed by a neural rendering model that translates the intermediate face representation to the high-quality appearance. This fully-trainable framework avoids several time-consuming steps in traditional graphics-based methods, meeting the requirements of many low-delay applications. Experiments show that our method has advantages over modern neural-based methods in both visual appearance and processing efficiency.

Edge-Aware Graph Attention Network for Ratio of Edge-User Estimation in Mobile Networks

Jiehui Deng, Sheng Wan, Xiang Wang, Enmei Tu, Xiaolin Huang, Jie Yang, Chen Gong

Responsive image

Auto-TLDR; EAGAT: Edge-Aware Graph Attention Network for Automatic REU Estimation in Mobile Networks

Slides Poster Similar

Estimating the Ratio of Edge-Users (REU) is an important issue in mobile networks, as it helps the subsequent adjustment of loads in different cells. However, existing approaches usually determine the REU manually, which are experience-dependent and labor-intensive, and thus the estimated REU might be imprecise. Considering the inherited graph structure of mobile networks, in this paper, we utilize a graph-based deep learning method for automatic REU estimation, where the practical cells are deemed as nodes and the load switchings among them constitute edges. Concretely, Graph Attention Network (GAT) is employed as the backbone of our method due to its impressive generalizability in dealing with networked data. Nevertheless, conventional GAT cannot make full use of the information in mobile networks, since it only incorporates node features to infer the pairwise importance and conduct graph convolutions, while the edge features that are actually critical in our problem are disregarded. To accommodate this issue, we propose an Edge-Aware Graph Attention Network (EAGAT), which is able to fuse the node features and edge features for REU estimation. Extensive experimental results on two real-world mobile network datasets demonstrate the superiority of our EAGAT approach to several state-of-the-art methods.

A General End-To-End Method for Characterizing Neuropsychiatric Disorders Using Free-Viewing Visual Scanning Tasks

Hong Yue Sean Liu, Jonathan Chung, Moshe Eizenman

Responsive image

Auto-TLDR; A general, data-driven, end-to-end framework that extracts relevant features of attentional bias from visual scanning behaviour and uses these features

Slides Poster Similar

The growing availability of eye-gaze tracking technology has allowed for its employment in a wide variety of applications, one of which is the objective diagnosis and monitoring of neuropsychiatric disorders from features of attentional bias extracted from visual scanning patterns. Current techniques in this field are largely comprised of non-generalizable methodologies that rely on domain expertise and study-specific assumptions. In this paper, we present a general, data-driven, end-to-end framework that extracts relevant features of attentional bias from visual scanning behaviour and uses these features to classify between subject groups with standard machine learning techniques. During the free-viewing task, subjects view sets of slides with thematic images while their visual scanning patterns (sets of ordered fixations) are monitored by an eye-tracking system. We encode fixations into relative visual attention maps (RVAMs) to describe measurement errors, and two data-driven methods are proposed to segment regions of interests from RVAMs: 1) using group average RVAMs, and 2) using difference of group average RVAMs. Relative fixation times within regions of interest are calculated and used as input features for a vanilla multilayered perceptron to classify between patient groups. The methods were evaluated on data from an anorexia nervosa (AN) study with 37 subjects and a bipolar/major depressive disorder (BD-MDD) study with 73 subjects. Using leave-one-subject-out cross validation, our technique is able to achieve an area under the receiver operating curve (AUROC) score of 0.935 for the AN study and 0.888 for the BD-MDD study, the latter of which exceeds the performance of the state-of-the-art analysis model designed specifically for the BD-MDD study, which had an AUROC of 0.879. The results validate the proposed methods' efficacy as generalizable, standard baselines for analyzing visual scanning data.

Rotational Adjoint Methods for Learning-Free 3D Human Pose Estimation from IMU Data

Caterina Emilia Agelide Buizza, Yiannis Demiris

Responsive image

Auto-TLDR; Learning-free 3D Human Pose Estimation from Inertial Measurement Unit Data

Poster Similar

We present a new framework for learning-free 3D human pose estimation from Inertial Measurement Unit (IMU) data. The proposed method does not rely on a full motion sequence to calculate a pose for any particular time point and thus can operate in real-time. A cost function based only on joint rotations is used, removing the need for frequent transformations between rotations and 3D Cartesian coordinates. A Jacobian that preserves skeleton structure is derived using Adjoint methods from Variational Data Assimilation. To facilitate further research in IMU-based Motion Capture, we provide a dataset that combines RGB and depth images from an Intel RealSense camera, marker-based motion capture from an Optitrack system and Xsens IMU data. We have evaluated our method on both our dataset and the Total Capture dataset, showing an average error across 24 joints of 0.45 and 0.48 radians respectively.

Photometric Stereo with Twin-Fisheye Cameras

Jordan Caracotte, Fabio Morbidi, El Mustapha Mouaddib

Responsive image

Auto-TLDR; Photometric stereo problem for low-cost 360-degree cameras

Slides Poster Similar

In this paper, we introduce and solve, for the first time, the photometric stereo problem for low-cost 360-degree cameras. In particular, we present a spherical image irradiance equation which is adapted to twin-fisheye cameras, and an original algorithm for the estimation of light directions based on the specular highlights observed on mirror balls. Extensive experiments with synthetic and real-world images captured by a Ricoh Theta V camera, demonstrate the effectiveness and robustness of the proposed 3D reconstruction pipeline. To foster reproducible research, the image dataset and code developed for this paper are made publicly available at the address: https://home.mis.u-picardie.fr/~fabio/PhotoSphere.html

Inferring Functional Properties from Fluid Dynamics Features

Andrea Schillaci, Maurizio Quadrio, Carlotta Pipolo, Marcello Restelli, Giacomo Boracchi

Responsive image

Auto-TLDR; Exploiting Convective Properties of Computational Fluid Dynamics for Medical Diagnosis

Slides Poster Similar

In a wide range of applied problems involving fluid flows, Computational Fluid Dynamics (CFD) provides detailed quantitative information on the flow field, at various levels of fidelity and computational cost. However, CFD alone cannot predict high-level functional properties of the system that are not easily obtained from the equations of fluid motion. In this work, we present a data-driven framework to extract additional information, such as medical diagnostic output, from CFD solutions. The task is made difficult by the huge data dimensionality of CFD, together with the limited amount of training data implied by its high computational cost. By pursuing a traditional ML pipeline of pre-processing, feature extraction, and model training, we demonstrate that informative features can be extracted from CFD data. Two experiments, pertaining to different application domains, support the claim that the convective properties implicit into a CFD solution can be leveraged to retrieve functional information for which an analytical definition is missing. Despite the preliminary nature of our study and the relative simplicity of both the geometrical and CFD models, for the first time we demonstrate that the combination of ML and CFD can diagnose a complex system in terms of high-level functional information.

NetCalib: A Novel Approach for LiDAR-Camera Auto-Calibration Based on Deep Learning

Shan Wu, Amnir Hadachi, Damien Vivet, Yadu Prabhakar

Responsive image

Auto-TLDR; Automatic Calibration of LiDAR and Cameras using Deep Neural Network

Slides Poster Similar

A fusion of LiDAR and cameras have been widely used in many robotics applications such as classification, segmentation, object detection, and autonomous driving. It is essential that the LiDAR sensor can measure distances accurately, which is a good complement to the cameras. Hence, calibrating sensors before deployment is a mandatory step. The conventional methods include checkerboards, specific patterns, or human labeling, which is trivial and human-labor extensive if we do the same calibration process every time. The main propose of this research work is to build a deep neural network that is capable of automatically finding the geometric transformation between LiDAR and cameras. The results show that our model manages to find the transformations from randomly sampled artificial errors. Besides, our work is open-sourced for the community to fully utilize the advances of the methodology for developing more the approach, initiating collaboration, and innovation in the topic.

Audio-Video Detection of the Active Speaker in Meetings

Francisco Madrigal, Frederic Lerasle, Lionel Pibre, Isabelle Ferrané

Responsive image

Auto-TLDR; Active Speaker Detection with Visual and Contextual Information from Meeting Context

Slides Poster Similar

Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using speech, however, certain visual and contextual information can provide additional insights. In this paper we propose an active speaker detection framework that integrates audiovisual features with social information, from the meeting context. Visual cue is processed using a Convolutional Neural Network (CNN) that captures the spatio-temporal relationships. We analyze several CNN architectures with both cues: raw pixels (RGB images) and motion (estimated with optical flow). Contextual reasoning is done with an original methodology, based on the gaze of all participants. We evaluate our proposal with a public \textcolor{black}{benchmark} in state-of-art: AMI corpus. We show how the addition of visual and context information improves the performance of the active speaker detection.

Local Attention and Global Representation Collaborating for Fine-Grained Classification

He Zhang, Yunming Bai, Hui Zhang, Jing Liu, Xingguang Li, Zhaofeng He

Responsive image

Auto-TLDR; Weighted Region Network for Cosmetic Contact Lenses Detection

Slides Poster Similar

The cosmetic contact lenses over an iris may change its original textural pattern that is the foundation for iris recognition, making the cosmetic lenses a possible and easy-to-use iris presentation attack means. Aiming at cosmetic contact lenses detection of practical application system, some approaches have been proposed but still facing unsolved problems, such as low quality iris images and inaccurate localized iris boundaries. In this paper, we propose a novel framework called Weighted Region Network (WRN) for the cosmetic contact lenses detection. The WRN includes both the local attention Weight Network and the global classification Region Network. With the inherent attention mechanism, the proposed network is able to find the most discriminative regions, which reduces the requirement for target detection and improves the ability of classification based on some specific areas and patterns. The Weight Network can be trained by using Rank loss and MSE loss without manual discriminative region annotations. Experiments are conducted on several databases and a new collected low-quality iris image database. The proposed method outperforms state-of-the-art fake iris detection algorithms, and is also effective for the fine-grained image classification task.

RefiNet: 3D Human Pose Refinement with Depth Maps

Andrea D'Eusanio, Stefano Pini, Guido Borghi, Roberto Vezzani, Rita Cucchiara

Responsive image

Auto-TLDR; RefiNet: A Multi-stage Framework for 3D Human Pose Estimation

Slides Similar

Human Pose Estimation is a fundamental task for many applications in the Computer Vision community and it has been widely investigated in the 2D domain, i.e. intensity images. Therefore, most of the available methods for this task are mainly based on 2D Convolutional Neural Networks and huge manually-annotated RGB datasets, achieving stunning results. In this paper, we propose RefiNet, a multi-stage framework that regresses an extremely-precise 3D human pose estimation from a given 2D pose and a depth map. The framework consists of three different modules, each one specialized in a particular refinement and data representation, i.e. depth patches, 3D skeleton and point clouds. Moreover, we collect a new dataset, namely Baracca, acquired with RGB, depth and thermal cameras and specifically created for the automotive context. Experimental results confirm the quality of the refinement procedure that largely improves the human pose estimations of off-the-shelf 2D methods.

Real-Time Drone Detection and Tracking with Visible, Thermal and Acoustic Sensors

Fredrik Svanström, Cristofer Englund, Fernando Alonso-Fernandez

Responsive image

Auto-TLDR; Automatic multi-sensor drone detection using sensor fusion

Slides Poster Similar

This paper explores the process of designing an automatic multi-sensor drone detection system. Besides the common video and audio sensors, the system also includes a thermal infrared camera, which is shown to be a feasible solution to the drone detection task. Even with slightly lower resolution, the performance is just as good as a camera in visible range. The detector performance as a function of the sensor-to-target distance is also investigated. In addition, using sensor fusion, the system is made more robust than the individual sensors, helping to reduce false detections. To counteract the lack of public datasets, a novel video dataset containing 650 annotated infrared and visible videos of drones, birds, airplanes and helicopters is also presented. The database is complemented with an audio dataset of the classes drones, helicopters and background noise.

Better Prior Knowledge Improves Human-Pose-Based Extrinsic Camera Calibration

Olivier Moliner, Sangxia Huang, Kalle Åström

Responsive image

Auto-TLDR; Improving Human-pose-based Extrinsic Calibration for Multi-Camera Systems

Slides Poster Similar

Accurate extrinsic calibration of wide baseline multi-camera systems enables better understanding of 3D scenes for many applications and is of great practical importance. Classical Structure-from-Motion calibration methods require special calibration equipment so that accurate point correspondences can be detected between different views. In addition, an operator with some training is usually needed to ensure that data is collected in a way that leads to good calibration accuracy. This limits the ease of adoption of such technologies. Recently, methods have been proposed to use human pose estimation models to establish point correspondences, thus removing the need for any special equipment. The challenge with this approach is that human pose estimation algorithms typically produce much less accurate feature points compared to classical patch-based methods. Another problem is that ambient human motion might not be optimal for calibration. We build upon prior works and introduce several novel ideas to improve the accuracy of human-pose-based extrinsic calibration. Our first contribution is a robust reprojection loss based on a better understanding of the sources of pose estimation error. Our second contribution is a 3D human pose likelihood model learned from motion capture data. We demonstrate significant improvements in calibration accuracy by evaluating our method on four publicly available datasets.

Reducing-Over-Time Tree for Event-Based Data

Shane Harrigan, Sonya Coleman, Dermot Kerr, Pratheepan Yogarajah, Zheng Fang, Chengdong Wu

Responsive image

Auto-TLDR; Reducing-Over-Time Binary Tree Structure for Event-Based Vision Data

Slides Poster Similar

This paper presents a novel Reducing-Over-Time (ROT) binary tree structure for event-based vision data and subtypes of the tree structure. A framework is presented using ROT, that takes advantage of the self-balancing and self-pruning nature of the tree structure to extract spatial-temporal information. The ROT framework is paired with an established motion classification technique and performance is evaluated against other state-of-the-art techniques using four datasets. Additionally, the ROT framework as a processing platform is compared with other event-based vision processing platforms in terms of memory usage and is found to be one of the most memory efficient platforms available.

Transfer Learning through Weighted Loss Function and Group Normalization for Vessel Segmentation from Retinal Images

Abdullah Sarhan, Jon Rokne, Reda Alhajj, Andrew Crichton

Responsive image

Auto-TLDR; Deep Learning for Segmentation of Blood Vessels in Retinal Images

Slides Poster Similar

The vascular structure of blood vessels is important in diagnosing retinal conditions such as glaucoma and diabetic retinopathy. Accurate segmentation of these vessels can help in detecting retinal objects such as the optic disc and optic cup and hence determine if there are damages to these areas. Moreover, the structure of the vessels can help in diagnosing glaucoma. The rapid development of digital imaging and computer-vision techniques has increased the potential for developing approaches for segmenting retinal vessels. In this paper, we propose an approach for segmenting retinal vessels that uses deep learning along with transfer learning. We adapted the U-Net structure to use a customized InceptionV3 as the encoder and used multiple skip connections to form the decoder. Moreover, we used a weighted loss function to handle the issue of class imbalance in retinal images. Furthermore, we contributed a new dataset to this field. We tested our approach on six publicly available datasets and a newly created dataset. We achieved an average accuracy of 95.60\% and a Dice coefficient of 80.98\%. The results obtained from comprehensive experiments demonstrate the robustness of our approach to the segmentation of blood vessels in retinal images obtained from different sources. Our approach results in greater segmentation accuracy than other approaches.