Multi-Order Feature Statistical Model for Fine-Grained Visual Categorization

Qingtao Wang, Ke Zhang, Shaoli Huang, Lianbo Zhang, Jin Fan

Responsive image

Auto-TLDR; Multi-Order Feature Statistical Method for Fine-Grained Visual Categorization

Slides Poster Similar

Fine-grained visual categorization aims to learn a robust image representation modeling subtle differences from similar categories. Existing methods in this field tackle the problem by designing complex frameworks, which produce high-level features by performing first-order or second-order pooling. Despite the impressive performance achieved by these strategies, the single-order networks only carry linear or non-linear information of the last convolutional layer, neglecting the fact that feature from different orders are mutually complementary. In this paper, we propose a Multi-Order Feature Statistical Method (MOFS), which learns fine-grained features characterizing multiple orders. Specifically, the MOFS consists of two sub-modules: (i) a first-order module modeling both mid-level and high-level features. (ii) a covariance feature statistical module capturing high-order features. By deploying these two sub-modules on the top of existing backbone networks, MOFS simultaneously captures multi-level of discrimative patters including local, global and co-related patters. We evaluate the proposed method on three challenging benchmarks, namely CUB-200-2011, Stanford Cars, and FGVC-Aircraft. Compared with state-of-the-art methods, experiments results exhibit superior performance in recognizing fine-grained objects

Visibility Restoration in Infra-Red Images

Olivier Fourt, Jean-Philippe Tarel

Responsive image

Auto-TLDR; Single Image Defogging for Long-Wavelength Infra-Red (LWIR)

Slides Poster Similar

For the last decade, single image defogging has been a subject of interest in image processing. In the visible spectrum, fog and haze decrease the visibility of distant objects. Thus, the objective of the visibility restoration is to remove as much as possible the effect of the fog within the image. Infrared sensors are more and more used in automotive and aviation industries but the effect of fog and haze is not restricted to the visible spectrum and also applies in the infrared band. After recalling the effects of fog in the common sub-bands of the infrared spectrum, we tested if the approach used for single image defogging in the visible spectrum might also work for infrared. This led us to propose a new approach of single image defogging for Long-Wavelength Infra-Red (LWIR) or Thermal Infra-Red. Several experiments are presented showing that the proposed algorithm offers interesting results not only for fog and haze but for bad weather conditions in general, during day and night.

Deep Reinforcement Learning on a Budget: 3D Control and Reasoning without a Supercomputer

Edward Beeching, Jilles Steeve Dibangoye, Olivier Simonin, Christian Wolf

Responsive image

Auto-TLDR; Deep Reinforcement Learning in Mobile Robots Using 3D Environment Scenarios

Slides Poster Similar

An important goal of research in Deep Reinforcement Learning in mobile robotics is to train agents capableof solving complex tasks, which require a high level of scene understanding and reasoning from an egocentric perspective.When trained from simulations, optimal environments should satisfy a currently unobtainable combination of high-fidelity photographic observations, massive amounts of different environment configurations and fast simulation speeds. In this paper we argue that research on training agents capable of complex reasoning can be simplified by decoupling from the requirement of high fidelity photographic observations. We present a suite of tasks requiring complex reasoning and exploration in continuous,partially observable 3D environments. The objective is to provide challenging scenarios and a robust baseline agent architecture that can be trained on mid-range consumer hardware in under 24h. Our scenarios combine two key advantages: (i) they are based on a simple but highly efficient 3D environment (ViZDoom)which allows high speed simulation (12000fps); (ii) the scenarios provide the user with a range of difficulty settings, in order to identify the limitations of current state of the art algorithms and network architectures. We aim to increase accessibility to the field of Deep-RL by providing baselines for challenging scenarios where new ideas can be iterated on quickly. We argue that the community should be able to address challenging problems in reasoning of mobile agents without the need for a large compute infrastructure.

On Morphological Hierarchies for Image Sequences

Caglayan Tuna, Alain Giros, François Merciol, Sébastien Lefèvre

Responsive image

Auto-TLDR; Comparison of Hierarchies for Image Sequences

Slides Poster Similar

Morphological hierarchies form a popular framework aiming at emphasizing the multiscale structure of digital image by performing an unsupervised spatial partitioning of the data. These hierarchies have been recently extended to cope with image sequences, and different strategies have been proposed to allow their construction from spatio-temporal data. In this paper, we compare these hierarchical representation strategies for image sequences according to their structural properties. We introduce a projection method to make these representations comparable. Furthermore, we extend one of these recent strategies in order to obtain more efficient hierarchical representations for image sequences. Experiments were conducted on both synthetic and real datasets, the latter being made of satellite image time series. We show that building one hierarchy by using spatial and temporal information together is more efficient comparing to other existing strategies.

EdgeNet: Semantic Scene Completion from a Single RGB-D Image

Aloisio Dourado, Teofilo De Campos, Adrian Hilton, Hansung Kim

Responsive image

Auto-TLDR; Semantic Scene Completion using 3D Depth and RGB Information

Slides Poster Similar

Semantic scene completion is the task of predicting a complete 3D representation of volumetric occupancy with corresponding semantic labels for a scene from a single point of view. In this paper, we present EdgeNet, a new end-to-end neural network architecture that fuses information from depth and RGB, explicitly representing RGB edges in 3D space. Previous works on this task used either depth-only or depth with colour by projecting 2D semantic labels generated by a 2D segmentation network into the 3D volume, requiring a two step training process. Our EdgeNet representation encodes colour information in 3D space using edge detection and flipped truncated signed distance, which improves semantic completion scores especially in hard to detect classes. We achieved state-of-the-art scores on both synthetic and real datasets with a simpler and a more computationally efficient training pipeline than competing approaches.

Semi-Supervised Class Incremental Learning

Alexis Lechat, Stéphane Herbin, Frederic Jurie

Responsive image

Auto-TLDR; incremental class learning with non-annotated batches

Slides Poster Similar

This paper makes a contribution to the problem of incremental class learning, the principle of which is to sequentially introduce batches of samples annotated with new classes during the learning phase. The main objective is to reduce the drop in classification performance on old classes, a phenomenon commonly called catastrophic forgetting. We propose in this paper a new method which exploits the availability of a large quantity of non-annotated images in addition to the annotated batches. These images are used to regularize the classifier and give the feature space a more stable structure. We demonstrate on several image data sets that our approach is able to improve the global performance of classifiers learned using an incremental learning protocol, even with annotated batches of small size.

Supervised Classification Using Graph-Based Space Partitioning for Multiclass Problems

Nicola Yanev, Ventzeslav Valev, Adam Krzyzak, Karima Ben Suliman

Responsive image

Auto-TLDR; Box Classifier for Multiclass Classification

Slides Poster Similar

We introduce and investigate in multiclass setting an efficient classifier which partitions the training data by means of multidimensional parallelepipeds called boxes. We show that multiclass classification problem at hand can be solved by integrating the heuristic minimum clique cover approach and the k-nearest neighbor rule. Our algorithm is motivated an algorithm for partitioning a graph into a minimal number of maximal. The main advantage of the new classifier called Box classifier is that it optimally utilizes the geometrical structure of the training set by decomposing the l-class problem (l > 2) into l binary classification problems. We discuss computational complexity of the proposed Box classifier. The extensive experiments performed on the simulated and real data for binary and multiclass problems show that in almost all cases the Box classifier performs significantly better than k-NN, SVM and decision trees.

Context Matters: Self-Attention for Sign Language Recognition

Fares Ben Slimane, Mohamed Bouguessa

Responsive image

Auto-TLDR; Attentional Network for Continuous Sign Language Recognition

Slides Poster Similar

This paper proposes an attentional network for the task of Continuous Sign Language Recognition. The proposed approach exploits co-independent streams of data to model the sign language modalities. These different channels of information can share a complex temporal structure between each other. For that reason, we apply attention to synchronize and help capture entangled dependencies between the different sign language components. Even though Sign Language is multi-channel, handshapes represent the central entities in sign interpretation. Seeing handshapes in their correct context defines the meaning of a sign. Taking that into account, we utilize the attention mechanism to efficiently aggregate the hand features with their appropriate Spatio-temporal context for better sign recognition. We found that by doing so the model is able to identify the essential Sign Language components that revolve around the dominant hand and the face areas. We test our model on the benchmark dataset RWTH-PHOENIX-Weather 2014, yielding competitive results.

Lightweight Low-Resolution Face Recognition for Surveillance Applications

Yoanna Martínez-Díaz, Heydi Mendez-Vazquez, Luis S. Luevano, Leonardo Chang, Miguel Gonzalez-Mendoza

Responsive image

Auto-TLDR; Efficiency of Lightweight Deep Face Networks on Low-Resolution Surveillance Imagery

Slides Poster Similar

Typically, real-world requirements to deploy face recognition models in unconstrained surveillance scenarios demand to identify low-resolution faces with extremely low computational cost. In the last years, several methods based on complex deep learning models have been proposed with promising recognition results but at a high computational cost. Inspired by the compactness and computation efficiency of lightweight deep face networks and their high accuracy on general face recognition tasks, in this work we propose to benchmark two recently introduced lightweight face models on low-resolution surveillance imagery to enable efficient system deployment. In this way, we conduct a comprehensive evaluation on the two typical settings: LR-to-HR and LR-to-LR matching. In addition, we investigate the effect of using trained models with down-sampled synthetic data from high-resolution images, as well as the combination of different models, for face recognition on real low-resolution images. Experimental results show that the used lightweight face models achieve state-of-the-art results on low-resolution benchmarks with low memory footprint and computational complexity. Moreover, we observed that combining models trained with different degradations improves the recognition accuracy on low-resolution surveillance imagery, which is feasible due to their low computational cost.

A Transformer-Based Network for Anisotropic 3D Medical Image Segmentation

Guo Danfeng, Demetri Terzopoulos

Responsive image

Auto-TLDR; A transformer-based model to tackle the anisotropy problem in 3D medical image analysis

Slides Poster Similar

A critical challenge of applying neural networks to 3D medical image analysis is to deal with the anisotropy problem. The inter-slice contextual information contained in medical images is important, especially when the structural information of lesions is needed. However, such information often varies with cases because of variable slice spacing. Image anisotropy downgrades model performance especially when slice spacing varies significantly among training and testing datasets. ExsiWe proposed a transformer-based model to tackle the anisotropy problem. It is adaptable to different levels of anisotropy and is computationally efficient. Experiments are conducted on 3D lung cancer segmentation task. Our model achieves an average Dice score of approximately 0.87, which generally outperforms baseline models.

LFIR2Pose: Pose Estimation from an Extremely Low-Resolution FIR Image Sequence

Saki Iwata, Yasutomo Kawanishi, Daisuke Deguchi, Ichiro Ide, Hiroshi Murase, Tomoyoshi Aizawa

Responsive image

Auto-TLDR; LFIR2Pose: Human Pose Estimation from a Low-Resolution Far-InfraRed Image Sequence

Slides Poster Similar

In this paper, we propose a method for human pose estimation from a Low-resolution Far-InfraRed (LFIR) image sequence captured by a 16 × 16 FIR sensor array. Human body estimation from such a single LFIR image is a hard task. For training the estimation model, annotation of the human pose to the images is also a difficult task for human. Thus, we propose the LFIR2Pose model which accepts a sequence of LFIR images and outputs the human pose of the last frame, and also propose an automatic annotation system for the model training. Additionally, considering that the scale of human body motion is largely different among body parts, we also propose a loss function focusing on the difference. Through an experiment, we evaluated the human pose estimation accuracy using an original data set, and confirmed that human pose can be estimated accurately from an LFIR image sequence.

Better Prior Knowledge Improves Human-Pose-Based Extrinsic Camera Calibration

Olivier Moliner, Sangxia Huang, Kalle Åström

Responsive image

Auto-TLDR; Improving Human-pose-based Extrinsic Calibration for Multi-Camera Systems

Slides Poster Similar

Accurate extrinsic calibration of wide baseline multi-camera systems enables better understanding of 3D scenes for many applications and is of great practical importance. Classical Structure-from-Motion calibration methods require special calibration equipment so that accurate point correspondences can be detected between different views. In addition, an operator with some training is usually needed to ensure that data is collected in a way that leads to good calibration accuracy. This limits the ease of adoption of such technologies. Recently, methods have been proposed to use human pose estimation models to establish point correspondences, thus removing the need for any special equipment. The challenge with this approach is that human pose estimation algorithms typically produce much less accurate feature points compared to classical patch-based methods. Another problem is that ambient human motion might not be optimal for calibration. We build upon prior works and introduce several novel ideas to improve the accuracy of human-pose-based extrinsic calibration. Our first contribution is a robust reprojection loss based on a better understanding of the sources of pose estimation error. Our second contribution is a 3D human pose likelihood model learned from motion capture data. We demonstrate significant improvements in calibration accuracy by evaluating our method on four publicly available datasets.

Distilling Spikes: Knowledge Distillation in Spiking Neural Networks

Ravi Kumar Kushawaha, Saurabh Kumar, Biplab Banerjee, Rajbabu Velmurugan

Responsive image

Auto-TLDR; Knowledge Distillation in Spiking Neural Networks for Image Classification

Slides Poster Similar

Spiking Neural Networks (SNN) are energy-efficient computing architectures that exchange spikes for processing information, unlike classical Artificial Neural Networks (ANN). Due to this, SNNs are better suited for real-life deployments. However, similar to ANNs, SNNs also benefit from deeper architectures to obtain improved performance. Furthermore, like the deep ANNs, the memory, compute and power requirements of SNNs also increase with model size, and model compression becomes a necessity. Knowledge distillation is a model com- pression technique that enables transferring the learning of a large machine learning model to a smaller model with minimal loss in performance. In this paper, we propose techniques for knowledge distillation in spiking neural networks for the task of image classification. We present ways to distill spikes from a larger SNN, also called the teacher network, to a smaller one, also called the student network, while minimally impacting the classification accuracy. We demonstrate the effectiveness of the proposed method with detailed experiments on three standard datasets while proposing novel distillation methodologies and loss functions. We also present a multi-stage knowledge distillation technique for SNNs using an intermediate network to obtain higher performance from the student network. Our approach is expected to open up new avenues for deploying high performing large SNN models on resource-constrained hardware platforms.

Unsupervised 3D Human Pose Estimation in Multi-view-multi-pose Video

Cheng Sun, Diego Thomas, Hiroshi Kawasaki

Responsive image

Auto-TLDR; Unsupervised 3D Human Pose Estimation from 2D Videos Using Generative Adversarial Network

Slides Poster Similar

3D human pose estimation from a single 2D video is an extremely difficult task because computing 3D geometry from 2D images is an ill-posed problem. Recent popular solutions adopt fully-supervised learning strategy, which requires to train a deep network on a large-scale ground truth dataset of 3D poses and 2D images. However, such a large-scale dataset with natural images does not exist, which limits the usability of existing methods. While building a complete 3D dataset is tedious and expensive, abundant 2D in-the-wild data is already publicly available. As a consequence, there is a growing interest in the computer vision community to design efficient techniques that use the unsupervised learning strategy, which does not require any ground truth 3D data. Such methods can be trained with only natural 2D images of humans. In this paper we propose an unsupervised method for estimating 3D human pose in videos. The standard approach for unsupervised learning is to use the Generative Adversarial Network (GAN) framework. To improve the performance of 3D human pose estimation in videos, we propose a new GAN network that enforces body consistency over frames in a video. We evaluate the efficiency of our proposed method on a public 3D human body dataset.

Improving Batch Normalization with Skewness Reduction for Deep Neural Networks

Pak Lun Kevin Ding, Martin Sarah, Baoxin Li

Responsive image

Auto-TLDR; Batch Normalization with Skewness Reduction

Slides Poster Similar

Batch Normalization (BN) is a well-known technique used in training deep neural networks. The main idea behind batch normalization is to normalize the features of the layers ($i.e.$, transforming them to have a mean equal to zero and a variance equal to one). Such a procedure encourages the optimization landscape of the loss function to be smoother, and improve the learning of the networks for both speed and performance. In this paper, we demonstrate that the performance of the network can be improved, if the distributions of the features of the output in the same layer are similar. As normalizing based on mean and variance does not necessarily make the features to have the same distribution, we propose a new normalization scheme: Batch Normalization with Skewness Reduction (BNSR). Comparing with other normalization approaches, BNSR transforms not just only the mean and variance, but also the skewness of the data. By tackling this property of a distribution, we are able to make the output distributions of the layers to be further similar. The nonlinearity of BNSR may further improve the expressiveness of the underlying network. Comparisons with other normalization schemes are tested on the CIFAR-100 and ImageNet datasets. Experimental results show that the proposed approach can outperform other state-of-the-arts that are not equipped with BNSR.

Real Time Fencing Move Classification and Detection at Touch Time During a Fencing Match

Cem Ekin Sunal, Chris G. Willcocks, Boguslaw Obara

Responsive image

Auto-TLDR; Fencing Body Move Classification and Detection Using Deep Learning

Slides Similar

Fencing is a fast-paced sport played with swords which are Epee, Foil, and Saber. However, such fast-pace can cause referees to make wrong decisions. Review of slow-motion camera footage in tournaments helps referees’ decision making, but it interrupts the match and may not be available for every organization. Motivated by the need for better decision making, analysis, and availability, we introduce the first fully-automated deep learning classification and detection system for fencing body moves at the moment a touch is made. This is an important step towards creating a fencing analysis system, with player profiling and decision tools that will benefit the fencing community. The proposed architecture combines You Only Look Once version three (YOLOv3) with a ResNet-34 classifier, trained on ImageNet settings to obtain 83.0\% test accuracy on the fencing moves. These results are exciting development in the sport, providing immediate feedback and analysis along with accessibility, hence making it a valuable tool for trainers and fencing match referees.

Multiple Future Prediction Leveraging Synthetic Trajectories

Lorenzo Berlincioni, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo

Responsive image

Auto-TLDR; Synthetic Trajectory Prediction using Markov Chains

Slides Poster Similar

Trajectory prediction is an important task, especially in autonomous driving. The ability to forecast the position of other moving agents can yield to an effective planning, ensuring safety for the autonomous vehicle as well for the observed entities. In this work we propose a data driven approach based on Markov Chains to generate synthetic trajectories, which are useful for training a multiple future trajectory predictor. The advantages are twofold: on the one hand synthetic samples can be used to augment existing datasets and train more effective predictors; on the other hand, it allows to generate samples with multiple ground truths, corresponding to diverse equally likely outcomes of the observed trajectory. We define a trajectory prediction model and a loss that explicitly address the multimodality of the problem and we show that combining synthetic and real data leads to prediction improvements, obtaining state of the art results.

Edge-Guided CNN for Denoising Images from Portable Ultrasound Devices

Yingnan Ma, Fei Yang, Anup Basu

Responsive image

Auto-TLDR; Edge-Guided Convolutional Neural Network for Portable Ultrasound Images

Slides Poster Similar

Ultrasound is a non-invasive tool that is useful for medical diagnosis and treatment. To reduce long wait times and add convenience to patients, portable ultrasound scanning devices are becoming increasingly popular. These devices can be held in one palm, and are compatible with modern cell phones. However, the quality of ultrasound images captured from the portable scanners is relatively poor compared to standard ultrasound scanning systems in hospitals. To improve the quality of the ultrasound images obtained from portable ultrasound devices, we propose a new neural network architecture called Edge-Guided Convolutional Neural Network (EGCNN), which can preserve significant edge information in ultrasound images when removing noise. We also study and compare the effectiveness of classical filtering approaches in removing speckle noise in these images. Experimental results show that after applying the proposed EGCNN, various organs can be better recognized from ultrasound images. This approach is expected to lead to better accuracy in diagnostics in the future.

Adaptive L2 Regularization in Person Re-Identification

Xingyang Ni, Liang Fang, Heikki Juhani Huttunen

Responsive image

Auto-TLDR; AdaptiveReID: Adaptive L2 Regularization for Person Re-identification

Slides Poster Similar

We introduce an adaptive L2 regularization mechanism termed AdaptiveReID, in the setting of person re-identification. In the literature, it is common practice to utilize hand-picked regularization factors which remain constant throughout the training procedure. Unlike existing approaches, the regularization factors in our proposed method are updated adaptively through backpropagation. This is achieved by incorporating trainable scalar variables as the regularization factors, which are further fed into a scaled hard sigmoid function. Extensive experiments on the Market-1501, DukeMTMC-reID and MSMT17 datasets validate the effectiveness of our framework. Most notably, we obtain state-of-the-art performance on MSMT17, which is the largest dataset for person re-identification. Source code will be published at https://github.com/nixingyang/AdaptiveReID.

T-SVD Based Non-Convex Tensor Completion and Robust Principal Component Analysis

Tao Li, Jinwen Ma

Responsive image

Auto-TLDR; Non-Convex tensor rank surrogate function and non-convex sparsity measure for tensor recovery

Slides Poster Similar

In this paper, we propose a novel non-convex tensor rank surrogate function and a novel non-convex sparsity measure. The basic idea is to sidestep the bias of $\ell_1-$norm by introducing the concavity. Furthermore, we employ this non-convex penalty in tensor recovery problems such as tensor completion and tensor robust principal component analysis. Due to the concavity, the parameters of these models are difficult to solve. To tackle this problem, we devise a majorization minimization algorithm that can optimize the upper bound of the original function in each iteration, and every sub-problem is solved by the alternating direction multiplier method. We also analyze the theoretical properties of the proposed algorithm. Finally, the experimental results on natural and hyperspectral images demonstrate the efficacy and efficiency of the proposed method.

EEG-Based Cognitive State Assessment Using Deep Ensemble Model and Filter Bank Common Spatial Pattern

Debashis Das Chakladar, Shubhashis Dey, Partha Pratim Roy, Masakazu Iwamura

Responsive image

Auto-TLDR; A Deep Ensemble Model for Cognitive State Assessment using EEG-based Cognitive State Analysis

Slides Poster Similar

Electroencephalography (EEG) is the most used physiological measure to evaluate the cognitive state of a user efficiently. As EEG inherently suffers from a poor spatial resolution, features extracted from each EEG channel may not efficiently used for cognitive state assessment. In this paper, the EEG-based cognitive state assessment has been performed during the mental arithmetic experiment, which includes two cognitive states (task and rest) of a user. To obtain the temporal as well as spatial resolution of the EEG signal, we combined the Filter Bank Common Spatial Pattern (FBCSP) method and Long Short-Term Memory (LSTM)-based deep ensemble model for classifying the cognitive state of a user. Subject-wise data distribution has been performed due to the execution of a large volume of data in a low computing environment. In the FBCSP method, the input EEG is decomposed into multiple equal-sized frequency bands, and spatial features of each frequency bands are extracted using the Common Spatial Pattern (CSP) algorithm. Next, a feature selection algorithm has been applied to identify the most informative features for classification. The proposed deep ensemble model consists of multiple similar structured LSTM networks that work in parallel. The output of the ensemble model (i.e., the cognitive state of a user) is computed using the average weighted combination of individual model prediction. This proposed model achieves 87\% classification accuracy, and it can also effectively estimate the cognitive state of a user in a low computing environment.