An Intelligent Photographing Guidance System Based on Compositional Deep Features and Intepretable Machine Learning Model

Chin-Shyurng Fahn, Meng-Luen Wu, Sheng-Kuei Tsau

Responsive image

Auto-TLDR; Photography Guidance Using Interpretable Machine Learning

Photography is the activity of recording precious moments which are often difficult to make up afterwards. Therefore, taking the correct picture under proper guidance assistance is important. Although there are many factors that can determine a good photo, in general, photos that do not follow the composition rules usually look bad that make the viewer feel uncomfortable. As a solution, in this paper, we propose an intelligent photographing guidance system using machine learning. The guidance is based on interpretable models that can give reasons for decisions. There are two kinds of features for guidance, which are traditional image features and deep features. Traditional features includes saliency map, sharpness map, prominent lines, and each of them are in multi-scale Gaussian pyramid. Deep feature is extracted during the establishment of a CNN based image composition classifier. We use these two kinds of features as inputs for tree based interpretable machine learning model to establish a feasible photographing guidance system. The guidance system references our composition classifier with 94.8% of accuracy, which the tree based interpretable model is capable of guiding camera users to alter image contents for obtaining better aesthetical compositions to take photos of good quality.

Similar papers

A Heuristic-Based Decision Tree for Connected Components Labeling of 3D Volumes

Maximilian Söchting, Stefano Allegretti, Federico Bolelli, Costantino Grana

Responsive image

Auto-TLDR; Entropy Partitioning Decision Tree for Connected Components Labeling

Slides Poster Similar

Connected Components Labeling represents a fundamental step for many Computer Vision and Image Processing pipelines. Since the first appearance of the task in the sixties, many algorithmic solutions to optimize the computational load needed to label an image have been proposed. Among them, block-based scan approaches and decision trees revealed to be some of the most valuable strategies. However, due to the cost of the manual construction of optimal decision trees and the computational limitations of automatic strategies employed in the past, the application of blocks and decision trees has been restricted to small masks, and thus to 2D algorithms. With this paper we present a novel heuristic algorithm based on decision tree learning methodology, called Entropy Partitioning Decision Tree (EPDT). It allows to compute near-optimal decision trees for large scan masks. Experimental results demonstrate that algorithms based on the generated decision trees outperform state-of-the-art competitors.

Merged 1D-2D Deep Convolutional Neural Networks for Nerve Detection in Ultrasound Images

Mohammad Alkhatib, Adel Hafiane, Pierre Vieyres

Responsive image

Auto-TLDR; A Deep Neural Network for Deep Neural Networks to Detect Median Nerve in Ultrasound-Guided Regional Anesthesia

Slides Poster Similar

Ultrasound-Guided Regional Anesthesia (UGRA) becomes a standard procedure in surgical operations and contributes to pain management. It offers the advantages of the targeted nerve detection and provides the visualization of regions of interest such as anatomical structures. However, nerve detection is one of the most challenging tasks that anesthetists can encounter in the UGRA procedure. A computer-aided system that can detect automatically the nerve region would facilitate the anesthetist's daily routine and allow them to concentrate more on the anesthetic delivery. In this paper, we propose a new method based on merging deep learning models from different data to detect the median nerve. The merged architecture consists of two branches, one being one dimensional (1D) convolutional neural networks (CNN) branch and another 2D CNN branch. The merged architecture aims to learn the high-level features from 1D handcrafted noise-robust features and 2D ultrasound images. The obtained results show the validity and high accuracy of the proposed approach and its robustness.

CNN-Based Repetitive Self-Revised Learning for Photos’ Aesthetics Imbalanced Classification

Dai Ying

Responsive image

Auto-TLDR; Automatic Aesthetic Assessment Using Recurrent Self-revised Learning

Poster Similar

Aesthetic assessment is subjective, and the distribution of the aesthetic levels is imbalanced. In order to realize the auto-assessment of photo aesthetics, we focus on using repetitive self-revised learning (RSRL) to train the CNN-based aesthetics classification network by imbalanced data set. As RSRL, the network is trained repetitively by dropping out the low likelihood photo samples at the middle levels of aesthetics from the training data set based on the previously trained network. Further, the retained two networks are used in extracting highlight regions of the photos related with the aesthetic assessment. Experimental results show that the CNN-based repetitive self-revised learning is effective for improving the performances of the imbalanced classification.

Color, Edge, and Pixel-Wise Explanation of Predictions Based onInterpretable Neural Network Model

Jay Hoon Jung, Youngmin Kwon

Responsive image

Auto-TLDR; Explainable Deep Neural Network with Edge Detecting Filters

Poster Similar

We design an interpretable network model by introducing explainable components into a Deep Neural Network (DNN). We substituted the first kernels of a Convolutional Neural Network (CNN) and a ResNet-50 with the well-known edge detecting filters such as Sobel, Prewitt, and other filters. Each filters' relative importance scores are measured with a variant of Layer-wise Relevance Propagation (LRP) method proposed by Bach et al. Since the effects of the edge detecting filters are well understood, our model provides three different scores to explain individual predictions: the scores with respect to (1) colors, (2) edge filters, and (3) pixels of the image. Our method provides more tools to analyze the predictions by highlighting the location of important edges and colors in the images. Furthermore, the general features of a category can be shown in our scores as well as individual predictions. At the same time, the model does not degrade performances on MNIST, Fruit360 and ImageNet datasets.

Exploring Spatial-Temporal Representations for fNIRS-based Intimacy Detection via an Attention-enhanced Cascade Convolutional Recurrent Neural Network

Chao Li, Qian Zhang, Ziping Zhao

Responsive image

Auto-TLDR; Intimate Relationship Prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network Using Functional Near-Infrared Spectroscopy

Slides Poster Similar

The detection of intimacy plays a crucial role in the improvement of intimate relationship, which contributes to promote the family and social harmony. Previous studies have shown that different degrees of intimacy have significant differences in brain imaging. Recently, a few of work has emerged to recognise intimacy automatically by using machine learning technique. Moreover, considering the temporal dynamic characteristics of intimacy relationship on neural mechanism, how to model spatio-temporal dynamics for intimacy prediction effectively is still a challenge. In this paper, we propose a novel method to explore deep spatial-temporal representations for intimacy prediction by Attention-enhanced Cascade Convolutional Recurrent Neural Network (ACCRNN). Given the advantages of time-frequency resolution in complex neuronal activities analysis, this paper utilizes functional near-infrared spectroscopy (fNIRS) to analyse and infer to intimate relationship. We collect a fNIRS-based dataset for the analysis of intimate relationship. Forty-two-channel fNIRS signals are recorded from the 44 subjects' prefrontal cortex when they watched a total of 18 photos of lovers, friends and strangers for 30 seconds per photo. The experimental results show that our proposed method outperforms the others in terms of accuracy with the precision of 96.5%. To the best of our knowledge, this is the first time that such a hybrid deep architecture has been employed for fNIRS-based intimacy prediction.

Image Representation Learning by Transformation Regression

Xifeng Guo, Jiyuan Liu, Sihang Zhou, En Zhu, Shihao Dong

Responsive image

Auto-TLDR; Self-supervised Image Representation Learning using Continuous Parameter Prediction

Slides Poster Similar

Self-supervised learning is a thriving research direction since it can relieve the burden of human labeling for machine learning by seeking for supervision from data instead of human annotation. Although demonstrating promising performance in various applications, we observe that the existing methods usually model the auxiliary learning tasks as classification tasks with finite discrete labels, leading to insufficient supervisory signals, which in turn restricts the representation quality. In this paper, to solve the above problem and make full use of the supervision from data, we design a regression model to predict the continuous parameters of a group of transformations, i.e., image rotation, translation, and scaling. Surprisingly, this naive modification stimulates tremendous potential from data and the resulting supervisory signal has largely improved the performance of image representation learning. Extensive experiments on four image datasets, including CIFAR10, CIFAR100, STL10, and SVHN, indicate that our proposed algorithm outperforms the state-of-the-art unsupervised learning methods by a large margin in terms of classification accuracy. Crucially, we find that with our proposed training mechanism as an initialization, the performance of the existing state-of-the-art classification deep architectures can be preferably improved.

LFIEM: Lightweight Filter-Based Image Enhancement Model

Oktai Tatanov, Aleksei Samarin

Responsive image

Auto-TLDR; Image Retouching Using Semi-supervised Learning for Mobile Devices

Slides Poster Similar

Photo retouching features are being integrated into a growing number of mobile applications. Current learning-based approaches enhance images using large convolutional neural network-based models, where the result is received directly from the neural network outputs. This method can lead to artifacts in the resulting images, models that are complicated to interpret, and can be computationally expensive. In this paper, we explore the application of a filter-based approach in order to overcome the problems outlined above. We focus on creating a lightweight solution suitable for use on mobile devices when designing our model. A significant performance increase was achieved through implementing consistency regularization used in semi-supervised learning. The proposed model can be used on mobile devices and achieves competitive results compared to known models.

MFPP: Morphological Fragmental Perturbation Pyramid for Black-Box Model Explanations

Qing Yang, Xia Zhu, Jong-Kae Fwu, Yun Ye, Ganmei You, Yuan Zhu

Responsive image

Auto-TLDR; Morphological Fragmental Perturbation Pyramid for Explainable Deep Neural Network

Slides Poster Similar

Deep neural networks (DNNs) have recently been applied and used in many advanced and diverse tasks, such as medical diagnosis, automatic driving, etc. Due to the lack of transparency of the deep models, DNNs are often criticized for their prediction that cannot be explainable by human. In this paper, we propose a novel Morphological Fragmental Perturbation Pyramid (MFPP) method to solve the Explainable AI problem. In particular, we focus on the black-box scheme, which can identify the input area responsible for the output of the DNN without having to understand the internal architecture of the DNN. In the MFPP method, we divide the input image into multi-scale fragments and randomly mask out fragments as perturbation to generate a saliency map, which indicates the significance of each pixel for the prediction result of the black box model. Compared with the existing input sampling perturbation method, the pyramid structure fragment has proved to be more effective. It can better explore the morphological information of the input image to match its semantic information, and does not need any value inside the DNN. We qualitatively and quantitatively prove that MFPP meets and exceeds the performance of state-of-the-art (SOTA) black-box interpretation method on multiple DNN models and datasets.

Explainable Online Validation of Machine Learning Models for Practical Applications

Wolfgang Fuhl, Yao Rong, Thomas Motz, Michael Scheidt, Andreas Markus Hartel, Andreas Koch, Enkelejda Kasneci

Responsive image

Auto-TLDR; A Reformulation of Regression and Classification for Machine Learning Algorithm Validation

Slides Poster Similar

We present a reformulation of the regression and classification, which aims to validate the result of a machine learning algorithm. Our reformulation simplifies the original problem and validates the result of the machine learning algorithm using the training data. Since the validation of machine learning algorithms must always be explainable, we perform our experiments with the kNN algorithm as well as with an algorithm based on conditional probabilities, which is proposed in this work. For the evaluation of our approach, three publicly available data sets were used and three classification and two regression problems were evaluated. The presented algorithm based on conditional probabilities is also online capable and requires only a fraction of memory compared to the kNN algorithm.

Classifying Eye-Tracking Data Using Saliency Maps

Shafin Rahman, Sejuti Rahman, Omar Shahid, Md. Tahmeed Abdullah, Jubair Ahmed Sourov

Responsive image

Auto-TLDR; Saliency-based Feature Extraction for Automatic Classification of Eye-tracking Data

Slides Poster Similar

A plethora of research in the literature shows how human eye fixation pattern varies depending on different factors, including genetics, age, social functioning, cognitive functioning, and so on. Analysis of these variations in visual attention has already elicited two potential research avenues: 1) determining the physiological or psychological state of the subject and 2) predicting the tasks associated with the act of viewing from the recorded eye-fixation data. To this end, this paper proposes a visual saliency based novel feature extraction method for automatic and quantitative classification of eye-tracking data, which is applicable to both of the research directions. Instead of directly extracting features from the fixation data, this method employs several well-known computational models of visual attention to predict eye fixation locations as saliency maps. Comparing the saliency amplitudes, similarity and dissimilarity of saliency maps with the corresponding eye fixations maps gives an extra dimension of information which is effectively utilized to generate discriminative features to classify the eye-tracking data. Extensive experimentation using Saliency4ASD [1], Age Prediction [2], and Visual Perceptual Task [3] dataset show that our saliency-based feature can achieve superior performance, outperforming the previous state-of-the-art methods [2],[4], [5] by a considerable margin. Moreover, unlike the existing application-specific solutions, our method demonstrates performance improvement across three distinct problems from the real-life domain: Autism Spectrum Disorder screening, toddler age prediction, and human visual perceptual task classification, providing a general paradigm that utilizes the extra-information inherent in saliency maps for a more accurate classification.

Saliency Prediction on Omnidirectional Images with Brain-Like Shallow Neural Network

Zhu Dandan, Chen Yongqing, Min Xiongkuo, Zhao Defang, Zhu Yucheng, Zhou Qiangqiang, Yang Xiaokang, Tian Han

Responsive image

Auto-TLDR; A Brain-like Neural Network for Saliency Prediction of Head Fixations on Omnidirectional Images

Slides Poster Similar

Deep feedforward convolutional neural networks (CNNs) perform well in the saliency prediction of omnidirectional images (ODIs), and have become the leading class of candidate models of the visual processing mechanism in the primate ventral stream. These CNNs have evolved from shallow network architecture to extremely deep and branching architecture to achieve superb performance in various vision tasks, yet it is unclear how brain-like they are. In particular, these deep feedforward CNNs are difficult to mapping to ventral stream structure of the brain visual system due to their vast number of layers and missing biologically-important connections, such as recurrence. To tackle this issue, some brain-like shallow neural networks are introduced. In this paper, we propose a novel brain-like network model for saliency prediction of head fixations on ODIs. Specifically, our proposed model consists of three modules: a CORnet-S module, a template feature extraction module and a ranking attention module (RAM). The CORnet-S module is a lightweight artificial neural network (ANN) with four anatomically mapped areas (V1, V2, V4 and IT) and it can simulate the visual processing mechanism of ventral visual stream in the human brain. The template features extraction module is introduced to extract attention maps of ODIs and provide guidance for the feature ranking in the following RAM module. The RAM module is used to rank and select features that are important for fine-grained saliency prediction. Extensive experiments have validated the effectiveness of the proposed model in predicting saliency maps of ODIs, and the proposed model outperforms other state-of-the-art methods with similar scale.

Point In: Counting Trees with Weakly Supervised Segmentation Network

Pinmo Tong, Shuhui Bu, Pengcheng Han

Responsive image

Auto-TLDR; Weakly Tree counting using Deep Segmentation Network with Localization and Mask Prediction

Slides Poster Similar

For tree counting tasks, since traditional image processing methods require expensive feature engineering and are not end-to-end frameworks, this will cause additional noise and cannot be optimized overall, so this method has not been widely used in recent trends of tree counting application. Recently, many deep learning based approaches are designed for this task because of the powerful feature extracting ability. The representative way is bounding box based supervised method, but time-consuming annotations are indispensable for them. Moreover, these methods are difficult to overcome the occlusion or overlap. To solve this problem, we propose a weakly tree counting network (WTCNet) based on deep segmentation network with only point supervision. It can simultaneously complete tree counting with localization and output mask of each tree at the same time. We first adopt a novel feature extractor network (FENet) to get features of input images, and then an effective strategy is introduced to deal with different mask predictions. In the end, we propose a basic localization guidance accompany with rectification guidance to train the network. We create two different datasets and select an existing challenging plant dataset to evaluate our method on three different tasks. Experimental results show the good performance improvement of our method compared with other existing methods. Further study shows that our method has great potential to reduce human labor and provide effective ground-truth masks and the results show the superiority of our method over the advanced methods.

End-To-End Deep Learning Methods for Automated Damage Detection in Extreme Events at Various Scales

Yongsheng Bai, Alper Yilmaz, Halil Sezen

Responsive image

Auto-TLDR; Robust Mask R-CNN for Crack Detection in Extreme Events

Slides Poster Similar

Robust Mask R-CNN (Mask Regional Convolutional Neural Network) methods are proposed and tested for automatic detection of cracks on structures or their components that may be damaged during extreme events, such as earth-quakes. We curated a new dataset with 2,021 labeled images for training and validation and aimed to find end-to-end deep neural networks for crack detection in the field. With data augmentation and parameters fine-tuning, Path Aggregation Network (PANet) with spatial attention mechanisms and High-resolution Network (HRNet) are introduced into Mask R-CNNs. The tests on three public datasets with low- or high-resolution images demonstrate that the proposed methods can achieve a big improvement over alternative networks, so the proposed method may be sufficient for crack detection for a variety of scales in real applications.

Coarse to Fine: Progressive and Multi-Task Learning for Salient Object Detection

Dong-Goo Kang, Sangwoo Park, Joonki Paik

Responsive image

Auto-TLDR; Progressive and mutl-task learning scheme for salient object detection

Slides Poster Similar

Most deep learning-based salient object detection (SOD) methods tried to manipulate the convolution block to effectively capture the context of object. In this paper, we propose a novel method, called progressive and mutl-task learning scheme, to extract the context of object by only manipulating the learning scheme without changing the network architecture. The progressive learning scheme is a method to grow the decoder progressively in the train phase. In other words, starting from easier low-resolution layers, it gradually adds high-resolution layers. Although the progressive learning successfullyl captures the context of object, its output boundary tends to be rough. To solve this problem, we also propose a multi-task learning (MTL) scheme that processes the object saliency map and contour in a single network jointly. The proposed MTL scheme trains the network in an edge-preserved direction through an auxiliary branch that learns contours. The proposed a learning scheme can be combined with other convolution block manipulation methods. Extensive experiments on five datasets show that the proposed method performs best compared with state-of-the-art methods in most cases.

A Comparison of Neural Network Approaches for Melanoma Classification

Maria Frasca, Michele Nappi, Michele Risi, Genoveffa Tortora, Alessia Auriemma Citarella

Responsive image

Auto-TLDR; Classification of Melanoma Using Deep Neural Network Methodologies

Slides Poster Similar

Melanoma is the deadliest form of skin cancer and it is diagnosed mainly visually, starting from initial clinical screening and followed by dermoscopic analysis, biopsy and histopathological examination. A dermatologist’s recognition of melanoma may be subject to errors and may take some time to diagnose it. In this regard, deep learning can be useful in the study and classification of skin cancer. In particular, by classifying images with Deep Neural Network methodologies, it is possible to obtain comparable or even superior results compared to those of dermatologists. In this paper, we propose a methodology for the classification of melanoma by adopting different deep learning techniques applied to a common dataset, composed of images from the ISIC dataset and consisting of different types of skin diseases, including melanoma on which we applied a specific pre-processing phase. In particular, a comparison of the results is performed in order to select the best effective neural network to be applied to the problem of recognition and classification of melanoma. Moreover, we also evaluate the impact of the pre- processing phase on the final classification. Different metrics such as accuracy, sensitivity, and specificity have been selected to assess the goodness of the adopted neural networks and compare them also with the manual classification of dermatologists.

Lane Detection Based on Object Detection and Image-To-Image Translation

Hiroyuki Komori, Kazunori Onoguchi

Responsive image

Auto-TLDR; Lane Marking and Road Boundary Detection from Monocular Camera Images using Inverse Perspective Mapping

Slides Poster Similar

In this paper, we propose a method to detect various types of lane markings and road boundaries simultaneously from a monocular camera image. This method detects lane markings and road boundaries in IPM images obtained by the Inverse Perspective Mapping of input images. First, bounding boxes surrounding a lane marking or the road boundary are extracted by the object detection network. At the same time, these areas are labelled with a solid line, a dashed line, a zebra line, a curb, a grass, a sidewall and so on. Next, in each bounding box, lane marking boundaries or road boundaries are drawn by the image-to-image translation network. We use YOLOv3 for the object detection and pix2pix for the image translation. We create our own datasets including various types of lane markings and road boundaries and evaluate our approach using these datasets qualitatively and quantitatively.

Dynamic Low-Light Image Enhancement for Object Detection Via End-To-End Training

Haifeng Guo, Yirui Wu, Tong Lu

Responsive image

Auto-TLDR; Object Detection using Low-Light Image Enhancement for End-to-End Training

Slides Poster Similar

Object detection based on convolutional neural networks is a hot research topic in computer vision. The illumination component in the image has a great impact on object detection, and it will cause a sharp decline in detection performance under low-light conditions. Using low-light image enhancement technique as a pre-processing mechanism can improve image quality and obtain better detection results.However, due to the complexity of low-light environments, the existing enhancement methods may have negative effects on some samples. Therefore, it is difficult to improve the overall detection performance in low-light conditions. In this paper, our goal is to use image enhancement to improve object detection performance rather than perceptual quality for humans. We propose a novel framework that combines low-light enhancement and object detection for end-to-end training. The framework can dynamically select different enhancement subnetworks for each sample to improve the performance of the detector. Our proposed method consists of two stage: the enhancement stage and the detection stage. The enhancement stage dynamically enhances the low-light images under the supervision of several enhancement methods and output corresponding weights. During the detection stage, the weights offers information on object classification to generate high-quality region proposals and in turn result in accurate detection. Our experiments present promising results, which show that the proposed method can significantly improve the detection performance in low-light environment.

Construction Worker Hardhat-Wearing Detection Based on an Improved BiFPN

Chenyang Zhang, Zhiqiang Tian, Jingyi Song, Yaoyue Zheng, Bo Xu

Responsive image

Auto-TLDR; A One-Stage Object Detection Method for Hardhat-Wearing in Construction Site

Slides Poster Similar

Work in the construction site is considered to be one of the occupations with the highest safety risk factor. Therefore, safety plays an important role in construction site. One of the most fundamental safety rules in construction site is to wear a hardhat. To strengthen the safety of the construction site, most of the current methods use multi-stage method for hardhat-wearing detection. These methods have limitations in terms of adaptability and generalizability. In this paper, we propose a one-stage object detection method based on convolutional neural network. We present a multi-scale strategy that selects the high-resolution feature maps of DarkNet-53 to effectively identify small-scale hardhats. In addition, we propose an improved weighted bi-directional feature pyramid network (BiFPN), which could fuse more semantic features from more scales. The proposed method can not only detect hardhat-wearing, but also identify the color of the hardhat. Experimental results show that the proposed method achieves a mAP of 87.04%, which outperforms several state-of-the-art methods on a public dataset.

Mobile Phone Surface Defect Detection Based on Improved Faster R-CNN

Tao Wang, Can Zhang, Runwei Ding, Ge Yang

Responsive image

Auto-TLDR; Faster R-CNN for Mobile Phone Surface Defect Detection

Slides Poster Similar

Various surface defects will inevitably occur in the production process of mobile phones, which have a huge impact on the enterprise. Therefore, precise defect detection is of great significance in the production of mobile phones. However, the traditional manual inspection and machine vision inspection have low efficiency and accuracy respectively which cannot meet the rapid production needs of modern enterprises. In this paper, we proposed a mobile phone surface defect (MPSD) detection model based on deep learning, which greatly reduce the requirement of a large dataset and improve detection performance. First, Boundary Equilibrium Generative Adversarial Networks (BEGAN) is used to generate and augment the defect data. Then, based on Faster R-CNN model, Feature Pyramid Network (FPN) and ResNet 101 are combined as feature extraction network to get more small target defect features. Further, replacing the ROI pooling layer with an ROI Align layer reduces the quantization deviation during the pooling process. Finally, we train and evaluate our model on our own dataset. The experimental results indicate that compared with some traditional methods based on handcrafted feature extraction and the traditional Faster R-CNN, the improved Faster R-CNN achieves 99.43% mAP, which is more effective in MPSD defect detection area.

Early Wildfire Smoke Detection in Videos

Taanya Gupta, Hengyue Liu, Bir Bhanu

Responsive image

Auto-TLDR; Semi-supervised Spatio-Temporal Video Object Segmentation for Automatic Detection of Smoke in Videos during Forest Fire

Poster Similar

Recent advances in unmanned aerial vehicles and camera technology have proven useful for the detection of smoke that emerges above the trees during a forest fire. Automatic detection of smoke in videos is of great interest to Fire department. To date, in most parts of the world, the fire is not detected in its early stage and generally it turns catastrophic. This paper introduces a novel technique that integrates spatial and temporal features in a deep learning framework using semi-supervised spatio-temporal video object segmentation and dense optical flow. However, detecting this smoke in the presence of haze and without the labeled data is difficult. Considering the visibility of haze in the sky, a dark channel pre-processing method is used that reduces the amount of haze in video frames and consequently improves the detection results. Online training is performed on a video at the time of testing that reduces the need for ground-truth data. Tests using the publicly available video datasets show that the proposed algorithms outperform previous work and they are robust across different wildfire-threatened locations.

Enhancing Deep Semantic Segmentation of RGB-D Data with Entangled Forests

Matteo Terreran, Elia Bonetto, Stefano Ghidoni

Responsive image

Auto-TLDR; FuseNet: A Lighter Deep Learning Model for Semantic Segmentation

Slides Poster Similar

Semantic segmentation is a problem which is getting more and more attention in the computer vision community. Nowadays, deep learning methods represent the state of the art to solve this problem, and the trend is to use deeper networks to get higher performance. The drawback with such models is a higher computational cost, which makes it difficult to integrate them on mobile robot platforms. In this work we want to explore how to obtain lighter deep learning models without compromising performance. To do so we will consider the features used in the Entangled Random Forest algorithm and we will study the best strategies to integrate these within FuseNet deep network. Such new features allow us to shrink the network size without loosing performance, obtaining hence a lighter model which achieves state-of-the-art performance on the semantic segmentation task and represents an interesting alternative for mobile robotics applications, where computational power and energy are limited.

Malware Detection by Exploiting Deep Learning over Binary Programs

Panpan Qi, Zhaoqi Zhang, Wei Wang, Chang Yao

Responsive image

Auto-TLDR; End-to-End Malware Detection without Feature Engineering

Slides Poster Similar

Malware evolves rapidly over time, which makes existing solutions being ineffective in detecting newly released malware. Machine learning models that can learn to capture malicious patterns directly from the data play an increasingly important role in malware analysis. However, traditional machine learning models heavily depend on feature engineering. The extracted static features are vulnerable as hackers could create new malware with different feature values to deceive the machine learning models. In this paper, we propose an end-to-end malware detection framework consisting of convolutional neural network, autoencoder and neural decision trees. It learns the features from multiple domains for malware detection without feature engineering. In addition, since anti-virus products should have a very low false alarm rate to avoid annoying users, we propose a special loss function, which optimizes the recall for a fixed low false positive rate (e.g., less than 0.1%). Experiments show that the proposed framework has achieved a better recall than the baseline models, and the derived loss function also makes a difference.

Responsive Social Smile: A Machine-Learning Based Multimodal Behavior Assessment Framework towards Early Stage Autism Screening

Yueran Pan, Kunjing Cai, Ming Cheng, Xiaobing Zou, Ming Li

Responsive image

Auto-TLDR; Responsive Social Smile: A Machine Learningbased Assessment Framework for Early ASD Screening

Poster Similar

Autism spectrum disorder (ASD) is a neurodevelopmental disorder, which causes social deficits in social lives. Early ASD screening for children is an important method to reduce the impact of ASD on people’s whole lives. Traditional screening methods rely on protocol experiments and subjective evaluations from clinicians and domain experts and thereby cost a lot. To standardize the process of ASD screening, we 1 collaborate with a group of ASD experts, and design a ”Responsive Social Smile” protocol and an experiment environment. Also, we propose a machine learningbased assessment framework for early ASD screening. By integrating technologies of speech recognition and computer vision, the framework can quantitatively analyze the behaviors of children under well-designed protocols. By collecting 196 test samples from 41 children in the clinical treatments, our proposed method obtains 85.20% accuracy for the score prediction of individual protocol, and 80.49% unweighted accuracy for the final ASD prediction. This result indicates that our model reaches the average level of domain experts in ASD diagnosis.

Face Anti-Spoofing Using Spatial Pyramid Pooling

Lei Shi, Zhuo Zhou, Zhenhua Guo

Responsive image

Auto-TLDR; Spatial Pyramid Pooling for Face Anti-Spoofing

Slides Poster Similar

Face recognition system is vulnerable to many kinds of presentation attacks, so how to effectively detect whether the image is from the real face is particularly important. At present, many deep learning-based anti-spoofing methods have been proposed. But these approaches have some limitations, for example, global average pooling (GAP) easily loses local information of faces, single-scale features easily ignore information differences in different scales, while a complex network is prune to be overfitting. In this paper, we propose a face anti-spoofing approach using spatial pyramid pooling (SPP). Firstly, we use ResNet-18 with a small amount of parameter as the basic model to avoid overfitting. Further, we use spatial pyramid pooling module in the single model to enhance local features while fusing multi-scale information. The effectiveness of the proposed method is evaluated on three databases, CASIA-FASD, Replay-Attack and CASIA-SURF. The experimental results show that the proposed approach can achieve state-of-the-art performance.

Automatic Annotation of Corpora for Emotion Recognition through Facial Expressions Analysis

Alex Mircoli, Claudia Diamantini, Domenico Potena, Emanuele Storti

Responsive image

Auto-TLDR; Automatic annotation of video subtitles on the basis of facial expressions using machine learning algorithms

Slides Poster Similar

The recent diffusion of social networks has made available an unprecedented amount of user-generated content, which may be analyzed in order to determine people's opinions and emotions about a large variety of topics. Research has made many efforts in defining accurate algorithms for analyzing emotions expressed by users in texts; however, their performance often rely on the existence of large annotated datasets, whose current scarcity represents a major issue. The manual creation of such datasets represents a costly and time-consuming activity and hence there is an increasing demand for techniques for the automatic annotation of corpora. In this work we present a methodology for the automatic annotation of video subtitles on the basis of the analysis of facial expressions of people in videos, with the goal of creating annotated corpora that may be used to train emotion recognition algorithms. Facial expressions are analyzed through machine learning algorithms, on the basis of a set of manually-engineered facial features that are extracted from video frames. The soundness of the proposed methodology has been evaluated through an extensive experimentation aimed at determining the performance on real datasets of each methodological step.

Temporal Pulses Driven Spiking Neural Network for Time and Power Efficient Object Recognition in Autonomous Driving

Wei Wang, Shibo Zhou, Jingxi Li, Xiaohua Li, Junsong Yuan, Zhanpeng Jin

Responsive image

Auto-TLDR; Spiking Neural Network for Real-Time Object Recognition on Temporal LiDAR Pulses

Slides Poster Similar

Accurate real-time object recognition from sensory data has long been a crucial and challenging task for autonomous driving. Even though deep neural networks (DNNs) have been widely applied in this area, their considerable processing latency, power consumption as well as computational complexity have been challenging issues for real-time autonomous driving applications. In this paper, we propose an approach to address the real-time object recognition problem utilizing spiking neural networks (SNNs). The proposed SNN model works directly with raw temporal LiDAR pulses without the pulse-to-point cloud preprocessing procedure, which can significantly reduce delay and power consumption. Being evaluated on various datasets derived from LiDAR and dynamic vision sensor (DVS), including Sim LiDAR, KITTI, and DVS-barrel, our proposed model has shown remarkable time and power efficiency, while achieving comparable recognition performance as the state-of-the-art methods. This paper highlights the SNN's great potentials in autonomous driving and related applications. To the best of our knowledge, this is the first attempt to use SNN to directly perform time and energy efficient object recognition on temporal LiDAR pulses in the setting of autonomous driving.

AdaFilter: Adaptive Filter Design with Local Image Basis Decomposition for Optimizing Image Recognition Preprocessing

Aiga Suzuki, Keiichi Ito, Takahide Ibe, Nobuyuki Otsu

Responsive image

Auto-TLDR; Optimal Preprocessing Filtering for Pattern Recognition Using Higher-Order Local Auto-Correlation

Slides Poster Similar

Image preprocessing is an important process during pattern recognition which increases the recognition performance. Linear convolution filtering is a primary preprocessing method used to enhance particular local patterns of the image which are essential for recognizing the images. However, because of the vast search space of the preprocessing filter, almost no earlier studies have tackled the problem of identifying an optimal preprocessing filter that yields effective features for input images. This paper proposes a novel design method for the optimal preprocessing filter corresponding to a given task. Our method calculates local image bases of the training dataset and represents the optimal filter as a linear combination of these local image bases with the optimized coefficients to maximize the expected generalization performance. Thereby, the optimization problem of the preprocessing filter is converted to a lower-dimensional optimization problem. Our proposed method combined with a higher-order local auto-correlation (HLAC) feature extraction exhibited the best performance both in the anomaly detection task with the conventional pattern recognition algorithm and in the classification task using the deep convolutional neural network compared with typical preprocessing filters.

Decision Snippet Features

Pascal Welke, Fouad Alkhoury, Christian Bauckhage, Stefan Wrobel

Responsive image

Auto-TLDR; Decision Snippet Features for Interpretability

Slides Poster Similar

Decision trees excel at interpretability of their prediction results. To achieve required prediction accuracies, however, often large ensembles of decision trees -- random forests -- are considered, reducing interpretability due to large size. Additionally, their size slows down inference on modern hardware and restricts their applicability in low-memory embedded devices. We introduce \emph{Decision Snippet Features}, which are obtained from small subtrees that appear frequently in trained random forests. We subsequently show that linear models on top of these features achieve comparable and sometimes even better predictive performance than the original random forest, while reducing the model size by up to two orders of magnitude.

Siamese Graph Convolution Network for Face Sketch Recognition

Liang Fan, Xianfang Sun, Paul Rosin

Responsive image

Auto-TLDR; A novel Siamese graph convolution network for face sketch recognition

Slides Poster Similar

In this paper, we present a novel Siamese graph convolution network (GCN) for face sketch recognition. To build a graph from an image, we utilize a deep learning method to detect the image edges, and then use a superpixel method to segment the edge image. Each segmented superpixel region is taken as a node, and each pair of adjacent regions forms an edge of the graph. Graphs from both a face sketch and a face photo are input into the Siamese GCN for recognition. A deep graph matching method is used to share messages between cross-modal graphs in this model. Experiments show that the GCN can obtain high performance on several face photo-sketch datasets, including seen and unseen face photo-sketch datasets. It is also shown that the model performance based on the graph structure representation of the data using the Siamese GCN is more stable than a Siamese CNN model.

User-Independent Gaze Estimation by Extracting Pupil Parameter and Its Mapping to the Gaze Angle

Sang Yoon Han, Nam Ik Cho

Responsive image

Auto-TLDR; Gaze Point Estimation using Pupil Shape for Generalization

Slides Poster Similar

Since gaze estimation plays a crucial role in recognizing human intentions, it has been researched for a long time, and its accuracy is ever increasing. However, due to the wide variation in eye shapes and focusing abilities between the individuals, accuracies of most algorithms vary depending on each person in the test group, especially when the initial calibration is not well performed. To alleviate the user-dependency, we attempt to derive features that are general for most people and use them as the input to a deep network instead of using the images as the input. Specifically, we use the pupil shape as the core feature because it is directly related to the 3D eyeball rotation, and thus the gaze direction. While existing deep learning methods learn the gaze point by extracting various features from the image, we focus on the mapping function from the eyeball rotation to the gaze point by using the pupil shape as the input. It is shown that the accuracy of gaze point estimation also becomes robust for the uncalibrated points by following the characteristics of the mapping function. Also, our gaze network learns the gaze difference to facilitate the re-calibration process to fix the calibration-drift problem that typically occurs with glass-type or head-mount devices.

FastSal: A Computationally Efficient Network for Visual Saliency Prediction

Feiyan Hu, Kevin Mcguinness

Responsive image

Auto-TLDR; MobileNetV2: A Convolutional Neural Network for Saliency Prediction

Slides Poster Similar

This paper focuses on the problem of visual saliency prediction, predicting regions of an image that tend to attract human visual attention, under a constrained computational budget. We modify and test various recent efficient convolutional neural network architectures like EfficientNet and MobileNetV2 and compare them with existing state-of-the-art saliency models such as SalGAN and DeepGaze II both in terms of standard accuracy metrics like AUC and NSS, and in terms of the computational complexity and model size. We find that MobileNetV2 makes an excellent backbone for a visual saliency model and can be effective even without a complex decoder. We also show that knowledge transfer from a more computationally expensive model like DeepGaze II can be achieved via pseudo-labelling an unlabelled dataset, and that this approach gives result on-par with many state-of-the-art algorithms with a fraction of the computational cost and model size.

Personalized Models in Human Activity Recognition Using Deep Learning

Hamza Amrani, Daniela Micucci, Paolo Napoletano

Responsive image

Auto-TLDR; Incremental Learning for Personalized Human Activity Recognition

Slides Poster Similar

Current sensor-based human activity recognition techniques that rely on a user-independent model struggle to generalize to new users and on to changes that a person may make over time to his or her way of carrying out activities. Incremental learning is a technique that allows to obtain personalized models which may improve the performance on the classifiers thanks to a continuous learning based on user data. Finally, deep learning techniques have been proven to be more effective with respect to traditional ones in the generation of user-independent models. The aim of our work is therefore to put together deep learning techniques with incremental learning in order to obtain personalized models that perform better with respect to user-independent model and personalized model obtained using traditional machine learning techniques. The experimentation was done by comparing the results obtained by a technique in the state of the art with those obtained by two neural networks (ResNet and a simplified CNN) on three datasets. The experimentation showed that neural networks adapt faster to a new user than the baseline.

Video Lightening with Dedicated CNN Architecture

Li-Wen Wang, Wan-Chi Siu, Zhi-Song Liu, Chu-Tak Li, P. K. Daniel Lun

Responsive image

Auto-TLDR; VLN: Video Lightening Network for Driving Assistant Systems in Dark Environment

Slides Poster Similar

Darkness brings us uncertainty, worry and low confidence. This is a problem not only applicable to us walking in a dark evening but also for drivers driving a car on the road with very dim or even without lighting condition. To address this problem, we propose a new CNN structure named as Video Lightening Network (VLN) that regards the low-light enhancement as a residual learning task, which is useful as reference to indirectly lightening the environment, or for vision-based application systems, such as driving assistant systems. The VLN consists of several Lightening Back-Projection (LBP) and Temporal Aggregation (TA) blocks. Each LBP block enhances the low-light frame by domain transfer learning that iteratively maps the frame between the low- and normal-light domains. A TA block handles the motion among neighboring frames by investigating the spatial and temporal relationships. Several TAs work in a multi-scale way, which compensates the motions at different levels. The proposed architecture has a consistent enhancement for different levels of illuminations, which significantly increases the visual quality even in the extremely dark environment. Extensive experimental results show that the proposed approach outperforms other methods under both objective and subjective metrics.

Algorithm Recommendation for Data Streams

Jáder Martins Camboim De Sá, Andre Luis Debiaso Rossi, Gustavo Enrique De Almeida Prado Alves Batista, Luís Paulo Faina Garcia

Responsive image

Auto-TLDR; Meta-Learning for Algorithm Selection in Time-Changing Data Streams

Slides Poster Similar

In the last decades, many companies are taking advantage of massive data generation at high frequencies through knowledge discovery to identify valuable information. Machine learning techniques can be employed for knowledge discovery, since they are able to extract patterns from data and induce models to predict future events. However, dynamic and evolving environments generate streams of data that usually are non-stationary. Models induced in these scenarios may perish over time due to seasonality or concept drift. The periodic retraining could help but the fixed algorithm's hypothesis space could no longer be appropriate. An alternative solution is to use meta-learning for periodic algorithm selection in time-changing environments, choosing the bias that best suits the current data. In this paper, we present an enhanced framework for data streams algorithm selection based on MetaStream. Our approach uses meta-learning and incremental learning to actively select the best algorithm for the current concept in a time-changing. Different from previous works, a set of cutting edge meta-features and an incremental learning approach in the meta-level based on LightGBM are used. The results show that this new strategy can improve the recommendation of the best algorithm more accurately in time-changing data.

Walk the Lines: Object Contour Tracing CNN for Contour Completion of Ships

André Peter Kelm, Udo Zölzer

Responsive image

Auto-TLDR; Walk the Lines: A Convolutional Neural Network trained to follow object contours

Slides Poster Similar

We develop a new contour tracing algorithm to enhance the results of the latest object contour detectors. The goal is to achieve a perfectly closed, single-pixel wide and detailed object contour, since this type of contour could be analyzed using methods such as Fourier descriptors. Convolutional Neural Networks (CNNs) are rarely used for contour tracing, and we see great potential in using their capabilities for this task. Therefore we present the Walk the Lines (WtL) algorithm: A standard regression CNN trained to follow object contours. As initial step, we train the CNN only on ship contours, but the principle is applicable to other objects. Input data are the image and the associated object contour prediction of the recently published RefineContourNet (RCN). The WtL gets the center pixel coordinates, which defines an input section, plus an angle for rotating this section. Ideally, the center pixel moves on the contour, while the angle describes upcoming directional contour changes. The WtL predicts its steps pixelwise in a selfrouting way. To obtain a complete object contour the WtL runs in parallel at different image locations and the traces of its individual paths are summed. In contrast to the comparable Non-Maximum Suppression (NMS) method, our approach produces connected contours with finer details. Finally, the object contour is binarized under the condition of being closed. In case all procedures work as desired, excellent ship segmentations with high IoUs are produced, showing details such as antennas and ship superstructures that are easily omitted by other segmentation methods.

Uncertainty-Sensitive Activity Recognition: A Reliability Benchmark and the CARING Models

Alina Roitberg, Monica Haurilet, Manuel Martinez, Rainer Stiefelhagen

Responsive image

Auto-TLDR; CARING: Calibrated Action Recognition with Input Guidance

Slides Similar

Beyond assigning the correct class, an activity recognition model should also to be able to determine, how certain it is in its predictions. We present the first study of how well the confidence values of modern action recognition architectures indeed reflect the probability of the correct outcome and propose a learning-based approach for improving it. First, we extend two popular action recognition datasets with a reliability benchmark in form of the expected calibration error and reliability diagrams. Since our evaluation highlights that confidence values of standard action recognition architectures do not represent the uncertainty well, we introduce a new approach which learns to transform the model output into realistic confidence estimates through an additional calibration network. The main idea of our Calibrated Action Recognition with Input Guidance (CARING) model is to learn an optimal scaling parameter depending on the video representation. We compare our model with the native action recognition networks and the temperature scaling approach - a wide spread calibration method utilized in image classification. While temperature scaling alone drastically improves the reliability of the confidence values, our CARING method consistently leads to the best uncertainty estimates in all benchmark settings.

DR2S: Deep Regression with Region Selection for Camera Quality Evaluation

Marcelin Tworski, Stéphane Lathuiliere, Salim Belkarfa, Attilio Fiandrotti, Marco Cagnazzo

Responsive image

Auto-TLDR; Texture Quality Estimation Using Deep Learning

Slides Poster Similar

In this work, we tackle the problem of estimating a camera capability to preserve fine texture details at a given lighting condition. Importantly, our texture preservation measurement should coincide with human perception. Consequently, we formulate our problem as a regression one and we introduce a deep convolutional network to estimate texture quality score. At training time, we use ground-truth quality scores provided by expert human annotators in order to obtain a subjective quality measure. In addition, we propose a region selection method to identify the image regions that are better suited at measuring perceptual quality. Finally, our experimental evaluation shows that our learning-based approach outperforms existing methods and that our region selection algorithm consistently improves the quality estimation.

Facial Expression Recognition Using Residual Masking Network

Luan Pham, Vu Huynh, Tuan Anh Tran

Responsive image

Auto-TLDR; Deep Residual Masking for Automatic Facial Expression Recognition

Slides Poster Similar

Automatic facial expression recognition (FER) has gained much attention due to its applications in human-computer interaction. Among the approaches to improve FER tasks, this paper focuses on deep architecture with the attention mechanism. We propose a novel Masking idea to boost the performance of CNN in facial expression task. It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions. In experiments, we combine the ubiquitous Deep Residual Network and Unet-like architecture to produce a Residual Masking Network. The proposed method holds state-of-the-art (SOTA) accuracy on the well-known FER2013 and private VEMO datasets. Our works are available on Github.

Verifying the Causes of Adversarial Examples

Honglin Li, Yifei Fan, Frieder Ganz, Tony Yezzi, Payam Barnaghi

Responsive image

Auto-TLDR; Exploring the Causes of Adversarial Examples in Neural Networks

Slides Poster Similar

The robustness of neural networks is challenged by adversarial examples that contain almost imperceptible perturbations to inputs which mislead a classifier to incorrect outputs in high confidence. Limited by the extreme difficulty in examining a high-dimensional image space thoroughly, research on explaining and justifying the causes of adversarial examples falls behind studies on attacks and defenses. In this paper, we present a collection of potential causes of adversarial examples and verify (or partially verify) them through carefully-designed controlled experiments. The major causes of adversarial examples include model linearity, one-sum constraint, and geometry of the categories. To control the effect of those causes, multiple techniques are applied such as $L_2$ normalization, replacement of loss functions, construction of reference datasets, and novel models using multi-layer perceptron probabilistic neural networks (MLP-PNN) and density estimation (DE). Our experiment results show that geometric factors tend to be more direct causes and statistical factors magnify the phenomenon, especially for assigning high prediction confidence. We hope this paper will inspire more studies to rigorously investigate the root causes of adversarial examples, which in turn provide useful guidance on designing more robust models.

Depth Videos for the Classification of Micro-Expressions

Ankith Jain Rakesh Kumar, Bir Bhanu, Christopher Casey, Sierra Cheung, Aaron Seitz

Responsive image

Auto-TLDR; RGB-D Dataset for the Classification of Facial Micro-expressions

Slides Poster Similar

Facial micro-expressions are spontaneous, subtle, involuntary muscle movements occurring briefly on the face. The spotting and recognition of these expressions are difficult due to the subtle behavior, and the time duration of these expressions is about half a second, which makes it difficult for humans to identify them. These micro-expressions have many applications in our daily life, such as in the field of online learning, game playing, lie detection, and therapy sessions. Traditionally, researchers use RGB images/videos to spot and classify these micro-expressions, which pose challenging problems, such as illumination, privacy concerns and pose variation. The use of depth videos solves these issues to some extent, as the depth videos are not susceptible to the variation in illumination. This paper describes the collection of a first RGB-D dataset for the classification of facial micro-expressions into 6 universal expressions: Anger, Happy, Sad, Fear, Disgust, and Surprise. This paper shows the comparison between the RGB and Depth videos for the classification of facial micro-expressions. Further, a comparison of results shows that depth videos alone can be used to classify facial micro-expressions correctly in a decision tree structure by using the traditional and deep learning approaches with good classification accuracy. The dataset will be released to the public in the near future.

Detecting Marine Species in Echograms Via Traditional, Hybrid, and Deep Learning Frameworks

Porto Marques Tunai, Alireza Rezvanifar, Melissa Cote, Alexandra Branzan Albu, Kaan Ersahin, Todd Mudge, Stephane Gauthier

Responsive image

Auto-TLDR; End-to-End Deep Learning for Echogram Interpretation of Marine Species in Echograms

Slides Poster Similar

This paper provides a comprehensive comparative study of traditional, hybrid, and deep learning (DL) methods for detecting marine species in echograms. Acoustic backscatter data obtained from multi-frequency echosounders is visualized as echograms and typically interpreted by marine biologists via manual or semi-automatic methods, which are time-consuming. Challenges related to automatic echogram interpretation are the variable size and acoustic properties of the biological targets (marine life), along with significant inter-class similarities. Our study explores and compares three types of approaches that cover the entire range of machine learning methods. Based on our experimental results, we conclude that an end-to-end DL-based framework, that can be readily scaled to accommodate new species, is overall preferable to other learning approaches for echogram interpretation, even when only a limited number of annotated training samples is available.

Appliance Identification Using a Histogram Post-Processing of 2D Local Binary Patterns for Smart Grid Applications

Yassine Himeur, Abdullah Alsalemi, Faycal Bensaali, Abbes Amira

Responsive image

Auto-TLDR; LBP-BEVM based Local Binary Patterns for Appliances Identification in the Smart Grid

Similar

Identifying domestic appliances in the smart grid leads to a better power usage management and further helps in detecting appliance-level abnormalities. An efficient identification can be achieved only if a robust feature extraction scheme is developed with a high ability to discriminate between different appliances on the smart grid. Accordingly, we propose in this paper a novel method to extract electrical power signatures after transforming the power signal to 2D space, which has more encoding possibilities. Following, an improved local binary patterns (LBP) is proposed that relies on improving the discriminative ability of conventional LBP using a post-processing stage. A binarized eigenvalue map (BEVM) is extracted from the 2D power matrix and then used to post-process the generated LBP representation. Next, two histograms are constructed, namely up and down histograms, and are then concatenated to form the global histogram. A comprehensive performance evaluation is performed on two different datasets, namely the GREEND and WITHED, in which power data were collected at 1 Hz and 44000 Hz sampling rates, respectively. The obtained results revealed the superiority of the proposed LBP-BEVM based system in terms of the identification performance versus other 2D descriptors and existing identification frameworks.

Automatic Tuberculosis Detection Using Chest X-Ray Analysis with Position Enhanced Structural Information

Hermann Jepdjio Nkouanga, Szilard Vajda

Responsive image

Auto-TLDR; Automatic Chest X-ray Screening for Tuberculosis in Rural Population using Localized Region on Interest

Slides Poster Similar

For Tuberculosis (TB) detection beside the more expensive diagnosis solutions such as culture or sputum smear analysis one could consider the automatic analysis of the chest X-ray (CXR). This could mimic the lung region reading by the radiologist and it could provide a cheap solution to analyze and diagnose pulmonary abnormalities such as TB which often co- occurs with HIV. This software based pulmonary screening can be a reliable and affordable solution for rural population in different parts of the world such as India, Africa, etc. Our fully automatic system is processing the incoming CXR image by applying image processing techniques to detect the region on interest (ROI) followed by a computationally cheap feature extraction involving edge detection using Laplacian of Gaussian which we enrich by counting the local distribution of the intensities. The choice to ”zoom in” the ROI and look for abnormalities locally is motivated by the fact that some pulmonary abnormalities are localized in specific regions of the lungs. Later on the classifiers can decide about the normal or abnormal nature of each lung X-ray. Our goal is to find a simple feature, instead of a combination of several ones, -proposed and promoted in recent years’ literature, which can properly describe the different pathological alterations in the lungs. Our experiments report results on two publicly available data collections1, namely the Shenzhen and the Montgomery collection. For performance evaluation, measures such as area under the curve (AUC), and accuracy (ACC) were considered, achieving AUC = 0.81 (ACC = 83.33%) and AUC = 0.96 (ACC = 96.35%) for the Montgomery and Schenzen collections, respectively. Several comparisons are also provided to other state- of-the-art systems reported recently in the field.

How Does DCNN Make Decisions?

Yi Lin, Namin Wang, Xiaoqing Ma, Ziwei Li, Gang Bai

Responsive image

Auto-TLDR; Exploring Deep Convolutional Neural Network's Decision-Making Interpretability

Slides Poster Similar

Deep Convolutional Neural Networks (DCNN), despite imitating the human visual system, present no such decision credibility as human observers. This phenomenon, therefore, leads to the limitations of DCNN's applications in the security and trusted computing, such as self-driving cars and medical diagnosis. Focusing on this issue, our work aims to explore the way DCNN makes decisions. In this paper, the major contributions we made are: firstly, provide the hypothesis, “point-wise activation” of convolution function, according to the analysis of DCNN’s architectures and training process; secondly, point out the effect of “point-wise activation” on DCNN’s uninterpretable classification and pool robustness, and then suggest, in particular, the contradiction between the traditional and DCNN’s convolution kernel functions; finally, distinguish decision-making interpretability from semantic interpretability, and indicate that DCNN’s decision-making mechanism need to evolve towards the direction of semantics in the future. Besides, the “point-wise activation” hypothesis and conclusions proposed in our paper are supported by extensive experimental results.

Two-Level Attention-Based Fusion Learning for RGB-D Face Recognition

Hardik Uppal, Alireza Sepas-Moghaddam, Michael Greenspan, Ali Etemad

Responsive image

Auto-TLDR; Fused RGB-D Facial Recognition using Attention-Aware Feature Fusion

Slides Poster Similar

With recent advances in RGB-D sensing technologies as well as improvements in machine learning and fusion techniques, RGB-D facial recognition has become an active area of research. A novel attention aware method is proposed to fuse two image modalities, RGB and depth, for enhanced RGB-D facial recognition. The proposed method first extracts features from both modalities using a convolutional feature extractor. These features are then fused using a two layer attention mechanism. The first layer focuses on the fused feature maps generated by the feature extractor, exploiting the relationship between feature maps using LSTM recurrent learning. The second layer focuses on the spatial features of those maps using convolution. The training database is preprocessed and augmented through a set of geometric transformations, and the learning process is further aided using transfer learning from a pure 2D RGB image training process. Comparative evaluations demonstrate that the proposed method outperforms other state-of-the-art approaches, including both traditional and deep neural network-based methods, on the challenging CurtinFaces and IIIT-D RGB-D benchmark databases, achieving classification accuracies over 98.2% and 99.3% respectively. The proposed attention mechanism is also compared with other attention mechanisms, demonstrating more accurate results.

Object-Oriented Map Exploration and Construction Based on Auxiliary Task Aided DRL

Junzhe Xu, Jianhua Zhang, Shengyong Chen, Honghai Liu

Responsive image

Auto-TLDR; Auxiliary Task Aided Deep Reinforcement Learning for Environment Exploration by Autonomous Robots

Similar

Environment exploration by autonomous robots through deep reinforcement learning (DRL) based methods has attracted more and more attention. However, existing methods usually focus on robot navigation to single or multiple fixed goals, while ignoring the perception and construction of external environments. In this paper, we propose a novel environment exploration task based on DRL, which requires a robot fast and completely perceives all objects of interest, and reconstructs their poses in a global environment map, as much as the robot can do. To this end, we design an auxiliary task aided DRL model, which is integrated with the auxiliary object detection and 6-DoF pose estimation components. The outcome of auxiliary tasks can improve the learning speed and robustness of DRL, as well as the accuracy of object pose estimation. Comprehensive experimental results on the indoor simulation platform AI2-THOR have shown the effectiveness and robustness of our method.

RWMF: A Real-World Multimodal Foodlog Database

Pengfei Zhou, Cong Bai, Kaining Ying, Jie Xia, Lixin Huang

Responsive image

Auto-TLDR; Real-World Multimodal Foodlog: A Real-World Foodlog Database for Diet Assistant

Slides Poster Similar

With the increasing health concerns on diet, it's worthwhile to develop an intelligent assistant that can help users eat healthier. Such assistants can automatically give personal advice for the users' diet and generate health reports about eating on a regular basis. To boost the research on such diet assistant, we establish a real-world foodlog database using various methods such as filter, cluster and graph convolutional network. This database is built based on real-world lifelog and medical data, which is named as Real-World Multimodal Foodlog (RWMF). It contains 7500 multimodal pairs, and each pair consists of a food image paired with a line of personal biometrics data (such as Blood Glucose) and a textual food description of food composition paired with a line of food nutrition data. In this paper, we present the detailed procedures for setting up the database. We evaluate the performance of RWMF using different food classification and cross-modal retrieval approaches. We also test the performance of multimodal fusion on RWMF through ablation experiments. The experimental results show that the RWMF database is quite challenging and can be widely used to evaluate the performance of food analysis methods based on multimodal data.

DE-Net: Dilated Encoder Network for Automated Tongue Segmentation

Hui Tang, Bin Wang, Jun Zhou, Yongsheng Gao

Responsive image

Auto-TLDR; Automated Tongue Image Segmentation using De-Net

Slides Poster Similar

Automated tongue recognition is a growing research field due to global demand for personal health care. Using mobile devices to take tongue pictures is convenient and of low cost for tongue recognition. It is particularly suitable for self-health evaluation of the public. However, images taken by mobile devices are easily affected by various imaging environment, which makes fine segmentation a more challenging task compared with those taken by specialized acquisition devices. Deep learning approaches are promising for tongue image segmentation because they have powerful feature learning and representation capability. However, the successive pooling operations in these methods lead to loss of information on image details, making them fail when segmenting low-quality images captured by mobile devices. To address this issue, we propose a dilated encoder network (DE-Net) to capture more high-level features and get high-resolution output for automated tongue image segmentation. In addition, we construct two tongue image datasets which contain images taken by specialized devices and mobile devices, respectively, to verify the effectiveness of the proposed method. Experimental results on both datasets demonstrate that the proposed method outperforms the state-of-the-art methods in tongue image segmentation.