Relevance Detection in Cataract Surgery Videos by Spatio-Temporal Action Localization

Negin Ghamsarian, Mario Taschwer, Doris Putzgruber, Stephanie. Sarny, Klaus Schoeffmann

Responsive image

Auto-TLDR; relevance-based retrieval in cataract surgery videos

Slides Similar

In cataract surgery, the operation is performed with the help of a microscope. Since the microscope enables watching real-time surgery by up to two people only, a major part of surgical training is conducted using the recorded videos. To optimize the training procedure with the video content, the surgeons require an automatic relevance detection approach. In addition to relevance-based retrieval, these results can be further used for skill assessment and irregularity detection in cataract surgery videos. In this paper, a three-module framework is proposed to detect and classify the relevant phase segments in cataract videos. Taking advantage of an idle frame recognition network, the video is divided into idle and action segments. To boost the performance in relevance detection Mask R-CNN is utilized to detect the cornea in each frame where the relevant surgical actions are conducted. The spatio-temporal localized segments containing higher-resolution information about the pupil texture and actions, and complementary temporal information from the same phase are fed into the relevance detection module. This module consists of four parallel recurrent CNNs being responsible to detect four relevant phases that have been defined with medical experts. The results will then be integrated to classify the action phases as irrelevant or one of four relevant phases. Experimental results reveal that the proposed approach outperforms static CNNs and different configurations of feature-based and end-to-end recurrent networks.

Are Multiple Cross-Correlation Identities Better Than Just Two? Improving the Estimate of Time Differences-Of-Arrivals from Blind Audio Signals

Danilo Greco, Jacopo Cavazza, Alessio Del Bue

Responsive image

Auto-TLDR; Improving Blind Channel Identification Using Cross-Correlation Identity for Time Differences-of-Arrivals Estimation

Slides Poster Similar

Given an unknown audio source, the estimation of time differences-of-arrivals (TDOAs) can be efficiently and robustly solved using blind channel identification and exploiting the cross-correlation identity (CCI). Prior "blind" works have improved the estimate of TDOAs by means of different algorithmic solutions and optimization strategies, while always sticking to the case N = 2 microphones. But what if we can obtain a direct improvement in performance by just increasing N? In this paper we try to investigate this direction, showing that, despite the arguable simplicity, this is capable of (sharply) improving upon state-of-the-art blind channel identification methods based on CCI, without modifying the computational pipeline. Inspired by our results, we seek to warm up the community and the practitioners by paving the way (with two concrete, yet preliminary, examples) towards joint approaches in which advances in the optimization are combined with an increased number of microphones, in order to achieve further improvements.

Trajectory Representation Learning for Multi-Task NMRDP Planning

Firas Jarboui, Vianney Perchet

Responsive image

Auto-TLDR; Exploring Non Markovian Reward Decision Processes for Reinforcement Learning

Slides Poster Similar

Expanding Non Markovian Reward Decision Processes (NMRDP) into Markov Decision Processes (MDP) enables the use of state of the art Reinforcement Learning (RL) techniques to identify optimal policies. In this paper an approach to exploring NMRDPs and expanding them into MDPs, without the prior knowledge of the reward structure, is proposed. The non Markovianity of the reward function is disentangled under the assumption that sets of similar and dissimilar trajectory batches can be sampled. More precisely, within the same batch, measuring the similarity between any couple of trajectories is permitted, although comparing trajectories from different batches is not possible. A modified version of the triplet loss is optimised to construct a representation of the trajectories under which rewards become Markovian.

Adaptive Sampling of Pareto Frontiers with Binary Constraints Using Regression and Classification

Raoul Heese, Michael Bortz

Responsive image

Auto-TLDR; Adaptive Optimization for Black-Box Multi-Objective Optimizing Problems with Binary Constraints

Poster Similar

We present a novel adaptive optimization algorithm for black-box multi-objective optimization problems with binary constraints on the foundation of Bayes optimization. Our method is based on probabilistic regression and classification models, which act as a surrogate for the optimization goals and allow us to suggest multiple design points at once in each iteration. The proposed acquisition function is intuitively understandable and can be tuned to the demands of the problems at hand. We also present a novel ellipsoid truncation method to speed up the expected hypervolume calculation in a straightfoward way for regression models with a normal probability density. We benchmark our approach with an evolutionary algorithm on multiple test problems.

In Depth Semantic Scene Completion

David Gillsjö, Kalle Åström

Responsive image

Auto-TLDR; Bayesian Convolutional Neural Network for Semantic Scene Completion

Slides Poster Similar

For autonomous agents moving around in our world, mapping of the environment is essential. This is their only perception of their surrounding, what is not measured is unknown. Humans have learned from experience what to expect in certain environments, for example in indoor offices or supermarkets. This work studies Semantic Scene Completion which aims to predict a 3D semantic segmentation of our surroundings, even though some areas are occluded. For this we construct a Bayesian Convolutional Neural Network (BCNN), which is not only able to perform the segmentation, but also predict model uncertainty. This is an important feature not present in standard CNNs. We show on the MNIST dataset that the Bayesian approach performs equal or better to the standard CNN when processing digits unseen in the training phase when looking at accuracy, precision and recall. With the added benefit of having better calibrated scores and the ability to express model uncertainty. We then show results for the Semantic Scene Completion task where a category is introduced at test time on the SUNCG dataset. In this more complex task the Bayesian approach outperforms the standard CNN. Showing better Intersection over Union score and excels in Average Precision and separation scores.

Optimal Transport As a Defense against Adversarial Attacks

Quentin Bouniot, Romaric Audigier, Angélique Loesch

Responsive image

Auto-TLDR; Sinkhorn Adversarial Training with Optimal Transport Theory

Slides Poster Similar

Deep learning classifiers are now known to have flaws in the representations of their class. Adversarial attacks can find a human-imperceptible perturbation for a given image that will mislead a trained model. The most effective methods to defend against such attacks trains on generated adversarial examples to learn their distribution. Previous work aimed to align original and adversarial image representations in the same way as domain adaptation to improve robustness. Yet, they partially align the representations using approaches that do not reflect the geometry of space and distribution. In addition, it is difficult to accurately compare robustness between defended models. Until now, they have been evaluated using a fixed perturbation size. However, defended models may react differently to variations of this perturbation size. In this paper, the analogy of domain adaptation is taken a step further by exploiting optimal transport theory. We propose to use a loss between distributions that faithfully reflect the ground distance. This leads to SAT (Sinkhorn Adversarial Training), a more robust defense against adversarial attacks. Then, we propose to quantify more precisely the robustness of a model to adversarial attacks over a wide range of perturbation sizes using a different metric, the Area Under the Accuracy Curve (AUAC). We perform extensive experiments on both CIFAR-10 and CIFAR-100 datasets and show that our defense is globally more robust than the state-of-the-art.

SSDL: Self-Supervised Domain Learning for Improved Face Recognition

Samadhi Poornima Kumarasinghe Wickrama Arachchilage, Ebroul Izquierdo

Responsive image

Auto-TLDR; Self-supervised Domain Learning for Face Recognition in unconstrained environments

Slides Poster Similar

Face recognition in unconstrained environments is challenging due to variations in illumination, quality of sensing, motion blur and etc. An individual’s face appearance can vary drastically under different conditions creating a gap between train (source) and varying test (target) data. The domain gap could cause decreased performance levels in direct knowledge transfer from source to target. Despite fine-tuning with domain specific data could be an effective solution, collecting and annotating data for all domains is extremely expensive. To this end, we propose a self-supervised domain learning (SSDL) scheme that trains on triplets mined from unlabelled data. A key factor in effective discriminative learning, is selecting informative triplets. Building on most confident predictions, we follow an “easy-to-hard” scheme of alternate triplet mining and self-learning. Comprehensive experiments on four different benchmarks show that SSDL generalizes well on different domains.

PCANet: Pyramid Context-Aware Network for Retinal Vessel Segmentation

Yi Zhang, Yixuan Chen, Kai Zhang

Responsive image

Auto-TLDR; PCANet: Adaptive Context-Aware Network for Automated Retinal Vessel Segmentation

Slides Poster Similar

Automated retinal vessel segmentation plays an important role in the diagnosis of some diseases such as diabetes, arteriosclerosis and hypertension. Recent works attempt to improve segmentation performance by exploring either global or local contexts. However, the context demands are varying from regions in each image and different levels of network. To address these problems, we propose Pyramid Context-aware Network (PCANet), which can adaptively capture multi-scale context representations. Specifically, PCANet is composed of multiple Adaptive Context-aware (ACA) blocks arranged in parallel, each of which can adaptively obtain the context-aware features by estimating affinity coefficients at a specific scale under the guidance of global contextual dependencies. Meanwhile, we import ACA blocks with specific scales in different levels of the network to obtain a coarse-to-fine result. Furthermore, an integrated test-time augmentation method is developed to further boost the performance of PCANet. Finally, extensive experiments demonstrate the effectiveness of the proposed PCANet, and state-of-the-art performances are achieved with AUCs of 0.9866, 0.9886 and F1 Scores of 0.8274, 0.8371 on two public datasets, DRIVE and STARE, respectively.

Attention Based Pruning for Shift Networks

Ghouthi Hacene, Carlos Lassance, Vincent Gripon, Matthieu Courbariaux, Yoshua Bengio

Responsive image

Auto-TLDR; Shift Attention Layers for Efficient Convolutional Layers

Slides Poster Similar

In many application domains such as computer vision, Convolutional Layers (CLs) are key to the accuracy of deep learning methods. However, it is often required to assemble a large number of CLs, each containing thousands of parameters, in order to reach state-of-the-art accuracy, thus resulting in complex and demanding systems that are poorly fitted to resource-limited devices. Recently, methods have been proposed to replace the generic convolution operator by the combination of a shift operation and a simpler 1x1 convolution. The resulting block, called Shift Layer (SL), is an efficient alternative to CLs in the sense it allows to reach similar accuracies on various tasks with faster computations and fewer parameters. In this contribution, we introduce Shift Attention Layers (SALs), which extend SLs by using an attention mechanism that learns which shifts are the best at the same time the network function is trained. We demonstrate SALs are able to outperform vanilla SLs (and CLs) on various object recognition benchmarks while significantly reducing the number of float operations and parameters for the inference.

Detecting Rare Cell Populations in Flow Cytometry Data Using UMAP

Lisa Weijler, Markus Diem, Michael Reiter

Responsive image

Auto-TLDR; Unsupervised Manifold Approximation and Projection for Small Cell Population Detection in Flow cytometry Data

Slides Poster Similar

We present an approach for detecting small cell populations in flow cytometry (FCM) samples based on the combination of unsupervised manifold embedding and supervised random forest classification. Each sample consists of hundred thousands to a few million cells where each cell typically corresponds to a measurement vector with 10 to 50 dimensions. The difficulty of the task is that clusters of measurement vectors formed in the data space according to standard clustering criteria often do not correspond to biologically meaningful sub-populations of cells, due to strong variations in shape and size of their distributions. In many cases the relevant population consists of less than 100 scattered events out of millions of events, where supervised approaches perform better than unsupervised clustering. The aim of this paper is to demonstrate that the performance of the standard supervised classifier can be improved significantly by combining it with a preceding unsupervised learning step involving the Uniform Manifold Approximation and Projection (UMAP). We present an experimental evaluation on FCM data from children suffering from Acute Lymphoblastic Leukemia (ALL) showing that the improvement particularly occurs in difficult samples where the size of the relevant population of leukemic cells is low in relation to other sub-populations. Further, the experiments indicate that on such samples the algorithm also outperforms other baseline methods based on Gaussian Mixture Models.

Near-Infrared Depth-Independent Image Dehazing using Haar Wavelets

Sumit Laha, Ankit Sharma, Shengnan Hu, Hassan Foroosh

Responsive image

Auto-TLDR; A fusion algorithm for haze removal using Haar wavelets

Slides Poster Similar

We propose a fusion algorithm for haze removal that combines color information from an RGB image and edge information extracted from its corresponding NIR image using Haar wavelets. The proposed algorithm is based on the key observation that NIR edge features are more prominent in the hazy regions of the image than the RGB edge features in those same regions. To combine the color and edge information, we introduce a haze-weight map which proportionately distributes the color and edge information during the fusion process. Because NIR images are, intrinsically, nearly haze-free, our work makes no assumptions like existing works that rely on a scattering model and essentially designing a depth-independent method. This helps in minimizing artifacts and gives a more realistic sense to the restored haze-free image. Extensive experiments show that the proposed algorithm is both qualitatively and quantitatively better on several key metrics when compared to existing state-of-the-art methods.

Bidirectional Matrix Feature Pyramid Network for Object Detection

Wei Xu, Yi Gan, Jianbo Su

Responsive image

Auto-TLDR; BMFPN: Bidirectional Matrix Feature Pyramid Network for Object Detection

Slides Poster Similar

Feature pyramids are widely used to improve scale invariance for object detection. Most methods just map the objects to feature maps with relevant square receptive fields, but rarely pay attention to the aspect ratio variation, which is also an important property of object instances. It will lead to a poor match between rectangular objects and assigned features with square receptive fields, thus preventing from accurate recognition and location. Besides, the information propagation among feature layers is sparse, namely, each feature in the pyramid may mainly or only contain single-level information, which is not representative enough for classification and localization sub-tasks. In this paper, Bidirectional Matrix Feature Pyramid Network (BMFPN) is proposed to address these issues. It consists of three modules: Diagonal Layer Generation Module (DLGM), Top-down Module (TDM) and Bottom-up Module (BUM). First, multi-level features extracted by backbone are fed into DLGM to produce the base features. Then these base features are utilized to construct the final feature pyramid through TDM and BUM in series. The receptive fields of the designed feature layers in BMFPN have various scales and aspect ratios. Objects can be correctly assigned to appropriate and representative feature maps with relevant receptive fields depending on its scale and aspect ratio properties. Moreover, TDM and BUM form bidirectional and reticular information flow, which effectively fuses multi level information in top-down and bottom-up manner respectively. To evaluate the effectiveness of our proposed architecture, an end-toend anchor-free detector is designed and trained by integrating BMFPN into FCOS. And the center ness branch in FCOS is modified with our Gaussian center-ness branch (GCB), which brings another slight improvement. Without bells and whistles, our method gains +3.3%, +2.4% and +2.6% AP on MS COCO dataset from baselines with ResNet-50, ResNet-101 and ResNeXt-101 backbones, respectively.

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Amar Shrestha, Krittaphat Pugdeethosapol, Haowen Fang, Qinru Qiu

Responsive image

Auto-TLDR; MAGNet: A Multi-Region Attention-Aware Grounding Network for Free-form Textual Queries

Slides Poster Similar

Grounding free-form textual queries necessitates an understanding of these textual phrases and its relation to the visual cues to reliably reason about the described locations. Spatial attention networks are known to learn this relationship and focus its gaze on salient objects in the image. Thus, we propose to utilize spatial attention networks for image-level visual-textual fusion preserving local (word) and global (phrase) information to refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query. We focus only on the phrase query - ground truth pair (referring expression) for a model independent of the constraints of the datasets i.e. additional attributes, context etc. For such referring expression dataset ReferIt game, our Multi- region Attention-assisted Grounding network (MAGNet) achieves over 12% improvement over the state-of-the-art. Without the con- text from image captions and attribute information in Flickr30k Entities, we still achieve competitive results compared to the state- of-the-art.

A Scalable Deep Neural Network to Detect Low Quality Images without a Reference

Zongyi Liu

Responsive image

Auto-TLDR; A Deep Neural Network-based Algorithm for Non-reference Non-Reference Non-Referential Image Quality Metrics for Streaming Services

Slides Poster Similar

Online streaming services have been growing at a fast pace. To provide the best user experience, it is needed to detect low quality images from videos so that we can repair or improve them before showing to customers. For example, for movie and TV-show streaming services, it is important to check if an original~(master) video produced from a studio has low quality images that contain artifacts such as up-scaling or interlacing; for live streaming services, it is important to detect if a streamed video have hits due to encoding such as h264 or mpeg2. The impairment detection is usually measured by the non-reference~(NR) metrics because it is often difficult and sometimes impossible to get the original~(master) videos. On the other hand, today researches in the image quality area, such as super-resolution, are mainly focused on the full reference~(FR) metrics like PSNR or VMAF. In this paper, we present an algorithm that is able to reliably compute five types spatial NR metrics that are commonly used in the Prime Video~(PV) movie content inspection and live streaming services. The algorithm consists of two components: a pre-processing step that spatially de-correlates pixel intensity values and a novel deep neural network~(DNN) that is able to quantify the $NR$ metrics at the image region level. We show that our algorithm achieves better performance than state-of-art algorithms in this area.

Hierarchical Classification with Confidence Using Generalized Logits

James W. Davis, Tong Liang, James Enouen, Roman Ilin

Responsive image

Auto-TLDR; Generalized Logits for Hierarchical Classification

Slides Poster Similar

We present a bottom-up approach to hierarchical classification based on posteriors conditioned with logits. Beginning with the output logits for a set of terminal labels from a base classifier, an initial hypothesis is repeatedly generalized (softened) to a weaker label until a particular confidence measure is achieved. As conditioning the probabilistic model with the full set of terminal logits quickly becomes intractable for large label sets, we propose an alternative approach employing "generalized logits" spanning relevant hypotheses within the label hierarchy. Experimental results are compared with related methods on multiple datasets and base classifiers. The proposed approach provides an efficient and effective hierarchical classification framework with monotonic, non-decreasing inference behavior.

Feature Extraction and Selection Via Robust Discriminant Analysis and Class Sparsity

Ahmad Khoder, Fadi Dornaika

Responsive image

Auto-TLDR; Hybrid Linear Discriminant Embedding for supervised multi-class classification

Slides Poster Similar

The main goal of discriminant embedding is to extract features that can be compact and informative representations of the original set of features. This paper introduces a hybrid scheme for linear feature extraction for supervised multi-class classification. We introduce a unifying criterion that is able to retain the advantages of robust sparse LDA and Inter-class sparsity. Thus, the estimated transformation includes two types of discrimination which are the inter-class sparsity and robust Linear Discriminant Analysis with feature selection. In order to optimize the proposed objective function, we deploy an iterative alternating minimization scheme for estimating the linear transformation and the orthogonal matrix. The introduced scheme is generic in the sense that it can be used for combining and tuning many other linear embedding methods. In the lights of the experiments conducted on six image datasets including faces, objects, and digits, the proposed scheme was able to outperform competing methods in most of the cases.

Revisiting the Training of Very Deep Neural Networks without Skip Connections

Oyebade Kayode Oyedotun, Abd El Rahman Shabayek, Djamila Aouada, Bjorn Ottersten

Responsive image

Auto-TLDR; Optimization of Very Deep PlainNets without shortcut connections with 'vanishing and exploding units' activations'

Slides Poster Similar

Deep neural networks (DNNs) with many layers of feature representations yield state-of-the-art results on several difficult learning tasks. However, optimizing very deep DNNs without shortcut connections known as PlainNets, is a notoriously hard problem. Considering the growing interest in this area, this paper investigates holistically two scenarios that plague the training of very deep PlainNets: (1) the relatively popular challenge of 'vanishing and exploding units' activations', and (2) the less investigated 'singularity' problem, which is studied in details in this paper. In contrast to earlier works that study only the saturation and explosion of units' activations in isolation, this paper harmonizes the inconspicuous coexistence of the aforementioned problems for very deep PlainNets. Particularly, we argue that the aforementioned problems would have to be tackled simultaneously for the successful training of very deep PlainNets. Finally, different techniques that can be employed for tackling the optimization problem are discussed, and a specific combination of simple techniques that allows the successful training of PlainNets having up to 100 layers is demonstrated.

Do Not Treat Boundaries and Regions Differently: An Example on Heart Left Atrial Segmentation

Zhou Zhao, Elodie Puybareau, Nicolas Boutry, Thierry Geraud

Responsive image

Auto-TLDR; Attention Full Convolutional Network for Atrial Segmentation using ResNet-101 Architecture

Slides Similar

Atrial fibrillation is the most common heart rhythm disease. Due to a lack of understanding in matter of underlying atrial structures, current treatments are still not satisfying. Recently, with the popularity of deep learning, many segmentation methods based on fully convolutional networks have been proposed to analyze atrial structures, especially from late gadolinium-enhanced magnetic resonance imaging. However, two problems still occur: 1) segmentation results include the atrial-like background; 2) boundaries are very hard to segment. Most segmentation approaches design a specific network that mainly focuses on the regions, to the detriment of the boundaries. Therefore, this paper proposes an attention full convolutional network framework based on the ResNet-101 architecture, which focuses on boundaries as much as on regions. The additional attention module is added to have the network pay more attention on regions and then to reduce the impact of the misleading similarity of neighboring tissues. We also use a hybrid loss composed of a region loss and a boundary loss to treat boundaries and regions at the same time. We demonstrate the efficiency of the proposed approach on the MICCAI 2018 Atrial Segmentation Challenge public dataset.

Inferring Functional Properties from Fluid Dynamics Features

Andrea Schillaci, Maurizio Quadrio, Carlotta Pipolo, Marcello Restelli, Giacomo Boracchi

Responsive image

Auto-TLDR; Exploiting Convective Properties of Computational Fluid Dynamics for Medical Diagnosis

Slides Poster Similar

In a wide range of applied problems involving fluid flows, Computational Fluid Dynamics (CFD) provides detailed quantitative information on the flow field, at various levels of fidelity and computational cost. However, CFD alone cannot predict high-level functional properties of the system that are not easily obtained from the equations of fluid motion. In this work, we present a data-driven framework to extract additional information, such as medical diagnostic output, from CFD solutions. The task is made difficult by the huge data dimensionality of CFD, together with the limited amount of training data implied by its high computational cost. By pursuing a traditional ML pipeline of pre-processing, feature extraction, and model training, we demonstrate that informative features can be extracted from CFD data. Two experiments, pertaining to different application domains, support the claim that the convective properties implicit into a CFD solution can be leveraged to retrieve functional information for which an analytical definition is missing. Despite the preliminary nature of our study and the relative simplicity of both the geometrical and CFD models, for the first time we demonstrate that the combination of ML and CFD can diagnose a complex system in terms of high-level functional information.

Few-Shot Font Generation with Deep Metric Learning

Haruka Aoki, Koki Tsubota, Hikaru Ikuta, Kiyoharu Aizawa

Responsive image

Auto-TLDR; Deep Metric Learning for Japanese Typographic Font Synthesis

Slides Poster Similar

Designing fonts for languages with a large number of characters, such as Japanese and Chinese, is an extremely labor-intensive and time-consuming task. In this study, we addressed the problem of automatically generating Japanese typographic fonts from only a few font samples, where the synthesized glyphs are expected to have coherent characteristics, such as skeletons, contours, and serifs. Existing methods often fail to generate fine glyph images when the number of style reference glyphs is extremely limited. Herein, we proposed a simple but powerful framework for extracting better style features. This framework introduces deep metric learning to style encoders. We performed experiments using black-and-white and shape-distinctive font datasets and demonstrated the effectiveness of the proposed framework.

End-To-End Deep Learning Methods for Automated Damage Detection in Extreme Events at Various Scales

Yongsheng Bai, Alper Yilmaz, Halil Sezen

Responsive image

Auto-TLDR; Robust Mask R-CNN for Crack Detection in Extreme Events

Slides Poster Similar

Robust Mask R-CNN (Mask Regional Convolutional Neural Network) methods are proposed and tested for automatic detection of cracks on structures or their components that may be damaged during extreme events, such as earth-quakes. We curated a new dataset with 2,021 labeled images for training and validation and aimed to find end-to-end deep neural networks for crack detection in the field. With data augmentation and parameters fine-tuning, Path Aggregation Network (PANet) with spatial attention mechanisms and High-resolution Network (HRNet) are introduced into Mask R-CNNs. The tests on three public datasets with low- or high-resolution images demonstrate that the proposed methods can achieve a big improvement over alternative networks, so the proposed method may be sufficient for crack detection for a variety of scales in real applications.