#### BCAU-Net: A Novel Architecture with Binary Channel Attention Module for MRI Brain Segmentation

Yongpei Zhu, Zicong Zhou, Guojun Liao, Kehong Yuan
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 12:00 in session PS T5.5

Auto-TLDR; BCAU-Net: Binary Channel Attention U-Net for MRI brain segmentation

Recently deep learning-based networks have achieved advanced performance in medical image segmentation. However, the development of deep learning is slow in magnetic resonance image (MRI) segmentation of normal brain tissues. In this paper, inspired by channel attention module, we propose a new architecture, Binary Channel Attention U-Net (BCAU-Net), by introducing a novel Binary Channel Attention Module (BCAM) into skip connection of U-Net, which can take full advantages of the channel information extracted from the encoding path and corresponding decoding path. To better aggregate multi-scale spatial information of the feature map, spatial pyramid pooling (SPP) modules with different pooling operations are used in BCAM instead of original average-pooling and max-pooling operations. We verify this model on two datasets including IBSR and MRBrainS18, and obtain better performance on MRI brain segmentation compared with other methods. We believe the proposed method can advance the performance in brain segmentation and clinical diagnosis.

## Similar papers

#### Segmentation of Intracranial Aneurysm Remnant in MRA Using Dual-Attention Atrous Net

Subhashis Banerjee, Ashis Kumar Dhara, Johan Wikström, Robin Strand
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 16:30 in session PS T5.4

Auto-TLDR; Dual-Attention Atrous Net for Segmentation of Intracranial Aneurysm Remnant from MRA Images

Due to the advancement of non-invasive medical imaging modalities like Magnetic Resonance Angiography (MRA), an increasing number of Intracranial Aneurysm (IA) cases are being reported in recent years. The IAs are typically treated by so-called endovascular coiling, where blood flow in the IA is prevented by embolization with a platinum coil. Accurate quantification of the IA Remnant (IAR), i.e. the volume with blood flow present post treatment is the utmost important factor in choosing the right treatment planning. This is typically done by manually segmenting the aneurysm remnant from the MRA volume. Since manual segmentation of volumetric images is a labour-intensive and error-prone process, development of an automatic volumetric segmentation method is required. Segmentation of small structures such as IA, that may largely vary in size, shape, and location is considered extremely difficult. Similar intensity distribution of IAs and surrounding blood vessels makes it more challenging and susceptible to false positive. In this paper we propose a novel 3D CNN architecture called Dual-Attention Atrous Net (DAtt-ANet), which can efficiently segment IAR volumes from MRA images by reconciling features at different scales using the proposed Parallel Atrous Unit (PAU) along with the use of self-attention mechanism for extracting fine-grained features and intra-class correlation. The proposed DAtt-ANet model is trained and evaluated on a clinical MRA image dataset (prospective research project, approved by the local ethical committee) of IAR consisting of 46 subjects, annotated by an expert radiologist from our group. We compared the proposed DAtt-ANet with five state-of-the-art CNN models based on their segmentation performance. The proposed DAtt-ANet outperformed all other methods and was able to achieve a five-fold cross-validation DICE score of $0.73\pm0.06$.

#### A Benchmark Dataset for Segmenting Liver, Vasculature and Lesions from Large-Scale Computed Tomography Data

Bo Wang, Zhengqing Xu, Wei Xu, Qingsen Yan, Liang Zhang, Zheng You
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 12:00 in session PS T5.5

Auto-TLDR; The Biggest Treatment-Oriented Liver Cancer Dataset for Segmentation

How to build a high-performance liver-related computer assisted diagnosis system is an open question of great interest. However, the performance of the state-of-art algorithm is always limited by the amount of data and quality of the label. To address this problem, we propose the biggest treatment-oriented liver cancer dataset for liver surgery and treatment planning. This dataset provides 216 cases (totally about 268K frames) scanned images in contrast-enhanced computed tomography (CT). We labeled all the CT images with the liver, liver vasculature and liver tumor segmentation ground truth for train and tune segmentation algorithms in advance. Based on that, we evaluate several recent and state-of-the-art segmentation algorithms, including 7 deep learning methods, on CT sequences. All results are compared to reference segmentations five error metrics that highlight different aspects of segmentation accuracy. In general, compared with previous datasets, our dataset is really a challenging dataset. To our knowledge, the proposed dataset and benchmark allow for the first time systematic exploration of such issues, and will be made available to allow for further research in this field.

#### Do Not Treat Boundaries and Regions Differently: An Example on Heart Left Atrial Segmentation

Zhou Zhao, Elodie Puybareau, Nicolas Boutry, Thierry Geraud
Track 5: Image and Signal Processing
Fri 15 Jan 2021 at 13:00 in session OS T5.5

Auto-TLDR; Attention Full Convolutional Network for Atrial Segmentation using ResNet-101 Architecture

Atrial fibrillation is the most common heart rhythm disease. Due to a lack of understanding in matter of underlying atrial structures, current treatments are still not satisfying. Recently, with the popularity of deep learning, many segmentation methods based on fully convolutional networks have been proposed to analyze atrial structures, especially from late gadolinium-enhanced magnetic resonance imaging. However, two problems still occur: 1) segmentation results include the atrial-like background; 2) boundaries are very hard to segment. Most segmentation approaches design a specific network that mainly focuses on the regions, to the detriment of the boundaries. Therefore, this paper proposes an attention full convolutional network framework based on the ResNet-101 architecture, which focuses on boundaries as much as on regions. The additional attention module is added to have the network pay more attention on regions and then to reduce the impact of the misleading similarity of neighboring tissues. We also use a hybrid loss composed of a region loss and a boundary loss to treat boundaries and regions at the same time. We demonstrate the efficiency of the proposed approach on the MICCAI 2018 Atrial Segmentation Challenge public dataset.

#### CAggNet: Crossing Aggregation Network for Medical Image Segmentation

Xu Cao, Yanghao Lin
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 12:00 in session PS T3.4

Auto-TLDR; Crossing Aggregation Network for Medical Image Segmentation

In this paper, we present Crossing Aggregation Network (CAggNet), a novel densely connected semantic segmentation method for medical image analysis. The crossing aggregation network absorbs the idea of deep layer aggregation and makes significant innovations in layer connection and semantic information fusion. In this architecture, the traditional skip-connection structure of general U-Net is replaced by aggregations of multi-level down-sampling and up-sampling layers. This enables the network to fuse information interactively flows at different levels of layers in semantic segmentation. It also introduces weighted aggregation module to aggregate multi-scale output information. We have evaluated and compared our CAggNet with several advanced U-Net based methods in two public medical image datasets, including the 2018 Data Science Bowl nuclei detection dataset and the 2015 MICCAI gland segmentation competition dataset. Experimental results indicate that CAggNet improves medical object recognition and achieves a more accurate and efficient segmentation compared to existing improved U-Net and UNet++ structure.

#### 3D Medical Multi-Modal Segmentation Network Guided by Multi-Source Correlation Constraint

Tongxue Zhou, Stéphane Canu, Pierre Vera, Su Ruan
Track 5: Image and Signal Processing
Fri 15 Jan 2021 at 16:00 in session PS T5.8

Auto-TLDR; Multi-modality Segmentation with Correlation Constrained Network

In the field of multimodal segmentation, the correlation between different modalities can be considered for improving the segmentation results. In this paper, we propose a multi-modality segmentation network with a correlation constraint. Our network includes N model-independent encoding paths with N image sources, a correlation constrain block, a feature fusion block, and a decoding path. The model-independent encoding path can capture modality-specific features from the N modalities. Since there exists a strong correlation between different modalities, we first propose a linear correlation block to learn the correlation between modalities, then a loss function is used to guide the network to learn the correlated features based on the correlation representation block. This block forces the network to learn the latent correlated features which are more relevant for segmentation. Considering that not all the features extracted from the encoders are useful for segmentation, we propose to use dual attention based fusion block to recalibrate the features along the modality and spatial paths, which can suppress less informative features and emphasize the useful ones. The fused feature representation is finally projected by the decoder to obtain the segmentation result. Our experiment results tested on BraTS-2018 dataset for brain tumor segmentation demonstrate the effectiveness of our proposed method.

#### FOANet: A Focus of Attention Network with Application to Myocardium Segmentation

Zhou Zhao, Elodie Puybareau, Nicolas Boutry, Thierry Geraud
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 16:30 in session PS T5.4

Auto-TLDR; FOANet: A Hybrid Loss Function for Myocardium Segmentation of Cardiac Magnetic Resonance Images

In myocardium segmentation of cardiac magnetic resonance images, ambiguities often appear near the boundaries of the target domains due to tissue similarities. To address this issue, we propose a new architecture, called FOANet, which can be decomposed in three main steps: a localization step, a Gaussian-based contrast enhancement step, and a segmentation step. This architecture is supplied with a hybrid loss function that guides the FOANet to study the transformation relationship between the input image and the corresponding label in a threelevel hierarchy (pixel-, patch- and map-level), which is helpful to improve segmentation and recovery of the boundaries. We demonstrate the efﬁciency of our approach on two public datasets in terms of regional and boundary segmentations.

#### DARN: Deep Attentive Refinement Network for Liver Tumor Segmentation from 3D CT Volume

Yao Zhang, Jiang Tian, Cheng Zhong, Yang Zhang, Zhongchao Shi, Zhiqiang He
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 17:00 in session PS T5.2

Auto-TLDR; Deep Attentive Refinement Network for Liver Tumor Segmentation from 3D Computed Tomography Using Multi-Level Features

Automatic liver tumor segmentation from 3D Computed Tomography (CT) is a necessary prerequisite in the interventions of hepatic abnormalities and surgery planning. However, accurate liver tumor segmentation remains challenging due to the large variability of tumor sizes and inhomogeneous texture. Recent advances based on Fully Convolutional Network (FCN) in liver tumor segmentation draw on success of learning discriminative multi-level features. In this paper, we propose a Deep Attentive Refinement Network (DARN) for improved liver tumor segmentation from CT volumes by fully exploiting both low and high level features embedded in different layers of FCN. Different from existing works, we exploit attention mechanism to leverage the relation of different levels of features encoded in different layers of FCN. Specifically, we introduce a Semantic Attention Refinement (SemRef) module to selectively emphasize global semantic information in low level features with the guidance of high level ones, and a Spatial Attention Refinement (SpaRef) module to adaptively enhance spatial details in high level features with the guidance of low level ones. We evaluate our network on the public MICCAI 2017 Liver Tumor Segmentation Challenge dataset (LiTS dataset) and it achieves state-of-the-art performance. The proposed refinement modules are an effective strategy to exploit multi-level features and has great potential to generalize to other medical image segmentation tasks.

#### Deep Recurrent-Convolutional Model for AutomatedSegmentation of Craniomaxillofacial CT Scans

Francesca Murabito, Simone Palazzo, Federica Salanitri Proietto, Francesco Rundo, Ulas Bagci, Daniela Giordano, Rosalia Leonardi, Concetto Spampinato
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 16:30 in session PS T5.4

Auto-TLDR; Automated Segmentation of Anatomical Structures in Craniomaxillofacial CT Scans using Fully Convolutional Deep Networks

In this paper we define a deep learning architecture for automated segmentation of anatomical structures in Craniomaxillofacial (CMF) CT scans that leverages the recent success of encoder-decoder models for semantic segmentation of natural images. In particular, we propose a fully convolutional deep network that combines the advantages of recent fully convolutional models, such as Tiramisu, with squeeze-and-excitation blocks for feature recalibration, integrated with convolutional LSTMs to model spatio-temporal correlations between consecutive slices. The proposed segmentation network shows superior performance and generalization capabilities (to different structures and imaging modalities) than state of the art methods on automated segmentation of CMF structures (e.g., mandibles and airways) in several standard benchmarks (e.g., MICCAI datasets) and on new datasets proposed herein, effectively facing shape variability.

#### SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation

Changlu Guo, Marton Szemenyei, Yugen Yi, Wenle Wang, Buer Chen, Changqi Fan
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 16:30 in session PS T5.4

Auto-TLDR; Spatial Attention U-Net for Segmentation of Retinal Blood Vessels

The precise segmentation of retinal blood vessels is of great significance for early diagnosis of eye-related diseases such as diabetes and hypertension. In this work, we propose a lightweight network named Spatial Attention U-Net (SA-UNet) that does not require thousands of annotated training samples and can be utilized in a data augmentation manner to use the available annotated samples more efficiently. SA-UNet introduces a spatial attention module which infers the attention map along the spatial dimension, and multiplies the attention map by the input feature map for adaptive feature refinement. In addition, the proposed network employs structured dropout convolutional blocks instead of the original convolutional blocks of U-Net to prevent the network from overfitting. We evaluate SA-UNet based on two benchmark retinal datasets: the Vascular Extraction (DRIVE) dataset and the Child Heart and Health Study (CHASE_DB1) dataset. The results show that the proposed SA-UNet achieves state-of-the-art performance on both datasets.The implementation and the trained networks are available on Github1.

#### PCANet: Pyramid Context-Aware Network for Retinal Vessel Segmentation

Yi Zhang, Yixuan Chen, Kai Zhang
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 12:00 in session PS T5.5

Auto-TLDR; PCANet: Adaptive Context-Aware Network for Automated Retinal Vessel Segmentation

Automated retinal vessel segmentation plays an important role in the diagnosis of some diseases such as diabetes, arteriosclerosis and hypertension. Recent works attempt to improve segmentation performance by exploring either global or local contexts. However, the context demands are varying from regions in each image and different levels of network. To address these problems, we propose Pyramid Context-aware Network (PCANet), which can adaptively capture multi-scale context representations. Specifically, PCANet is composed of multiple Adaptive Context-aware (ACA) blocks arranged in parallel, each of which can adaptively obtain the context-aware features by estimating affinity coefficients at a specific scale under the guidance of global contextual dependencies. Meanwhile, we import ACA blocks with specific scales in different levels of the network to obtain a coarse-to-fine result. Furthermore, an integrated test-time augmentation method is developed to further boost the performance of PCANet. Finally, extensive experiments demonstrate the effectiveness of the proposed PCANet, and state-of-the-art performances are achieved with AUCs of 0.9866, 0.9886 and F1 Scores of 0.8274, 0.8371 on two public datasets, DRIVE and STARE, respectively.

#### A Multi-Task Contextual Atrous Residual Network for Brain Tumor Detection & Segmentation

Ngan Le, Kashu Yamazaki, Quach Kha Gia, Thanh-Dat Truong, Marios Savvides
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 17:00 in session PS T5.1

Auto-TLDR; Contextual Brain Tumor Segmentation Using 3D atrous Residual Networks and Cascaded Structures

In recent years, deep neural networks have achieved state-of-the-art performance in a variety of recognition and segmentation tasks in medical imaging including brain tumor segmentation. We investigate that segmenting brain tumor is facing to the imbalanced data problem where the number of pixels belonging to background class (non tumor pixel) is much larger than the number of pixels belonging to foreground class (tumor pixel). To address this problem, we propose a multi-task network which is formed as a cascaded structure and designed to share the feature maps. Our model consists of two targets, i.e., (i) effectively differentiating brain tumor regions and (ii) estimating brain tumor masks. The first task is performed by our proposed contextual brain tumor detection network, which plays the role of an attention gate and focuses on the region around brain tumor only while ignore the background (non tumor area). Instead of processing every pixel, our contextual brain tumor detection network only processes contextual regions around ground-truth instances and this strategy helps to produce meaningful regions proposals. The second task is built upon a 3D atrous residual network and under an encode-decode network in order to effectively segment both large and small objects (brain tumor). Our 3D atrous residual network is designed with a skip connection to enables the gradient from the deep layers to be directly propagated to shallow layers, thus, features of different depths are preserved and used for refining each other. In order to incorporate larger contextual information in volume MRI data, our network is designed by 3D atrous convolution with various kernel sizes, which enlarges the receptive field of filters. Our proposed network has been evaluated on various datasets including BRATS2015, BRATS2017 and BRATS2018 datasets with both validation set and testing set. Our performance has been benchmarked by both region-based metrics and surface-based metrics. We also have conducted comparisons against state-of-the-art approaches.

#### Automatic Semantic Segmentation of Structural Elements related to the Spinal Cord in the Lumbar Region by Using Convolutional Neural Networks

Jhon Jairo Sáenz Gamboa, Maria De La Iglesia-Vaya, Jon Ander Gómez
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 16:00 in session PS T5.6

Auto-TLDR; Semantic Segmentation of Lumbar Spine Using Convolutional Neural Networks

This work addresses the problem of automatically segmenting the MR images corresponding to the lumbar spine. The purpose is to detect and delimit the different structural elements like vertebrae, intervertebral discs, nerves, blood vessels, etc. This task is known as semantic segmentation. The approach proposed in this work is based on convolutional neural networks whose output is a mask where each pixel from the input image is classified into one of the possible classes. Classes were defined by radiologists and correspond to structural elements and tissues. The proposed network architectures are variants of the U-Net. Several complementary blocks were used to define the variants: spatial attention models, deep supervision and multi-kernels at input, this last block type is based on the idea of inception. Those architectures which got the best results are described in this paper, and their results are discussed. Two of the proposed architectures outperform the standard U-Net used as baseline.

#### Dual Encoder Fusion U-Net (DEFU-Net) for Cross-manufacturer Chest X-Ray Segmentation

Zhang Lipei, Aozhi Liu, Jing Xiao
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 15:00 in session OS T5.2

Auto-TLDR; Inception Convolutional Neural Network with Dilation for Chest X-Ray Segmentation

A number of methods based on the deep learning have been applied to medical image segmentation and have achieved state-of-the-art performance. The most famous technique is U-Net which has been used to many medical datasets including the Chest X-ray. Due to the importance of chest x- ray data in studying COVID-19, there is a demand for state-of- art models capable of precisely segmenting chest x-rays. In this paper, we propose a dual encoder fusion U-Net framework for Chest X-rays based on Inception Convolutional Neural Network with dilation, Densely Connected Recurrent Convolutional Neural Network, which is named DEFU-Net. The densely connected recurrent path extends the network deeper for facilitating context feature extraction. In order to increase the width of network and enrich representation of features, the inception blocks with dilation have been used. The inception blocks can capture globally and locally spatial information with various receptive fields to avoid information loss caused by max-pooling. Meanwhile, the features fusion of two path by summation preserve the context and the spatial information for decoding part. We applied this model in Chest X-ray dataset from two different manufacturers (Montgomery and Shenzhen hospital). The DEFU-Net achieves the better performance than basic U-Net, residual U-Net, BCDU- Net, R2U-Net and attention R2U-Net. This model approaches state-of-the-art in this mixed dataset. The open source code for this proposed framework is public available.

#### CT-UNet: An Improved Neural Network Based on U-Net for Building Segmentation in Remote Sensing Images

Huanran Ye, Sheng Liu, Kun Jin, Haohao Cheng
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 12:00 in session PS T5.3

Auto-TLDR; Context-Transfer-UNet: A UNet-based Network for Building Segmentation in Remote Sensing Images

With the proliferation of remote sensing images, how to segment buildings more accurately in remote sensing images is a critical challenge. First, the high resolution leads to blurred boundaries in the extracted building maps. Second, the similarity between buildings and background results in intra-class inconsistency. To address these two problems, we propose an UNet-based network named Context-Transfer-UNet (CT-UNet). Specifically, we design Dense Boundary Block (DBB). Dense Block utilizes reuse mechanism to refine features and increase recognition capabilities. Boundary Block introduces the low-level spatial information to solve the fuzzy boundary problem. Then, to handle intra-class inconsistency, we construct Spatial Channel Attention Block (SCAB). It combines context space information and selects more distinguishable features from space and channel. Finally, we propose a novel loss function to enhance the purpose of loss by adding evaluation indicator. Based on our proposed CT-UNet, we achieve 85.33% mean IoU on the Inria dataset and 91.00% mean IoU on the WHU dataset, which outperforms our baseline (U-Net ResNet-34) by 3.76% and Web-Net by 2.24%.

#### PSDNet: A Balanced Architecture of Accuracy and Parameters for Semantic Segmentation

Yue Liu, Zhichao Lian
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 14:00 in session PS T3.8

Auto-TLDR; Pyramid Pooling Module with SE1Cblock and D2SUpsample Network (PSDNet)

Abstract—In this paper, we present our Pyramid Pooling Module （PPM） with SE1Cblock and D2SUpsample Network (PSDNet), a novel architecture for accurate semantic segmentation. Started from the known work called Pyramid Scene Parsing Network (PSPNet), PSDNet takes advantage of pyramid pooling structure with channel attention module and feature transform module in Pyramid Pooling Module (PPM). The enhanced PPM with these two components can strengthen context information flowing in the network instead of damaging it. The channel attention module we mentioned is an improved “Squeeze and Excitation with 1D Convolution” (SE1C) block which can explicitly model interrelationship between channels with fewer number of parameters. We propose a feature transform module named “Depth to Space Upsampling” (D2SUpsample) in the PPM which keeps integrity of features by transforming features while interpolating features, at the same time reducing parameters. In addition, we introduce a joint strategy in SE1Cblock which combines two variants of global pooling without increasing parameters. Compared with PSPNet, our work achieves higher accuracy on public datasets with 73.97% mIoU and 82.89% mAcc accuracy on Cityscapes Dataset based on ResNet50 backbone.

#### BiLuNet: A Multi-Path Network for Semantic Segmentation on X-Ray Images

Van Luan Tran, Huei-Yung Lin, Rachel Liu, Chun-Han Tseng, Chun-Han Tseng
Track 5: Image and Signal Processing
Fri 15 Jan 2021 at 15:00 in session PS T5.7

Auto-TLDR; BiLuNet: Multi-path Convolutional Neural Network for Semantic Segmentation of Lumbar vertebrae, sacrum,

Semantic segmentation and shape detection of lumbar vertebrae, sacrum, and femoral heads from clinical X-ray images are important and challenging tasks. In this paper, we propose a new multi-path convolutional neural network, BiLuNet, for semantic segmentation on X-ray images. The network is capable of medical image segmentation with very limited training data. With the shape fitting of the bones, we can identify the location of the target regions very accurately for lumbar vertebra inspection. We collected our dataset and annotated by doctors for model training and performance evaluation. Compared to the state-of-the-art methods, the proposed technique provides better mIoUs and higher success rates with the same training data. The experimental results have demonstrated the feasibility of our network to perform semantic segmentation for lumbar vertebrae, sacrum, and femoral heads.

#### Segmenting Kidney on Multiple Phase CT Images Using ULBNet

Yanling Chi, Yuyu Xu, Gang Feng, Jiawei Mao, Sihua Wu, Guibin Xu, Weimin Huang
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.4

Auto-TLDR; A ULBNet network for kidney segmentation on multiple phase CT images

Abstract—Segmentation of kidney on CT images is critical to computer-assisted surgical planning for kidney interventional therapy. Segmenting kidney manually is impractical in clinical, automatic segmentation is desirable. U-Net has been successful in medical image segmentation and is a promising candidate for the task. However, semantic gap still exists, especially when multiple phase images or multiple center images are involved. In this paper, we proposed an ULBNet to reduce the semantic gap and to improve segmentation performance. The proposed architecture includes new skip connections of local binary convolution (LBC). We also proposed a novel strategy of fast retraining a model for a new task without manually labelling required. We evaluated the network for kidney segmentation on multiple phase CT images. ULBNet resulted in an overall accuracy of 98.0% with comparison to Resunet 97.5%. Specifically, on the plain phase CT images, 98.1% resulted from ULBNet and 97.6% from Resunet; on the corticomedullay phase images, 97.8% from ULBNet and 97.2% from Resunet; on the nephrographic phase images, 97.6% from ULBNet and 97.4% from Resunet; on the excretory phase images, 98.1% from ULBNet and 97.4% from Resunet. The proposed network architecture performs better than Resunet on generalizing to multiple phase images.

#### BG-Net: Boundary-Guided Network for Lung Segmentation on Clinical CT Images

Rui Xu, Yi Wang, Tiantian Liu, Xinchen Ye, Lin Lin, Yen-Wei Chen, Shoji Kido, Noriyuki Tomiyama
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 12:00 in session PS T5.3

Auto-TLDR; Boundary-Guided Network for Lung Segmentation on CT Images

Lung segmentation on CT images is a crucial step for a computer-aided diagnosis system of lung diseases. The existing deep learning based lung segmentation methods are less efficient to segment lungs on clinical CT images, especially that the segmentation on lung boundaries is not accurate enough due to complex pulmonary opacities in practical clinics. In this paper, we propose a boundary-guided network (BG-Net) to address this problem. It contains two auxiliary branches that separately segment lungs and extract the lung boundaries, and an aggregation branch that efficiently exploits lung boundary cues to guide the network for more accurate lung segmentation on clinical CT images. We evaluate the proposed method on a private dataset collected from the Osaka university hospital and four public datasets including StructSeg, HUG, VESSEL12, and a Novel Coronavirus 2019 (COVID-19) dataset. Experimental results show that the proposed method can segment lungs more accurately and outperform several other deep learning based methods.

#### A Deep Learning Approach for the Segmentation of Myocardial Diseases

Khawala Brahim, Abdull Qayyum, Alain Lalande, Arnaud Boucher, Anis Sakly, Fabrice Meriaudeau
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 16:00 in session PS T5.6

Auto-TLDR; Segmentation of Myocardium Infarction Using Late GADEMRI and SegU-Net

Cardiac left ventricular (LV) segmentation is of paramount essential step for both diagnosis and treatment of cardiac pathologies such as ischemia, myocardial infarction, arrhythmia and myocarditis. However, this segmentation is challenging due to high variability across patients and the potential lack of contrast between structures. In this work, we propose and evaluate a (2.5D) SegU-Net model based on the fusion of two deep learning techniques (U-Net and Seg-Net) for automated LGEMRI (Late gadolinium enhanced magnetic resonance imaging) myocardial disease (infarct core and no reflow region) quantification in a new multifield expert annotated dataset. Given that the scar tissue represents a small part of the whole MRI slices, we focused on myocardium area. Segmentation results show that this preprocessing step facilitate the learning procedure. In order to solve the class imbalance problem, we propose to apply the Jaccard loss and the Focal Loss as optimization loss function and to integrate a class weights strategy into the objective function. Late combination has been used to merge the output of the best trained models on a different set of hyperparameters. The final network segmentation performances will be useful for future comparison of new method to the current related work for this task. A total number of 2237 of slices (320 cases) were used for training/validation and 210 slices (35 cases) were used for testing. Experiments over our proposed dataset, using several evaluation metrics such Jaccard distance (IOU), Accuracy and Dice similarity coefficient (DSC), demonstrate efficiency performance in quantifying different zones of myocardium infarction across various patients. As compared to the second intra-observer study, our testing results showed that the SegUNet prediction model leads to these average dice coefficients over all segmented tissue classes, respectively : 'Background': 0.99999, 'Myocardium': 0.99434, 'Infarctus': 0.95587, 'Noreflow': 0.78187.

#### Planar 3D Transfer Learning for End to End Unimodal MRI Unbalanced Data Segmentation

Martin Kolarik, Radim Burget, Carlos M. Travieso-Gonzalez, Jan Kocica
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Thu 14 Jan 2021 at 16:00 in session OS T1.6

Auto-TLDR; Planar 3D Res-U-Net Network for Unbalanced 3D Image Segmentation using Fluid Attenuation Inversion Recover

We present a novel approach of 2D to 3D transfer learning based on mapping pre-trained 2D convolutional neural network weights into planar 3D kernels. The method is validated by proposed planar 3D res-u-net network with encoder transferred from the 2D VGG-16 which is applied for a single-stage unbalanced 3D image data segmentation. In particular, we evaluate the method on the MICCAI 2016 MS lesion segmentation challenge dataset utilizing solely Fluid Attenuation Inversion Recover (FLAIR) sequence without brain extraction for training and inference to simulate real medical praxis. The planar 3D res-u-net network performed the best both in sensitivity and Dice score amongst end to end methods processing raw MRI scans and achieved comparable Dice score to a state-of-the-art unimodal not end to end approach. Complete source code was released under the open-source license and this paper is in compliance with the Machine learning Reproducibility Checklist. By implementing practical transfer learning for 3D data representation we were able to successfully segment heavily unbalanced data without selective sampling and achieved more reliable results using less training data in single modality. From medical perspective, the unimodal approach gives an advantage in real praxis as it does not require co-registration nor additional scanning time during examination. Although modern medical imaging methods capture high resolution 3D anatomy scans suitable for computer aided detection system processing, deployment of automatic systems for interpretation of radiology imaging is still rather theoretical in many medical areas. Our work aims to bridge the gap offering solution for partial research questions.

#### DE-Net: Dilated Encoder Network for Automated Tongue Segmentation

Hui Tang, Bin Wang, Jun Zhou, Yongsheng Gao
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 12:00 in session PS T5.3

Auto-TLDR; Automated Tongue Image Segmentation using De-Net

Automated tongue recognition is a growing research field due to global demand for personal health care. Using mobile devices to take tongue pictures is convenient and of low cost for tongue recognition. It is particularly suitable for self-health evaluation of the public. However, images taken by mobile devices are easily affected by various imaging environment, which makes fine segmentation a more challenging task compared with those taken by specialized acquisition devices. Deep learning approaches are promising for tongue image segmentation because they have powerful feature learning and representation capability. However, the successive pooling operations in these methods lead to loss of information on image details, making them fail when segmenting low-quality images captured by mobile devices. To address this issue, we propose a dilated encoder network (DE-Net) to capture more high-level features and get high-resolution output for automated tongue image segmentation. In addition, we construct two tongue image datasets which contain images taken by specialized devices and mobile devices, respectively, to verify the effectiveness of the proposed method. Experimental results on both datasets demonstrate that the proposed method outperforms the state-of-the-art methods in tongue image segmentation.

#### Learn to Segment Retinal Lesions and Beyond

Qijie Wei, Xirong Li, Weihong Yu, Xiao Zhang, Yongpeng Zhang, Bojie Hu, Bin Mo, Di Gong, Ning Chen, Dayong Ding, Youxin Chen
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 17:00 in session PS T5.2

Towards automated retinal screening, this paper makes an endeavor to simultaneously achieve pixel-level retinal lesion segmentation and image-level disease classification. Such a multi-task approach is crucial for accurate and clinically interpretable disease diagnosis. Prior art is insufficient due to three challenges, i.e., lesions lacking objective boundaries, clinical importance of lesions irrelevant to their size, and the lack of one-to-one correspondence between lesion and disease classes. This paper attacks the three challenges in the context of diabetic retinopathy (DR) grading. We propose Lesion-Net, a new variant of fully convolutional networks, with its expansive path re- designed to tackle the first challenge. A dual Dice loss that leverages both semantic segmentation and image classification losses is introduced to resolve the second challenge. Lastly, we build a multi-task network that employs Lesion-Net as a side- attention branch for both DR grading and result interpretation. A set of 12K fundus images is manually segmented by 45 ophthalmologists for 8 DR-related lesions, resulting in 290K manual segments in total. Extensive experiments on this large- scale dataset show that our proposed approach surpasses the prior art for multiple tasks including lesion segmentation, lesion classification and DR grading.

#### DA-RefineNet: Dual-Inputs Attention RefineNet for Whole Slide Image Segmentation

Ziqiang Li, Rentuo Tao, Qianrun Wu, Bin Li
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 12:00 in session PS T5.5

Auto-TLDR; DA-RefineNet: A dual-inputs attention network for whole slide image segmentation

Automatic medical image segmentation techniques have wide applications for disease diagnosing, however, its much more challenging than natural optical image segmentation tasks due to the high-resolution of medical images and the corresponding huge computation cost. Sliding window was a commonly used technique for whole slide image (WSI) segmentation, however, for these methods that based on sliding window, the main drawback was lacking of global contextual information for supervision. In this paper, we proposed a dual-inputs attention network (denoted as DA-RefineNet) for WSI segmentation, where both local fine-grained information and global coarse information can be efficiently utilized. Sufficient comparative experiments were conducted to evaluate the effectiveness of the proposed method, the results proved that the proposed method can achieve better performance on WSI segmentation tasks compared to methods rely on single-input.

#### Arbitrary Style Transfer with Parallel Self-Attention

Tiange Zhang, Ying Gao, Feng Gao, Lin Qi, Junyu Dong
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 12:00 in session PS T5.3

Auto-TLDR; Self-Attention-Based Arbitrary Style Transfer Using Adaptive Instance Normalization

Neural style transfer aims to create artistic images by synthesizing patterns from a given style image. Recently, the Adaptive Instance Normalization (AdaIN) layer is proposed to achieve real-time arbitrary style transfer. However, we observed that if crucial features based on AdaIN can be further emphasized during transfer, both content and style information will be better reflected in stylized images. Furthermore, it is always essential to preserve more details and reduce unexpected artifacts in order to generate appealing results. In this paper, we introduce an improved arbitrary style transfer method based on the self-attention mechanism. A self-attention module is designed to learn what and where to emphasize in the input image. In addition, an extra Laplacian loss is applied to preserve structure details of the content while eliminating artifacts. Experimental results demonstrate that the proposed method outperforms AdaIN and can generate more appealing results.

#### CSpA-DN: Channel and Spatial Attention Dense Network for Fusing PET and MRI Images

Bicao Li, Zhoufeng Liu, Shan Gao, Jenq-Neng Hwang, Jun Sun, Zongmin Wang
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 17:00 in session PS T5.2

Auto-TLDR; CSpA-DN: Unsupervised Fusion of PET and MR Images with Channel and Spatial Attention

In this paper, we propose a novel unsupervised fusion framework based on a dense network with channel and spatial attention (CSpA-DN) for PET and MR images. In our approach, an encoder composed of the densely connected neural network is constructed to extract features from source images, and a decoder network is leveraged to yield the fused image from these features. Simultaneously, a self-attention mechanism is introduced in the encoder and decoder to further integrate local features along with their global dependencies adaptively. The extracted feature of each spatial position is synthesized by a weighted summation of those features at the same row and column with this position via a spatial attention module. Meanwhile, the interdependent relationship of all feature maps is integrated by a channel attention module. The summation of the outputs of these two attention modules is fed into the decoder and the fused image is generated. Experimental results illustrate the superiorities of our proposed CSpA-DN model compared with state-of-the-art methods in PET and MR images fusion according to both visual perception and objective assessment.

#### NephCNN: A Deep-Learning Framework for Vessel Segmentation in Nephrectomy Laparoscopic Videos

Alessandro Casella, Sara Moccia, Chiara Carlini, Emanuele Frontoni, Elena De Momi, Leonardo Mattos
Track 5: Image and Signal Processing
Thu 14 Jan 2021 at 16:00 in session PS T5.6

Auto-TLDR; Adversarial Fully Convolutional Neural Networks for kidney vessel segmentation from nephrectomy laparoscopic videos

Objective: In the last years, Robot-assisted partial nephrectomy (RAPN) is establishing as elected treatment for renal cell carcinoma (RCC). Reduced field of view, field occlusions by surgical tools, and reduced maneuverability may potentially cause accidents, such as unwanted vessel resection with consequent bleeding. Surgical Data Science (SDS) can provide effective context-aware tools for supporting surgeons. However, currently no tools have been exploited for automatic vessels segmentation from nephrectomy laparoscopic videos. Herein, we propose a new approach based on adversarial Fully Convolutional Neural Networks (FCNNs) to kidney vessel segmentation from nephrectomy laparoscopic vision. Methods: The proposed approach enhances existing segmentation framework by (i) encoding 3D kernels for spatio-temporal features extraction to enforce pixel connectivity in time, and (ii) perform training in adversarial fashion, which constrains vessels shape. Results: We performed a preliminary study using 8 different RAPN videos (1871 frames), the first in the field, achieving a median Dice Similarity Coefficient of 71.76%. Conclusions: Results showed that the proposed approach could be a valuable solution with a view to assist surgeon during RAPN.

#### Accurate Cell Segmentation in Digital Pathology Images Via Attention Enforced Networks

Zeyi Yao, Kaiqi Li, Guanhong Zhang, Yiwen Luo, Xiaoguang Zhou, Muyi Sun
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Tue 12 Jan 2021 at 15:00 in session PS T1.1

Auto-TLDR; AENet: Attention Enforced Network for Automatic Cell Segmentation

Automatic cell segmentation is an essential step in the pipeline of computer-aided diagnosis (CAD), such as the detection and grading of breast cancer. Accurate segmentation of cells can not only assist the pathologists to make a more precise diagnosis, but also save much time and labor. However, this task suffers from stain variation, cell inhomogeneous intensities, background clutters and cells from different tissues. To address these issues, we propose an Attention Enforced Network (AENet), which is built on spatial attention module and channel attention module, to integrate local features with global dependencies and weight effective channels adaptively. Besides, we introduce a feature fusion branch to bridge high-level and low-level features. Finally, the marker controlled watershed algorithm is applied to post-process the predicted segmentation maps for reducing the fragmented regions. In the test stage, we present an individual color normalization method to deal with the stain variation problem. We evaluate this model on the MoNuSeg dataset. The quantitative comparisons against several prior methods demonstrate the priority of our approach.

#### Transfer Learning through Weighted Loss Function and Group Normalization for Vessel Segmentation from Retinal Images

Abdullah Sarhan, Jon Rokne, Reda Alhajj, Andrew Crichton
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 17:00 in session PS T5.1

Auto-TLDR; Deep Learning for Segmentation of Blood Vessels in Retinal Images

The vascular structure of blood vessels is important in diagnosing retinal conditions such as glaucoma and diabetic retinopathy. Accurate segmentation of these vessels can help in detecting retinal objects such as the optic disc and optic cup and hence determine if there are damages to these areas. Moreover, the structure of the vessels can help in diagnosing glaucoma. The rapid development of digital imaging and computer-vision techniques has increased the potential for developing approaches for segmenting retinal vessels. In this paper, we propose an approach for segmenting retinal vessels that uses deep learning along with transfer learning. We adapted the U-Net structure to use a customized InceptionV3 as the encoder and used multiple skip connections to form the decoder. Moreover, we used a weighted loss function to handle the issue of class imbalance in retinal images. Furthermore, we contributed a new dataset to this field. We tested our approach on six publicly available datasets and a newly created dataset. We achieved an average accuracy of 95.60\% and a Dice coefficient of 80.98\%. The results obtained from comprehensive experiments demonstrate the robustness of our approach to the segmentation of blood vessels in retinal images obtained from different sources. Our approach results in greater segmentation accuracy than other approaches.

#### End-To-End Multi-Task Learning for Lung Nodule Segmentation and Diagnosis

Wei Chen, Qiuli Wang, Dan Yang, Xiaohong Zhang, Chen Liu, Yucong Li
Track 5: Image and Signal Processing
Fri 15 Jan 2021 at 13:00 in session OS T5.5

Auto-TLDR; A novel multi-task framework for lung nodule diagnosis based on deep learning and medical features

#### Transitional Asymmetric Non-Local Neural Networks for Real-World Dirt Road Segmentation

Yooseung Wang, Jihun Park
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 12:00 in session PS T3.7

Auto-TLDR; Transitional Asymmetric Non-Local Neural Networks for Semantic Segmentation on Dirt Roads

Understanding images by predicting pixel-level semantic classes is a fundamental task in computer vision and is one of the most important techniques for autonomous driving. Recent approaches based on deep convolutional neural networks have dramatically improved the speed and accuracy of semantic segmentation on paved road datasets, however, dirt roads have yet to be systematically studied. Dirt roads do not contain clear boundaries between drivable and non-drivable regions; and thus, this difficulty must be overcome for the realization of fully autonomous vehicles. The key idea of our approach is to apply lightweight non-local blocks to reinforce stage-wise long-range dependencies in encoder-decoder style backbone networks. Experiments on 4,687 images of a dirt road dataset show that our transitional asymmetric non-local neural networks present a higher accuracy with lower computational costs compared to state-of-the-art models.

#### MTGAN: Mask and Texture-Driven Generative Adversarial Network for Lung Nodule Segmentation

Wei Chen, Qiuli Wang, Kun Wang, Dan Yang, Xiaohong Zhang, Chen Liu, Yucong Li
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 12:00 in session PS T5.3

Accurate segmentation for lung nodules in lung computed tomography (CT) scans plays a key role in the early diagnosis of lung cancer. Many existing methods, especially UNet, have made significant progress in lung nodule segmentation. However, due to the complex shapes of lung nodules and the similarity of visual characteristics between nodules and lung tissues, an accurate segmentation with few false positives of lung nodules is still a challenging problem. Considering the fact that both boundary and texture information of lung nodules are important for obtaining an accurate segmentation result, we propose a novel Mask and Texture-driven Generative Adversarial Network (MTGAN) with a joint multi-scale L1 loss for lung nodule segmentation, which takes full advantages of U-Net and adversarial training. The proposed MTGAN leverages adversarial learning strategy guided by the boundary and texture information of lung nodules to generate more accurate segmentation results with lesser false positives. We validate our model with the LIDC–IDRI dataset, and experimental results show that our method achieves excellent segmentation results for a variety of lung nodules, especially for juxtapleural nodules and low-dense nodules. Without any bells and whistles, the proposed MTGAN achieves significant segmentation performance with the Dice similarity coefficient (DSC) of 85.24% on the LIDC–IDRI dataset.

#### RSAN: Residual Subtraction and Attention Network for Single Image Super-Resolution

Shuo Wei, Xin Sun, Haoran Zhao, Junyu Dong
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 14:00 in session OS T5.1

Auto-TLDR; RSAN: Residual subtraction and attention network for super-resolution

The single-image super-resolution (SISR) aims to recover a potential high-resolution image from its low-resolution version. Recently, deep learning-based methods have played a significant role in super-resolution field due to its effectiveness and efficiency. However, most of the SISR methods neglect the importance among the feature map channels. Moreover, they can not eliminate the redundant noises, making the output image be blurred. In this paper, we propose the residual subtraction and attention network (RSAN) for powerful feature expression and channels importance learning. More specifically, RSAN firstly implements one redundance removal module to learn noise information in the feature map and subtract noise through residual learning. Then it introduces the channel attention module to amplify high-frequency information and suppress the weight of effectless channels. Experimental results on extensive public benchmarks demonstrate our RSAN achieves significant improvement over the previous SISR methods in terms of both quantitative metrics and visual quality.

#### Global-Local Attention Network for Semantic Segmentation in Aerial Images

Minglong Li, Lianlei Shan, Weiqiang Wang
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 15:00 in session PS T3.10

Auto-TLDR; GLANet: Global-Local Attention Network for Semantic Segmentation

Errors in semantic segmentation task could be classified into two types: large area misclassification and local inaccurate boundaries. Previously attention based methods capture rich global contextual information, this is beneficial to diminish the first type of error, but local imprecision still exists. In this paper we propose Global-Local Attention Network (GLANet) with a simultaneous consideration of global context and local details. Specifically, our GLANet is composed of two branches namely global attention branch and local attention branch, and three different modules are embedded in the two branches for the purpose of modeling semantic interdependencies in spatial, channel and boundary dimensions respectively. We sum the outputs of the two branches to further improve feature representation, leading to more precise segmentation results. The proposed method achieves very competitive segmentation accuracy on two public aerial image datasets, bringing significant improvements over baseline.

#### Attention Pyramid Module for Scene Recognition

Zhinan Qiao, Xiaohui Yuan, Chengyuan Zhuang, Abolfazl Meyarian
Track 3: Computer Vision Robotics and Intelligent Systems
Tue 12 Jan 2021 at 17:00 in session PS T3.3

Auto-TLDR; Attention Pyramid Module for Multi-Scale Scene Recognition

The unrestricted open vocabulary and diverse substances of scenery images bring significant challenges to scene recognition. However, most deep learning architectures and attention methods are developed on general-purpose datasets and omit the characteristics of scene data. In this paper, we exploit the attention pyramid module (APM) to tackle the predicament of scene recognition. Our method streamlines the multi-scale scene recognition pipeline, learns comprehensive scene features at various scales and locations, addresses the interdependency among scales, and further assists feature re-calibration as well as aggregation process. APM is extremely light-weighted and can be easily plugged into existing network architectures in a parameter-efficient manner. By simply integrating APM into ResNet-50, we obtain a 3.54\% boost in terms of top-1 accuracy on the benchmark scene dataset. Comprehensive experiments show that APM achieves better performance comparing with state-of-the-art attention methods using significant less computation budget. Code and pre-trained models will be made publicly available.

#### A Transformer-Based Network for Anisotropic 3D Medical Image Segmentation

Guo Danfeng, Demetri Terzopoulos
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 17:00 in session PS T5.1

Auto-TLDR; A transformer-based model to tackle the anisotropy problem in 3D medical image analysis

A critical challenge of applying neural networks to 3D medical image analysis is to deal with the anisotropy problem. The inter-slice contextual information contained in medical images is important, especially when the structural information of lesions is needed. However, such information often varies with cases because of variable slice spacing. Image anisotropy downgrades model performance especially when slice spacing varies significantly among training and testing datasets. ExsiWe proposed a transformer-based model to tackle the anisotropy problem. It is adaptable to different levels of anisotropy and is computationally efficient. Experiments are conducted on 3D lung cancer segmentation task. Our model achieves an average Dice score of approximately 0.87, which generally outperforms baseline models.

#### Deeply-Fused Attentive Network for Stereo Matching

Zuliu Yang, Xindong Ai, Weida Yang, Yong Zhao, Qifei Dai, Fuchi Li
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 12:00 in session PS T3.4

Auto-TLDR; DF-Net: Deep Learning-based Network for Stereo Matching

In this paper, we propose a novel learning-based network for stereo matching called DF-Net, which makes three main contributions that are experimentally shown to have practical merit. Firstly, we further increase the accuracy by using the deeply fused spatial pyramid pooling (DF-SPP) module, which can acquire the continuous multi-scale context information in both parallel and cascade manners. Secondly, we introduce channel attention block to dynamically boost the informative features. Finally, we propose a stacked encoder-decoder structure with 3D attention gate for cost regularization. More precisely, the module fuses the coding features to their next encoder-decoder structure under the supervision of attention gate with long-range skip connection, and thus exploit deep and hierarchical context information for disparity prediction. The performance on SceneFlow and KITTI datasets shows that our model is able to generate better results against several state-of-the-art algorithms.

#### Progressive Scene Segmentation Based on Self-Attention Mechanism

Yunyi Pan, Yuan Gan, Kun Liu, Yan Zhang
Track 3: Computer Vision Robotics and Intelligent Systems
Fri 15 Jan 2021 at 15:00 in session PS T3.10

Auto-TLDR; Two-Stage Semantic Scene Segmentation with Self-Attention

Semantic scene segmentation is vital for a large variety of applications as it enables understanding of 3D data. Nowadays, various approaches based upon point clouds ignore the mathematical distribution of points and treat the points equally. The methods following this direction neglect the imbalance problem of samples that naturally exists in scenes. To avoid these issues, we propose a two-stage semantic scene segmentation framework based on self-attention mechanism and achieved state-of-the-art performance on 3D scene understanding tasks. We split the whole task into two small ones which efficiently relief the sample imbalance issue. In addition, we have designed a new self-attention block which could be inserted into submanifold convolution networks to model the long-range dependencies that exists among points. The proposed network consists of an encoder and a decoder, with the spatial-wise and channel-wise attention modules inserted. The two-stage network shares a U-Net architecture and is an end-to-end trainable framework which could predict the semantic label for the scene point clouds fed into it. Experiments on standard benchmarks of 3D scenes implies that our network could perform at par or better than the existing state-of-the-art methods.

#### Deep Multi-Stage Model for Automated Landmarking of Craniomaxillofacial CT Scans

Simone Palazzo, Giovanni Bellitto, Luca Prezzavento, Francesco Rundo, Ulas Bagci, Daniela Giordano, Rosalia Leonardi, Concetto Spampinato
Track 5: Image and Signal Processing
Fri 15 Jan 2021 at 16:00 in session PS T5.8

Auto-TLDR; Automated Landmarking of Craniomaxillofacial CT Images Using Deep Multi-Stage Architecture

In this paper we define a deep multi-stage architecture for automated landmarking of craniomaxillofacial (CMF) CT images. Our model is composed of three subnetworks that first localize, on reduced-resolution images, areas where land-marks may be found and then refine the search, at full-resolution scale, through a hierarchical structure aiming at increasing the granularity of the investigated region. The multi-stage pipeline is designed to deal with full resolution data and does not require any additional pre-processing step to reduce search space, as opposed to existing methods that can be only adopted for searching landmarks located in well-defined anatomical structures (e.g.,mandibles). The automated landmarking system is tested on identifying landmarks located in several CMF regions, achieving an average error of 0.8 mm, significantly lower than expert readings. The proposed model also outperforms baselines and is on par with existing models that employ additional upstream segmentation, on state-of-the-art benchmarks.

#### Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting

Pongpisit Thanasutives, Ken-Ichi Fukui, Masayuki Numao, Boonserm Kijsirikul
Track 2: Biometrics, Human Analysis and Behavior Understanding
Wed 13 Jan 2021 at 12:00 in session PS T2.2

Auto-TLDR; M-SFANet and M-SegNet for Crowd Counting Using Multi-Scale Fusion Networks

In this paper, we proposed two modified neural networks based on dual path multi-scale fusion networks (SFANet) and SegNet for accurate and efficient crowd counting. Inspired by SFANet, the first model, which is named M-SFANet, is attached with atrous spatial pyramid pooling (ASPP) and context-aware module (CAN). The encoder of M-SFANet is enhanced with ASPP containing parallel atrous convolutional layers with different sampling rates and hence able to extract multi-scale features of the target object and incorporate larger context. To further deal with scale variation throughout an input image, we leverage the CAN module which adaptively encodes the scales of the contextual information. The combination yields an effective model for counting in both dense and sparse crowd scenes. Based on the SFANet decoder structure, M-SFANet's decoder has dual paths, for density map and attention map generation. The second model is called M-SegNet, which is produced by replacing the bilinear upsampling in SFANet with max unpooling that is used in SegNet. This change provides a faster model while providing competitive counting performance. Designed for high-speed surveillance applications, M-SegNet has no additional multi-scale-aware module in order to not increase the complexity. Both models are encoder-decoder based architectures and are end-to-end trainable. We conduct extensive experiments on five crowd counting datasets and one vehicle counting dataset to show that these modifications yield algorithms that could improve state-of-the-art crowd counting methods.

#### Enhanced Feature Pyramid Network for Semantic Segmentation

Mucong Ye, Ouyang Jinpeng, Ge Chen, Jing Zhang, Xiaogang Yu
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Fri 15 Jan 2021 at 15:00 in session PS T1.14

Auto-TLDR; EFPN: Enhanced Feature Pyramid Network for Semantic Segmentation

Multi-scale feature fusion has been an effective way for improving the performance of semantic segmentation. However, current methods generally fail to consider the semantic gaps between the shallow (low-level) and deep (high-level) features and thus the fusion methods may not be optimal. In this paper, to address the issues of the semantic gap between the feature from different layers, we propose a unified framework based on the U-shape encoder-decoder architecture, named Enhanced Feature Pyramid Network (EFPN). Specifically, the semantic enhancement module (SEM), boundary extraction module (BEM), and context aggregation model (CAM) are incorporated into the decoder network to improve the robustness of the multi-level features aggregation. In addition, a global fusion model (GFM) in encoder branch is proposed to capture more semantic information in the deep layers and effectively transmit the high-level semantic features to each layer. Extensive experiments are conducted and the results show that the proposed framework achieves the state-of-the-art results on three public datasets, namely PASCAL VOC 2012, Cityscapes, and PASCAL Context. Furthermore, we also demonstrate that the proposed method is effective for other visual tasks that require frequent fusing features and upsampling.

#### Attention Stereo Matching Network

Doudou Zhang, Jing Cai, Yanbing Xue, Zan Gao, Hua Zhang
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 12:00 in session PS T1.4

Auto-TLDR; ASM-Net: Attention Stereo Matching with Disparity Refinement

Despite great progress, previous stereo matching algorithms still lack the ability to match textureless regions and slender structure areas. To tackle this problem, we propose ASM-Net, an attention stereo matching network. Attention module and disparity refinement module are constructed in the ASMNet. The attention module can improve correlation information between two images by channels and spatial attention.The feature-guided disparity refinement module learns more geometry information in different feature levels to refine the coarse prediction resolution constantly. The proposed approach was evaluated on several benchmark datasets. Experiments show that the proposed method achieves competitive results on KITTI and Scene-Flow datasets while running in real-time at 14ms.

#### A Lumen Segmentation Method in Ureteroscopy Images Based on a Deep Residual U-Net Architecture

Jorge Lazo, Marzullo Aldo, Sara Moccia, Michele Catellani, Benoit Rosa, Elena De Momi, Michel De Mathelin, Francesco Calimeri
Track 1: Artificial Intelligence, Machine Learning for Pattern Analysis
Wed 13 Jan 2021 at 16:30 in session PS T1.8

Auto-TLDR; A Deep Neural Network for Ureteroscopy with Residual Units

Ureteroscopy is becoming the first surgical treatment option for the majority of urinary affections. This procedure is carried out using an endoscope which provides the surgeon with the visual and spatial information necessary to navigate inside the urinary tract. Having in mind the development of surgical assistance systems, that could enhance the performance of surgeon, the task of lumen segmentation is a fundamental part since this is the visual reference which marks the path that the endoscope should follow. This is something that has not been analyzed in ureteroscopy data before. However, this task presents several challenges given the image quality and the conditions itself of ureteroscopy procedures. In this paper, we study the implementation of a Deep Neural Network which exploits the advantage of residual units in an architecture based on U-Net. For the training of these networks, we analyze the use of two different color spaces: gray-scale and RGB data images. We found that training on gray-scale images gives the best results obtaining mean values of Dice Score, Precision, and Recall of 0.73, 0.58, and 0.92 respectively. The results obtained show that the use of residual U-Net could be a suitable model for further development for a computer-aided system for navigation and guidance through the urinary system.

#### Semantic Segmentation of Breast Ultrasound Image with Pyramid Fuzzy Uncertainty Reduction and Direction Connectedness Feature

Kuan Huang, Yingtao Zhang, Heng-Da Cheng, Ping Xing, Boyu Zhang
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 17:00 in session PS T5.1

Auto-TLDR; Uncertainty-Based Deep Learning for Breast Ultrasound Image Segmentation

Deep learning approaches have achieved impressive results in breast ultrasound (BUS) image segmentation. However, these methods did not solve uncertainty and noise in BUS images well. To address this issue, we present a novel deep learning structure for BUS image semantic segmentation by analyzing the uncertainty using a pyramid fuzzy block and generating a novel feature based on connectedness. Firstly, feature maps in the proposed network are down-sampled to different resolutions. Fuzzy transformation and uncertainty representation are applied to each resolution to obtain the uncertainty degree on different scales. Meanwhile, the BUS images contain layer structures. From top to bottom, there are skin layer, fat layer, mammary layer, muscle layer, and background area. A spatial recurrent neural network (RNN) is utilized to calculate the connectedness between each pixel and the pixels on the four boundaries in horizontal and vertical lines. The spatial-wise context feature can introduce the characteristic of layer structure to deep neural network. Finally, the original convolutional features are combined with connectedness feature according to the uncertainty degrees. The proposed methods are applied to two datasets: a BUS image benchmark with two categories (background and tumor) and a five-category BUS image dataset with fat layer, mammary layer, muscle layer, background, and tumor. The proposed method achieves the best results on both datasets compared with eight state-of-the-art deep learning-based approaches.

#### Ordinal Depth Classification Using Region-Based Self-Attention

Minh Hieu Phan, Son Lam Phung, Abdesselam Bouzerdoum
Track 3: Computer Vision Robotics and Intelligent Systems
Wed 13 Jan 2021 at 12:00 in session PS T3.4

Auto-TLDR; Region-based Self-Attention for Multi-scale Depth Estimation from a Single 2D Image

Depth estimation from a single 2D image has been widely applied in 3D understanding, 3D modelling and robotics. It is challenging as reliable cues (e.g. stereo correspondences and motions) are not available. Most of the modern approaches exploited multi-scale feature extraction to provide more powerful representations for deep networks. However, these studies have not focused on how to effectively fuse the learned multi-scale features. This paper proposes a novel region-based self-attention (rSA) module. The rSA recalibrates the multi-scale responses by explicitly modelling the interdependency between channels in separate image regions. We discretize continuous depths to solve an ordinal depth classification in which the relative order between categories is significant. We contribute a dataset of 4410 RGB-D images, captured in outdoor environments at the University of Wollongong's campus. In our experimental results, the proposed module improves the lightweight models on small-sized datasets by 22% - 40%

#### ACRM: Attention Cascade R-CNN with Mix-NMS for Metallic Surface Defect Detection

Junting Fang, Xiaoyang Tan, Yuhui Wang
Track 3: Computer Vision Robotics and Intelligent Systems
Thu 14 Jan 2021 at 14:00 in session PS T3.8

Auto-TLDR; Attention Cascade R-CNN with Mix Non-Maximum Suppression for Robust Metal Defect Detection

Metallic surface defect detection is of great significance in quality control for production. However, this task is very challenging due to the noise disturbance, large appearance variation, and the ambiguous definition of the defect individual. Traditional image processing methods are unable to detect the damaged region effectively and efficiently. In this paper, we propose a new defect detection method, Attention Cascade R-CNN with Mix-NMS (ACRM), to classify and locate defects robustly. Three submodules are developed to achieve this goal: 1) a lightweight attention block is introduced, which can improve the ability in capture global and local feature both in the spatial and channel dimension; 2) we firstly apply the cascade R-CNN to our task, which exploits multiple detectors to sequentially refine the detection result robustly; 3) we introduce a new method named Mix Non-Maximum Suppression (Mix-NMS), which can significantly improve its ability in filtering the redundant detection result in our task. Extensive experiments on a real industrial dataset show that ACRM achieves state-of-the-art results compared to the existing methods, demonstrating the effectiveness and robustness of our detection method.

#### Multi-Scale and Attention Based ResNet for Heartbeat Classification

Haojie Zhang, Gongping Yang, Yuwen Huang, Feng Yuan, Yilong Yin
Track 5: Image and Signal Processing
Wed 13 Jan 2021 at 12:00 in session PS T5.3

Auto-TLDR; A Multi-Scale and Attention based ResNet for ECG heartbeat classification in intra-patient and inter-patient paradigms

This paper presents a novel deep learning framework for the electrocardiogram (ECG) heartbeat classification. Although there have been some studies with excellent overall accuracy, these studies have not been very accurate in the diagnosis of arrhythmia classes especially such as supraventricular ectopic beat (SVEB) and ventricular ectopic beat (VEB). In our work, we propose a Multi-Scale and Attention based ResNet for heartbeat classification in intra-patient and inter-patient paradigms respectively. Firstly, we extract shallow features from a convolutional layer. Secondly, the shallow features are sent into three branches with different convolution kernels in order to combine receptive fields of different sizes. Finally, fully connected layers are used to classify the heartbeat. Besides, we design a new attention mechanism based on the characteristics of heartbeat data. At last, extensive experiments on benchmark dataset demonstrate the effectiveness of our proposed model.

#### Triplet-Path Dilated Network for Detection and Segmentation of General Pathological Images

Jiaqi Luo, Zhicheng Zhao, Fei Su, Limei Guo
Track 5: Image and Signal Processing
Fri 15 Jan 2021 at 13:00 in session OS T5.5

Auto-TLDR; Triplet-path Network for One-Stage Object Detection and Segmentation in Pathological Images

Deep learning has been widely applied in the field of medical image processing. However, compared with flourishing visual tasks in natural images, the progress achieved in pathological images is not remarkable, and detection and segmentation, which are among basic tasks of computer vision, are regarded as two independent tasks. In this paper, we make full use of existing datasets and construct a triplet-path network using dilated convolutions to cooperatively accomplish one-stage object detection and nuclei segmentation for general pathological images. First, in order to meet the requirement of detection and segmentation, a novel structure called triplet feature generation (TFG) is designed to extract high-resolution and multiscale features, where features from different layers can be properly integrated. Second, considering that pathological datasets are usually small, a location-aware and partially truncated loss function is proposed to improve the classification accuracy of datasets with few images and widely varying targets. We compare the performance of both object detection and instance segmentation with state-of-the-art methods. Experimental results demonstrate the effectiveness and efficiency of the proposed network on two datasets collected from multiple organs.

#### OCT Image Segmentation Using NeuralArchitecture Search and SRGAN

Saba Heidari, Omid Dehzangi, Nasser M. Nasarabadi, Ali Rezai
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 17:00 in session PS T5.1

Auto-TLDR; Automatic Segmentation of Retinal Layers in Optical Coherence Tomography using Neural Architecture Search

Alzheimer’s disease (AD) diagnosis is one of the major research areas in computational medicine. Optical coherence tomography (OCT) is a non-invasive, inexpensive, and timely efficient method that scans the human’s retina with depth. It has been hypothesized that the thickness of the retinal layers extracted from OCTs could be an efficient and effective biomarker for early diagnosis of AD. In this work, we aim to design a self-training model architecture for the task of segmenting the retinal layers in OCT scans. Neural architecture search (NAS) is a subfield of AutoML domain, which has a significant impact on improving the accuracy of machine vision tasks. We integrate the NAS algorithm with a Unet auto-encoder architecture as its backbone. Then, we employ our proposed model to segment the retinal nerve fiber layer in our preprocessed OCT images with the aim of AD diagnosis. In this work, we trained a super-resolution generative adversarial network on the raw OCT scans to improve the quality of the images before the modeling stage. In our architecture search strategy, different primitive operations suggested to find down- \& up-sampling Unet cell blocks and the binary gate method has been applied to make the search strategy more practical. Our architecture search method is empirically evaluated by training on the Unet and NAS-Unet from scratch. Specifically, the proposed NAS-Unet training significantly outperforms the baseline human-designed architecture by achieving 95.1\% in the mean Intersection over Union metric and 79.1\% in the Dice similarity coefﬁcient.

#### Leveraging Unlabeled Data for Glioma Molecular Subtype and Survival Prediction

Nicholas Nuechterlein, Beibin Li, Mehmet Saygin Seyfioglu, Sachin Mehta, Patrick Cimino, Linda Shapiro
Track 5: Image and Signal Processing
Tue 12 Jan 2021 at 17:00 in session PS T5.1

Auto-TLDR; Multimodal Brain Tumor Segmentation Using Unlabeled MR Data and Genomic Data for Cancer Prediction

In this paper, we address two long-standing challenges in neuro-oncology: (1) how to leverage large amounts of unlabeled magnetic resonance (MR) imaging data for radiogenomic tasks and (2) how to unite glioma MR imaging with genomic data. We examine multi-parametric MR data from 542 patients in the combined training, validation, and testing sets of the 2018 Multimodal Brain Tumor Segmentation Challenge and somatic copy number alteration (SCNA) data from 1090 patients in The Cancer Genome Archive's (TCGA) lower-grade glioma and glioblastoma projects. We propose a novel application of multi-task learning (MTL) that leverages unlabeled MR data by jointly learning tumor segmentation masks with glioma molecular subtype markers and allows for SCNA input when available. There are 235 patients in the intersection of these MR and SCNA datasets, which we divide into an unlabeled training set, a labeled training set, and a validation set. Our MTL model significantly outperforms comparable classification models trained only on labeled MR data for both IDH1/2 mutation and 1p/19q co-deletion glioma subtype marker prediction tasks. We also observe that models trained on genomic and imaging data improve survival prediction results achieved by models trained on either alone. We will release our source code for future research.