Hierarchically Aggregated Residual Transformation for Single Image Super Resolution

Zejiang Hou, Sy Kung

Responsive image

Auto-TLDR; HARTnet: Hierarchically Aggregated Residual Transformation for Multi-Scale Super-resolution

Slides Poster

Visual patterns usually appear at different scales/sizes in natural images. Multi-scale feature representation is of great importance for the single-image super-resolution(SISR) task to reconstruct image objects at different scales.However, such characteristic has been rarely considered by CNN-based SISR methods. In this work, we propose a novel build-ing block, i.e. hierarchically aggregated residual transformation(HART), to achieve multi-scale feature representation in each layer of the network. Within each HART block, we connect multiple convolutions in a hierarchical residual-like manner, which greatly expands the range of effective receptive fields and helps to detect image features at different scales. To theoretically understand the proposed HART block, we recast SISR as an optimal control problem and show that HART effectively approximates the classical4th-order Runge-Kutta method, which has the merit of small local truncation error for solving numerical ordinary differential equation. By cascading the proposed HART blocks, we establish our high-performing HARTnet. Comparedwith existing SR state-of-the-arts (including those in NTIRE2019 SR Challenge leaderboard), the proposed HARTnet demonstrates consistent PSNR/SSIM performance improvements on various benchmark datasets under different degradation models.Moreover, HARTnet can efficiently restore more faithful high-resolution images than comparative SR methods (cf. Figure 1).

Similar papers

Residual Fractal Network for Single Image Super Resolution by Widening and Deepening

Jiahang Gu, Zhaowei Qu, Xiaoru Wang, Jiawang Dan, Junwei Sun

Responsive image

Auto-TLDR; Residual fractal convolutional network for single image super-resolution

Slides Poster Similar

The architecture of the convolutional neural network (CNN) plays an important role in single image super-resolution (SISR). However, most models proposed in recent years usually transplant methods or architectures that perform well in other vision fields. Thence they do not combine the characteristics of super-resolution (SR) and ignore the key information brought by the recurring texture feature in the image. To utilize patch-recurrence in SR and the high correlation of texture, we propose a residual fractal convolutional block (RFCB) and expand its depth and width to obtain residual fractal network (RFN), which contains deep residual fractal network (DRFN) and wide residual fractal network (WRFN). RFCB is recursive with multiple branches of magnified receptive field. Through the phased feature fusion module, the network focuses on extracting high-frequency texture feature that repeatedly appear in the image. We also introduce residual in residual (RIR) structure to RFCB that enables abundant low-frequency feature feed into deeper layers and reduce the difficulties of network training. RFN is the first supervised learning method to combine the patch-recurrence characteristic in SISR into network design. Extensive experiments demonstrate that RFN outperforms state-of-the-art SISR methods in terms of both quantitative metrics and visual quality, while the amount of parameters has been greatly optimized.

Single Image Super-Resolution with Dynamic Residual Connection

Karam Park, Jae Woong Soh, Nam Ik Cho

Responsive image

Auto-TLDR; Dynamic Residual Attention Network for Lightweight Single Image Super-Residual Networks

Slides Poster Similar

Deep convolutional neural networks have shown significant improvement in the single image super-resolution (SISR) field. Recently, there have been attempts to solve the SISR problem using lightweight networks, considering limited computational resources for real-world applications. Especially for lightweight networks, balancing between parameter demand and performance is very difficult to adjust, and most lightweight SISR networks are manually designed based on a huge number of brute-force experiments. Besides, a critical key to the network performance relies on the skip connection of building blocks that are repeatedly in the architecture. Notably, in previous works, these connections are pre-defined and manually determined by human researchers. Hence, they are less flexible to the input image statistics, and there can be a better solution for the given number of parameters. Therefore, we focus on the automated design of networks regarding the connection of basic building blocks (residual networks), and as a result, propose a dynamic residual attention network (DRAN). The proposed method allows the network to dynamically select residual paths depending on the input image, based on the idea of attention mechanism. For this, we design a dynamic residual module that determines the residual paths between the basic building blocks for the given input image. By finding optimal residual paths between the blocks, the network can selectively bypass informative features needed to reconstruct the target high-resolution (HR) image. Experimental results show that our proposed DRAN outperforms most of the existing state-of-the-arts lightweight models in SISR.

RSAN: Residual Subtraction and Attention Network for Single Image Super-Resolution

Shuo Wei, Xin Sun, Haoran Zhao, Junyu Dong

Responsive image

Auto-TLDR; RSAN: Residual subtraction and attention network for super-resolution

Slides Similar

The single-image super-resolution (SISR) aims to recover a potential high-resolution image from its low-resolution version. Recently, deep learning-based methods have played a significant role in super-resolution field due to its effectiveness and efficiency. However, most of the SISR methods neglect the importance among the feature map channels. Moreover, they can not eliminate the redundant noises, making the output image be blurred. In this paper, we propose the residual subtraction and attention network (RSAN) for powerful feature expression and channels importance learning. More specifically, RSAN firstly implements one redundance removal module to learn noise information in the feature map and subtract noise through residual learning. Then it introduces the channel attention module to amplify high-frequency information and suppress the weight of effectless channels. Experimental results on extensive public benchmarks demonstrate our RSAN achieves significant improvement over the previous SISR methods in terms of both quantitative metrics and visual quality.

Boosting High-Level Vision with Joint Compression Artifacts Reduction and Super-Resolution

Xiaoyu Xiang, Qian Lin, Jan Allebach

Responsive image

Auto-TLDR; A Context-Aware Joint CAR and SR Neural Network for High-Resolution Text Recognition and Face Detection

Slides Poster Similar

Due to the limits of bandwidth and storage space, digital images are usually down-scaled and compressed when transmitted over networks, resulting in loss of details and jarring artifacts that can lower the performance of high-level visual tasks. In this paper, we aim to generate an artifact-free high-resolution image from a low-resolution one compressed with an arbitrary quality factor by exploring joint compression artifacts reduction (CAR) and super-resolution (SR) tasks. First, we propose a context-aware joint CAR and SR neural network (CAJNN) that integrates both local and non-local features to solve CAR and SR in one-stage. Finally, a deep reconstruction network is adopted to predict high quality and high-resolution images. Evaluation on CAR and SR benchmark datasets shows that our CAJNN model outperforms previous methods and also takes 26.2% less runtime. Based on this model, we explore addressing two critical challenges in high-level computer vision: optical character recognition of low-resolution texts, and extremely tiny face detection. We demonstrate that CAJNN can serve as an effective image preprocessing method and improve the accuracy for real-scene text recognition (from 85.30% to 85.75%) and the average precision for tiny face detection (from 0.317 to 0.611).

Progressive Splitting and Upscaling Structure for Super-Resolution

Qiang Li, Tao Dai, Shutao Xia

Responsive image

Auto-TLDR; PSUS: Progressive and Upscaling Layer for Single Image Super-Resolution

Slides Poster Similar

Recently, very deep convolutional neural networks (CNNs) have shown great success in single image super-resolution (SISR). Most of these methods focus on the design of network architecture and adopt a sub-pixel convolution layer at the end of network, but few have paid attention to exploring potential representation ability of upscaling layer. Sub-pixel convolution layer aggregates several low resolution (LR) feature maps and builds super-resolution (SR) images in a single step. However, those LR feature maps share similar patterns as they are extracted from a single trunk network. We believe that the mapping relationships between input image and each LR feature map are not consistent. Inspired by this, we propose a novel progressive splitting and upscaling structure, termed PSUS, which generates decoupled feature maps for upscaling layer to get better SR image. Experiments show that our method can not only speed up the convergence, but also achieve considerable improvement on image quality with fewer parameters and lower computational complexity.

LiNet: A Lightweight Network for Image Super Resolution

Armin Mehri, Parichehr Behjati Ardakani, Angel D. Sappa

Responsive image

Auto-TLDR; LiNet: A Compact Dense Network for Lightweight Super Resolution

Slides Poster Similar

This paper proposes a new lightweight network, LiNet, that enhancing technical efficiency in lightweight super resolution and operating approximately like very large and costly networks in terms of number of network parameters and operations. The proposed architecture allows the network to learn more abstract properties by avoiding low-level information via multiple links. LiNet introduces a Compact Dense Module, which contains set of inner and outer blocks, to efficiently extract meaningful information, to better leverage multi-level representations before upsampling stage, and to allow an efficient information and gradient flow within the network. Experiments on benchmark datasets show that the proposed LiNet achieves favorable performance against lightweight state-of-the-art methods.

Efficient Super Resolution by Recursive Aggregation

Zhengxiong Luo Zhengxiong Luo, Yan Huang, Shang Li, Liang Wang, Tieniu Tan

Responsive image

Auto-TLDR; Recursive Aggregation Network for Efficient Deep Super Resolution

Slides Poster Similar

Deep neural networks have achieved remarkable results on image super resolution (SR), but the efficiency problem of deep SR networks is rarely studied. We experimentally find that many sequentially stacked convolutional blocks in nowadays SR networks are far from being fully optimized, which largely damages their overall efficiency. It indicates that comparable or even better results could be achieved with less but sufficiently optimized blocks. In this paper, we try to construct more efficient SR model via the proposed recursive aggregation network (RAN). It recursively aggregates convolutional blocks in different orders, and avoids too many sequentially stacked blocks. In this way, multiple shortcuts are introduced in RAN, and help gradients easier flow to all inner layers, even for very deep SR networks. As a result, all blocks in RAN can be better optimized, thus RAN can achieve better performance with smaller model size than existing methods.

Deep Iterative Residual Convolutional Network for Single Image Super-Resolution

Rao Muhammad Umer, Gian Luca Foresti, Christian Micheloni

Responsive image

Auto-TLDR; ISRResCNet: Deep Iterative Super-Resolution Residual Convolutional Network for Single Image Super-resolution

Slides Similar

Deep convolutional neural networks (CNNs) have recently achieved great success for single image super-resolution (SISR) task due to their powerful feature representation capabilities. Most recent deep learning based SISR methods focus on designing deeper / wider models to learn the non-linear mapping between low-resolution (LR) inputs and the high-resolution (HR) outputs. These existing SR methods do not take into account the image observation (physical) model and thus require a large number of network's trainable parameters with a huge volume of training data. To address these issues, we propose a deep Iterative Super-Resolution Residual Convolutional Network (ISRResCNet) that exploits the powerful image regularization and large-scale optimization techniques by training the deep network in an iterative manner with a residual learning approach. Extensive experimental results on various super-resolution benchmarks demonstrate that our method with a few trainable parameters improves results for different scaling factors in comparison with the state-of-art methods.

Cross-Layer Information Refining Network for Single Image Super-Resolution

Hongyi Zhang, Wen Lu, Xiaopeng Sun

Responsive image

Auto-TLDR; Interlaced Spatial Attention Block for Single Image Super-Resolution

Slides Poster Similar

Recently, deep learning-based image super-resolution (SR) has made a remarkable progress. However, previous SR methods rarely focus on the correlation between adjacent layers, which leads to underutilization of the information extracted by each convolutional layer. To address these problem, we design a simple and efficient cross-layer information refining network (CIRN) for single image super-resolution. Concretely, we propose the interlaced spatial attention block (ISAB) to measure the correlation between the adjacent layers feature maps and adaptively rescale spatial-wise features for refining the information. Owing to the two stage information propagation strategy, the CIRN can distill the primary information of adjacent layers without introducing too many parameters. Extensive experiments on benchmark datasets illustrate that our method achieves better accuracy than state-of-the-art methods even in 16× scale, spcifically it has a better banlance between performance and parameters.

DID: A Nested Dense in Dense Structure with Variable Local Dense Blocks for Super-Resolution Image Reconstruction

Longxi Li, Hesen Feng, Bing Zheng, Lihong Ma, Jing Tian

Responsive image

Auto-TLDR; DID: Deep Super-Residual Dense Network for Image Super-resolution Reconstruction

Slides Poster Similar

The success of single image super-resolution reconstruction (SR) relies on a refined mapping from low-resolution (LR) examples to high-resolution (HR) signals. However, the relation is sometimes chaos, especially in a deep SR network. We try to improve the mapping interpretability in two folds: i) The variable local dense blocks (VLDB) are suggested to match receptive fields in different depths of a residual dense network (RDN), with each VLDB a dyadic increment of layer numbers than its predecessor. ii) Based on VLDBs, a dense in dense (DID) network is created. It substitutes nodes in a regular RDN with super nodes, i.e. VLDBs; and formulates a joint learning by flexible hierarchical feature scaling, reusing and long-short term aggregating. VLDBs deal with feature underfitting occurred when a big receptive field meets a fixed-depth dense block, and the DID network provides a relative complete feature dictionary to preserve details for feature shift, dilating and grouping in high dimension image reconstruction. To demonstrate the validness of DID structure, detail experiments are performed on the benchmark datasets Set5, Set14, B100 and Urban100, where the accuracy PSNR and the visual perceptive SSIM are superior to most state-of-the-art methods. Besides, due to the depth adaption of VLDBs and its nesting in generalized RDN,DID network is converged easily and gradient explosion or disappearance are alleviated even when network deepens.

Neural Architecture Search for Image Super-Resolution Using Densely Connected Search Space: DeCoNAS

Joon Young Ahn, Nam Ik Cho

Responsive image

Auto-TLDR; DeCoNASNet: Automated Neural Architecture Search for Super-Resolution

Slides Poster Similar

Abstract—The recent progress of deep convolutional neural networks has enabled great success in single image superresolution (SISR) and many other vision tasks. Their performances are also being increased by deepening the networks and developing more sophisticated network structures. However, finding an optimal structure for the given problem is a difficult task, even for human experts. For this reason, neural architecture search (NAS) methods have been introduced, which automate the procedure of constructing the structures. In this paper, we expand the NAS to the super-resolution domain and find a lightweight densely connected network named DeCoNASNet. We use a hierarchical search strategy to find the best connection with local and global features. In this process, we define a complexitybased penalty for solving image super-resolution, which can be considered a multi-objective problem. Experiments show that our DeCoNASNet outperforms the state-of-the-art lightweight superresolution networks designed by handcraft methods and existing NAS-based design.

Wavelet Attention Embedding Networks for Video Super-Resolution

Young-Ju Choi, Young-Woon Lee, Byung-Gyu Kim

Responsive image

Auto-TLDR; Wavelet Attention Embedding Network for Video Super-Resolution

Slides Poster Similar

Recently, Video super-resolution (VSR) has become more crucial as the resolution of display has been grown. The majority of deep learning-based VSR methods combine the convolutional neural networks (CNN) with motion compensation or alignment module to estimate high-resolution (HR) frame from low-resolution (LR) frames. However, most of previous methods deal with the spatial features equally and may result in the misaligned temporal features by pixel-based motion compensation and alignment module. It can lead to the damaging effect on the accuracy of the estimated HR feature. In this paper, we propose a wavelet attention embedding network (WAEN), including wavelet embedding network (WENet) and attention embedding network (AENet), to fully exploit the spatio-temporal informative features. The WENet is operated as a spatial feature extractor of individual low and high-frequency information based on 2-D Haar discrete wavelet transform. The meaningful temporal feature is extracted in the AENet through utilizing the weighted attention map between frames. Experimental results demonstrate that the proposed method achieves superior performance compared with state-of-the-art methods.

On-Device Text Image Super Resolution

Dhruval Jain, Arun Prabhu, Gopi Ramena, Manoj Goyal, Debi Mohanty, Naresh Purre, Sukumar Moharana

Responsive image

Auto-TLDR; A Novel Deep Neural Network for Super-Resolution on Low Resolution Text Images

Slides Poster Similar

Recent research on super-resolution (SR) has wit- nessed major developments with the advancements of deep convolutional neural networks. There is a need for information extraction from scenic text images or even document images on device, most of which are low-resolution (LR) images. Therefore, SR becomes an essential pre-processing step as Bicubic Upsampling, which is conventionally present in smartphones, performs poorly on LR images. To give the user more control over his privacy, and to reduce the carbon footprint by reducing the overhead of cloud computing and hours of GPU usage, executing SR models on the edge is a necessity in the recent times. There are various challenges in running and optimizing a model on resource-constrained platforms like smartphones. In this paper, we present a novel deep neural network that reconstructs sharper character edges and thus boosts OCR confidence. The proposed architecture not only achieves significant improvement in PSNR over bicubic upsampling on various benchmark datasets but also runs with an average inference time of 11.7 ms per image. We have outperformed state-of-the-art on the Text330 dataset. We also achieve an OCR accuracy of 75.89% on the ICDAR 2015 TextSR dataset, where ground truth has an accuracy of 78.10%.

Multi-Laplacian GAN with Edge Enhancement for Face Super Resolution

Shanlei Ko, Bi-Ru Dai

Responsive image

Auto-TLDR; Face Image Super-Resolution with Enhanced Edge Information

Slides Poster Similar

Face image super-resolution has become a research hotspot in the field of image processing. Nowadays, more and more researches add additional information, such as landmark, identity, to reconstruct high resolution images from low resolution ones, and have a good performance in quantitative terms and perceptual quality. However, these additional information is hard to obtain in many cases. In this work, we focus on reconstructing face images by extracting useful information from face images directly rather than using additional information. By observing edge information in each scale of face images, we propose a method to reconstruct high resolution face images with enhanced edge information. In additional, with the proposed training procedure, our method reconstructs photo-realistic images in upscaling factor 8x and outperforms state-of-the-art methods both in quantitative terms and perceptual quality.

Face Super-Resolution Network with Incremental Enhancement of Facial Parsing Information

Shuang Liu, Chengyi Xiong, Zhirong Gao

Responsive image

Auto-TLDR; Learning-based Face Super-Resolution with Incremental Boosting Facial Parsing Information

Slides Poster Similar

Recently, facial priors based face super-resolution (SR) methods have obtained significant performance gains in dealing with extremely degraded facial images, and facial priors have also been proved useful in facilitating the inference of face images. Based on this, how to fully fuse facial priors into deep features to improve face SR performance has attracted a major attention. In this paper, we propose a learning-based face SR approach with incremental boosting facial parsing information (IFPSR) for high-magnification of low-resolution faces. The proposed IFPSR method consists of three main parts: i) a three-stage parsing map embedded features upsampling network, in which image recovery and prior estimation processes are performed simultaneously and progressively to improve the image resolution; ii) a progressive training method and a joint facial attention and heatmap loss to obtain better facial attributes; iii) the channel attention strategy in residual dense blocks to adaptively learn facial features. Extensive experimental results show that compared with the state-of-the-art methods in terms of quantitative and qualitative metrics, our approach can achieve an outstanding balance between SR image quality and low network complexity.

Single Image Deblurring Using Bi-Attention Network

Yaowei Li, Ye Luo, Jianwei Lu

Responsive image

Auto-TLDR; Bi-Attention Neural Network for Single Image Deblurring

Poster Similar

Recently, deep convolutional neural networks have been extensively applied into image deblurring and have achieved remarkable performance. However, most CNN-based image deblurring methods focus on simply increasing network depth, neglecting the contextual information of the blurred image and the reconstructed image. Meanwhile, most encoder-decoder based methods rarely exploit encoder's multi-layer features. To address these issues, we propose a bi-attention neural network for single image deblurring, which mainly consists of a bi-attention network and a feature fusion network. Specifically, two criss-cross attention modules are plugged before and after the encoder-decoder to capture long-range spatial contextual information in the blurred image and the reconstructed image simultaneously, and the feature fusion network combines multi-layer features from encoder to enable the decoder reconstruct the image with multi-scale features. The whole network is end-to-end trainable. Quantitative and qualitative experiment results validate that the proposed network outperforms state-of-the-art methods in terms of PSNR and SSIM on benchmark datasets.

Deep Universal Blind Image Denoising

Jae Woong Soh, Nam Ik Cho

Responsive image

Auto-TLDR; Image Denoising with Deep Convolutional Neural Networks

Slides Similar

Image denoising is an essential part of many image processing and computer vision tasks due to inevitable noise corruption during image acquisition. Traditionally, many researchers have investigated image priors for the denoising, within the Bayesian perspective based on image properties and statistics. Recently, deep convolutional neural networks (CNNs) have shown great success in image denoising by incorporating large-scale synthetic datasets. However, they both have pros and cons. While the deep CNNs are powerful for removing the noise with known statistics, they tend to lack flexibility and practicality for the blind and real-world noise. Moreover, they cannot easily employ explicit priors. On the other hand, traditional non-learning methods can involve explicit image priors, but they require considerable computation time and cannot exploit large-scale external datasets. In this paper, we present a CNN-based method that leverages the advantages of both methods based on the Bayesian perspective. Concretely, we divide the blind image denoising problem into sub-problems and conquer each inference problem separately. As the CNN is a powerful tool for inference, our method is rooted in CNNs and propose a novel design of network for efficient inference. With our proposed method, we can successfully remove blind and real-world noise, with a moderate number of parameters of universal CNN.

Automatical Enhancement and Denoising of Extremely Low-Light Images

Yuda Song, Yunfang Zhu, Xin Du

Responsive image

Auto-TLDR; INSNet: Illumination and Noise Separation Network for Low-Light Image Restoring

Slides Poster Similar

Deep convolutional neural networks (DCNN) based methodologies have achieved remarkable performance on various low-level vision tasks recently. Restoring images captured at night is one of the trickiest low-level vision tasks due to its high-level noise and low-level intensity. We propose a DCNN-based methodology, Illumination and Noise Separation Network (INSNet), which performs both denoising and enhancement on these extremely low-light images. INSNet fully utilizes global-ware features and local-ware features using the modified network structure and image sampling scheme. Compared to well-designed complex neural networks, our proposed methodology only needs to add a bypass network to the existing network. However, it can boost the quality of recovered images dramatically but only increase the computational cost by less than 0.1%. Even without any manual settings, INSNet can stably restore the extremely low-light images to desired high-quality images.

Thermal Image Enhancement Using Generative Adversarial Network for Pedestrian Detection

Mohamed Amine Marnissi, Hajer Fradi, Anis Sahbani, Najoua Essoukri Ben Amara

Responsive image

Auto-TLDR; Improving Visual Quality of Infrared Images for Pedestrian Detection Using Generative Adversarial Network

Slides Poster Similar

Infrared imaging has recently played an important role in a wide range of applications including surveillance, robotics and night vision. However, infrared cameras often suffer from some limitations, essentially about low-contrast and blurred details. These problems contribute to the loss of observation of target objects in infrared images, which could limit the feasibility of different infrared imaging applications. In this paper, we mainly focus on the problem of pedestrian detection on thermal images. Particularly, we emphasis the need for enhancing the visual quality of images beforehand performing the detection step. % to ensure effective results. To address that, we propose a novel thermal enhancement architecture based on Generative Adversarial Network, and composed of two modules contrast enhancement and denoising modules with a post-processing step for edge restoration in order to improve the overall quality. The effectiveness of the proposed architecture is assessed by means of visual quality metrics and better results are obtained compared to the original thermal images and to the obtained results by other existing enhancement methods. These results have been conduced on a subset of KAIST dataset. Using the same dataset, the impact of the proposed enhancement architecture has been demonstrated on the detection results by obtaining better performance with a significant margin using YOLOv3 detector.

Small Object Detection Leveraging on Simultaneous Super-Resolution

Hong Ji, Zhi Gao, Xiaodong Liu, Tiancan Mei

Responsive image

Auto-TLDR; Super-Resolution via Generative Adversarial Network for Small Object Detection

Poster Similar

Despite the impressive advancement achieved in object detection, the detection performance of small object is still far from satisfactory due to the lack of sufficient detailed appearance to distinguish it from similar objects. Inspired by the positive effects of super-resolution for object detection, we propose a general framework that can be incorporated with most available detector networks to significantly improve the performance of small object detection, in which the low-resolution image is super-resolved via generative adversarial network (GAN) in an unsupervised manner. In our method, the super-resolution network and the detection network are trained jointly and alternately with each other fixed. In particular, the detection loss is back-propagated into the super-resolution network during training to facilitate detection. Compared with available simultaneous super-resolution and detection methods which heavily rely on low-/high-resolution image pairs, our work breaks through such restriction via applying the CycleGAN strategy, achieving increased generality and applicability, while remaining an elegant structure. Extensive experiments on datasets from both computer vision and remote sensing communities demonstrate that our method works effectively on a wide range of complex scenarios, resulting in best performance that significantly outperforms many state-of-the-art approaches.

Improving Low-Resolution Image Classification by Super-Resolution with Enhancing High-Frequency Content

Liguo Zhou, Guang Chen, Mingyue Feng, Alois Knoll

Responsive image

Auto-TLDR; Super-resolution for Low-Resolution Image Classification

Slides Poster Similar

With the prosperous development of Convolutional Neural Networks, currently they can perform excellently on visual understanding tasks when the input images are high quality and common quality images. However, large degradation in performance always occur when the input images are low quality images. In this paper, we propose a new super-resolution method in order to improve the classification performance for low-resolution images. In an image, the regions in which pixel values vary dramatically contain more abundant high frequency contents compared to other parts. Based on this fact, we design a weight map and integrate it with a super-resolution CNN training framework. During the process of training, this weight map can find out positions of the high frequency pixels in ground truth high-resolution images. After that, the pixel-level loss function takes effect only at these found positions to minimize the difference between reconstructed high-resolution images and ground truth high-resolution images. Compared with other state-of-the-art super-resolution methods, the experiment results show that our method can recover more high-frequency contents in high-resolution image reconstructing, and better improve the classification accuracy after low-resolution image preprocessing.

Dynamic Guided Network for Monocular Depth Estimation

Xiaoxia Xing, Yinghao Cai, Yiping Yang, Dayong Wen

Responsive image

Auto-TLDR; DGNet: Dynamic Guidance Upsampling for Self-attention-Decoding for Monocular Depth Estimation

Slides Poster Similar

Self-attention or encoder-decoder structure has been widely used in deep neural networks for monocular depth estimation tasks. The former mechanism are capable to capture long-range information by computing the representation of each position by a weighted sum of the features at all positions, while the latter networks can capture structural details information by gradually recovering the spatial information. In this work, we combine the advantages of both methods. Specifically, our proposed model, DGNet, extends EMANet Network by adding an effective decoder module to refine the depth results. In the decoder stage, we further design dynamic guidance upsampling which uses local neighboring information of low-level features guide coarser depth to upsample. In this way, dynamic guidance upsampling generates content-dependent and spatially-variant kernels for depth upsampling which makes full use of spatial details information from low-level features. Experimental results demonstrate that our method obtains higher accuracy and generates the desired depth map.

Video Lightening with Dedicated CNN Architecture

Li-Wen Wang, Wan-Chi Siu, Zhi-Song Liu, Chu-Tak Li, P. K. Daniel Lun

Responsive image

Auto-TLDR; VLN: Video Lightening Network for Driving Assistant Systems in Dark Environment

Slides Poster Similar

Darkness brings us uncertainty, worry and low confidence. This is a problem not only applicable to us walking in a dark evening but also for drivers driving a car on the road with very dim or even without lighting condition. To address this problem, we propose a new CNN structure named as Video Lightening Network (VLN) that regards the low-light enhancement as a residual learning task, which is useful as reference to indirectly lightening the environment, or for vision-based application systems, such as driving assistant systems. The VLN consists of several Lightening Back-Projection (LBP) and Temporal Aggregation (TA) blocks. Each LBP block enhances the low-light frame by domain transfer learning that iteratively maps the frame between the low- and normal-light domains. A TA block handles the motion among neighboring frames by investigating the spatial and temporal relationships. Several TAs work in a multi-scale way, which compensates the motions at different levels. The proposed architecture has a consistent enhancement for different levels of illuminations, which significantly increases the visual quality even in the extremely dark environment. Extensive experimental results show that the proposed approach outperforms other methods under both objective and subjective metrics.

Context-Aware Residual Module for Image Classification

Jing Bai, Ran Chen

Responsive image

Auto-TLDR; Context-Aware Residual Module for Image Classification

Slides Poster Similar

Attention module has achieved great success in numerous vision tasks. However, existing visual attention modules generally consider the features of a single-scale, and cannot make full use of their multi-scale contextual information. Meanwhile, the multi-scale spatial feature representation has demonstrated its outstanding performance in a wide range of applications. However, the multi-scale features are always represented in a layer-wise manner, i.e. it is impossible to know their contextual information at a granular level. Focusing on the above issue, a context-aware residual module for image classification is proposed in this paper. It consists of a novel multi-scale channel attention module MSCAM to learn refined channel weights by considering the visual features of its own scale and its surrounding fields, and a multi-scale spatial aware module MSSAM to further capture more spatial information. Either or both of the two modules can be plugged into any CNN-based backbone image classification architecture with a short residual connection to obtain the context-aware enhanced features. The experiments on public image recognition datasets including CIFAR10, CIFAR100,Tiny-ImageNet and ImageNet consistently demonstrate that our proposed modules significantly outperforms a wide-used state-of-the-art methods, e.g., ResNet and the lightweight networks of MobileNet and SqueezeeNet.

Detail-Revealing Deep Low-Dose CT Reconstruction

Xinchen Ye, Yuyao Xu, Rui Xu, Shoji Kido, Noriyuki Tomiyama

Responsive image

Auto-TLDR; A Dual-branch Aggregation Network for Low-Dose CT Reconstruction

Slides Poster Similar

Low-dose CT imaging emerges with low radiation risk due to the reduction of radiation dose, but brings negative impact on the imaging quality. This paper addresses the problem of low-dose CT reconstruction. Previous methods are unsatisfactory due to the inaccurate recovery of image details under the strong noise generated by the reduction of radiation dose, which directly affects the final diagnosis. To suppress the noise effectively while retain the structures well, we propose a detail-revealing dual-branch aggregation network to effectively reconstruct the degraded CT image. Specifically, the main reconstruction branch iteratively exploits and compensates the reconstruction errors to gradually refine the CT image, while the prior branch is to learn the structure details as prior knowledge to help recover the CT image. A sophisticated detail-revealing loss is designed to fuse the information from both branches and guide the learning to obtain better performance from pixel-wise and holistic perspectives respectively. Experimental results show that our method outperforms the state-of-art methods in both PSNR and SSIM metrics.

TinyVIRAT: Low-Resolution Video Action Recognition

Ugur Demir, Yogesh Rawat, Mubarak Shah

Responsive image

Auto-TLDR; TinyVIRAT: A Progressive Generative Approach for Action Recognition in Videos

Slides Poster Similar

The existing research in action recognition is mostly focused on high-quality videos where the action is distinctly visible. In real-world surveillance environments, the actions in videos are captured at a wide range of resolutions. Most activities occur at a distance with a small resolution and recognizing such activities is a challenging problem. In this work, we focus on recognizing tiny actions in videos. We introduce a benchmark dataset, TinyVIRAT, which contains natural low-resolution activities. The actions in TinyVIRAT videos have multiple labels and they are extracted from surveillance videos which makes them realistic and more challenging. We propose a novel method for recognizing tiny actions in videos which utilizes a progressive generative approach to improve the quality of low-resolution actions. The proposed method also consists of a weakly trained attention mechanism which helps in focusing on the activity regions in the video. We perform extensive experiments to benchmark the proposed TinyVIRAT dataset and observe that the proposed method significantly improves the action recognition performance over baselines. We also evaluate the proposed approach on synthetically resized action recognition datasets and achieve state-of-the-art results when compared with existing methods. The dataset and code will be publicly available.

Attention Pyramid Module for Scene Recognition

Zhinan Qiao, Xiaohui Yuan, Chengyuan Zhuang, Abolfazl Meyarian

Responsive image

Auto-TLDR; Attention Pyramid Module for Multi-Scale Scene Recognition

Slides Poster Similar

The unrestricted open vocabulary and diverse substances of scenery images bring significant challenges to scene recognition. However, most deep learning architectures and attention methods are developed on general-purpose datasets and omit the characteristics of scene data. In this paper, we exploit the attention pyramid module (APM) to tackle the predicament of scene recognition. Our method streamlines the multi-scale scene recognition pipeline, learns comprehensive scene features at various scales and locations, addresses the interdependency among scales, and further assists feature re-calibration as well as aggregation process. APM is extremely light-weighted and can be easily plugged into existing network architectures in a parameter-efficient manner. By simply integrating APM into ResNet-50, we obtain a 3.54\% boost in terms of top-1 accuracy on the benchmark scene dataset. Comprehensive experiments show that APM achieves better performance comparing with state-of-the-art attention methods using significant less computation budget. Code and pre-trained models will be made publicly available.

Transitional Asymmetric Non-Local Neural Networks for Real-World Dirt Road Segmentation

Yooseung Wang, Jihun Park

Responsive image

Auto-TLDR; Transitional Asymmetric Non-Local Neural Networks for Semantic Segmentation on Dirt Roads

Slides Poster Similar

Understanding images by predicting pixel-level semantic classes is a fundamental task in computer vision and is one of the most important techniques for autonomous driving. Recent approaches based on deep convolutional neural networks have dramatically improved the speed and accuracy of semantic segmentation on paved road datasets, however, dirt roads have yet to be systematically studied. Dirt roads do not contain clear boundaries between drivable and non-drivable regions; and thus, this difficulty must be overcome for the realization of fully autonomous vehicles. The key idea of our approach is to apply lightweight non-local blocks to reinforce stage-wise long-range dependencies in encoder-decoder style backbone networks. Experiments on 4,687 images of a dirt road dataset show that our transitional asymmetric non-local neural networks present a higher accuracy with lower computational costs compared to state-of-the-art models.

Super-Resolution Guided Pore Detection for Fingerprint Recognition

Syeda Nyma Ferdous, Ali Dabouei, Jeremy Dawson, Nasser M. Nasarabadi

Responsive image

Auto-TLDR; Super-Resolution Generative Adversarial Network for Fingerprint Recognition Using Pore Features

Slides Poster Similar

Performance of fingerprint recognition algorithms substantially rely on fine features extracted from fingerprints. Apart from minutiae and ridge patterns, pore features have proven to be usable for fingerprint recognition. Although features from minutiae and ridge patterns are quite attainable from low-resolution images, using pore features is practical only if the fingerprint image is of high resolution which necessitates a model that enhances the image quality of the conventional 500 ppi legacy fingerprints preserving the fine details. To find a solution for recovering pore information from low-resolution fingerprints, we adopt a joint learning-based approach that combines both super-resolution and pore detection networks. Our modified single image Super-Resolution Generative Adversarial Network (SRGAN) framework helps to reliably reconstruct high-resolution fingerprint samples from low-resolution ones assisting the pore detection network to identify pores with a high accuracy. The network jointly learns a distinctive feature representation from a real low-resolution fingerprint sample and successfully synthesizes a high-resolution sample from it. To add discriminative information and uniqueness for all the subjects, we have integrated features extracted from a deep fingerprint verifier with the SRGAN quality discriminator. We also add ridge reconstruction loss, utilizing ridge patterns to make the best use of extracted features. Our proposed method solves the recognition problem by improving the quality of fingerprint images. High recognition accuracy of the synthesized samples that is close to the accuracy achieved using the original high-resolution images validate the effectiveness of our proposed model.

Fast, Accurate and Lightweight Super-Resolution with Neural Architecture Search

Chu Xiangxiang, Bo Zhang, Micheal Ma Hailong, Ruijun Xu, Jixiang Li, Qingyuan Li

Responsive image

Auto-TLDR; Multi-Objective Neural Architecture Search for Super-Resolution

Slides Poster Similar

Deep convolutional neural networks demonstrate impressive results in the super-resolution domain. A series of studies concentrate on improving peak signal noise ratio (PSNR) by using much deeper layers, which are not friendly to constrained resources. Pursuing a trade-off between the restoration capacity and the simplicity of models is still non-trivial. Recent contributions are struggling to manually maximize this balance, while our work achieves the same goal automatically with neural architecture search. Specifically, we handle super-resolution with a multi-objective approach. We also propose an elastic search tactic at both micro and macro level, based on a hybrid controller that profits from evolutionary computation and reinforcement learning. Quantitative experiments help us to draw a conclusion that our generated models dominate most of the state-of-the-art methods with respect to the individual FLOPS.

SIDGAN: Single Image Dehazing without Paired Supervision

Pan Wei, Xin Wang, Lei Wang, Ji Xiang, Zihan Wang

Responsive image

Auto-TLDR; DehazeGAN: An End-to-End Generative Adversarial Network for Image Dehazing

Slides Poster Similar

Single image dehazing is challenging without scene airlight and transmission map. Most of existing dehazing algorithms tend to estimate key parameters based on manual designed priors or statistics, which may be invalid in some scenarios. Although deep learning-based dehazing methods provide an effective solution, most of them rely on paired training datasets, which are prohibitively difficult to be collected in real world. In this paper, we propose an effective end-to-end generative adversarial network for image dehazing, named DehazeGAN. The proposed DehazeGAN adopts a U-net architecture with a novel color-consistency loss derived from dark channel prior and perceptual loss, which can be trained in an unsupervised fashion without paired synthetic datasets. We create a RealHaze dataset for network training, including 4,000 outdoor hazy images and 4,000 haze-free images. Extensive experiments demonstrate that our proposed DehazeGAN achieves better performance than existing state-of-the-art methods on both synthetic datasets and real-world datasets in terms of PSNR, SSIM, and subjective visual experience.

OCT Image Segmentation Using NeuralArchitecture Search and SRGAN

Saba Heidari, Omid Dehzangi, Nasser M. Nasarabadi, Ali Rezai

Responsive image

Auto-TLDR; Automatic Segmentation of Retinal Layers in Optical Coherence Tomography using Neural Architecture Search

Poster Similar

Alzheimer’s disease (AD) diagnosis is one of the major research areas in computational medicine. Optical coherence tomography (OCT) is a non-invasive, inexpensive, and timely efficient method that scans the human’s retina with depth. It has been hypothesized that the thickness of the retinal layers extracted from OCTs could be an efficient and effective biomarker for early diagnosis of AD. In this work, we aim to design a self-training model architecture for the task of segmenting the retinal layers in OCT scans. Neural architecture search (NAS) is a subfield of AutoML domain, which has a significant impact on improving the accuracy of machine vision tasks. We integrate the NAS algorithm with a Unet auto-encoder architecture as its backbone. Then, we employ our proposed model to segment the retinal nerve fiber layer in our preprocessed OCT images with the aim of AD diagnosis. In this work, we trained a super-resolution generative adversarial network on the raw OCT scans to improve the quality of the images before the modeling stage. In our architecture search strategy, different primitive operations suggested to find down- \& up-sampling Unet cell blocks and the binary gate method has been applied to make the search strategy more practical. Our architecture search method is empirically evaluated by training on the Unet and NAS-Unet from scratch. Specifically, the proposed NAS-Unet training significantly outperforms the baseline human-designed architecture by achieving 95.1\% in the mean Intersection over Union metric and 79.1\% in the Dice similarity coefficient.

Tarsier: Evolving Noise Injection inSuper-Resolution GANs

Baptiste Roziere, Nathanaël Carraz Rakotonirina, Vlad Hosu, Rasoanaivo Andry, Hanhe Lin, Camille Couprie, Olivier Teytaud

Responsive image

Auto-TLDR; Evolutionary Super-Resolution using Diagonal CMA

Slides Poster Similar

Super-resolution aims at increasing the resolution and level of detail within an image. The current state of the art in general single-image super-resolution is held by nESRGAN+,which injects a Gaussian noise after each residual layer at training time. In this paper, we harness evolutionary methods to improve nESRGAN+ by optimizing the noise injection at inference time. More precisely, we use Diagonal CMA to optimize the injected noise according to a novel criterion combining quality assessment and realism. Our results are validated by the PIRM perceptual score and a human study. Our method outperforms nESRGAN+ on several standard super-resolution datasets. More generally, our approach can be used to optimize any method based on noise injection.

Efficient-Receptive Field Block with Group Spatial Attention Mechanism for Object Detection

Jiacheng Zhang, Zhicheng Zhao, Fei Su

Responsive image

Auto-TLDR; E-RFB: Efficient-Receptive Field Block for Deep Neural Network for Object Detection

Slides Poster Similar

Object detection has been paid rising attention in computer vision field. Convolutional Neural Networks (CNNs) extract high-level semantic features of images, which directly determine the performance of object detection. As a common solution, embedding integration modules into CNNs can enrich extracted features and thereby improve the performance. However, the instability and inconsistency of internal multiple branches exist in these modules. To address this problem, we propose a novel multibranch module called Efficient-Receptive Field Block (E-RFB), in which multiple levels of features are combined for network optimization. Specifically, by downsampling and increasing depth, the E-RFB provides sufficient RF. Second, in order to eliminate the inconsistency across different branches, a novel spatial attention mechanism, namely, Group Spatial Attention Module (GSAM) is proposed. The GSAM gradually narrows a feature map by channel grouping; thus it encodes the information between spatial and channel dimensions into the final attention heat map. Third, the proposed module can be easily joined in various CNNs to enhance feature representation as a plug-and-play component. With SSD-style detectors, our method halves the parameters of the original detection head and achieves high accuracy on the PASCAL VOC and MS COCO datasets. Moreover, the proposed method achieves superior performance compared with state-of-the-art methods based on similar framework.

Deep Residual Attention Network for Hyperspectral Image Reconstruction

Kohei Yorimoto, Xian-Hua Han

Responsive image

Auto-TLDR; Deep Convolutional Neural Network for Hyperspectral Image Reconstruction from a Snapshot

Slides Poster Similar

Coded aperture snapshot spectral imaging (CASSI) captures a full frame spectral image as a single compressive image and is mandatory to reconstruct the underlying hyperspectral image (HSI) from the snapshot as the post-processing, which is challenge inverse problem due to its ill-posed nature. Existing methods for HSI reconstruction from a snapshot usually employs optimization for solving the formulated image degradation model regularized with the empirically designed priors, and still cannot achieve enough reconstruction accuracy for real HS image analysis systems. Motivated by the recent advances of deep learning for different inverse problems, deep learning based HSI reconstruction method has attracted a lot of attention, and can boost the reconstruction performance. This study proposes a novel deep convolutional neural network (DCNN) based framework for effectively learning the spatial structure and spectral attribute in the underlying HSI with the reciprocal spatial and spectral modules. Further, to adaptively leverage the useful learned feature for better HSI image reconstruction, we integrate residual attention modules into our DCNN via exploring both spatial and spectral attention maps. Experimental results on two benchmark HSI datasets show that our method outperforms state-of-the-art methods in both quantitative values and visual effect.

CURL: Neural Curve Layers for Global Image Enhancement

Sean Moran, Steven Mcdonagh, Greg Slabaugh

Responsive image

Auto-TLDR; CURL: Neural CURve Layers for Image Enhancement

Slides Poster Similar

We present a novel approach to adjust global image properties such as colour, saturation, and luminance using human-interpretable image enhancement curves, inspired by the Photoshop curves tool. Our method, dubbed neural CURve Layers (CURL), is designed as a multi-colour space neural retouching block trained jointly in three different colour spaces (HSV, CIELab, RGB) guided by a novel multi-colour space loss. The curves are fully differentiable and are trained end-to-end for different computer vision problems including photo enhancement (RGB-to-RGB) and as part of the image signal processing pipeline for image formation (RAW-to-RGB). To demonstrate the effectiveness of CURL we combine this global image transformation block with a pixel-level (local) image multi-scale encoder-decoder backbone network. In an extensive experimental evaluation we show that CURL produces state-of-the-art image quality versus recently proposed deep learning approaches in both objective and perceptual metrics, setting new state-of-the-art performance on multiple public datasets.

VGG-Embedded Adaptive Layer-Normalized Crowd Counting Net with Scale-Shuffling Modules

Dewen Guo, Jie Feng, Bingfeng Zhou

Responsive image

Auto-TLDR; VadaLN: VGG-embedded Adaptive Layer Normalization for Crowd Counting

Slides Poster Similar

Crowd counting is widely used in real-time congestion monitoring and public security. Due to the limited data, many methods have little ability to be generalized because the differences between feature domains are not taken into consideration. We propose VGG-embedded adaptive layer normalization (VadaLN) to filter the features that irrelevant to the counting tasks in order that the counting results should not be affected by the image quality, color or illumination. VadaLN is implemented on the pretrained VGG-16 backbone. There is no additional learning parameters required through our method. VadaLN incoporates the proposed scale-shuffling modules (SSM) to relax the distortions in upsampling operations. Besides, non-aligned training methdology for the estimation of density maps is leveraged by an adversarial contextual loss (ACL) to improve the counting performance. Based on the proposed method, we construct an end-to-end trainable baseline model without bells and whistles, namely VadaLNet, which outperforms several recent state-of-the-art methods on commonly used challenging standard benchmarks. The intermediate scale-shuffled results are combined to formulate a scale-complementary strategy as a more powerful network, namely as VadaLNeSt. We implement VadaLNeSt on standard benchmarks, e.g. ShanghaiTech (Part A & Part B), UCF_CC_50, and UCF_QNRF, to show the superiority of our method.

Dynamic Low-Light Image Enhancement for Object Detection Via End-To-End Training

Haifeng Guo, Yirui Wu, Tong Lu

Responsive image

Auto-TLDR; Object Detection using Low-Light Image Enhancement for End-to-End Training

Slides Poster Similar

Object detection based on convolutional neural networks is a hot research topic in computer vision. The illumination component in the image has a great impact on object detection, and it will cause a sharp decline in detection performance under low-light conditions. Using low-light image enhancement technique as a pre-processing mechanism can improve image quality and obtain better detection results.However, due to the complexity of low-light environments, the existing enhancement methods may have negative effects on some samples. Therefore, it is difficult to improve the overall detection performance in low-light conditions. In this paper, our goal is to use image enhancement to improve object detection performance rather than perceptual quality for humans. We propose a novel framework that combines low-light enhancement and object detection for end-to-end training. The framework can dynamically select different enhancement subnetworks for each sample to improve the performance of the detector. Our proposed method consists of two stage: the enhancement stage and the detection stage. The enhancement stage dynamically enhances the low-light images under the supervision of several enhancement methods and output corresponding weights. During the detection stage, the weights offers information on object classification to generate high-quality region proposals and in turn result in accurate detection. Our experiments present promising results, which show that the proposed method can significantly improve the detection performance in low-light environment.

Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting

Pongpisit Thanasutives, Ken-Ichi Fukui, Masayuki Numao, Boonserm Kijsirikul

Responsive image

Auto-TLDR; M-SFANet and M-SegNet for Crowd Counting Using Multi-Scale Fusion Networks

Slides Poster Similar

In this paper, we proposed two modified neural networks based on dual path multi-scale fusion networks (SFANet) and SegNet for accurate and efficient crowd counting. Inspired by SFANet, the first model, which is named M-SFANet, is attached with atrous spatial pyramid pooling (ASPP) and context-aware module (CAN). The encoder of M-SFANet is enhanced with ASPP containing parallel atrous convolutional layers with different sampling rates and hence able to extract multi-scale features of the target object and incorporate larger context. To further deal with scale variation throughout an input image, we leverage the CAN module which adaptively encodes the scales of the contextual information. The combination yields an effective model for counting in both dense and sparse crowd scenes. Based on the SFANet decoder structure, M-SFANet's decoder has dual paths, for density map and attention map generation. The second model is called M-SegNet, which is produced by replacing the bilinear upsampling in SFANet with max unpooling that is used in SegNet. This change provides a faster model while providing competitive counting performance. Designed for high-speed surveillance applications, M-SegNet has no additional multi-scale-aware module in order to not increase the complexity. Both models are encoder-decoder based architectures and are end-to-end trainable. We conduct extensive experiments on five crowd counting datasets and one vehicle counting dataset to show that these modifications yield algorithms that could improve state-of-the-art crowd counting methods.

Enhanced Feature Pyramid Network for Semantic Segmentation

Mucong Ye, Ouyang Jinpeng, Ge Chen, Jing Zhang, Xiaogang Yu

Responsive image

Auto-TLDR; EFPN: Enhanced Feature Pyramid Network for Semantic Segmentation

Slides Poster Similar

Multi-scale feature fusion has been an effective way for improving the performance of semantic segmentation. However, current methods generally fail to consider the semantic gaps between the shallow (low-level) and deep (high-level) features and thus the fusion methods may not be optimal. In this paper, to address the issues of the semantic gap between the feature from different layers, we propose a unified framework based on the U-shape encoder-decoder architecture, named Enhanced Feature Pyramid Network (EFPN). Specifically, the semantic enhancement module (SEM), boundary extraction module (BEM), and context aggregation model (CAM) are incorporated into the decoder network to improve the robustness of the multi-level features aggregation. In addition, a global fusion model (GFM) in encoder branch is proposed to capture more semantic information in the deep layers and effectively transmit the high-level semantic features to each layer. Extensive experiments are conducted and the results show that the proposed framework achieves the state-of-the-art results on three public datasets, namely PASCAL VOC 2012, Cityscapes, and PASCAL Context. Furthermore, we also demonstrate that the proposed method is effective for other visual tasks that require frequent fusing features and upsampling.

Fast and Accurate Real-Time Semantic Segmentation with Dilated Asymmetric Convolutions

Leonel Rosas-Arias, Gibran Benitez-Garcia, Jose Portillo-Portillo, Gabriel Sanchez-Perez, Keiji Yanai

Responsive image

Auto-TLDR; FASSD-Net: Dilated Asymmetric Pyramidal Fusion for Real-Time Semantic Segmentation

Slides Poster Similar

Recent works have shown promising results applied to real-time semantic segmentation tasks. To maintain fast inference speed, most of the existing networks make use of light decoders, or they simply do not use them at all. This strategy helps to maintain a fast inference speed; however, their accuracy performance is significantly lower in comparison to non-real-time semantic segmentation networks. In this paper, we introduce two key modules aimed to design a high-performance decoder for real-time semantic segmentation for reducing the accuracy gap between real-time and non-real-time segmentation networks. Our first module, Dilated Asymmetric Pyramidal Fusion (DAPF), is designed to substantially increase the receptive field on the top of the last stage of the encoder, obtaining richer contextual features. Our second module, Multi-resolution Dilated Asymmetric (MDA) module, fuses and refines detail and contextual information from multi-scale feature maps coming from early and deeper stages of the network. Both modules exploit contextual information without excessively increasing the computational complexity by using asymmetric convolutions. Our proposed network entitled “FASSD-Net” reaches 78.8% of mIoU accuracy on the Cityscapes validation dataset at 41.1 FPS on full resolution images (1024x2048). Besides, with a light version of our network, we reach 74.1% of mIoU at 133.1 FPS (full resolution) on a single NVIDIA GTX 1080Ti card with no additional acceleration techniques. The source code and pre-trained models are available at https://github.com/GibranBenitez/FASSD-Net.

Edge-Guided CNN for Denoising Images from Portable Ultrasound Devices

Yingnan Ma, Fei Yang, Anup Basu

Responsive image

Auto-TLDR; Edge-Guided Convolutional Neural Network for Portable Ultrasound Images

Slides Poster Similar

Ultrasound is a non-invasive tool that is useful for medical diagnosis and treatment. To reduce long wait times and add convenience to patients, portable ultrasound scanning devices are becoming increasingly popular. These devices can be held in one palm, and are compatible with modern cell phones. However, the quality of ultrasound images captured from the portable scanners is relatively poor compared to standard ultrasound scanning systems in hospitals. To improve the quality of the ultrasound images obtained from portable ultrasound devices, we propose a new neural network architecture called Edge-Guided Convolutional Neural Network (EGCNN), which can preserve significant edge information in ultrasound images when removing noise. We also study and compare the effectiveness of classical filtering approaches in removing speckle noise in these images. Experimental results show that after applying the proposed EGCNN, various organs can be better recognized from ultrasound images. This approach is expected to lead to better accuracy in diagnostics in the future.

Progressive Scene Segmentation Based on Self-Attention Mechanism

Yunyi Pan, Yuan Gan, Kun Liu, Yan Zhang

Responsive image

Auto-TLDR; Two-Stage Semantic Scene Segmentation with Self-Attention

Slides Poster Similar

Semantic scene segmentation is vital for a large variety of applications as it enables understanding of 3D data. Nowadays, various approaches based upon point clouds ignore the mathematical distribution of points and treat the points equally. The methods following this direction neglect the imbalance problem of samples that naturally exists in scenes. To avoid these issues, we propose a two-stage semantic scene segmentation framework based on self-attention mechanism and achieved state-of-the-art performance on 3D scene understanding tasks. We split the whole task into two small ones which efficiently relief the sample imbalance issue. In addition, we have designed a new self-attention block which could be inserted into submanifold convolution networks to model the long-range dependencies that exists among points. The proposed network consists of an encoder and a decoder, with the spatial-wise and channel-wise attention modules inserted. The two-stage network shares a U-Net architecture and is an end-to-end trainable framework which could predict the semantic label for the scene point clouds fed into it. Experiments on standard benchmarks of 3D scenes implies that our network could perform at par or better than the existing state-of-the-art methods.

GSTO: Gated Scale-Transfer Operation for Multi-Scale Feature Learning in Semantic Segmentation

Zhuoying Wang, Yongtao Wang, Zhi Tang, Yangyan Li, Ying Chen, Haibin Ling, Weisi Lin

Responsive image

Auto-TLDR; Gated Scale-Transfer Operation for Semantic Segmentation

Slides Poster Similar

Existing CNN-based methods for semantic segmentation heavily depend on multi-scale features to meet the requirements of both semantic comprehension and detail preservation. State-of-the-art segmentation networks widely exploit conventional scale-transfer operations, i.e., up-sampling and down-sampling to learn multi-scale features. In this work, we find that these operations lead to scale-confused features and suboptimal performance because they are spatial-invariant and directly transit all feature information cross scales without spatial selection. To address this issue, we propose the Gated Scale-Transfer Operation (GSTO) to properly transit spatial-filtered features to another scale. Specifically, GSTO can work either with or without extra supervision. Unsupervised GSTO is learned from the feature itself while the supervised one is guided by the supervised probability matrix. Both forms of GSTO are lightweight and plug-and-play, which can be flexibly integrated into networks or modules for learning better multi-scale features. In particular, by plugging GSTO into HRNet, we get a more powerful backbone (namely GSTO-HRNet) for pixel labeling, and it achieves new state-of-the-art results on multiple benchmarks for semantic segmentation including Cityscapes, LIP and Pascal Context, with negligible extra computational cost. Moreover, experiment results demonstrate that GSTO can also significantly boost the performance of multi-scale feature aggregation modules like PPM and ASPP.

Deeply-Fused Attentive Network for Stereo Matching

Zuliu Yang, Xindong Ai, Weida Yang, Yong Zhao, Qifei Dai, Fuchi Li

Responsive image

Auto-TLDR; DF-Net: Deep Learning-based Network for Stereo Matching

Slides Poster Similar

In this paper, we propose a novel learning-based network for stereo matching called DF-Net, which makes three main contributions that are experimentally shown to have practical merit. Firstly, we further increase the accuracy by using the deeply fused spatial pyramid pooling (DF-SPP) module, which can acquire the continuous multi-scale context information in both parallel and cascade manners. Secondly, we introduce channel attention block to dynamically boost the informative features. Finally, we propose a stacked encoder-decoder structure with 3D attention gate for cost regularization. More precisely, the module fuses the coding features to their next encoder-decoder structure under the supervision of attention gate with long-range skip connection, and thus exploit deep and hierarchical context information for disparity prediction. The performance on SceneFlow and KITTI datasets shows that our model is able to generate better results against several state-of-the-art algorithms.

Free-Form Image Inpainting Via Contrastive Attention Network

Xin Ma, Xiaoqiang Zhou, Huaibo Huang, Zhenhua Chai, Xiaolin Wei, Ran He

Responsive image

Auto-TLDR; Self-supervised Siamese inference for image inpainting

Slides Similar

Most deep learning based image inpainting approaches adopt autoencoder or its variants to fill missing regions in images. Encoders are usually utilized to learn powerful representational spaces, which are important for dealing with sophisticated learning tasks. Specifically, in the image inpainting task, masks with any shapes can appear anywhere in images (i.e., free-form masks) forming complex patterns. It is difficult for encoders to capture such powerful representations under this complex situation. To tackle this problem, we propose a self-supervised Siamese inference network to improve the robustness and generalization. Moreover, the restored image usually can not be harmoniously integrated into the exiting content, especially in the boundary area. To address this problem, we propose a novel Dual Attention Fusion module (DAF), which can combine both the restored and known regions in a smoother way and be inserted into decoder layers in a plug-and-play way. DAF is developed to not only adaptively rescale channel-wise features by taking interdependencies between channels into account but also force deep convolutional neural networks (CNNs) focusing more on unknown regions. In this way, the unknown region will be naturally filled from the outside to the inside. Qualitative and quantitative experiments on multiple datasets, including facial and natural datasets (i.e., Celeb-HQ, Pairs Street View, Places2 and ImageNet), demonstrate that our proposed method outperforms against state-of-the-arts in generating high-quality inpainting results.

Adaptive Image Compression Using GAN Based Semantic-Perceptual Residual Compensation

Ruojing Wang, Zitang Sun, Sei-Ichiro Kamata, Weili Chen

Responsive image

Auto-TLDR; Adaptive Image Compression using GAN based Semantic-Perceptual Residual Compensation

Slides Poster Similar

Image compression is a basic task in image processing. In this paper, We present an adaptive image compression algorithm that relies on GAN based semantic-perceptual residual compensation, which is available to offer visually pleasing reconstruction at a low bitrate. Our method adopt an U-shaped encoding and decoding structure accompanied by a well-designed dense residual connection with strip pooling module to improve the original auto-encoder. Besides, we introduce the idea of adversarial learning by introducing a discriminator thus constructed a complete GAN. To improve the coding efficiency, we creatively designed an adaptive semantic-perception residual compensation block based on Grad-CAM algorithm. In the improvement of the quantizer, we embed the method of soft-quantization so as to solve the problem to some extent that back propagation process is irreversible. Simultaneously, we use the latest FLIF lossless compression algorithm and BPG vector compression algorithm to perform deeper compression on the image. More importantly experimental results including PSNR, MS-SSIM demonstrate that the proposed approach outperforms the current state-of-the-art image compression methods.

Bidirectional Matrix Feature Pyramid Network for Object Detection

Wei Xu, Yi Gan, Jianbo Su

Responsive image

Auto-TLDR; BMFPN: Bidirectional Matrix Feature Pyramid Network for Object Detection

Slides Poster Similar

Feature pyramids are widely used to improve scale invariance for object detection. Most methods just map the objects to feature maps with relevant square receptive fields, but rarely pay attention to the aspect ratio variation, which is also an important property of object instances. It will lead to a poor match between rectangular objects and assigned features with square receptive fields, thus preventing from accurate recognition and location. Besides, the information propagation among feature layers is sparse, namely, each feature in the pyramid may mainly or only contain single-level information, which is not representative enough for classification and localization sub-tasks. In this paper, Bidirectional Matrix Feature Pyramid Network (BMFPN) is proposed to address these issues. It consists of three modules: Diagonal Layer Generation Module (DLGM), Top-down Module (TDM) and Bottom-up Module (BUM). First, multi-level features extracted by backbone are fed into DLGM to produce the base features. Then these base features are utilized to construct the final feature pyramid through TDM and BUM in series. The receptive fields of the designed feature layers in BMFPN have various scales and aspect ratios. Objects can be correctly assigned to appropriate and representative feature maps with relevant receptive fields depending on its scale and aspect ratio properties. Moreover, TDM and BUM form bidirectional and reticular information flow, which effectively fuses multi level information in top-down and bottom-up manner respectively. To evaluate the effectiveness of our proposed architecture, an end-toend anchor-free detector is designed and trained by integrating BMFPN into FCOS. And the center ness branch in FCOS is modified with our Gaussian center-ness branch (GCB), which brings another slight improvement. Without bells and whistles, our method gains +3.3%, +2.4% and +2.6% AP on MS COCO dataset from baselines with ResNet-50, ResNet-101 and ResNeXt-101 backbones, respectively.