Context-Aware Residual Module for Image Classification

Jing Bai, Ran Chen

Responsive image

Auto-TLDR; Context-Aware Residual Module for Image Classification

Slides Poster

Attention module has achieved great success in numerous vision tasks. However, existing visual attention modules generally consider the features of a single-scale, and cannot make full use of their multi-scale contextual information. Meanwhile, the multi-scale spatial feature representation has demonstrated its outstanding performance in a wide range of applications. However, the multi-scale features are always represented in a layer-wise manner, i.e. it is impossible to know their contextual information at a granular level. Focusing on the above issue, a context-aware residual module for image classification is proposed in this paper. It consists of a novel multi-scale channel attention module MSCAM to learn refined channel weights by considering the visual features of its own scale and its surrounding fields, and a multi-scale spatial aware module MSSAM to further capture more spatial information. Either or both of the two modules can be plugged into any CNN-based backbone image classification architecture with a short residual connection to obtain the context-aware enhanced features. The experiments on public image recognition datasets including CIFAR10, CIFAR100,Tiny-ImageNet and ImageNet consistently demonstrate that our proposed modules significantly outperforms a wide-used state-of-the-art methods, e.g., ResNet and the lightweight networks of MobileNet and SqueezeeNet.

Similar papers

Attention Pyramid Module for Scene Recognition

Zhinan Qiao, Xiaohui Yuan, Chengyuan Zhuang, Abolfazl Meyarian

Responsive image

Auto-TLDR; Attention Pyramid Module for Multi-Scale Scene Recognition

Slides Poster Similar

The unrestricted open vocabulary and diverse substances of scenery images bring significant challenges to scene recognition. However, most deep learning architectures and attention methods are developed on general-purpose datasets and omit the characteristics of scene data. In this paper, we exploit the attention pyramid module (APM) to tackle the predicament of scene recognition. Our method streamlines the multi-scale scene recognition pipeline, learns comprehensive scene features at various scales and locations, addresses the interdependency among scales, and further assists feature re-calibration as well as aggregation process. APM is extremely light-weighted and can be easily plugged into existing network architectures in a parameter-efficient manner. By simply integrating APM into ResNet-50, we obtain a 3.54\% boost in terms of top-1 accuracy on the benchmark scene dataset. Comprehensive experiments show that APM achieves better performance comparing with state-of-the-art attention methods using significant less computation budget. Code and pre-trained models will be made publicly available.

PSDNet: A Balanced Architecture of Accuracy and Parameters for Semantic Segmentation

Yue Liu, Zhichao Lian

Responsive image

Auto-TLDR; Pyramid Pooling Module with SE1Cblock and D2SUpsample Network (PSDNet)

Slides Poster Similar

Abstract—In this paper, we present our Pyramid Pooling Module (PPM) with SE1Cblock and D2SUpsample Network (PSDNet), a novel architecture for accurate semantic segmentation. Started from the known work called Pyramid Scene Parsing Network (PSPNet), PSDNet takes advantage of pyramid pooling structure with channel attention module and feature transform module in Pyramid Pooling Module (PPM). The enhanced PPM with these two components can strengthen context information flowing in the network instead of damaging it. The channel attention module we mentioned is an improved “Squeeze and Excitation with 1D Convolution” (SE1C) block which can explicitly model interrelationship between channels with fewer number of parameters. We propose a feature transform module named “Depth to Space Upsampling” (D2SUpsample) in the PPM which keeps integrity of features by transforming features while interpolating features, at the same time reducing parameters. In addition, we introduce a joint strategy in SE1Cblock which combines two variants of global pooling without increasing parameters. Compared with PSPNet, our work achieves higher accuracy on public datasets with 73.97% mIoU and 82.89% mAcc accuracy on Cityscapes Dataset based on ResNet50 backbone.

Efficient-Receptive Field Block with Group Spatial Attention Mechanism for Object Detection

Jiacheng Zhang, Zhicheng Zhao, Fei Su

Responsive image

Auto-TLDR; E-RFB: Efficient-Receptive Field Block for Deep Neural Network for Object Detection

Slides Poster Similar

Object detection has been paid rising attention in computer vision field. Convolutional Neural Networks (CNNs) extract high-level semantic features of images, which directly determine the performance of object detection. As a common solution, embedding integration modules into CNNs can enrich extracted features and thereby improve the performance. However, the instability and inconsistency of internal multiple branches exist in these modules. To address this problem, we propose a novel multibranch module called Efficient-Receptive Field Block (E-RFB), in which multiple levels of features are combined for network optimization. Specifically, by downsampling and increasing depth, the E-RFB provides sufficient RF. Second, in order to eliminate the inconsistency across different branches, a novel spatial attention mechanism, namely, Group Spatial Attention Module (GSAM) is proposed. The GSAM gradually narrows a feature map by channel grouping; thus it encodes the information between spatial and channel dimensions into the final attention heat map. Third, the proposed module can be easily joined in various CNNs to enhance feature representation as a plug-and-play component. With SSD-style detectors, our method halves the parameters of the original detection head and achieves high accuracy on the PASCAL VOC and MS COCO datasets. Moreover, the proposed method achieves superior performance compared with state-of-the-art methods based on similar framework.

Attention As Activation

Yimian Dai, Stefan Oehmcke, Fabian Gieseke, Yiquan Wu, Kobus Barnard

Responsive image

Auto-TLDR; Attentional Activation Units for Convolutional Networks

Slides Similar

Activation functions and attention mechanisms are typically treated as having different purposes and have evolved differently. However, both concepts can be formulated as a non-linear gating function. Inspired by their similarity, we propose a novel type of activation units called attentional activation~(ATAC) units as a unification of activation functions and attention mechanisms. In particular, we propose a local channel attention module for the simultaneous non-linear activation and element-wise feature refinement, which locally aggregates point-wise cross-channel feature contexts. By replacing the well-known rectified linear units by such ATAC units in convolutional networks, we can construct fully attentional networks that perform significantly better with a modest number of additional parameters. We conducted detailed ablation studies on the ATAC units using several host networks with varying network depths to empirically verify the effectiveness and efficiency of the units. Furthermore, we compared the performance of the ATAC units against existing activation functions as well as other attention mechanisms on the CIFAR-10, CIFAR-100, and ImageNet datasets. Our experimental results show that networks constructed with the proposed ATAC units generally yield performance gains over their competitors given a comparable number of parameters.

Dual-Attention Guided Dropblock Module for Weakly Supervised Object Localization

Junhui Yin, Siqing Zhang, Dongliang Chang, Zhanyu Ma, Jun Guo

Responsive image

Auto-TLDR; Dual-Attention Guided Dropblock for Weakly Supervised Object Localization

Slides Poster Similar

Attention mechanisms is frequently used to learn the discriminative features for better feature representations. In this paper, we extend the attention mechanism to the task of weakly supervised object localization (WSOL) and propose the dual-attention guided dropblock module (DGDM), which aims at learning the informative and complementary visual patterns for WSOL. This module contains two key components, the channel attention guided dropout (CAGD) and the spatial attention guided dropblock (SAGD). To model channel interdependencies, the CAGD ranks the channel attentions and treats the top-k attentions with the largest magnitudes as the important ones. It also keeps some low-valued elements to increase their value if they become important during training. The SAGD can efficiently remove the most discriminative information by erasing the contiguous regions of feature maps rather than individual pixels. This guides the model to capture the less discriminative parts for classification. Furthermore, it can also distinguish the foreground objects from the background regions to alleviate the attention misdirection. Experimental results demonstrate that the proposed method achieves new state-of-the-art localization performance.

Enhanced Feature Pyramid Network for Semantic Segmentation

Mucong Ye, Ouyang Jinpeng, Ge Chen, Jing Zhang, Xiaogang Yu

Responsive image

Auto-TLDR; EFPN: Enhanced Feature Pyramid Network for Semantic Segmentation

Slides Poster Similar

Multi-scale feature fusion has been an effective way for improving the performance of semantic segmentation. However, current methods generally fail to consider the semantic gaps between the shallow (low-level) and deep (high-level) features and thus the fusion methods may not be optimal. In this paper, to address the issues of the semantic gap between the feature from different layers, we propose a unified framework based on the U-shape encoder-decoder architecture, named Enhanced Feature Pyramid Network (EFPN). Specifically, the semantic enhancement module (SEM), boundary extraction module (BEM), and context aggregation model (CAM) are incorporated into the decoder network to improve the robustness of the multi-level features aggregation. In addition, a global fusion model (GFM) in encoder branch is proposed to capture more semantic information in the deep layers and effectively transmit the high-level semantic features to each layer. Extensive experiments are conducted and the results show that the proposed framework achieves the state-of-the-art results on three public datasets, namely PASCAL VOC 2012, Cityscapes, and PASCAL Context. Furthermore, we also demonstrate that the proposed method is effective for other visual tasks that require frequent fusing features and upsampling.

Global-Local Attention Network for Semantic Segmentation in Aerial Images

Minglong Li, Lianlei Shan, Weiqiang Wang

Responsive image

Auto-TLDR; GLANet: Global-Local Attention Network for Semantic Segmentation

Slides Poster Similar

Errors in semantic segmentation task could be classified into two types: large area misclassification and local inaccurate boundaries. Previously attention based methods capture rich global contextual information, this is beneficial to diminish the first type of error, but local imprecision still exists. In this paper we propose Global-Local Attention Network (GLANet) with a simultaneous consideration of global context and local details. Specifically, our GLANet is composed of two branches namely global attention branch and local attention branch, and three different modules are embedded in the two branches for the purpose of modeling semantic interdependencies in spatial, channel and boundary dimensions respectively. We sum the outputs of the two branches to further improve feature representation, leading to more precise segmentation results. The proposed method achieves very competitive segmentation accuracy on two public aerial image datasets, bringing significant improvements over baseline.

Transitional Asymmetric Non-Local Neural Networks for Real-World Dirt Road Segmentation

Yooseung Wang, Jihun Park

Responsive image

Auto-TLDR; Transitional Asymmetric Non-Local Neural Networks for Semantic Segmentation on Dirt Roads

Slides Poster Similar

Understanding images by predicting pixel-level semantic classes is a fundamental task in computer vision and is one of the most important techniques for autonomous driving. Recent approaches based on deep convolutional neural networks have dramatically improved the speed and accuracy of semantic segmentation on paved road datasets, however, dirt roads have yet to be systematically studied. Dirt roads do not contain clear boundaries between drivable and non-drivable regions; and thus, this difficulty must be overcome for the realization of fully autonomous vehicles. The key idea of our approach is to apply lightweight non-local blocks to reinforce stage-wise long-range dependencies in encoder-decoder style backbone networks. Experiments on 4,687 images of a dirt road dataset show that our transitional asymmetric non-local neural networks present a higher accuracy with lower computational costs compared to state-of-the-art models.

Real-Time Semantic Segmentation Via Region and Pixel Context Network

Yajun Li, Yazhou Liu, Quansen Sun

Responsive image

Auto-TLDR; A Dual Context Network for Real-Time Semantic Segmentation

Slides Poster Similar

Real-time semantic segmentation is a challenging task as both segmentation accuracy and inference speed need to be considered at the same time. In this paper, we present a Dual Context Network (DCNet) to address this challenge. It contains two independent sub-networks: Region Context Network and Pixel Context Network. Region Context Network is main network with low-resolution input and feature re-weighting module to achieve sufficient receptive field. Meanwhile, Pixel Context Network with location attention module to capture the location dependencies of each pixel for assisting the main network to recover spatial detail. A contextual feature fusion is introduced to combine output features of these two sub-networks. The experiments show that DCNet can achieve high-quality segmentation while keeping a high speed. Specifically, for Cityscapes test dataset, we achieve 76.1% Mean IOU with the speed of 82 FPS on a single GTX 2080Ti GPU when using ResNet50 as backbone, and 71.2% Mean IOU with the speed of 142 FPS when using ResNet18 as backbone.

Hierarchically Aggregated Residual Transformation for Single Image Super Resolution

Zejiang Hou, Sy Kung

Responsive image

Auto-TLDR; HARTnet: Hierarchically Aggregated Residual Transformation for Multi-Scale Super-resolution

Slides Poster Similar

Visual patterns usually appear at different scales/sizes in natural images. Multi-scale feature representation is of great importance for the single-image super-resolution(SISR) task to reconstruct image objects at different scales.However, such characteristic has been rarely considered by CNN-based SISR methods. In this work, we propose a novel build-ing block, i.e. hierarchically aggregated residual transformation(HART), to achieve multi-scale feature representation in each layer of the network. Within each HART block, we connect multiple convolutions in a hierarchical residual-like manner, which greatly expands the range of effective receptive fields and helps to detect image features at different scales. To theoretically understand the proposed HART block, we recast SISR as an optimal control problem and show that HART effectively approximates the classical4th-order Runge-Kutta method, which has the merit of small local truncation error for solving numerical ordinary differential equation. By cascading the proposed HART blocks, we establish our high-performing HARTnet. Comparedwith existing SR state-of-the-arts (including those in NTIRE2019 SR Challenge leaderboard), the proposed HARTnet demonstrates consistent PSNR/SSIM performance improvements on various benchmark datasets under different degradation models.Moreover, HARTnet can efficiently restore more faithful high-resolution images than comparative SR methods (cf. Figure 1).

Single Image Super-Resolution with Dynamic Residual Connection

Karam Park, Jae Woong Soh, Nam Ik Cho

Responsive image

Auto-TLDR; Dynamic Residual Attention Network for Lightweight Single Image Super-Residual Networks

Slides Poster Similar

Deep convolutional neural networks have shown significant improvement in the single image super-resolution (SISR) field. Recently, there have been attempts to solve the SISR problem using lightweight networks, considering limited computational resources for real-world applications. Especially for lightweight networks, balancing between parameter demand and performance is very difficult to adjust, and most lightweight SISR networks are manually designed based on a huge number of brute-force experiments. Besides, a critical key to the network performance relies on the skip connection of building blocks that are repeatedly in the architecture. Notably, in previous works, these connections are pre-defined and manually determined by human researchers. Hence, they are less flexible to the input image statistics, and there can be a better solution for the given number of parameters. Therefore, we focus on the automated design of networks regarding the connection of basic building blocks (residual networks), and as a result, propose a dynamic residual attention network (DRAN). The proposed method allows the network to dynamically select residual paths depending on the input image, based on the idea of attention mechanism. For this, we design a dynamic residual module that determines the residual paths between the basic building blocks for the given input image. By finding optimal residual paths between the blocks, the network can selectively bypass informative features needed to reconstruct the target high-resolution (HR) image. Experimental results show that our proposed DRAN outperforms most of the existing state-of-the-arts lightweight models in SISR.

Fast and Accurate Real-Time Semantic Segmentation with Dilated Asymmetric Convolutions

Leonel Rosas-Arias, Gibran Benitez-Garcia, Jose Portillo-Portillo, Gabriel Sanchez-Perez, Keiji Yanai

Responsive image

Auto-TLDR; FASSD-Net: Dilated Asymmetric Pyramidal Fusion for Real-Time Semantic Segmentation

Slides Poster Similar

Recent works have shown promising results applied to real-time semantic segmentation tasks. To maintain fast inference speed, most of the existing networks make use of light decoders, or they simply do not use them at all. This strategy helps to maintain a fast inference speed; however, their accuracy performance is significantly lower in comparison to non-real-time semantic segmentation networks. In this paper, we introduce two key modules aimed to design a high-performance decoder for real-time semantic segmentation for reducing the accuracy gap between real-time and non-real-time segmentation networks. Our first module, Dilated Asymmetric Pyramidal Fusion (DAPF), is designed to substantially increase the receptive field on the top of the last stage of the encoder, obtaining richer contextual features. Our second module, Multi-resolution Dilated Asymmetric (MDA) module, fuses and refines detail and contextual information from multi-scale feature maps coming from early and deeper stages of the network. Both modules exploit contextual information without excessively increasing the computational complexity by using asymmetric convolutions. Our proposed network entitled “FASSD-Net” reaches 78.8% of mIoU accuracy on the Cityscapes validation dataset at 41.1 FPS on full resolution images (1024x2048). Besides, with a light version of our network, we reach 74.1% of mIoU at 133.1 FPS (full resolution) on a single NVIDIA GTX 1080Ti card with no additional acceleration techniques. The source code and pre-trained models are available at https://github.com/GibranBenitez/FASSD-Net.

Dynamic Multi-Path Neural Network

Yingcheng Su, Yichao Wu, Ken Chen, Ding Liang, Xiaolin Hu

Responsive image

Auto-TLDR; Dynamic Multi-path Neural Network

Slides Similar

Although deeper and larger neural networks have achieved better performance, due to overwhelming burden on computation, they cannot meet the demands of deployment on resource-limited devices. An effective strategy to address this problem is to make use of dynamic inference mechanism, which changes the inference path for different samples at runtime. Existing methods only reduce the depth by skipping an entire specific layer, which may lose important information in this layer. In this paper, we propose a novel method called Dynamic Multi-path Neural Network (DMNN), which provides more topology choices in terms of both width and depth on the fly. For better modelling the inference path selection, we further introduce previous state and object category information to guide the training process. Compared to previous dynamic inference techniques, the proposed method is more flexible and easier to incorporate into most modern network architectures. Experimental results on ImageNet and CIFAR-100 demonstrate the superiority of our method on both efficiency and classification accuracy.

Multi-Scale Residual Pyramid Attention Network for Monocular Depth Estimation

Jing Liu, Xiaona Zhang, Zhaoxin Li, Tianlu Mao

Responsive image

Auto-TLDR; Multi-scale Residual Pyramid Attention Network for Monocular Depth Estimation

Slides Poster Similar

Monocular depth estimation is a challenging problem in computer vision and is crucial for understanding 3D scene geometry. Recently, deep convolutional neural networks (DCNNs) based methods have improved the estimation accuracy significantly. However, existing methods fail to consider complex textures and geometries in scenes, thereby resulting in loss of local details, distorted object boundaries, and blurry reconstruction. In this paper, we proposed an end-to-end Multi-scale Residual Pyramid Attention Network (MRPAN) to mitigate these problems.First,we propose a Multi-scale Attention Context Aggregation (MACA) module, which consists of Spatial Attention Module (SAM) and Global Attention Module (GAM). By considering the position and scale correlation of pixels from spatial and global perspectives, the proposed module can adaptively learn the similarity between pixels so as to obtain more global context information of the image and recover the complex structure in the scene. Then we proposed an improved Residual Refinement Module (RRM) to further refine the scene structure, giving rise to deeper semantic information and retain more local details. Experimental results show that our method achieves more promisin performance in object boundaries and local details compared with other state-of-the-art methods.

Attention Stereo Matching Network

Doudou Zhang, Jing Cai, Yanbing Xue, Zan Gao, Hua Zhang

Responsive image

Auto-TLDR; ASM-Net: Attention Stereo Matching with Disparity Refinement

Slides Poster Similar

Despite great progress, previous stereo matching algorithms still lack the ability to match textureless regions and slender structure areas. To tackle this problem, we propose ASM-Net, an attention stereo matching network. Attention module and disparity refinement module are constructed in the ASMNet. The attention module can improve correlation information between two images by channels and spatial attention.The feature-guided disparity refinement module learns more geometry information in different feature levels to refine the coarse prediction resolution constantly. The proposed approach was evaluated on several benchmark datasets. Experiments show that the proposed method achieves competitive results on KITTI and Scene-Flow datasets while running in real-time at 14ms.

Improved Residual Networks for Image and Video Recognition

Ionut Cosmin Duta, Li Liu, Fan Zhu, Ling Shao

Responsive image

Auto-TLDR; Residual Networks for Deep Learning

Slides Poster Similar

Residual networks (ResNets) represent a powerful type of convolutional neural network (CNN) architecture, widely adopted and used in various tasks. In this work we propose an improved version of ResNets. Our proposed improvements address all three main components of a ResNet: the flow of information through the network layers, the residual building block, and the projection shortcut. We are able to show consistent improvements in accuracy and learning convergence over the baseline. For instance, on ImageNet dataset, using the ResNet with 50 layers, for top-1 accuracy we can report a 1.19% improvement over the baseline in one setting and around 2% boost in another. Importantly, these improvements are obtained without increasing the model complexity. Our proposed approach allows us to train extremely deep networks, while the baseline shows severe optimization issues. We report results on three tasks over six datasets: image classification (ImageNet, CIFAR-10 and CIFAR-100), object detection (COCO) and video action recognition (Kinetics-400 and Something-Something-v2). In the deep learning era, we establish a new milestone for the depth of a CNN. We successfully train a 404-layer deep CNN on the ImageNet dataset and a 3002-layer network on CIFAR-10 and CIFAR-100, while the baseline is not able to converge at such extreme depths. Code is available at: https://github.com/iduta/iresnet

GSTO: Gated Scale-Transfer Operation for Multi-Scale Feature Learning in Semantic Segmentation

Zhuoying Wang, Yongtao Wang, Zhi Tang, Yangyan Li, Ying Chen, Haibin Ling, Weisi Lin

Responsive image

Auto-TLDR; Gated Scale-Transfer Operation for Semantic Segmentation

Slides Poster Similar

Existing CNN-based methods for semantic segmentation heavily depend on multi-scale features to meet the requirements of both semantic comprehension and detail preservation. State-of-the-art segmentation networks widely exploit conventional scale-transfer operations, i.e., up-sampling and down-sampling to learn multi-scale features. In this work, we find that these operations lead to scale-confused features and suboptimal performance because they are spatial-invariant and directly transit all feature information cross scales without spatial selection. To address this issue, we propose the Gated Scale-Transfer Operation (GSTO) to properly transit spatial-filtered features to another scale. Specifically, GSTO can work either with or without extra supervision. Unsupervised GSTO is learned from the feature itself while the supervised one is guided by the supervised probability matrix. Both forms of GSTO are lightweight and plug-and-play, which can be flexibly integrated into networks or modules for learning better multi-scale features. In particular, by plugging GSTO into HRNet, we get a more powerful backbone (namely GSTO-HRNet) for pixel labeling, and it achieves new state-of-the-art results on multiple benchmarks for semantic segmentation including Cityscapes, LIP and Pascal Context, with negligible extra computational cost. Moreover, experiment results demonstrate that GSTO can also significantly boost the performance of multi-scale feature aggregation modules like PPM and ASPP.

Efficient High-Resolution High-Level-Semantic Representation Learning for Human Pose Estimation

Hong Liu, Lisi Guan

Responsive image

Auto-TLDR; Spatial enhanced separated temporal spatial convolutional neural network

Slides Poster Similar

Temporal-spatial information, as the most key issue for human action recognition, has been explored by lots of means, such as 3D convolution network (3DCNN) based or 3DCNN decomposing based approaches. Though the latter can be seen as a trade-off for overcoming the shortage caused by the former for reducing the computation cost and saving parameters, information imbalance of videos between spatial and temporal information is still not been well excavated. To tackle this problem, spatial enhanced separated temporal spatial convolutional neural network (SESTSN) is proposed in this paper, which can easily outperform 3DCNN based and 3DCNN decomposing based methods with fewer parameters. What's more, to further reduce parameter and computation cost, we adopt depth-wise convolution to the proposed SESTSN and propose the channel separated spatial enhanced separated temporal spatial convolutional neural network (CSESTSN). Experiments show that the proposed CSESTSN contains considerably fewer parameters involving much lower computation cost, while it achieves comparable performance to 3D convolution-based methods. Our method outperforms state-of-the-art methods on two challenging datasets, namely NTU RGB+D dataset and Northwestern-UCLA dataset, which verifies the effectiveness of our method.

Residual Fractal Network for Single Image Super Resolution by Widening and Deepening

Jiahang Gu, Zhaowei Qu, Xiaoru Wang, Jiawang Dan, Junwei Sun

Responsive image

Auto-TLDR; Residual fractal convolutional network for single image super-resolution

Slides Poster Similar

The architecture of the convolutional neural network (CNN) plays an important role in single image super-resolution (SISR). However, most models proposed in recent years usually transplant methods or architectures that perform well in other vision fields. Thence they do not combine the characteristics of super-resolution (SR) and ignore the key information brought by the recurring texture feature in the image. To utilize patch-recurrence in SR and the high correlation of texture, we propose a residual fractal convolutional block (RFCB) and expand its depth and width to obtain residual fractal network (RFN), which contains deep residual fractal network (DRFN) and wide residual fractal network (WRFN). RFCB is recursive with multiple branches of magnified receptive field. Through the phased feature fusion module, the network focuses on extracting high-frequency texture feature that repeatedly appear in the image. We also introduce residual in residual (RIR) structure to RFCB that enables abundant low-frequency feature feed into deeper layers and reduce the difficulties of network training. RFN is the first supervised learning method to combine the patch-recurrence characteristic in SISR into network design. Extensive experiments demonstrate that RFN outperforms state-of-the-art SISR methods in terms of both quantitative metrics and visual quality, while the amount of parameters has been greatly optimized.

Boundary-Aware Graph Convolution for Semantic Segmentation

Hanzhe Hu, Jinshi Cui, Jinshi Hongbin Zha

Responsive image

Auto-TLDR; Boundary-Aware Graph Convolution for Semantic Segmentation

Slides Poster Similar

Recent works have made great progress in semantic segmentation by exploiting contextual information in a local or global manner with dilated convolutions, pyramid pooling or self-attention mechanism. However, few works have focused on harvesting boundary information to improve the segmentation performance. In order to enhance the feature similarity within the object and keep discrimination from other objects, we propose a boundary-aware graph convolution (BGC) module to propagate features within the object. The graph reasoning is performed among pixels of the same object apart from the boundary pixels. Based on the proposed BGC module, we further introduce the Boundary-aware Graph Convolution Network(BGCNet), which consists of two main components including a basic segmentation network and the BGC module, forming a coarse-to-fine paradigm. Specifically, the BGC module takes the coarse segmentation feature map as node features and boundary prediction to guide graph construction. After graph convolution, the reasoned feature and the input feature are fused together to get the refined feature, producing the refined segmentation result. We conduct extensive experiments on three popular semantic segmentation benchmarks including Cityscapes, PASCAL VOC 2012 and COCO Stuff, and achieve state-of-the-art performance on all three benchmarks.

Wavelet Attention Embedding Networks for Video Super-Resolution

Young-Ju Choi, Young-Woon Lee, Byung-Gyu Kim

Responsive image

Auto-TLDR; Wavelet Attention Embedding Network for Video Super-Resolution

Slides Poster Similar

Recently, Video super-resolution (VSR) has become more crucial as the resolution of display has been grown. The majority of deep learning-based VSR methods combine the convolutional neural networks (CNN) with motion compensation or alignment module to estimate high-resolution (HR) frame from low-resolution (LR) frames. However, most of previous methods deal with the spatial features equally and may result in the misaligned temporal features by pixel-based motion compensation and alignment module. It can lead to the damaging effect on the accuracy of the estimated HR feature. In this paper, we propose a wavelet attention embedding network (WAEN), including wavelet embedding network (WENet) and attention embedding network (AENet), to fully exploit the spatio-temporal informative features. The WENet is operated as a spatial feature extractor of individual low and high-frequency information based on 2-D Haar discrete wavelet transform. The meaningful temporal feature is extracted in the AENet through utilizing the weighted attention map between frames. Experimental results demonstrate that the proposed method achieves superior performance compared with state-of-the-art methods.

Ordinal Depth Classification Using Region-Based Self-Attention

Minh Hieu Phan, Son Lam Phung, Abdesselam Bouzerdoum

Responsive image

Auto-TLDR; Region-based Self-Attention for Multi-scale Depth Estimation from a Single 2D Image

Slides Poster Similar

Depth estimation from a single 2D image has been widely applied in 3D understanding, 3D modelling and robotics. It is challenging as reliable cues (e.g. stereo correspondences and motions) are not available. Most of the modern approaches exploited multi-scale feature extraction to provide more powerful representations for deep networks. However, these studies have not focused on how to effectively fuse the learned multi-scale features. This paper proposes a novel region-based self-attention (rSA) module. The rSA recalibrates the multi-scale responses by explicitly modelling the interdependency between channels in separate image regions. We discretize continuous depths to solve an ordinal depth classification in which the relative order between categories is significant. We contribute a dataset of 4410 RGB-D images, captured in outdoor environments at the University of Wollongong's campus. In our experimental results, the proposed module improves the lightweight models on small-sized datasets by 22% - 40%

Spatial-Related and Scale-Aware Network for Crowd Counting

Lei Li, Yuan Dong, Hongliang Bai

Responsive image

Auto-TLDR; Spatial Attention for Crowd Counting

Slides Poster Similar

Crowd counting aims to estimate the number of people in images. Although promising progresses have been made with the prevalence of deep Convolutional Neural Networks, there still remains a challenging task due to cluttered backgrounds and varying scales of people within an image. In this paper, we propose a learnable spatial attention module which can get the spatial relations to diminish the negative impact of backgrounds. Besides, a dense hybrid dilated convolution module is also brought up to preserve information derived from varied scales. With these two modules, our network can deal with the problem caused by scale variance and background interference. To demonstrate the effectiveness of our method, we compare it with state-of-the-art algorithms on three representative crowd counting benchmarks (ShanghaiTech UCF-QNRF,UCF_CC_50). Experimental results show that our proposed network can achieve significant improvements on all the three datasets.

RSAN: Residual Subtraction and Attention Network for Single Image Super-Resolution

Shuo Wei, Xin Sun, Haoran Zhao, Junyu Dong

Responsive image

Auto-TLDR; RSAN: Residual subtraction and attention network for super-resolution

Slides Similar

The single-image super-resolution (SISR) aims to recover a potential high-resolution image from its low-resolution version. Recently, deep learning-based methods have played a significant role in super-resolution field due to its effectiveness and efficiency. However, most of the SISR methods neglect the importance among the feature map channels. Moreover, they can not eliminate the redundant noises, making the output image be blurred. In this paper, we propose the residual subtraction and attention network (RSAN) for powerful feature expression and channels importance learning. More specifically, RSAN firstly implements one redundance removal module to learn noise information in the feature map and subtract noise through residual learning. Then it introduces the channel attention module to amplify high-frequency information and suppress the weight of effectless channels. Experimental results on extensive public benchmarks demonstrate our RSAN achieves significant improvement over the previous SISR methods in terms of both quantitative metrics and visual quality.

An Improved Bilinear Pooling Method for Image-Based Action Recognition

Wei Wu, Jiale Yu

Responsive image

Auto-TLDR; An improved bilinear pooling method for image-based action recognition

Slides Poster Similar

Action recognition in still images is a challenging task because of the complexity of human motions and the variance of background in the same action category. And some actions typically occur in fine-grained categories, with little visual differences between these categories. So extracting discriminative features or modeling various semantic parts is essential for image-based action recognition. Many methods apply expensive manual annotations to learn discriminative parts information for action recognition, which may severely discourage potential applications in real life. In recent years, bilinear pooling method has shown its effectiveness for image classification due to its learning distinctive features automatically. Inspired by this model, in this paper, an improved bilinear pooling method is proposed for avoiding the shortcomings of traditional bilinear pooling methods. The previous bilinear pooling approaches contain lots of noisy background or harmful feature information, which limit their application for action recognition. In our method, the attention mechanism is introduced into hierarchical bilinear pooling framework with mask aggregation for action recognition. The proposed model can generate the distinctive and ROI-aware feature information by combining multiple attention mask maps from the channel and spatial-wise attention features. To be more specific, our method makes the network to better pay attention to discriminative region of the vital objects in an image. We verify our model on the two challenging datasets: 1) Stanford 40 action dataset and 2) our action dataset that includes 60 categories. Experimental results demonstrate the effectiveness of our approach, which is superior to the traditional and state-of-the-art methods.

Second-Order Attention Guided Convolutional Activations for Visual Recognition

Shannan Chen, Qian Wang, Qiule Sun, Bin Liu, Jianxin Zhang, Qiang Zhang

Responsive image

Auto-TLDR; Second-order Attention Guided Network for Convolutional Neural Networks for Visual Recognition

Slides Poster Similar

Recently, modeling deep convolutional activations by the global second-order pooling has shown great advance on visual recognition tasks. However, most of the existing deep second-order statistical models mainly compute second-order statistics of activations of the last convolutional layer as image representations, and they seldom introduce second-order statistics into earlier layers to better fit network topology, thus limiting the representational ability to a certain extent. Motivated by the flexibility of attention blocks that are commonly plugged into intermediate layers of deep convolutional networks (ConvNets), this work makes an attempt to combine deep second-order statistics with attention mechanisms in ConvNets, and further proposes a novel Second-order Attention Guided Network (SoAG-Net) for visual recognition. More specifically, SoAG-Net involves several SoAG modules seemingly inserted into intermediate layers of the network, in which SoAG collects second-order statistics of convolutional activations by polynomial kernel approximation to predict channel-wise attention maps utilized for guiding the learning of convolutional activations through tensor scaling along channel dimension. SoAG improves the nonlinearity of ConvNets and enables ConvNets to fit more complicated distribution of convolutional activations. Experiment results on three commonly used datasets illuminate that SoAG-Net outperforms its counterparts and achieves competitive performance with state-of-the-art models under the same backbone.

ResFPN: Residual Skip Connections in Multi-Resolution Feature Pyramid Networks for Accurate Dense Pixel Matching

Rishav ., René Schuster, Ramy Battrawy, Oliver Wasenmüler, Didier Stricker

Responsive image

Auto-TLDR; Resolution Feature Pyramid Networks for Dense Pixel Matching

Slides Similar

Dense pixel matching is required for many computer vision algorithms such as disparity, optical flow or scene flow estimation. Feature Pyramid Networks (FPN) have proven to be a suitable feature extractor for CNN-based dense matching tasks. FPN generates well localized and semantically strong features at multiple scales. However, the generic FPN is not utilizing its full potential, due to its reasonable but limited localization accuracy. Thus, we present ResFPN – a multiresolution feature pyramid network with multiple residual skip connections, where at any scale, we leverage the information from higher resolution maps for stronger and better localized features. In our ablation study we demonstrate the effectiveness of our novel architecture with clearly higher accuracy than FPN. In addition, we verify the superior accuracy of ResFPN in many different pixel matching applications on established datasets like KITTI, Sintel, and FlyingThings3D.

Single Image Deblurring Using Bi-Attention Network

Yaowei Li, Ye Luo, Jianwei Lu

Responsive image

Auto-TLDR; Bi-Attention Neural Network for Single Image Deblurring

Poster Similar

Recently, deep convolutional neural networks have been extensively applied into image deblurring and have achieved remarkable performance. However, most CNN-based image deblurring methods focus on simply increasing network depth, neglecting the contextual information of the blurred image and the reconstructed image. Meanwhile, most encoder-decoder based methods rarely exploit encoder's multi-layer features. To address these issues, we propose a bi-attention neural network for single image deblurring, which mainly consists of a bi-attention network and a feature fusion network. Specifically, two criss-cross attention modules are plugged before and after the encoder-decoder to capture long-range spatial contextual information in the blurred image and the reconstructed image simultaneously, and the feature fusion network combines multi-layer features from encoder to enable the decoder reconstruct the image with multi-scale features. The whole network is end-to-end trainable. Quantitative and qualitative experiment results validate that the proposed network outperforms state-of-the-art methods in terms of PSNR and SSIM on benchmark datasets.

Multi-Direction Convolution for Semantic Segmentation

Dehui Li, Zhiguo Cao, Ke Xian, Xinyuan Qi, Chao Zhang, Hao Lu

Responsive image

Auto-TLDR; Multi-Direction Convolution for Contextual Segmentation

Slides Similar

Context is known to be one of crucial factors effecting the performance improvement of semantic segmentation. However, state-of-the-art segmentation models built upon fully convolutional networks are inherently weak in encoding contextual information because of stacked local operations such as convolution and pooling. Failing to capture context leads to inferior segmentation performance. Despite many context modules have been proposed to relieve this problem, they still operate in a local manner or use the same contextual information in different positions (due to upsampling). In this paper, we introduce the idea of Multi-Direction Convolution (MDC)—a novel operator capable of encoding rich contextual information. This operator is inspired by an observation that the standard convolution only slides along the spatial dimension (x, y direction) where the channel dimension (z direction) is fixed, which renders slow growth of the receptive field (RF). If considering the channel-fixed convolution to be one-direction, MDC is multi-direction in the sense that MDC slides along both spatial and channel dimensions, i.e., it slides along x, y when z is fixed, along x, z when y is fixed, and along y, z when x is fixed. In this way, MDC is able to encode rich contextual information with the fast increase of the RF. Compared to existing context modules, the encoded context is position-sensitive because no upsampling is required. MDC is also efficient and easy to implement. It can be implemented with few standard convolution layers with permutation. We show through extensive experiments that MDC effectively and selectively enlarges the RF and outperforms existing contextual modules on two standard benchmarks, including Cityscapes and PASCAL VOC2012.

LiNet: A Lightweight Network for Image Super Resolution

Armin Mehri, Parichehr Behjati Ardakani, Angel D. Sappa

Responsive image

Auto-TLDR; LiNet: A Compact Dense Network for Lightweight Super Resolution

Slides Poster Similar

This paper proposes a new lightweight network, LiNet, that enhancing technical efficiency in lightweight super resolution and operating approximately like very large and costly networks in terms of number of network parameters and operations. The proposed architecture allows the network to learn more abstract properties by avoiding low-level information via multiple links. LiNet introduces a Compact Dense Module, which contains set of inner and outer blocks, to efficiently extract meaningful information, to better leverage multi-level representations before upsampling stage, and to allow an efficient information and gradient flow within the network. Experiments on benchmark datasets show that the proposed LiNet achieves favorable performance against lightweight state-of-the-art methods.

Cross-Layer Information Refining Network for Single Image Super-Resolution

Hongyi Zhang, Wen Lu, Xiaopeng Sun

Responsive image

Auto-TLDR; Interlaced Spatial Attention Block for Single Image Super-Resolution

Slides Poster Similar

Recently, deep learning-based image super-resolution (SR) has made a remarkable progress. However, previous SR methods rarely focus on the correlation between adjacent layers, which leads to underutilization of the information extracted by each convolutional layer. To address these problem, we design a simple and efficient cross-layer information refining network (CIRN) for single image super-resolution. Concretely, we propose the interlaced spatial attention block (ISAB) to measure the correlation between the adjacent layers feature maps and adaptively rescale spatial-wise features for refining the information. Owing to the two stage information propagation strategy, the CIRN can distill the primary information of adjacent layers without introducing too many parameters. Extensive experiments on benchmark datasets illustrate that our method achieves better accuracy than state-of-the-art methods even in 16× scale, spcifically it has a better banlance between performance and parameters.

Dynamic Guided Network for Monocular Depth Estimation

Xiaoxia Xing, Yinghao Cai, Yiping Yang, Dayong Wen

Responsive image

Auto-TLDR; DGNet: Dynamic Guidance Upsampling for Self-attention-Decoding for Monocular Depth Estimation

Slides Poster Similar

Self-attention or encoder-decoder structure has been widely used in deep neural networks for monocular depth estimation tasks. The former mechanism are capable to capture long-range information by computing the representation of each position by a weighted sum of the features at all positions, while the latter networks can capture structural details information by gradually recovering the spatial information. In this work, we combine the advantages of both methods. Specifically, our proposed model, DGNet, extends EMANet Network by adding an effective decoder module to refine the depth results. In the decoder stage, we further design dynamic guidance upsampling which uses local neighboring information of low-level features guide coarser depth to upsample. In this way, dynamic guidance upsampling generates content-dependent and spatially-variant kernels for depth upsampling which makes full use of spatial details information from low-level features. Experimental results demonstrate that our method obtains higher accuracy and generates the desired depth map.

Deeply-Fused Attentive Network for Stereo Matching

Zuliu Yang, Xindong Ai, Weida Yang, Yong Zhao, Qifei Dai, Fuchi Li

Responsive image

Auto-TLDR; DF-Net: Deep Learning-based Network for Stereo Matching

Slides Poster Similar

In this paper, we propose a novel learning-based network for stereo matching called DF-Net, which makes three main contributions that are experimentally shown to have practical merit. Firstly, we further increase the accuracy by using the deeply fused spatial pyramid pooling (DF-SPP) module, which can acquire the continuous multi-scale context information in both parallel and cascade manners. Secondly, we introduce channel attention block to dynamically boost the informative features. Finally, we propose a stacked encoder-decoder structure with 3D attention gate for cost regularization. More precisely, the module fuses the coding features to their next encoder-decoder structure under the supervision of attention gate with long-range skip connection, and thus exploit deep and hierarchical context information for disparity prediction. The performance on SceneFlow and KITTI datasets shows that our model is able to generate better results against several state-of-the-art algorithms.

Collaborative Human Machine Attention Module for Character Recognition

Chetan Ralekar, Tapan Gandhi, Santanu Chaudhury

Responsive image

Auto-TLDR; A Collaborative Human-Machine Attention Module for Deep Neural Networks

Slides Poster Similar

The deep learning models which include attention mechanisms are shown to enhance the performance and efficiency of the various computer vision tasks such as pattern recognition, object detection, face recognition, etc. Although the visual attention mechanism is the source of inspiration for these models, recent attention models consider `attention' as a pure machine vision optimization problem and visual attention remains the most neglected aspect. Therefore, this paper presents a collaborative human and machine attention module which considers both visual and network's attention. The proposed module is inspired by the dorsal (`where') pathways of visual processing and it can be integrated with any convolutional neural network (CNN) model. First, the module computes the spatial attention map from the input feature maps which is then combined with the visual attention maps. The visual attention maps are created using eye-fixations obtained by performing an eye-tracking experiment with human participants. The visual attention map covers the highly salient and discriminative image regions as humans tend to focus on such regions, whereas the other relevant image regions are processed by spatial attention map. The combination of these two maps results in the finer refinement in feature maps which results in improved performance. The comparative analysis reveals that our model not only shows significant improvement over the baseline model but also outperforms the other models. We hope that our findings using a collaborative human-machine attention module will be helpful in other vision tasks as well.

Feature-Dependent Cross-Connections in Multi-Path Neural Networks

Dumindu Tissera, Kasun Vithanage, Rukshan Wijesinghe, Kumara Kahatapitiya, Subha Fernando, Ranga Rodrigo

Responsive image

Auto-TLDR; Multi-path Networks for Adaptive Feature Extraction

Slides Poster Similar

Learning a particular task from a dataset, samples in which originate from diverse contexts, is challenging, and usually addressed by deepening or widening standard neural networks. As opposed to conventional network widening, multi-path architectures restrict the quadratic increment of complexity to a linear scale. However, existing multi-column/path networks or model ensembling methods do not consider any feature-dependant allocation of parallel resources, and therefore, tend to learn redundant features. Given a layer in a multi-path network, if we restrict each path to learn a context-specific set of features and introduce a mechanism to intelligently allocate incoming feature maps to such paths, each path can specialize in a certain context, reducing the redundancy and improving the quality of extracted features. This eventually leads to better-optimized usage of parallel resources. To do this, we propose inserting feature-dependant cross-connections between parallel sets of feature maps in successive layers. The weights of these cross-connections are learned based on the input features of the particular layer. Our multi-path networks show improved image recognition accuracy at a similar complexity compared to conventional and state-of-the-art methods for deepening, widening and adaptive feature extracting, in both small and large scale datasets.

Boosting High-Level Vision with Joint Compression Artifacts Reduction and Super-Resolution

Xiaoyu Xiang, Qian Lin, Jan Allebach

Responsive image

Auto-TLDR; A Context-Aware Joint CAR and SR Neural Network for High-Resolution Text Recognition and Face Detection

Slides Poster Similar

Due to the limits of bandwidth and storage space, digital images are usually down-scaled and compressed when transmitted over networks, resulting in loss of details and jarring artifacts that can lower the performance of high-level visual tasks. In this paper, we aim to generate an artifact-free high-resolution image from a low-resolution one compressed with an arbitrary quality factor by exploring joint compression artifacts reduction (CAR) and super-resolution (SR) tasks. First, we propose a context-aware joint CAR and SR neural network (CAJNN) that integrates both local and non-local features to solve CAR and SR in one-stage. Finally, a deep reconstruction network is adopted to predict high quality and high-resolution images. Evaluation on CAR and SR benchmark datasets shows that our CAJNN model outperforms previous methods and also takes 26.2% less runtime. Based on this model, we explore addressing two critical challenges in high-level computer vision: optical character recognition of low-resolution texts, and extremely tiny face detection. We demonstrate that CAJNN can serve as an effective image preprocessing method and improve the accuracy for real-scene text recognition (from 85.30% to 85.75%) and the average precision for tiny face detection (from 0.317 to 0.611).

SCA Net: Sparse Channel Attention Module for Action Recognition

Hang Song, Yonghong Song, Yuanlin Zhang

Responsive image

Auto-TLDR; SCA Net: Efficient Group Convolution for Sparse Channel Attention

Slides Poster Similar

Channel attention has shown its great performance recently when it was incorporated into deep convolutional neural networks. However, existing methods usually require extensive computing resources due to their involuted structure, which is hard for 3D CNNs to take full advantage of. In this paper, a lightweight sparse channel attention (SCA) module implemented by efficient group convolution is proposed, which adopts the idea of sparse channel connection and involves much less parameters but brings clear performance gain. Meanwhile, to solve the lack of local channel interaction brought by group convolution, a dominant function called Aggregate-Shuffle-Diverge (ASD) is leveraged to enhance information flow over each group with no additional parameters. We also adjust the existing mainstream 3D CNNs by employing 3D convolution factorization, so as to further reduce the parameters. Our SCA module can be flexibly incorporated into most existing 3D CNNs, all of which can achieve a perfect trade-off between performance and complexity on action recognition task with factorized I3D or 3D ResNet backbone networks. The experimental results also indicate that the resulting network, namely, SCA Net can achieve an outstanding performance on UCF-101 and HMDB-51 datasets.

Bidirectional Matrix Feature Pyramid Network for Object Detection

Wei Xu, Yi Gan, Jianbo Su

Responsive image

Auto-TLDR; BMFPN: Bidirectional Matrix Feature Pyramid Network for Object Detection

Slides Poster Similar

Feature pyramids are widely used to improve scale invariance for object detection. Most methods just map the objects to feature maps with relevant square receptive fields, but rarely pay attention to the aspect ratio variation, which is also an important property of object instances. It will lead to a poor match between rectangular objects and assigned features with square receptive fields, thus preventing from accurate recognition and location. Besides, the information propagation among feature layers is sparse, namely, each feature in the pyramid may mainly or only contain single-level information, which is not representative enough for classification and localization sub-tasks. In this paper, Bidirectional Matrix Feature Pyramid Network (BMFPN) is proposed to address these issues. It consists of three modules: Diagonal Layer Generation Module (DLGM), Top-down Module (TDM) and Bottom-up Module (BUM). First, multi-level features extracted by backbone are fed into DLGM to produce the base features. Then these base features are utilized to construct the final feature pyramid through TDM and BUM in series. The receptive fields of the designed feature layers in BMFPN have various scales and aspect ratios. Objects can be correctly assigned to appropriate and representative feature maps with relevant receptive fields depending on its scale and aspect ratio properties. Moreover, TDM and BUM form bidirectional and reticular information flow, which effectively fuses multi level information in top-down and bottom-up manner respectively. To evaluate the effectiveness of our proposed architecture, an end-toend anchor-free detector is designed and trained by integrating BMFPN into FCOS. And the center ness branch in FCOS is modified with our Gaussian center-ness branch (GCB), which brings another slight improvement. Without bells and whistles, our method gains +3.3%, +2.4% and +2.6% AP on MS COCO dataset from baselines with ResNet-50, ResNet-101 and ResNeXt-101 backbones, respectively.

Learnable Higher-Order Representation for Action Recognition

Jie Shao, Xiangyang Xue

Responsive image

Auto-TLDR; Learningable Higher-Order Operations for Spatiotemporal Dynamics in Video Recognition

Similar

Capturing spatiotemporal dynamics is an essential topic in video recognition. In this paper, we present learnable higher-order operations as a generic family of building blocks for capturing spatiotemporal dynamics from RGB input video space. Similar to higher-order functions, the weights of higher-order operations are themselves derived from the data with learnable parameters. Classical architectures such as residual learning and network-in-network are first-order operations where weights are directly learned from the data. Higher-order operations make it easier to capture context-sensitive patterns, such as motion. Self-attention models are also higher-order operations, but the attention weights are mostly computed from an affine operation or dot product. The learnable higher-order operations can be more generic and flexible. Experimentally, we show that on the task of video recognition, our higher-order models can achieve results on par with or better than the existing state-of-the-art methods on Something-Something (V1 and V2), Kinetics and Charades datasets.

Delivering Meaningful Representation for Monocular Depth Estimation

Doyeon Kim, Donggyu Joo, Junmo Kim

Responsive image

Auto-TLDR; Monocular Depth Estimation by Bridging the Context between Encoding and Decoding

Slides Poster Similar

Monocular depth estimation plays a key role in 3D scene understanding, and a number of recent papers have achieved significant improvements using deep learning based algorithms. Most papers among them proposed methods that use a pre-trained network as a deep feature extractor and then decode the obtained features to create a depth map. In this study, we focus on how to use this encoder-decoder structure to deliver meaningful representation throughout the entire network. We propose a new network architecture with our suggested modules to create a more accurate depth map by bridging the context between the encoding and decoding phase. First, we place the pyramid block at the bottleneck of the network to enlarge the view and convey rich information about the global context to the decoder. Second, we suggest a skip connection with the fuse module to aggregate the encoder and decoder feature. Finally, we validate our approach on the NYU Depth V2 and KITTI datasets. The experimental results prove the efficacy of the suggested model and show performance gains over the state-of-the-art model.

Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting

Pongpisit Thanasutives, Ken-Ichi Fukui, Masayuki Numao, Boonserm Kijsirikul

Responsive image

Auto-TLDR; M-SFANet and M-SegNet for Crowd Counting Using Multi-Scale Fusion Networks

Slides Poster Similar

In this paper, we proposed two modified neural networks based on dual path multi-scale fusion networks (SFANet) and SegNet for accurate and efficient crowd counting. Inspired by SFANet, the first model, which is named M-SFANet, is attached with atrous spatial pyramid pooling (ASPP) and context-aware module (CAN). The encoder of M-SFANet is enhanced with ASPP containing parallel atrous convolutional layers with different sampling rates and hence able to extract multi-scale features of the target object and incorporate larger context. To further deal with scale variation throughout an input image, we leverage the CAN module which adaptively encodes the scales of the contextual information. The combination yields an effective model for counting in both dense and sparse crowd scenes. Based on the SFANet decoder structure, M-SFANet's decoder has dual paths, for density map and attention map generation. The second model is called M-SegNet, which is produced by replacing the bilinear upsampling in SFANet with max unpooling that is used in SegNet. This change provides a faster model while providing competitive counting performance. Designed for high-speed surveillance applications, M-SegNet has no additional multi-scale-aware module in order to not increase the complexity. Both models are encoder-decoder based architectures and are end-to-end trainable. We conduct extensive experiments on five crowd counting datasets and one vehicle counting dataset to show that these modifications yield algorithms that could improve state-of-the-art crowd counting methods.

Efficient Super Resolution by Recursive Aggregation

Zhengxiong Luo Zhengxiong Luo, Yan Huang, Shang Li, Liang Wang, Tieniu Tan

Responsive image

Auto-TLDR; Recursive Aggregation Network for Efficient Deep Super Resolution

Slides Poster Similar

Deep neural networks have achieved remarkable results on image super resolution (SR), but the efficiency problem of deep SR networks is rarely studied. We experimentally find that many sequentially stacked convolutional blocks in nowadays SR networks are far from being fully optimized, which largely damages their overall efficiency. It indicates that comparable or even better results could be achieved with less but sufficiently optimized blocks. In this paper, we try to construct more efficient SR model via the proposed recursive aggregation network (RAN). It recursively aggregates convolutional blocks in different orders, and avoids too many sequentially stacked blocks. In this way, multiple shortcuts are introduced in RAN, and help gradients easier flow to all inner layers, even for very deep SR networks. As a result, all blocks in RAN can be better optimized, thus RAN can achieve better performance with smaller model size than existing methods.

Accurate Cell Segmentation in Digital Pathology Images Via Attention Enforced Networks

Zeyi Yao, Kaiqi Li, Guanhong Zhang, Yiwen Luo, Xiaoguang Zhou, Muyi Sun

Responsive image

Auto-TLDR; AENet: Attention Enforced Network for Automatic Cell Segmentation

Slides Poster Similar

Automatic cell segmentation is an essential step in the pipeline of computer-aided diagnosis (CAD), such as the detection and grading of breast cancer. Accurate segmentation of cells can not only assist the pathologists to make a more precise diagnosis, but also save much time and labor. However, this task suffers from stain variation, cell inhomogeneous intensities, background clutters and cells from different tissues. To address these issues, we propose an Attention Enforced Network (AENet), which is built on spatial attention module and channel attention module, to integrate local features with global dependencies and weight effective channels adaptively. Besides, we introduce a feature fusion branch to bridge high-level and low-level features. Finally, the marker controlled watershed algorithm is applied to post-process the predicted segmentation maps for reducing the fragmented regions. In the test stage, we present an individual color normalization method to deal with the stain variation problem. We evaluate this model on the MoNuSeg dataset. The quantitative comparisons against several prior methods demonstrate the priority of our approach.

Contextual Classification Using Self-Supervised Auxiliary Models for Deep Neural Networks

Sebastian Palacio, Philipp Engler, Jörn Hees, Andreas Dengel

Responsive image

Auto-TLDR; Self-Supervised Autogenous Learning for Deep Neural Networks

Slides Poster Similar

Classification problems solved with deep neural networks (DNNs) typically rely on a closed world paradigm, and optimize over a single objective (e.g., minimization of the cross- entropy loss). This setup dismisses all kinds of supporting signals that can be used to reinforce the existence or absence of particular patterns. The increasing need for models that are interpretable by design makes the inclusion of said contextual signals a crucial necessity. To this end, we introduce the notion of Self-Supervised Autogenous Learning (SSAL). A SSAL objective is realized through one or more additional targets that are derived from the original supervised classification task, following architectural principles found in multi-task learning. SSAL branches impose low-level priors into the optimization process (e.g., grouping). The ability of using SSAL branches during inference, allow models to converge faster, focusing on a richer set of class-relevant features. We equip state-of-the-art DNNs with SSAL objectives and report consistent improvements for all of them on CIFAR100 and Imagenet. We show that SSAL models outperform similar state-of-the-art methods focused on contextual loss functions, auxiliary branches and hierarchical priors.

TSMSAN: A Three-Stream Multi-Scale Attentive Network for Video Saliency Detection

Jingwen Yang, Guanwen Zhang, Wei Zhou

Responsive image

Auto-TLDR; Three-stream Multi-scale attentive network for video saliency detection in dynamic scenes

Slides Poster Similar

Video saliency detection is an important low-level task that has been used in a large range of high-level applications. In this paper, we proposed a three-stream multi-scale attentive network (TSMSAN) for saliency detection in dynamic scenes. TSMSAN integrates motion vector representation, static saliency map, and RGB information in multi-scales together into one framework on the basis of Fully Convolutional Network (FCN) and spatial attention mechanism. On the one hand, the respective motion features, spatial features, as well as the scene features can provide abundant information for video saliency detection. On the other hand, spatial attention mechanism can combine features with multi-scales to focus on key information in dynamic scenes. In this manner, the proposed TSMSAN can encode the spatiotemporal features of the dynamic scene comprehensively. We evaluate the proposed approach on two public dynamic saliency data sets. The experimental results demonstrate TSMSAN is able to achieve the state-of-the-art performance as well as the excellent generalization ability. Furthermore, the proposed TSMSAN can provide more convincing video saliency information, in line with human perception.

P2 Net: Augmented Parallel-Pyramid Net for Attention Guided Pose Estimation

Luanxuan Hou, Jie Cao, Yuan Zhao, Haifeng Shen, Jian Tang, Ran He

Responsive image

Auto-TLDR; Parallel-Pyramid Net with Partial Attention for Human Pose Estimation

Slides Poster Similar

The target of human pose estimation is to determine the body parts and joint locations of persons in the image. Angular changes, motion blur and occlusion etc. in the natural scenes make this task challenging, while some joints are more difficult to be detected than others. In this paper, we propose an augmented Parallel-Pyramid Net (P^2Net) with an partial attention module. During data preprocessing, we proposed a differentiable auto data augmentation (DA^2) method in which sequences of data augmentations are formulated as a trainable and operational Convolution Neural Network (CNN) component. DA^2 improves the training efficiency and effectiveness. A parallel pyramid structure is followed to compensate the information loss introduced by the network. For the information loss problem in the backbone network, we optimize the backbone network by adopting a new parallel structure without increasing the overall computational complexity. To further refine the predictions after completion of global predictions, an Partial Attention Module (PAM) is defined to extract weighted features from different scale feature maps generated by the parallel pyramid structure. Compared with the traditional up-sampling refining, PAM can better capture the relationship between channels. Experiments corroborate the effectiveness of our proposed method. Notably, our method achieves the best performance on the challenging MSCOCO and MPII datasets.

Progressive Scene Segmentation Based on Self-Attention Mechanism

Yunyi Pan, Yuan Gan, Kun Liu, Yan Zhang

Responsive image

Auto-TLDR; Two-Stage Semantic Scene Segmentation with Self-Attention

Slides Poster Similar

Semantic scene segmentation is vital for a large variety of applications as it enables understanding of 3D data. Nowadays, various approaches based upon point clouds ignore the mathematical distribution of points and treat the points equally. The methods following this direction neglect the imbalance problem of samples that naturally exists in scenes. To avoid these issues, we propose a two-stage semantic scene segmentation framework based on self-attention mechanism and achieved state-of-the-art performance on 3D scene understanding tasks. We split the whole task into two small ones which efficiently relief the sample imbalance issue. In addition, we have designed a new self-attention block which could be inserted into submanifold convolution networks to model the long-range dependencies that exists among points. The proposed network consists of an encoder and a decoder, with the spatial-wise and channel-wise attention modules inserted. The two-stage network shares a U-Net architecture and is an end-to-end trainable framework which could predict the semantic label for the scene point clouds fed into it. Experiments on standard benchmarks of 3D scenes implies that our network could perform at par or better than the existing state-of-the-art methods.

Adaptive Image Compression Using GAN Based Semantic-Perceptual Residual Compensation

Ruojing Wang, Zitang Sun, Sei-Ichiro Kamata, Weili Chen

Responsive image

Auto-TLDR; Adaptive Image Compression using GAN based Semantic-Perceptual Residual Compensation

Slides Poster Similar

Image compression is a basic task in image processing. In this paper, We present an adaptive image compression algorithm that relies on GAN based semantic-perceptual residual compensation, which is available to offer visually pleasing reconstruction at a low bitrate. Our method adopt an U-shaped encoding and decoding structure accompanied by a well-designed dense residual connection with strip pooling module to improve the original auto-encoder. Besides, we introduce the idea of adversarial learning by introducing a discriminator thus constructed a complete GAN. To improve the coding efficiency, we creatively designed an adaptive semantic-perception residual compensation block based on Grad-CAM algorithm. In the improvement of the quantizer, we embed the method of soft-quantization so as to solve the problem to some extent that back propagation process is irreversible. Simultaneously, we use the latest FLIF lossless compression algorithm and BPG vector compression algorithm to perform deeper compression on the image. More importantly experimental results including PSNR, MS-SSIM demonstrate that the proposed approach outperforms the current state-of-the-art image compression methods.