Yongtao Wang

Papers from this author

Dual Loss for Manga Character Recognition with Imbalanced Training Data

Yonggang Li, Yafeng Zhou, Yongtao Wang, Xiaoran Qin, Zhi Tang

Responsive image

Auto-TLDR; Dual Adaptive Re-weighting Loss for Manga Character Recognition

Slides Poster Similar

Manga character recognition is a key technology for manga character retrieval and verfication. This task is very challenging since the manga character images have a long-tailed distribution and large quality variations. Training models with cross-entropy softmax loss on such imbalanced data would introduce biases to feature and class weight norm. To handle this problem, we propose a novel dual loss which is the sum of two losses: dual ring loss and dual adaptive re-weighting loss. Dual ring loss combines weight and feature soft normalization and serves as a regularization term to softmax loss. Dual adaptive re-weighting loss re-weights softmax loss according to the norm of both feature and class weight. With the proposed losses, we have achieved encouraging results on Manga109 benchmark. Specifically, compared with the baseline softmax loss, our method improves the character retrieval mAP from 35.72% to 38.88% and the character verification accuracy from 87.00% to 88.50%.

GSTO: Gated Scale-Transfer Operation for Multi-Scale Feature Learning in Semantic Segmentation

Zhuoying Wang, Yongtao Wang, Zhi Tang, Yangyan Li, Ying Chen, Haibin Ling, Weisi Lin

Responsive image

Auto-TLDR; Gated Scale-Transfer Operation for Semantic Segmentation

Slides Poster Similar

Existing CNN-based methods for semantic segmentation heavily depend on multi-scale features to meet the requirements of both semantic comprehension and detail preservation. State-of-the-art segmentation networks widely exploit conventional scale-transfer operations, i.e., up-sampling and down-sampling to learn multi-scale features. In this work, we find that these operations lead to scale-confused features and suboptimal performance because they are spatial-invariant and directly transit all feature information cross scales without spatial selection. To address this issue, we propose the Gated Scale-Transfer Operation (GSTO) to properly transit spatial-filtered features to another scale. Specifically, GSTO can work either with or without extra supervision. Unsupervised GSTO is learned from the feature itself while the supervised one is guided by the supervised probability matrix. Both forms of GSTO are lightweight and plug-and-play, which can be flexibly integrated into networks or modules for learning better multi-scale features. In particular, by plugging GSTO into HRNet, we get a more powerful backbone (namely GSTO-HRNet) for pixel labeling, and it achieves new state-of-the-art results on multiple benchmarks for semantic segmentation including Cityscapes, LIP and Pascal Context, with negligible extra computational cost. Moreover, experiment results demonstrate that GSTO can also significantly boost the performance of multi-scale feature aggregation modules like PPM and ASPP.

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition

Kaiyu Shan, Yongtao Wang, Zhi Tang, Ying Chen, Yangyan Li

Responsive image

Auto-TLDR; Mixed Temporal Convolution for Action Recognition

Slides Poster Similar

To efficiently extract spatiotemporal features of video for action recognition, most state-of-the-art methods integrate 1D temporal convolution into a conventional 2D CNN backbone. However, they all exploit 1D temporal convolution of fixed kernel size (i.e., 3) in the network building block, thus have suboptimal temporal modeling capability to handle both long term and short-term actions. To address this problem, we first investigate the impacts of different kernel sizes for the 1D temporal convolutional filters. Then, we propose a simple yet efficient operation called Mixed Temporal Convolution (MixTConv) in methodology, which consists of multiple depthwise 1D convolutional filters with different kernel sizes. By plugging MixTConv into the conventional 2D CNN backbone ResNet-50, we further propose an efficient and effective network architecture named MSTNet for action recognition, and achieve state-of-the-art results on multiple large-scale benchmarks.