Zhexi Zhang
Paper download is intended for registered attendees only, and is
subjected to the IEEE Copyright Policy. Any other use is strongly forbidden.
Papers from this author
Automatic Student Network Search for Knowledge Distillation
Zhexi Zhang, Wei Zhu, Junchi Yan, Peng Gao, Guotong Xie
Auto-TLDR; NAS-KD: Knowledge Distillation for BERT
Abstract Slides Poster Similar
Pre-trained language models (PLMs), such as BERT, have achieved outstanding performance on multiple natural language processing (NLP) tasks. However, such pre-trained models usually contain a huge number of parameters and are computationally expensive. The high resource demand hinders their application on resource-restricted devices like mobile phones. Knowledge distillation (KD) is an effective compression approach, aiming at encouraging a light-weight student network to imitate the teacher network, and accordingly latent knowledge is transferred from the teacher to student. However, the great majority of student networks in previous KD methods are manually designed, normally a subnetwork of the teacher network. Transformer is generally utilized as the student for compressing BERT but still contains masses of parameters. Motivated by this, we propose a novel approach named NAS-KD, which automatically generates an optimal student network using neural architecture search (NAS) to enhance the distillation for BERT. Experiment on 7 classification tasks in NLP domain demonstrates that NAS-KD can substantially reduce the size of BERT without much performance sacrifice.