Compressing BERT - An Evaluation and Combination of Methods


Using pre-trained Language Models (LMs), such as ELMo, BERT, or GPT models show good empirical results on a wide range of Natural Language Processing (NLP) downstream tasks. However, these large models consist of millions or even billions of parameters, making training and inference slow and computationally expensive, especially in resource-constrained environments. In this thesis, we discuss and compare four approaches to compress BERT based on literature research and choose Theseus Compression (TC) as most promising for further experimental evaluation. We introduce and show the benefits of changing TC’s initialization procedure and present a comprehensive analysis of its hyperparameters. A concluding qualitative error analysis reveals that TC efficiently compresses the original model’s knowledge into a smaller model. For a comprehensive evaluation of TC’s performance, we use three datasets of two domains, hate speech and medical domain, for two binary-classification downstream tasks, German hate speech detection and in-hospital mortality prediction. Our best experiments show that domain-specific models compressed with Theseus Compression are 1.67x smaller, train downstream tasks more than 2x faster, retain up to 99% prediction performance, and increase inference speed, on average, 1.94x on CPU, and 1.73x on GPU.

Beuth University of Applied Sciences Berlin