Language models have drawn a lot of attention in NLP in recent years. Despite their short history of development, they have been employed and delivered astonishing performances in all sorts of NLP tasks, such as translation, question answering, information extraction and intelligent search. However, we should not forget that giant language models are not only data hungry, but also energy hungry. State-of-the-art language models such as BERT, RoBERTa and XLNet process millions of parameters, which is only possible with the help of dozens of sophisticated and expensive chips. The CO2 generated in the process is also massive. Being responsible for such high energy consumption is not easy in times of climate change. In order for companies to benefit from the performance of state-of-the-art language models without putting too much strain on their computing costs, the models used must be reduced to a minimum. Of course, performance should not suffer as a result. One possible means to achieve this is the so-called knowledge distillation, which is one common technique among model compression methods. In this presentation, we will show you how you can use knowledge distillation to generate models that achieve comparable performances as state-of-the-art language models effectively, and in a resource-saving manner. |