We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Compress giant language models using knowledge distillation

00:00

Formal Metadata

Title
Compress giant language models using knowledge distillation
Alternative Title
Compress language models to effective & resource-saving models with knowledge distillation
Title of Series
Number of Parts
56
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Language models have drawn a lot of attention in NLP in recent years. Despite their short history of development, they have been employed and delivered astonishing performances in all sorts of NLP tasks, such as translation, question answering, information extraction and intelligent search. However, we should not forget that giant language models are not only data hungry, but also energy hungry. State-of-the-art language models such as BERT, RoBERTa and XLNet process millions of parameters, which is only possible with the help of dozens of sophisticated and expensive chips. The CO2 generated in the process is also massive. Being responsible for such high energy consumption is not easy in times of climate change. In order for companies to benefit from the performance of state-of-the-art language models without putting too much strain on their computing costs, the models used must be reduced to a minimum. Of course, performance should not suffer as a result. One possible means to achieve this is the so-called knowledge distillation, which is one common technique among model compression methods. In this presentation, we will show you how you can use knowledge distillation to generate models that achieve comparable performances as state-of-the-art language models effectively, and in a resource-saving manner.
IRIS-TDemo (music)Data managementContent (media)Local GroupMathematical analysisSoftware frameworkContext awarenessWritingInformationText miningMachine learningAlgorithmParameter (computer programming)Mathematical modelMathematical modelData compressionGeometric quantizationReduction of orderFloating pointFactorizationRankingWeightMusical ensembleNeuroinformatikWeightFunctional (mathematics)BitFigurate numberRankingFood energyMathematical modelData compressionMultiplication sign1 (number)Graph (mathematics)Execution unitMatrix (mathematics)Subject indexingTransformation (genetics)Wave packetLogicShared memoryDampingPairwise comparisonForcing (mathematics)CASE <Informatik>Function (mathematics)MassSet (mathematics)Procedural programmingConnected spaceFormal languageGroup actionVirtual machineStatisticsSoftware frameworkData managementMathematical analysisContent (media)Degree (graph theory)Parameter (computer programming)Mathematical modelMachine learningArtificial neural networkLinear algebraPattern recognitionNumberText miningProcess (computing)Expert systemCloud computingMobile WebField (computer science)OntologyReduction of orderNumeral (linguistics)Computer configurationComputer hardwareOrder (biology)Geometric quantizationMessage passingProduct (business)Projective planeOperator (mathematics)Film editingForm (programming)Medical imagingAlgorithmCartesian coordinate systemSlide ruleData miningImplementationInferenceTunisPotenz <Mathematik>Key (cryptography)MultilaterationResultantXMLUMLLecture/ConferenceComputer animationDiagram
RankingFactorizationStudent's t-testMathematical modelAsynchronous Transfer ModeInsertion lossPattern recognitionVisual systemProcess (computing)Type theoryDifferent (Kate Ryan album)Numbering schemeStreaming mediaStudent's t-testSelf-organizationInsertion lossMathematical modelNumbering schemeType theoryDifferent (Kate Ryan album)Formal languageRational numberImplementationEntropie <Informationstheorie>Order (biology)InformationPattern recognitionComputer architectureDependent and independent variablesFunction (mathematics)Helmholtz decompositionMatrix (mathematics)Data compressionProof theoryMeasurementMultiplication signGeometric quantizationRange (statistics)Drop (liquid)Cross-correlationCASE <Informatik>Reduction of orderDivisorRankingData structureTheory of relativityUniform resource locatorSocial classFunctional (mathematics)WordWave packetFood energyArithmetic meanUtility softwareFocus (optics)NumberArtificial neural networkError messageSquare numberTheorySampling (statistics)Best, worst and average caseString (computer science)Task (computing)Computer animation
Mathematical modelStudent's t-testEinbettung <Mathematik>Solid geometrySoftware testingArchitectureBefehlsprozessorBenchmarkInferenceParameter (computer programming)Sampling (statistics)Distribution (mathematics)Set (mathematics)Macro (computer science)Insertion lossFiber (mathematics)FingerprintInterior (topology)InformationMathematical modelResultantVirtual machineMathematical modelStudent's t-testComputer architectureInformationSelf-organizationEinbettung <Mathematik>Transformation (genetics)Projective planeIntegrated development environmentWave packetDigitizingFunctional (mathematics)Product (business)WordSemantics (computer science)Validity (statistics)CASE <Informatik>Configuration spaceDimensional analysisVideo gameReal numberSet (mathematics)Combinational logicSelectivity (electronic)Message passingProcess (computing)Distribution (mathematics)Range (statistics)Arrow of timeCharacteristic polynomialWeightMultiplication signAverageTable (information)Different (Kate Ryan album)Performance appraisalLatent heatTask (computing)Drop (liquid)Formal languagePairwise comparisonReduction of orderDomain nameRevision controlBeta functionDirection (geometry)State of matterData structureTesselationExpert systemLevel (video gaming)Pattern recognitionBlack boxArtificial neural networkBest, worst and average caseAsynchronous Transfer ModeTranslation (relic)Software frameworkMathematical analysisError messageRecurrence relationRaw image formatBit error rateTunisInferenceContext awarenessComputer animation
Musical ensembleReduction of orderPower (physics)Lecture/ConferenceMeeting/Interview
BefehlsprozessorMathematical modelReduction of orderPower (physics)CASE <Informatik>MeasurementTheory of relativityMultiplication signMusical ensembleMeeting/InterviewLecture/Conference
Mathematical modelCASE <Informatik>Mathematical modelFormal languageProcedural programmingOrder (biology)Food energyPairwise comparisonWave packetNumberProcess (computing)Multiplication signMeasurementGraphics processing unitLecture/Conference
NumberFood energyProcedural programmingSlide ruleBitParameter (computer programming)Focus (optics)Process (computing)Lecture/Conference
Mathematical modelResultantInformationCASE <Informatik>Computer architectureStudent's t-testData compressionLecture/Conference
Lecture/Conference
AngleWave packetProcess (computing)Self-organizationPairwise comparisonCASE <Informatik>Inclined planeComputer architectureMaxima and minimaMathematical modelNatural numberPredictabilityResultantStudent's t-testChannel capacitySocial classValidity (statistics)Slide ruleLecture/ConferenceMeeting/Interview
Point (geometry)Distribution (mathematics)Slide ruleFunction (mathematics)Mathematical modelLecture/Conference
DigitizingMatrix (mathematics)Distribution (mathematics)Mathematical modelFunction (mathematics)Maxima and minimaSocial classNumberLecture/ConferenceMeeting/Interview
Musical ensembleLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
Thank you very much, everyone, for coming and joining me for the last session of today. I am very excited to be able to present today. My name is Wuqi. I originally come from China
and have been living in Germany for over five years. After I finished my master degree in statistics, I started to work as a machine learning engineer at OntoLux. Today, my topic is compressed giant language models using knowledge distillation, which is a popular method among other model compression methods. We will come back to that later.
Before I start with the topic, I would like to shortly present our company, Neophony Group. We are a digital agency based in Berlin. Over 20 years, we have experiences in the field of content
management, e-commerce, mobile and operations. OntoLux is a brand of Neophony. Since 2021, we are providing state-of-the-art AI solutions to our customers. As a meanwhile, we are also
developing our own research project. At OntoLux, we are specialized in text mining, machine learning and search. Our product, TextVec, for example, is a lightweighted text analysis framework where we focus on German text analysis. There we try to combine
the traditional expert system together with the modern machine learning algorithm and thus provide our users many useful functions such as named entity recognition, entity linking, sentiment analysis, and so on. The agenda of today's talk is I will first
talk about the motivation, why we were interested in the first place about model compression and knowledge distillation. Secondly, I would introduce a couple of popular model compression methods
that's existing now on the market. Afterwards, I will introduce you our own implementation and experiment with knowledge distillation and we will discuss about the result. At the end, I'm also happy to answer all the questions. Let us start. I think we all would agree that
we are now living in an exciting time of artificial intelligence. We were often quite overwhelmed by how those AI models performed in many applications and scenarios such as writing
poetry, generating images, doing art, composition, even though all of these things are fascinating, but at the other hand, they are also a little bit intimidating. However, the AI models are
performing better and better significantly. However, the model sizes are also growing bigger and bigger exponentially. As we can see here on the graph, that's the most popular NLP model and successful ones since 2018 until today. We can see that from millions of parameters such
as bird until billions of parameters such as GPT-3, we didn't take that long of a time. That brought us many problems. The first one is the environmental burden. As we can all imagine
that big models would consume a lot of energy but I think most of us don't know how big this figure actually is because people don't like to talk about it. They would like to talk about how well my model performs but they are reluctant to report how much energy my model consumes.
There are some figures that were researched and published by the researchers which is quite astonishing because the training and experimenting process of transformer model with around 200
million parameters which means less than bird, it generates four times as much of the carbon dioxide in comparison to an average car throughout its whole lifetime.
That's the first problem. The second problem is the cost because in order to make a giant language model to work, you also need adequate hardware support or cloud
computation options which are all not cheap and the third one is the speed because the bigger the model is, the longer time you would need at inference time and that's like problematic in many use cases. So what are the solutions? The idea is that we would like to have
small and efficient models that also don't lose so much on the performance as a meanwhile and there are already a lot of methods in the literature how to compress the model
and among others, I would go through the method that's listed on the slide one by one. The first one is called numeric precision reduction which is also called quantization. So because all the weights and biases in the matrices is saved in the numeric value
in the form of a float 32, so intuitively we could think about, okay, maybe we don't need that high of a precision and we could just cut them into a float 16 or integer 8 and in order to
achieve model compression and in tensorflow, for example, we already have like implemented functions for this kind of situation. However, sometimes we can also not naively just cut all the weights and biases in the model because some of the computations
in the neural network do need a higher precision of float 32 and luckily in PyTorch we also have the implemented package that would allow us to automate this process. The second method is weight sharing which is also called quantization because the logic is
the same. We would like to represent a huge model with lower precision and thus achieve model compression. Take here the weight matrix here as an example. Let's say we have like a 4x4 weight matrices and then we could cluster them according to their values and all the same values
in the same cluster would only be represented by the same short. So in this way instead of having a 4x4 matrix with 32 bytes each, now we only need like an index matrix which indicates
which cluster the value belongs to which is only two bytes unit and then the same short of each cluster. And in this procedure it's also quite common that we would we will do like a fine tuning process in order to further
calibrate the model and then secure the performance of the model. The third method is also quite straightforward. It's pruning because in a giant neural network not necessarily all the weight are important for the for our output. So in this case we could
think about just cutting all the unnecessary weight. There are structured pruning where you take out all the whole filter or like a whole neuron together with all connections. There are also unstructured pruning where you only take out the individual weights.
And we also have a mathematical approach to this as we all have learned in linear algebra we could reconstruct a big matrix M in this case with a two smaller one L and R and when the rank
of matrix M is bigger than the number key here then we could approximate the matrix with the lower precision. And we also have a knowledge distillation which is our main focus today.
This method is a little bit different than all the method I have introduced before because in knowledge distillation we don't try to represent a big model with lower precision or we don't cut it small. A new model would be trained in this case.
In the language of neural knowledge distillation we call the big and powerful model that's consuming the teacher model and we call the small and efficient model the student model. And we would build the loss function of the student model in the way that we encourage it
to mimic the behavior of the teacher model. And in this case we take the output of the teacher model and then build it into the loss function of the student model and the output in this case we call it knowledge. So we have introduced a lot of method before why
us as a company why do we choose knowledge distillation. The first reason is that in knowledge distillation the student architecture is quite flexible it could be independent from the teacher architecture and we want that kind of flexibility so that's the main reason
and there's also another advantage because the new model will be trained in this case theoretically we could use an existing teacher model to generate as much as training samples for our student. So that's also another advantage. So how does it work exactly? I take here an example of an NLP task which is named entity
recognition so basically you have a string of text and then the model have to predict in which words in the text that belongs to named entity and typically there are four different classes
organization, person, location and misnania which means it belongs to named entity but not the three classes. And here I take an example it's the word European Union and I feed it into both the teacher model and the student model. The teacher model would give me an output to tell
me okay how much is the probability that this entity belongs to class organization person and location and the student as well. And then based on those two values we could build like we could construct a mean square error loss in order to encourage the student to
mimic the output of the teacher. And in this case if we would have labeled data then we could also configure the loss function in a way that it's partly contains
the distillation loss which is in this case mean square error and on the other hand also the loss against the true data labeling which is normally the cross entropy loss. So the idea of knowledge distillation was there since 2015. It was firstly
published in 2015 and ever since then people are inspired by this idea and had a lot of their own implementation and exploration about this method and we can say that here in
2021 there is a nice paper that summarizes all those trials. Today we would however only focusing on the two different type of knowledge and three different types of knowledge that's used in knowledge distillation and then three
different types of distillation scheme. So the first question that we might ask us in knowledge distillation is what is knowledge? In the previous example we have the European Union that the model would tag and say that this probably belongs to your organization. So that
kind of knowledge takes the output of the teacher model for the final layer and we call this kind of knowledge response based knowledge. However in a giant neural network there are also
a lot of hidden informations between the layers so there are also a lot of approaches that tries to take these informations and name it feature-based knowledge and try to train a student model based on this knowledge. And there are a third type of knowledge that's different
different than the both of them because these two kind of knowledge they all take the output of our teacher model. However the third type of knowledge is called relation-based
knowledge where they assume that in a big neural network there is a information or structure how the values are changing from layers to layers. So they try to model the correlation between different layers and then use that correlation as the relation-based knowledge
to train a student model. So there are various different ways how we could utilize knowledge distillation. And there are also different training schemes. So the most common one is called offline distillation where you already have a pre-trained teacher model
and then you use it to generate the training samples for the student and train the student. However in some cases the teacher model could not be available so you can try to train the teacher and students together at once and then we call this online distillation.
And there's also an interest very interesting case where when the architecture of the student is exactly the same as the teacher we call it self-distillation. And in this case it's not about model compression anymore but more about an exploration of what would happen. And even though
in theory we don't yet have a proof why this would improve the model performance in practice in the experiments they do. So after we have introduced a lot of theoretical
knowledge about model compression and knowledge distillation now we are curious how well do those model works. To answer that question I choose this paper that has a very good summary because I think probably most many of us are familiar with the model bird that was already
mentioned a couple of times in the previous talk from today. So a lot of people try to distill the model bird after the idea of knowledge distillation has came out. And this paper is a very good summary of how people have tried different
methods to distill the model bird. We can see here like the method quantization, pruning, knowledge distillation from different types of the knowledge and the matrix decomposition which is the low rank factorization we mentioned before.
And so in general there are three different types of measures to measure how well a distilled model works. The first one is the model size reduction and in this case they are ranging from 58 percent to 99 percent and then the second one is the
speed up which ranges from 1.2 times to 25 times and then the third one is the average performance drop which ranges in our cases from minus 0.1 to minus 12. So this is like
a wide range because all those different methods they have different purposes, they try different methods so that's why we can say like there's a wide range of the of the performance.
I have taken as an example a popular distilled model which is called distilled bird and then in this specific case the model is taking the same architecture as bird but with less layers and then less hidden dimensions and then they also initialize
the model with the original bird weights. The model size reduction for distilled bird is about 40 percent and with a speed up of around 60 percent and the average performance drop is only three percent so that's a pretty good trade-off that we have here. So it is also
very reasonable that we can distill a general language model and then we could fine tune it for specific use cases. However, as a company we decided to go the other way because all most of the existing language models they mainly focus on the English language and for us as a company
our main target is German. So we would like to distill the model that's a specific good in German language tasks and then second of all we have different datas from different projects which is not only contents-wise but also structure-wise very different from each other
so a domain adoption is anyway like a task for us. So we decided in one of our projects we have the German named entity recognition task and we have chosen for that the FLARE model as
a state-of-the-art NER model as our teacher model. This model uses XMLMR with document level as the model structure so for those who are not familiar with those XMLMR stands for cross
lingual language model and R state for raw beta which is just basically an improved version of BERT. This model also uses document level features. So the question now is how to train
the student models. There are a couple of questions that we have to answer in these cases because we need to choose the architecture of the student model. It is flexible from the teacher and the embeddings and the training datas. So for the architecture choose most of the existing models are mostly nowadays the most
popular architecture is transformer architecture. However before the transformer architecture came to the picture recurrent neural network based model is also performs also quite good and it
has the advantage that when the model size is small enough it actually outperforms transformer model. So for our use cases we decided to take RNN based model as our student model. And for embeddings we have choosing the popular and conventional glove embedding because we
already lost the semantic information from a transformer model that's why we would like to have a pre-trained word embedding that's already context a lot of semantic information. And lastly
is the data because as we mentioned that we could train as much as possible of training data for our student model. So how much should we train? We also to answer that question we also looked at into the literature and the result is that the combination of the training data
of the teacher together with the newly tagged data from the teacher model would be best for the result. So after all the considerations we have trained our model and then we have like
this nice table which shows that the result of our student model which is not not bad. So we have like an average drop of performance around three percent and however the inference time is almost 40 times as fast as a student model which is a good result. However,
so is that all? Is our model good enough for the production use? Now we have a problem that's not only relevant in knowledge distillation but also in general for machine learning model
training because the problem is that our validation set is usually not representative enough for real life data distribution. And we have made an experiment in our company that so those two models have exactly the same size however the first one is deeper it
has a two layer LSTM and the second one is wider so it only has one layer but with more hidden dimension. When we look at the score it looks pretty similar however when we are texting it on the real data one model performs significantly like worse than the other one.
So what I wanted to the message I wanted to express here is that model selection is a complex problem and because of the black box characteristic of the neural network maybe the only way to get the right answer is by trial and error. So experience do matter
in these cases. There are also a lot of things that we consider about doing in the future. We have to optimize our evaluation process which means maybe a manually labeling data of the
evaluation set could be necessary and we also wanted to change the student architecture because we were also curious about if we would take a transformer model how well it would come perform
in comparison to the RNN based model and then maybe we can also generate more training data and then see how the result looks like. And lastly since knowledge distillation is not we could easily combine knowledge distillation with all the other methods we mentioned before.
So we do have a lot of use cases of our distillate model. We embed it into our framework text work that's what I mentioned before it's one of our company's product where we are using
the expert system to enable a lot of text and analysis functions and we also use a distillate model in different projects. In Diels for example the digital environment urban solutions is a European wide project we are working with many other organizations.
We have environmental data and then for the federal ministry of finance in Germany we have legal data and then for each of the different use cases we use the same technique of knowledge distillation but then we have a different configuration of how the students look like
the word embedding that we'll choose which is all specific in each user cases. So that will be all for the talk today so as a conclusion I wanted to say that
I think we should I think that going bigger and bigger shouldn't be the owning direction that we are looking at because at the end of the day we don't want to create an AI model that replaces us as human but for it to replace the boring work that humans are reluctant to do
and it's very useful to just have more smaller models which are suitable for a specific occasions maybe it doesn't sound as cool as having a general AI that's capable of doing everything but it's very useful and energy saving. So that will be all for my
talk today I'm here for any questions. Thank you very much it was very interesting. So there are questions I think everyone who lifts their hand I will come over and then you're the first and
then yeah. Thanks for the great talk. Did you measure like power requirement reductions on
your use cases? Did you measure the power requirement reductions? For us we just use CPU because we choose like a relative small model that's also possible to run on a CPU device
like the end device could be like a cell phone and small devices yeah. It was next yeah. So my question is similar you mentioned at the beginning that such a model can consume up to
four times what the car is consuming how much is your model consuming? Well I don't have an exact measure but definitely much much less than that because you know and so why it's this four times of carbon dioxide in comparison to like a US car is because they use in order to generate
this kind of general language model so you need to train the model with a huge amount of data and the whole training process like consumes a lot of energy and in our cases we are using a small model for a very specific case. So for us we have our own GPU session at the company and
the training procedure only takes around five hours with our own GPU and then even though I don't have like exact number of how much energy that consumes but it's definitely a lot comparable for all those giant models that's on the market now yeah. Sounds great thank you.
Anyone else? Oh wow. I know it's maybe not the biggest focus of the talk but could you elaborate a bit on the self-distillation process what's the exact goal is it just trying to
optimize the performance or is there anything other? The self-distillation procedure? Yes that you mentioned briefly on one of the slides. Do you want to go back to the slide? Okay can you please repeat your question? Sorry what is since we're not trying to get gains on energy consumption because the number of parameters is not changing
what does the technique entail what it how does it work as in a general sense? Sorry I couldn't really capture your your. So could you just elaborate a bit on the self-distillation? On the what solution? What it is self-distillation. Ah yeah so for self-distillation you use
exactly the same architecture as a teacher model and so the the teacher model is already trained so it can already do a lot of stuff and then you train a new model and so how to train the the teacher model is with the real data and then and then with with a with a
student model it has exactly the same architecture but it has an aid on information which could be generated with a teacher model so that's like an extra information that the model could get in the case of self-distillation but in that case it's not about like model compression
anymore because they have like the same size but it's still very interesting to say like they actually improve the the result of the model yeah. So one more so I'll have a look if there's something from the online crowd no so then thank you. There's one question there.
Sorry thank you. I wanted to ask does this method lead to overfitting and have you
looked at this like from this angle at this method overfitting self-distillation in general yeah well I don't think it would because the whole training process that we are doing it has like this validation process and then our model is actually relatively small so it doesn't
from its nature doesn't have the capacity to overfit too much on the on the training data actually the teacher model is more precise about whatever prediction it generates and then our student model is not so sure which one is the case but maybe it will predict like 60 percent it
belongs to the organization which doesn't really infect the result that much if it's 99 percent or 60 percent anyway 60 percent is the maximum comparison all the other classes so from nature the because the student architecture is not it doesn't incline to to overfitting
yeah okay thank you don't be shy one more yeah I see thank you on your slide there is a thing
about the euro euro european union example right yes and then there's two strengthening one is there's a negative possibility and the second thing is why didn't you choose like the label as one zero zero but instead zero point eight and whatever the uh it was uh so yeah it
was not a distribution because uh so in in in the model firstly we will have output which is just like numeric values and then normally we would like have a softmax matrix which to to make it to be a distribution so that's that's um that's a digit number is not yet
the distribution of the data of the yeah of the classes however it suggests the same thing
then again thank you very much thank you thank you and have a nice evening