We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Fine-tuning large models on local hardware

00:00

Formal Metadata

Title
Fine-tuning large models on local hardware
Title of Series
Number of Parts
131
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Fine-tuning big neural nets like Large Language Models (LLMs) has traditionally been prohibitive due to high hardware requirements. However, Parameter-Efficient Fine-Tuning (PEFT) and quantization enable the training of large models on modest hardware. Thanks to the PEFT library and the Hugging Face ecosystem, these techniques are now accessible to a broad audience. Expect to learn: - what the challenges are of fine-tuning large models - what solutions have been proposed and how they work - practical examples of applying the PEFT library
Local ringComputer hardwareComputer hardwareLocal ringEndliche ModelltheorieLengthSoftware developerMeta elementMedical imagingInstance (computer science)Formal languageElectric generatorTask (computing)Mathematical optimizationLoop (music)Machine learningSlide ruleQR codeMultiplication signSemiconductor memoryLibrary (computing)CodeError messageStandard deviationOnline chatPhase transitionStability theoryComputer animationLecture/Conference
Graphics processing unitData modelModule (mathematics)Endliche ModelltheorieLink (knot theory)WeightType theorySemiconductor memoryInference2 (number)Functional (mathematics)Adaptive behaviorBitGraphics processing unitCASE <Informatik>Parameter (computer programming)NumberError messageMeta elementCodeVirtual machineDifferent (Kate Ryan album)Multiplication signFreewareTrailMathematical optimizationVarianceState of matterArithmetic meanStandard deviationGradientEinbettung <Mathematik>Normal (geometry)DampingModule (mathematics)Matrix (mathematics)outputNichtlineares GleichungssystemLinear regressionRankingInstance (computer science)Goodness of fitLevel (video gaming)Cursor (computers)ResultantTunisPhase transitionComputer animation
Dot productRankingCodeMathematicsShape (magazine)WeightResultantPosition operatorDivisorTable (information)Normal (geometry)Graphics processing unitProcess (computing)Bit rateGradientEndliche ModelltheorieTunisDampingSemiconductor memoryCombinational logicMultiplication signParameter (computer programming)Geometric quantizationApproximationSurfaceReduction of orderBitIntegerVirtual machineRankingInstance (computer science)Computer configurationBoltzmann equationRow (database)Mathematical optimizationRight angleNumberCalculationState of matterTotal S.A.Functional (mathematics)Configuration spaceScripting languageDataflowInterior (topology)Computer animation
GoogolRead-only memorySpeciesRaw image formatCalculationScripting languageDifferent (Kate Ryan album)Repository (publishing)Parameter (computer programming)Adaptive behaviorEndliche ModelltheorieComputer configurationBitType theoryBit rateVulnerability (computing)Multiplication signCASE <Informatik>TunisStapeldateiSemiconductor memoryRankingInstance (computer science)CalculationChainWeightOnline helpLibrary (computing)Row (database)Single-precision floating-point formatVirtual machineFormal languageGeometric quantizationNumberLink (knot theory)Task (computing)Graphics processing unitInferenceMultiplicationMilitary baseSlide ruleOpen setInsertion lossPauli exclusion principleComputer animation
Self-organizationMultiplication signGeometric quantizationParameter (computer programming)Default (computer science)Semiconductor memoryEndliche ModelltheorieState of matterAdaptive behaviorLecture/Conference
System of linear equationsDefault (computer science)Endliche ModelltheorieMathematicsComputer configurationAdaptive behaviorTransformation (genetics)Error messageLink (knot theory)Numbering schemeBit rateLecture/Conference
Coma BerenicesReplication (computing)Error messageNumbering schemeGeometric quantizationEndliche ModelltheorieComputer animation
Parameter (computer programming)TunisInstance (computer science)Channel capacitySemiconductor memoryEndliche ModelltheorieLatent heatLibrary (computing)Different (Kate Ryan album)BackupBitProjective planeTask (computing)Function (mathematics)File formatSlide ruleComputer animationLecture/Conference
Mathematical optimizationInclusion mapSatelliteData conversionSpeciesLatent heatCASE <Informatik>Task (computing)Streamlines, streaklines, and pathlinesLibrary (computing)Mathematical optimizationSemiconductor memoryResultantEndliche ModelltheorieSimilarity (geometry)Parameter (computer programming)Scripting languageComputer animation
Computer animationMeeting/Interview
Transcript: English(auto-generated)
Yeah, thanks everyone for attending my talk on fine-tuning large length or large models on local hardware and Yeah, as you can see the slides are online. So if you're interested just follow the link or scan your QR code and Yeah, I want to start right away with a problem and not lose any time on anything else
so you probably all have heard about the recent developments that we had in Machine learning and AI when it comes to big models. So for instance, we've heard about chat GPT Gemini Claude for language modeling or maybe you've heard about
image generation models like Dali mid journey and stable diffusion and now let's assume you have your own data and you want to train those models on your own data because they are maybe not good enough for the tasks at hand and Do you have any possibility to do that and
Thankfully, yes you have so let's assume for this talk that we are interested in training a language model and For instance meta has released a couple of models that we can freely use So one of them is llama 3 8 B and I wanted to train that model on my PC locally
so I Loaded the model through the hiring phase transformers library and Then I set up my optimizer. I loaded my data set up my training loop So this is fairly standard pytorch code Which you have probably used yourself and then I wanted to run this code, but I had an error so
Who can tell me what's wrong? Anyone has an idea? Okay, so I can hardly see you. So I just come with a solution The memory error is the one that we get if you run this code
So this is a very dreaded error the bane of our existence so if you have ever tried to train a big model with your local machine, you probably came across this error and Just in general memory will be the bottleneck if you want to train large models and Therefore it is a big problem that we want to tackle
So let's first explore why memory is such a big problem So I checked the llama 3 8 B model and I loaded it in float 16 or B float 16 precision and When I looked at the different modules that we have we see that we have the embedding layer which comes with 500 million parameters
We have the linear weights. So the linear layers have weights but not biases So that's 7 billion parameters and then we have the RMS norm which we can Mostly ignore so overall that comes down to 8 billion parameters. So that's why it's called 8b and
If we look at the memory that we need that is 14 gigabytes So 14 gigabytes is quite a lot. But when I checked the packaging of my GPU, it said 24 gigabytes So what's happening is Nvidia lying to me? No, it's a bit more difficult than that So when we want to train
Such a model we need actually to have a lot more GPU memory than that So probably, you know that if you want to train you will net you have to calculate the gradients for this model and For each parameter we need a gradient. So that means You need twice as much memory already just to have the gradients for this model, but this is not enough
So if we check back with the code earlier, we are using the Adam W optimizer Which is a fairly good optimizer. It's standard and Works really well, but it has some drawbacks. So this optimizer has to keep track of the
Optimizer states which are the mean and variance of the updates of the parameters So that means you have to add twice the size of the model again on top to the memory that we already require So overall that means we need four times the memory that we need to load the model to actually train it
So the for the llama free model, that means we are at 56 gigabytes of memory so that is quite a lot and that explains why I could not train it on my machine and then on top of that we also need more memory for stuff like the activations, which I haven't included here because
They're fairly difficult to calculate But just know that we have to add even more memory on top if we want to train this model Okay, so then I checked a few of the more popular models on the hugging face up and to See, what can I train on my machine and I came up with these numbers
so again, let's assume we load them and float six in precision and Then we have a couple of nice models for meta from mistral Friend models and the Google Gemma models, which came out just last week, I think and As you can see they are all quite big and the only one that I can reasonably train on my machine would be the 1.5
be parameter model from Quinn and Yeah, it's nice, but it might not be enough for my use case So is there anything that we can do about this? Well, of course there is otherwise I wouldn't be here and the solution is called parameter efficient fine-tuning and
Yeah, I wanted to use to you one of the packages that we develop at hugging face which is called hugging face peped and This implements a lot of different methods that allow you to use parameter efficient fine-tuning To decrease the memory that you need for training
And Yeah, this is achieved by reducing the number of trainable parameters and we will see what that means in a few seconds and We also provide a bunch of convenience functions that you typically need if you want to work with these types of models
And there are also some misconceptions. So I want to clarify them right away So for inference, there is no memory benefit So this is purely for training if you cannot fit the model into memory for inference The path will not help you
Furthermore Most papers show that full fine-tuning is still the best when it comes to performance So if you want to get the very best performance you might want to invest into bigger GPUs but we can get fairly close to these results and And Finally some people think that peped makes the training faster and that's also not necessarily the case
So the goal is always to reduce memory not to make it faster, but in practices can often be faster So now let's check out one of these methods in detail and this method is called Laura
Which is short for low rank adapters So why Laura? Well, it's the most popular method So I wanted to go into bit of a more details here, but just know it's not the only one I've included the paper link if you're interested in the details, but I also want to give you a
Short high-level overview of what Laura is doing So let's assume we have a linear layer with a weight W and this way W is often quite big So for our example, let's assume it has the size 1,000 times 1,000 So in normal fine-tuning we would update those 1 million weights, but if we are using Laura
We are applying this Laura adapter That's how it called it's called and that means we don't update W but instead we update two smaller matrices A and B and
These matrices have a low rank. So for instance could be 8 So the a matrix would be 8 times 1,000 and the B matrix would be 1,000 times 8 and size And if you add this up, you will see it's much lower number than 1,000 times 1,000 so when it comes to calculating the hidden states
on the See the cursor. Yeah, so on the top we have the linear layer. So that's just normal linear layer. We have W dot X plus B, so wait with Matrix multiplied by the inputs and then adding the bias But if we have a lower layer, we change the equation a little bit
So we have W asterisk, which is just the same as W but it's frozen so it means we do not apply any updates to the base weights and Then we have this Delta W Which is added on top and this Delta W is calculated by just multiplying B and A and
If you do the math, you can see it has the same shape as W but it has a lower rank and by the way, if you did not fully understand it, it's not necessary to follow the rest of the talk but Still wanted to give you a brief overview Okay, so how does it look in code?
So you don't have to change a lot of your code if you want to train a path model So you just load the basic model as you would always do and then after installing path You can install the you can import the LoRa config and get path model function Then you define the LoRa config instance, which takes a bunch of parameters. So R is the rank that we just saw
but you can Configure it much more to your liking and then we call get path model by passing the base model and the config and All the rest is exactly the same as you would always do So you can really easily add it to your training script and it should just work
So if you paid attention, you're now asking yourself probably If we are adding more parameters to the model, how come that we need less memory? That doesn't make sense. Does it? Well, let's revisit
so if you remember Three-quarters of the memory that we needed was required for the gradients and the optimizer states, right? But these are only required for the trainable parameters so the base rates are not trainable and we don't need to calculate gradients and optimizer states and
The lower rates are typically much much fewer numbers So less than 1% of the total number of parameters So that means we only have to calculate the gradients and the optimizer states for 1% of the parameters
So that means we need less memory to train the model despite having more parameters in total And as a nice bonus if you are finished with training You only need to save the LoRa rates because the base weights of the model are already present So you don't need to save them the checkpoints are that's really really small you can easily share them move them around and so on
So let's get back to our table that we had earlier about the memory requirements of the different models and here in the middle column We have the same Numbers that we had previously and on the right we have the numbers if we do LoRa fine-tuning and we assume a rank of 32 and
As you can see the memory that we need is much smaller So it's roughly just a quarter of the memory that we needed initially So for the llama 3 model on the first row, we went down from 56 gigabytes to 15 gigabytes So that is really a considerable improvement and on my machine
I could probably train this model even though I need a little bit of extra memory for the activations I have nine gigabytes to spare so that should work However, if you're using maybe a smaller GPU with 16 gigabytes for instance if you're using Google Colab t4 GPUs
Which are free they have 16 gigabytes. It will still not be enough So can we do even better? And yes, there is a way and that is if we combine path with quantization So this is not a talk about quantization But I wanted to give you a very brief overview of what we mean by that
So usually when we load a neural net the weights are loaded in the floor 32, so that means We have four bytes. It can also go down to float 16. Then we have two bytes per parameter However, if we load a model in with quantization
It means we are loading them in int 8 or even in for precision and there are few other options But those are the most common. So now we are down to one or even half a byte per parameter so That means if we go from float 16 2 & 4 we have a 4 times reduction in memory
So that sounds pretty nice, right? well There's some disadvantage in that these models will have a bit of a degraded performance Because we are making Approximations right so you have to factor this in but it's not as bad as it might seem on the surface
But that's a bigger problem. And that is that we cannot train a quantized model so the short summary of why that is is we are loading this as integers and This thus we cannot calculate the gradients for these weights and without gradients. We cannot do training
So what can we do about that? Well, that's the nice thing about theft, right? So if we remember what we learned earlier We don't have to update the base weights. They stay the same throughout the whole training process and For the lower rates what we can actually do is we can still load them in float 32 precision
And since they are so few of them It doesn't really matter for the whole memory footprint that they are loaded in float 32. It's still gonna be a very small model so Since the lower rates are loaded in flow 32, we can calculate the gradients and thus we can train the lower rates and
this combination of methods is called Q Laura so quantization plus Laura and It has Proven to work quite well So there's a paper that shows that The results are fairly good so you can check that out later
Okay, so let's revisit this table again So in the middle, we have the normal Laura fine-tuning and on the right side we have the Q Laura fine-tuning and we assume in for precision here and As you can see we go down again in the amount of memory that we need by a factor of almost four to one
So the 15 gigabytes that we initially had to reserve for the llama 3 8b now is just 5 gigabytes So If you have a 16 gigabyte GPU or even a 12 gigabyte GPU You can probably train llama 3 8b on it when you're using Q Laura. So this is just quite amazing
And if we look at the other models, yeah, 70 billion models are still not really possible on your local GPUs so there we need some of the bigger machines, but let's look at the last row, which is a 27 billion parameter model and we are now down to just 17 gigabytes
So it's a fairly good chance. We could train it on the 24 gigabytes GPU and By the way, these numbers I calculated with a script and the script I have also uploaded to the repository So if you want to check out some models how much memory they require with different parameters You can just visit the repo and run the script
Okay, so I wanted to talk about some of the other features that we have in path So one of our goals when we design this library is to make it very flexible So that means we have a ton of different adapter methods and they all have their own strengths and weaknesses. Of course, I
don't have time to go over them, but just know that there's much more than just Laura and Depending on the type of problem you have you should probably take a look at these other methods as well Then we also make it very easy to choose which layer you want to target. So
even if you're not using one of those very popular models, it should still work and Sometimes you might also have to fine-tune like fully fine-tune some of the layers. So this is also possible with path Then when it comes to Laura, we offer a bunch of different layers that we support
So that should cover most of your bases We have half a different half a dozen different quantization methods that we support We have different initialization options, which have some advantages in certain circumstances and We have tried distributed training to ensure that it also works so you can use DDP deep speed fsdp
It's all working and we have a bunch of other features as well And then I wanted to mention some of the advanced features that I alluded to earlier So if you have more than one adapter, so maybe you trained three Laura adapters for three different tasks
You can load all of them at the same time on the same model and also switch between those adapters Often it can also be necessary to disable the adapters so that you can query the base model again So that's also possible Then one nice thing that we offer is that you can match the lower rates into the base model
Why would you do that? Well, as you saw with the lower rates, we have to do a little bit of extra calculation when we do inference, right? So that means we are a little bit slower not much slower, but a little bit But if we merge them into the base weights, we get back the same speed so if you just have a single or adapter and you want to do inference only you should merge it into the base weights and
Then we have some other options like you get mixed different adapters in the same batch when you're doing inference You can if you have multiple Laura adapters You can merge them into a single Laura adapter which can also be quite useful
But because it might have the capabilities of all of them combined and We also have support for Torch compile for many of the features so for if you want to do Laura training It should work with the torch compile but some of the features are not supported so if you follow the link, you can see what works and what doesn't and
Then finally, I want to give you a few tips if you want to get started with a pep training and In general just know that all the knowledge that you already have about training you will nets is still valid So there's not something completely different So everything, you know, you should just keep in mind
but there are some special things you might consider when you're training a path model and First tip which is more general is start small and see whether that's already enough to solve your problem And only if you see that the model is not capable of solving your tasks. Should you go to a bigger model?
I think that's fairly obvious And then let's say you're using a large language model First option should always be prompting. So maybe you can do some Of the prompting techniques like chain of thought or a few short prompting and that is already enough to Reach your goal then don't worry about using path because training will always complicate things
But if you find that you want to use pep, I would recommend to start with Laura first Not because it's the best technique and necessarily but it's just the most popular so you can find a lot of help online and
It has a lot of features that the other methods might not have But please also give the other methods a try once you have tested Laura Then it is important to do a very quick end-to-end run so for a full training and deployment if possible Because there can be some pitfalls when you do this kind of training
So for instance some users were using distributed training and then at the end after three days of training They found that the checkpoint is not working So all this training time was wasted therefore Do a quick end-to-end run with just a small amount of data. See if everything works and then you can do the full run
and then who's us often asked like what layers should I target and Typically, that should be just all linear layers that works the best in most cases. We have an extra option for that and Yeah, if you find that your model is under fitting try increasing the rank of the lower adapter and
If it's overfitting you should decrease it. You can also apply dropout, of course and use other techniques that you already know Then with Laura you typically can get away with a higher learning rate So if you are using full fine-tuning and you found a certain learning rate works
Try ten times that for Laura fine-tuning and also since you have more memory you can maybe get a higher batch size and Thus training will be a little bit quicker for you and Finally, I would recommend to take a look at the initialization options for the adapters
So especially if you're quantizing the model as I mentioned earlier We lose a little bit of precision because of that but the Laura adapters can actually offset Loss of precision if we initialize them correctly, so that's also an option which can help with quantized models
Yes, and that's it for my talk today. So thanks for your attention here again other slides and I'm open for questions Thank you very much. I have a small presence for you as a thank you from
The organizers and all the volunteers and everyone here. I hope we do have a time for some questions. Do we have any questions? Please make your way to the microphone. We have a question over here. Yeah. Thanks for an excellent talk here and Just have a question related to quantization so you can Reduce the memory requirement for training
and I'm just wonder, you know if you apply Laura technique after that Is it a way to actually map the original parameters back to? to the previous state And for that work, you know potentially you can test the performance of the model with the new parameters as well
Yeah, so by default That's not happening. So by default the Laura adapters are initialized such that they don't have any influence at all So they are just an identity transform on the model and only when you train them Does that change but some of the options let's go back
So here if you follow this link in the initialization scheme some of the options that we have there I think I also have the name So for example loft Q would be one of these examples It calculates basically the error that you get from quantizing and then it initializes the lower rates
Such that the error is minimized. Okay, so that means if you use this initialization scheme You should get a model that works almost as well as the non quantized variant and Then of course you can do additional training and it should be even better. Okay. Thank you
We have another question, please. Go ahead Could you talk into the microphone a little bit more sorry Do you think that Laura is a good method for the model to learn your knowledge? Because from my experience it just lent in your formatting
Yeah new knowledge It depends a little bit so I think what what it's doing best is maybe Like forcing a specific kind of knowledge that the model already has so These LLM's are very general or other models can be very general and produce a bunch of different outputs
But you want a specific one. So as you mentioned you want a specific format maybe Then this will typically work really well But if you check out the papers, I mean, I think it can be a bit debatable Like is it new knowledge or is it just the old knowledge repackage? But they can at least solve tasks that were previously not really solvable with these elements. So I
Think there's a little bit of learning capacity since we have new parameters But yeah, you should probably not expect a model that for instance has just been trained on English You probably can't teach it check. It's probably too much for Laura. Yeah. Thank you
We have another question over there. Yep, firstly. Thanks for creating this library and thank you for your talk as well I'm really helpful to see exactly how much memory training some of these models will take I've already started a project fine-tuning using axolotl And I was interested in your take on the different fine-tuning libraries available and how they compare to each other
Um, yeah, so I had a slide on this which I put in the backups so axolotl is really excellent So it for those who don't know it that provides a bunch of streamlined fine-tuning scripts, basically So if you want to get started and for example You don't want to experiment with all the hyper parameters and so on
You can check whether the task and the model you're interested in is already in axolotl And then you can use that and it builds on top of path basically so It it's not quite as flexible because it's streamlined for those specific use cases
But if the use case fits for your problem, then you should definitely give it a try and I think there's a similar library Which you should definitely check out on the top here, which is onslaught Which it's a bunch of optimizations on top so you can get quicker results and even less memory So you should also check that one out if it supports the use case that you're interested in
Thank you very much Okay, I don't think there are any more questions. I would like to thank you once again. Can we give one big applause to