Unlocking Mixture of Experts : From 1 Know-it-all to group of Jedi Masters
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69517 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Mixture modelLocal GroupMachine learningEnterprise architectureFunctional (mathematics)Event horizonDomain namePrototypeBoundary value problemDesign of experimentsReduction of orderPhase transitionEndliche ModelltheorieMereologyComplex (psychology)Division (mathematics)AerodynamicsService (economics)ArchitectureView (database)WindowBookmark (World Wide Web)Graphical user interfaceMathematicsTime evolutionoutputConditional probabilityGeneric programmingLatent heatOptical disc driveMusical ensembleComputer networkRouter (computing)Expert systemBlock (periodic table)Adaptive behaviorLocal ringRootClassical physicsTuring testBit error rateMIDIScaling (geometry)Ljapunov-ExponentMixture modelExpert systemComputer architectureNormal (geometry)Artificial neural networkMathematical modelSupervised learningMachine learningWeightFunction (mathematics)Router (computing)Transformation (genetics)Parallel computingBlock (periodic table)Selectivity (electronic)NeuroinformatikPairwise comparisonEndliche ModelltheorieCASE <Informatik>NumberTrailParameter (computer programming)Electric generatorEnterprise architectureAnalytic continuationInferenceInformation technology consultingHand fanSelf-organizationMultiplication signField (computer science)Channel capacityPoint (geometry)Domain nameDifferent (Kate Ryan album)Group actionData structureHierarchyAlgorithmEstimatorInteractive televisionDependent and independent variablesDynamical systemBoundary value problemVirtual machineFormal languageMeta elementSoftwareCartesian coordinate systemCore dumpConfidence intervalResultantRight angleLocal ringRootAdaptive behaviorBackpropagation-AlgorithmusPosition operatorGroup representationComplex (psychology)AdditionClassical physicsStrategy gameConnected spaceScaling (geometry)SpacetimePhysical systemMoore's lawGreatest elementWave packetDisk read-and-write headoutputMathematical analysisVideo gameCodierung <Programmierung>Projective planeDivisorMusical ensembleFunctional (mathematics)Computer animationLecture/Conference
08:47
Endliche ModelltheorieChannel capacityParameter (computer programming)Sparse matrixAsynchronous Transfer ModePopulation densityoutputProcess (computing)Computer networkEinbettung <Mathematik>WordBlock (periodic table)Dean numberToken ringKolmogorov complexitySequenceHash functionPhysical lawRoutingFunction (mathematics)CurveInsertion lossLogical constantCountingExpert systemScaling (geometry)Bounded variationRankingComputer-generated imageryNumerical taxonomyRandom numberRead-only memoryKernel (computing)View (database)Computer fileGraphical user interfaceBookmark (World Wide Web)WindowCross-site scriptingCone penetration testComponent-based software engineeringArchitectureRouter (computing)Linear mapConservation of energySineLangevin-GleichungInheritance (object-oriented programming)Structural loadComputational physicsProbability distributionGleichverteilungScheduling (computing)GradientVarianceChemical equationExpert systemArtificial neural networkSkewnessScaling (geometry)Structural loadLinear mapDifferent (Kate Ryan album)Mixture modelAxiom of choiceDomain nameProbability distributionValidity (statistics)NeuroinformatikGroup actionGraph (mathematics)Token ringCountingRouter (computing)Core dumpProjective planeSingle-precision floating-point formatData managementVulnerability (computing)Electric generatorInsertion lossNumberEndliche ModelltheorieSparse matrixSoftware testingMereologyCurveTransformation (genetics)Bounded variationSequenceComputer architectureMultiplicationPhysical lawWave packetParameter (computer programming)TunisWeightStrategy gameLogic gateRight angleModule (mathematics)Noise (electronics)Logical constantConnectivity (graph theory)AlgorithmDimensional analysisInferenceCombinational logicSelectivity (electronic)Function (mathematics)Message passingSoftwareProcess (computing)outputLogicChannel capacityComplex (psychology)Phase transitionEqualiser (mathematics)ImplementationSoftware developerError messageMultiplication signSheaf (mathematics)ResultantGradientVarianceMatrix (mathematics)Maxima and minimaPredictabilityHypercubePosition operatorTask (computing)Weight functionSet (mathematics)Associative propertyPower (physics)CASE <Informatik>Compilation albumComputer animation
17:25
Chemical equationStructural loadInsertion lossProbability distributionComputational physicsGradientVarianceExpert systemGleichverteilungRouter (computing)Scheduling (computing)Channel capacityDivisorStapeldateiToken ringTensorCompilation albumAlgorithmFLOPSTelecommunicationResidual (numerical analysis)Buffer overflowDesign of experimentsPareto distributionUniform convergenceVertical directionData typeMathematicsCodierung <Programmierung>WeightCache (computing)Square numberCellular automatonPattern languageEndliche ModelltheoriePeer-to-peerGreatest elementWindowView (database)Graphical user interfaceBookmark (World Wide Web)Heat transferOverhead (computing)MereologySynchronizationInferenceIntegrated development environmentData compressionContext awarenessPrinciple of localityParallel computingParameter (computer programming)Vertex (graph theory)Computer networkStrategy gameComputer hardwareDecision theoryGeometric quantizationMagneto-optical driveSuccessive over-relaxationMixture modelSparse matrixRoutingTotal S.A.Dimensional analysisNumberHausdorff dimensionArchitectureSource codeOpen sourceAxonometric projectionMetric systemEndliche ModelltheorieCache (computing)Set (mathematics)Expert systemNeuroinformatikToken ringSemiconductor memoryWave packetStructural loadSparse matrixDifferent (Kate Ryan album)Insertion lossPoint (geometry)Strategy gameTransformation (genetics)NumberParameter (computer programming)Scaling (geometry)Channel capacityInferenceCompilation albumExponential functionDynamical systemString (computer science)Medical imagingProcess (computing)Right angleArithmetic progressionLastteilungTotal S.A.Web pageMultiplicationCombinational logicHorizonSoftwareProjective planeRouter (computing)BuildingLibrary (computing)Software development kitParallel portStapeldateiMultiplication signIntegrated development environmentDataflowFluid staticsTelecommunicationWeightSynchronizationSystem callAxiom of choiceFormal languageDeterminantFLOPSDecision theoryGeometric quantizationComputer hardwareFile archiverConnectivity (graph theory)Equaliser (mathematics)Nichtlineares GleichungssystemPropagatorNoise (electronics)Computer architectureDimensional analysisDomain nameMixture modelSource codeDivisorSlide ruleChemical equationShared memoryKey (cryptography)SpacetimeMatrix (mathematics)RoutingMereologyCodierung <Programmierung>GleichverteilungProbability distributionFunctional (mathematics)Computer engineeringData transmissionOpen sourceOpen setSheaf (mathematics)Normal (geometry)DeterminismComputer animation
26:03
UsabilityMixture modelArchitectureMereologySource codeOpen sourceMetric systemAxonometric projectionEndliche ModelltheorieRouter (computing)2 (number)Object-oriented analysis and designSheaf (mathematics)Endliche ModelltheoriePopulation densityCase moddingComputer animationLecture/Conference
26:29
Cross-site scriptingDecision theoryMultiplication signExpert systemPoint (geometry)Channel capacityNumberPhysical systemParameter (computer programming)Dependent and independent variablesDeterminantAdhesionDomain nameHorizonLogicDifferent (Kate Ryan album)QuicksortWave packetBuffer overflowEndliche ModelltheorieSoftwareConnectivity (graph theory)Right angleToken ringCASE <Informatik>Mixture modelArtificial neural networkFunction (mathematics)Transformation (genetics)BitComputer architectureComputer virusMultiplicationLecture/ConferenceMeeting/Interview
28:55
Endliche ModelltheorieChannel capacityParameter (computer programming)WordEinbettung <Mathematik>Process (computing)Computer networkoutputBlock (periodic table)Kolmogorov complexitySequenceToken ringExpert systemWave packetParameter (computer programming)Endliche ModelltheorieNumberArtificial neural networkDisk read-and-write headDimensional analysisChannel capacityMixture modelOcean currentPopulation densityComputer animation
29:48
Expert systemDivisorChannel capacityStapeldateiToken ringAlgorithmFLOPSProbability distributionTensorCompilation albumBuffer overflowTelecommunicationResidual (numerical analysis)Medical imagingToken ringRight angleAxiom of choiceExpert system1 (number)AreaRoutingControl flowComputer clusterWave packetCausalitySlide ruleSelectivity (electronic)Computer animation
30:38
Roundness (object)Computer animation
Transcript: English(auto-generated)
00:04
So, today my goal is to explore a technology, the mixture of experts architecture, which is one of the dynamic modeling computation tools that's driving the AI in generative modeling. Before we push start, I'm going to quickly explain why I deserve this opportunity to
00:23
present to all of you. So I have a dedicated AI experience for the last five years, I've built a lot of applications, sold it to enterprises starting from prototyping to deployment, I've done a lot of consultancy projects, I'm a fan of Python and PyTorch, the love of my life,
00:40
and continuous learning towards the AI dynamics, right? We all know that papers are coming in like, you know, you can't keep track. So I've been trying to keep track of these things and continuously learning in the field of AI. So let's get straight to the topic. Before understanding mixture of experts, let's try to understand the intuition behind why we've reached a point of time where we need a mixture of experts.
01:03
So why does a large organization have the organizational hierarchy or the structure that it has? You know, startups usually start out with a flat hierarchy, but beyond a certain scale, they start having this transformation of an organizational hierarchy. And the answer purely is because of specialization, right?
01:22
You need teams that are super specific and accurate at solving specific problems for the business with the highest accuracy, because we cannot compromise on quality, right? So the idea is that it's a divide and conquer strategy. We're going to take a big complex problem, for example, generative modeling in different
01:41
domains and spaces, and we are going to divide it into smaller experts who can solve this problem, right? It also reduces the conformism bias. So for example, if you have a group of five people who are working on the same problem, usually the outcome would be something where all five of them agree. But if you have them working in silos, individually, their responses and their outputs for the
02:03
solution could be completely different. So we want to reduce the interaction between these experts so that they can perform individually. So our aim is to push specialization, and that's where performance boundaries expand. And we're also looking for not a jack of all trades, like a generative AI model like Jack GPT-3, which can solve a lot of problems with decent accuracy, but for critical domains
02:24
like healthcare, we want experts who are super accurate on solving that one problem with the best possible way. So we are trying to promote specialization over generalization, and we're ignoring the one size fits all dynamics. From the engineering standpoint, you can think of this as the monolithic architecture where
02:43
you cannot keep expanding the RAM of the system, and eventually you will have to move towards a distributed parallel computing scenario. Now let's quickly cover the mathematical intuition as well. So where did we all start? With the traditional machine learning models, the simple Y equals FX. We have the X, we know the Y, with the supervised learning algorithms, we are trying
03:02
to find the F. But then we move towards bagging, where we knew that one single traditional model might not be capable enough. So we added more models and tried to aggregate their outcomes to eventually move towards one single solution. Then we moved towards boosting, the XGBoost algorithm has created wonders in the ML
03:21
community, and even complex models have been replaced with a simple XGBoost. Then we moved towards ensembles, where we've given different weights to different individual models, and the aggregate outcome was what we preferred. And then we moved towards stacking, where there's like a base estimator and then a meta estimator. So base estimators predict something, and a meta estimator kind of manages them and tries
03:43
to predict the final outcome. And then finally we moved toward a mixture of experts. Now please don't confuse mixture of experts as a technology solely for large language models. It's a technique that can be applied towards diverse applications in machine learning. So in the mixture of experts, what the core difference is, you have another function,
04:02
the GI of X that you can see, which assigns a weightage to the output coming from the FI of X, which is an individual expert. On the right hand side you can see that there is a gating network which decides how much importance or how much confidence we should add to individual expert outcomes.
04:21
So this dynamic computation, that given an input, we dynamically define which expert would be the best to answer this question, and then we aggregate those results specifically. And a big problem with the stacking algorithms, which I've experienced personally, is that experts start to compensate for other experts.
04:42
Because they're trained together and their output is combined, individual experts over backpropagation start to compensate for what the other expert could not achieve. This is not the case with experts because an expert is working in a silo. It's an individual non-interactive expert.
05:00
So we've all heard of the mixture 8 cross 7B, which is beaten llama 70B. And you might think of the mixture 8 cross 7B as something on the left, where there's a router network at the bottom, and there are seven billion parameter models that are then interacting and producing the output, but that's actually not true.
05:22
The one on the right is the actual representation of the mixture of experts, where the only place where the mixture of experts is placed is at the feed-forward neural network. For the LLM guys, the position-wise feed-forward neural network, that's what is replaced with the mixture of experts. So the architecture doesn't look like the one on the left, but the one on the right.
05:44
Now, the roots of MOE actually come from a 1991 paper, the adaptive mixture of local experts. It's pretty, pretty old as an architecture, as a technique, but with the added advantage of more data, more computational power, we've actually been able to reach a position where we can use mixture of experts.
06:01
So on the left-hand side, you see the classic transformer architecture, which has the multi-headed attention, the normalization and additional layers, the residual connections. Nowadays, we are doing the addition and normalization before the multi-head, which is usually a technique for more stable training, but this is the original architecture of the transformer. The only thing that has changed from the classical architecture
06:23
to the mixture of experts is the feed-forward neural layer. The one highlighted in blue, you can see on the right, instead of the feed-forward layer, which was one single network, which was processing all the requests individually, we now have multiple individual experts.
06:40
They're all feed-forward layers, but now the router, which is the one assigning which expert is going to be called for this input, is managing the selection, routing, and then the output is coming from that selected expert. So if you see on the right, there's an example, there's two tokens, more and parameters, they've been embedded with the positional encoding,
07:01
then there's the attention layer, and after the attention layer, there is the mixture of experts layer, where there's a router, the router is assigning probabilities of selection for each expert, and then the expert is selected, and the weight given by the router to the expert, which is selected, then gets multiplied, and that goes towards the next layer.
07:22
And because transformers are like blocks of transformers, every transformer block is going to have a mixture of experts. Now, the idea behind why there's a motivation of mixture of experts is this. Given a fixed computing budget, training a larger model for fewer steps
07:41
is empirically shown to be high performing than training a smaller model with more number of steps. Now, this actually misleads us from our original understanding of traditional models, where we know if you want to build a robust model, you add more and more data, you fine-tune it to get a better model, but in the case of LLMs, you can see historically,
08:03
we've been scaling to bigger and bigger models. The GPT-4, which is rumored to be a mixture of experts of 8 x 220 billion parameters, is like a trillion-parameter model. But we'll see why trillion parameters is not actually correct, because each expert is a 220B, so during inference time, we're only using one expert.
08:22
And you can see there's a 15,000x increase in five years, and for a comparative analysis, I've also shown the number of parameters in a rat brain and in a human brain, so I'm kind of predicting we're moving towards the number of parameters in a human brain, which is obviously way far off, but we'll get there. There's also empirical evidence to show
08:41
that scaling the model capacity increases model performance, which is kind of why companies are scaling their models to bigger and bigger dimensions, because you can see in a sparse model, sparse model is the model where you have experts, but you're choosing more than one expert at a time. If you choose a single expert, it's called a switch transformer,
09:01
again, the LLM generative AI lingo, but it's called a switch transformer because only one expert is used. When there are more experts used, let's say two or three experts, then it's called a sparse model, comparing it to the dense model, which we have like a GPT-3 where all the neurons are processed in sequence. And you can see, as in when we increase the sparse model parameters,
09:23
the test loss keeps decreasing, so it's kind of following like a power loss scaling strategy. The curve looks like that, and on the right-hand side, you can also see that the bigger model reaches that kind of loss sooner than a smaller model. So those scaling laws of empirical evidence do apply.
09:41
Now, it actually begs the question, why mixture of experts? That's one, but why have we added the mixture of experts in the feed-forward neural network and not in other parts of the architecture, right? So first of all, mixture of experts, because training the algorithm is one thing, training the model, but when you provide it for inferencing,
10:02
that's when a lot of time is taken. The inference time, the latency, the cost and the throughput is because all the neurons in a dense model have to be processed with the matrix multiplication. But in a mixture of experts, you can reduce the inference time because you're now picking experts selectively. So it's a dynamic choice.
10:21
So traditional networks process all input data through every layer, but to increase the capacity of the model, we can obviously add more transformers. So let's say a mixture is a 32-transformer architecture. You can make it, let's say, 64 or 128. But then again, you'd still have to process all of those and the cost and latency would increase. On the other hand, if you look at the GPT-3 parameter usage graph on the right,
10:45
a significant amount of parameters of the model are actually consumed in the feed-forward neural layer, not in the attention layer. Adding more parameters in the attention layer also adds quadratic complexity. So with all of these ideas in mind, we want to find the exact place where the mixture of experts
11:03
could replace the existing architecture. And that's why we've chosen the feed-forward neural layer. Now, unified scaling laws have also been seen to apply for the mixture of experts. So you can see with these curves on the left, the number of expert counts as they increase,
11:20
the validation loss keeps decreasing. So you can see predicting loss for varying expert count for the same model parameters, 1.3 billion, 370 million, 130 million, you can see that the validation loss decreases as you keep increasing the number of experts. On the second curve, you can see that there's a constant loss value and for models of different dimensions,
11:43
if you increase the expert count, you can see that the performance improves. So the unified model scaling laws apply, which is why we've added a mixture of experts to increase the scale of these models. Now, there's a lot more going on in increasing the efficiency of these transformers, but that's a talk for another day.
12:02
Coming to the core architecture and components, right? The gating router, this is the deciding manager, right? If there's a new project that comes to the business, the manager decides which team and which member of which team actually gets to do that task, because it knows the individual strengths and weaknesses of every associate, every employee, right?
12:21
And then there are the experts, which are the executors. You could think of them as employees being managed by a manager. And the core part is the conditional computation. This is where a lot of variation has come into play. So when you're looking for a more strategic computation or hyperparameters, you could think of things like, do I do the softmax after the prediction of the probabilities,
12:43
or do I select the top two or top three and then do a softmax? Should I pick only one expert, or should I pick multiple experts? Things like that. Those add the advantage of parameter tuning. Let's get our hands a little dirty. On the left, you can see this is the position-wise feed-forward layer that we currently have.
13:01
It's a simple two-layers linear transformation with a dropout, and there's a relu in the middle for activation. On the right is where the individual components are built. So there's the expert module and the gating router. In the expert, again, it's usually a smaller dimension. So if you're talking about the dimensions of a feed-forward neural net and the dimension of an individual expert,
13:21
usually the expert dimensions would be very small. And the gating router is where you have another single layer, and given the input, you generate basically the probabilities of picking individual experts. Now, there's a small noise component that I have added during the training phase. We'll come to that when we discuss challenges of the mixture of experts training.
13:42
And that might have looked simple, but this is the complex part. So the core challenge that we need to solve where unstable training or poor performance or sudden movements of the model's accuracy is happening is this forward pass. So we are basically trying to take each and every token
14:01
through the gating router. For every token, the gating router decides the probabilities, the logits. The logits are then taken through the top K selection of PyTorch. So basically, given a set of logits, you're going to find out which are the top two experts that are most likely to solve this problem with higher accuracy.
14:20
And then you apply a softmax layer in the end and find out the weightage distribution between those two top experts. And finally, you do the weighted sum, which is the combination where expert one probability multiplied by output of expert one, expert two probability multiplied by the output of expert two. So that forward pass does the core mixture
14:42
of experts' logic implementation. Right? But obviously, it's not as easy and simple as it sounds, and which is why mixture of experts has been recently solved. Like, we could have done that as soon as the attention paper came out in 2017. We could have done that after more recent developments of larger models. But why were we not able to do it?
15:02
It's because there were certain training challenges which were extreme whenever we were trying to model a mixture of experts. And that's the primary problem is load equalization. So think of this. If you randomly start training a mixture of experts' model, the initial expert that gets selected over time improves its performance.
15:21
And what happens is the gating router sequentially starts choosing just that one expert to solve all of your problems, right? Which is why experts could be a bottleneck, and we want to avoid that. We want to make sure that the load is distributed equally across all experts, both from an engineering standpoint and from a modeling standpoint for specialization.
15:42
Now, if an expert becomes a bottleneck, what's also happening is other resources, other experts are being underutilized, right? And obviously we want to optimize. There's unstable gradients issue also present. So think about if a skewed distribution of experts is happening by the gating router, and the same few group of experts are being selected.
16:03
The other experts, whenever they are chosen, they're gradients that have been black propagated and their errors managed, would not have been of the highest quality. So in that sense, the unstable gradients would also cause, in a way, they would be returning results that would sound very bogus in a way, right?
16:21
And we are actually aiming at a low variance of the gating probability. So if you look at the chart on the bottom right, you can see that in the layer one, in the layer 16 and the layer 32, how is the distribution of assigned tokens across the eight experts? So you can see that the distribution has to be somehow balanced across experts,
16:42
which is why a common misconception in mixture of experts is, experts get their expertise in domains or sections or topics, which is absolutely not the case. It's not like there's one expert who's great at healthcare, one expert that's great at manufacturing, one expert that's great at comprehension. It's actually experts being great
17:02
at processing tokens and not topics, right? So we need this efficient distribution of computational load across all the experts' subnetworks, basically saying experts should be chosen uniformly by the gating router. And for that, we are adding that initial noise. The one that I was showing you earlier,
17:21
here in the gating router, if self.training, that's during the training process, I want to add some noise. This is why we were doing that, for load balancing or load equalization, right? We also add an auxiliary loss component, which basically helps during back propagation to penalize the model for non-uniformly selecting the experts.
17:42
So that extra loss component, which you can see as an equation on the right, it encourages the uniform distribution of tokens across different experts. Now talking about something called as expert capacity. So PyTorch and TensorFlow, during their compilation time, they're expecting a static size.
18:01
For example, the number of batches it can process, the number of tokens it can process. Now because our mixture of experts is conditional computation, like a dynamic choice of experts, we can never really be sure how many tokens will be processed by each expert in the beginning, when we are trying to compile the model. Which is why there's a concept of expert capacity.
18:20
Every expert is actually given a certain capacity. Now an easy way to allot capacity is total number of tokens divided by total number of experts. So if there's 100 tokens and 10 experts, each expert should get like 10 tokens. But obviously there is a trade off. Why is that trade off? Because during multiple calls to the LLM,
18:41
if the other tokens and other requests that are coming to the LLM are exceeding individual expert capacity, then the same token which was routed to expert two in the previous call, will be now routed to another expert in the next call. Which makes these language models indeterministic. Which is obviously okay because we are using different parameters,
19:01
but if we wanted to make it deterministic, we can't unless we have an expert capacity or a token routing strategy that ensures determinism. So if you look at the expert capacity on the top right, the first image, there are three experts and there are six tokens. So each expert uniformly should deserve two tokens each.
19:21
But what if the gating router actually gives a probability distribution where one of the tokens exceeds the expert capacity. So the capacity overflows. What do we do then? That is when we try to apply strategies where we drop some tokens or the tokens are taken to the next best expert
19:41
and which is why the indeterminism comes into the picture. We could increase the expert capacity by a factor, let's say 1.5. That means if there are two possible places, now it gets extended to three. So there is a higher capacity than the possible capacity.
20:00
Now, like I was mentioning, what do experts really learn? The conception that experts learn domains or topics, that's absolutely incorrect. We can see on the right and on the left, the experts are actually learning tokens. So you can see the archive, the GitHub, the fill papers, the stack exchange, different tokens coming from different training sets and across the multiple layers for each expert,
20:21
you can see which of those tokens are getting routed to which expert. Because you see a balance here, that is why experts are experts at tokens, not at topics. Now there's another challenge of communication cost. Because this is dynamic computation, the token being routed to a particular expert actually has a communication cost
20:40
because for a larger model, you cannot have a single VRAM which would have like one trillion parameters capacity. So you would actually divide your experts into different computation engines and you have to route every token from the gating router towards the expert network, which is why there are solutions in place. I've linked the paper. You can go for model parallelism
21:01
as well as expert and model parallelism, which kind of gives us the idea that experts are placed somewhere and tokens are routed to them, but we can duplicate the models, more like an orchestration strategy, like a Kubernetes kind of environment. So if an expert is being chosen more often, then we have more copies of those experts in place so that there's no bottleneck with experts.
21:22
And the communication cost, obviously the data transfer between experts, synchronization of parameters during back propagation when we are training them together. So we need to make sure that each expert, even if it has multiple copies, the weights in the end should be synchronized so that the overall outcome stays deterministic.
21:40
Now, it might all sound that we've kind of moved towards cracking the problem, but then there's still points to note. The first one is you do need a larger VRAM for loading the model parameters. So even though experts will be chosen dynamically, but you still need to have all experts loaded into the GPU, right?
22:02
So the advantage with MOE with scale is it has faster inference, but you still have to load all the models into the GPU. So you do need a larger VRAM. So you can think of mixture of experts being more flop efficient than scale efficient. So choosing the right hardware,
22:21
the right deployment strategy, although a very nuanced decision, things have been coming up every day, improvements, but you could still think of these strategies to be in research, in progress. There are also strategies like quantization, the VLLM package, the page retention X transformers that do increase the efficiency, but you can see that the KV cache on the top right
22:41
that Amit says, for any LLM that you want to deploy, the KV cache actually takes 30% of the memory. So it's not like if you're training a 13 billion parameter model, you're hosting one, you need like a 16 gigabytes GPU, right? You'd rather need more because the KV cache, which keeps processing the tokens and storing the key value pair matrices
23:01
that would also take substantial amount of space in memory. Finally, let's take a quick example of the mixture eight cross seven B model based on the mixture of experts. So it's a sparse mixture of experts. That basically means that it's not choosing one expert, it's choosing two experts at a time. There's 32 layers of transformers.
23:20
Each layer has its own mixture of experts. We are choosing the top two routing strategy. It's 6X faster inference than a LLM 270B, right? So an eight cross seven B, which would ideally be thought of as a 56 billion parameters, it's actually 6X faster than a 70 billion parameter model. As a matter of fact, it's not actually 56 billion parameters
23:42
because as we saw in the initial image, it's actually a 46.7, 47 billion parameters model because a lot of the layers are sharing parameters. Only the feed forward layer is actually having the extended experts, but the attention layer and the add and norm layers, the encoding and the decoding parts,
24:00
they are actually sharing parameters. And because it's only choosing two experts at a time, it's inference time is equal to a 12.9 billion model. So you have a 56 billion parameters model, which inferences at a speed of a 12.9 billion parameters. So your inferencing increases, your specialization increases,
24:20
and obviously we have the reduced carbon footprint, which is pretty significant. GPT-4, like I mentioned, is rumored to be an eight cross 220 billion parameter model and which is why a lot of function calling that happens in the GPT-4 is pretty excelled because one of the experts is rumored to be a function calling expert. So it knows what function to call when a certain kind of token comes.
24:42
So there's a lot of levers we can change. I'll quickly cover them. What's the top key? The dimensions of individual experts, the number of experts that you will have in your architecture, the gating routers dimension, the parameter initialization. So one of the papers, switch transformer, has pretty much cracked the unstable training because they've chosen a different precision
25:02
for the gating router and a different precision for the experts. The parameter initialization, if you scale down the initialization, it gives more stable training. The token routing strategy you choose, the dropout intensity at different layers for experts to be more stably trained, a higher dropout being used there will pretty much help a lot. And we're moving towards hierarchical MOEs.
25:22
So an MOE having an MOE of itself, that's something it's yet to explore. The strength of the auxiliary loss to make sure that the load is balanced. And there's a lot of interesting things coming up. I'll share this slide. We are trying to add the mixture of experts to the transformers attention layer, which can completely change the dynamics of these models.
25:42
There's an open source project called the Open MOE, which is doing amazing work in building and expanding the horizons of mixture of experts. And there's also a merge kit library, which can combine individual expert models and they can build better. Thank you for listening. Enjoy the conference if you have any questions.
26:01
Great talk. So now it's Q&A section. So anybody have some doubt, please go to the mic and you can ask. Great talk. I was wondering if there's any advantages to using a dense model compared to an MOE.
26:21
Is there any places where an MOE falls down where you'd be preferable to use a dense model? Okay, so when we were solving this problem for the polymers domain, the polymers have like the adhesives, the paints and different components. If you want to restrict the performance of the LLM to a particular domain and you don't want to expand its horizons,
26:42
that's where smaller models being fine tuned without the experts logic is better because when you expand the network capacity, if you do not do training, first of all, the training a lot of times is unstable, so you need to manage that. But apart from that, if you take a bigger model for a smaller use case, then you would rather go with a dense model than a mixture of experts model.
27:02
Okay, thanks. Hello. I'm curious about this determinism that you were talking about. I was under the probably naive impression that this can be affected simply by setting a seat, but it seems that even with a seat set,
27:21
I can still have a non-deterministic output because the MOE. Could you maybe shed a little bit more light on that? Sure. So I was trying to do this experiment where I was calling GPT-4 multiple times with the same prompt, same request, and it was returning different responses. So the idea is that the tokens that you send,
27:41
at that point of time, the other requests that are being batched together into the system, if the token routing exceeds the expert capacity, I mean, obviously we could have a larger and larger capacity to make it more deterministic, but there's still a chance that it will not give the same response every time, even if you set the seat to the same value. So definitely, determinism is one of those challenges.
28:02
You could tackle it from an engineering standpoint, but it might still fail at a certain point of time. Thank you so much. Hey. Thank you for the talk. Yeah, I'm just curious how many, like in a given model, what's like sort of the typical percentage of parameters
28:20
that would be dedicated towards the experts as opposed to the other aspects of the transformer architecture? And then also as a followup to that question, how do you actually, when you're training the model, how do you account for this sort of mixture capacity or expert capacity overflow? Is that something that's included in training or is this just like a heuristic decision that was made?
28:42
Like, hey, this would probably work if we just overflow to the second best expert. Yes, so when you try to, if you look at the number of parameters on the right, the feedforward neural network takes actually the number of parameters. So you could think of this as the attention and the multi-headed attention. So if you have like eight different heads,
29:01
you basically divide the number of parameters by eight and individual head would have like one by eight of the original capacity. So something similar can actually be done with a mixture of experts also. If your current feedforward neural network, let's say has a X number of parameters and you plan to add eight experts, so you could have like an X by eight dimension for individual expert. So the parameter size could remain constant
29:22
from a theoretical standpoint, but you could go ahead with a lesser number of experts like the chart I was showing that if you have a larger number of experts, it is better than the dense model. So I could think of lesser number of experts meeting the dense model performance. So that's one. And then just as far as the capacity goes,
29:43
is that something that's somehow accounted for in training? Or is that like, okay. Let me show you that. Yeah, so at the bottom, I'm sorry, the image was pretty large and I wanted to aggregate everything in a single slide, but you can see that there's a token choice routing, there's an expert choice routing. So you could think of tokens selecting
30:01
what experts they want to go to or experts selecting based on how many tokens they can incorporate, selecting the best ones of the lot. And for the ones which do not get their priority choice, you could then think of the second best or the third best. A lot of experiments have also been done in simply dropping those tokens or transferring those tokens
30:21
via the residual layer to the next layer and them not getting passed to the experts at all. But that's primarily the cause of unstable training and non-determinism, which is an active area of research we're trying to solve. All right, thank you. Thank you. So thanks again, Brinchal. And there'll be a coffee break right now.
30:40
So give us last round of applause for Brinchal. Thank you.