Blockchain and BFT: Fast Byzantine SGD (Best Student Paper Award)
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 30 | |
Author | ||
License | CC Attribution 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/52884 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Parameter (computer programming)Server (computing)Data modelStatement (computer science)Physical systemMechanism designArithmetic meanCNNTask (computing)Total S.A.AsymptoteGraph (mathematics)Social classDirected setOverhead (computing)Pairwise comparisonSlide ruleError messagePoint (geometry)Traffic reportingWeightMultiplication signInteractive televisionLine (geometry)Task (computing)Metropolitan area networkNumberConnected spaceTunisPhysical lawCodeHypermediaControl flowPrice indexGradientGoodness of fitRoboticsTwitterFood energyWater vaporAverageBuildingCategory of beingLabour Party (Malta)VotingTrailTotal S.A.Operator (mathematics)Open setInformationOrder (biology)Range (statistics)NeuroinformatikRow (database)Robust statisticsMathematicsMathematical optimizationUniform resource locatorNormal (geometry)MereologyEnterprise architectureInternetworkingSelf-organizationGoogolCondition numberRankingGreatest elementGradient descentStochasticOverhead (computing)IterationVector spaceRule of inferenceSheaf (mathematics)Function (mathematics)Mechanism designSquare numberMedianDeterminantContent (media)Product (business)Term (mathematics)Physical systemDimensional analysisMatrix (mathematics)Error messageFunctional (mathematics)Performance appraisalAlgorithmEndliche ModelltheorieComplex (psychology)Virtual machineWave packetMaxima and minimaServer (computing)Local ringSet (mathematics)EstimatorParameter (computer programming)Cartesian coordinate systemRevision controlSelectivity (electronic)XML
09:12
Slide ruleTheoremLemma (mathematics)Error messageMoment (mathematics)StatisticsArtificial neural networkMIDITwin primeView (database)Hydraulic jumpInfinityBroadcast programmingPerformance appraisalInclusion mapExecution unitConvex setFunctional (mathematics)NumberNichtlineares GleichungssystemEstimatorLemma (mathematics)VarianceNormal distributionMaxima and minimaStandard deviationGradient descentResultantAlpha (investment)Distribution (mathematics)Vulnerability (computing)Multiplication signSequenceSet (mathematics)SmoothingRandomizationArrow of timeCost curveExpected valueRule of inferenceSheaf (mathematics)Graphical user interfaceFunction (mathematics)Different (Kate Ryan album)State of matterVector spaceError messageSigma-algebraEndliche ModelltheorieSquare numberTerm (mathematics)Robust statisticsAngleTheoremGame controllerTheoryOrder (biology)StochasticStatisticsLine (geometry)AverageMoment (mathematics)Computer programmingPoint (geometry)TrailMedical imagingMetropolitan area networkEquivalence relationBitWater vaporTwitterGame theoryAreaBoundary value problemMereologyGodRange (statistics)Data conversion
18:25
Error messageComputer animationMeeting/Interview
Transcript: English(auto-generated)
00:03
Hello, everybody. My name is Amin Bustam. I am a PhD candidate at UNCP in Morocco. And today's talk is about the Byzantine resilience of distributed stochastic gradient descent. So after a brief introduction, I will state the problem. And after that, I will talk about some related works
00:21
and how they handled the Byzantine resilience problem. I will also briefly present the theoretical guarantees in the full version of this talk. Finally, I will present a selection of experiments to empirically assess the performance of our algorithm. So as you probably know, many large-scale machine learning
00:41
applications are now implemented in a distributed way. Perhaps you are mostly familiar with parameter server architecture, which was introduced by Google Research. So in this classical scheme, you have one parameter server. Let's call it PS for short. And this PS can be possibly replicated
01:01
and a set of workers. And each worker has a local data set. So if we consider a stochastic gradient optimization in suppose that you are in iteration number k, what happens is that first, the PS broadcasts the model wk to all the workers.
01:20
Then each worker computes an estimate of the gradient using this model and its local data set and sends it back to the PS. Finally, the PS aggregates all these received estimates of the gradient and performs an update. This operation starts again until a certain criterion is met.
01:42
Now what happens when a proportion of these workers are malicious? So if the PS is only averaging the received vector, then it suffices to just have one malicious worker to make the system completely famous. So what we need then is a more robust aggregation rule to defend against bad behavior.
02:01
Here are some of the properties that we want from an aggregation rule. First, the training itself is already taking a huge amount of time. So the defense mechanism should be fast, otherwise the training will be a daunting task. It should also defend against a high number of malicious workers.
02:21
And it should be noted that the maximum number of Byzantine workers that can be tolerated is half, because if more than half are malicious, then it will be impossible to distinguish between the good and the bad. The third point, which is an important one, is the fact that the defense mechanism should not degrade the performance of the system
02:42
when all the workers are honest, because companies or firms that use distributed machine learning are not constantly under attack. Let's say that Byzantine attacks may happen once every five or 10 years. So in the rest of the time, your model should reach the full accuracy, and it should not be limited by your aggregation law.
03:04
And finally, the system augmented with the defense should have an acceptable performance when facing Byzantine workers. And actually, this is the goal of an aggregation rule. So here is our algorithm.
03:21
It is based on very simple functions like addition, subtraction, computing norms, and median. So basically, we try to come up with the closest vector to the coordinate-wide median, which is itself a good, robust estimator, but only constructed from full gradient. So we do this first by computing
03:40
the coordinate-wise median, and we store it in the vector M, okay? And we subtract it from every vector, okay? Every column vector in the matrix. And then we compute the squared norm of the essential vectors in the vector F, okay? And finally, we compute the median of the squared norm.
04:03
So let's call it R, okay? And then we construct this interval from zero to R, and R is the median of the squared norm. So now the filter is very simple. We only select the vectors inside this interval, only vectors that satisfy this condition, okay?
04:23
And we average them, and this is our output. So from what we have seen, most defense mechanisms can be categorized as follow. You have aggregation rules that uses historical information to filter the bad work
04:41
while progressing with each iteration. You also have some methods that rely on redundancy, which is a classical way to deal with failures, and you have aggregation rules based on robust statistics. So our work falls into this last category.
05:01
So in robust statistics, there are two ways to deal with subjects. You can either select whole vectors based on some rule or some filter, and average only those selected vectors to come up with the final output. And this is what we called in the paper a full aggregation rule, or full GAR for short.
05:22
Or you can apply the rule directly on each coordinate, and this is what we call blended GAR. Now from experimentation, we observed that blended GAR never reached the top accuracy in a setting where we only have honest workers, but full GARs do.
05:41
So this is exactly the third point we discussed earlier on the fact that the defense mechanism should not degrade the system. So what we see here is an experimentation involving 50 workers, and the aggregation rules are tuned to withstand 12 Byzantine workers. But actually, none of them are Byzantine in the experiment.
06:02
So let's just forget for now the name Axed, which is the name of our algorithm, and take a look at the performance of these aggregation rules. So it's clear that full GARs, they have no overhead accuracy goals, okay? You see that they are reaching the accuracy of average.
06:22
But the blended GARs, they have this gap, okay? So they don't reach this top accuracy, which is reached by average, okay?
06:41
Okay, now let's compare the three most important properties of these closest GARs to our work. So averaging is indeed fast, okay? So it has an optimal time complexity. We say it's optimal because in determinative methods, one needs at least to read the contents of the gradient matrix, which is done in big O and D.
07:03
So N is the number of workers, and D is the dimension of the product. So it also has a low angular error, okay? So it is decreasing when the number of workers increase, okay, which is good.
07:22
The problem is that it is not Byzantine-resilient, okay? So F equals zero. Now we have seen the downside of blended GARs, okay? So although they have an optimal time complexity, an optimal breakdown point, and even some of them
07:40
have reached the angular error of average, okay, which is the desired property. Now, to the date of publishing this paper, we only knew about three full GARs. So chrome, multi-chrome, and bullion, okay? The problem was the fact that they had a high time complexity, which was quadratic
08:04
in the number of workers, okay? And they had a high angular error, okay? So you see this term in square N appearing, even N squared D, okay? So this is bad.
08:22
And they didn't have an optimal breakdown point. So they are far from the N bigger than two F, okay? So for example, bullion, it can only defend against nearly a quarter of the Byzantine workers. Okay?
08:40
So we came up with Excel that has improved on the three properties, okay? So basically we have taken the best out of the two categories. So we took optimal time complexity and breakdown point plus the low angular error from the blended approaches. And we took the low overhead accuracy cost
09:03
from full gradient approach, which makes our aggregation the best as we will see in the evaluation section. Okay, now this is the set of assumptions made in this work. So the first one concerns the breakdown point.
09:23
So we require that the number F of malicious workers is less than half of the total worker. Now, the second and the third assumptions concern the cost function, and we want it to be smooth and strongly convex.
09:41
But we also have a result on smooth general functions and three times differentiable, which is the consequence of the Byzantine resilience. We'll see it later. So the last assumption is the classical function in this line of research. So we assume that we have an unbiased estimator of the gradient, okay?
10:02
And their variance is bounded. All right. So the first thing to do was to upper bound the variance of access, so which was done in lemma 11. So all the theorems coming next are based on this upper bound result.
10:22
For example, if we combine this result with lemma 12 on the control statistical models, we are able to prove the alpha F Byzantine resilience of access. So let me talk about this a little bit. So we say that an aggregation rule is alpha F Byzantine resilient,
10:41
if it can tolerate a maximum of F malicious worker. And suppose that your output is this, so this is your function. This is the true gradient, and this is access output, for example. So your function, your aggregation rule
11:02
is alpha F Byzantine resilient if it has an angle alpha with the true gradient. So naturally, we want this angle to be as small as possible, even zero if you can do it. Okay, so a direct consequence of the alpha F Byzantine resilience
11:20
is the almost sure convergence. So in the paper, we don't prove the theorem because it was done on the original paper. So it was sufficient to prove the alpha F Byzantine resilience of access, in our case, to state the theorem. So this is indeed a weak convergence result because we only showed that the gradient sequence will almost surely reach zero.
11:43
So this means that if your function, for example, is like this, okay? So this means that you are going to reach a flat region of the cost function. So it could be a global minimum, which is very good, but it could also be a local minimum.
12:01
Or worse, it could be a saddle point, which is very bad. Now in theorem 15, we prove convergence for strongly convex cost functions. So this is exactly how convergence equations are stated for all stochastic gradient descent methods.
12:23
So you always have a decrease in term that will go to zero, okay, when the counter T will go to infinity. And you are left with a small term here involving your variance, okay? So it is important to have a gradient aggregation rule
12:43
with a small angular error, which implies a small value. So just for the sake of comparison, if you take a vanilla gradient, a stochastic gradient, like which has the aggregation rule of averaging, okay? So the delta will be equal to D sigma square over N.
13:05
So this is decreasing with N, okay? Now, if you take some aggregation rule, like axel, we have this term here, okay? So this term is constant with the number of,
13:22
regarding the number of workers, okay? So if you take prior full guards, they had, okay, they had angular arrows with big O N, and this was bad. Okay, so at least axel,
13:40
with axel we have a constant time regarding the number of workers here. So it's not, sorry, it's not like averaging, okay? But we are approaching average. Okay, so up to now, all the results were stated in expectation.
14:03
So we had the results on the variance, which is the expected gap. We also proved the expected alpha F Byzantine resilience. And finally, we proved the expected convergence. So if you look at the last section of the paper, we also studied the actual error. So instead of having an expectation here, okay,
14:24
which was done in every result, we dropped this expectation and we studied the actual error, okay? So this was possible by assuming a normal distribution of the gradients. And we used the theory of extreme value
14:40
in order to prove a logarithmic upper bound of the angular error of axel, okay? So prior works, they had, for optimally robust gradient aggregation rule, so if you have n greater or equal, greater than two F, okay? If you have n greater than two F,
15:01
then we had an angular error of d n squared. Okay, so now let's take a look at two experiments that were presented in the paper. Actually, we have a whole section on empirical results for different data sets and different settings in the appendix.
15:21
So in this first experiment, we train a CNN on CFR 10 using 25 workers and 11 among them are Byzantine workers. So these Byzantine workers are implementing the attack presented in the paper, Fall of Empires. So actually we have tested two attacks,
15:42
the one in the paper, Fall of Empires, and another one in the paper called A Little is Enough. So both of them, they are exploiting the fact that honest workers will send some values like normally distributed, okay? So this is to me, this is standard deviation.
16:00
So what they do is that they send vectors here, okay? Or here. So these vectors, they are considered as honest because, but they are in the tail of the distribution and this is very bad, okay? So as you can see here,
16:22
Accel has reached the best accuracy, so nearly 30%, even with nearly half of Byzantine workers, while other guards were limited to 10%, okay? And even blended guards, they were stuck at maybe 20 or 25%, okay?
16:45
So the second experiment is quite interesting because Accel is the only guard convergent and most importantly, it has reached the top accuracy exactly like averages, even with the presence of roughly one fifth of the malicious workers.
17:03
So these workers were implemented the state of the art attack found in the paper little is enough, okay? So as you can see here, Accel is the orange dotted line. So it has completely reached the accuracy of average, okay?
17:21
Maybe even, it is even slightly better. I don't know if you can see it, but it is slightly better than average, okay? While the other guards were stuck at 10%, maybe Chrome did better in this experiment, it has reached 30%, okay? But it is still far from the average, okay?
17:44
And blended guards, they were all stuck at 10%, okay? Accuracy, so the model has diverse, okay? So as a conclusion, we are left with some open questions
18:02
that will be addressed in the future work like whether it is possible to reduce further the angular error and achieve maybe the error of averaging or can randomness enable better performance, okay? And what are the guarantees that we can have
18:20
when dealing with non-IOD data, okay? So this was all, thank you very much for your attention. If you have any questions, I will be glad to respond. Thank you.