Blockchain and BFT: Fast Byzantine SGD (Best Student Paper Award) - TIB AV-Portal

Blockchain and BFT: Fast Byzantine SGD (Best Student Paper Award)

00:00

2

Related Material

Boussetta, Amine El-Mhamdi, El-Mahdi Guerraoui, Rachid Maurer, Alexandre Rouault, Sébastien

Formal Metadata

Title

Blockchain and BFT: Fast Byzantine SGD (Best Student Paper Award)

Title of Series

OPODIS: International Conference on Principles of Distributed Systems, 2020

Number of Parts

30

Author

Boussetta, Amine

El-Mhamdi, El-Mahdi

Guerraoui, Rachid

Maurer, Alexandre

Rouault, Sébastien

License

CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/52884 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Modern machine learning architectures distinguish servers and workers. Typically, a d-dimensional model is hosted by a server and trained by n workers, using a distributed \textit{stochastic gradient descent} (SGD) optimization scheme. At each SGD step, the goal is to estimate the gradient of a cost function. The simplest way to do this is to \emph{average} the gradients estimated by the workers. However, averaging is not resilient to even one single Byzantine failure of a worker. Many alternative \emph{gradient aggregation rules} (GARs) have recently been proposed to tolerate a maximum number f of Byzantine workers. These GARs differ according to (1) the complexity of their computation time, (2) the maximal number of Byzantine workers despite which convergence can still be ensured (breakdown point), and (3) their accuracy, which can be captured by (3.1) their angular error, namely the angle with the true gradient, as well as (3.2) their ability to aggregate full gradients. In particular, many are not full gradients for they operate on each dimension separately, which results in a coordinate-wise blended gradient, leading to low accuracy in practical situations where the number (s) of workers that are actually Byzantine in an execution is small ($s<We propose \textsc{Aksel}, a new scalable median-based GAR with an optimal time complexity (O(n)), an optimal breakdown point (n>2f) and the lowest upper bound on the \textit{expected angular error} (O(1)) among \textit{full gradient} approaches. We also study the \textit{actual angular error} of \textsc{Aksel} when the gradient distribution is normal and show that it only grows in O(logn), which is the first logarithmic upper bound ever proven assuming an optimal breakdown point. We also report on an empirical evaluation of \textsc{Aksel} on various classification tasks, which we compare to alternative GARs against state-of-the-art attacks. \textsc{Aksel} is the only GAR reaching top accuracy when there is actually none or few Byzantine workers while maintaining a good defense even under the extreme case (s=f). For simplicity of presentation, we consider a scheme with a single server. However, as we explain in the paper, \textsc{Aksel} can also easily be adapted to multi-server architectures.

OPODIS: International Conference on Principles of Distributed Systems, 202011 / 30

1

23:04

Dynamic & Radio Networks: Broadcasting Competitively against Adaptive Adversary in Multi-channel Radio Networks

2

23:04

Dynamic & Radio Networks: Dynamic Byzantine Reliable Broadcast (DBRB)

3

15:43

Dynamic & Radio Networks: Broadcasting with mobile agents in dynamic networks

4

23:38

Dynamic & Radio Networks: On Broadcast in Generalized Network and Adversarial Models

5

13:33

Dynamic & Radio Networks: Maximally Resilient Replacement Paths for a Family of Product Graphs

6

20:21

Dynamic & Radio Networks: Self-stabilizing Byzantine-resilient communication in dynamic networks

7

22:39

Blockchain and BFT: Byzantine Lattice Agreement in Asynchronous Message Systems

8

25:18

Blockchain and BFT: Multi-Threshold Asynchronous Reliable Broadcast and Consensus

9

24:15

Blockchain and BFT: Heterogeneous Paxos

10

21:00

Blockchain and BFT: Echo-CGC: A Communication-Efficient Byzantine-tolerant Distributed Machine Learning Algorithm in Single-Hop Radio Network

11

18:38

Blockchain and BFT: Fast Byzantine SGD (Best Student Paper Award)

12

23:15

Blockchain and BFT: Security Analysis of Ripple Consensus

13

19:58

Consensus Potpourri: ACE: Abstract Consensus Encapsulation for Liveness Boosting of State Machine Replication

14

21:50

Consensus Potpourri: Information Theoretic HotStuff

15

22:53

Consensus Potpourri: Rational Behaviors in Committee-Based Blockchains

16

19:42

Consensus Potpourri: Locally Solvable Tasks and the Limitations of Valency Arguments

17

25:57

Consensus Potpourri: Approximate Majority With Catalytic Inputs

18

16:03

Consensus Potpourri: Optimal Resilience in Systems that Mix Shared Memory and Message Passing

19

22:19

Graph algorithms and population protocols: Fast Deterministic Algorithms for Highly-Dynamic Networks

20

22:14

Graph algorithms and population protocols: Approximating Bipartite Minimum Vertex Cover in the CONGEST model

21

20:52

Graph algorithms and population protocols: Distributed Distance Approximation

22

25:50

Graph algorithms and population protocols: Fast Hybrid Network Algorithms for Shortest Paths in Sparse Graphs

23

20:00

Graph algorithms and population protocols: Secured Distributed Algorithms without Hardness Assumptions

24

23:35

Graph algorithms and population protocols: Uniform Bipartition in Population Protocol Model with Arbitrary Communication Graphs

25

22:35

Concurrent data structures: Relaxed Queues and Stacks from Read/Write Operations

26

27:59

Concurrent data structures: Fast and Space-Efficient Queues via Relaxation

27

24:33

Concurrent data structures: Recoverable, Abortable, and Adaptive Mutual Exclusion with Sublogarithmic RMR Complexity

28

27:14

Concurrent data structures: CSR++: A Fast, Scalable, Update-Friendly Graph Data Structure

29

25:22

Formal methods: Monitoring Distributed Systems under Partial Synchrony

30

21:02

Formal methods: Decentralized Runtime Enforcement of Message Sequences in Message-Based Systems

Automatic playback

Speech

Text

Image

00:00

Parameter (computer programming)Server (computing)Data modelStatement (computer science)Physical systemMechanism designArithmetic meanCNNTask (computing)Total S.A.AsymptoteGraph (mathematics)Social classDirected setOverhead (computing)Pairwise comparisonSlide ruleError messagePoint (geometry)Traffic reportingWeightMultiplication signInteractive televisionLine (geometry)Task (computing)Metropolitan area networkNumberConnected spaceTunisPhysical lawCodeHypermediaControl flowPrice indexGradientGoodness of fitRoboticsTwitterFood energyWater vaporAverageBuildingCategory of beingLabour Party (Malta)VotingTrailTotal S.A.Operator (mathematics)Open setInformationOrder (biology)Range (statistics)NeuroinformatikRow (database)Robust statisticsMathematicsMathematical optimizationUniform resource locatorNormal (geometry)MereologyEnterprise architectureInternetworkingSelf-organizationGoogolCondition numberRankingGreatest elementGradient descentStochasticOverhead (computing)IterationVector spaceRule of inferenceSheaf (mathematics)Function (mathematics)Mechanism designSquare numberMedianDeterminantContent (media)Product (business)Term (mathematics)Physical systemDimensional analysisMatrix (mathematics)Error messageFunctional (mathematics)Performance appraisalAlgorithmEndliche ModelltheorieComplex (psychology)Virtual machineWave packetMaxima and minimaServer (computing)Local ringSet (mathematics)EstimatorParameter (computer programming)Cartesian coordinate systemRevision controlSelectivity (electronic)XML

09:12

Slide ruleTheoremLemma (mathematics)Error messageMoment (mathematics)StatisticsArtificial neural networkMIDITwin primeView (database)Hydraulic jumpInfinityBroadcast programmingPerformance appraisalInclusion mapExecution unitConvex setFunctional (mathematics)NumberNichtlineares GleichungssystemEstimatorLemma (mathematics)VarianceNormal distributionMaxima and minimaStandard deviationGradient descentResultantAlpha (investment)Distribution (mathematics)Vulnerability (computing)Multiplication signSequenceSet (mathematics)SmoothingRandomizationArrow of timeCost curveExpected valueRule of inferenceSheaf (mathematics)Graphical user interfaceFunction (mathematics)Different (Kate Ryan album)State of matterVector spaceError messageSigma-algebraEndliche ModelltheorieSquare numberTerm (mathematics)Robust statisticsAngleTheoremGame controllerTheoryOrder (biology)StochasticStatisticsLine (geometry)AverageMoment (mathematics)Computer programmingPoint (geometry)TrailMedical imagingMetropolitan area networkEquivalence relationBitWater vaporTwitterGame theoryAreaBoundary value problemMereologyGodRange (statistics)Data conversion

18:25

Error messageComputer animationMeeting/Interview

Transcript: English(auto-generated)

00:03

Hello, everybody. My name is Amin Bustam. I am a PhD candidate at UNCP in Morocco. And today's talk is about the Byzantine resilience of distributed stochastic gradient descent. So after a brief introduction, I will state the problem. And after that, I will talk about some related works

00:21

and how they handled the Byzantine resilience problem. I will also briefly present the theoretical guarantees in the full version of this talk. Finally, I will present a selection of experiments to empirically assess the performance of our algorithm. So as you probably know, many large-scale machine learning

00:41

applications are now implemented in a distributed way. Perhaps you are mostly familiar with parameter server architecture, which was introduced by Google Research. So in this classical scheme, you have one parameter server. Let's call it PS for short. And this PS can be possibly replicated

01:01

and a set of workers. And each worker has a local data set. So if we consider a stochastic gradient optimization in suppose that you are in iteration number k, what happens is that first, the PS broadcasts the model wk to all the workers.

01:20

Then each worker computes an estimate of the gradient using this model and its local data set and sends it back to the PS. Finally, the PS aggregates all these received estimates of the gradient and performs an update. This operation starts again until a certain criterion is met.

01:42

Now what happens when a proportion of these workers are malicious? So if the PS is only averaging the received vector, then it suffices to just have one malicious worker to make the system completely famous. So what we need then is a more robust aggregation rule to defend against bad behavior.

02:01

Here are some of the properties that we want from an aggregation rule. First, the training itself is already taking a huge amount of time. So the defense mechanism should be fast, otherwise the training will be a daunting task. It should also defend against a high number of malicious workers.

02:21

And it should be noted that the maximum number of Byzantine workers that can be tolerated is half, because if more than half are malicious, then it will be impossible to distinguish between the good and the bad. The third point, which is an important one, is the fact that the defense mechanism should not degrade the performance of the system

02:42

when all the workers are honest, because companies or firms that use distributed machine learning are not constantly under attack. Let's say that Byzantine attacks may happen once every five or 10 years. So in the rest of the time, your model should reach the full accuracy, and it should not be limited by your aggregation law.

03:04

And finally, the system augmented with the defense should have an acceptable performance when facing Byzantine workers. And actually, this is the goal of an aggregation rule. So here is our algorithm.

03:21

It is based on very simple functions like addition, subtraction, computing norms, and median. So basically, we try to come up with the closest vector to the coordinate-wide median, which is itself a good, robust estimator, but only constructed from full gradient. So we do this first by computing

03:40

the coordinate-wise median, and we store it in the vector M, okay? And we subtract it from every vector, okay? Every column vector in the matrix. And then we compute the squared norm of the essential vectors in the vector F, okay? And finally, we compute the median of the squared norm.

04:03

So let's call it R, okay? And then we construct this interval from zero to R, and R is the median of the squared norm. So now the filter is very simple. We only select the vectors inside this interval, only vectors that satisfy this condition, okay?

04:23

And we average them, and this is our output. So from what we have seen, most defense mechanisms can be categorized as follow. You have aggregation rules that uses historical information to filter the bad work

04:41

while progressing with each iteration. You also have some methods that rely on redundancy, which is a classical way to deal with failures, and you have aggregation rules based on robust statistics. So our work falls into this last category.

05:01

So in robust statistics, there are two ways to deal with subjects. You can either select whole vectors based on some rule or some filter, and average only those selected vectors to come up with the final output. And this is what we called in the paper a full aggregation rule, or full GAR for short.

05:22

Or you can apply the rule directly on each coordinate, and this is what we call blended GAR. Now from experimentation, we observed that blended GAR never reached the top accuracy in a setting where we only have honest workers, but full GARs do.

05:41

So this is exactly the third point we discussed earlier on the fact that the defense mechanism should not degrade the system. So what we see here is an experimentation involving 50 workers, and the aggregation rules are tuned to withstand 12 Byzantine workers. But actually, none of them are Byzantine in the experiment.

06:02

So let's just forget for now the name Axed, which is the name of our algorithm, and take a look at the performance of these aggregation rules. So it's clear that full GARs, they have no overhead accuracy goals, okay? You see that they are reaching the accuracy of average.

06:22

But the blended GARs, they have this gap, okay? So they don't reach this top accuracy, which is reached by average, okay?

06:41

Okay, now let's compare the three most important properties of these closest GARs to our work. So averaging is indeed fast, okay? So it has an optimal time complexity. We say it's optimal because in determinative methods, one needs at least to read the contents of the gradient matrix, which is done in big O and D.

07:03

So N is the number of workers, and D is the dimension of the product. So it also has a low angular error, okay? So it is decreasing when the number of workers increase, okay, which is good.

07:22

The problem is that it is not Byzantine-resilient, okay? So F equals zero. Now we have seen the downside of blended GARs, okay? So although they have an optimal time complexity, an optimal breakdown point, and even some of them

07:40

have reached the angular error of average, okay, which is the desired property. Now, to the date of publishing this paper, we only knew about three full GARs. So chrome, multi-chrome, and bullion, okay? The problem was the fact that they had a high time complexity, which was quadratic

08:04

in the number of workers, okay? And they had a high angular error, okay? So you see this term in square N appearing, even N squared D, okay? So this is bad.

08:22

And they didn't have an optimal breakdown point. So they are far from the N bigger than two F, okay? So for example, bullion, it can only defend against nearly a quarter of the Byzantine workers. Okay?

08:40

So we came up with Excel that has improved on the three properties, okay? So basically we have taken the best out of the two categories. So we took optimal time complexity and breakdown point plus the low angular error from the blended approaches. And we took the low overhead accuracy cost

09:03

from full gradient approach, which makes our aggregation the best as we will see in the evaluation section. Okay, now this is the set of assumptions made in this work. So the first one concerns the breakdown point.

09:23

So we require that the number F of malicious workers is less than half of the total worker. Now, the second and the third assumptions concern the cost function, and we want it to be smooth and strongly convex.

09:41

But we also have a result on smooth general functions and three times differentiable, which is the consequence of the Byzantine resilience. We'll see it later. So the last assumption is the classical function in this line of research. So we assume that we have an unbiased estimator of the gradient, okay?

10:02

And their variance is bounded. All right. So the first thing to do was to upper bound the variance of access, so which was done in lemma 11. So all the theorems coming next are based on this upper bound result.

10:22

For example, if we combine this result with lemma 12 on the control statistical models, we are able to prove the alpha F Byzantine resilience of access. So let me talk about this a little bit. So we say that an aggregation rule is alpha F Byzantine resilient,

10:41

if it can tolerate a maximum of F malicious worker. And suppose that your output is this, so this is your function. This is the true gradient, and this is access output, for example. So your function, your aggregation rule

11:02

is alpha F Byzantine resilient if it has an angle alpha with the true gradient. So naturally, we want this angle to be as small as possible, even zero if you can do it. Okay, so a direct consequence of the alpha F Byzantine resilience

11:20

is the almost sure convergence. So in the paper, we don't prove the theorem because it was done on the original paper. So it was sufficient to prove the alpha F Byzantine resilience of access, in our case, to state the theorem. So this is indeed a weak convergence result because we only showed that the gradient sequence will almost surely reach zero.

11:43

So this means that if your function, for example, is like this, okay? So this means that you are going to reach a flat region of the cost function. So it could be a global minimum, which is very good, but it could also be a local minimum.

12:01

Or worse, it could be a saddle point, which is very bad. Now in theorem 15, we prove convergence for strongly convex cost functions. So this is exactly how convergence equations are stated for all stochastic gradient descent methods.

12:23

So you always have a decrease in term that will go to zero, okay, when the counter T will go to infinity. And you are left with a small term here involving your variance, okay? So it is important to have a gradient aggregation rule

12:43

with a small angular error, which implies a small value. So just for the sake of comparison, if you take a vanilla gradient, a stochastic gradient, like which has the aggregation rule of averaging, okay? So the delta will be equal to D sigma square over N.

13:05

So this is decreasing with N, okay? Now, if you take some aggregation rule, like axel, we have this term here, okay? So this term is constant with the number of,

13:22

regarding the number of workers, okay? So if you take prior full guards, they had, okay, they had angular arrows with big O N, and this was bad. Okay, so at least axel,

13:40

with axel we have a constant time regarding the number of workers here. So it's not, sorry, it's not like averaging, okay? But we are approaching average. Okay, so up to now, all the results were stated in expectation.

14:03

So we had the results on the variance, which is the expected gap. We also proved the expected alpha F Byzantine resilience. And finally, we proved the expected convergence. So if you look at the last section of the paper, we also studied the actual error. So instead of having an expectation here, okay,

14:24

which was done in every result, we dropped this expectation and we studied the actual error, okay? So this was possible by assuming a normal distribution of the gradients. And we used the theory of extreme value

14:40

in order to prove a logarithmic upper bound of the angular error of axel, okay? So prior works, they had, for optimally robust gradient aggregation rule, so if you have n greater or equal, greater than two F, okay? If you have n greater than two F,

15:01

then we had an angular error of d n squared. Okay, so now let's take a look at two experiments that were presented in the paper. Actually, we have a whole section on empirical results for different data sets and different settings in the appendix.

15:21

So in this first experiment, we train a CNN on CFR 10 using 25 workers and 11 among them are Byzantine workers. So these Byzantine workers are implementing the attack presented in the paper, Fall of Empires. So actually we have tested two attacks,

15:42

the one in the paper, Fall of Empires, and another one in the paper called A Little is Enough. So both of them, they are exploiting the fact that honest workers will send some values like normally distributed, okay? So this is to me, this is standard deviation.

16:00

So what they do is that they send vectors here, okay? Or here. So these vectors, they are considered as honest because, but they are in the tail of the distribution and this is very bad, okay? So as you can see here,

16:22

Accel has reached the best accuracy, so nearly 30%, even with nearly half of Byzantine workers, while other guards were limited to 10%, okay? And even blended guards, they were stuck at maybe 20 or 25%, okay?

16:45

So the second experiment is quite interesting because Accel is the only guard convergent and most importantly, it has reached the top accuracy exactly like averages, even with the presence of roughly one fifth of the malicious workers.

17:03

So these workers were implemented the state of the art attack found in the paper little is enough, okay? So as you can see here, Accel is the orange dotted line. So it has completely reached the accuracy of average, okay?

17:21

Maybe even, it is even slightly better. I don't know if you can see it, but it is slightly better than average, okay? While the other guards were stuck at 10%, maybe Chrome did better in this experiment, it has reached 30%, okay? But it is still far from the average, okay?

17:44

And blended guards, they were all stuck at 10%, okay? Accuracy, so the model has diverse, okay? So as a conclusion, we are left with some open questions

18:02

that will be addressed in the future work like whether it is possible to reduce further the angular error and achieve maybe the error of averaging or can randomness enable better performance, okay? And what are the guarantees that we can have

18:20

when dealing with non-IOD data, okay? So this was all, thank you very much for your attention. If you have any questions, I will be glad to respond. Thank you.